[go: up one dir, main page]

0% found this document useful (0 votes)
37 views43 pages

Core Components of Natural Language Processing

Semantic Role Labeling (SRL) identifies the relationships between predicates and their arguments in sentences, focusing on roles like agent, patient, and instrument. It employs various annotation schemes such as PropBank, FrameNet, and VerbNet, and utilizes methods ranging from traditional machine learning to neural approaches. SRL has applications in question answering, information extraction, and machine translation, among others.

Uploaded by

harshjajal786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views43 pages

Core Components of Natural Language Processing

Semantic Role Labeling (SRL) identifies the relationships between predicates and their arguments in sentences, focusing on roles like agent, patient, and instrument. It employs various annotation schemes such as PropBank, FrameNet, and VerbNet, and utilizes methods ranging from traditional machine learning to neural approaches. SRL has applications in question answering, information extraction, and machine translation, among others.

Uploaded by

harshjajal786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

9.

Semantic Role Labeling


Definition: The process of identifying the semantic relationships between predicates (typically verbs)
and their associated arguments in a sentence, determining "who did what to whom, when, where, why,
and how."
Theoretical Foundation
1. Frame Semantics:
Linguistic frames: cognitive structures that describe events or situations
Predicates evoke frames with specific participant roles
FrameNet: resource cataloging frames and their elements
2. Propositional Semantics:
Sentences express propositions about entities and their relations
Predicates (usually verbs) establish relations between arguments
Arguments fill specific thematic roles
3. Thematic Roles (Semantic Roles):
Agent: Entity performing an action (typically animate)
Patient/Theme: Entity affected by the action
Experiencer: Entity experiencing a state
Instrument: Object used to perform an action
Goal/Recipient: Entity toward which action is directed
Source: Entity from which something moves
Location: Place where action occurs
Time: When action occurs
Manner: How action is performed
Cause: What caused the event
Beneficiary: Entity for whose benefit the action is performed
SRL Annotation Schemes
1. PropBank-style:
Verb-specific argument labels (Arg0, Arg1, Arg2, etc.)
Core arguments (numbered) vs. modifiers (labeled by function)
Arg0 typically agent, Arg1 typically patient/theme
Example: "[John]ARG0 [gave]PREDICATE [Mary]ARG2 [a book]ARG1"
2. FrameNet-style:
Frame-specific semantic roles
More fine-grained and semantically transparent
Example: "[The chef]COOK [baked]TARGET [a cake]FOOD [in the oven]HEATING_INSTRUMENT"
3. VerbNet-style:
Generalizes across verbs with similar argument structures
Thematic roles with syntactic constraints
Example: "[John]AGENT [broke]PREDICATE [the window]PATIENT [with a hammer]INSTRUMENT"
SRL Process
1. Predicate Identification:
Identify words that function as predicates (often verbs)
Can include nominal, adjectival predicates
Example: "The destruction of the city was complete." ("destruction" is nominal predicate)
2. Predicate Disambiguation:
Determine specific sense of predicate
Different senses may have different argument structures
Example: "run" as in "run a company" vs. "run a marathon"
3. Argument Identification:
Identify text spans that represent arguments
Usually phrase boundaries align with argument boundaries
4. Argument Classification:
Assign semantic role labels to identified arguments
Based on semantics and syntactic position
Detailed Example:
Sentence: "Yesterday, John carefully opened the door with a key because it was
locked."

Predicate: "opened"
Arguments:
- "Yesterday" - ARGM-TMP (temporal modifier)
- "John" - ARG0 (agent)
- "carefully" - ARGM-MNR (manner)
- "the door" - ARG1 (patient)
- "with a key" - ARGM-INS (instrument)
- "because it was locked" - ARGM-CAU (cause)

Visualization:
[Yesterday]ARGM-TMP, [John]ARG0 [carefully]ARGM-MNR [opened]PREDICATE [the door]ARG1
[with a key]ARGM-INS [because it was locked]ARGM-CAU.

SRL Approaches
1. Feature-based Methods:
Syntactic features (position, path in parse tree)
Lexical features (predicate, headword)
Named entity features
Voice (active/passive)
Traditional ML algorithms (SVM, MaxEnt, CRF)
2. Neural Approaches:
BiLSTM with attention mechanisms
End-to-end deep learning architectures
Multi-task learning with syntactic parsing
Transformer-based methods (BERT, RoBERTa)
3. Rule-based Systems:
Mapping from syntactic structure to semantic roles
Pattern matching on parse trees
Limited coverage but high precision
Code Example (Python with spaCy and AllenNLP):
python
# This example uses AllenNLP for SRL
from allennlp.predictors.predictor import Predictor
import spacy
import json

# Load SRL model


predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/

# Example sentence
sentence = "Yesterday, John carefully opened the door with a key because it was locked

# Get SRL predictions


prediction = predictor.predict(sentence=sentence)

# Print the results in a readable format


verbs = prediction['verbs']
words = prediction['words']

print("Sentence:", sentence)
print("\nSemantic Role Labels:")
print("-" * 50)

for verb_info in verbs:


print(f"Predicate: {verb_info['verb']}")

# Create a dictionary of argument spans


args = {}
current_arg = None
current_span = []

for tag, word in zip(verb_info['tags'], words):


if tag.startswith('B-'):
# Begin a new argument
if current_arg:
args[current_arg] = ' '.join(current_span)
current_arg = tag[2:] # Remove the 'B-' prefix
current_span = [word]
elif tag.startswith('I-'):
# Continue current argument
current_span.append(word)
else: # 'O' tag
if current_arg:
args[current_arg] = ' '.join(current_span)
current_arg = None
current_span = []
# Handle the last argument if there is one
if current_arg:
args[current_arg] = ' '.join(current_span)

# Print arguments sorted by their typical importance


core_args = [arg for arg in args.keys() if arg.startswith('ARG')]
modifier_args = [arg for arg in args.keys() if arg.startswith('ARGM')]

print("Core arguments:")
for arg in sorted(core_args):
print(f" {arg}: {args[arg]}")

print("Modifiers:")
for arg in sorted(modifier_args):
print(f" {arg}: {args[arg]}")

# Print the tagged sentence


print("\nTagged sentence:")
print(verb_info['description'])
print("-" * 50)

Applications:
Question answering (finding specific information)
Information extraction (structured knowledge)
Machine translation (role-preserving translation)
Text summarization (identifying key propositions)
Semantic search (finding content by meaning)
Reading comprehension systems
Dialogue systems (understanding user intents)
Natural language inference (logical reasoning)
Event extraction (structured representations of events)# Core Components of Natural Language
Processing (NLP)
Introduction
Natural Language Processing (NLP) enables computers to understand, interpret, and generate human
language. This presentation provides an in-depth exploration of the fundamental building blocks that
form the foundation of modern NLP systems.
1. Tokenization
Definition: The process of breaking text into smaller units called tokens, which serve as the basic
elements for further language processing.
Types:
Word Tokenization: Splitting text into individual words
Example: "I love NLP" → ["I", "love", "NLP"]
Challenges: Handling contractions, hyphenated words, abbreviations
Sentence Tokenization: Dividing text into sentences
Example: "Hello! How are you? I'm fine." → ["Hello!", "How are you?", "I'm fine."]
Challenges: Abbreviations (e.g., "Dr."), quotations, non-standard punctuation
Subword Tokenization: Breaking words into meaningful subunits
Byte-Pair Encoding (BPE): Iteratively merges most frequent character pairs
WordPiece: Similar to BPE but uses likelihood rather than frequency
SentencePiece: Language-agnostic tokenization that treats spaces as symbols
Example: "unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"]
Advanced Tokenization Considerations:
Language-specific challenges:
Chinese/Japanese/Korean: No explicit word boundaries
Arabic/Hebrew: Complex morphology and right-to-left script
Agglutinative languages (Finnish, Turkish): Long compound words
Tokenization in modern transformers:
Special tokens: [CLS], [SEP], [MASK], [PAD]
Handling out-of-vocabulary words
Code Example (Python with NLTK):
python

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Sentence tokenization
text = "NLP is fascinating! It has many applications."
sentences = sent_tokenize(text)
print(sentences) # ['NLP is fascinating!', 'It has many applications.']

# Word tokenization
for sentence in sentences:
words = word_tokenize(sentence)
print(words) # ['NLP', 'is', 'fascinating', '!'] ['It', 'has', 'many', 'applicatio

Applications:
Foundation for all other NLP tasks
Text preprocessing and normalization
Feature extraction for machine learning models
Input preparation for neural networks
Information retrieval systems
2. Part-of-Speech (POS) Tagging
Definition: Identifying the grammatical category of each word or token in text, enabling deeper
syntactic and semantic analysis.
Common POS Tag Sets:
Penn Treebank POS Tags (most common in English NLP)
Universal Dependencies POS Tags (cross-linguistic compatibility)
Detailed POS Categories:
Nouns:
Common nouns (NN): generic objects, concepts (book, happiness)
Proper nouns (NNP): specific names (John, London)
Singular/plural distinctions (NN/NNS)
Possessive forms (NN's/NNP's)
Verbs:
Base form (VB): "go", "see"
Present tense (VBP/VBZ): "go/goes"
Past tense (VBD): "went"
Gerund/present participle (VBG): "going"
Past participle (VBN): "gone"
Modal auxiliaries (MD): "can", "should"
Adjectives:
Base form (JJ): "happy"
Comparative (JJR): "happier"
Superlative (JJS): "happiest"
Additional Categories:
Adverbs (RB/RBR/RBS): "quickly", "more", "most"
Determiners (DT): "the", "a", "this"
Prepositions (IN): "in", "on", "by"
Conjunctions (CC): "and", "but", "or"
Pronouns (PRP): "he", "she", "they"
Cardinal numbers (CD): "one", "2", "three"
Foreign words (FW), Interjections (UH), etc.
POS Tagging Approaches:
1. Rule-based: Use hand-crafted rules and dictionaries
Advantage: Interpretable, no training data needed
Disadvantage: Limited coverage, difficult to maintain
2. Stochastic/Statistical:
Hidden Markov Models (HMMs)
Maximum Entropy Markov Models
Conditional Random Fields (CRFs)
3. Deep Learning Approaches:
Recurrent Neural Networks (RNNs/LSTMs)
Bidirectional LSTMs with CRF layer
Transformer-based models (BERT, etc.)
Extended Example:
Text: "The quick brown fox jumps over the lazy dog."

Detailed POS analysis:


"The" - DT (Determiner)
"quick" - JJ (Adjective)
"brown" - JJ (Adjective)
"fox" - NN (Noun, singular)
"jumps" - VBZ (Verb, 3rd person singular present)
"over" - IN (Preposition)
"the" - DT (Determiner)
"lazy" - JJ (Adjective)
"dog" - NN (Noun, singular)
"." - . (Punctuation)

Code Example (Python with NLTK):


python

import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Download necessary data


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# POS tagging example


text = "The quick brown fox jumps over the lazy dog."
tokens = word_tokenize(text)
tagged = pos_tag(tokens)

print(tagged)
# Output: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'),
# ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'),
# ('dog', 'NN'), ('.', '.')]

Applications:
Syntactic parsing and grammatical analysis
Named entity recognition preprocessing
Word sense disambiguation
Machine translation
Text-to-speech pronunciation
Grammar checking and correction
Information extraction systems
Sentiment analysis enhancement
3. Named Entity Recognition (NER)
Definition: Identifying and classifying real-world entities mentioned in text into predefined categories,
enabling structured information extraction from unstructured text.
Standard Entity Types:
People (PER): Individual persons, groups of people
Organizations (ORG): Companies, institutions, government agencies
Locations (LOC): Geographic locations, political entities
Geo-Political Entities (GPE): Countries, cities, states with political and geographic properties
Temporal Expressions (TIME/DATE): Dates, times, durations, periods
Numeric Expressions:
Monetary values (MONEY): "$50 million", "€20"
Percentages (PERCENT): "10%", "one-third"
Cardinal numbers (CARDINAL): "ten", "15"
Ordinal numbers (ORDINAL): "first", "5th"
Domain-Specific Entity Types:
Biomedical: Genes, proteins, diseases, drugs
Legal: Legal citations, court cases, statutes
Scientific: Chemical compounds, astronomical objects
Financial: Stock symbols, financial instruments
Products: Brand names, product models
Creative Works: Books, movies, songs
NER Approaches:
1. Rule-based Methods:
Gazetteer/dictionary matching
Regular expressions
Hand-crafted linguistic rules
2. Statistical Methods:
Hidden Markov Models (HMMs)
Conditional Random Fields (CRFs)
Support Vector Machines (SVMs)
3. Deep Learning Methods:
Bidirectional LSTMs with CRF
Transformer-based models (BERT, RoBERTa)
Fine-tuned language models
NER Evaluation Metrics:
Precision: Proportion of identified entities that are correct
Recall: Proportion of actual entities that were identified
F1 Score: Harmonic mean of precision and recall
Exact match vs. partial match scoring
Extended Example:
Text: "Apple Inc. is planning to open a new flagship store in central London next
March, with an investment of $40 million."

Detailed NER:
"Apple Inc." - ORGANIZATION
"central London" - LOCATION
"next March" - DATE
"$40 million" - MONEY

Visualization:
[Apple Inc.](ORG) is planning to open a new flagship store in [central London](LOC)
[next March](DATE), with an investment of [$40 million](MONEY).

Code Example (Python with spaCy):


python

import spacy

# Load English NER model


nlp = spacy.load("en_core_web_sm")

# Process text
text = "Apple Inc. is planning to open a new flagship store in central London next Marc
doc = nlp(text)

# Display entities
for ent in doc.ents:
print(f"Entity: {ent.text}, Type: {ent.label_}, Start: {ent.start_char}, End: {ent

# Output:
# Entity: Apple Inc., Type: ORG, Start: 0, End: 10
# Entity: London, Type: GPE, Start: 54, End: 60
# Entity: next March, Type: DATE, Start: 61, End: 71
# Entity: $40 million, Type: MONEY, Start: 93, End: 104

Advanced NER Challenges:


Dealing with ambiguous entities (e.g., "Apple" as company vs. fruit)
Handling nested entities (e.g., "Bank of America" contains "America")
Cross-lingual entity recognition
Entity recognition in noisy text (social media, speech-to-text)
Low-resource settings with limited training data
Applications:
Knowledge graph construction
Information extraction systems
Question answering systems
Content recommendation and personalization
Search engine optimization and relevance
News categorization and tagging
Compliance monitoring (PII detection)
Customer service automation
Research and market intelligence
4. Lemmatization and Stemming
Definition: Text normalization techniques that reduce inflected or derived words to their base or root
form, allowing systems to treat different word forms as equivalent.
Stemming
Core Concept:
Strips affixes (prefixes and suffixes) from words using algorithmic rules
Language-specific but dictionary-independent process
Focuses on computational efficiency over linguistic accuracy
Popular Stemming Algorithms:
1. Porter Stemmer:
Most widely used English stemmer
Five-phase rule-based approach
Example transformations:
"CONNECTED" → "CONNECT" (remove "-ED")
"GENERALIZATION" → "GENER" (multiple rules applied)
2. Snowball Stemmer (Porter2):
Improved version of Porter stemmer
Supports multiple languages (English, French, Spanish, etc.)
More accurate but slightly slower than Porter
3. Lancaster (Paice/Husk) Stemmer:
Aggressive stemming with iterative rules
Often produces shorter stems
Higher error rate but good for strict matching
4. Lovins Stemmer:
One of the earliest stemmers (1968)
Single-pass, longest-match algorithm
Larger rule set (over 260 endings, 29 conditions)
Stemming Limitations:
Over-stemming: Different words reduced to same stem (e.g., "university" and "universal" →
"univers")
Under-stemming: Related words not reduced to same stem (e.g., "alumnus" and "alumni" remain
different)
Produces non-words or lexically invalid forms
No semantic understanding
Lemmatization
Core Concept:
Reduces words to their dictionary form (lemma) based on vocabulary and morphological analysis
Considers word's context and part of speech
Prioritizes linguistic accuracy over computational efficiency
Lemmatization Process:
1. POS tagging to determine grammatical category
2. Morphological analysis of word structure
3. Dictionary lookup to find base form
4. Application of language-specific rules
Examples with Context:
"better" (adjective) → "good"
"better" (verb) → "better"
"saw" (noun) → "saw"
"saw" (verb, past tense) → "see"
"studies" (noun, plural) → "study"
"studies" (verb, 3rd person) → "study"
Popular Lemmatization Tools:
WordNet Lemmatizer (NLTK)
spaCy's Lemmatizer
Stanford CoreNLP Lemmatizer
TreeTagger
Code Example (Python with NLTK and spaCy):
python
# NLTK Stemming and Lemmatization
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import pos_tag

nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Initialize
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Helper function to convert NLTK POS tags to WordNet POS tags


def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wordnet.ADJ
elif treebank_tag.startswith('V'):
return wordnet.VERB
elif treebank_tag.startswith('N'):
return wordnet.NOUN
elif treebank_tag.startswith('R'):
return wordnet.ADV
else:
return wordnet.NOUN # Default to noun

# Example words
words = ["running", "runs", "ran", "better", "best", "studies", "studying"]
pos_tagged = pos_tag(words)

print("Word\t\tStem\t\tLemma (no POS)\tLemma (with POS)")


print("-" * 60)

for word, pos in pos_tagged:


wordnet_pos = get_wordnet_pos(pos)
print(f"{word}\t\t{porter.stem(word)}\t\t{lemmatizer.lemmatize(word)}\t\t{lemmatize

# spaCy Lemmatization
import spacy
nlp = spacy.load("en_core_web_sm")

text = "The children were running in the park. She studied better than him."
doc = nlp(text)

print("\nspaCy Lemmatization:")
print("Token\t\tLemma\t\tPOS")
print("-" * 40)
for token in doc:
print(f"{token.text}\t\t{token.lemma_}\t\t{token.pos_}")

Comparative Analysis:
Aspect Stemming Lemmatization
Speed Faster Slower
Complexity Lower Higher
Resources Rule-based only Requires dictionaries
Accuracy Lower Higher
Output Often non-words Valid dictionary words
Context Context-independent Context-aware
Applications:
Text normalization for search engines
Feature space reduction in text classification
Query expansion in information retrieval
Document clustering and topic modeling
Spelling correction systems
Keyword extraction
Text similarity measurement
Machine translation preprocessing
5. Syntax and Parsing
Definition: Analyzing grammatical structure to determine relationships between words and phrases,
revealing the hierarchical organization of language and enabling deeper semantic understanding.
Formal Grammar Theory
Context-Free Grammars (CFGs):
Formalized set of production rules that describe all possible strings in a language
Consists of:
Non-terminal symbols (syntactic categories like NP, VP)
Terminal symbols (actual words)
Production rules (NP → Det N)
Start symbol (S)
Example CFG Rules:
S → NP VP
NP → Det N | Det N PP | Pronoun
VP → V | V NP | V NP PP
PP → P NP
Det → "the" | "a" | "my"
N → "cat" | "mouse" | "house"
V → "chased" | "saw" | "ate"
P → "in" | "on" | "with"

Grammar Formalisms:
Context-free grammars (CFGs)
Tree-adjoining grammars (TAGs)
Lexicalized tree-adjoining grammars (LTAGs)
Head-driven phrase structure grammar (HPSG)
Combinatory categorial grammar (CCG)
Constituency Parsing
Core Concept:
Breaks sentences into nested constituents based on phrase structure grammar
Represents hierarchical structure as a parse tree
Focuses on phrases and their composition
Constituent Types:
Noun Phrase (NP): "the red car"
Verb Phrase (VP): "is driving fast"
Prepositional Phrase (PP): "on the road"
Adjective Phrase (ADJP): "very happy"
Adverb Phrase (ADVP): "quite slowly"
Example Parse Tree:
S
/ \
NP VP
/ \ / \
Det N V NP
| | | / \
The cat ate Det N
| |
the fish

Parsing Algorithms:
CKY Algorithm: Dynamic programming approach for CFGs
Earley Parser: Top-down parser with bottom-up filtering
Shift-Reduce Parser: Uses stack and buffer with shift/reduce operations
Chart Parsing: Builds partial analyses in a well-formed substring table
Dependency Parsing
Core Concept:
Identifies direct grammatical relationships between words
Represents sentence as a directed graph with labeled edges
No intermediate phrasal nodes, only connections between words
Common Dependency Relations:
nsubj: Nominal subject
dobj: Direct object
iobj: Indirect object
det: Determiner
amod: Adjectival modifier
advmod: Adverbial modifier
aux: Auxiliary verb
prep: Preposition
pobj: Object of preposition
conj: Conjunct
Detailed Example:
Sentence: "The black cat chased the small mouse in the kitchen."

chased
/ | \
cat | in
/ \ | |
The black mouse kitchen
/ \ |
the small the

With Labeled Dependencies:


chased
/ \
nsubj/ \dobj \prep
cat mouse in
/ \ / \ |
det/ mod\ det/ \mod pobj\
The black the small kitchen
|
det\
the

Dependency Parsing Approaches:


1. Transition-Based Parsing:
Uses a sequence of actions (shift, reduce, etc.)
Greedy or beam search for action sequence
Linear time complexity (O(n))
2. Graph-Based Parsing:
Scores possible dependency trees
Finds maximum spanning tree
Cubic time complexity (O(n³))
3. Neural Dependency Parsing:
Deep biaffine attention models
Transformer-based approaches
Graph neural networks
Universal Dependencies
Cross-linguistic annotation standard
Consistent dependency relations across languages
Universal POS tags
Facilitates multilingual parsing
Code Example (Python with spaCy):
python

import spacy
from spacy import displacy
import pandas as pd

# Load English model


nlp = spacy.load("en_core_web_sm")

# Example sentence
text = "The black cat chased the small mouse in the kitchen."
doc = nlp(text)

# Display constituency parse (if available)


# Note: This requires 'en_core_web_trf' model or adding the parser pipeline
# print(doc._.parse_string)

# Display dependency parse as text


print("DEPENDENCY PARSING RESULTS:")
print("{:<15} {:<10} {:<15} {:<10}".format("TOKEN", "POS", "DEPENDENCY", "HEAD"))
print("-" * 50)
for token in doc:
print("{:<15} {:<10} {:<15} {:<10}".format(
token.text, token.pos_, token.dep_, token.head.text))

# Visualize dependency parse (in notebook or save to file)


# displacy.render(doc, style="dep", jupyter=True)

Parsing Evaluation Metrics:


Labeled attachment score (LAS)
Unlabeled attachment score (UAS)
Label accuracy
F1 score on constituents (for constituency parsing)
Applications:
Semantic role labeling
Grammatical error detection
Machine translation
Information extraction
Question answering
Relationship extraction
Coreference resolution
Sentiment analysis (aspect-based)
Text simplification
Readability assessment
6. Word Sense Disambiguation
Definition: Identifying which specific meaning (sense) of a word is activated by its context when a
word has multiple possible interpretations, resolving lexical ambiguity to enable accurate language
understanding.
Types of Lexical Ambiguity
Homonymy:
Words that share spelling and pronunciation but have unrelated meanings
Example: "bank" (financial institution) vs. "bank" (river shore)
Usually developed from different etymological origins
Polysemy:
Words with related but distinct meanings
Example: "head" (body part, leader, top section)
Meanings derive from semantic extension of original sense
Fine-grained vs. Coarse-grained Disambiguation:
Fine-grained: Distinguishing between closely related senses
Coarse-grained: Distinguishing between major sense categories
Example (fine): "bright student" vs. "bright light" vs. "bright colors"
WSD Approaches
1. Knowledge-based Methods:
Leverage lexical resources and semantic networks
No training data required a. Lesk Algorithm:
Compares dictionary definitions with context words
Selects sense with maximum overlap between definition and context
Variations: Simplified Lesk, Enhanced Lesk
b. WordNet-based Methods:
Exploit semantic relationships in WordNet
Use hypernyms, hyponyms, synonyms, etc.
Measure semantic similarity between senses
c. Graph-based Methods:
Represent WordNet as semantic graph
Apply algorithms like PageRank to find most relevant sense
2. Supervised Methods:
Require sense-annotated training data
Learn statistical models of word senses a. Feature Engineering:
Local context features (surrounding words, POS tags)
Topical features (document topic, domain)
Syntactic features (dependency relations)
Semantic features (named entities, semantic roles)
b. Classification Algorithms:
Support Vector Machines
Decision Trees
Maximum Entropy Models
Neural Networks
3. Semi-supervised Methods:
Address lack of training data
Use small amount of labeled data with large unlabeled corpus a. Bootstrapping:
Start with seed examples
Iteratively expand training set
b. One-sense-per-discourse:
Assume consistent sense within document
Propagate identified senses
4. Unsupervised Methods:
No annotated data required
Induce senses from text a. Word Embeddings:
Generate context-specific embeddings
Cluster similar contexts
b. Topic Models:
Discover latent topics
Assign senses based on topic distributions
5. Deep Learning Approaches:
Contextual Embeddings:
BERT, ELMo, GPT produce context-dependent representations
Capture sense information in vector space
Neural Architectures:
Bidirectional LSTMs
Transformer-based encoders
Attention mechanisms for context modeling
Advanced WSD Concepts
Cross-lingual WSD:
Leveraging parallel corpora
Transfer learning across languages
BabelNet and other multilingual resources
Domain Adaptation:
Adapting general WSD systems to specific domains
Domain-specific sense inventories
WSD Evaluation Framework:
SensEval/SemEval competitions
Metrics: Precision, recall, F1-score
All-words vs. lexical sample tasks
Detailed Examples with Analysis:
Example 1: Homonymy
Sentence: "The bank approved my loan application."
Context analysis:
- Financial terms: "loan", "application", "approved"
- No water-related terms
- Syntactic role: subject of approval action
→ Sense: financial institution

Example 2: Polysemy
Sentence: "The company runs a successful business."
Context analysis:
- Business entity: "company", "business"
- Subject-verb relationship suggests operation, not movement
→ Sense: operates/manages (not physical running)

Example 3: Fine-grained distinction


Sentence: "The bright student quickly solved the problem."
Context analysis:
- Human subject: "student"
- Mental activity: "solved", "problem"
→ Sense: intelligent (not luminous or colorful)

Code Example (Python with NLTK):


python
from nltk.corpus import wordnet as wn
from nltk.wsd import lesk
from nltk import word_tokenize, pos_tag
import nltk

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Helper function to convert NLTK POS tags to WordNet POS tags


def get_wordnet_pos(treebank_tag):
if treebank_tag.startswith('J'):
return wn.ADJ
elif treebank_tag.startswith('V'):
return wn.VERB
elif treebank_tag.startswith('N'):
return wn.NOUN
elif treebank_tag.startswith('R'):
return wn.ADV
else:
return None

# Example sentences for disambiguation


sentences = [
"The bank approved my loan application.",
"The river bank was eroding after the flood.",
"She can't bear to watch the sad movie.",
"The bear caught a fish in the stream."
]

target_word = "bank" # Word to disambiguate

for i, sentence in enumerate(sentences[:2]): # Just using the bank examples


# Tokenize and POS tag
tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)

# Find position of target word


positions = [j for j, (word, _) in enumerate(tagged) if word.lower() == target_word

if positions:
pos = positions[0]
# Get WordNet POS tag
wordnet_pos = get_wordnet_pos(tagged[pos][1])

# Apply Lesk algorithm


best_sense = lesk(tokens, target_word, wordnet_pos)

print(f"Sentence {i+1}: {sentence}")


print(f"Context: {' '.join(tokens)}")
if best_sense:
print(f"Best sense: {best_sense}")
print(f"Definition: {best_sense.definition()}")
print(f"Examples: {best_sense.examples()}")
else:
print("No sense found")
print()

# List all senses of "bank"


print("All senses of 'bank':")
for i, synset in enumerate(wn.synsets('bank')):
print(f"{i+1}. {synset.name()}: {synset.definition()}")

Applications:
Machine translation (selecting correct translation)
Information retrieval (improving search precision)
Text summarization (preserving meaning)
Question answering (understanding queries)
Sentiment analysis (contextual polarity)
Speech recognition (disambiguating homophones)
Ontology mapping (knowledge integration)
Semantic parsing (meaning representation)ADJ elif treebank_tag.startswith('V'): return wn.
7. Coreference Resolution
Definition: Identifying when different expressions in text refer to the same entity or event, establishing
connections between these mentions to maintain coherence and enable deeper understanding of text.
Types of Coreference
1. Pronominal Coreference:
Pronouns referring back to a previously mentioned entity
Personal pronouns: he, she, it, they, etc.
Possessive pronouns: his, her, its, their, etc.
Reflexive pronouns: himself, herself, itself, themselves, etc.
Example: "[John]₁ said that [he]₁ would finish the project."
2. Nominal Coreference:
Noun phrases referring to the same entity
Exact repetitions: "a dog"..."the dog"
Synonyms: "the automobile"..."the car"
Hypernyms/hyponyms: "the animal"..."the dog"
Example: "[The President]₁ spoke yesterday. [Joe Biden]₁ addressed climate concerns."
3. Zero Anaphora (Ellipsis):
Omitted expressions that are understood from context
Commonly occurs in coordinate structures and certain languages
Example: "[Mary]₁ wanted to go to the beach and [∅]₁ packed her bag." (∅ = Mary)
4. Split Antecedent:
When a plural pronoun refers to multiple separate antecedents
Example: "[John]₁ met [Mary]₂ at the conference. [They]₁₊₂ had dinner together."
Challenges in Coreference Resolution
1. Syntactic Ambiguity:
Multiple potential antecedents with matching grammatical features
Example: "[The trophy]₁ didn't fit into [the suitcase]₂ because [it]₁/₂ was too big."
2. Semantics and World Knowledge:
Requires understanding beyond syntax
Example: "[The city council]₁ denied [the demonstrators]₂ a permit because [they]₁/₂ feared
violence."
3. Bridging References:
Implicit relations between entities
Example: "I went to [a new restaurant]₁. [The chef]₂ was excellent." (where chef is part of
restaurant)
4. Event Coreference:
Identifying when different expressions refer to the same event
Example: "The building [exploded]₁ yesterday. The [blast]₁ injured ten people."
Coreference Resolution Approaches
1. Rule-based Systems:
Linguistic constraints and preferences
Hobbs algorithm
Centering theory
Binding constraints (Chomsky's Government and Binding Theory)
2. Mention-Pair Models:
Classify pairs of mentions as coreferent or not
Features: distance, syntactic position, gender/number agreement, semantic compatibility
Challenge: Local decisions may be inconsistent globally
3. Entity-Mention Models:
Consider all previously resolved entities for each new mention
Build coreference chains incrementally
More coherent entity representations
4. Neural Coreference Models:
End-to-end neural systems
Span representations with attention
Higher-order inference
Joint learning with other NLP tasks
5. Transformer-based Approaches:
Fine-tuned language models (BERT, RoBERTa, etc.)
Contextual token representations
Self-attention captures long-distance relationships
Detailed Resolution Process
1. Mention Detection:
Identify potential referring expressions
Filter non-referring expressions
Named entities, noun phrases, pronouns
2. Mention-level Feature Extraction:
Grammatical properties: gender, number, person
Semantic class: human, organization, location, etc.
Syntactic role: subject, object, etc.
Definiteness, proper noun status
3. Pairwise Feature Computation:
Distance features: sentences, mentions between
String matching features: exact, partial, head match
Syntactic features: syntactic paths, c-command
Semantic compatibility features
4. Clustering or Classification:
Agglomerative clustering of mentions
Ranking of candidate antecedents
Global optimization for consistency
Comprehensive Example:
Text: "Mark told John that he had won the competition. His friends were very proud
of him.
The young programmer had worked very hard for this achievement."

Referring expressions:
- [Mark]₁
- [John]₂
- [he]₃ → refers to [Mark]₁
- [the competition]₄
- [His]₅ → refers to [Mark]₁
- [friends]₆
- [him]₇ → refers to [Mark]₁
- [The young programmer]₈ → refers to [Mark]₁
- [this achievement]₉ → refers to [winning the competition]

Coreference chains:
Chain 1: [Mark]₁ - [he]₃ - [His]₅ - [him]₇ - [The young programmer]₈
Chain 2: [John]₂
Chain 3: [the competition]₄ - [this achievement]₉
Chain 4: [friends]₆

Resolution process:
- "he" → ambiguous between Mark/John, resolved to Mark due to syntactic prominence
(subject)
- "His" → matches masculine singular, references most recent compatible entity
(Mark)
- "him" → matches masculine singular object, continues reference to Mark
- "The young programmer" → requires world knowledge that Mark is a programmer
- "this achievement" → event reference to the winning action

Code Example (Python with spaCy):


python

import spacy

# Load English NLP model with coreference resolution capability


# Note: This requires the neuralcoref pipeline which must be installed separately
# pip install neuralcoref
# import neuralcoref
# Or use specialized coreference models

nlp = spacy.load("en_core_web_sm")
# neuralcoref.add_to_pipe(nlp)

# Process text
text = "Mark told John that he had won the competition. His friends were very proud of
doc = nlp(text)

# Print tokens with their part-of-speech tags


print("TOKEN\t\tPOS\t\tDEP\t\tHEAD")
print("-" * 50)
for token in doc:
print(f"{token.text}\t\t{token.pos_}\t\t{token.dep_}\t\t{token.head.text}")

# With neuralcoref installed, you could access coreference chains:


# print("\nCoreference clusters:")
# for cluster in doc._.coref_clusters:
# print(f"Cluster: {cluster.main} contains: {cluster.mentions}")

# Display dependency visualization


from spacy import displacy
# displacy.render(doc, style="dep", jupyter=True)

Evaluation Metrics:
MUC (Message Understanding Conference) score
B³ (B-cubed)
CEAF (Constrained Entity-Alignment F-measure)
BLANC (BiLateral Assessment of Noun-phrase Coreference)
CoNLL F1 (average of MUC, B³, and CEAF)
Applications:
Machine translation (pronoun resolution across languages)
Question answering (connecting entities in questions and answers)
Text summarization (maintaining entity coherence)
Information extraction (building knowledge bases)
Dialogue systems (tracking conversation entities)
Reading comprehension (understanding narrative flow)
Sentiment analysis (attributing opinions to correct entities)
8. Text Normalization and Preprocessing
Definition: The process of transforming text into a standardized, canonical form to ensure consistency
and improve the quality of downstream NLP tasks. Text normalization addresses variations and
irregularities in natural language to create a clean, uniform foundation for analysis.
Components of Text Normalization
1. Case Normalization:
Converting text to lowercase or uppercase
Preserves proper nouns when necessary
Example: "New York City" → "new york city" or "New york city"
Impact: Reduces vocabulary size but can lose information
2. Noise Removal:
Eliminating irrelevant characters and artifacts
HTML/XML tags, URLs, email addresses
Special characters, emoji, decorative symbols
Extra whitespace, line breaks, tabs
Example: "Contact us at info@example.com" → "Contact us at"
3. Punctuation Handling:
Removing punctuation completely
Separating punctuation from words
Standardizing quotes, hyphens, apostrophes
Example: "Don't worry!" → "Don t worry"
Consideration: Important for parsing, sentiment, dialogue
4. Number Handling:
Removing numbers
Converting numbers to words
Categorizing numeric expressions
Example: "I have 42 apples" → "I have NUMBER apples" or "I have forty-two apples"
5. Text Encoding Standardization:
Converting to UTF-8 or ASCII
Handling special characters, accents, diacritics
Normalizing Unicode representations (NFC, NFD)
Example: "café" might be represented as "café" or "cafe\u0301"
6. Spelling Correction and Normalization:
Fixing spelling errors
Standardizing spelling variants
Handling regional variations (e.g., "color" vs. "colour")
Example: "I luv gr8 txts" → "I love great texts"
7. Abbreviation and Acronym Expansion:
Expanding common abbreviations
Normalizing domain-specific shorthand
Example: "WHO announced..." → "World Health Organization announced..."
8. Contractions and Possessives:
Expanding contractions
Standardizing possessive forms
Example: "I'll can't" → "I will cannot" or "I will can not"
9. Character-level Normalization:
Handling repeated characters ("goooooal!")
Standardizing character substitutions ("gr8" → "great")
Removing non-standard characters
Advanced Normalization Techniques:
10. Text Segmentation:
Word boundaries in languages without spaces (Chinese, Japanese)
Compound word handling (German, Finnish)
Hashtag segmentation (#NaturalLanguageProcessing → Natural Language Processing)
11. Domain-Specific Normalization:
Medical terminology standardization
Legal text normalization
Social media language normalization
Technical jargon standardization
12. Language Identification:
Determining the primary language
Handling code-switching and multilingual text
Applying language-specific normalization rules
Normalization Pipeline Example
Input Text:
OMG!!! I luv NLP sooooo much :) #NaturalLanguageProcessing
Check out https://example.com or email me@email.com

Normalization Steps:
1. Lowercase conversion:
omg!!! i luv nlp sooooo much :) #naturallanguageprocessing
check out https://example.com or email me@email.com

2. URL and email removal:


omg!!! i luv nlp sooooo much :) #naturallanguageprocessing
check out or email

3. Punctuation removal:
omg i luv nlp sooooo much naturallanguageprocessing
check out or email

4. Character normalization:
omg i luv nlp so much naturallanguageprocessing
check out or email

5. Text expansion:
oh my god i love nlp so much natural language processing
check out or email
Implementation Approaches
Rule-based Normalization:
Regular expressions
Dictionary-based replacement
Handcrafted rules
Advantages: Interpretable, controllable
Disadvantages: Limited coverage, high maintenance
Statistical Normalization:
Noisy channel models
Spelling correction algorithms
Advantages: Data-driven, handles unseen cases
Disadvantages: Requires training data, may overgeneralize
Neural Normalization:
Sequence-to-sequence models
Character-level neural networks
Advantages: Handles complex transformations, learns patterns
Disadvantages: Black box, requires substantial training data
Code Example (Python):
python
import re
import string
import unicodedata
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

class TextNormalizer:
def __init__(self, language='english'):
self.language = language
self.stop_words = set(stopwords.words(language)) if language in stopwords.file
self.stemmer = PorterStemmer()
self.lemmatizer = WordNetLemmatizer()

# Common contractions
self.contractions = {
"n't": " not",
"'ll": " will",
"'ve": " have",
"'re": " are",
"'m": " am",
"'d": " would"
}

def remove_urls_emails(self, text):


"""Remove URLs and email addresses"""
text = re.sub(r'https?://\S+|www\.\S+', '', text)
text = re.sub(r'\S+@\S+', '', text)
return text

def replace_contractions(self, text):


"""Expand contractions"""
for contraction, expansion in self.contractions.items():
text = text.replace(contraction, expansion)
return text

def normalize_unicode(self, text):


"""Normalize Unicode characters"""
return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ut

def remove_punctuation(self, text):


"""Remove punctuation"""
translator = str.maketrans('', '', string.punctuation)
return text.translate(translator)

def normalize_whitespace(self, text):


"""Normalize whitespace"""
return ' '.join(text.split())

def normalize_repeating_chars(self, text):


"""Normalize repeating characters"""
return re.sub(r'(.)\1{2,}', r'\1\1', text)

def tokenize_and_filter(self, text, remove_stopwords=True):


"""Tokenize and optionally remove stopwords"""
tokens = word_tokenize(text)
if remove_stopwords:
tokens = [token for token in tokens if token.lower() not in self.stop_words
return tokens

def stem_tokens(self, tokens):


"""Apply stemming to tokens"""
return [self.stemmer.stem(token) for token in tokens]

def lemmatize_tokens(self, tokens):


"""Apply lemmatization to tokens"""
return [self.lemmatizer.lemmatize(token) for token in tokens]

def normalize(self, text, lowercase=True, remove_urls=True, expand_contractions=Tru


unicode_norm=True, remove_punct=True, norm_whitespace=True,
norm_repeating=True, remove_stopwords=False, stem=False, lemmatize=Fa
"""Full normalization pipeline"""
if lowercase:
text = text.lower()

if remove_urls:
text = self.remove_urls_emails(text)

if expand_contractions:
text = self.replace_contractions(text)

if unicode_norm:
text = self.normalize_unicode(text)

if remove_punct:
text = self.remove_punctuation(text)

if norm_whitespace:
text = self.normalize_whitespace(text)

if norm_repeating:
text = self.normalize_repeating_chars(text)
tokens = self.tokenize_and_filter(text, remove_stopwords)

if stem:
tokens = self.stem_tokens(tokens)

if lemmatize:
tokens = self.lemmatize_tokens(tokens)

return tokens if (stem or lemmatize) else ' '.join(tokens)

# Example usage
normalizer = TextNormalizer()
input_text = "OMG!!! I luv NLP sooooo much :) #NaturalLanguageProcessing"
normalized_text = normalizer.normalize(input_text)
print(f"Original: {input_text}")
print(f"Normalized: {normalized_text}")

Normalization Considerations:
When to Normalize:
Data cleaning and preprocessing
Increasing lexical overlap for similarity measures
Reducing vocabulary size for efficiency
Standardizing inputs for machine learning
When to Preserve Original Text:
Sentiment analysis (punctuation and capitalization convey emotion)
Author attribution (writing style contains individual markers)
Certain domain-specific tasks (legal, medical precision)
When information loss outweighs standardization benefits
Effectiveness Metrics:
Task-specific performance improvement
Vocabulary size reduction
Improved matching/retrieval rates
Error reduction in downstream tasks
Applications:
Information retrieval and search engines
Text classification and categorization
Machine translation preprocessing
Sentiment analysis preparation
Named entity recognition preprocessing
Text similarity and duplicate detection
Speech recognition post-processing
Optical character recognition (OCR) correction

You might also like