Natural language
processing
Unit I
🧠 Natural Language Processing (NLP)
📌 Definition:
Natural Language Processing (NLP) is a branch of Artificial Intelligence
(AI) that enables computers to understand, interpret, and generate
human language.
It combines:
Linguistics
Computer Science
Machine Learning
🌍 Scope of NLP
NLP allows machines to interact with human language for a wide variety of
tasks:
Core Areas in NLP:
Area Description
Text Classification Categorizing texts (e.g., spam vs. non-spam)
Identifying emotions in text
Sentiment Analysis
(positive/negative/neutral)
Machine Translation Translating text from one language to another
Speech Recognition Converting spoken language to text
Text Summarization Creating concise summaries from large text
Question Answering Systems that can answer user queries (like
Area Description
ChatGPT)
Named Entity Identifying names, places, organizations in
Recognition (NER) text
💡 Applications in Various Domains
1. 📱 Customer Support
Chatbots and virtual assistants (e.g., Alexa, Siri)
Automated replies and email sorting
2. 📈 Business Intelligence
Analyzing customer feedback, reviews, and surveys
Market sentiment analysis
3. 🏥 Healthcare
Analyzing clinical notes and health records
Extracting drug names, diagnoses, and treatments
4. 📰 Media and Journalism
Auto-generating news summaries
Detecting fake news
5. ⚖️Legal and Compliance
Parsing legal documents
Contract analysis
6. 🛒 E-Commerce
Product recommendations based on reviews
Search query understanding and auto-completion
7. 🎓 Education
Automated essay scoring
Language learning apps (e.g., Duolingo)
⚠️Challenges and Limitations of NLP
Despite its success, NLP faces several challenges:
1. Ambiguity of Language
Words have multiple meanings based on context (e.g., “bank” –
riverbank or financial institution?).
2. Sarcasm & Irony
Machines struggle to detect sarcasm or emotional undertones.
3. Language Diversity
Thousands of languages, dialects, and scripts make universal NLP hard.
4. Data Dependency
Requires huge labeled datasets for training models.
Data scarcity for low-resource languages.
5. Grammar and Syntax Complexity
Natural languages have inconsistent grammar rules.
Informal language, typos, slang complicate processing.
6. Context Understanding
Hard to understand context across multiple sentences or long
documents.
7. Bias and Fairness
NLP models can learn and reflect biases present in training data
(gender, racial, political, etc.).
🚧 Limitations of Current NLP Systems:
Often lack true understanding – they work statistically, not
semantically.
Struggle with reasoning, common sense, and logic-based tasks.
Require retraining for domain-specific vocabulary (e.g., medical vs.
legal terms).
1. Syntax in NLP
Syntax refers to the structure of language — how words are combined to
form grammatically correct sentences. It's about rules and grammar.
🔧 NLP Tasks in Syntax:
Task Description
Part-of-Speech Assigns word types like noun, verb, adjective, etc. to
(POS) Tagging each word in a sentence.
Parsing (Syntactic Analyzes the grammatical structure of a sentence and
Analysis) builds a parse tree.
Chunking Groups words into meaningful phrases (noun phrases,
(Shallow Parsing) verb phrases, etc.).
Sentence
Splits a paragraph into individual sentences.
Segmentation
Breaks words into morphemes (smallest units of
Morphological
meaning). Example: “unhappiness” = un + happy +
Analysis
ness
✅ Example:
Sentence: "The cat sat on the mat."
POS Tags: [The/DET] [cat/NOUN] [sat/VERB] [on/PREP] [the/DET]
[mat/NOUN]
Parse Tree: Shows how sentence components are nested.
🧠 2. Semantics in NLP
Semantics is the study of meaning in language — understanding what the
sentence actually conveys.
🔧 NLP Tasks in Semantics:
Task Description
Word Sense Identifies the correct meaning of a word in context
Task Description
Disambiguation
(e.g., "bank" as riverbank or financial?).
(WSD)
Named Entity Identifies entities like names, places, organizations,
Recognition (NER) etc.
Semantic Role Determines the roles of words in a sentence (who
Labeling (SRL) did what to whom).
Checks if one sentence logically follows from
Textual Entailment
another.
Measures how close the meaning of two sentences
Semantic Similarity
or words is.
✅ Example:
Sentence: "Apple launched a new iPhone."
NER: Apple → Organization, iPhone → Product
SRL: Agent (Apple), Action (launched), Object (iPhone)
🧠 3. Pragmatics in NLP
Pragmatics focuses on how context and world knowledge affect
language interpretation. It goes beyond literal meaning.
🔧 NLP Tasks in Pragmatics:
Task Description
Coreference Finds what a pronoun or noun phrase refers to (e.g.,
Resolution "John said he will come" → "he" = John).
Discourse Understands the flow of meaning across multiple
Analysis sentences or paragraphs.
Dialogue Handles conversations in chatbots or virtual assistants
Management by keeping track of context.
Infers what the user wants (e.g., asking for weather,
Intent Detection
booking a flight).
Task Description
Sarcasm & Irony
Identifies non-literal or humorous use of language.
Detection
✅ Example:
Conversation:
User: "I’m freezing!"
Bot: (Understands the context means the user is cold and suggests
turning on the heater)
🎯 Summary Table
Level Focus Key NLP Tasks
Syntax Structure POS Tagging, Parsing, Chunking
Semantic
Meaning WSD, NER, SRL, Entailment
s
Pragmati Context & Coreference, Discourse, Dialogue, Intent
cs Intent Detection
📘 1. Boolean Model
✅ Definition:
The Boolean model represents documents and queries as a set of terms
(keywords) and uses Boolean logic (AND, OR, NOT) to retrieve
documents that exactly match the query conditions.
📚 How it Works:
Each document is represented as a binary vector of terms.
A term is either present (1) or absent (0).
Queries are written as Boolean expressions.
🔍 Example:
Query: "AI AND Healthcare NOT Robotics"
Only documents that contain both “AI” and “Healthcare” but not “Robotics”
will be returned.
🧠 Advantages:
Simple and easy to implement.
Results are unambiguous.
⚠️Limitations:
No ranking of documents (all matches are equal).
Does not handle partial relevance.
No concept of term frequency or document similarity.
📘 2. Vector Space Model (VSM)
✅ Definition:
The Vector Model represents documents and queries as vectors in an n-
dimensional space, where each dimension corresponds to a term. Relevance
is measured using cosine similarity between vectors.
📚 How it Works:
Documents and queries are represented as TF-IDF vectors.
Cosine similarity is used to calculate the angle between query and
document vectors.
The smaller the angle, the more relevant the document.
📈 Formula:
cosine_similarity=d⃗⋅q⃗∥d⃗∥⋅∥q⃗∥\text{cosine\_similarity} = \frac{\vec{d} \cdot \
vec{q}}{\| \vec{d} \| \cdot \| \vec{q} \|}cosine_similarity=∥d∥⋅∥q∥d⋅q
Where:
d⃗\vec{d}d = Document vector
q⃗\vec{q}q = Query vector
🔍 Example:
If a query is "Machine Learning" and Document A has those terms with high
weights, it will rank higher than Document B which only mentions "Learning".
🧠 Advantages:
Supports ranking of documents.
Measures partial matches.
Easy to implement using linear algebra.
⚠️Limitations:
Ignores term dependencies (word order).
Can't handle uncertainty or noise in meaning.
Performance drops in very large corpora without optimization.
📘 3. Probabilistic Model
✅ Definition:
The Probabilistic Model estimates the probability that a document is
relevant to a given query. It ranks documents based on this probability.
📚 How it Works:
Based on Bayes' Theorem.
For each document DDD, calculate:
P(R=1∣D,Q)→ Probability of relevanceP(R=1|D, Q) \quad \text{→ Probability of
relevance}P(R=1∣D,Q)→ Probability of relevance
The system ranks documents by decreasing order of probability.
🔍 Example Models:
Binary Independence Model (BIM)
BM25 (Best Matching 25) – one of the most popular probabilistic IR
models.
🧠 Advantages:
Provides statistical foundation.
Can be tuned using relevance feedback.
More flexible and accurate than Boolean or basic Vector models.
⚠️Limitations:
Requires training data or assumptions about relevance.
Complex to compute in large-scale environments.
📊 Comparison Table
Feature Boolean Model Vector Model Probabilistic Model
Basis Set theory Linear Algebra Probability theory
Binary (match/no Ranked
Result Type Ranked documents
match) documents
Yes (based on
Term Weighting None (0 or 1) Yes (TF-IDF)
relevance)
Handles Partial
No Yes Yes
Match
Complexity Low Moderate High
Simple search Full-text search Advanced IR systems
Common Use
filters engines (e.g., BM25)
✅ Summary
Boolean Model – Simple, rule-based, good for exact match.
Vector Model – Ranks documents based on similarity, suitable for
search engines.
Probabilistic Model – Advanced, statistical model used in modern IR
systems like Elasticsearch and Lucene.
🧠 1. Rule-Based Model (General NLP)
✅ Overview:
Uses manually crafted linguistic rules (grammar, syntax,
morphology).
Based on expert knowledge and dictionaries.
No learning from data.
🧩 Example Use:
POS tagging using handcrafted rules (e.g., "if a word ends with 'ly', tag
it as an adverb").
🧠 Pros:
Interpretable, deterministic.
Useful in domain-specific or low-resource languages.
⚠️Cons:
Time-consuming to develop.
Poor generalization to unseen data.
Difficult to scale for real-world language complexities.
📈 2. Statistical Model
✅ Overview:
Uses probabilities derived from large corpora.
Learns language patterns using machine learning.
Examples: Hidden Markov Models (HMM), n-grams, Maximum
Entropy Models.
🧩 Example Use:
POS tagging using HMMs
Speech recognition using n-gram language models
🧠 Pros:
Better adaptability to unseen data.
More accurate than rule-based for many tasks.
⚠️Cons:
Requires large datasets.
May produce unnatural output in some cases.
🔍 3. Information Retrieval (IR) Model
✅ Overview:
Focuses on retrieving relevant documents given a user query.
Based on models like Boolean, Vector Space, and Probabilistic
(BM25).
🧩 Example Use:
Search engines, document ranking, and question answering systems.
🧠 Pros:
Efficient retrieval over large text corpora.
Supports ranking of relevance.
⚠️Cons:
Doesn't understand deep semantics.
Focuses more on matching than true understanding.
🌍 4. Rule-Based Machine Translation (RBMT)
✅ Overview:
Translates text using predefined linguistic rules and bilingual
dictionaries.
Language pairs are mapped using transfer rules.
🧩 Example Use:
Translating between grammatically similar languages using syntactic
rules.
🧠 Pros:
High accuracy in limited domain or formal language.
Human-readable transformation rules.
⚠️Cons:
Extremely resource-intensive to build.
Fails with idiomatic or ambiguous language.
Not robust to informal or spoken input.
📊 5. Probabilistic Graphical Models (PGM)
✅ Overview:
Combines graph theory and probability to model complex
dependencies.
Examples: Bayesian Networks, Markov Random Fields,
Conditional Random Fields (CRF).
🧩 Example Use:
NER, POS tagging, sequence labeling using CRFs.
🧠 Pros:
Captures dependencies and uncertainty in language.
Well-suited for structured prediction tasks.
⚠️Cons:
Computationally intensive.
Requires labelled data and expertise in probabilistic modeling.
📊 Summary Comparison Table
Rule- Probabilistic
Feature / Statistical Rule-
Based IR Model Graphical
Model NLP Based MT
NLP Model
Human Data/ Term Translation Graph +
Based on
rules statistics matching rules Probability
Learning ✅/❌
❌ No ✅ Yes ❌ No ✅ Yes
from data? (depends)
Moderate to High for
Flexibility Low Low High
high retrieval
Interpretabili
High Moderate High High Moderate
ty
Scalability Low High High Low Moderate
Rule- Probabilistic
Feature / Statistical Rule-
Based IR Model Graphical
Model NLP Based MT
NLP Model
Example Grammar POS tagging, Search Language NER, POS, Info
Task check MT Engine Translation Extraction
✅ Final Notes:
Modern NLP systems (like BERT, GPT) use deep learning, which
has largely replaced these classical models, though PGMs and
statistical models are still used in low-resource and explainable NLP
systems.
Rule-based and IR models are still used in hybrid systems where
interpretability is critical (e.g., legal, medical).
Unit II: Linguistics and
Morphology
1. Phonetics
✅ Definition:
Phonetics is the study of the physical sounds of human speech — how
speech sounds are produced, transmitted, and received.
🔍 Key Aspects of Phonetics:
Type of
Description
Phonetics
Articulatory Studies how speech sounds are produced using vocal
Phonetics organs (e.g., lips, tongue, vocal cords).
Acoustic Studies the physical properties of sound waves
Phonetics (frequency, amplitude, duration).
Auditory
Focuses on how listeners perceive speech sounds.
Phonetics
🧩 Applications in NLP:
Speech Recognition Systems: Understanding sound variations
(accents, pronunciation).
Text-to-Speech (TTS): Generating natural-sounding speech.
Voice Biometrics: Identifying users based on vocal characteristics.
Example:
The word “cat” is made up of these phonetic sounds:
/k/ – voiceless velar stop
/æ/ – low front vowel
/t/ – voiceless alveolar stop
🧠 2. Phonology
✅ Definition:
Phonology is the study of how sounds function in a particular language
or languages — the abstract, mental representation of sounds.
🔍 Key Concepts:
Concept Explanation
The smallest sound unit that can change meaning (e.g., /p/ vs.
Phoneme
/b/ in pat vs bat).
Variations of a phoneme that do not change meaning (e.g.,
Allophone
aspirated /pʰ/ in pin vs. unaspirated /p/ in spin).
Minimal Pairs of words that differ by only one phoneme (e.g., bit vs.
Pairs pit).
Syllable Rules on how sounds are organized into syllables (onset,
Structure nucleus, coda).
🧩 Applications in NLP:
Pronunciation modeling in speech synthesis.
Language modeling for ASR (Automatic Speech Recognition).
Phonological rules used in grammar correction and spelling
prediction.
Example:
In English, the plural ending "-s" is pronounced differently based on the final
sound of the noun:
cats → /s/
dogs → /z/
buses → /ɪz/
This variation is phonological and follows specific sound rules.
🔄 Phonetics vs. Phonology — Comparison
Feature Phonetics Phonology
Physical properties of speech Abstract, mental representation
Focus
sounds of sounds
How sounds function and relate
Scope How sounds are made and heard
to each other
Units of Phonemes (meaningful sound
Phones (actual sounds)
Study units)
Spectrograms, waveforms,
Tools Phonemic charts, minimal pairs
articulatory diagrams
NLP Use Pronunciation dictionaries,
TTS, ASR
Case language rules
🎯 Summary
Phonetics = the science of sounds (how they are produced,
transmitted, and heard).
Phonology = the rules and patterns of how sounds are organized in a
language.
Both are crucial for speech-based NLP tasks and understanding the
linguistic backbone of natural language.
🧱 1. Morphology
✅ Definition:
Morphology is the study of the structure of words—how they are formed
and how they relate to other words.
🔹 Key Term: Morpheme
A morpheme is the smallest meaningful unit of language.
Types:
o Free morpheme: Can stand alone (e.g., "book", "run").
o Bound morpheme: Cannot stand alone (e.g., "-s", "-ed", "un-").
🧩 Applications in NLP:
Lemmatization & stemming
Spell checking
Machine translation
2. Syntax
✅ Definition:
Syntax studies the structure of sentences — how words are combined to
form grammatically correct phrases and sentences.
🔹 Concepts:
Phrases (noun phrase, verb phrase)
Parse trees
Part-of-Speech (POS) tagging
🧩 Applications in NLP:
Grammar checking
Sentence parsing
Text generation
💬 3. Semantics
✅ Definition:
Semantics is the study of meaning in language — both of individual words
and how meaning is constructed in phrases/sentences.
🔹 Examples:
Word sense disambiguation (e.g., “bank” as a riverbank vs financial)
Semantic roles (who did what to whom)
🧩 Applications in NLP:
Question answering
Chatbots
Semantic search
🧠 4. Pragmatics
✅ Definition:
Pragmatics deals with language in context — how meaning is interpreted
based on situation, speaker intent, and shared knowledge.
🔹 Examples:
“Can you pass the salt?” → A request, not a question about ability.
🧩 Applications in NLP:
Conversational AI
Sentiment analysis
Emotion detection
🧭 5. Semiotics
✅ Definition:
Semiotics is the study of signs and symbols and their use or
interpretation in communication.
🔹 Aspects:
Signifier (form) and Signified (meaning)
Language as a system of signs
🧩 Applications:
Symbolic reasoning
Brand and visual language analysis
Human-computer interaction
📚 6. Discourse Analysis
✅ Definition:
Discourse analysis studies language beyond the sentence level —
including coherence, context, and structure in conversation or text.
🔹 Topics:
Turn-taking in conversation
Co-reference resolution (e.g., “John was tired. He went home.”)
Discourse markers ("however", "therefore")
🧩 Applications in NLP:
Dialogue systems
Summarization
Narrative analysis
🧬 7. Psycholinguistics
✅ Definition:
Psycholinguistics examines how language is processed in the human
brain, including how people understand, produce, and acquire language.
🔹 Areas:
Language acquisition
Word recognition
Sentence processing
🧩 Applications in NLP:
Cognitive modeling
Speech recognition and error prediction
Assistive technologies
8. Corpus Linguistics
✅ Definition:
Corpus linguistics involves the study of language through large collections of
real-world text data (corpora).
🔹 Components:
Annotated corpora (POS tags, syntactic trees, semantic roles)
Frequency analysis
Concordance analysis
🧩 Applications in NLP:
Training language models
POS tagging
Collocation extraction
Sentiment analysis
🧠 Summary Table:
Concept Focus NLP Use Cases
Word structure & Lemmatization, stemming,
Morphology
formation tokenization
Syntax Sentence structure Parsing, POS tagging, syntax trees
Meaning of words & QA systems, WSD, knowledge
Semantics
sentences graphs
Pragmatics Contextual meaning Chatbots, intent detection
Concept Focus NLP Use Cases
Semiotics Symbols and meaning Symbolic AI, UX design
Discourse Beyond sentence-level Summarization, co-reference
Analysis meaning resolution
Psycholinguistic Brain’s language
Cognitive NLP, speech prediction
s processing
Corpus Study of language via Model training, text mining,
Linguistics datasets language resources
📚 1. Word Formation Processes
These are the different ways in which new words are formed in a language.
Understanding these processes helps in morphological analysis, tokenization,
and language generation.
🔹 Major Types:
Process Description Example
Adding prefixes or suffixes to create happy → unhappy, play
Derivation
new words → playful
Modifying a word to express tense,
walk → walked, cat →
Inflection number, etc., without changing its
cats
category
Joining two words to form a new
Compounding notebook, blackboard
word
telephone → phone,
Clipping Shortening a word
influenza → flu
brunch (breakfast +
Blending Combining parts of two words lunch), smog (smoke +
fog)
Acronyms &
Forming words from initial letters NASA, FBI
Initialisms
Conversion Changing the grammatical category Google (noun) → to
Process Description Example
without modification Google (verb)
🧩 2. Morphological Analysis
✅ Definition:
Morphological analysis is the process of breaking down a word into its
base form (root) and its morphemes (smallest units of meaning).
🔍 Example:
Word: "unhappiness"
o Root: happy
o Prefix: un- (negation)
o Suffix: -ness (noun-forming)
🧠 Uses in NLP:
Lemmatization (finding base form)
Stemming
Machine Translation
Spell Checkers
🔧 Tools:
Rule-based analyzers
Morphological dictionaries
Finite State Transducers (see next section)
⚙️3. Morphological Finite State Transducers (FSTs)
✅ Definition:
A Finite State Transducer (FST) is an automaton (like a state machine)
used for modeling the mapping between two sets of symbols. In
morphology, it maps surface forms (inflected) to lexical forms (base +
features).
🧠 Think of it like:
Input: "running"
Output: run + V + Prog
🔹 Structure:
States (nodes)
Transitions (edges with input/output)
Accept states
Alphabet (symbols or letters)
📜 How it works:
It reads a word character-by-character, matching rules (like suffix patterns)
to derive its root and grammatical properties.
🔍 Example Transition:
perl
CopyEdit
State 0:
"r" → "r" → State 1
"u" → "u" → State 2
...
"ing" → "+V+Prog" → Final state
💡 Applications:
Morphological parsing (analyze form)
Morphological generation (create word forms from root + grammar)
Speech recognition and synthesis
Language learning tools
🧪 Example Output:
text
CopyEdit
Input: "walked"
Output: walk + V + Past
Implementation in NLP
Popular tools that use FSTs or support morphological analysis:
Tool Description
Xerox Finite-State Tool for morphological
XFST
parsing
FOMA Open-source alternative to XFST
Hunspe
Used in spell checkers (LibreOffice, Firefox)
ll
Can be used to create simple morphological
NLTK
analyzers
🎯 Summary
Concept Description
Word Formation
How new words are created in language
Processes
Morphological
Identifying root words and affixes
Analysis
Finite State Automata used to model and process
Transducers morphological rules
Unit III: Word Level
Analysis
🧱 1. Tokenization
✅ Definition:
Tokenization is the process of splitting text into smaller units (called
tokens), such as words, subwords, or sentences.
🔹 Types:
Word Tokenization – Splits text into words.
o Example: "I love NLP" → ["I", "love", "NLP"]
Sentence Tokenization – Splits a paragraph into sentences.
o Example: "I love NLP. It's amazing!" → ["I love NLP.", "It's
amazing!"]
Subword Tokenization – Useful for deep learning models (e.g.,
BERT).
o Example: "unhappiness" → ["un", "happi", "ness"]
🧠 Why it's useful:
First step in most NLP pipelines.
Enables other processes like POS tagging, lemmatization, etc.
🧩 2. Part-of-Speech Tagging (POS Tagging)
✅ Definition:
POS tagging is the process of labeling each word with its grammatical
category (e.g., noun, verb, adjective).
🔹 Example:
"She is eating an apple."
→ [('She', PRON), ('is', AUX), ('eating', VERB), ('an', DET), ('apple', NOUN)]
🔹 Common POS Tags:
Meanin
Tag
g
NN Noun
VB Verb
JJ Adjectiv
Meanin
Tag
g
RB Adverb
PRP Pronoun
Libraries:
NLTK
spaCy
Stanford NLP
TextBlob
🧠 Use Cases:
Grammar checking
Named Entity Recognition
Machine Translation
🧬 3. Lemmatization
✅ Definition:
Lemmatization is the process of reducing a word to its base or dictionary
form (lemma), considering the context and part of speech.
🔹 Example:
"am", "are", "is" → "be"
"better" → "good"
"running" → "run"
💡 Features:
Considers word meaning and POS
Uses lexicons and morphological analysis
Tools:
WordNetLemmatizer (NLTK)
spaCy's Token.lemma_
✂️4. Stemming
✅ Definition:
Stemming is the process of removing suffixes or prefixes to find the root
form of a word.
🔹 Example:
"playing", "played", "plays" → "play"
"studies" → "studi"
💡 Characteristics:
Often results in non-real words
Uses heuristic-based rules
Faster but less accurate than lemmatization
🔧 Common Algorithms:
Porter Stemmer
Lancaster Stemmer
Snowball Stemmer
🔍 Comparison Table:
Tokenizati
Feature POS Tagging Lemmatization Stemming
on
Assign grammatical Reduce to base Remove
Purpose Split text
tags word suffixes
Output Tokens (word, POS) pairs Lemmas Stems
Context
❌ ✅ ✅ ❌
aware?
❌
Real words? ✅ ✅ ✅
(sometimes)
Tokenizati
Feature POS Tagging Lemmatization Stemming
on
Speed Fast Moderate Moderate Fast
1. Named Entity Recognition (NER)
✅ Definition:
NER is the task of identifying and classifying named entities in text into
predefined categories such as:
Person names
Organizations
Locations
Dates, Time
Monetary values
Percentages
🔍 Example:
"Barack Obama was born in Hawaii and served as President of the United
States."
NER Tags:
Barack Obama → PERSON
Hawaii → LOCATION
President → TITLE
United States → LOCATION/ORGANIZATION
🔧 NER Categories (Typical):
Entity Type Example
PERSON Elon Musk
India,
LOCATION
Himalayas
ORGANIZATIO
Google, UN
N
Entity Type Example
15 August
DATE
1947
TIME 10:30 AM
MONEY $100, ₹500
Libraries:
spaCy (ent.label_)
Stanford NLP
Flair
Hugging Face transformers (e.g., BERT NER models)
🧠 Applications:
Information extraction
Question answering
Resume parsing
Chatbots
🧠 2. Word Sense Disambiguation (WSD)
✅ Definition:
WSD is the process of determining the correct meaning (sense) of a word
based on its context, especially for words with multiple meanings.
🔍 Example:
Word: "bank"
"He went to the bank to deposit money." → financial institution
"He sat on the bank of the river." → side of a river
🔧 Approaches:
Approach Description
Knowledge- Uses dictionaries like WordNet
Approach Description
based
Uses annotated corpora for
Supervised
training
Clusters word usage without
Unsupervised
labeling
Contextual Uses transformers (e.g., BERT,
models GPT)
Tools:
NLTK + WordNet
Lesk Algorithm (classic)
Deep Learning with Hugging Face models
🧠 Applications:
Machine translation
Search engines
Semantic analysis
Question answering systems
🔡 3. Word Embedding
✅ Definition:
Word Embedding is a technique where words are represented as dense
vectors in a continuous vector space, where similar words have similar
representations.
🔍 Why?
It captures semantic meaning of words based on context and co-
occurrence in large corpora.
🧠 Key Ideas:
Words used in similar contexts are closer in vector space.
Unlike one-hot encoding, word embeddings are dense and low-
dimensional.
🔧 Popular Word Embedding Models:
Model Description
Word2Vec Predicts a word from its context (or vice versa)
GloVe Uses word co-occurrence matrix
FastText Includes subword information
BERT Contextualized, dynamic embeddings from
Embeddings transformers
🌐 Example:
python
CopyEdit
from gensim.models import Word2Vec
# Sample sentence
sentences = [["dog", "barks"], ["cat", "meows"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
# Get embedding for a word
vector = model.wv["dog"]
🔍 Word2Vec Example Similarities:
model.wv.most_similar("king") → might return ["queen", "prince",
"monarch"]
🧠 Applications:
Sentiment analysis
Recommendation systems
Chatbots
Text classification
Similarity detection
🔁 Summary Table:
Concept Description Key Use Tool Examples
Identify names, locations, spaCy, NLTK,
NER Info extraction
etc. Transformers
Identify correct meaning WordNet, BERT,
WSD Semantic analysis
of a word in context Lesk
Word Represent words as Word similarity, Word2Vec, GloVe,
Embedding vectors NLP modeling BERT
1. ✅ Rule-Based POS Tagging
📌 Definition:
Uses hand-written rules and lexicons (dictionaries) to assign POS tags
based on patterns in the words and their neighbors.
🔧 How it works:
A dictionary provides possible tags for each word.
Syntactic rules determine the most likely tag based on context.
o e.g., If a word is preceded by a determiner (like "the"), it's likely a
noun.
🧪 Example:
“The boy played football.”
Rule: If a word follows "The", tag it as a noun → "boy" = NOUN
✅ Pros:
Simple, interpretable
Doesn’t require training data
❌ Cons:
Hard to scale (requires many rules)
Less accurate than statistical methods
2. 🎲 Stochastic (Statistical) POS Tagging
📌 Definition:
Uses probability and statistics from annotated corpora to determine the
most likely tag for a word.
🔧 Methods:
Unigram Tagging: Assigns the most frequent tag for each word.
Bigram/Trigram Tagging: Considers previous one or two tags as
context.
Hidden Markov Models (HMM): Uses probabilistic models for
sequences of words and tags.
🧪 Example:
Word: “play”
In “They play cricket” → verb
In “a play by Shakespeare” → noun
The tag with the highest probability is selected based on context.
✅ Pros:
Learns from data
Higher accuracy with more training
❌ Cons:
Requires large annotated corpora
Can struggle with unknown words
3. 🔁 Transformation-Based (Brill) POS Tagging
📌 Definition:
A hybrid approach that combines rule-based and statistical methods. It starts
with initial tagging (usually unigram), then refines it using
transformation rules learned from training data.
🧠 Example:
Initial tagging: "can" → noun
Transformation rule: If "can" is followed by a verb, it’s likely a modal
verb
🔧 Key Steps:
1. Initial tagging
2. Learn transformation rules from corpus
3. Apply rules iteratively to improve accuracy
✅ Pros:
High accuracy
Learns interpretable rules from data
❌ Cons:
Slower to train
Needs both annotated data and rule generation
4. 📚 Lexical POS Tagging
📌 Definition:
Uses a lexicon (dictionary) where each word is associated with its possible
POS tags based on usage frequency or pre-defined mapping.
🧠 Characteristics:
Doesn't look much at surrounding words (unlike stochastic methods).
Works well for words with unique or less ambiguous tags.
🧪 Example:
"and" → Always tagged as CONJUNCTION
"the" → Always tagged as DETERMINER
✅ Pros:
Simple and fast
Effective for known, unambiguous words
❌ Cons:
Poor at resolving ambiguity
Doesn’t handle context well
🧾 Summary Table:
Learns from Context Accura
Type Method Used
Data? Awareness cy
Rule-Based Hand-coded rules ❌ ✅ (via rules) Medium
Stochastic Probabilities, HMM ✅ ✅ High
Transformation- Rules learned
✅ ✅ High
Based from data
Word-tag Low-
Lexical ❌ ❌
dictionary Med
Tools & Libraries Supporting All:
NLTK (Python)
spaCy
Stanford CoreNLP
OpenNLP
AllenNLP
1. Hidden Markov Model (HMM)
✅ Definition:
An HMM is a statistical model used for sequence prediction, where the
system being modeled is assumed to be a Markov process with hidden
states.
In NLP, it is widely used for:
Part-of-Speech Tagging
Named Entity Recognition
Speech Recognition
🧠 Key Concepts:
States: POS tags (e.g., Noun, Verb)
Observations: Words in the sentence
Transition probabilities: Probability of one tag following another
Emission probabilities: Probability of a word given a tag
🔁 HMM Steps:
1. Start with a known POS tag.
2. Predict the next tag using transition probabilities.
3. Predict the word using emission probabilities.
4. Use Viterbi Algorithm to find the most likely tag sequence.
⚖️2. Maximum Entropy Model (MaxEnt)
✅ Definition:
A discriminative model that uses the principle of maximum entropy to
predict the most likely outcome (e.g., POS tag, NER tag) based on
contextual features.
🧠 Key Idea:
Rather than modeling sequences like HMMs, MaxEnt focuses on:
Feature-based classification (e.g., word suffix, previous tags)
Tries to be as unbiased as possible without contradicting known data
💡 Example Features for POS:
Current word = "running"
Previous word = "is"
Word suffix = "ing"
📦 Used in:
POS tagging
NER
Text classification
✅ Pros:
Can use arbitrary, overlapping features
More flexible than HMMs
📊 3. N-grams
✅ Definition:
N-grams are contiguous sequences of n items (words or characters) from
a given sample of text.
Example (Sentence: "I
N-Gram
love NLP")
Unigram
I, love, NLP
(n=1)
Bigram
I love, love NLP
(n=2)
Trigram
I love NLP
(n=3)
📌 Applications:
Language modeling
Text generation
Spelling correction
Auto-suggestions
⚠️Limitation:
Poor for capturing long-range dependencies
Data sparsity in higher n-grams
🔗 4. Collocations
✅ Definition:
Collocations are combinations of words that occur together more often
than by chance.
🔍 Examples:
“strong tea” (but not “powerful tea”)
“make a decision” (not “do a decision”)
📊 Detected Using:
Frequency-based methods: Pointwise Mutual Information (PMI)
Statistical association measures: Chi-squared, t-score
📦 Applications:
Machine translation
Lexicography (dictionary building)
Improving naturalness of generated text
5. Applications of Named Entity Recognition (NER)
📌 Real-World Use Cases:
Application Area Description
Information Extract people, organizations, locations from
Extraction news articles
Question
Identify named entities as answers
Answering
Resume Parsing Extract names, dates, skills from CVs
Chatbots Understand user intents better
Social Media
Detect brands, products, or sentiment drivers
Monitoring
Legal & Medical Extract patient names, drug names, legal terms
Application Area Description
Docs
🔧 Example:
Input:
“Apple Inc. launched a new iPhone in California on September 14.”
NER Tags:
Apple Inc. → ORGANIZATION
iPhone → PRODUCT
California → LOCATION
September 14 → DATE
🧾 Summary Table:
Concept Description Main Use
Probabilistic sequence model using
HMM POS Tagging, NER
hidden states
POS, NER,
MaxEnt Feature-based discriminative classifier
Sentiment
Language
N-Grams Sequence of N tokens
modeling
Collocations Frequent co-occurring phrases Natural phrasing
NER
Use of entity extraction in real domains Resume, QA, NLP
Applications
Unit IV: Syntax Analysis
1. Grammatical Formalisms
Grammatical formalisms are systems or methods used to describe the
syntactic structure of sentences in a language. They form the basis for
syntactic parsing in Natural Language Processing (NLP).
🧱 2. Context-Free Grammar (CFG)
✅ Definition:
A Context-Free Grammar (CFG) is a type of formal grammar that consists
of a set of production rules used to generate strings (sentences) in a
language.
CFGs are widely used in NLP to describe the syntax of natural languages in
a simplified, yet powerful way.
✨ Components of CFG:
A CFG is defined as a 4-tuple:
G = (N, Σ, P, S), where:
N = Non-terminal symbols (e.g., S, NP, VP)
Σ = Terminal symbols (e.g., actual words)
P = Production rules (e.g., S → NP VP)
S = Start symbol (usually S for sentence)
🔧 Example of a CFG:
plaintext
CopyEdit
S → NP VP
NP → Det N
VP → V NP
Det → "the" | "a"
N → "cat" | "dog"
V → "chased" | "saw"
🧪 Example Sentence:
"The cat chased the dog."
Parsed as:
S
o NP (The cat)
o VP (chased the dog)
3. Grammar Rules for English
These are specific rules to describe how words combine to form phrases
and sentences in English.
🔹 Common English Grammar Rules (CFG-like):
Rule Description
A sentence consists of a noun phrase and a verb
S → NP VP
phrase
NP → Det
A noun phrase is a determiner followed by a noun
N
VP → V NP A verb phrase is a verb followed by a noun phrase
A prepositional phrase is a preposition followed by a
PP → P NP
noun phrase
NP → NP A noun phrase can be expanded with a prepositional
PP phrase
📍 Example:
"The dog in the garden barked loudly."
This can be parsed using rules:
NP → Det N PP
PP → P NP
VP → V Adv
🧠 4. Syntactic Parsing
✅ Definition:
Syntactic parsing (or syntactic analysis) is the process of analyzing the
structure of a sentence according to a grammar.
It answers:
“How are the words in this sentence structured grammatically?”
📍 Types of Parsing:
Type Description
Constituency Divides sentence into sub-phrases (constituents) like NP,
Parsing VP
Dependency Analyzes grammatical relationships (e.g., subject, object)
Parsing between words
📊 Constituency Parse Tree Example (for "The cat saw a dog"):
plaintext
CopyEdit
/ \
NP VP
/\ / \
Det N V NP
| | | /\
the cat saw Det N
| |
a dog
🔗 Dependency Parse Example:
saw → root
o cat → nsubj (subject)
o dog → dobj (object)
o the, a → determiners
🧠 Parsing Algorithms:
Algorithm Description
Top-Down Starts from the root (S) and expands using
Parsing grammar
Bottom-Up Starts from words and builds up to
Parsing sentence
For parsing with CFG in Chomsky Normal
CYK Algorithm
Form
Earley Parser Efficient for general CFGs
✅ Summary:
Concept Description
CFG Formal grammar with rules like S → NP VP
Grammar Rules
Rule sets describing English syntax
(English)
Identifying sentence structure via CFG or dependency
Syntactic Parsing
relations
Parsing Types Constituency vs. Dependency Parsing
1. Grammar Formalisms
✅ Definition:
Grammar formalisms are structured systems of rules that describe how
words combine to form grammatical sentences in a language.
These formalisms serve as the theoretical foundation for syntactic
parsing.
🔹 Types of Grammar Formalisms:
Grammar Type Description Example
Context-Free Each rule has a single non-
S → NP VP
Grammar (CFG) terminal on the left-hand side.
Lexicalized Tree Allows extended domain of Tree structure rooted
Adjoining Grammar locality; includes word-level with a verb like "eat"
(LTAG) information in grammar trees. includes its NP object
Head-Driven Highly lexicalized, based on Uses typed feature
Phrase Structure constraints and feature structures to capture
Grammar (HPSG) structures. word agreement
Based on binary relations “The cat sleeps” →
Dependency
between words (head- sleeps is the head of
Grammar
dependent). cat
Combinatory Uses category logic and
S/NP → a verb phrase
Categorial combinatory rules; highly
needing a noun phrase
Grammar (CCG) expressive.
From Chomsky’s minimalist
Minimalist Focus on economy of
program, based on operations
Grammar (MG) derivation
like merge and move.
💡 Why Grammar Formalisms Matter in NLP:
Help define the structure of a language
Allow for automatic syntactic parsing
Form the basis for treebanks
🌳 2. Treebanks
✅ Definition:
A treebank is a parsed text corpus that annotates syntactic or semantic
sentence structures using a specific grammar formalism.
Each sentence is associated with a parse tree (usually a constituency or
dependency tree).
🧱 Types of Treebanks:
Type Description Example
Constituency Based on phrase structure
Penn Treebank
Treebank (CFG).
Dependency Based on head-dependent Universal Dependencies
Treebank relations. (UD)
TIGER Treebank
Hybrid Treebank Combines features of both.
(German)
📌 Notable Treebanks:
Languag
Treebank Formalism Notes
e
Widely used; contains POS
Penn Treebank English CFG-based
tags and phrase structure
Universal
Multilingu Dependency
Dependencies Cross-linguistic annotations
al grammar
(UD)
Prague
Dependency
Dependency Czech Rich morphological tagging
grammar
Treebank
Uses both dependencies and
TIGER Treebank German Hybrid
phrases
Constituency- Emphasizes word order and
NEGRA Treebank German
based structure
📊 Example (Constituency Parse from Penn Treebank):
(S
(NP (DT The) (NN dog))
(VP (VBD barked))
(. .))
🔗 Dependency Version:
barked is the root
o dog → nsubj (nominal subject)
o The → det (determiner)
🔧 How Treebanks Are Used:
Train and evaluate parsers (e.g., Stanford Parser, spaCy)
Improve POS tagging and NER
Statistical grammar induction
Semantic role labeling
Linguistic research
📈 Summary Table
Term Definition Example
Grammar Rule system for sentence CFG, HPSG, Dependency
Formalism formation Grammar
Annotated corpus of Penn Treebank, Universal
Treebank
syntactic trees Dependencies
Constituency
Phrase structure analysis NP → Det N
Parsing
Dependency
Head-dependent relations sleeps ← subj ← cat
Parsing
1. Efficient Parsing for Context-Free Grammars (CFGs)
✅ Context-Free Grammar (CFG) Recap:
CFGs are sets of recursive rewriting rules used to generate patterns of
strings in a language.
Example:
S → NP VP
NP → Det N
VP → V NP
🔧 Parsing Challenges:
Parsing CFGs involves finding whether a sentence can be generated by a
grammar, and if yes, how (the parse tree). Some sentences can have
multiple parse trees (ambiguity), which complicates parsing.
🚀 Efficient CFG Parsing Algorithms:
Algorithm Description Time Complexity
Top-Down Starts from start symbol (S), tries Not efficient; may lead to
Parsing to rewrite using rules infinite loops
Bottom-Up Starts from the input and works More efficient than top-
Parsing upward down
Earley Works with all CFGs, top-down and O(n³) worst-case, O(n²)
Parser bottom-up hybrid average
CYK Requires grammar in Chomsky
O(n³)
Algorithm Normal Form
Chart Uses dynamic programming to Efficient with
Parsing avoid redundant parsing memoization
📌 CYK Algorithm Overview:
Works with Chomsky Normal Form (CNF) grammars
Builds a parse table where each cell [i, j] stores possible non-
terminals that can derive substring from position i to j
Time complexity: O(n³) where n is sentence length
🔹 2. Statistical Parsing
✅ What is it?
Statistical parsing uses probability to choose the most likely parse tree
when multiple parses exist.
Instead of just checking grammaticality, it asks:
“What is the most probable structure for this sentence?”
🔍 How it Works:
Trains a parser using a treebank (annotated dataset)
Computes probabilities from frequency counts
Chooses parse trees based on likelihood
🔹 3. Probabilistic Context-Free Grammars (PCFGs)
✅ Definition:
A Probabilistic Context-Free Grammar (PCFG) extends CFG by assigning
probabilities to each production rule.
✨ Format:
Each rule has the form:
A → B C [p]
Where p is the probability of the rule.
Example:
S → NP VP [1.0]
NP → Det N [0.6]
NP → N [0.4]
VP → V NP [0.8]
VP → V [0.2]
🧠 Parsing with PCFGs:
Use algorithms like Viterbi parsing (a dynamic programming method)
to find the most likely parse
Combines syntax with statistics for better accuracy
Deals well with ambiguity in natural language
📌 Use Case Example:
Given:
"The cat saw the dog"
Two parse trees are possible. PCFG assigns probabilities to each and selects
the more likely one based on training data.
📊 Advantages of PCFG:
Handles ambiguity effectively
Learns from real data (treebanks)
Helps in building robust NLP systems
✅ Summary Table
Key
Concept Description
Algorithm
Defines grammatical
CFG CYK, Earley
structure
Statistical Selects best parse based on
PCFG, Viterbi
Parsing data
Viterbi
PCFG CFG + rule probabilities
algorithm
Unit V:Semantic
Analysis
1. Requirements for Knowledge Representation in AI
To build intelligent systems, information must be represented in a structured
form that a machine can reason about. This is known as knowledge
representation (KR).
✅ Requirements for Good Representation:
Requirement Explanation
Representational Should be able to represent all kinds of knowledge
Adequacy required for the task (e.g., facts, rules, relationships)
Inferential Must support new knowledge derivation through
Adequacy reasoning (inference rules)
Inferential
Should allow for efficient reasoning (fast and scalable)
Efficiency
Acquisitional Easy to acquire and modify knowledge from real-world
Efficiency sources
Representation should allow modular knowledge
Modularity
chunks (easy to add/remove components)
The structure should be easy to understand, define,
Clarity
and interpret
Capable of expressing uncertainty, defaults, or
Expressiveness
exceptions where needed
🔹 2. First-Order Logic (FOL)
✅ What is FOL?
First-Order Logic (also called Predicate Logic) is a formal logic system
that extends propositional logic by adding:
Quantifiers
Predicates
Variables
✨ Components of FOL:
Compone
Example Meaning
nt
Constant John, 5 Represents an object
Loves(John, Expresses a
Predicate
IceCream) relationship
Placeholder for
Variable x, y
objects
∀x, ∃x
Quantifie “for all”, “there
rs exists”
Function FatherOf(x) Returns an object
🧠 Example FOL Statement:
"Everyone loves ice cream."
scss
CopyEdit
∀x Loves(x, IceCream)
"There exists a person who is a teacher."
scss
CopyEdit
∃x Teacher(x)
✅ Why FOL is useful:
Powerful enough to represent real-world facts
Supports inference rules (Modus Ponens, resolution, etc.)
Basis for AI systems like expert systems and knowledge bases
🔹 3. Description Logics (DL)
✅ What are Description Logics?
Description Logics (DLs) are a family of formal knowledge representation
languages. They are subsets of FOL, designed for representing structured
knowledge (like ontologies).
✨ DL Focuses On:
Concepts (Classes)
Roles (Properties/Relationships)
Individuals (Instances)
📌 Key Features of DL:
Feature Description
Represents sets or classes (e.g., Person,
Concepts (C)
Animal)
Roles (R) Binary relations (e.g., hasChild, owns)
Individuals
Objects in the domain (e.g., Alice, Car1)
(a, b)
🧠 Example in DL Notation:
Father ≡ Man ⊓ ∃hasChild.Person
→ A Father is a Man who has at least one child who is a Person.
Student ⊑ Person
→ All students are persons.
✅ Applications of DL:
Semantic Web (used in OWL - Web Ontology Language)
Ontology creation (e.g., biological taxonomies, legal systems)
Medical informatics
Natural language understanding
📌 Summary Table:
Concept Description Use Case
Knowledge Format to store, interpret, and
AI systems
Representation reason about information
Formal logic with predicates and Rule-based AI,
FOL
quantifiers theorem proving
Subset of FOL for structured Semantic Web,
Description Logics
knowledge Ontologies
1. Syntax-Driven Semantic Analysis
✅ What is it?
Syntax-Driven Semantic Analysis is the process of assigning meaning
(semantics) to a sentence using its syntactic structure (typically derived
from a parse tree).
It is based on the idea that:
"The meaning of a sentence can be built compositionally from the meaning
of its parts (words/phrases) and their syntactic structure."
This concept is central in natural language processing (NLP) and
compiler design.
🧠 Key Principles:
1. Compositional Semantics:
Meaning is derived by combining meanings of subparts.
2. Parse Trees & Grammar Rules:
Semantic rules are attached to grammar productions.
3. Bottom-up or Top-down Traversal:
Parse trees are used to evaluate or generate the meaning.
📌 Example:
Grammar rule:
nginx
CopyEdit
S → NP VP
Semantic rule (attachment):
ini
CopyEdit
S.meaning = combine(NP.meaning, VP.meaning)
Let’s say:
NP.meaning = "John"
VP.meaning = "eats apples"
Then:
S.meaning = "John eats apples"
This is how semantics are driven by syntax.
🔹 2. Semantic Attachments
✅ What are they?
Semantic attachments are semantic rules or semantic actions
associated with grammar rules to generate meaning.
They define how to compute the meaning of a phrase based on its
constituents.
How It Works:
Each grammar production is augmented with a semantic rule.
For example:
Production:
nginx
CopyEdit
VP → V NP
Semantic Attachment:
ini
CopyEdit
VP.meaning = apply(V.meaning, NP.meaning)
If:
V.meaning = λx.eat(x)
NP.meaning = "apples"
Then:
VP.meaning = eat(apples)
💡 Tools Used:
Lambda Calculus: For meaning representation
Attribute Grammars: Formal systems with attributes and rules
Syntax-directed translation: Common in compilers and NLP
pipelines
🧠 Application in NLP:
Component Role
Syntax Parser Builds the syntactic structure
Semantic Analyzer Uses attachments to build logical forms
Component Role
Intermediate
e.g., First-order logic, semantic graphs
Representation
Question answering, machine
Downstream Tasks
translation, etc.
✅ Summary
Concept Description
Syntax-Driven Semantic
Constructs meaning using syntactic structure
Analysis
Rules linked to grammar that specify how to
Semantic Attachments
compute meaning
1. Word Senses
✅ What is a Word Sense?
A word sense is a particular meaning of a word.
Most words in natural language are polysemous, meaning they have
multiple senses.
📌 Example:
The word "bank" can mean:
1. A financial institution (e.g., "I deposited money at the bank.")
2. The side of a river (e.g., "We sat on the river bank.")
These are two different senses of the same word.
🧠 Why is Word Sense Important?
Understanding which sense is intended is crucial for:
Machine translation
Information retrieval
Question answering
Text summarization
This is addressed in Word Sense Disambiguation (WSD).
🔹 2. Relations Between Word Senses
WordNet (a lexical database) defines several types of semantic
relationships between senses:
Relation Description Example
Synonym
Same meaning big ↔ large
y
Antonym
Opposite meaning hot ↔ cold
y
Hyperny animal is a hypernym of
More general
my dog
Hyponym sparrow is a hyponym of
More specific
y bird
Meronym
Part-whole wheel is a meronym of car
y
Holonym
Whole-part car is a holonym of wheel
y
Tropony Specific manner of to sprint is a troponym of
my action to run
These relations help in semantic reasoning and NLP applications.
🔹 3. Thematic Roles (Theta Roles)
✅ What are Thematic Roles?
Thematic Roles describe the roles that entities play in an event or
action, typically in relation to a verb.
They are part of semantic role labeling (SRL) in NLP.
📌 Common Thematic Roles:
Role Description Example
"John" in "John kicked the
Agent The doer of the action
ball"
Theme/ The receiver of the
"the ball" in above
Patient action
One who feels or
Experiencer "Mary" in "Mary felt cold"
perceives
"a stick" in "He hit it with a
Instrument Means used
stick"
Where the event
Location "in the park"
happens
Goal Endpoint of an action "to the store"
Source Starting point "from Delhi"
🔹 4. Selectional Restrictions
✅ What are they?
Selectional restrictions are semantic constraints that verbs (or predicates)
impose on their arguments.
They help determine which types of words can logically fill a role.
📌 Examples:
1. eat(x) expects x to be edible.
o ✅ "She ate an apple"
o ❌ "She ate a bicycle"
2. drive(x) expects x to be a vehicle.
o ✅ "He drove a car"
o ❌ "He drove a tree"
These restrictions guide syntactic parsing, semantic analysis, and
disambiguation.
✅ Summary Table
Concept Description
Word Sense Specific meaning of a word
Connections like synonymy,
Semantic Relations
hypernymy
Thematic Roles Roles entities play in actions
Selectional Semantic constraints on arguments
Restrictions of verbs
Word-Sense Disambiguation (WSD)
🔹 What is Word-Sense Disambiguation?
Word-Sense Disambiguation (WSD) is the task of determining which
sense (meaning) of a word is activated by its context in a sentence when
the word has multiple meanings.
🔍 Example:
In the sentence "He sat on the bank of the river",
WSD helps the system understand that "bank" refers to riverbank, not a
financial institution.
🔹 Approaches to WSD
1️⃣ Supervised Learning-Based WSD
Uses labeled training data where words are tagged with the correct
sense.
🧠 How it works:
Extract features from context (e.g., surrounding words, POS tags)
Train a machine learning model like:
o Decision Trees
o Naive Bayes
o SVM
o Neural Networks
✅ Example:
For the word “bass”:
"I caught a bass." → fish
"I played the bass." → music
A classifier learns to associate context with correct senses.
✔️Pros:
High accuracy if quality data is available
❌ Cons:
Requires large amounts of sense-annotated corpora (expensive to
create)
2️⃣ Dictionary/Gloss-Based WSD (Lesk Algorithm)
Uses definitions (glosses) from dictionaries like WordNet.
🧠 How it works (Lesk Algorithm):
For each sense of a word:
o Compare the definition (gloss) to the context words.
o Select the sense with the most overlapping words.
✅ Example:
For "bat":
Gloss for animal: "a nocturnal flying mammal"
Gloss for sports: "an implement used in sports"
If context includes words like “fly”, “nocturnal” → selects animal sense.
✔️Pros:
Doesn’t require labeled data
Simple and interpretable
❌ Cons:
Depends heavily on quality and coverage of dictionary
Ignores deeper syntactic/semantic relationships
3️⃣ Thesaurus/Knowledge-Based WSD
Uses semantic networks or thesauri (like WordNet) to infer word sense
using:
Synonyms
Hypernyms
Semantic similarity
🧠 Example methods:
Path-based similarity: Shortest path in a semantic graph
Semantic relatedness scores between words
Sense clustering
✔️Pros:
Doesn’t need training data
Useful in low-resource languages
❌ Cons:
Less accurate than supervised models
🎯 Applications of WSD
Application Role of WSD
Machine Translation Ensures correct translation of ambiguous words
Improves search relevance by understanding
Information Retrieval
query intent
Text Summarization Identifies correct sense to avoid ambiguity
Application Role of WSD
Question Answering Helps in understanding queries and extracting
Systems precise answers
Speech Recognition &
Corrects homophones (e.g., "pair" vs. "pear")
Generation
Chatbots and Virtual Enhances context understanding in
Assistants conversations
✅ Summary Table
Requires
Approach Technique Strength
Training?
Supervised ML classifiers ✅ Yes High accuracy
Dictionary- Easy to
Lesk Algorithm ❌ No
Based implement
Thesaurus- WordNet paths,
❌ No Knowledge-rich
Based synonyms