What is NLP?
Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on enabling
computers to understand, interpret, and generate human language — both written and spoken.
It bridges the gap between human communication and computer understanding, allowing
machines to work with natural language data.
Key Issues in NLP
1. Ambiguity
o Lexical Ambiguity: Words have multiple meanings (e.g., “bank” = riverbank or
financial institution).
o Syntactic Ambiguity: Sentence structure can be interpreted in more than one way.
Example:
Sentence:
"I saw the man with the telescope."
Possible interpretations:
I used the telescope to see the man.
(The phrase with the telescope modifies saw.)
The man I saw had a telescope.
(The phrase with the telescope modifies the man.)
o Semantic Ambiguity: Meaning is unclear without context.
Example:
Sentence:
"I went to the bank."
Possible meanings:
Bank as a financial institution:
You went to a place to deposit or withdraw money.
Bank as the side of a river:
You went to the edge of a river or lake.
2 Data Sparsity
Many languages or specific domains don’t have enough quality annotated data to
train models effectively.
3 Out-of-Vocabulary (OOV) Words
New words, typos, slang, or rare terms that models haven’t seen before cause
comprehension issues.
4 Handling Idioms and Figurative Language
Phrases like "kick the bucket" (meaning “to die”) are hard for systems to interpret
literally.
What is Morphological Processing?
Morphological processing is the step in NLP that deals with the structure of words — how words are
formed from smaller meaningful units called morphemes (like roots, prefixes, suffixes).
It includes tasks like:
Stemming: Reducing words to their base or root form (not necessarily a real word).
Lemmatization: Reducing words to their dictionary (lemma) form using vocabulary and
context.
Morphological Analysis: Breaking down words into morphemes (root + affixes).
Example: Morphological Processing for the word “running”
Process Output Explanation
Input Word running The word to analyze
Stemming run Remove suffix “-ing” to get stem “run”
Lemmatization run Lemma is “run”, the dictionary form
Morpheme Split run + ing Root = "run", suffix = "-ing" (present participle)
Python Example with NLTK
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
# Download required resources (run once)
nltk.download('wordnet')
nltk.download('omw-1.4')
word = "running"
# Stemming
ps = PorterStemmer()
stem = ps.stem(word)
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemma = lemmatizer.lemmatize(word, pos='v') # pos='v' for verb
print("Original word:", word)
print("Stemmed word:", stem)
print("Lemmatized word:", lemma)
Original word: running
Stemmed word: run
Lemmatized word: run
What is Syntax Analysis?
Syntax Analysis (also called Parsing) is the process in NLP that examines the grammatical structure of
a sentence. It identifies how words relate to each other to form phrases, clauses, and overall
sentence meaning according to the rules of a language.
The goal is to build a parse tree or syntax tree that shows the syntactic structure.
Why is Syntax Analysis important?
Helps understand the grammatical relationships between words.
Crucial for tasks like machine translation, question answering, and information extraction.
Differentiates sentences with similar words but different meanings based on structure.
Types of Syntax Analysis
1. Constituency Parsing
Breaks sentence into nested constituents or phrases (noun phrase, verb phrase, etc.)
2. Dependency Parsing
Represents grammatical relations as links between words (e.g., subject, object).
Example Sentence
"The cat sat on the mat."
Constituency Parse Tree (simplified):
(S
(NP The cat)
(VP sat
(PP on
(NP the mat))))
S = Sentence
NP = Noun Phrase
VP = Verb Phrase
PP = Prepositional Phrase
Example of Dependency Parsing:
sat is the root verb
cat is the subject of sat
on is a preposition linked to sat
mat is the object of the preposition on
Python Example using spaCy (Dependency Parsing)
import spacy
# Load English model
nlp = spacy.load("en_core_web_sm")
sentence = "The cat sat on the mat."
doc = nlp(sentence)
# Print dependencies
for token in doc:
print(f"{token.text:10} --> {token.dep_:10} --> {token.head.text}")
Output:
rust
CopyEdit
The --> det --> cat
cat --> nsubj --> sat
sat --> ROOT --> sat
on --> prep --> sat
the --> det --> mat
mat --> pobj --> on
. --> punct --> sat
What is Semantic Analysis?
Semantic Analysis is the process of understanding the meaning of text. It goes beyond the structure
(syntax) to capture what the text actually means — the concepts, relationships, and the intended
message.
Why is Semantic Analysis Important?
To understand the context and meaning of sentences.
Enables applications like question answering, chatbots, machine translation, and
information retrieval.
Helps resolve ambiguities, e.g., word sense disambiguation.
Key Tasks in Semantic Analysis
1. Word Sense Disambiguation
Determine which sense of a word is used in context (e.g., “bank” as riverbank vs. financial
bank).
2. Named Entity Recognition (NER)
Identify entities like people, places, organizations.
3. Semantic Role Labeling
Identify predicate-argument structures — who did what to whom.
4. Coreference Resolution
Find which words refer to the same entity (e.g., “John ... he”).
5. Sentiment Analysis
Determine the sentiment or emotion expressed.
Example: Sentence Meaning
Sentence: “John gave Mary a book.”
Semantic analysis identifies:
o John = giver (agent)
o Mary = receiver (recipient)
o book = object (theme)
The relation: John → gave → Mary (with object book)
Simple Semantic Analysis in Python (Using SpaCy for NER and Dependency)
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple is looking at buying a startup in New York."
doc = nlp(text)
print("Named Entities:")
for ent in doc.ents:
print(ent.text, ent.label_)
print("\nSemantic Roles (Subject - Verb - Object):")
for token in doc:
if token.dep_ == "ROOT":
subject = [child for child in token.children if child.dep_ == "nsubj"]
dobj = [child for child in token.children if child.dep_ == "dobj"]
if subject and dobj:
print(f"{subject[0].text} - {token.text} - {dobj[0].text}")
Output:
Named Entities:
Apple ORG
New York GPE
Semantic Roles (Subject - Verb - Object):
Apple - looking - startup
What is Discourse Integration?
Discourse Integration is the process of understanding how individual sentences or utterances
connect to form a coherent whole in a text or conversation. It goes beyond analyzing single
sentences to interpreting the relationships between sentences, paragraphs, or turns in dialogue.
It helps NLP systems understand context across multiple sentences, maintain topic continuity, and
grasp implied meaning.
Why is Discourse Integration Important?
Maintains coherence in text understanding.
Resolves pronouns and references across sentences (anaphora resolution).
Detects relations like cause-effect, contrast, elaboration between sentences.
Essential for text summarization, dialogue systems, story understanding, and machine
translation.
Key Tasks in Discourse Integration
1. Anaphora Resolution
Identifying what pronouns (he, she, it, they) refer to across sentences.
2. Coherence Relations
Understanding logical relations between sentences, e.g., contrast, cause, elaboration.
3. Discourse Parsing
Structuring text into discourse units linked by relations.
Example
Text:
“John went to the park. He saw a dog.”
Discourse integration links “He” in the second sentence to “John” in the first.
Understanding that these two sentences form a coherent idea about John’s activities.
Simple Python Example: Anaphora Resolution with neuralcoref (extension to spaCy)
pip install spacy neuralcoref
import spacy
import neuralcoref
nlp = spacy.load('en_core_web_sm')
# Add neuralcoref to spaCy's pipeline
neuralcoref.add_to_pipe(nlp)
text = "John went to the park. He saw a dog."
doc = nlp(text)
print("Original Text:")
print(text)
print("\nAfter Coreference Resolution:")
print(doc._.coref_resolved)
Output:
Original Text:
John went to the park. He saw a dog.
After Coreference Resolution:
John went to the park. John saw a dog.
What is Pragmatic Analysis?
Pragmatic Analysis in NLP is the process of understanding the intended meaning of language in
context — not just what the words say literally, but what the speaker/writer actually means based
on the situation, shared knowledge, and social cues.
It deals with things like:
Implicature (implied meaning beyond literal words)
Speech acts (e.g., requests, promises, questions)
Contextual factors (who is speaking, to whom, when, where)
Deixis (words like “this,” “that,” “here,” “now” whose meaning depends on context)
Why is Pragmatic Analysis Important?
Understands indirect meaning (e.g., sarcasm, irony, politeness)
Helps dialogue systems respond appropriately
Crucial for natural conversations, sentiment understanding, humor detection
Example
Sentence:
“Can you pass the salt?”
Literal meaning: Asking if someone is capable of passing the salt.
Pragmatic meaning: Polite request for the salt.
Simple Example in NLP
Pragmatic understanding often requires context or external knowledge beyond sentence alone.
Here’s a simple example using contextual dialogue:
dialogue = [
"Person A: It's cold in here.",
"Person B: I'll close the window."
# Literal meaning: Person A states a fact.
# Pragmatic meaning: Person A is indirectly requesting to close the window.
print("Dialogue:")
for line in dialogue:
print(line)
print("\nPragmatic interpretation:")
print("Person A's statement implies a request to close the window.")
Challenges of Pragmatic Analysis
Requires world knowledge and context tracking.
Hard to automate fully — often involves common sense reasoning.
Complex in multi-turn conversations.
import spacy
from transformers import pipeline
# Load spaCy model
nlp = spacy.load("en_core_web_sm")
# Sample solar system text (knowledge base)
solar_system_text = """
Jupiter is the largest planet in the solar system.
It has a strong magnetic field and at least 79 moons.
Saturn is known for its ring system.
Mars is called the Red Planet due to iron oxide on its surface.
Venus has a thick atmosphere rich in carbon dioxide.
earth has only one moon.
And earth moon name is chandrama.
"""
# Process with spaCy
doc = nlp(solar_system_text)
# -------------------------------
# 1. Tokenization and POS tagging
# -------------------------------
print("=== Token Info ===")
for token in doc[:10]: # show first 10 tokens
print(f"{token.text:<12} POS: {token.pos_:<6} | Lemma: {token.lemma_}")
# -------------------------------
# 2. Named Entity Recognition
# -------------------------------
print("\n=== Named Entities ===")
for ent in doc.ents:
print(f"{ent.text:<20} --> {ent.label_}")
# -------------------------------
# 3. Dependency Parsing
# -------------------------------
print("\n=== Dependency Parsing ===")
for sent in doc.sents:
print(f"\nSentence: {sent.text.strip()}")
for token in sent:
print(f"{token.text:<12} Head: {token.head.text:<12} Dep: {token.dep_}")
# -------------------------------
# 4. Extract simple facts
# -------------------------------
print("\n=== Extracted Facts ===")
for sent in doc.sents:
if "moon" in sent.text.lower() or "ring" in sent.text.lower():
print("Fact:", sent.text.strip())
# -------------------------------
# 5. Question Answering (QA)
# -------------------------------
print("\n=== Question Answering ===")
# Load QA model
qa_pipeline = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")
# Sample questions
questions = [
"How many moons does Jupiter have?",
"Which planet has a ring system?",
"Why is Mars called the Red Planet?",
"What is Venus's atmosphere made of?",
"What is the largest planet?",
"number of moon jupiter have?",
"number of moon earth have?",
"what is the name of earth moon?"
# Ask questions
for question in questions:
result = qa_pipeline(question=question, context=solar_system_text)
print(f"Q: {question}")
print(f"A: {result['answer']}\n")