Core Components of Natural Language Processing
Core Components of Natural Language Processing
Predicate: "opened"
Arguments:
- "Yesterday" - ARGM-TMP (temporal modifier)
- "John" - ARG0 (agent)
- "carefully" - ARGM-MNR (manner)
- "the door" - ARG1 (patient)
- "with a key" - ARGM-INS (instrument)
- "because it was locked" - ARGM-CAU (cause)
Visualization:
[Yesterday]ARGM-TMP, [John]ARG0 [carefully]ARGM-MNR [opened]PREDICATE [the door]ARG1
[with a key]ARGM-INS [because it was locked]ARGM-CAU.
SRL Approaches
1. Feature-based Methods:
Syntactic features (position, path in parse tree)
Lexical features (predicate, headword)
Named entity features
Voice (active/passive)
Traditional ML algorithms (SVM, MaxEnt, CRF)
2. Neural Approaches:
BiLSTM with attention mechanisms
End-to-end deep learning architectures
Multi-task learning with syntactic parsing
Transformer-based methods (BERT, RoBERTa)
3. Rule-based Systems:
Mapping from syntactic structure to semantic roles
Pattern matching on parse trees
Limited coverage but high precision
Code Example (Python with spaCy and AllenNLP):
python
# This example uses AllenNLP for SRL
from allennlp.predictors.predictor import Predictor
import spacy
import json
# Example sentence
sentence = "Yesterday, John carefully opened the door with a key because it was locked
print("Sentence:", sentence)
print("\nSemantic Role Labels:")
print("-" * 50)
print("Core arguments:")
for arg in sorted(core_args):
print(f" {arg}: {args[arg]}")
print("Modifiers:")
for arg in sorted(modifier_args):
print(f" {arg}: {args[arg]}")
Applications:
Question answering (finding specific information)
Information extraction (structured knowledge)
Machine translation (role-preserving translation)
Text summarization (identifying key propositions)
Semantic search (finding content by meaning)
Reading comprehension systems
Dialogue systems (understanding user intents)
Natural language inference (logical reasoning)
Event extraction (structured representations of events)# Core Components of Natural Language
Processing (NLP)
Introduction
Natural Language Processing (NLP) enables computers to understand, interpret, and generate human
language. This presentation provides an in-depth exploration of the fundamental building blocks that
form the foundation of modern NLP systems.
1. Tokenization
Definition: The process of breaking text into smaller units called tokens, which serve as the basic
elements for further language processing.
Types:
Word Tokenization: Splitting text into individual words
Example: "I love NLP" → ["I", "love", "NLP"]
Challenges: Handling contractions, hyphenated words, abbreviations
Sentence Tokenization: Dividing text into sentences
Example: "Hello! How are you? I'm fine." → ["Hello!", "How are you?", "I'm fine."]
Challenges: Abbreviations (e.g., "Dr."), quotations, non-standard punctuation
Subword Tokenization: Breaking words into meaningful subunits
Byte-Pair Encoding (BPE): Iteratively merges most frequent character pairs
WordPiece: Similar to BPE but uses likelihood rather than frequency
SentencePiece: Language-agnostic tokenization that treats spaces as symbols
Example: "unhappiness" → ["un", "happiness"] or ["un", "happy", "ness"]
Advanced Tokenization Considerations:
Language-specific challenges:
Chinese/Japanese/Korean: No explicit word boundaries
Arabic/Hebrew: Complex morphology and right-to-left script
Agglutinative languages (Finnish, Turkish): Long compound words
Tokenization in modern transformers:
Special tokens: [CLS], [SEP], [MASK], [PAD]
Handling out-of-vocabulary words
Code Example (Python with NLTK):
python
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Sentence tokenization
text = "NLP is fascinating! It has many applications."
sentences = sent_tokenize(text)
print(sentences) # ['NLP is fascinating!', 'It has many applications.']
# Word tokenization
for sentence in sentences:
words = word_tokenize(sentence)
print(words) # ['NLP', 'is', 'fascinating', '!'] ['It', 'has', 'many', 'applicatio
Applications:
Foundation for all other NLP tasks
Text preprocessing and normalization
Feature extraction for machine learning models
Input preparation for neural networks
Information retrieval systems
2. Part-of-Speech (POS) Tagging
Definition: Identifying the grammatical category of each word or token in text, enabling deeper
syntactic and semantic analysis.
Common POS Tag Sets:
Penn Treebank POS Tags (most common in English NLP)
Universal Dependencies POS Tags (cross-linguistic compatibility)
Detailed POS Categories:
Nouns:
Common nouns (NN): generic objects, concepts (book, happiness)
Proper nouns (NNP): specific names (John, London)
Singular/plural distinctions (NN/NNS)
Possessive forms (NN's/NNP's)
Verbs:
Base form (VB): "go", "see"
Present tense (VBP/VBZ): "go/goes"
Past tense (VBD): "went"
Gerund/present participle (VBG): "going"
Past participle (VBN): "gone"
Modal auxiliaries (MD): "can", "should"
Adjectives:
Base form (JJ): "happy"
Comparative (JJR): "happier"
Superlative (JJS): "happiest"
Additional Categories:
Adverbs (RB/RBR/RBS): "quickly", "more", "most"
Determiners (DT): "the", "a", "this"
Prepositions (IN): "in", "on", "by"
Conjunctions (CC): "and", "but", "or"
Pronouns (PRP): "he", "she", "they"
Cardinal numbers (CD): "one", "2", "three"
Foreign words (FW), Interjections (UH), etc.
POS Tagging Approaches:
1. Rule-based: Use hand-crafted rules and dictionaries
Advantage: Interpretable, no training data needed
Disadvantage: Limited coverage, difficult to maintain
2. Stochastic/Statistical:
Hidden Markov Models (HMMs)
Maximum Entropy Markov Models
Conditional Random Fields (CRFs)
3. Deep Learning Approaches:
Recurrent Neural Networks (RNNs/LSTMs)
Bidirectional LSTMs with CRF layer
Transformer-based models (BERT, etc.)
Extended Example:
Text: "The quick brown fox jumps over the lazy dog."
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
print(tagged)
# Output: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'),
# ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'),
# ('dog', 'NN'), ('.', '.')]
Applications:
Syntactic parsing and grammatical analysis
Named entity recognition preprocessing
Word sense disambiguation
Machine translation
Text-to-speech pronunciation
Grammar checking and correction
Information extraction systems
Sentiment analysis enhancement
3. Named Entity Recognition (NER)
Definition: Identifying and classifying real-world entities mentioned in text into predefined categories,
enabling structured information extraction from unstructured text.
Standard Entity Types:
People (PER): Individual persons, groups of people
Organizations (ORG): Companies, institutions, government agencies
Locations (LOC): Geographic locations, political entities
Geo-Political Entities (GPE): Countries, cities, states with political and geographic properties
Temporal Expressions (TIME/DATE): Dates, times, durations, periods
Numeric Expressions:
Monetary values (MONEY): "$50 million", "€20"
Percentages (PERCENT): "10%", "one-third"
Cardinal numbers (CARDINAL): "ten", "15"
Ordinal numbers (ORDINAL): "first", "5th"
Domain-Specific Entity Types:
Biomedical: Genes, proteins, diseases, drugs
Legal: Legal citations, court cases, statutes
Scientific: Chemical compounds, astronomical objects
Financial: Stock symbols, financial instruments
Products: Brand names, product models
Creative Works: Books, movies, songs
NER Approaches:
1. Rule-based Methods:
Gazetteer/dictionary matching
Regular expressions
Hand-crafted linguistic rules
2. Statistical Methods:
Hidden Markov Models (HMMs)
Conditional Random Fields (CRFs)
Support Vector Machines (SVMs)
3. Deep Learning Methods:
Bidirectional LSTMs with CRF
Transformer-based models (BERT, RoBERTa)
Fine-tuned language models
NER Evaluation Metrics:
Precision: Proportion of identified entities that are correct
Recall: Proportion of actual entities that were identified
F1 Score: Harmonic mean of precision and recall
Exact match vs. partial match scoring
Extended Example:
Text: "Apple Inc. is planning to open a new flagship store in central London next
March, with an investment of $40 million."
Detailed NER:
"Apple Inc." - ORGANIZATION
"central London" - LOCATION
"next March" - DATE
"$40 million" - MONEY
Visualization:
[Apple Inc.](ORG) is planning to open a new flagship store in [central London](LOC)
[next March](DATE), with an investment of [$40 million](MONEY).
import spacy
# Process text
text = "Apple Inc. is planning to open a new flagship store in central London next Marc
doc = nlp(text)
# Display entities
for ent in doc.ents:
print(f"Entity: {ent.text}, Type: {ent.label_}, Start: {ent.start_char}, End: {ent
# Output:
# Entity: Apple Inc., Type: ORG, Start: 0, End: 10
# Entity: London, Type: GPE, Start: 54, End: 60
# Entity: next March, Type: DATE, Start: 61, End: 71
# Entity: $40 million, Type: MONEY, Start: 93, End: 104
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
# Initialize
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()
# Example words
words = ["running", "runs", "ran", "better", "best", "studies", "studying"]
pos_tagged = pos_tag(words)
# spaCy Lemmatization
import spacy
nlp = spacy.load("en_core_web_sm")
text = "The children were running in the park. She studied better than him."
doc = nlp(text)
print("\nspaCy Lemmatization:")
print("Token\t\tLemma\t\tPOS")
print("-" * 40)
for token in doc:
print(f"{token.text}\t\t{token.lemma_}\t\t{token.pos_}")
Comparative Analysis:
Aspect Stemming Lemmatization
Speed Faster Slower
Complexity Lower Higher
Resources Rule-based only Requires dictionaries
Accuracy Lower Higher
Output Often non-words Valid dictionary words
Context Context-independent Context-aware
Applications:
Text normalization for search engines
Feature space reduction in text classification
Query expansion in information retrieval
Document clustering and topic modeling
Spelling correction systems
Keyword extraction
Text similarity measurement
Machine translation preprocessing
5. Syntax and Parsing
Definition: Analyzing grammatical structure to determine relationships between words and phrases,
revealing the hierarchical organization of language and enabling deeper semantic understanding.
Formal Grammar Theory
Context-Free Grammars (CFGs):
Formalized set of production rules that describe all possible strings in a language
Consists of:
Non-terminal symbols (syntactic categories like NP, VP)
Terminal symbols (actual words)
Production rules (NP → Det N)
Start symbol (S)
Example CFG Rules:
S → NP VP
NP → Det N | Det N PP | Pronoun
VP → V | V NP | V NP PP
PP → P NP
Det → "the" | "a" | "my"
N → "cat" | "mouse" | "house"
V → "chased" | "saw" | "ate"
P → "in" | "on" | "with"
Grammar Formalisms:
Context-free grammars (CFGs)
Tree-adjoining grammars (TAGs)
Lexicalized tree-adjoining grammars (LTAGs)
Head-driven phrase structure grammar (HPSG)
Combinatory categorial grammar (CCG)
Constituency Parsing
Core Concept:
Breaks sentences into nested constituents based on phrase structure grammar
Represents hierarchical structure as a parse tree
Focuses on phrases and their composition
Constituent Types:
Noun Phrase (NP): "the red car"
Verb Phrase (VP): "is driving fast"
Prepositional Phrase (PP): "on the road"
Adjective Phrase (ADJP): "very happy"
Adverb Phrase (ADVP): "quite slowly"
Example Parse Tree:
S
/ \
NP VP
/ \ / \
Det N V NP
| | | / \
The cat ate Det N
| |
the fish
Parsing Algorithms:
CKY Algorithm: Dynamic programming approach for CFGs
Earley Parser: Top-down parser with bottom-up filtering
Shift-Reduce Parser: Uses stack and buffer with shift/reduce operations
Chart Parsing: Builds partial analyses in a well-formed substring table
Dependency Parsing
Core Concept:
Identifies direct grammatical relationships between words
Represents sentence as a directed graph with labeled edges
No intermediate phrasal nodes, only connections between words
Common Dependency Relations:
nsubj: Nominal subject
dobj: Direct object
iobj: Indirect object
det: Determiner
amod: Adjectival modifier
advmod: Adverbial modifier
aux: Auxiliary verb
prep: Preposition
pobj: Object of preposition
conj: Conjunct
Detailed Example:
Sentence: "The black cat chased the small mouse in the kitchen."
chased
/ | \
cat | in
/ \ | |
The black mouse kitchen
/ \ |
the small the
import spacy
from spacy import displacy
import pandas as pd
# Example sentence
text = "The black cat chased the small mouse in the kitchen."
doc = nlp(text)
Example 2: Polysemy
Sentence: "The company runs a successful business."
Context analysis:
- Business entity: "company", "business"
- Subject-verb relationship suggests operation, not movement
→ Sense: operates/manages (not physical running)
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
if positions:
pos = positions[0]
# Get WordNet POS tag
wordnet_pos = get_wordnet_pos(tagged[pos][1])
Applications:
Machine translation (selecting correct translation)
Information retrieval (improving search precision)
Text summarization (preserving meaning)
Question answering (understanding queries)
Sentiment analysis (contextual polarity)
Speech recognition (disambiguating homophones)
Ontology mapping (knowledge integration)
Semantic parsing (meaning representation)ADJ elif treebank_tag.startswith('V'): return wn.
7. Coreference Resolution
Definition: Identifying when different expressions in text refer to the same entity or event, establishing
connections between these mentions to maintain coherence and enable deeper understanding of text.
Types of Coreference
1. Pronominal Coreference:
Pronouns referring back to a previously mentioned entity
Personal pronouns: he, she, it, they, etc.
Possessive pronouns: his, her, its, their, etc.
Reflexive pronouns: himself, herself, itself, themselves, etc.
Example: "[John]₁ said that [he]₁ would finish the project."
2. Nominal Coreference:
Noun phrases referring to the same entity
Exact repetitions: "a dog"..."the dog"
Synonyms: "the automobile"..."the car"
Hypernyms/hyponyms: "the animal"..."the dog"
Example: "[The President]₁ spoke yesterday. [Joe Biden]₁ addressed climate concerns."
3. Zero Anaphora (Ellipsis):
Omitted expressions that are understood from context
Commonly occurs in coordinate structures and certain languages
Example: "[Mary]₁ wanted to go to the beach and [∅]₁ packed her bag." (∅ = Mary)
4. Split Antecedent:
When a plural pronoun refers to multiple separate antecedents
Example: "[John]₁ met [Mary]₂ at the conference. [They]₁₊₂ had dinner together."
Challenges in Coreference Resolution
1. Syntactic Ambiguity:
Multiple potential antecedents with matching grammatical features
Example: "[The trophy]₁ didn't fit into [the suitcase]₂ because [it]₁/₂ was too big."
2. Semantics and World Knowledge:
Requires understanding beyond syntax
Example: "[The city council]₁ denied [the demonstrators]₂ a permit because [they]₁/₂ feared
violence."
3. Bridging References:
Implicit relations between entities
Example: "I went to [a new restaurant]₁. [The chef]₂ was excellent." (where chef is part of
restaurant)
4. Event Coreference:
Identifying when different expressions refer to the same event
Example: "The building [exploded]₁ yesterday. The [blast]₁ injured ten people."
Coreference Resolution Approaches
1. Rule-based Systems:
Linguistic constraints and preferences
Hobbs algorithm
Centering theory
Binding constraints (Chomsky's Government and Binding Theory)
2. Mention-Pair Models:
Classify pairs of mentions as coreferent or not
Features: distance, syntactic position, gender/number agreement, semantic compatibility
Challenge: Local decisions may be inconsistent globally
3. Entity-Mention Models:
Consider all previously resolved entities for each new mention
Build coreference chains incrementally
More coherent entity representations
4. Neural Coreference Models:
End-to-end neural systems
Span representations with attention
Higher-order inference
Joint learning with other NLP tasks
5. Transformer-based Approaches:
Fine-tuned language models (BERT, RoBERTa, etc.)
Contextual token representations
Self-attention captures long-distance relationships
Detailed Resolution Process
1. Mention Detection:
Identify potential referring expressions
Filter non-referring expressions
Named entities, noun phrases, pronouns
2. Mention-level Feature Extraction:
Grammatical properties: gender, number, person
Semantic class: human, organization, location, etc.
Syntactic role: subject, object, etc.
Definiteness, proper noun status
3. Pairwise Feature Computation:
Distance features: sentences, mentions between
String matching features: exact, partial, head match
Syntactic features: syntactic paths, c-command
Semantic compatibility features
4. Clustering or Classification:
Agglomerative clustering of mentions
Ranking of candidate antecedents
Global optimization for consistency
Comprehensive Example:
Text: "Mark told John that he had won the competition. His friends were very proud
of him.
The young programmer had worked very hard for this achievement."
Referring expressions:
- [Mark]₁
- [John]₂
- [he]₃ → refers to [Mark]₁
- [the competition]₄
- [His]₅ → refers to [Mark]₁
- [friends]₆
- [him]₇ → refers to [Mark]₁
- [The young programmer]₈ → refers to [Mark]₁
- [this achievement]₉ → refers to [winning the competition]
Coreference chains:
Chain 1: [Mark]₁ - [he]₃ - [His]₅ - [him]₇ - [The young programmer]₈
Chain 2: [John]₂
Chain 3: [the competition]₄ - [this achievement]₉
Chain 4: [friends]₆
Resolution process:
- "he" → ambiguous between Mark/John, resolved to Mark due to syntactic prominence
(subject)
- "His" → matches masculine singular, references most recent compatible entity
(Mark)
- "him" → matches masculine singular object, continues reference to Mark
- "The young programmer" → requires world knowledge that Mark is a programmer
- "this achievement" → event reference to the winning action
import spacy
nlp = spacy.load("en_core_web_sm")
# neuralcoref.add_to_pipe(nlp)
# Process text
text = "Mark told John that he had won the competition. His friends were very proud of
doc = nlp(text)
Evaluation Metrics:
MUC (Message Understanding Conference) score
B³ (B-cubed)
CEAF (Constrained Entity-Alignment F-measure)
BLANC (BiLateral Assessment of Noun-phrase Coreference)
CoNLL F1 (average of MUC, B³, and CEAF)
Applications:
Machine translation (pronoun resolution across languages)
Question answering (connecting entities in questions and answers)
Text summarization (maintaining entity coherence)
Information extraction (building knowledge bases)
Dialogue systems (tracking conversation entities)
Reading comprehension (understanding narrative flow)
Sentiment analysis (attributing opinions to correct entities)
8. Text Normalization and Preprocessing
Definition: The process of transforming text into a standardized, canonical form to ensure consistency
and improve the quality of downstream NLP tasks. Text normalization addresses variations and
irregularities in natural language to create a clean, uniform foundation for analysis.
Components of Text Normalization
1. Case Normalization:
Converting text to lowercase or uppercase
Preserves proper nouns when necessary
Example: "New York City" → "new york city" or "New york city"
Impact: Reduces vocabulary size but can lose information
2. Noise Removal:
Eliminating irrelevant characters and artifacts
HTML/XML tags, URLs, email addresses
Special characters, emoji, decorative symbols
Extra whitespace, line breaks, tabs
Example: "Contact us at info@example.com" → "Contact us at"
3. Punctuation Handling:
Removing punctuation completely
Separating punctuation from words
Standardizing quotes, hyphens, apostrophes
Example: "Don't worry!" → "Don t worry"
Consideration: Important for parsing, sentiment, dialogue
4. Number Handling:
Removing numbers
Converting numbers to words
Categorizing numeric expressions
Example: "I have 42 apples" → "I have NUMBER apples" or "I have forty-two apples"
5. Text Encoding Standardization:
Converting to UTF-8 or ASCII
Handling special characters, accents, diacritics
Normalizing Unicode representations (NFC, NFD)
Example: "café" might be represented as "café" or "cafe\u0301"
6. Spelling Correction and Normalization:
Fixing spelling errors
Standardizing spelling variants
Handling regional variations (e.g., "color" vs. "colour")
Example: "I luv gr8 txts" → "I love great texts"
7. Abbreviation and Acronym Expansion:
Expanding common abbreviations
Normalizing domain-specific shorthand
Example: "WHO announced..." → "World Health Organization announced..."
8. Contractions and Possessives:
Expanding contractions
Standardizing possessive forms
Example: "I'll can't" → "I will cannot" or "I will can not"
9. Character-level Normalization:
Handling repeated characters ("goooooal!")
Standardizing character substitutions ("gr8" → "great")
Removing non-standard characters
Advanced Normalization Techniques:
10. Text Segmentation:
Word boundaries in languages without spaces (Chinese, Japanese)
Compound word handling (German, Finnish)
Hashtag segmentation (#NaturalLanguageProcessing → Natural Language Processing)
11. Domain-Specific Normalization:
Medical terminology standardization
Legal text normalization
Social media language normalization
Technical jargon standardization
12. Language Identification:
Determining the primary language
Handling code-switching and multilingual text
Applying language-specific normalization rules
Normalization Pipeline Example
Input Text:
OMG!!! I luv NLP sooooo much :) #NaturalLanguageProcessing
Check out https://example.com or email me@email.com
Normalization Steps:
1. Lowercase conversion:
omg!!! i luv nlp sooooo much :) #naturallanguageprocessing
check out https://example.com or email me@email.com
3. Punctuation removal:
omg i luv nlp sooooo much naturallanguageprocessing
check out or email
4. Character normalization:
omg i luv nlp so much naturallanguageprocessing
check out or email
5. Text expansion:
oh my god i love nlp so much natural language processing
check out or email
Implementation Approaches
Rule-based Normalization:
Regular expressions
Dictionary-based replacement
Handcrafted rules
Advantages: Interpretable, controllable
Disadvantages: Limited coverage, high maintenance
Statistical Normalization:
Noisy channel models
Spelling correction algorithms
Advantages: Data-driven, handles unseen cases
Disadvantages: Requires training data, may overgeneralize
Neural Normalization:
Sequence-to-sequence models
Character-level neural networks
Advantages: Handles complex transformations, learns patterns
Disadvantages: Black box, requires substantial training data
Code Example (Python):
python
import re
import string
import unicodedata
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
class TextNormalizer:
def __init__(self, language='english'):
self.language = language
self.stop_words = set(stopwords.words(language)) if language in stopwords.file
self.stemmer = PorterStemmer()
self.lemmatizer = WordNetLemmatizer()
# Common contractions
self.contractions = {
"n't": " not",
"'ll": " will",
"'ve": " have",
"'re": " are",
"'m": " am",
"'d": " would"
}
if remove_urls:
text = self.remove_urls_emails(text)
if expand_contractions:
text = self.replace_contractions(text)
if unicode_norm:
text = self.normalize_unicode(text)
if remove_punct:
text = self.remove_punctuation(text)
if norm_whitespace:
text = self.normalize_whitespace(text)
if norm_repeating:
text = self.normalize_repeating_chars(text)
tokens = self.tokenize_and_filter(text, remove_stopwords)
if stem:
tokens = self.stem_tokens(tokens)
if lemmatize:
tokens = self.lemmatize_tokens(tokens)
# Example usage
normalizer = TextNormalizer()
input_text = "OMG!!! I luv NLP sooooo much :) #NaturalLanguageProcessing"
normalized_text = normalizer.normalize(input_text)
print(f"Original: {input_text}")
print(f"Normalized: {normalized_text}")
Normalization Considerations:
When to Normalize:
Data cleaning and preprocessing
Increasing lexical overlap for similarity measures
Reducing vocabulary size for efficiency
Standardizing inputs for machine learning
When to Preserve Original Text:
Sentiment analysis (punctuation and capitalization convey emotion)
Author attribution (writing style contains individual markers)
Certain domain-specific tasks (legal, medical precision)
When information loss outweighs standardization benefits
Effectiveness Metrics:
Task-specific performance improvement
Vocabulary size reduction
Improved matching/retrieval rates
Error reduction in downstream tasks
Applications:
Information retrieval and search engines
Text classification and categorization
Machine translation preprocessing
Sentiment analysis preparation
Named entity recognition preprocessing
Text similarity and duplicate detection
Speech recognition post-processing
Optical character recognition (OCR) correction