19CSE453
Natural Language Processing
Lecture 2
Text processing: tokenization
What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.
Depending on the application in hand, you might have to perform sentence
segmentation a s well.
Sentence Segmentation
The problem of deciding where the sentences begin and end.
Challenges Involved
Sentence Segmentation
The problem of deciding where the sentences begin and end.
Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
Sentence Segmentation
The problem of deciding where the sentences begin and end.
Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally
for
• ) Abbreviations (Dr., Mr., m.p.h.)
Sentence Segmentation
The problem of deciding where the sentences begin and end.
Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)
Sentence Segmentation
The problem of deciding where the sentences begin and end.
Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)
Approach: build a binary classifier
For each “.”
Decides EndOfSentence/NotEndOfSentence
3 /26
Sentence Segmentation
The problem of deciding where the sentences begin and end.
Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally for
• ) Abbreviations (Dr., Mr., m.p.h.)
• ) Numbers (2.4%, 4.3)
Approach: build a binary classifier
For each “.”
• Decides EndOfSentence/NotEndOfSentence
• Classifiers can be: hand-written rules, regular expressions, or
machine learning
Sentence Segmentation: Decision Tree Example
Decision Tree: Is this word the end-of-sentence ( E -O-S) ?
Sentence Segmentation: Decision Tree Example
Decision Tree: Is this word the end-of-sentence ( E -O-S) ?
Other Important Features likely to influence sentence segmentation
• C a s e of word with “.” : Upper, Lower, Cap, Number
• C a s e of word after “.” : Upper, Lower, Cap, Number
• Numeric Features
➢ Length of word with “.”
➢ Probability (word with “.” occurs at end-of-sentence)
➢ Probability (word after “.” occurs at beginning-of-sentence)
Word Tokenization
What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.
Word Tokenization
What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.
I have a can opener; but I can not open these cans.
Word Token
An occurrence of a word
For the above sentence, 12 word tokens.
Word Type
A different realization of a word
For the above sentence, 10 word types.
Popular Python packages for NLP
➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most comprehensive
libraries for NLP tasks. It provides tools for tasks such as tokenization, stemming,
lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It offers
features like tokenization, POS tagging, named entity recognition (NER), dependency
parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for common
NLP tasks such as tokenization, POS tagging, noun phrase extraction, sentiment analysis,
and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity analysis,
but it also offers functionality for tasks like text preprocessing, word embedding, and
similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer for
converting text data into numerical feature vectors.
Word Tokenization
Issues in Tokenization
Finland’s → Finland Finland‘s Finland ’s ?
What’re, I’m, shouldn’t → What are, I am, should not ?
S a n Francisco → one token or two?
Normalization
Why to “normalize”?
• Indexed text and query terms must have the same form.
Example: U.S.A. and U S A should be matched
We implicitly define equivalence classes of terms
• Three forms of Normalization
• Case folding
• Stemming
• Lemmatization
Case Folding
• Reduce all letters to lower case
• Possible exceptions (Task dependent):
➢ Upper case in mid sentence, may point to named entities (e.g.
General Motors)
➢ Words conveyed in CAPS mean a strong conveyance
➢ (eg. US vs. us, I REALLY MEAN IT
➢ Applications in sentiment analysis, information extraction etc.
Lemmatization
• Reduce inflections or variant forms to base form:
✓ am, are, is → be
✓ car, cars, car’s, cars’ → car
✓ eat, ate, eaten→ eat
✓ Write, wrote, written→ write
• Must find the correct dictionary headword form
Lemmatization learns from Morphology
Morphology studies the internal structure of words, how words are built up
from smaller meaningful units called morphemes
Morphemes are divided into two categories
Stems: The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings and
grammatical functions
• Perfix: un-,anti-, etc.
• suffix: -ation, -ity, en, -ed etc.
*Lemmatization algorithms take input from morphology to convert tokens
to their root form
Python Code for Lemmatization
*Note: The 2nd parameter of the lemmatize function is taken
default as noun if not provided
Stemming
• Reducing terms to their stems, used in information retrieval
• Crude chopping of affixes
➢ language dependent
➢ automate(s), automatic, automation all reduced to automat
Example:
Porter’s algorithm for Stemming
Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)
Porter’s algorithm
Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)
Step 1b
(*v*)ing → φ(walking → walk, king → king)
(*v*)ed → φ (played → play)
*Note: v represents vowel
Porter’s algorithm
Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)
Step 3
al → φ (revival → reviv)
able → φ(adjustable → adjust)
ate → φ (activate → activ)
Python code for Stemming
import nltk
from nltk.stem import PorterStemmer
# Initialize Porter stemmer
stemmer = PorterStemmer()
# Stem each token Output:
print(stemmer.stem("cats")) cat
print(stemmer.stem("played")) play
print(stemmer.stem("playing")) play
print(stemmer.stem("welcomes")) welcom
print(stemmer.stem("persual")) persual
print(stemmer.stem("ideologies")) ideolog
*Note: SnowballStemmer& Lancester stemmer are other examples
Popular Python packages for NLP
➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most
comprehensive libraries for NLP tasks. It provides tools for tasks such as
tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It
offers features like tokenization, POS tagging, named entity recognition (NER),
dependency parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for
common NLP tasks such as tokenization, POS tagging, noun phrase extraction,
sentiment analysis, and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity
analysis, but it also offers functionality for tasks like text preprocessing, word
embedding, and similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer
for converting text data into numerical feature vectors.
Python code for pre-processing using spacy
import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text
text = "The dogs are barking loudly outside. I am reading a book."
doc = nlp(text)
# Perform various preprocessing tasks
cleaned_text = []
for token in doc:
# Remove stop words and punctuation Output:
if not token.is_stop and not token.is_punct: Original text: The dogs
# Lemmatize each token are barking loudly
lemma = token.lemma_ outside. I am reading a
# Lowercase each token book.
cleaned_text.append(lemma.lower()) Preprocessed text: dog
# Join the cleaned tokens back into a string bark loudly outside read
cleaned_text = " ".join(cleaned_text) book
# Print the preprocessed text
print("Original text:", text)
print("Preprocessed text:", cleaned_text)
Python code for Pre-processing using TextBlob
from textblob import TextBlob
# Sample text
text = "Barack Obama was born in Hawaii on August 4,
1961. He served as the 44th President of the United States."
# Tokenization
blob = TextBlob(text)
tokens = blob.words
# POS tagging
pos_tags = blob.tags
# NER tagging
ner_tags = blob.noun_phrases
print("Tokens:", tokens)
print("POS tags:", pos_tags)
print("NER tags:", ner_tags)
Output
Output:
Tokens: ['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii', 'on', 'August', '4', '1961', 'He',
'served', 'as', 'the', '44th', 'President', 'of', 'the', 'United', 'States’]
POS tags: [('Barack', 'NNP'), ('Obama', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('in',
'IN'), ('Hawaii', 'NNP'), ('on', 'IN'), ('August', 'NNP'), ('4', 'CD'), ('1961', 'CD'), ('He',
'PRP'), ('served', 'VBD'), ('as', 'IN'), ('the', 'DT'), ('44th', 'CD'), ('President', 'NNP'),
('of', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS’)]
NER tags: ['barack obama', 'hawaii', 'august', '44th president']
Spelling Correction: Edit Distance
Spelling Correction
I am writing this email on behaf of ...
The user typed ‘behaf’.
Which are some close words?
Spelling Correction
I am writing this email on behaf of ...
The user typed ‘behaf’.
Which are some close words?
behalf
behave
....
Isolated word error correction
Pick the one that is closest to ‘behaf’
Spelling Correction
I am writing this email on behaf of ...
The user typed ‘behaf’.
Which are some close words?
behalf
behave
....
Isolated word error correction
Pick the one that is closest to ‘behaf’
How to define ‘closest’?
Spelling Correction
I am writing this email on behaf of ...
The user typed ‘behaf’.
Which are some close words?
behalf
behave
....
Isolated word error correction
Pick the one that is closest to ‘behaf’
How to define ‘closest’?
Need a distance metric
Spelling Correction
I am writing this email on behaf of ...
The user typed ‘behaf’.
Which are some close words?
behalf
behave
....
Isolated word error correction
Pick the one that is closest to ‘behaf’
How to define ‘closest’?
Need a distance metric
The simplest metric: edit distance
Edit Distance
• The minimum edit distance between two strings : It is the minimum
number of editing operations
• Insert
• Delete
• Substitution
How to find the Minimum Edit Distance?
Searching for a path (sequence of edits) from the start string to
the final string:
Initial state: the word we are transforming
Operators: insert, delete, substitute
Goal state: the word we are trying to get to
Path cost: what we want to minimize: the number of edits
Minimum Edit as Search
How to navigate?
The space of all edit sequences is huge
Minimum Edit as Search
How to navigate?
The space of all edit sequences is huge
Lot of distinct paths end up at the same state
Don’t have to keep track of all of them
Keep track of the shortest path to each state
Spelling Correction: Edit Distance
Defining Minimum Edit Distance Matrix
For two strings
X of length n
Y of length m
We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y
Thus, the edit distance between X and Y is D(n, m)
Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 8 /20
Computing Minimum Edit Distance
Dynamic Programming
A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems
Bottom-up
) Compute D(i, j) for small i,j
) Compute larger D(i, j) based on previously computed smaller values
Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20
Computing Minimum Edit Distance
Dynamic Programming
A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems
Bottom-up
) Compute D(i, j) for small i,j
) Compute larger D(i, j) based on previously computed smaller values
) Compute D(i, j) for all i and j till you get to D(n, m)
Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20
Dynamic Programming Algorithm
Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1
The Edit Distance Table
Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20
The Edit Distance Table
Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20
The Edit Distance Table
Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1
Computing Alignments
➢ Computing edit distance may not be sufficient for some applications
➢ We often need to align characters of the two strings to each other
Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 13 / 20
Minimum Edit Distance
Example
Edit distance from ‘intention’ to ‘execution’
Defining Minimum Edit Distance Matrix
For two strings
X of length n Y of length m
We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y
Thus, the edit distance between X and Y is D(n, m)
Performance
Time
O(nm)
Backtrace
O(n +m)