0% found this document useful (0 votes)

19 views54 pages

NLP Lecture2 Text Pre Processing

Uploaded by

tharini.abhinaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views54 pages

NLP Lecture2 Text Pre Processing

Uploaded by

tharini.abhinaya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

19CSE453

Natural Language Processing

Lecture 2
Text processing: tokenization

What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.

Depending on the application in hand, you might have to perform sentence

segmentation a s well.
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally
for
• ) Abbreviations (Dr., Mr., m.p.h.)
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
While ‘!’, ‘?’ are quite unambiguous
Period “.” is quite ambiguous and can be used additionally for
) Abbreviations (Dr., Mr., m.p.h.)
) Numbers (2.4%, 4.3)

Approach: build a binary classifier

For each “.”
Decides EndOfSentence/NotEndOfSentence

3 /26
Sentence Segmentation

The problem of deciding where the sentences begin and end.

Challenges Involved
• While ‘!’, ‘?’ are quite unambiguous
• Period “.” is quite ambiguous and can be used additionally for
• ) Abbreviations (Dr., Mr., m.p.h.)

• ) Numbers (2.4%, 4.3)

Approach: build a binary classifier

For each “.”
• Decides EndOfSentence/NotEndOfSentence
• Classifiers can be: hand-written rules, regular expressions, or
machine learning
Sentence Segmentation: Decision Tree Example

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?

Sentence Segmentation: Decision Tree Example

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?

Other Important Features likely to influence sentence segmentation

• C a s e of word with “.” : Upper, Lower, Cap, Number

• C a s e of word after “.” : Upper, Lower, Cap, Number
• Numeric Features
➢ Length of word with “.”
➢ Probability (word with “.” occurs at end-of-sentence)
➢ Probability (word after “.” occurs at beginning-of-sentence)
Word Tokenization

What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.
Word Tokenization

What is Tokenization?
Tokenization is the process of segmenting a string of characters into words.

I have a can opener; but I can not open these cans.

Word Token
An occurrence of a word
For the above sentence, 12 word tokens.

Word Type
A different realization of a word
For the above sentence, 10 word types.
Popular Python packages for NLP

➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most comprehensive
libraries for NLP tasks. It provides tools for tasks such as tokenization, stemming,
lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It offers
features like tokenization, POS tagging, named entity recognition (NER), dependency
parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for common
NLP tasks such as tokenization, POS tagging, noun phrase extraction, sentiment analysis,
and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity analysis,
but it also offers functionality for tasks like text preprocessing, word embedding, and
similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer for
converting text data into numerical feature vectors.
Word Tokenization

Issues in Tokenization
Finland’s → Finland Finland‘s Finland ’s ?
What’re, I’m, shouldn’t → What are, I am, should not ?
S a n Francisco → one token or two?
Normalization

Why to “normalize”?
• Indexed text and query terms must have the same form.
Example: U.S.A. and U S A should be matched

We implicitly define equivalence classes of terms

• Three forms of Normalization

• Case folding
• Stemming
• Lemmatization
Case Folding

• Reduce all letters to lower case

• Possible exceptions (Task dependent):

➢ Upper case in mid sentence, may point to named entities (e.g.
General Motors)

➢ Words conveyed in CAPS mean a strong conveyance

➢ (eg. US vs. us, I REALLY MEAN IT
➢ Applications in sentiment analysis, information extraction etc.
Lemmatization

• Reduce inflections or variant forms to base form:

✓ am, are, is → be
✓ car, cars, car’s, cars’ → car
✓ eat, ate, eaten→ eat
✓ Write, wrote, written→ write

• Must find the correct dictionary headword form

Lemmatization learns from Morphology

Morphology studies the internal structure of words, how words are built up
from smaller meaningful units called morphemes

Morphemes are divided into two categories

Stems: The core meaning bearing units
Affixes: Bits and pieces adhering to stems to change their meanings and
grammatical functions
• Perfix: un-,anti-, etc.
• suffix: -ation, -ity, en, -ed etc.

*Lemmatization algorithms take input from morphology to convert tokens

to their root form
Python Code for Lemmatization

*Note: The 2nd parameter of the lemmatize function is taken

default as noun if not provided
Stemming

• Reducing terms to their stems, used in information retrieval

• Crude chopping of affixes
➢ language dependent
➢ automate(s), automatic, automation all reduced to automat

Example:
Porter’s algorithm for Stemming

Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)
Porter’s algorithm

Step 1a
s s e s → s s (caresses → caress)
ies → i (ponies → poni)
s s → s s (caress → caress)
s → φ (cats → cat)

Step 1b
(*v*)ing → φ(walking → walk, king → king)
(*v*)ed → φ (played → play)

*Note: v represents vowel

Porter’s algorithm

Step 2
ational → ate (relational → relate)
izer → ize (digitizer → digitize)
ator → ate (operator → operate)

Step 3
al → φ (revival → reviv)
able → φ(adjustable → adjust)
ate → φ (activate → activ)
Python code for Stemming

import nltk
from nltk.stem import PorterStemmer
# Initialize Porter stemmer
stemmer = PorterStemmer()

# Stem each token Output:

print(stemmer.stem("cats")) cat
print(stemmer.stem("played")) play
print(stemmer.stem("playing")) play
print(stemmer.stem("welcomes")) welcom
print(stemmer.stem("persual")) persual
print(stemmer.stem("ideologies")) ideolog

*Note: SnowballStemmer& Lancester stemmer are other examples

Popular Python packages for NLP

➢ NLTK (Natural Language Toolkit): NLTK is one of the oldest and most
comprehensive libraries for NLP tasks. It provides tools for tasks such as
tokenization, stemming, lemmatization, part-of-speech tagging, parsing, and more.
➢ spaCy: spaCy is a modern NLP library that's designed to be fast and efficient. It
offers features like tokenization, POS tagging, named entity recognition (NER),
dependency parsing, and sentence segmentation.
➢ TextBlob: TextBlob is built on top of NLTK and provides a simpler interface for
common NLP tasks such as tokenization, POS tagging, noun phrase extraction,
sentiment analysis, and more.
➢ Gensim: Gensim is primarily focused on topic modeling and document similarity
analysis, but it also offers functionality for tasks like text preprocessing, word
embedding, and similarity queries.
➢ scikit-learn: While scikit-learn is a general-purpose machine learning library, it also
includes utilities for text preprocessing, such as CountVectorizer and TfidfVectorizer
for converting text data into numerical feature vectors.
Python code for pre-processing using spacy

import spacy
# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")
# Sample text
text = "The dogs are barking loudly outside. I am reading a book."
doc = nlp(text)
# Perform various preprocessing tasks
cleaned_text = []
for token in doc:
# Remove stop words and punctuation Output:
if not token.is_stop and not token.is_punct: Original text: The dogs
# Lemmatize each token are barking loudly
lemma = token.lemma_ outside. I am reading a
# Lowercase each token book.
cleaned_text.append(lemma.lower()) Preprocessed text: dog
# Join the cleaned tokens back into a string bark loudly outside read
cleaned_text = " ".join(cleaned_text) book
# Print the preprocessed text
print("Original text:", text)
print("Preprocessed text:", cleaned_text)
Python code for Pre-processing using TextBlob

from textblob import TextBlob

# Sample text
text = "Barack Obama was born in Hawaii on August 4,
1961. He served as the 44th President of the United States."
# Tokenization
blob = TextBlob(text)
tokens = blob.words
# POS tagging
pos_tags = blob.tags
# NER tagging
ner_tags = blob.noun_phrases
print("Tokens:", tokens)
print("POS tags:", pos_tags)
print("NER tags:", ner_tags)
Output

Output:
Tokens: ['Barack', 'Obama', 'was', 'born', 'in', 'Hawaii', 'on', 'August', '4', '1961', 'He',
'served', 'as', 'the', '44th', 'President', 'of', 'the', 'United', 'States’]

X of length n
Y of length m

We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y

Thus, the edit distance between X and Y is D(n, m)

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 8 /20

Computing Minimum Edit Distance

Dynamic Programming
A tabular computation of D(n,m)
Solving problems by combining solutions to subproblems
Bottom-up
) Compute D(i, j) for small i,j
) Compute larger D(i, j) based on previously computed smaller values

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20

Computing Minimum Edit Distance

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20

Dynamic Programming Algorithm

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1

The Edit Distance Table

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20

The Edit Distance Table

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20

The Edit Distance Table

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1

Computing Alignments

➢ Computing edit distance may not be sufficient for some applications

➢ We often need to align characters of the two strings to each other

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 13 / 20

Minimum Edit Distance

Example
Edit distance from ‘intention’ to ‘execution’
Defining Minimum Edit Distance Matrix

For two strings

X of length n Y of length m

We define D(i, j)
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y

Thus, the edit distance between X and Y is D(n, m)

Performance

Time
O(nm)

Backtrace
O(n +m)

Text Preprocessing
No ratings yet
Text Preprocessing
39 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP m2
No ratings yet
NLP m2
71 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
NLP with Python Lab Manual
No ratings yet
NLP with Python Lab Manual
15 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
Al3501 - Teaching Content
No ratings yet
Al3501 - Teaching Content
3 pages
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
No ratings yet
NLP Core Using NLTK: Dr. Muhammad Nouman Durrani
42 pages
UBC Summer School in NLP - VSP 2019 Lecture 10
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 10
33 pages
NLP - Exp 1 11
No ratings yet
NLP - Exp 1 11
29 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
TextMining
No ratings yet
TextMining
43 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
Sentiment Analysis for Engineers
No ratings yet
Sentiment Analysis for Engineers
7 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
NLP Applications and Preprocessing
No ratings yet
NLP Applications and Preprocessing
56 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
Introduction To NLP
No ratings yet
Introduction To NLP
15 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
CAT King Study Material 5
No ratings yet
CAT King Study Material 5
21 pages
Lab 2
No ratings yet
Lab 2
49 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Ai NLP
No ratings yet
Ai NLP
9 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
02 - NLP Pipeline - Binh
No ratings yet
02 - NLP Pipeline - Binh
37 pages
Chap 2
No ratings yet
Chap 2
70 pages
UNIT-1 Notes
No ratings yet
UNIT-1 Notes
19 pages
E026 ShubhamTanna ASTM Exp-3
No ratings yet
E026 ShubhamTanna ASTM Exp-3
8 pages
What Is NLP?: Components of An FSA
No ratings yet
What Is NLP?: Components of An FSA
16 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP Components and Techniques Guide
No ratings yet
NLP Components and Techniques Guide
26 pages
NLP Exp-123
No ratings yet
NLP Exp-123
6 pages
Unit - 2
No ratings yet
Unit - 2
55 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
16 pages
NLP Notebook
No ratings yet
NLP Notebook
20 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Techniques for Students
No ratings yet
NLP Techniques for Students
55 pages
Final LP-VI NLP Manual 2023-24
No ratings yet
Final LP-VI NLP Manual 2023-24
29 pages
Deep Learning: Text Processing Guide
No ratings yet
Deep Learning: Text Processing Guide
106 pages
AMLTA
No ratings yet
AMLTA
17 pages
02 Linguistics Essentials
No ratings yet
02 Linguistics Essentials
36 pages
1009 NLP PPT
No ratings yet
1009 NLP PPT
31 pages
NLP Notes
No ratings yet
NLP Notes
56 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
Chapter 7.1 - Introducing Natural Language Processing
No ratings yet
Chapter 7.1 - Introducing Natural Language Processing
39 pages
NLP Applications in Healthcare
No ratings yet
NLP Applications in Healthcare
71 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
Natural Language Processing Notes Class 10 AI
No ratings yet
Natural Language Processing Notes Class 10 AI
24 pages
Compiler Design
No ratings yet
Compiler Design
117 pages
NLP CH 7 Machine Translation
No ratings yet
NLP CH 7 Machine Translation
86 pages
G52MAL: Lecture 18: Recursive-Descent Parsing: Elimination of Left Recursion
No ratings yet
G52MAL: Lecture 18: Recursive-Descent Parsing: Elimination of Left Recursion
26 pages
Text Mining Pipelines Explained
No ratings yet
Text Mining Pipelines Explained
333 pages
Turbo Prolog Toolbox 1987 PDF
100% (1)
Turbo Prolog Toolbox 1987 PDF
386 pages
CD Unit-2
100% (1)
CD Unit-2
60 pages
SLR Parser
No ratings yet
SLR Parser
15 pages
Compiler Design Basics
No ratings yet
Compiler Design Basics
89 pages
C++ Static Analysis Tools
No ratings yet
C++ Static Analysis Tools
29 pages
Revision QP
No ratings yet
Revision QP
5 pages
Department of Computer Science Vidyasagar University: Paschim Medinipur - 721102
No ratings yet
Department of Computer Science Vidyasagar University: Paschim Medinipur - 721102
26 pages
ATC-21CS51 Module 1 To 5 Notes
No ratings yet
ATC-21CS51 Module 1 To 5 Notes
419 pages
PDF Investigation With Parser Differentials and On
No ratings yet
PDF Investigation With Parser Differentials and On
8 pages
Syntax Analysis: - Check Syntax and Construct Abstract Syntax Tree
No ratings yet
Syntax Analysis: - Check Syntax and Construct Abstract Syntax Tree
22 pages
JavaCC Lab Guide for Students
No ratings yet
JavaCC Lab Guide for Students
14 pages
NLP - PPT - CH 4
No ratings yet
NLP - PPT - CH 4
78 pages
Machine Translation Challenges
No ratings yet
Machine Translation Challenges
15 pages
VLSI Based Robust Router Architecture: .Channamallikarjuna Mattihalli .Suprith Ron .Naveen Kolla
No ratings yet
VLSI Based Robust Router Architecture: .Channamallikarjuna Mattihalli .Suprith Ron .Naveen Kolla
6 pages
TD-CAD Tasklist 20151113
No ratings yet
TD-CAD Tasklist 20151113
534 pages
DLP VIVA Questions and Answers
No ratings yet
DLP VIVA Questions and Answers
5 pages
Learn Regex The Hard Way
0% (1)
Learn Regex The Hard Way
5 pages
Network Programming Language (NPL)
No ratings yet
Network Programming Language (NPL)
80 pages
1474627663module13 Content Final
No ratings yet
1474627663module13 Content Final
8 pages
Java Programming Course Outline
No ratings yet
Java Programming Course Outline
2 pages
A Practical Approach To Compiler Construction 1st Edition Des Watson (Auth.)
No ratings yet
A Practical Approach To Compiler Construction 1st Edition Des Watson (Auth.)
62 pages
NLP Exam Full Answers
No ratings yet
NLP Exam Full Answers
4 pages
03 Data Acquisition
No ratings yet
03 Data Acquisition
28 pages
Module 4
No ratings yet
Module 4
53 pages
CD Unit-Ii
No ratings yet
CD Unit-Ii
37 pages
Alcina PDF
No ratings yet
Alcina PDF
24 pages

NLP Lecture2 Text Pre Processing

Uploaded by

NLP Lecture2 Text Pre Processing

Uploaded by

19CSE453

Natural Language Processing

Depending on the application in hand, you might have to perform sentence

The problem of deciding where the sentences begin and end.

The problem of deciding where the sentences begin and end.

The problem of deciding where the sentences begin and end.

The problem of deciding where the sentences begin and end.

The problem of deciding where the sentences begin and end.

Approach: build a binary classifier

The problem of deciding where the sentences begin and end.

• ) Numbers (2.4%, 4.3)

Approach: build a binary classifier

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?

Decision Tree: Is this word the end-of-sentence ( E -O-S) ?

• C a s e of word with “.” : Upper, Lower, Cap, Number

I have a can opener; but I can not open these cans.

We implicitly define equivalence classes of terms

• Three forms of Normalization

• Reduce all letters to lower case

• Possible exceptions (Task dependent):

➢ Words conveyed in CAPS mean a strong conveyance

• Reduce inflections or variant forms to base form:

• Must find the correct dictionary headword form

Morphemes are divided into two categories

*Lemmatization algorithms take input from morphology to convert tokens

*Note: The 2nd parameter of the lemmatize function is taken

• Reducing terms to their stems, used in information retrieval

*Note: v represents vowel

# Stem each token Output:

*Note: SnowballStemmer& Lancester stemmer are other examples

from textblob import TextBlob

NER tags: ['barack obama', 'hawaii', 'august', '44th president']

I am writing this email on behaf of ...

Which are some close words?

I am writing this email on behaf of ...

Which are some close words?

Isolated word error correction

I am writing this email on behaf of ...

Which are some close words?

Isolated word error correction

I am writing this email on behaf of ...

Which are some close words?

Isolated word error correction

I am writing this email on behaf of ...

Which are some close words?

Isolated word error correction

• The minimum edit distance between two strings : It is the minimum

Searching for a path (sequence of edits) from the start string to

Spelling Correction: Edit Distance

For two strings

Thus, the edit distance between X and Y is D(n, m)

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 8 /20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 9 /20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1

➢ Computing edit distance may not be sufficient for some applications

➢ We often need to align characters of the two strings to each other

Manju Venugopalan Spelling Correction: Edit Distance Week 2: Lecture 1 13 / 20

For two strings

Thus, the edit distance between X and Y is D(n, m)

You might also like