0% found this document useful (0 votes)

21 views28 pages

CL - Lec 6

The document provides an overview of tokenization, stemming, and lemmatization in natural language processing. It explains the processes of breaking text into tokens, the challenges of tokenization across different languages, and the importance of stemming and lemmatization for reducing words to their base forms. Additionally, it discusses the implications of stop words and normalization in text processing.

Uploaded by

sharminaktereilma96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views28 pages

CL - Lec 6

Uploaded by

sharminaktereilma96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Tokenization, Stemming

and Lemmatization
What is Tokenization?

Tokenization is the task of chopping text into pieces, called tokens

Token is an instance of a sequence of characters in some particular document (grouped as a useful

semantic unit)

Type is the class of all tokens containing the same character sequence

Term is a (perhaps normalized) type that is included in the corpus dictionary

Example to sleep more to learn

:
Token to, sleep, more, to, learn
:
Type to, sleep, more, learn
:
Term sleep, more, learn (stop words removed)
:
Tokenization
◼ Input: “Girls, Bengalees and Countrymen”
◼ Output: Tokens
◼ Girls
◼ Bengalees
◼ Countrymen
◼ Each such token is now a candidate for an
index entry, after further processing
◼ Described below
◼ But what are valid tokens to emit?
Tokenization
◼ Issues in tokenization:
◼ Japan’s capital →

Japan? Japans? Japan’s?

◼ অগ্নি-বীণা → অগ্নি and বীণা as two tokens?
◼ state-of-the-art: break up hyphenated sequence.
◼ co-education
◼ lowercase, lower-case, lower case ?
◼ It’s effective to get the user to put in possible hyphens

◼ San Francisco: one token or two? How

do you decide it is one token?
◼ For example, if the document to be indexed is
to sleep perchance to dream, then there are
five tokens, but only four types (because there are
two instances of to). However, if to is omitted from
the index (as a stop word; then there are only three
terms: sleep, perchance, and dream.).
◼ The major question of the tokenization phase is

what are the correct tokens to use? In this

example, it looks fairly trivial: you chop on
whitespace and throw away punctuation
characters.
◼ Example.
◼ Mr. O’Neill thinks that the boys’ stories about
Numbers
◼ 3/12/91 Mar. 12, 1991
◼ 55 B.C.
◼ B-52
◼ My PGP key is 324a3df234cb23e
◼ (800) 234-2333
◼ Often have embedded spaces

◼ Often, don’t index as text

◼ But often very useful: think about things like

looking up error codes/stacktraces on the web
◼ (One answer is using n-grams: Lecture 3)
◼ Will often index “meta-data” separately
◼ Creation date, format, etc.
Tokenization: language issues
◼ French
◼ L'ensemble → one token or two?
◼ L ? L’ ? Le ?
◼ Want l’ensemble to match with un ensemble

◼ German noun compounds are not

segmented
◼ Lebensversicherungsgesellschaftsangestellter
◼ ‘life insurance company employee’
◼ German retrieval systems benefit greatly from a
compound splitter module
Tokenization: language issues
◼ Chinese and Japanese have no spaces
between words:
◼ 莎拉波娃现在居住在美国东南部的佛罗里达。
◼ Not always guaranteed a unique tokenization
◼ Further complicated in Japanese, with
multiple alphabets intermingled
◼ Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Tokenization: language issues
◼ Arabic (or Hebrew) is basically written right
to left, but with certain items like numbers
written left to right
◼ Words are separated, but letter forms within
a word form complex ligatures

◼ ← → ←→ ← start
◼ ‘Algeria achieved its independence in 1962
after 132 years of French occupation.’
◼ With Unicode, the surface presentation is
complex, but the stored form is straightforward
Word-based Tokenization

Approach
● Splitting the text by spaces
● Other delimiters such as punctuation can be used
Advantages
● Easy to implement
Disadvantages
● High risk of missing words; e.g., Let and Let’s will have two
different types
● Languages like Chinese do not have space
● Huge vocabulary size (token type)
○ Limit the number of words that can be added to the
vocabulary
● Misspelled words will be considered as a token
Character-based Tokenization
Approach
●Splitting the text into individual characters
Advantages
●There will be no or very few unknown words
(Out Of Vocabulary)
●Useful for languages that characters carry
information
●Fewer number of tokens
●Easy to implement
Disadvantages
●A character usually does not have a meaning
○ Cannot learn semantic for words
●Larger sequence to be processed by models
○ More input to process
Subword Tokenization

Approach
●Frequently used words should not be split into smaller subwords
●Rare words should be decomposed into meaningful subwords
●Uses a special symbol to indicate which word is the start of the token and which word is
the completion of the start of the token
○ Tokenization → “Token”, “##ization”
●State-of-the-art approaches for NLP and IR rely on this type

Advantages
●Out-of-vocabulary word problem solved
●Manageable vocabulary sizes

Disadvantages
●New scheme and needs more exploration

Byte Per Encoding (BPE) and WordPiece are two examples of this scheme
Stop words
◼ With a stop list, you exclude from dictionary
entirely the commonest words. Intuition:
◼ They have little semantic content: the, a, and, to, be
◼ There are a lot of them: ~30% of postings for top 30 wds
◼ But the trend is away from doing this:
◼ Good compression techniques means the space for
including stop words in a system is very small
◼ Good query optimization techniques mean you pay little
at query time for including stop words.
◼ You need them for:
◼ Phrase queries: “King of Denmark”
◼ Various song titles, etc.: “Let it be”, “To be or not to be”
◼ “Relational” queries: “flights to London”
Stopwords Removal

Stopping: Removing common words from the stream of tokens that become
index terms
●Words that are function words helping form sentence structure: the, of, and, to,
….
●For an application, an additional domain specific stop words list may be
constructed
●Why do we need to remove stop words?
○ Reduce indexing (or data) file size
○ Usually has no impact on the NLP task’s effectiveness, and may even
improve it
●Can sometimes cause issues for NLP tasks:
○ e.g., phrases: “to be or not to be”, “let it be”, “flights to Portland Maine”
○ Some tasks consider very small stopwords list
■ Sometimes perhaps only “the”
●List of stopwords: https://www.ranks.nl/stopwords
Normalization
◼ Need to “normalize” terms in indexed text as
well as query terms into the same form
◼ We want to match U.S.A. and USA
◼ We most commonly implicitly define
equivalence classes of terms
◼ e.g., by deleting periods in a term
◼ Alternative is to do asymmetric expansion:
◼ Enter: window Search: window, windows
◼ Enter: windows Search: Windows, windows, window
◼ Enter: Windows Search: Windows
◼ Potentially more powerful, but less efficient
Normalization: other languages
◼ Accents: résumé vs. resume.
◼ Most important criterion:
◼ How are your users like to write their queries
for these words?

◼ Even in languages that standardly have

accents, users often may not type them

◼ German: Tuebingen vs. Tübingen

◼ Should be equivalent
Case folding
◼ Reduce all letters to lower case
◼ exception: upper case in mid-sentence?
◼ e.g., General Motors
◼ Fed vs. fed
◼ SAIL vs. sail
◼ Often best to lower case everything, since
users will use lowercase regardless of
‘correct’ capitalization…

◼ Aug 2005 Google example:

◼ C.A.T. → Cat Fanciers website not Caterpiller
Inc.
Stemming and lemmatization
◼ The goal of both stemming and lemmatization is to
reduce inflectional
◼ forms and sometimes derivationally related forms of a
word to a common
◼ base form. For instance
◼ e.g., car = automobile
◼ color = colour
◼ Rewrite to form equivalence classes
◼ Index such equivalences
◼ When the document contains automobile, index it under car
as well (usually, also vice-versa)
Stemming is a process that extract stems by removing last few characters from a word,
often leading to incorrect meanings and spelling

Lemmatization considers the context and converts the word to its meaningful base form,
which is called Lemma
Stemming
Stemming: To group words that are derived from a common stem
●e.g, “fish”, “fishes”, “fishing” could be mapped to “fish”
●Generally produces small improvements in tasks effectiveness
●Similar to stopping, stemming can be done aggressively, conservatively, or not
at all
○ Aggressively: consider “fish” and “fishing” the same
○ Conservatively: just identifying plural forms using the letter “s”
■ issues: ‘Centuries’ → ‘Centurie”
○ Not at all: Consider all the word variants
●In different languages, stemming can have different importance for
effectiveness:
○ In Arabic, morphology is more complicated than English
○ In Chinese, stemming is not effective
Stemming
◼ Reduce terms to their “roots” before
indexing
◼ “Stemming” suggest crude affix chopping
◼ language dependent
◼ e.g., automate(s), automatic, automation all
reduced to automat.

for example compressed for exampl compress and

and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Evaluation of Stemmers
There are three criteria for evaluating stemmers:
1.Correctness
2.Efficiency of the task
3.Compression performance

There are two ways in which stemming can be incorrect:

●Over-stemming (too much of the term is removed)
○ Two or more words being reduced to the same wrong root
○ e.g., ‘centennial’, ‘century’, ‘center’: ‘cent’
●Under-stemming (too little of the term is removed)
○ Two or more words could be wrongly reduced to more than one
root word
○ e.g., ‘acquire’, ‘acquiring’, ‘acquired’: acquir ‘acquisition’:
‘acquis’
Example
Lemmatization
◼ Reduce inflectional/variant forms to base
form
◼ E.g.,
◼ am, are, is → be
◼ car, cars, car's, cars' → car
◼ the boy's cars are different colors → the boy
car be different color
◼ Lemmatization implies doing “proper”
reduction to dictionary headword form
Lemmatization

Reduce inflectional/variant forms to base form

●am, are, is → be
●car, cars, car's, cars' → car
●the boy's cars are different colors → the boy car be different color
Lemmatization implies doing “proper” reduction to dictionary headword form
●e.g., WordNet is a lexical database of semantic relations between words in
more than 200 languages

37
Phrases
In a task such as information retrieval, input queries can be 2-3
word phrases
● Phrases can yield more precise queries
○ “University of Southern Maine”, “black sea”
● Less ambiguous
○ “Red apple” vs. “apple”

Phrase is any sequence of n words: n-gram

● unigram: one bigram: sequence of 2 trigram: sequence of 3

word words words
● Generated by:

○ Choosing a particular value for ‘n’

○ Moving that “window” forward one unit word at a time
● The more frequently a word n-gram occurs, the more likely it is
to correspond to a meaningful phrase in the language
Language-specificity
◼ Many of the above features embody
transformations that are
◼ Language-specific and
◼ Often, application-specific
◼ These are “plug-in” addenda to the indexing
process
◼ Both open source and commercial plug-ins
are available for handling these
Dictionary entries – first cut
ensemble.french

時間.japanese

MIT.english These may be

grouped by
mit.german language (or
not…).
guaranteed.english
More on this in
entries.english ranking/query
processing.
sometimes.english

tokenization.english
THANK YOU

2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Module 1 NLP
No ratings yet
Module 1 NLP
26 pages
Module 2 Complete
No ratings yet
Module 2 Complete
134 pages
3-More On Indexing & Text Operations
No ratings yet
3-More On Indexing & Text Operations
27 pages
Lec 19
No ratings yet
Lec 19
60 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Chap 2
No ratings yet
Chap 2
70 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Info Retrieval for Linguists
No ratings yet
Info Retrieval for Linguists
38 pages
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
Extracting, Cleaning and Pre-Processing Text
No ratings yet
Extracting, Cleaning and Pre-Processing Text
12 pages
NLP 3-6
No ratings yet
NLP 3-6
20 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Text Mining
No ratings yet
Text Mining
62 pages
6 The Term Vocabulary & Posting List
No ratings yet
6 The Term Vocabulary & Posting List
19 pages
Tokenization & Morphology in NLP
No ratings yet
Tokenization & Morphology in NLP
63 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
Session 1
No ratings yet
Session 1
33 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
37 pages
NLP Preprocessing Techniques Guide
No ratings yet
NLP Preprocessing Techniques Guide
24 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
IR Lec3
No ratings yet
IR Lec3
41 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
6 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Text Preprocessing Techniques Guide
No ratings yet
Text Preprocessing Techniques Guide
6 pages
Unit-Ii Text and Web Page Pre-Processing: Stop Words
No ratings yet
Unit-Ii Text and Web Page Pre-Processing: Stop Words
23 pages
Text Preprocessing
No ratings yet
Text Preprocessing
59 pages
Stemming, Lemmatization & NLP Basics
No ratings yet
Stemming, Lemmatization & NLP Basics
6 pages
Text Processing for IR Systems
No ratings yet
Text Processing for IR Systems
43 pages
Part B Notes
No ratings yet
Part B Notes
62 pages
Unit 6 - AI (NLP)
No ratings yet
Unit 6 - AI (NLP)
37 pages
Chapter - 2 Text Operation (Lecture 2.1)
No ratings yet
Chapter - 2 Text Operation (Lecture 2.1)
63 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Natural Language Computing
No ratings yet
Natural Language Computing
20 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
Unit 2
No ratings yet
Unit 2
20 pages
L3 Vocabulary+Postings List
No ratings yet
L3 Vocabulary+Postings List
28 pages
Chapter 2 Part II
No ratings yet
Chapter 2 Part II
75 pages
Intro to Info Retrieval Basics
No ratings yet
Intro to Info Retrieval Basics
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Elasticsearch Basics for Beginners
No ratings yet
Elasticsearch Basics for Beginners
44 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Text Processing
No ratings yet
Text Processing
114 pages
2-Regular Expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular Expressions, Text Normalization, Edit Distance
42 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
Lect 7 Normalization
No ratings yet
Lect 7 Normalization
9 pages
Department of Information Technology: Data Structure Semester IV (4IT01) Question Bank Prepared by Prof. Ankur S. Mahalle
100% (1)
Department of Information Technology: Data Structure Semester IV (4IT01) Question Bank Prepared by Prof. Ankur S. Mahalle
13 pages
Why MS CIT V
No ratings yet
Why MS CIT V
71 pages
AI's Impact On Digital Marketing
No ratings yet
AI's Impact On Digital Marketing
4 pages
No-Code AI & ML Course Overview
No ratings yet
No-Code AI & ML Course Overview
1 page
2024 - Vitrox Technology Sharing
No ratings yet
2024 - Vitrox Technology Sharing
46 pages
Project Charter - Automation of CDS-A Validations and Report Generation For Altea Sign Off
No ratings yet
Project Charter - Automation of CDS-A Validations and Report Generation For Altea Sign Off
3 pages
Gui Design in C++.
No ratings yet
Gui Design in C++.
23 pages
67f804cdf3829e65bdb04297 Rujevagububo
No ratings yet
67f804cdf3829e65bdb04297 Rujevagububo
8 pages
Unit-Ii 5-Marks Question: Thin
No ratings yet
Unit-Ii 5-Marks Question: Thin
16 pages
ISPF Guide for Mainframe Users
No ratings yet
ISPF Guide for Mainframe Users
1 page
Gresb Ethernet/Spacewire Bridge: Gaisler
No ratings yet
Gresb Ethernet/Spacewire Bridge: Gaisler
26 pages
Cloud Transformation for MarkScan
No ratings yet
Cloud Transformation for MarkScan
3 pages
HCS12 Microcontroller Intro
100% (2)
HCS12 Microcontroller Intro
36 pages
Cybersecurity & Cybercrime Basics
No ratings yet
Cybersecurity & Cybercrime Basics
28 pages
ADS612-08 - Active Directory - Configuring A Network Policy Server
No ratings yet
ADS612-08 - Active Directory - Configuring A Network Policy Server
36 pages
SELPHY CP1500 Printer Features
No ratings yet
SELPHY CP1500 Printer Features
9 pages
CorelDRAW Keyboard Shortcut Keys LMP
No ratings yet
CorelDRAW Keyboard Shortcut Keys LMP
4 pages
F13 Lecture 1 Intro To Numerical Methods
No ratings yet
F13 Lecture 1 Intro To Numerical Methods
17 pages
WA Completeguide 0.12
No ratings yet
WA Completeguide 0.12
85 pages
Computer Network UNIT 5
No ratings yet
Computer Network UNIT 5
28 pages
Cvmiivictor Javier Ramirez Diaz
No ratings yet
Cvmiivictor Javier Ramirez Diaz
1 page
Video Sum4
No ratings yet
Video Sum4
5 pages
Beamng Dxdiag
No ratings yet
Beamng Dxdiag
24 pages
Log Content
No ratings yet
Log Content
50 pages
Tampering Report
No ratings yet
Tampering Report
42 pages
Frontend Roadmap
No ratings yet
Frontend Roadmap
5 pages
Print Ieu
No ratings yet
Print Ieu
12 pages
winspiroPRO - Revision - History
No ratings yet
winspiroPRO - Revision - History
38 pages
Unit 05 - SRE
No ratings yet
Unit 05 - SRE
15 pages
IX Emplobility Skills Combined Book PDF
No ratings yet
IX Emplobility Skills Combined Book PDF
66 pages

CL - Lec 6

Uploaded by

CL - Lec 6

Uploaded by

Tokenization, Stemming

Tokenization is the task of chopping text into pieces, called tokens

Token is an instance of a sequence of characters in some particular document (grouped as a useful

Term is a (perhaps normalized) type that is included in the corpus dictionary

Example to sleep more to learn

Japan? Japans? Japan’s?

◼ San Francisco: one token or two? How

what are the correct tokens to use? In this

◼ Often, don’t index as text

◼ But often very useful: think about things like

◼ German noun compounds are not

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

◼ Even in languages that standardly have

◼ German: Tuebingen vs. Tübingen

◼ Aug 2005 Google example:

for example compressed for exampl compress and

There are two ways in which stemming can be incorrect:

Reduce inflectional/variant forms to base form

Phrase is any sequence of n words: n-gram

● unigram: one bigram: sequence of 2 trigram: sequence of 3

○ Choosing a particular value for ‘n’

MIT.english These may be

You might also like