Tokenization, Stemming
and Lemmatization
What is Tokenization?
Tokenization is the task of chopping text into pieces, called tokens
Token is an instance of a sequence of characters in some particular document (grouped as a useful
semantic unit)
Type is the class of all tokens containing the same character sequence
Term is a (perhaps normalized) type that is included in the corpus dictionary
Example to sleep more to learn
:
Token to, sleep, more, to, learn
:
Type to, sleep, more, learn
:
Term sleep, more, learn (stop words removed)
:
Tokenization
◼ Input: “Girls, Bengalees and Countrymen”
◼ Output: Tokens
◼ Girls
◼ Bengalees
◼ Countrymen
◼ Each such token is now a candidate for an
index entry, after further processing
◼ Described below
◼ But what are valid tokens to emit?
Tokenization
◼ Issues in tokenization:
◼ Japan’s capital →
Japan? Japans? Japan’s?
◼ অগ্নি-বীণা → অগ্নি and বীণা as two tokens?
◼ state-of-the-art: break up hyphenated sequence.
◼ co-education
◼ lowercase, lower-case, lower case ?
◼ It’s effective to get the user to put in possible hyphens
◼ San Francisco: one token or two? How
do you decide it is one token?
◼ For example, if the document to be indexed is
to sleep perchance to dream, then there are
five tokens, but only four types (because there are
two instances of to). However, if to is omitted from
the index (as a stop word; then there are only three
terms: sleep, perchance, and dream.).
◼ The major question of the tokenization phase is
what are the correct tokens to use? In this
example, it looks fairly trivial: you chop on
whitespace and throw away punctuation
characters.
◼ Example.
◼ Mr. O’Neill thinks that the boys’ stories about
Numbers
◼ 3/12/91 Mar. 12, 1991
◼ 55 B.C.
◼ B-52
◼ My PGP key is 324a3df234cb23e
◼ (800) 234-2333
◼ Often have embedded spaces
◼ Often, don’t index as text
◼ But often very useful: think about things like
looking up error codes/stacktraces on the web
◼ (One answer is using n-grams: Lecture 3)
◼ Will often index “meta-data” separately
◼ Creation date, format, etc.
Tokenization: language issues
◼ French
◼ L'ensemble → one token or two?
◼ L ? L’ ? Le ?
◼ Want l’ensemble to match with un ensemble
◼ German noun compounds are not
segmented
◼ Lebensversicherungsgesellschaftsangestellter
◼ ‘life insurance company employee’
◼ German retrieval systems benefit greatly from a
compound splitter module
Tokenization: language issues
◼ Chinese and Japanese have no spaces
between words:
◼ 莎拉波娃现在居住在美国东南部的佛罗里达。
◼ Not always guaranteed a unique tokenization
◼ Further complicated in Japanese, with
multiple alphabets intermingled
◼ Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana Hiragana Kanji Romaji
End-user can express query entirely in hiragana!
Tokenization: language issues
◼ Arabic (or Hebrew) is basically written right
to left, but with certain items like numbers
written left to right
◼ Words are separated, but letter forms within
a word form complex ligatures
◼ ← → ←→ ← start
◼ ‘Algeria achieved its independence in 1962
after 132 years of French occupation.’
◼ With Unicode, the surface presentation is
complex, but the stored form is straightforward
Word-based Tokenization
Approach
● Splitting the text by spaces
● Other delimiters such as punctuation can be used
Advantages
● Easy to implement
Disadvantages
● High risk of missing words; e.g., Let and Let’s will have two
different types
● Languages like Chinese do not have space
● Huge vocabulary size (token type)
○ Limit the number of words that can be added to the
vocabulary
● Misspelled words will be considered as a token
Character-based Tokenization
Approach
●Splitting the text into individual characters
Advantages
●There will be no or very few unknown words
(Out Of Vocabulary)
●Useful for languages that characters carry
information
●Fewer number of tokens
●Easy to implement
Disadvantages
●A character usually does not have a meaning
○ Cannot learn semantic for words
●Larger sequence to be processed by models
○ More input to process
Subword Tokenization
Approach
●Frequently used words should not be split into smaller subwords
●Rare words should be decomposed into meaningful subwords
●Uses a special symbol to indicate which word is the start of the token and which word is
the completion of the start of the token
○ Tokenization → “Token”, “##ization”
●State-of-the-art approaches for NLP and IR rely on this type
Advantages
●Out-of-vocabulary word problem solved
●Manageable vocabulary sizes
Disadvantages
●New scheme and needs more exploration
Byte Per Encoding (BPE) and WordPiece are two examples of this scheme
Stop words
◼ With a stop list, you exclude from dictionary
entirely the commonest words. Intuition:
◼ They have little semantic content: the, a, and, to, be
◼ There are a lot of them: ~30% of postings for top 30 wds
◼ But the trend is away from doing this:
◼ Good compression techniques means the space for
including stop words in a system is very small
◼ Good query optimization techniques mean you pay little
at query time for including stop words.
◼ You need them for:
◼ Phrase queries: “King of Denmark”
◼ Various song titles, etc.: “Let it be”, “To be or not to be”
◼ “Relational” queries: “flights to London”
Stopwords Removal
Stopping: Removing common words from the stream of tokens that become
index terms
●Words that are function words helping form sentence structure: the, of, and, to,
….
●For an application, an additional domain specific stop words list may be
constructed
●Why do we need to remove stop words?
○ Reduce indexing (or data) file size
○ Usually has no impact on the NLP task’s effectiveness, and may even
improve it
●Can sometimes cause issues for NLP tasks:
○ e.g., phrases: “to be or not to be”, “let it be”, “flights to Portland Maine”
○ Some tasks consider very small stopwords list
■ Sometimes perhaps only “the”
●List of stopwords: https://www.ranks.nl/stopwords
Normalization
◼ Need to “normalize” terms in indexed text as
well as query terms into the same form
◼ We want to match U.S.A. and USA
◼ We most commonly implicitly define
equivalence classes of terms
◼ e.g., by deleting periods in a term
◼ Alternative is to do asymmetric expansion:
◼ Enter: window Search: window, windows
◼ Enter: windows Search: Windows, windows, window
◼ Enter: Windows Search: Windows
◼ Potentially more powerful, but less efficient
Normalization: other languages
◼ Accents: résumé vs. resume.
◼ Most important criterion:
◼ How are your users like to write their queries
for these words?
◼ Even in languages that standardly have
accents, users often may not type them
◼ German: Tuebingen vs. Tübingen
◼ Should be equivalent
Case folding
◼ Reduce all letters to lower case
◼ exception: upper case in mid-sentence?
◼ e.g., General Motors
◼ Fed vs. fed
◼ SAIL vs. sail
◼ Often best to lower case everything, since
users will use lowercase regardless of
‘correct’ capitalization…
◼ Aug 2005 Google example:
◼ C.A.T. → Cat Fanciers website not Caterpiller
Inc.
Stemming and lemmatization
◼ The goal of both stemming and lemmatization is to
reduce inflectional
◼ forms and sometimes derivationally related forms of a
word to a common
◼ base form. For instance
◼ e.g., car = automobile
◼ color = colour
◼ Rewrite to form equivalence classes
◼ Index such equivalences
◼ When the document contains automobile, index it under car
as well (usually, also vice-versa)
Stemming is a process that extract stems by removing last few characters from a word,
often leading to incorrect meanings and spelling
Lemmatization considers the context and converts the word to its meaningful base form,
which is called Lemma
Stemming
Stemming: To group words that are derived from a common stem
●e.g, “fish”, “fishes”, “fishing” could be mapped to “fish”
●Generally produces small improvements in tasks effectiveness
●Similar to stopping, stemming can be done aggressively, conservatively, or not
at all
○ Aggressively: consider “fish” and “fishing” the same
○ Conservatively: just identifying plural forms using the letter “s”
■ issues: ‘Centuries’ → ‘Centurie”
○ Not at all: Consider all the word variants
●In different languages, stemming can have different importance for
effectiveness:
○ In Arabic, morphology is more complicated than English
○ In Chinese, stemming is not effective
Stemming
◼ Reduce terms to their “roots” before
indexing
◼ “Stemming” suggest crude affix chopping
◼ language dependent
◼ e.g., automate(s), automatic, automation all
reduced to automat.
for example compressed for exampl compress and
and compression are both compress ar both accept
accepted as equivalent to as equival to compress
compress.
Evaluation of Stemmers
There are three criteria for evaluating stemmers:
1.Correctness
2.Efficiency of the task
3.Compression performance
There are two ways in which stemming can be incorrect:
●Over-stemming (too much of the term is removed)
○ Two or more words being reduced to the same wrong root
○ e.g., ‘centennial’, ‘century’, ‘center’: ‘cent’
●Under-stemming (too little of the term is removed)
○ Two or more words could be wrongly reduced to more than one
root word
○ e.g., ‘acquire’, ‘acquiring’, ‘acquired’: acquir ‘acquisition’:
‘acquis’
Example
Lemmatization
◼ Reduce inflectional/variant forms to base
form
◼ E.g.,
◼ am, are, is → be
◼ car, cars, car's, cars' → car
◼ the boy's cars are different colors → the boy
car be different color
◼ Lemmatization implies doing “proper”
reduction to dictionary headword form
Lemmatization
Reduce inflectional/variant forms to base form
●am, are, is → be
●car, cars, car's, cars' → car
●the boy's cars are different colors → the boy car be different color
Lemmatization implies doing “proper” reduction to dictionary headword form
●e.g., WordNet is a lexical database of semantic relations between words in
more than 200 languages
37
Phrases
In a task such as information retrieval, input queries can be 2-3
word phrases
● Phrases can yield more precise queries
○ “University of Southern Maine”, “black sea”
● Less ambiguous
○ “Red apple” vs. “apple”
Phrase is any sequence of n words: n-gram
● unigram: one bigram: sequence of 2 trigram: sequence of 3
word words words
● Generated by:
○ Choosing a particular value for ‘n’
○ Moving that “window” forward one unit word at a time
● The more frequently a word n-gram occurs, the more likely it is
to correspond to a meaningful phrase in the language
Language-specificity
◼ Many of the above features embody
transformations that are
◼ Language-specific and
◼ Often, application-specific
◼ These are “plug-in” addenda to the indexing
process
◼ Both open source and commercial plug-ins
are available for handling these
Dictionary entries – first cut
ensemble.french
時間.japanese
MIT.english These may be
grouped by
mit.german language (or
not…).
guaranteed.english
More on this in
entries.english ranking/query
processing.
sometimes.english
tokenization.english
THANK YOU