Speech Recognition
NLTK - Part of Speech Tagging, Stemming, Lemmatizing
Prof. Dr.-Ing. Udo Garmann
DIT Faculty of Computer Science
Speech Recognition 1 of 30
Content
• Introduction / What is PoS?
• What is PoS-Tagging?
• Why PoS-Tagging?
• Approaches to PoS-Tagging
• NLTK PoS Tagging
• Evaluation / Goldstandard
• Additional Aspects
• PoS Tagging German
• Stemming
• Lemmatizing
also see Jurafsky Chapter 8, NLTK Book chapter 5
Speech Recognition 2 of 30
Introduction
• Dionysius Thrax of Alexandria
(170–90 BCE) was a Hellenistic
grammarian
• Eight parts of speech: noun, verb,
pronoun, preposition, adverb,
conjunction, participle, and article.
• Translation:
https://en.wikisource.org/wiki/The_g
rammar_of_Dionysios_Thrax
• Also known as PoS, word classes, or
syntactic categories
Speech Recognition 3 of 30
Penn Treebank PoS
Figure 2: Penn Treebank part-of-speech tags (including punctuation)
Speech Recognition 4 of 30
Tags for German
Approach for POS tagging of German:
https://datascience.blog.wzb.eu/2016/07/
13/accurate-part-of-speech-tagging-of-
german-texts-with-nltk/
German tagset:
https://www.sketchengine.eu/german-stts-
part-of-speech-tagset/
refers to
https://www.ims.uni-stuttgart.de/en/rese
arch/projects/textkorpora-werkzeuge/
Speech Recognition 5 of 30
Types of PoS (1)
A Language changes over time. That is
why several categories of PoS tags exist:
• closed class and
• open class types
Closed classes are those with relatively fixed
membership, such as prepositions—new
prepositions are rarely coined.
Open classes: nouns and verbs are open
classes—new nouns and verbs like iPhone
or to fax are continually being created or
Figure 3: 17 parts of speech in the Universal
borrowed.
Dependencies tagset (de Marneffe et al., 2021,
from Jurafsky)
Speech Recognition 6 of 30
Types of PoS (2)
See Jurafsky section 8.1.
Four major open classes occur in the languages of the world: nouns, verbs, adjectives, and adverbs.
English has all four, although not every language does
The closed classes differ more from language to language than do the open classes. Some of the
important closed classes in English include:
prepositions: on, under, over, near, by, at, from, to, with
particles: up, down, on, off, in, out, at, by
determiners: a, an, the
conjunctions: and, but, or, as, if, when
pronouns: she, who, I, others
auxiliary verbs: can, may, should, are
numerals: one, two, three, first, second, third
Speech Recognition 7 of 30
What is PoS-Tagging?
• Part-of-speech tagging is the process of assigning
a part-of-speech marker to each part-of-speech
tagging word in an input text. The input to a
tagging algorithm is a sequence of (tokenized)
words and a tagset, and the output is a sequence
of tags, one per token.
• Tagging is a disambiguation task; words are
ambiguous - have more than one ambiguous
possible part-of-speech - and the goal is to find Figure 4: PoS Tagging, see Jurafsky
the correct tag for the context.
• For example, “book” can be a verb (“book that
flight”) or a noun (“hand me that book”).
“That” can be a determiner (“Does that flight
serve dinner”) or a complementizer (“I thought
that your flight was earlier”).
• So PoS Tagging can be seen as a classification
task!
• A good POS Tagger is about > 95% correct.
Speech Recognition 8 of 30
Difference to Grammars
When grammars were defined, parts of
sentences (e.g. noun phrase) were named
first, then word classes (e.g. noun)
Figure 5: Example Sentence Structure, see
NLTK Book , chapter 8
Speech Recognition 9 of 30
Limitations of Grammars
• In a very simple grammar for natural languages, there are few rules about words,
e.g. a is only a determiner, dog is only a noun, and runs is only a verb.
• More realistic grammars for natural languages will be ambiguous, when a broader
set of sentences shall be parsed. E.g. a is also a noun (e.g. part a), dog is also a
verb (meaning to follow closely), and runs is also a noun (e.g. ski runs).
• In English, from nouns verbs can be created, e.g. fish.
• Also, from verbs often nouns can be created.
• Sometimes, strange sentences are syntactically correct, “the a are of I” - are is a
noun meaning a hundredth of a hectare (or 100 sq m), and a and I are nouns
designating coordinates.
Speech Recognition 10 of 30
Why PoS?
PoS tells us about likely neighboring words (nouns are preceded by determiners and
adjectives, verbs by nouns) and syntactic structure (nouns are generally part of noun
phrases), making part-of-speech tagging a key aspect of parsing.
Parts of speech are useful features for labelling named entities like people or
organizations in information extraction.
A word’s part of speech can even play a role in speech recognition or synthesis, e.g.,
the word content is pronounced CONtent when it is a noun (“The CONtent of the
package is broken.”) and conTENT when it is an adjective (“She is conTENT with the
solution”).
Speech Recognition 11 of 30
Approaches for Taggers
Remember: Tagging can be seen as a classification task.
Typical approaches applied for PoS-Tagging:
• Rule-based, e.g. Regex Tagger (NLTK)
• Probabilistic, Hidden Markov Models
(e.g. https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/1527 )
• Neural Networks (e.g. https://www.researchgate.net/publication/250806272 )
Speech Recognition 12 of 30
NLTK PoS Tagging (1)
see NLTK Chapter 5, section 1
>>> import nltk
>>> from nltk import word_tokenize
>>> text = word_tokenize("They refuse to
permit us to obtain the refuse permit")
>>> nltk.pos_tag(text) # Use NLTK’s currently recommended part of speech tagger to
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP')
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a verb meaning “deny,” while REFuse is a
noun meaning “trash” (i.e. they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly.
(For this reason, text-to-speech systems usually perform POS-tagging.)
“Currently, NLTK pos_tag only supports English and Russian” (approach for German see later slide).
Tags are represented as a tuple consisting of the token and the tag:
nltk.pos_tag(text)
# e.g. `('fly', 'NN')`
Speech Recognition 13 of 30
NLTK PoS Tagging (2)
The NLTK knows several Taggers
https://www.nltk.org/_modules/nltk/tag.html
contains examples for different taggers, e.g.
>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
... print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None
also possible as sentence:
from nltk import word_tokenize
sent = word_tokenize("They refuse to permit us to obtain the refuse permit")
Speech Recognition 14 of 30
NLTK PoS Tagging (3)
NLTK offers a helping system for tags of some corpora.
using the tag, e.g. nltk.help.upenn_tagset('RB'), or a regular expression,
e.g. nltk.help.upenn_tagset('NN.*').
Some corpora have README files with tagset documentation. The README can be
printed with nltk.corpus.???.readme(), where ??? is the name of the corpus.
Example: nltk.corpus.treebank.readme()
1 from nltk.book import *
2 raw = ' '.join(text1)
3 nltk.pos_tag(text1[:20])
4 nltk.help.upenn_tagset('RB')
5 # when using brown corpus: nltk.help.brown_tagset()
Speech Recognition 15 of 30
Counting similar Tags
see https://www.nltk.org/book/ch05.html , section 2.7
Let’s find the most frequent nouns of each noun part-of-speech type, i.e. find all tags
starting with NN.
1 def findtags(tag_prefix, tagged_text):
2 cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_te
3 if tag.startswith(tag_prefix))
4 return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())
5
6 tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))
7 for tag in sorted(tagdict):
8 print(tag, tagdict[tag])
Speech Recognition 16 of 30
N-Gram Tagging
• N-Gram Tagger assign the tag, that is
most like in the context.
• 1-gram (‘unigram’) tagger is we only
consider the current token
• A 2-gram (‘bigram’) tagger uses the
token and the previous token Figure 6: Tagging Context
• In the n-gram tagger shown here, we
have n=3
Speech Recognition 17 of 30
RegEx Tagger
Find RE patterns in words. A set of rules can be defined to be used.
1 >>> patterns = [
2 ... (r'.*ing$', 'VBG'), # gerunds
3 ... (r'.*ed$', 'VBD'), # simple past
4 ... (r'.*es$', 'VBZ'), # 3rd singular present
5 ... (r'.*ould$', 'MD'), # modals
6 ... (r'.*\'s$', 'NN$'), # possessive nouns
7 ... (r'.*s$', 'NNS'), # plural nouns
8 ... (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'), # cardinal numbers
9 ... (r'.*', 'NN') # nouns (default)
10 ... ]
11
12 >>> regexp_tagger = nltk.RegexpTagger(patterns)
13 >>> regexp_tagger.tag(brown_sents[3])
14 [('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful',
Speech Recognition 18 of 30
Evaluation of Taggers / Gold Standard
• The performance of a tagger is evaluated relative to the tags a human expert
would assign.
• Such test data is called a gold standard.
• A gold standard is a corpus which has been manually annotated and which is
accepted as a standard against which the guesses of an automatic system are
assessed.
• The tagger is regarded as being correct if the tag it guesses for a given word is the
same as the gold standard tag.
Speech Recognition 19 of 30
Separating the Training and Testing Data
Data should be split into training data (90%) and testing data (10%):
1 split_perc = 0.1
2 split_size = int(len(tagged_sents) * split_perc)
3 train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_si
4
5 from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import Classif
6 tagger = ClassifierBasedGermanTagger(train=train_sents)
Speech Recognition 20 of 30
Combining Taggers
• Several taggers can be combined to get better results (trade-off between accuracy
and coverage)
• Also, a fall back algorithm can be used.
• For example, the results of a bigram tagger, a unigram tagger, and a default
tagger can be combined like this:
• Try tagging the token with the bigram tagger.
• If the bigram tagger is unable to find a tag for the token, try the unigram tagger.
• If the unigram tagger is also unable to find a tag, use a default tagger.
Most NLTK taggers permit a backoff-tagger to be specified. The backoff-tagger may
itself have a backoff tagger:
1 >>> t0 = nltk.DefaultTagger('NN') # tags every word with 'NN'
2 >>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
3 >>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
4 >>> t2.evaluate(test_sents)
5 0.844513...
Speech Recognition 21 of 30
Tagging Unknown Words
When using a regular-expression tagger or a default tagger, it is unable to make use of
context.
If the tagger finds for example the word blog, it tags it somehow and uses the same
tag later again.
An better approach may be:
• limit the vocabulary of a tagger to the most frequent n words
• unknown words are tagged with a special word UNK
• During training, a unigram tagger will probably learn that UNK is usually a noun.
• However, the n-gram taggers will detect contexts in which it has some other tag.
• For example, if the preceding word is to (tagged TO), then UNK will probably be
tagged as a verb.
Speech Recognition 22 of 30
Storing Taggers
After training a tagger may be saved using the Python module pickle. It can be used
to serialize every Python object.
1 >>> from pickle import dump
2 >>> output = open('t2.pkl', 'wb')
3 >>> dump(t2, output, -1)
4 >>> output.close()
5
6 # Now, in a separate Python process, we can load our saved tagger.
7
8 >>> from pickle import load
9 >>> input = open('t2.pkl', 'rb')
10 >>> tagger = load(input)
11 >>> input.close()
https://docs.python.org/3/library/pickle.html
Speech Recognition 23 of 30
German Tagging (1)
Example on how to tag a different language:
https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-
german-texts-with-nltk/ (visited 23-04-28)
takes the Tiger corpus
https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/ (visited
24-04-13)
and uses the ConllCorpusReader
1 corp = nltk.corpus.ConllCorpusReader('./data', 'tiger_release_aug07.corrected.1601
2 ['ignore', 'words', 'ignore', 'ignore', 'pos'],
3 encoding='utf-8')
4 import random
5 tagged_sents = list(corp.tagged_sents())
6 random.shuffle(tagged_sents)
Speech Recognition 24 of 30
German Tagging (2)
Code continued:
Python class ClassifierBasedGermanTagger can be download here:
https://github.com/ptnplanet/NLTK-
Contributions/tree/master/ClassifierBasedGermanTagger
1 # set a split size: use 90% for training, 10% for testing
2 split_perc = 0.1
3 split_size = int(len(tagged_sents) * split_perc)
4 train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_si
5
6 from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import Classif
7 tagger = ClassifierBasedGermanTagger(train=train_sents)
8 accuracy = tagger.accuracy(test_sents)
9 tagger.tag(['Das', 'ist', 'ein', 'einfacher', 'Test'])
10 # [('Das', 'ART'), ('ist', 'VAFIN'), ('ein', 'ART'), ('einfacher', 'ADJA'),
Speech Recognition 25 of 30
German Tagging (3)
CoNLL-style files (see https://www.nltk.org/_modules/nltk/corpus/reader/conll.html)
”These files consist of a series of sentences, separated by blank lines.
Each sentence is encoded using a table (or “grid”) of values, where each line
corresponds to a single word, and each column corresponds to an annotation type.
The set of columns used by CoNLL-style files can vary from corpus to corpus.”
Example content:
# more tiger_release_aug07.corrected.16012013.conll09
1_1 `` -- _ $( _ _ _ 4 _ -- _ _ _ _
1_2 Ross Ross _ NE _ case=nom|number=sg|gender=masc _ 3 _ PNC __ _ _
1_3 Perot Perot _ NE _ case=nom|number=sg|gender=masc _ 4 _ SB __ _ _
1_4 wäre sein _ VAFIN _ number=sg|person=3|tense=past|mood=subj _ 0 _ -- _ _
1_5 vielleicht vielleicht _ ADV _ _ _ 4 _ MO _ __ _
1_6 ein ein _ ART _ case=nom|number=sg|gender=masc _ 8 _ NK __ _ _
1_7 prächtiger prächtig _ ADJA _ case=nom|number=sg|gender=masc|degree=pos _8 _ NK
1_8 Diktator Diktator _ NN _ case=nom|number=sg|gender=masc _ 4 _PD _ _
1_9 '' -- _ $( _ _ _ 4 _ -- _ _ _ _
Speech Recognition 26 of 30
Stemmer and Lemmatizer
“Stemming usually refers to a crude heuristic process that chops off the ends of words
in the hope of achieving this goal correctly most of the time, and often includes the
removal of derivational affixes. Lemmatization usually refers to doing things properly
with the use of a vocabulary and morphological analysis of words, normally aiming to
remove inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma . If confronted with the token saw, stemming might
return just s, whereas lemmatization would attempt to return either see or saw
depending on whether the use of the token was as a verb or a noun.”
see https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-
1.html (visited 23-04-06)
Speech Recognition 27 of 30
NLTK Stemmer
see https://www.nltk.org/howto/stem.html
• Porter
• Snowball
• …
1 >>> from nltk.stem.porter import *
2 >>> stemmer = PorterStemmer()
3 >>> plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
4 ... 'died', 'agreed', 'owned', 'humbled', 'sized',
5 ... 'meeting', 'stating', 'siezing', 'itemization',
6 ... 'sensational', 'traditional', 'reference', 'colonizer',
7 ... 'plotted']
8
9 >>> singles = [stemmer.stem(plural) for plural in plurals]
10 >>> print(' '.join(singles))
Speech Recognition 28 of 30
Lemmatizing
see
https://pythonprogramming.net/lemmatizing-nltk-tutorial/
https://www.nltk.org/_modules/nltk/stem/wordnet.html
1 from nltk.stem import WordNetLemmatizer
2 lemmatizer = WordNetLemmatizer()
3
4 print(lemmatizer.lemmatize("cats"))
5 print(lemmatizer.lemmatize("cacti"))
6 print(lemmatizer.lemmatize("geese"))
7 print(lemmatizer.lemmatize("rocks"))
8 print(lemmatizer.lemmatize("python"))
9 # pos is Part of Speech parameter, a=adjective,
10 # Valid options are `"n"` for nouns,
11 # `"v"` for verbs, `"a"` for adjectives, `"r"` for adverbs and `"s"`
12 # for satellite adjectives.
13 print(lemmatizer.lemmatize("better", pos="a"))
14 print(lemmatizer.lemmatize("best", pos="a"))
15 print(lemmatizer.lemmatize("run"))
16 print(lemmatizer.lemmatize("run",'v'))
Speech Recognition 29 of 30
Thank you! Questions?
Speech Recognition 30 of 30