0% found this document useful (0 votes)

23 views30 pages

SPR 07 Nltk2

The document discusses Part-of-Speech (PoS) tagging, its importance in natural language processing, and various approaches to implementing it using the NLTK library. It covers topics such as types of PoS, tagging methods, evaluation against gold standards, and techniques for handling unknown words. Additionally, it provides examples of PoS tagging in English and German, highlighting the challenges and methodologies involved in accurate tagging.

Uploaded by

oweenbarranzuela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views30 pages

SPR 07 Nltk2

Uploaded by

oweenbarranzuela

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Speech Recognition

NLTK - Part of Speech Tagging, Stemming, Lemmatizing

Prof. Dr.-Ing. Udo Garmann

DIT Faculty of Computer Science

Speech Recognition 1 of 30
Content

• Introduction / What is PoS?

• What is PoS-Tagging?
• Why PoS-Tagging?
• Approaches to PoS-Tagging
• NLTK PoS Tagging
• Evaluation / Goldstandard
• Additional Aspects
• PoS Tagging German
• Stemming
• Lemmatizing
also see Jurafsky Chapter 8, NLTK Book chapter 5

Speech Recognition 2 of 30
Introduction
• Dionysius Thrax of Alexandria
(170–90 BCE) was a Hellenistic
grammarian
• Eight parts of speech: noun, verb,
pronoun, preposition, adverb,
conjunction, participle, and article.
• Translation:
https://en.wikisource.org/wiki/The_g
rammar_of_Dionysios_Thrax
• Also known as PoS, word classes, or
syntactic categories

Speech Recognition 3 of 30
Penn Treebank PoS

Figure 2: Penn Treebank part-of-speech tags (including punctuation)

Speech Recognition 4 of 30
Tags for German

Approach for POS tagging of German:

https://datascience.blog.wzb.eu/2016/07/
13/accurate-part-of-speech-tagging-of-
german-texts-with-nltk/
German tagset:
https://www.sketchengine.eu/german-stts-
part-of-speech-tagset/
refers to
https://www.ims.uni-stuttgart.de/en/rese
arch/projects/textkorpora-werkzeuge/

Speech Recognition 5 of 30
Types of PoS (1)
A Language changes over time. That is
why several categories of PoS tags exist:
• closed class and
• open class types
Closed classes are those with relatively fixed
membership, such as prepositions—new
prepositions are rarely coined.
Open classes: nouns and verbs are open
classes—new nouns and verbs like iPhone
or to fax are continually being created or
Figure 3: 17 parts of speech in the Universal
borrowed.
Dependencies tagset (de Marneffe et al., 2021,
from Jurafsky)

Speech Recognition 6 of 30
Types of PoS (2)
See Jurafsky section 8.1.
Four major open classes occur in the languages of the world: nouns, verbs, adjectives, and adverbs.
English has all four, although not every language does
The closed classes differ more from language to language than do the open classes. Some of the
important closed classes in English include:
prepositions: on, under, over, near, by, at, from, to, with
particles: up, down, on, off, in, out, at, by
determiners: a, an, the
conjunctions: and, but, or, as, if, when
pronouns: she, who, I, others
auxiliary verbs: can, may, should, are
numerals: one, two, three, first, second, third

Speech Recognition 7 of 30
What is PoS-Tagging?
• Part-of-speech tagging is the process of assigning
a part-of-speech marker to each part-of-speech
tagging word in an input text. The input to a
tagging algorithm is a sequence of (tokenized)
words and a tagset, and the output is a sequence
of tags, one per token.
• Tagging is a disambiguation task; words are
ambiguous - have more than one ambiguous
possible part-of-speech - and the goal is to find Figure 4: PoS Tagging, see Jurafsky
the correct tag for the context.
• For example, “book” can be a verb (“book that
flight”) or a noun (“hand me that book”).
“That” can be a determiner (“Does that flight
serve dinner”) or a complementizer (“I thought
that your flight was earlier”).
• So PoS Tagging can be seen as a classification
task!
• A good POS Tagger is about > 95% correct.

Speech Recognition 8 of 30
Difference to Grammars

When grammars were defined, parts of

sentences (e.g. noun phrase) were named
first, then word classes (e.g. noun)

Figure 5: Example Sentence Structure, see

NLTK Book , chapter 8

Speech Recognition 9 of 30
Limitations of Grammars

• In a very simple grammar for natural languages, there are few rules about words,
e.g. a is only a determiner, dog is only a noun, and runs is only a verb.
• More realistic grammars for natural languages will be ambiguous, when a broader
set of sentences shall be parsed. E.g. a is also a noun (e.g. part a), dog is also a
verb (meaning to follow closely), and runs is also a noun (e.g. ski runs).
• In English, from nouns verbs can be created, e.g. fish.
• Also, from verbs often nouns can be created.
• Sometimes, strange sentences are syntactically correct, “the a are of I” - are is a
noun meaning a hundredth of a hectare (or 100 sq m), and a and I are nouns
designating coordinates.

Speech Recognition 10 of 30
Why PoS?

PoS tells us about likely neighboring words (nouns are preceded by determiners and
adjectives, verbs by nouns) and syntactic structure (nouns are generally part of noun
phrases), making part-of-speech tagging a key aspect of parsing.
Parts of speech are useful features for labelling named entities like people or
organizations in information extraction.
A word’s part of speech can even play a role in speech recognition or synthesis, e.g.,
the word content is pronounced CONtent when it is a noun (“The CONtent of the
package is broken.”) and conTENT when it is an adjective (“She is conTENT with the
solution”).

Speech Recognition 11 of 30
Approaches for Taggers

Remember: Tagging can be seen as a classification task.

Typical approaches applied for PoS-Tagging:
• Rule-based, e.g. Regex Tagger (NLTK)
• Probabilistic, Hidden Markov Models
(e.g. https://serwiss.bib.hs-hannover.de/frontdoor/index/index/docId/1527 )
• Neural Networks (e.g. https://www.researchgate.net/publication/250806272 )

Speech Recognition 12 of 30
NLTK PoS Tagging (1)
see NLTK Chapter 5, section 1

>>> import nltk

>>> from nltk import word_tokenize
>>> text = word_tokenize("They refuse to
permit us to obtain the refuse permit")
>>> nltk.pos_tag(text) # Use NLTK’s currently recommended part of speech tagger to
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP')
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
Notice that refuse and permit both appear as a present tense verb (VBP) and a noun (NN). E.g. refUSE is a verb meaning “deny,” while REFuse is a
noun meaning “trash” (i.e. they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly.
(For this reason, text-to-speech systems usually perform POS-tagging.)

“Currently, NLTK pos_tag only supports English and Russian” (approach for German see later slide).

Tags are represented as a tuple consisting of the token and the tag:

nltk.pos_tag(text)
# e.g. `('fly', 'NN')`

Speech Recognition 13 of 30
NLTK PoS Tagging (2)
The NLTK knows several Taggers
https://www.nltk.org/_modules/nltk/tag.html
contains examples for different taggers, e.g.
>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
... print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

also possible as sentence:

from nltk import word_tokenize

sent = word_tokenize("They refuse to permit us to obtain the refuse permit")

Speech Recognition 14 of 30
NLTK PoS Tagging (3)

NLTK offers a helping system for tags of some corpora.

using the tag, e.g. nltk.help.upenn_tagset('RB'), or a regular expression,
e.g. nltk.help.upenn_tagset('NN.*').
Some corpora have README files with tagset documentation. The README can be
printed with nltk.corpus.???.readme(), where ??? is the name of the corpus.
Example: nltk.corpus.treebank.readme()
1 from nltk.book import *
2 raw = ' '.join(text1)
3 nltk.pos_tag(text1[:20])
4 nltk.help.upenn_tagset('RB')
5 # when using brown corpus: nltk.help.brown_tagset()

Speech Recognition 15 of 30
Counting similar Tags

see https://www.nltk.org/book/ch05.html , section 2.7

Let’s find the most frequent nouns of each noun part-of-speech type, i.e. find all tags
starting with NN.
1 def findtags(tag_prefix, tagged_text):
2 cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_te
3 if tag.startswith(tag_prefix))
4 return dict((tag, cfd[tag].most_common(5)) for tag in cfd.conditions())
5

6 tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))

7 for tag in sorted(tagdict):
8 print(tag, tagdict[tag])

Speech Recognition 16 of 30
N-Gram Tagging

• N-Gram Tagger assign the tag, that is

most like in the context.
• 1-gram (‘unigram’) tagger is we only
consider the current token
• A 2-gram (‘bigram’) tagger uses the
token and the previous token Figure 6: Tagging Context
• In the n-gram tagger shown here, we
have n=3

Speech Recognition 17 of 30
RegEx Tagger
Find RE patterns in words. A set of rules can be defined to be used.
1 >>> patterns = [
2 ... (r'.*ing$', 'VBG'), # gerunds
3 ... (r'.*ed$', 'VBD'), # simple past
4 ... (r'.*es$', 'VBZ'), # 3rd singular present
5 ... (r'.*ould$', 'MD'), # modals
6 ... (r'.*\'s$', 'NN$'), # possessive nouns
7 ... (r'.*s$', 'NNS'), # plural nouns
8 ... (r'^-?[0-9]+(\.[0-9]+)?$', 'CD'), # cardinal numbers
9 ... (r'.*', 'NN') # nouns (default)
10 ... ]
11

12 >>> regexp_tagger = nltk.RegexpTagger(patterns)

13 >>> regexp_tagger.tag(brown_sents[3])
14 [('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful',
Speech Recognition 18 of 30
Evaluation of Taggers / Gold Standard

• The performance of a tagger is evaluated relative to the tags a human expert

would assign.
• Such test data is called a gold standard.
• A gold standard is a corpus which has been manually annotated and which is
accepted as a standard against which the guesses of an automatic system are
assessed.
• The tagger is regarded as being correct if the tag it guesses for a given word is the
same as the gold standard tag.

Speech Recognition 19 of 30
Separating the Training and Testing Data

Data should be split into training data (90%) and testing data (10%):
1 split_perc = 0.1
2 split_size = int(len(tagged_sents) * split_perc)
3 train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_si
4

5 from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import Classif

6 tagger = ClassifierBasedGermanTagger(train=train_sents)

Speech Recognition 20 of 30
Combining Taggers
• Several taggers can be combined to get better results (trade-off between accuracy
and coverage)
• Also, a fall back algorithm can be used.
• For example, the results of a bigram tagger, a unigram tagger, and a default
tagger can be combined like this:
• Try tagging the token with the bigram tagger.
• If the bigram tagger is unable to find a tag for the token, try the unigram tagger.
• If the unigram tagger is also unable to find a tag, use a default tagger.

Most NLTK taggers permit a backoff-tagger to be specified. The backoff-tagger may

itself have a backoff tagger:
1 >>> t0 = nltk.DefaultTagger('NN') # tags every word with 'NN'
2 >>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
3 >>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
4 >>> t2.evaluate(test_sents)
5 0.844513...

Speech Recognition 21 of 30
Tagging Unknown Words

When using a regular-expression tagger or a default tagger, it is unable to make use of

context.
If the tagger finds for example the word blog, it tags it somehow and uses the same
tag later again.
An better approach may be:
• limit the vocabulary of a tagger to the most frequent n words
• unknown words are tagged with a special word UNK
• During training, a unigram tagger will probably learn that UNK is usually a noun.
• However, the n-gram taggers will detect contexts in which it has some other tag.
• For example, if the preceding word is to (tagged TO), then UNK will probably be
tagged as a verb.

Speech Recognition 22 of 30
Storing Taggers
After training a tagger may be saved using the Python module pickle. It can be used
to serialize every Python object.
1 >>> from pickle import dump
2 >>> output = open('t2.pkl', 'wb')
3 >>> dump(t2, output, -1)
4 >>> output.close()
5

6 # Now, in a separate Python process, we can load our saved tagger.

8 >>> from pickle import load

9 >>> input = open('t2.pkl', 'rb')
10 >>> tagger = load(input)
11 >>> input.close()
https://docs.python.org/3/library/pickle.html
Speech Recognition 23 of 30
German Tagging (1)
Example on how to tag a different language:
https://datascience.blog.wzb.eu/2016/07/13/accurate-part-of-speech-tagging-of-
german-texts-with-nltk/ (visited 23-04-28)
takes the Tiger corpus
https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger/ (visited
24-04-13)
and uses the ConllCorpusReader

1 corp = nltk.corpus.ConllCorpusReader('./data', 'tiger_release_aug07.corrected.1601

2 ['ignore', 'words', 'ignore', 'ignore', 'pos'],
3 encoding='utf-8')
4 import random
5 tagged_sents = list(corp.tagged_sents())
6 random.shuffle(tagged_sents)
Speech Recognition 24 of 30
German Tagging (2)
Code continued:
Python class ClassifierBasedGermanTagger can be download here:
https://github.com/ptnplanet/NLTK-
Contributions/tree/master/ClassifierBasedGermanTagger
1 # set a split size: use 90% for training, 10% for testing
2 split_perc = 0.1
3 split_size = int(len(tagged_sents) * split_perc)
4 train_sents, test_sents = tagged_sents[split_size:], tagged_sents[:split_si
5

6 from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import Classif

7 tagger = ClassifierBasedGermanTagger(train=train_sents)
8 accuracy = tagger.accuracy(test_sents)
9 tagger.tag(['Das', 'ist', 'ein', 'einfacher', 'Test'])
10 # [('Das', 'ART'), ('ist', 'VAFIN'), ('ein', 'ART'), ('einfacher', 'ADJA'),
Speech Recognition 25 of 30
German Tagging (3)
CoNLL-style files (see https://www.nltk.org/_modules/nltk/corpus/reader/conll.html)
”These files consist of a series of sentences, separated by blank lines.
Each sentence is encoded using a table (or “grid”) of values, where each line
corresponds to a single word, and each column corresponds to an annotation type.
The set of columns used by CoNLL-style files can vary from corpus to corpus.”
Example content:
# more tiger_release_aug07.corrected.16012013.conll09
1_1 `` -- _ $( _ _ _ 4 _ -- _ _ _ _
1_2 Ross Ross _ NE _ case=nom|number=sg|gender=masc _ 3 _ PNC __ _ _
1_3 Perot Perot _ NE _ case=nom|number=sg|gender=masc _ 4 _ SB __ _ _
1_4 wäre sein _ VAFIN _ number=sg|person=3|tense=past|mood=subj _ 0 _ -- _ _
1_5 vielleicht vielleicht _ ADV _ _ _ 4 _ MO _ __ _
1_6 ein ein _ ART _ case=nom|number=sg|gender=masc _ 8 _ NK __ _ _
1_7 prächtiger prächtig _ ADJA _ case=nom|number=sg|gender=masc|degree=pos _8 _ NK
1_8 Diktator Diktator _ NN _ case=nom|number=sg|gender=masc _ 4 _PD _ _
1_9 '' -- _ $( _ _ _ 4 _ -- _ _ _ _

Speech Recognition 26 of 30
Stemmer and Lemmatizer

“Stemming usually refers to a crude heuristic process that chops off the ends of words
in the hope of achieving this goal correctly most of the time, and often includes the
removal of derivational aﬀixes. Lemmatization usually refers to doing things properly
with the use of a vocabulary and morphological analysis of words, normally aiming to
remove inflectional endings only and to return the base or dictionary form of a word,
which is known as the lemma . If confronted with the token saw, stemming might
return just s, whereas lemmatization would attempt to return either see or saw
depending on whether the use of the token was as a verb or a noun.”
see https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-
1.html (visited 23-04-06)

Speech Recognition 27 of 30
NLTK Stemmer
see https://www.nltk.org/howto/stem.html
• Porter
• Snowball
• …
1 >>> from nltk.stem.porter import *
2 >>> stemmer = PorterStemmer()
3 >>> plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
4 ... 'died', 'agreed', 'owned', 'humbled', 'sized',
5 ... 'meeting', 'stating', 'siezing', 'itemization',
6 ... 'sensational', 'traditional', 'reference', 'colonizer',
7 ... 'plotted']
8

9 >>> singles = [stemmer.stem(plural) for plural in plurals]

10 >>> print(' '.join(singles))
Speech Recognition 28 of 30
Lemmatizing

see
https://pythonprogramming.net/lemmatizing-nltk-tutorial/
https://www.nltk.org/_modules/nltk/stem/wordnet.html
1 from nltk.stem import WordNetLemmatizer
2 lemmatizer = WordNetLemmatizer()
3
4 print(lemmatizer.lemmatize("cats"))
5 print(lemmatizer.lemmatize("cacti"))
6 print(lemmatizer.lemmatize("geese"))
7 print(lemmatizer.lemmatize("rocks"))
8 print(lemmatizer.lemmatize("python"))
9 # pos is Part of Speech parameter, a=adjective,
10 # Valid options are `"n"` for nouns,
11 # `"v"` for verbs, `"a"` for adjectives, `"r"` for adverbs and `"s"`
12 # for satellite adjectives.
13 print(lemmatizer.lemmatize("better", pos="a"))
14 print(lemmatizer.lemmatize("best", pos="a"))
15 print(lemmatizer.lemmatize("run"))
16 print(lemmatizer.lemmatize("run",'v'))

Speech Recognition 29 of 30
Thank you! Questions?

Speech Recognition 30 of 30

Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
No ratings yet
Apznzaaczprqee1da4bjade7ul0meb Ap8tjou Feozcgqct6cpnh0z32ibu3faj 0wgfmnhp5p Eneunhaucakhow Bie9yhlaoqtsknu7yq0gfnxrzjd2mjuyrbnhadveb2wj7gjgcxpffbjgyxl4nzdqf5qeux-Lla2ggr5kg9w4bp8ev5hqrj7bwr3npwnp9gfmazwtau
108 pages
Lec3-Posner Intro
No ratings yet
Lec3-Posner Intro
30 pages
2025-NLP-Lecture 05 - Sequence Labeling For Parts of Speech and Name Entities
No ratings yet
2025-NLP-Lecture 05 - Sequence Labeling For Parts of Speech and Name Entities
69 pages
Intro to Syntactic Processing
No ratings yet
Intro to Syntactic Processing
56 pages
Lecture 20-23 Part of Speech Tagging
No ratings yet
Lecture 20-23 Part of Speech Tagging
36 pages
Parts of Speech Tagging
No ratings yet
Parts of Speech Tagging
17 pages
Lecture Part of Speech Tagging
No ratings yet
Lecture Part of Speech Tagging
41 pages
4 Pos
No ratings yet
4 Pos
62 pages
POS Tagging: Introduction: Heng Ji
No ratings yet
POS Tagging: Introduction: Heng Ji
35 pages
Lecture6 2022
No ratings yet
Lecture6 2022
101 pages
Unit 2 Pos Tagger
No ratings yet
Unit 2 Pos Tagger
9 pages
Module-2 NLP
No ratings yet
Module-2 NLP
50 pages
Lec-5 POStagging
No ratings yet
Lec-5 POStagging
24 pages
NLP Unit III Notes
No ratings yet
NLP Unit III Notes
30 pages
Experiment 4
No ratings yet
Experiment 4
3 pages
Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
No ratings yet
Word Classes and Part-of-Speech (POS) Tagging: CS4705 Julia Hirschberg
40 pages
Lecture 16-17-18-19
No ratings yet
Lecture 16-17-18-19
42 pages
Ai TXT Unit4
No ratings yet
Ai TXT Unit4
39 pages
Chapter Two Natural Language Processing
No ratings yet
Chapter Two Natural Language Processing
141 pages
POS Tagging: Techniques and Challenges
No ratings yet
POS Tagging: Techniques and Challenges
75 pages
Cme4408 p6 Pos Tagging
No ratings yet
Cme4408 p6 Pos Tagging
33 pages
Part-of-Speech (POS) Tagging
No ratings yet
Part-of-Speech (POS) Tagging
94 pages
POS Tagging for NLP Enthusiasts
No ratings yet
POS Tagging for NLP Enthusiasts
47 pages
Part-Of-Speech (POS) Tagging
No ratings yet
Part-Of-Speech (POS) Tagging
53 pages
Speech and Language Processing: SLP Chapter 5
No ratings yet
Speech and Language Processing: SLP Chapter 5
56 pages
NLPChapter 3
No ratings yet
NLPChapter 3
14 pages
Parts of Speech Using Hidden Markov Models
No ratings yet
Parts of Speech Using Hidden Markov Models
5 pages
8 POSNER Intro May 6 2021
No ratings yet
8 POSNER Intro May 6 2021
26 pages
Lect6 Pos
No ratings yet
Lect6 Pos
62 pages
Parts of Speech Tagging
No ratings yet
Parts of Speech Tagging
62 pages
Lecture#11 (POS Tagging)
No ratings yet
Lecture#11 (POS Tagging)
19 pages
Natural Language Processing: Parts of Speech Tagging - Pos
No ratings yet
Natural Language Processing: Parts of Speech Tagging - Pos
20 pages
Lecture 5
No ratings yet
Lecture 5
56 pages
Speech Recognition Systems Guide
No ratings yet
Speech Recognition Systems Guide
13 pages
NLP Ia2
No ratings yet
NLP Ia2
18 pages
NLP Chapter 3
No ratings yet
NLP Chapter 3
36 pages
Pos Tagging and Chunking
No ratings yet
Pos Tagging and Chunking
29 pages
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
No ratings yet
Session 6 - Part-Of-Speech Tagging, Sequence Labeling
86 pages
Unit 3
No ratings yet
Unit 3
16 pages
Lec04 2 PartOfSpeechTagging
No ratings yet
Lec04 2 PartOfSpeechTagging
56 pages
Lesson 3 Natural Language Understanding Techniques
No ratings yet
Lesson 3 Natural Language Understanding Techniques
89 pages
Module 2 HMMPPT
No ratings yet
Module 2 HMMPPT
31 pages
3 Natural Language Processing-PoS Tagging
No ratings yet
3 Natural Language Processing-PoS Tagging
14 pages
Ijcnn 2001
No ratings yet
Ijcnn 2001
5 pages
Print Lect6 Pos
No ratings yet
Print Lect6 Pos
11 pages
NLP Session 6
No ratings yet
NLP Session 6
5 pages
Comprehensive Guide to POS Tagging
No ratings yet
Comprehensive Guide to POS Tagging
48 pages
Module 3
No ratings yet
Module 3
33 pages
Tagging and Its Types
No ratings yet
Tagging and Its Types
3 pages
L11-POS - Tagging - II
No ratings yet
L11-POS - Tagging - II
43 pages
3.1 Chap NLP Pos - Tagging - Lecture3
No ratings yet
3.1 Chap NLP Pos - Tagging - Lecture3
38 pages
Part of Speech Tagging (Chapter 5) : Adapted From Kathy Mccoy'S Presentation Downloaded From The Web, September 2010
No ratings yet
Part of Speech Tagging (Chapter 5) : Adapted From Kathy Mccoy'S Presentation Downloaded From The Web, September 2010
63 pages
NLP Exp 6
No ratings yet
NLP Exp 6
4 pages
Be4 A 17 NLP Exp6
No ratings yet
Be4 A 17 NLP Exp6
4 pages
POS Tagging and HMM in NLP
No ratings yet
POS Tagging and HMM in NLP
93 pages
Lecture 5 Part of Speech Tagging
No ratings yet
Lecture 5 Part of Speech Tagging
39 pages
Unit3 01
No ratings yet
Unit3 01
10 pages
POStagging
No ratings yet
POStagging
72 pages
Syntax and Grammar
100% (1)
Syntax and Grammar
8 pages
Hindi Morphological Analyzer and Generator
No ratings yet
Hindi Morphological Analyzer and Generator
4 pages
05b.BDA (18CS72) Module-5 Text Mining
No ratings yet
05b.BDA (18CS72) Module-5 Text Mining
23 pages
Tasks For Consolidating and Checking Vocabulary
No ratings yet
Tasks For Consolidating and Checking Vocabulary
4 pages
10th STD TL English Grammar 2023-24 by Manjunatha Aj
No ratings yet
10th STD TL English Grammar 2023-24 by Manjunatha Aj
27 pages
Lecture 6. Notional Parts of Speech. The Pronoun. The Adjective.
No ratings yet
Lecture 6. Notional Parts of Speech. The Pronoun. The Adjective.
5 pages
Mpu3022 - English Language Proficiency
No ratings yet
Mpu3022 - English Language Proficiency
9 pages
What Are Nouns?: Notes, Examples, and Exercises
100% (2)
What Are Nouns?: Notes, Examples, and Exercises
3 pages
English Word Formation Guide
100% (1)
English Word Formation Guide
6 pages
Odd Words
No ratings yet
Odd Words
19 pages
Lesson 8. Lexical Categories Introduction Discussion
No ratings yet
Lesson 8. Lexical Categories Introduction Discussion
6 pages
Vocabulary Unit 2 Intelligent Business
No ratings yet
Vocabulary Unit 2 Intelligent Business
9 pages
Basic Grammar 3
No ratings yet
Basic Grammar 3
12 pages
Beyond Vocab
No ratings yet
Beyond Vocab
3 pages
Chapter 2. The Form and Function of Words
No ratings yet
Chapter 2. The Form and Function of Words
33 pages
Basic English Grammar Notes
No ratings yet
Basic English Grammar Notes
186 pages
ENG111 Elements of English Grammar
No ratings yet
ENG111 Elements of English Grammar
137 pages
Introduction To Linguistics II: Ling 2-121C, Group B
No ratings yet
Introduction To Linguistics II: Ling 2-121C, Group B
33 pages
TEACHING VOCABULARY-describing Vocabulary
No ratings yet
TEACHING VOCABULARY-describing Vocabulary
28 pages
Lesson Plan Business Result Upper-Intermediate Level
100% (1)
Lesson Plan Business Result Upper-Intermediate Level
3 pages
1.2 Functional English
No ratings yet
1.2 Functional English
2 pages
Figures of Speech Lesson Plan
No ratings yet
Figures of Speech Lesson Plan
11 pages
Year 12 English Scheme of Work
No ratings yet
Year 12 English Scheme of Work
26 pages
Handbook of The Ainu Language Anna Bugaeva Download
100% (1)
Handbook of The Ainu Language Anna Bugaeva Download
81 pages
Primary 3 - Writing Bookletqf244
No ratings yet
Primary 3 - Writing Bookletqf244
39 pages
Traditional Grammar - Maxwell - 2024
No ratings yet
Traditional Grammar - Maxwell - 2024
30 pages
Grammar Error Spotting Guide
No ratings yet
Grammar Error Spotting Guide
4 pages
English 6 To 8 New
No ratings yet
English 6 To 8 New
6 pages
C2-Phrasal Verbs-Ame
No ratings yet
C2-Phrasal Verbs-Ame
12 pages
(Ebook) Cognitive Computing: Theory and Applications by Venkat N. Gudivada, Vijay V. Raghavan, Venu Govindaraju and C.R. Rao (Eds.) ISBN 9780444637444, 0444637443 Full Chapters Included
No ratings yet
(Ebook) Cognitive Computing: Theory and Applications by Venkat N. Gudivada, Vijay V. Raghavan, Venu Govindaraju and C.R. Rao (Eds.) ISBN 9780444637444, 0444637443 Full Chapters Included
123 pages

SPR 07 Nltk2

Uploaded by

SPR 07 Nltk2

Uploaded by

Speech Recognition

NLTK - Part of Speech Tagging, Stemming, Lemmatizing

Prof. Dr.-Ing. Udo Garmann

DIT Faculty of Computer Science

• Introduction / What is PoS?

Figure 2: Penn Treebank part-of-speech tags (including punctuation)

Approach for POS tagging of German:

When grammars were defined, parts of

Figure 5: Example Sentence Structure, see

Remember: Tagging can be seen as a classification task.

>>> import nltk

also possible as sentence:

from nltk import word_tokenize

NLTK offers a helping system for tags of some corpora.

see https://www.nltk.org/book/ch05.html , section 2.7

6 tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))

• N-Gram Tagger assign the tag, that is

12 >>> regexp_tagger = nltk.RegexpTagger(patterns)

• The performance of a tagger is evaluated relative to the tags a human expert

5 from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import Classif

Most NLTK taggers permit a backoff-tagger to be specified. The backoff-tagger may

When using a regular-expression tagger or a default tagger, it is unable to make use of

6 # Now, in a separate Python process, we can load our saved tagger.

8 >>> from pickle import load

1 corp = nltk.corpus.ConllCorpusReader('./data', 'tiger_release_aug07.corrected.1601

6 from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import Classif

9 >>> singles = [stemmer.stem(plural) for plural in plurals]

You might also like