[go: up one dir, main page]

0% found this document useful (0 votes)
27 views134 pages

Module 2 Complete

Uploaded by

kalpanagangwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views134 pages

Module 2 Complete

Uploaded by

kalpanagangwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 134

Module 2

Word Level Analysis


▪ Tokenization, Stemming, Lemmatization;
Survey of English Morphology, Inflectional
Basic Terms Morphology, Derivational Morphology;
Regular expression with types.
▪ Tokenization is the process of breaking down the given
text in natural language processing into the smallest
unit in a sentence called a token.
▪ Punctuation marks, words, and numbers can be
considered tokens.

Tokenization ▪ But why do we need tokenization?


▪ We may want to find the frequencies of the words in the
entire text by dividing the given text into tokens. Then,
models can be made on these frequencies.
▪ Or we may want to tag tokens by word type. (P-O-S
tagging)
Example
▪ Word tokenization is the most commonly used method
where text is divided into individual words. It works
well for languages with clear word boundaries, like
English. For example, "Machine learning is fascinating"
becomes:

Word
Tokenization
Input before tokenization: ["Machine Learning
is fascinating"]
Output when tokenized by words: ["Machine",
"learning", "is", "fascinating"]
▪ In Character Tokenization, the textual data is split and
converted to a sequence of individual characters. This is
beneficial for tasks that require a detailed analysis, such
as spelling correction or for tasks with unclear
boundaries. It can also be useful for modelling
character-level language.
Character
Tokenization
Input before tokenization: ["You are helpful"]
Output when tokenized by characters: ["Y", "o", "u", " ",
"a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"]
▪ This strikes a balance between word and character
tokenization by breaking down text into units that are
larger than a single character but smaller than a full
word. This is useful when dealing with morphologically
rich languages or rare words.

Sub-word ["Time", "table"]


["Rain", "coat"]
Tokenization ["Grace", "fully"]
["Run", "way"]

Sub-word tokenization helps to handle out-of-


vocabulary words in NLP tasks and for
languages that form words by combining
smaller units.
▪ Sentence tokenization is also a common technique used
to make a division of paragraphs or large set of
sentences into separated sentences as tokens. This is
useful for tasks requiring individual sentence analysis
or processing.

Sentence Input before tokenization: ["Artificial


Tokenization Intelligence is an emerging technology.
Machine learning is fascinating. Computer
Vision handles images. "]
Output when tokenized by sentences
["Artificial Intelligence is an emerging
technology.", "Machine learning is
fascinating.", "Computer Vision handles
images."]
▪ N-gram tokenization splits words into fixed-sized
chunks (size = n) of data.

N-gram Input before tokenization: ["Machine learning


Tokenization is powerful"]
Output when tokenized by bigrams:
[('Machine', 'learning'), ('learning', 'is'), ('is',
'powerful')]

https://www.geeksforgeeks.org/nlp/nlp-how-tokenizing-text-sentence-words-works/
▪ “Stemming is the process of finding the root of words”.
▪ Stemming is a method in text processing that eliminates prefixes
Stemming and suffixes from words, transforming them into their
fundamental or root form, The main objective of stemming is to
streamline and standardize words, enhancing the effectiveness
of the natural language processing tasks.
▪ Stemming in natural language processing reduces
words to their base or root form, aiding in text

Stemming normalization for easier processing. This technique is


crucial in tasks like text classification, information
retrieval, and text summarization.
▪ It simplifies text by reducing words to their root form,
improving efficiency and accuracy in various NLP
tasks.
▪ By grouping related words together, stemming helps
reduce data redundancy, normalize text, and enhance
search capabilities.
Why is Stemming
important?
Data Simplification and Redundancy Reduction:
▪ Stemming reduces the number of unique words in a text
by converting different forms of a word (e.g., "running,"
"runs," "ran") into a single root form (e.g., "run").

Uses of Improved Search Accuracy:

Stemming ▪ Stemming allows search engines to connect related


words, increasing the breadth of search results.
▪ For example, a search for "running" might also return
results containing "runner" or "runs" because they
share the same root word.
Enhanced Performance of NLP Tasks:
▪ Stemming helps in tasks like text classification,
Uses of clustering, indexing, and topic modeling by reducing
the dimensionality of the data.
Stemming
▪ Porter Stemmer: Developed by Martin Porter in the 1980s, the
Porter Stemmer is one of the oldest and most widely used
stemming algorithms. It is designed primarily for English
words and applies a series of rules to remove suffixes and
transform words to their base form.

▪ Snowball Stemmer: Also known as the Porter2 Stemmer, the


Common Snowball Stemmer is an extension of the Porter algorithm with
Stemming support for multiple languages. It employs a more systematic
approach and can handle stemming tasks in languages
Techniques beyond English, including French, German, and Spanish.
▪ Lancaster Stemmer: Created by Chris Paice at Lancaster
University, the Lancaster Stemmer is known for its aggressive
stemming strategy. It applies a set of heuristic rules to
truncate words aggressively, often resulting in shorter stems
compared to other algorithms.
Word Porter Snowball Lancaster

running run run run

generously gener generous gen

connection connect connect connect

happiness happi happi happy

Example adjustable adjust adjust adjust

•Porter: Classical, less aggressive, good balance.


•Snowball: Slightly improved over Porter (more consistent
rules).
•Lancaster: Very aggressive, risks over-stemming.
▪ When the stemming process stems many unrelated
words to the same root.
▪ E.g. despite sharing a core word and being related, the
words “universal”, “university” and “universe” have
quite different meanings.

Over-Stemming

A L

U N I V E R S I T Y

E
▪ Occurs when numerous words are not stemmed from a
single root even when they should.

Under- ▪ For example, the Porter Stemmer algorithm does not


reduce the words “alumnus,” “alumnae,” and “alumni” to
Stemming the same word stem, although they should be treated as
synonyms.
▪ Lemmatization is a text normalization technique in
Natural Language Processing (NLP) that reduces
words to their base or dictionary form, known as the
lemma.
▪ Unlike stemming, which can truncate words
arbitrarily, lemmatization considers the context and
part of speech to produce a meaningful base form.

Lemmatization
How
Lemmatization
Works?

https://www.mygreatlearning.com/blog/what-is-lemmatization/
▪ Tokenization: Splitting text into words.
▪ Example: Sentence: “The cats are playing in the garden.”
▪ After tokenization: [‘The’, ‘cats’, ‘are’, ‘playing’, ‘in’, ‘the’, ‘garden’]

▪ Part-of-Speech (POS) Tagging: Identifying a word’s role (noun,


verb, adjective, etc.).
▪ Example: cats (noun), are (verb), playing (verb), garden (noun)
▪ POS tagging helps distinguish between words with multiple forms,
The process of such as “running” (verb) vs. “running” (adjective, as in “running
water”).
lemmatization
▪ Applying Lemmatization Rules: Converting words into their
base form using a lexical database. Example:
▪ playing → play
▪ cats → cat
▪ better → good

https://www.mygreatlearning.com/blog/what-is-lemmatization/
Why is
Lemmatization
Important in NLP?

https://www.mygreatlearning.com/blog/what-is-lemmatization/
Stemming or
Lemmatization?

https://www.mygreatlearning.com/blog/what-is-lemmatization/
Applications
▪ In NLP, morphological analysis focuses on analyzing the internal
structure of words to understand how they are formed from
smaller units called morphemes and how these morphemes
Survey of combine to create different word forms.
English ▪ This involves identifying roots, prefixes, and suffixes, and
Morphology understanding their roles in conveying grammatical information.

▪ This analysis is crucial for various NLP tasks like text


normalization, information retrieval, and machine translation.
▪ Morphemes:
Key Concepts in ▪ The smallest units of meaning in a language. They can
be free (stand-alone words) or bound (prefixes,
English suffixes).
Morphology for ▪ Stand-alone words: bus, bicycle
NLP ▪ Bound: ing, un
▪ It is the process of determining the morphemes from which
a given word is constructed.
OR

▪ It is used to identify and find the number of morphemes in a


given word.

Morphological ▪ Morphemes are the smallest meaningful words which


cannot be divided further.
Parsing ▪ Morphemes can be stem or afix.
▪ Stem are the root word whereas afix can be prefix, suffix or
infix. For example-
▪ Unsuccessfull → un success ful
(prefix) (stem) (suffix)
▪ 3 components: Lexicon, Morphotactic, Orthographic
rules

Steps to design ▪ Lexicon:


a morphological ▪ Stores the basic information about a word.
parser ▪ Like whether the word is a stem or an affix?
▪ If it is stem, then whether it is a verb stem or noun stem?
▪ If affix then whether it is prefix, infix or suffix?
▪ Morphotactic:
▪ It includes set of rules to make decisions.
Steps to design ▪ Decides whether a word can appear/not appear
before/after or in-between other words.
a morphological ▪ Use able ness -------- useableness
parser ▪ Able use ness---------ableuseness (wrong)
(able cannot be prefix, ness cannot be suffix of use)
▪ Orthographic rules
Steps to design ▪ Set of rules used to decide spelling changes.
a morphological ▪ Lady + s ----------- ladys (wrong)

parser ▪ Lady + s ----------- ladies (right)


Morphological
Aspect Morphology
Analysis
Process of
Study of word
What is it? breaking down
structure
words
Morphology and Theoretical
Type Practical task
Morphological concept
Used to extract
Analysis In NLP? Provides rules
structure
Morpheme-level
Knowledge of
Output breakdown of
morphemes
words
Types of ▪Inflectional
Morphology
▪Derivational
▪ An inflectional morpheme is a suffix that’s added to a word to assign
a particular grammatical property to that word, such as its number,
mood, tense, or possession.
Inflectional ▪ However, an inflectional morphology can never change the
Morphology grammatical category of a word.

▪ You can add an inflectional morphology to a verb, noun, adjective,


or an adverb.
▪ For example, adding a ‘-s’ to the verb plural verb ‘run’ can make this
verb singular. Similarly, adding ‘-ed’ to the verb dance creates the
past tense of the verb (danced).
▪ Some more examples are as follows:
▪ Cat à Cats
▪ Teach à Teaches
▪ Clean à Cleaned
▪ Prettyà Prettier
Inflectional
Morphology As evident from the above examples, inflectional morphemes
usually produce different forms of the same word, instead of
different words. In addition, inflection does not generally
change the basic meaning of a word as they only add
specifications to a word or emphasize certain aspects of its
meaning. Thus, words under inflectional morphology are not
found as separate entries in dictionaries.
▪ Derivational morphology is the study of the formation of new
words that differ either in syntactic category or in meaning
from their bases. Thus, a derivational morpheme is an affix
we add to a word in order to create a new word or a new form
of a word. Moreover, a derivational morpheme can either
change the meaning or the grammatical category of the
word. For example,

What is
Change in Meaning
Derivational
Leaf → Leaflet
Morphology?
Pure →Impure

Change in Grammatical Category

Help (verb) → Helper (noun)

Logic (noun) → Logical (adjective)


▪ A regular expression, also known as RegEx for short, is
a character sequence that forms a search pattern to find
Regular a string or set of strings.
▪ RegEx can be used to find the presence of a
Expressions and character(s) in a string.
Types ▪ The python RE library is frequently used to perform
RegEx in Python.

https://heartbeat.comet.ml/understanding-regular-expression-for-natural-language-processing-
ce9c4e272a29
▪ Regular expressions are particularly useful for
searching in texts, when we have a pattern to search for
and a corpus of texts to search through.
▪ A regular expression search function will search
through the corpus, returning all texts that match the
Definition pattern.

▪ The corpus can be a single document or a collection.


For example, the Unix command-line tool grep takes a
regular expression and returns every line of the input
document that matches the expression.
Regular Expressions are used in various tasks such as:
▪ Data pre-processing;
▪ Rule-based information Mining systems;
Regular ▪ Pattern Matching;
Expressions use ▪ Text feature Engineering;
cases ▪ Web scraping;
▪ Data validation;
▪ Data Extraction.

An example use case is extracting all hashtags from a tweet, or getting email addresses or
phone numbers from large unstructured text content.
▪ Patterns are what we want to find or match in a string. It
Regular Expression could be an ordinary character(s) such as a letter e.g. A
Pattern or a digit e.g. 9, a special character, a special sequence
or a combination of all of them.
▪ Special characters are those who have specific
meanings.

Special
Characters
\- Backslash

▪ Backlash is used to signal that anything following it is a


special sequence
▪ For instance, the special dot character “.” can be used to match
any character other than a newline. In other words, it can match a
letter, a digit, a space, a symbol, etc.
▪ The dot character alone will function as a special character, but, if
we wish to locate the dot character in a string, we must escape it
with a backslash. For example:
\- Backslash
\- Backslash
▪ This is used to determine whether or not a string
begins with a character(s).
▪ As an example, ^h checks to see if a string begins with
an h or not.

▪ Also, ^https pattern checks to see if a string starts with


this set of characters.
^- Caret
▪ This is used to specify a character class, or a group of
characters to match.
▪ The character can be listed individually within the
[ ]- Square symbol, such as [abc] which will map any string that is
either a, b, or c, or in the form of a range separated by “-
Brackets ”, such as [a-c].

▪ Also, the caret symbol can also be used within a


character class and it will represent the negation of
those classes. For example, [^0-5] will match any
character except from 0 to 5.
▪ It represents OR and is used whenever you want to
capture the pattern that is present before or after the
OR symbol. For example, b|c will match any string that
contains the letters b or c, such as boy cat.

|- Pipe Symbol
▪ This is known as the group symbol, and it is used to
group sub-patterns together. It matches whatever
RegEx pattern is inside the parentheses.

( )- Parentheses
The Repetition ▪ Repetition qualifiers allow you to match the result of the
Qualifiers Regex pattern multiple times.
▪ The plus symbol causes the match of one or more
occurrences of the pattern preceding it. In simpler
terms, this means that whenever a pattern(a character
before the plus symbol) is used, the result of the pattern
of the string will be matched one or more times.

+- Plus
▪ The star symbol works in the same way as the plus
symbol, with the exception that it causes a match of zero
or more occurrences of the RegEx pattern. That is,
unlike the plus symbol, which must include the pattern
in the resulting match at least once, the star symbol may
not include it.

*- Star Symbol
▪ The question mark functions in the same way as the two
preceding symbols, with the exception that it matches
zero or one occurrence of the RegEx pattern.

?- Question
Mark
▪ A special sequence consists of a backslash \ and a
character. They are frequently used to represent a
predefined set of characters such as words, digits, and
so on. Now we look at some of the most common special
sequences in RegEx.
▪ \d : It matches any digit in a string from 0 to 9. Because
Special it serves as a range, this special sequence is also the
same as [0–9].
Sequence
▪ \w : This is used to match any alphanumeric characters
in a string, either lowercase or uppercase, and is
equivalent to the pattern [A-Za-z0–9].
▪ This makes sense given that we previously stated that
we can list the pattern individually in a character class,
\w i.e. square bracket.

▪ The first pattern is a range A-Z, which matches any


uppercase letter from A-Z, followed by a-z, which
matches any lowercase letter from a-z, and finally, 0–9,
which matches any numeric from 0–9.
▪ \s : This matches a whitespace character in a string.

\s
▪ Morphological models are
computational methods used to analyze,
generate, or recognize word forms by
What Are breaking them into morphemes.

Morphological ▪ They help in tasks like:


▪ Lemmatization
Models?
▪ POS tagging
▪ Machine translation
▪ Spell-checking
Model Type Description Example
Uses hand-written "Add -ed to verb for
Rule-Based Models
linguistic rules past tense"

Uses a predefined list


Dictionary Lookup Look up went →
(lexicon) of word
Models lemma = go
forms and lemmas

Types of Finite-State
Automata that model
both recognition and
Encode morpheme
rules in a state
Transducers (FST)
Morphological generation machine

Models Statistical Models


Learns from data using Learn that "ing" suffix
probabilities is common for verbs

Uses deep learning to Sequence-to-


Neural Models learn morphological sequence models like
structure from data LSTM or Transformer
▪ In Natural Language Processing (NLP), dictionary
lookup models analyze word structure by comparing
input words to entries in a pre-compiled
Dictionary dictionary. These dictionaries store word forms along
with their associated grammatical and lexical
lookup information.
1. Corpus-based Compilation:
▪ Dictionaries can be built by analyzing large collections
of text (corpora). This involves identifying unique words
and their variations (inflections, derivations) and
recording their occurrences.

How 2. Linguistic Resources:

Dictionaries ▪ Existing linguistic resources like lexicons, thesauruses,


and morphological databases can be used to populate
are Made? the dictionary.
3. Hand-Crafted Entries:

▪ For specific domains or languages, dictionaries may


require manual creation or augmentation of entries,
especially for specialized vocabulary, proper nouns, or
rare words.
4. Morpheme-based Analysis:
▪ Dictionaries can be created by breaking down words
into their constituent morphemes (smallest units of
meaning) and storing the morphemes along with their
properties.
How
Dictionaries
5. Data Structures:
are Made?
▪ The dictionary data is typically stored using efficient
data structures like tries, hash tables, or binary search
trees for fast lookup.
▪ Lexicon: The core database containing word forms and
their associated information.

▪ Matcher: An algorithm or component that searches for


Key the input word in the lexicon and retrieves the
corresponding information.
Components of
▪ Morphological Analyzer: In some cases, the lookup
a Dictionary model might be combined with morphological
Lookup Model analyzers (e.g., finite-state transducers) to handle
inflected or derived word forms.
▪ In the example of the word "unhappiness", a dictionary
lookup model would:

▪ Check if "unhappiness" exists in the dictionary.


▪ If found, it might retrieve information like:
▪ Root word: "happy".
Example ▪ Prefix: "un".
▪ Suffix: "ness".
▪ Part of speech: noun.
▪ If not found, it might trigger a morphological analyzer
to decompose the word into its morphemes.
▪ For morphological analysis, Finite State Automata
▪ A finite state automata is defined as M= {Q, ∑, δ, q0, F}
where,
Q= Finite non-empty set of states
∑= Finite set of input symbols
Finite State δ = transition mapping: Q X ∑ Q|QX∑ 2Q
Automata qo = initial state of the finite automata

F=Finite set of final states.

PUSH

off on

PUSH
Finite State
Automata for
Morphology
▪ FSA can recognize/ accept a string but cannot tell
its internal structure.
▪ A machine is required to map/transduce the input
Recognition vs. string into an output string that encodes the
Analysis internal structure.
▪ We have Finite State Transducers for this.
▪ Two-level morphology represents the correspondence
between lexical and surface levels. The finite state
transducers are used to find mapping between these two
levels.
▪ A FST is thus two-tape automaton: Reads from one tape and
writes to the other one.
▪ For morphological processing, one tape holds lexical
representation, the second one holds the surface form of a
Finite State word.
▪ An FSA represents a set of strings like {walk, walks,
Transducers walked}, while FST represents a set of pairs of strings like
{(walks, walk+V+Pl), (walk, walk+N+SG), (walked,
walk+N+Past)}

Lexical Tape d o g +N +Pl (upper tape)

Surface Tape d o g s (lower tape)


▪ In morphological analysis, lexicon-free
Finite State Transducers (FSTs) like the
Porter Stemmer are used primarily for
stemming — that is, reducing words to
Porter Stemmer
their root or base forms (stems) without
using a dictionary or lexicon.
▪ A lexicon is a dictionary of known words
and their properties.

What is Lexicon- ▪ A lexicon-free FST doesn't rely on any


Free FST? such word list.
▪ Instead, it uses rule-based rewriting —
like stripping suffixes — to transform
input words.
▪ It’s a rule-based stemmer developed by Martin
Porter in 1980.
What is the ▪ It applies a sequence of suffix-stripping rules to
Porter reduce English words to their stems.

Stemmer? ▪ Example:
▪ caresses → caress
▪ ponies → poni
▪ relational → relate
An FST can model the Porter stemmer by:
▪ Using states for steps in the algorithm
FST ▪ Encoding suffix-stripping rules as
Implementation transitions
of Porter ▪ Outputting the stemmed form
▪ Language modeling involves predicting
the likelihood of sequences of words.
▪ N-grams: An N-gram is a contiguous
Language sequence of 'N' items (such as words or
Modeling with characters) from a given text. In language
N-grams modeling, N-grams predict the
probability of a word based on the
preceding 'N-1’ words.
▪ Unigrams (1-grams): Single items
(words or tokens).
▪ • Bigrams (2-grams): Sequences of
Types of N- two items.
grams ▪ • Trigrams (3-grams): Sequences of
three items.
▪ • Higher-order N-grams: Sequences
of 'N' items where 'N' > 3.
Consider the sentence: "The cat sat on the mat."
▪ Unigrams: ["The", "cat", "sat", "on", "the", "mat"]
▪ Bigrams: ["The cat", "cat sat", "sat on", "on the",
Example "the mat"]
▪ Trigrams: ["The cat sat", "cat sat on", "sat on the",
"on the mat"]
▪ Simple N-grams refer to basic N-gram
models where the probability of a word
Simple N-grams depends solely on the preceding 'N-1'
words. These models are used to estimate
the likelihood of word sequences in text.
▪ To predict the next word given a sequence of one word, a
bigram model uses the probability of a word given the
previous w ord.
▪ Sentence: "The cat sat on the mat."
▪ Goal: Predict the probability of the word "mat" given the
preceding words.
Example of a
Bigram Model
▪ Counting words and N-grams involves calculating their
frequency in a text corpus. These counts are crucial for
estimating probabilities in N-gram models.
Counting Words ▪ Steps:
in Corpora 1. Collect Text Data: Gather a corpus of text, such as documents,
books, or articles.
2. Tokenize Text: Split the text into tokens (words or phrases).
3. Count Frequencies: Count the occurrences of each token and
each N-gram.
Counting
Unigrams:
• "The": 4
• "cat": 2
Consider a small
• "sat": 1
corpus consisting of
• "on": 2
two sentences: • "the": 4
• "mat": 2
▪ 1. "The cat sat on the
Example mat."
• "is": 1
Counting Bigrams:
▪ 2. "The cat is on the • "The cat": 2
mat.“ • "cat sat": 1
• "sat on": 1
• "on the": 2
• "the mat": 2
• "cat is": 1
▪ Calculation of Bigram Probabilities:
To calculate the probability of "the" given "on":

Calculating
bigram
probability
▪ Corpus
<s> I am a human </s>
<s> I am not a stone </s>
Numerical
<s> I I live in Mumbai </s>
Check the probability of “I I am not”
using bigrams.
▪ Define an evaluation metric (scoring function).
We will want to measure how similar the predictions

of the model are to real text.


▪ Train the model on a ‘seen’ training set
How do we Perhaps: tune some parameters based on held-out data
evaluate (disjoint from the training data, meant to emulate unseen
data)
models? ▪ Test the model on an unseen test set
(usually from the same source (e.g. WSJ) as the training
data). Test data must be disjoint from training and held-
out data. Compare models by their scores
▪ In N-gram language modeling, the quality and
performance of the model heavily depend on the
training corpus used to build it. An N-gram model
N-gram Sensitivity estimates the probability of a word based on the
to Training Corpus previous (N-1) words. For example, in a trigram model,
the probability of a word depends on the two preceding
in NLP words. The model learns these probabilities from a
given corpus — a large body of text.
▪ N-gram models are very sensitive to unseen or rare
word sequences.
Data Sparsity ▪ If the training corpus is small or lacks sufficient
and Rare N- examples of certain word combinations, the model will
assign a zero probability to those N-grams, even if they
grams are valid and common in real-world usage. This is
known as the zero-frequency problem.
N-gram models are domain-specific. If trained on
medical texts, the model will perform poorly on informal
Domain conversations or legal documents because the vocabulary
and phrasing differ drastically. The learned N-gram
Sensitivity probabilities reflect only the style and terminology of the
training data.
▪ Smoothing (e.g., Laplace, Good-Turing, Kneser-Ney) to
handle zero probabilities.

Mitigation ▪ Backoff and interpolation strategies to fall back on


smaller N-grams when higher ones are missing.
Techniques ▪ Larger and more diverse training corpora to improve
generalization.
1. Open Vocabulary:
▪ Definition:
An open vocabulary task assumes that the system may
encounter new or unseen words that were not present
in the training vocabulary.
Unknown Words: ▪ Characteristics:
Open vs. Closed ▪ More realistic for real-world applications, such as:
Vocabulary ▪ Web search queries
▪ Social media text analysis
▪ Conversational agents or chatbots
▪ OOV handling becomes more complex and essential
▪ Smoothing (e.g., Laplace, Good-Turing, Kneser-Ney):
Helps assign some non-zero probability to unseen N-
grams (including those with unknown words).
Techniques for ▪ Backoff and Interpolation: Back off to lower-order N-
Handling OOV in grams when the full N-gram is not found.
Open Vocabulary ▪ Subword Models / Morphological Analysis: Break
Tasks unknown words into parts like prefixes, suffixes, or
character-level n-grams (e.g., "unhappiness" → "un",
"happi", "ness").
Example:
▪ If the model encounters the word "joyful" (unseen in
training), it may use: Subword N-grams like "joy" +
"ful“
▪ Character-based trigrams: "j-o-y", "o-y-f", "y-f-u", etc.
▪ Morphological analysis: base word "joy" + suffix "-ful"
▪ Definition: A closed vocabulary task assumes a fixed and
known vocabulary. All input text is expected to consist of
words that already appear in the training vocabulary.
▪ Characteristics: The model does not need to handle
Closed unknown words at inference time. Any input word not in the
vocabulary is typically discarded or replaced with a special
Vocabulary symbol like <UNK> (unknown).
Tasks ▪ Suitable for controlled environments, like:
▪ Command recognition systems
▪ Machine translation in a limited domain
▪ Speech recognition with predefined prompts
▪ Handling OOV:OOV words are replaced with a token
like <UNK> during preprocessing. The model is trained
with some proportion of words masked as <UNK> to
Closed learn the likelihood of <UNK> appearing in context.

Vocabulary ▪ Example: If the vocabulary is {"I", "am", "happy"} and


the test sentence is: "I am joyful“ "joyful" is not in the
vocabulary, so it's replaced with <UNK>:"I am <UNK>"
▪ The best way to evaluate the performance of a language
model is to embed it in an application and measure how
much the application improves. Such end-to-end
extrinsic evaluation is called extrinsic evaluation.
▪ Extrinsic evaluation is the only way to know if a
particular improvement in the language model (or any
Extrinsic component) is really going to help the task at hand.
Evaluation of ▪ Thus, for evaluating n-gram language models that are a
Language Models component of some task like speech recognition or
machine translation, we can compare the performance
of two candidate language models by running the
speech recognizer or machine translator twice, once
with each language model, and seeing which gives the
more accurate transcription.
▪ The training set is the data we use to learn the
parameters of our model; for simple n-gram language
models it’s the corpus from which we get the counts that
we normalize into the probabilities of the n-gram
language model.
▪ The test set is a different, held-out set of data, not
Training set, overlapping with the training set, that we use to
evaluate the model.
development set
▪ Even if we don’t train on the test set, if we test our
and the test set language model on the test set many times after making
different changes, we might implicitly tune to its
characteristics, by noticing which changes seem to
make the model better. For this reason, we only want to
run our model on the test set once, or a very few
number of times, once we are sure our model is ready.
▪ For this reason, we normally instead have a third dataset
called a development test set or, devset. We do all our
DevSet testing on this dataset until the very end, and then we
test on the test set once to see how good our model is.
▪ Perplexity in n-gram modeling measures how well a language
model predicts a sample. It's essentially the inverse probability of
the test set, normalized by the number of words. A lower
perplexity score indicates a better-performing model. Here's how
to calculate it:

1. Calculate the probability of the sentence:


Solved Example ▪ Use the n-gram model (e.g., bigram, trigram) to find the
probability of each word given its preceding (n-1) words.
▪ Multiply these conditional probabilities together to get the
overall probability of the sentence.
2. Normalize the probability:
▪ Raise the sentence probability to the power
of (1/number of words). This normalizes for
sentence length.

Solved
3. Calculate Perplexity:
Examples
▪ Take the reciprocal of the normalized
probability.
▪ Laplacian Smoothing for N-gram
Models

Laplacian Smoothing (also called Add-One


Laplacian Smoothing) is a technique used in n-gram
Smoothing language models to handle the problem of
zero probability for unseen n-grams in a
given corpus.
▪ In an unsmoothed n-gram model, probabilities are
estimated based on frequency counts:
Problem with ▪ P(wi∣wi−1)=Count(wi−1,wi)/Count(wi−1)

raw n-gram ▪ If an n-gram has never occurred in the training corpus,


the count becomes 0, and hence its probability
models becomes zero, which is problematic for real-world
applications like speech recognition or text prediction.
▪ Laplacian Smoothing solves this by adding 1 to all
counts, including unseen ones, ensuring no zero
probabilities.

What is
Laplacian
Smoothing?
General
Formula (for N-
gram)
▪ Suppose we have the following corpus:
"the cat sat on the mat“

Example
Example
▪ Good-Turing discounting is a statistical technique used to estimate
the probability of events, especially infrequent or unseen events, in
a given dataset or corpus.
Good Turing ▪ It's widely applied in natural language processing (NLP) to address
Discounting the data sparsity problem in language models, where many n-grams
(sequences of words) might have zero or very low frequencies in
the training data, leading to inaccurate probability estimates.
▪ The central idea behind Good-Turing discounting is that the
probability of encountering an unseen event can be estimated by
considering the frequency of events that occurred a certain number
of times in the past.
Good Turing
Discounting ▪ In simpler terms, it "borrows" probability mass from frequently
occurring events and redistributes it to account for potentially
unseen or rare events.
Let's say we have a small corpus of text and are calculating bigram
probabilities. We have the following counts for different bigrams:
▪ "the cat": 2 times
▪ "the dog": 2 times
▪ "a cat": 1 time
Example
▪ "a dog": 1 time
▪ "cat sat": 1 time
▪ "dog ran": 1 time
Steps:
1. Calculate N(c):

▪ This represents the number of bigrams that occur c times.N(1) = 3 (a cat, a


dog, cat sat, dog ran)
▪ N(2) = 2 (the cat, the dog)
▪ N(0) = ? (number of unseen bigrams, which we are trying to estimate)

Example 2. Calculate c\*:


▪ Good-Turing uses the following formula to adjust the counts: c\* = (c +
1) * (N(c+1) / N(c))

▪ For the bigrams that appeared once (c=1), c\* = (1+1) * (N(2) / N(1)) = 2 *
(2/3) = 4/3
▪ For the bigrams that appeared twice (c=2), c\* = (2+1) * (N(3) / N(2)) = 3 *
(0/2) = 0 (assuming no bigrams appeared 3 times)
3. Estimate probability of unseen bigrams:
▪ The probability of unseen bigrams is estimated as N1/N, where N is the
total number of bigrams in the corpus.
▪ N=2+2+1+1+1+1=8
▪ N1 = 3 (bigrams seen once)
▪ Probability of unseen bigrams = 3/8

Example 4. Calculate adjusted probabilities:


▪ For bigrams seen once (c=1): The adjusted count is 4/3, and the probability
is (4/3) / 8 = 1/6
▪ For bigrams seen twice (c=2): The adjusted count is 0, and the probability
is 0 / 8 = 0
▪ For unseen bigrams: The probability is 3/8
Results:
▪ "the cat" and "the dog" (seen twice) will have a
probability of 0 after discounting.
▪ "a cat", "a dog", "cat sat", "dog ran" (seen once) will have
Example a discounted probability of 1/6.
▪ Unseen bigrams will have a probability of 3/8.

You might also like