0% found this document useful (0 votes)

7 views95 pages

Week 2

The document discusses spelling correction using edit distance, which is the minimum number of editing operations required to transform one string into another. It explains how to compute the minimum edit distance using dynamic programming and introduces variations like weighted edit distance and the noisy channel model for more accurate corrections. The document also covers types of spelling errors and methods for generating candidate corrections based on edit distance and confusion matrices.

Uploaded by

wacinop537

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views95 pages

Week 2

Uploaded by

wacinop537

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 95

Spelling Correction: Edit Distance

EL
Pawan Goyal

PT CSE, IITKGP

Week 2: Lecture 1
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 1 / 20
Spelling Correction

I am writing this email on behaf of ...

The user typed ‘behaf’.

Which are some close words?

EL
behalf
behave
....

Isolated word error correction PT

Pick the one that is closest to ‘behaf’
N
How to define ‘closest’?
Need a distance metric
The simplest metric: edit distance

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 2 / 20
Edit Distance

EL
The minimum edit distance between two strings
Is the minimum number of editing operations
I Insertion
I
I
Deletion
Substitution PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 3 / 20
Minimum Edit Distance

Example
Edit distance from ‘intention’ to ‘execution’

EL
PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 4 / 20
Minimum Edit Distance

Example
Edit distance from ‘intention’ to ‘execution’

EL
PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 4 / 20
Minimum Edit Distance

EL
PT
If each operation has a cost of 1 (Levenshtein)
N
I Distance between these is 5
If substitution costs 2 (alternate version)
I Distance between these is 8

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 5 / 20
How to find the Minimum Edit Distance?

Searching for a path (sequence of edits) from the start string to the final string:
Initial state: the word we are transforming

EL
Operators: insert, delete, substitute
Goal state: the word we are trying to get to
Path cost: what we want to minimize: the number of edits

PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 6 / 20
Minimum Edit as Search

How to navigate?

EL
The space of all edit sequences is huge
Lot of distinct paths end up at the same state

PT
Don’t have to keep track of all of them
Keep track of the shortest path to each state
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 7 / 20
Defining Minimum Edit Distance Matrix

For two strings

X of length n

EL
Y of length m

We define D(i, j)

PT
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y
N
Thus, the edit distance between X and Y is D(n, m)

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 8 / 20
Computing Minimum Edit Distance

Dynamic Programming

EL
A tabular computation of D(n, m)
Solving problems by combining solutions to subproblems
Bottom-up
I
I
I
PT
Compute D(i, j) for small i, j
Compute larger D(i, j) based on previously computed smaller values
Compute D(i, j) for all i and j till you get to D(n, m)
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 9 / 20
Dynamic Programming Algorithm

EL
PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 10 / 20
The Edit Distance Table

EL
PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20
The Edit Distance Table

EL
PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 12 / 20
Computing Alignments

Computing edit distance may not be sufficient for some applications

EL
I We often need to align characters of the two strings to each other
We do this by keeping a “backtrace”

When we reach the end,

I
PT
Every time we enter a cell, remember where we came from

Trace back the path from the upper right corner to read off the alignment
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 13 / 20
The Edit Distance Table

EL
PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 15 / 20
Minimum Edit with Backtrace

EL
PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 16 / 20
Adding Backtrace to Minimum Edit

EL
PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 17 / 20
The distance matrix

Every non-decreasing path

EL
from (0,0) to (M,N)
corresponds to an alignment
of two sequences.

PT An optimal alignment is
composed of optimal
N
sub-alignments.

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 18 / 20
Result of Backtrace

EL
PT
N

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 19 / 20
Performance

Time
O(nm)

EL
Space
O(nm)

Backtrace
PT
N
O(n + m)

Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 20 / 20
Weighted Edit Distance, Other variations

EL
Pawan Goyal

PT CSE, IITKGP

Week 2: Lecture 2
N

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 1 / 12
Weighted Edit Distance

EL
Why to add weights to the computation?
Some letters are more likely to be mistyped.

PT
N

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 2 / 12
Confusion Matrix for Spelling Errors

EL
PT
N

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 3 / 12
Keyboard Design

EL
PT
N

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 4 / 12
Weighted Minimum Edit Distance

EL
PT
N

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 5 / 12
How to modify the algorithm with transpose?

Transpose
transpose(x, y) = (y, x)
Also known as metathesis

EL
Modification to the dynamic programmic algorithm
8
> D(i 1, j) + 1
>
>
>
>
>
>D(i, j 1) + 1
>
>
<
D[i][j] = min D(i 1, j 1)+
PT (deletion)
(insertion)
(
1 if (x[i] 6= y[j])(substitution)
N
>
>
> 0 otherwise
>
>
>
>
> D(i 2, j 2) + 1 (x[i] = y[j 1] and x[i 1] = y[j]
>
: (transposition)

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 6 / 12
How to find dictionary entries with smallest edit distance?

Naïve Method
Compute edit ditance from the query term to each dictionary term – an
exhaustive search

EL
Can be made efficient if we do it over a trie structure

PT
N

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 7 / 12
How to find dictionary entries with smallest edit distance?

Generate all possible terms with an edit distance <=2 (deletion +

EL
transpose + substitution + insertion) from the query term and search
them in the dictionary.

to search for
PT
For a word of length 9, alphabet of size 36, this will lead to 114,324 terms

For Chinese alphabet size is 70,000 (Unicode Han Characters)

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 8 / 12
How to find dictionary entries with smallest edit distance?

Symmetric Delete Spelling Correction

Generate terms with an edit distance  2 (deletes) from each dictionary

EL
term (offline)
Generate terms with an edit distance  2 (deletes) from the input terms
and search in dictionary

PT
Number of deletes within edit distance  2 for a word of length 9 will be 45
N
A further check is required to remove the false positives

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 9 / 12
Spelling Correction

Types of spelling errors: Non-word Errors

EL
behaf ! behalf

Types of spelling errors: Real-word Errors

PT
Typographical errors: three ! there
Cognitive errors (homophones): piece ! peace, too ! two
N

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 10 / 12
Non-word spelling errors

Non-word spelling error detection

Any word not in a dictionary is an error

EL
The larger the dictionary the better

Non-word spelling error correction

PT
Generate candidates: real words that are similar to the error word
Choose the best one:
N
I Shortest weighted edit distance
I Highest noisy channel probabliity

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 11 / 12
Real word spelling errors

For each word w, generate candidate set

EL
Find candidate words with similar pronunciations
Find candidate words with similar spelling
Include w in candidate set

Choosing best candidate

Noisy Channel
PT
N

Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 12 / 12
Noisy Channel Model for Spelling Correction

EL
Pawan Goyal

PT CSE, IITKGP

Week 2: Lecture 3
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 1 / 17
Noisy Channel

We see an observation x of the misspelled word

Find the correct word w

EL
ŵ = arg maxP(w|x)
w2V

PT
= arg max
w2V
P(x|w)P(w)
P(x)
N
= arg maxP(x|w)P(w)
w2V

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 2 / 17
Non-word spelling error: acress

Words with similar spelling

Small edit distance to error

EL
Words with similar pronuncitation
Small edit distance of pronunciation to error

PT
Damerau-Levenshtein edit distance
Minimum edit distance, where edits are:
N
Insertion, Deletion, Substitution,
Transposition of two adjacent letters

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 3 / 17
Words within edit distance 1 of acress

EL
PT
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 4 / 17
Candidate generation

80% of errors are within edit distance 1

EL
Almost all errors within edit distance 2

Allow deletion of space or hyphen

thisidea ! this idea
inlaw ! in-law
PT
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 5 / 17
Computing error probability: confusion matrix

del[x,y]: count (xy typed as x)

EL
ins[x,y]: count (x typed as xy)
sub[x,y]: count (x typed as y)

PT
trans[x,y]: count(xy typed as yx)
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 6 / 17
Computing error probability: confusion matrix

del[x,y]: count (xy typed as x)

EL
ins[x,y]: count (x typed as xy)
sub[x,y]: count (x typed as y)

PT
trans[x,y]: count(xy typed as yx)

Insertion and deletion are conditioned on previous character

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 6 / 17
Channel model

EL
PT
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 7 / 17
Channel model for acress

EL
PT
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 8 / 17
Noisy channel probability for acress

EL
PT
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 9 / 17
Using a bigram language model

“ ... versatile acress whose ...”

EL
Counts from the Corpus of Contemporary American English with add-1
smoothing
P(actress|versatile) = 0.000021, P(across|versatile) = 0.000021

PT
P(whose|actress) = 0.0010, P(whose|across) = 0.000006
P(“versatile actress whose”) = 0.000021 * 0.0010 = 210 x 10 10
N
P(“versatile across whose”) = 0.000021 * 0.000006 = 1 x10 10

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 10 / 17
Real-word spelling errors

EL
The study was conducted mainly be John Black
The design an construction of the system ...

PT
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 11 / 17
Real-word spelling errors

EL
The study was conducted mainly be John Black
The design an construction of the system ...

PT
25-40% of spelling errors are real words
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 11 / 17
Noisy channel for real-word spell correction

Given a sentence X = w1 , w2 , w3 . . . , wn

EL
Candidate (w1 ) = {w1 , w0 1 , w00 1 , w000 1 , . . .}
Candidate (w2 ) = {w2 , w0 2 , w00 2 , w000 2 , . . .}

PT
Candidate (w3 ) = {w3 , w0 3 , w00 3 , w000 3 , . . .}
Choose the sequence W that maximizes P(W|X)
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 12 / 17
Noisy channel for real-world spell correction

EL
PT
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 13 / 17
Simplification: One error per sentence

Choose among all possible sentences with one word replaced

EL
two of thew
w1 , w00 2 , w3 two off thew
w1 , w2 , w0 3 two of the

PT
w000 1 , w2 , w3 too of thew
N
Choose the sequence W that maximizes P(W|X)

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 14 / 17
Getting the probability values

Noisy Channel

Ŵ = arg maxP(W|X)
W2S

EL
where X is the observed sentence and S is the set of all the possible
sequences from the candidate set

PT
= arg maxP(X|W)P(W)
W2S
N
P(X|W)
Same as for non-word spelling correction
Also require proabability for no error P(w|w)

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 15 / 17
Probability of no error

EL
What is the probability for a correctly typed word? P(“the”|“the”)

It may depend on the source text under consideration

1 error in 10 words ! 0.9
1 error in 100 words ! 0.99
PT
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 16 / 17
Computing P(W)

EL
Use Language Model
Unigram
Bigram
... PT
N

Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 17 / 17
Context Sensitive Spelling Correction

The office is about fifteen minuets from my house

EL
Use a Language Model PT
P(about fifteen minutes from) > P(about fifteen minuets from)
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 2 / 24

Probablilistic Language Models: Applications

Speech Recognition
P(I saw a van) >> P(eyes awe of an)

EL
Machine Translation
Which sentence is more plausible in the target language?

PT
P(high winds) > P(large winds)

Other Applications
N
Context Sensitive Spelling Correction
Natural Language Generation
...

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 3 / 24

Completion Prediction

EL
Language model also supports predicting the completion of a sentence.
I Please turn off your cell ...
I Your program does not ...

PT
Predictive text input systems can guess what you are typing and give
choices on how to complete it.
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 4 / 24

Probabilistic Language Modeling

Goal: Compute the probability of a sentence or sequence of words:

P(W) = P(w1 , w2 , w3 , . . . , wn )

EL
Related Task: probability of an upcoming word:

PT P(w4 |w1 , w2 , w3 )
N
A model that computes either of these is called a language model

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 5 / 24

Computing P(W)

How to compute the joint probability

EL
P(about, fifteen, minutes, from)

PT
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 6 / 24

Computing P(W)

How to compute the joint probability

EL
P(about, fifteen, minutes, from)

Basic Idea
PT
Rely on the Chain Rule of Probability
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 6 / 24

The Chain Rule

Conditional Probabilities
P(A, B)
P(B|A) =
P(A)

EL
P(A, B) = P(A)P(B|A)

More Variables
PT
P(A, B, C, D) = P(A)P(B|A)P(C|A, B)P(D|A, B, C)
N
The Chain Rule in General
P(x1 , x2 , . . . , xn ) = P(x1 )P(x2 |x1 )P(x3 |x1 , x2 ) . . . P(xn |x1 , . . . , xn 1 )

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 7 / 24

Probability of words in sentences

P(w1 w2 . . . wn ) = ’ P(wi |w1 w2 . . . wi 1 )

EL
i

P(“about fifteen minutes from”) =

PT
P(about) x P(fifteen | about) x P(minutes | about fifteen) x P(from | about fifteen
minutes)
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 8 / 24

Estimating These Probability Values

Count and divide

EL
Count (about fifteen minutes from office)
P(office | about fifteen minutes from) = Count (about fifteen minutes from)

What is the problem

PT
We may never see enough data for estimating these
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 9 / 24

Markov Assumption

Simplifying Assumption: Use only the previous word

EL
P(office | about fifteen minutes from) ⇡ P(office | from)

Or the couple previous words

PT
P(office | about fifteen minutes from) ⇡ P(office | minutes from)
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 10 / 24

Markov Assumption

More Formally: kth order Markov Model

Chain Rule:
P(w1 w2 . . . wn ) = ’ P(wi |w1 w2 . . . wi 1 )

EL
i

Using Markov Assumption: only k previous words

PT
P(w1 w2 . . . wn ) ⇡ ’ P(wi |wi
i
k . . . wi 1 )
N
We approximate each component in the product

P(wi |w1 w2 . . . wi 1 ) ⇡ P(wi |wi k . . . wi 1 )

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 11 / 24

N-Gram Models

P(office | about fifteen minutes from)

An N -gram model uses only N 1 words of prior context.

EL
Unigram: P(office)
Bigram: P(office | from)

PT
Trigram: P(office | minutes from)

Markov model and Language Model

N
An N -gram model is an N 1-order Markov Model

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 12 / 24

N-Gram Models

We can extend to trigrams, 4-grams, 5-grams

EL
In general, an insufficient model of language:
language has long-distance dependencies:

floor crashed.”
PT
“The computer which I had just put into the machine room on the fifth

In most of the applications, we can get away with N-gram models

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 13 / 24

Estimating N-grams probabilities

Maximum Likelihood Estimate

EL
Value that makes the observed data the “most probable”

count(wi 1 , wi )
P(wi |wi 1 ) =
count(wi 1 )
PT
P(wi |wi 1 ) =
c(wi 1 , wi )
N
c(wi 1 )

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 14 / 24

An Example

<s>I am here </s>

EL
c(wi 1 , wi ) <s>who am I </s>
P(wi |wi 1 ) =
c(wi 1 ) <s>I would like to know </s>

PT
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 15 / 24

An Example

<s>I am here </s>

c(wi 1 , wi ) <s>who am I </s>
P(wi |wi 1 ) =
c(wi 1 ) <s>I would like to know </s>

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 15 / 24

Bigram counts from 9222 Restaurant Sentences

EL
PT
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 16 / 24

Computing bigram probabilities

Normlize by unigrams

EL
Bigram Probabilities

PT
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 17 / 24

Computing Sentence Probabilities

PT
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 18 / 24

Computing Sentence Probabilities

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 18 / 24

What knowledge does n-gram represent?

P(english|want) = .0011

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 19 / 24

Practical Issues

Everything in log space

EL
Avoids underflow
Adding is faster than multiplying
log(p1 ⇥ p2 ⇥ p3 ⇥ p4 ) = logp1 + logp2 + logp3 + logp4

Handling zeros
PT
N
Use smoothing

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 20 / 24

Language Modeling Toolkit

EL
SRILM
http://www.speech.sri.com/projects/srilm/

PT
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 21 / 24

Google N-grams

Number of tokens: 1,024,908,267,229

Number of sentences: 95,119,665,584

EL
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902

PT
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
http://googleresearch.blogspot.in/2006/08/
N
all-our-n-gram-are-belong-to-you.html

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 22 / 24

Example from the 4-gram data

serve as the inspector 66

EL
serve as the inspiration 1390
serve as the installation 136
serve as the institute 187
serve as the institution 279
serve as the institutional 461
PT
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 23 / 24

Google books Ngram Data

EL
PT
N

Pawan Goyal (IIT Kharagpur) N-gram Language Models Week 2: Lecture 4 24 / 24

Evaluation of Language Models, Basic Smoothing

EL
Pawan Goyal

PT CSE, IITKGP

Week 2: Lecture 5
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 1 / 16
Evaluating Language Model

Does it prefer good sentences to bad sentences?

Assign higher probability to real (or frequently observed) sentences than

EL
ungrammatical (or rarely observed) ones

Training and Test Corpora

PT
Parameters of the model are trained on a large corpus of text, called
training set.
N
Performance is tested on a disjoint (held-out) test data using an
evaluation metric

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 2 / 16
Extrinsic evaluation of N-grams models

Comparison of two models, A and B

EL
Use each model for one or more tasks: spelling corrector, speech
recognizer, machine translation

PT
Get accuracy values for A and B
Compare accuracy for A and B
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 3 / 16
Intrinsic evaluation: Perplexity

Intuition: The Shannon Game

How well can we predict the next word?

EL
I always order pizza with cheese and . . .
The president of India is . . .
I wrote a . . .

PT
Unigram model doesn’t work for this game.
N
A better model of text
is one which assigns a higher probability to the actual word

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 4 / 16
Perplexity
The best language model is one that best predics an unseen test set

Perplexity (PP(W))
Perplexity is the inverse probability of the test data, normalized by the number
of words:

EL
1
PP(W) = P(w1 w2 . . . wN ) N

Applying chain Rule

PT
PP(W) = ’
✓
1
P(wi |w1 . . . wi 1 )
◆ N1
N
For bigrams
✓ ◆ N1
1
PP(W) = ’
P(wi |wi 1 )

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 5 / 16
Example: A Simple Scenario

Consider a sentence consisting of N random digits

Find the perplexity of this sentence as per a model that assigns a
probability p = 1/10 to each digit.

EL
1
PP(W) = P(w1 w2 . . . wN ) N

PT =
✓ ◆N ! N1

✓ ◆ 1
1
10
N
1
=
10
= 10

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 6 / 16
Lower perplexity = better model

WSJ Corpus
Training: 38 million words
Test: 1.5 million words

EL
PT
N
Unigram perplexity: 962?
The model is as confused on test data as if it had to choose uniformly and
independently among 962 possibilities for each word.

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 7 / 16
The Shannon Visualization Method

Use the language model to generate word sequences

EL
Choose a random bigram
(<s>,w) as per its
probability

PT
Choose a random bigram
(w,x) as per its probability
N
And so on until we choose
</s>

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 8 / 16
Shakespeare as Corpus

EL
N = 884,647 tokens, V = 29,066
Shakespeare produced 300,000 bigram types out of V 2 = 844 million
possible bigrams.
PT
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 9 / 16
Approximating Shakespeare

EL
PT
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 10 / 16
Problems with simple MLE estimate: zeros

Training set
... denied the allegations Test Data

EL
... denied the reports ... denied the offer
... denied the claims ... denied the loan
... denied the request

Zero probability n-grams

P(offer | denied the) = 0
PT
N
The test set will be assigned a probability 0
And the perplexity can’t be computed

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 11 / 16
Language Modeling: Smoothing

With sparse statistics

EL
PT
Steal probability mass to generalize better
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 12 / 16
Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word (N-gram) one more time that we actually

EL
did
Just add one to all the counts!

PT
MLE estimate for bigram: PMLE (wi |wi 1 ) = c(wi 1 )i
i 1
c(w ,w )
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 13 / 16
Laplace Smoothing (Add-one estimation)

Pretend as if we saw each word (N-gram) one more time that we actually

EL
did
Just add one to all the counts!

PT
MLE estimate for bigram: PMLE (wi |wi 1 ) = c(wi 1 )i

Add-1 estimate: PAdd 1 (wi |wi 1 ) = c(wi 1 )+V

i 1
i
i 1
c(w
c(w

,w )+1
,w )
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 13 / 16
Reconstituted counts as effect of smoothing

EL
Effective bigram count (c⇤ (wn 1 wn ))

c⇤ (wn 1 wn ) c(wn 1 wn ) + 1
=

PT
c(wn 1 ) c(wn 1 ) + V
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 14 / 16
Comparing with bigrams: Restaurant corpus

EL
PT
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 15 / 16
More general formulations: Add-k

c(wi 1 , wi ) + k
PAdd k (wi |wi 1 ) =
c(wi 1 ) + kV

c(wi 1 , wi ) + m( V1 )

EL
PAdd k (wi |wi 1 ) =
c(wi 1 ) + m
Unigram prior smoothing:

PT
PUnigramPrior (wi |wi 1 ) =
c(wi 1 , wi ) + mP(wi )
c(wi 1 ) + m
N

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 16 / 16
More general formulations: Add-k

c(wi 1 , wi ) + k
PAdd k (wi |wi 1 ) =
c(wi 1 ) + kV

c(wi 1 , wi ) + m( V1 )

EL
PAdd k (wi |wi 1 ) =
c(wi 1 ) + m
Unigram prior smoothing:

PT
PUnigramPrior (wi |wi 1 ) =
c(wi 1 , wi ) + mP(wi )
c(wi 1 ) + m
N
A good value of k or m?
Can be optimized on held-out set

Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 16 / 16

DL UNIT V NLP Application
No ratings yet
DL UNIT V NLP Application
83 pages
Week 2
No ratings yet
Week 2
222 pages
Spelling Correction: Edit Distance: Pawan Goyal
No ratings yet
Spelling Correction: Edit Distance: Pawan Goyal
67 pages
Lec 6
No ratings yet
Lec 6
19 pages
Spell Correction & Edit Distance
No ratings yet
Spell Correction & Edit Distance
35 pages
Minimum Edit Distance.
No ratings yet
Minimum Edit Distance.
12 pages
Lec10 12 Edit Distance
No ratings yet
Lec10 12 Edit Distance
54 pages
03 Text Processing - Minimum Edit Distance
No ratings yet
03 Text Processing - Minimum Edit Distance
41 pages
03 Text Processing - Minimum Edit Distance
No ratings yet
03 Text Processing - Minimum Edit Distance
41 pages
2 EditDistance 2022
No ratings yet
2 EditDistance 2022
37 pages
Edit Dist
No ratings yet
Edit Dist
24 pages
4 EditDistance
No ratings yet
4 EditDistance
23 pages
Minimum Edit Distance in NLP
No ratings yet
Minimum Edit Distance in NLP
28 pages
Lecture # 15 - New
No ratings yet
Lecture # 15 - New
70 pages
Assignement 3 1
No ratings yet
Assignement 3 1
3 pages
NLP Mrinmoyee Mam
No ratings yet
NLP Mrinmoyee Mam
4 pages
L3 Edit Distance
No ratings yet
L3 Edit Distance
23 pages
Understanding Minimum Edit Distance
No ratings yet
Understanding Minimum Edit Distance
3 pages
2 EditDistance 2023
No ratings yet
2 EditDistance 2023
35 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
03 Med
No ratings yet
03 Med
52 pages
18-IntroNLP II PDF
No ratings yet
18-IntroNLP II PDF
187 pages
Minimum Edit Distance24
No ratings yet
Minimum Edit Distance24
35 pages
Edit Distance
No ratings yet
Edit Distance
30 pages
Edit Dist
No ratings yet
Edit Dist
35 pages
Auto-Correction via N-gram Indexing
No ratings yet
Auto-Correction via N-gram Indexing
5 pages
03 Med
No ratings yet
03 Med
35 pages
Definition of Minimum Edit Distance
No ratings yet
Definition of Minimum Edit Distance
49 pages
Calculating Minimum Edit Distance
0% (1)
Calculating Minimum Edit Distance
52 pages
Edit Distance for NLP & Biology
No ratings yet
Edit Distance for NLP & Biology
52 pages
4-Tolerant Retrieval
No ratings yet
4-Tolerant Retrieval
82 pages
Error Detection
No ratings yet
Error Detection
6 pages
Edit Distance
No ratings yet
Edit Distance
19 pages
Multimedia Application L3
No ratings yet
Multimedia Application L3
49 pages
Theory I Algorithm Design and Analysis: (13 - Edit Distance and Approximate String Matching)
No ratings yet
Theory I Algorithm Design and Analysis: (13 - Edit Distance and Approximate String Matching)
13 pages
Design & Analysis of Algorithms Exam
No ratings yet
Design & Analysis of Algorithms Exam
10 pages
Edit Distance & Dynamic Programming
No ratings yet
Edit Distance & Dynamic Programming
30 pages
04 Weighted Minimum Edit Distance 2-47
No ratings yet
04 Weighted Minimum Edit Distance 2-47
2 pages
Final Exam
No ratings yet
Final Exam
9 pages
Final Exam Solution
No ratings yet
Final Exam Solution
10 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Dynamic Programming and Single Word Recognizers (Part 1)
No ratings yet
Dynamic Programming and Single Word Recognizers (Part 1)
25 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
Measure Distance Between 2 Words by Simple Calculation
No ratings yet
Measure Distance Between 2 Words by Simple Calculation
7 pages
Noisy Channel Model for Spelling Correction
No ratings yet
Noisy Channel Model for Spelling Correction
20 pages
B505 Lec.10 DynamicProgramming 1
No ratings yet
B505 Lec.10 DynamicProgramming 1
19 pages
Lecture 2
No ratings yet
Lecture 2
71 pages
Lec 8
No ratings yet
Lec 8
17 pages
DH24 Week13
No ratings yet
DH24 Week13
29 pages
General - Other Father Angel School, Delhi - 9 February 2008
No ratings yet
General - Other Father Angel School, Delhi - 9 February 2008
3 pages
Alignment Algorithm
No ratings yet
Alignment Algorithm
58 pages
Damerau-Levenshtein Algorithm and Bayes Theorem For Spell Checker Optimization
No ratings yet
Damerau-Levenshtein Algorithm and Bayes Theorem For Spell Checker Optimization
6 pages
Cme4408 p7 Probmodels Med
No ratings yet
Cme4408 p7 Probmodels Med
50 pages
Key Dec 2023
No ratings yet
Key Dec 2023
6 pages
Note 4
No ratings yet
Note 4
1 page
String Matching
No ratings yet
String Matching
116 pages
Levenshtein
No ratings yet
Levenshtein
14 pages
Learning Journal Unit 2
No ratings yet
Learning Journal Unit 2
3 pages
GP Guidelines (JAN-MAY 2025)
No ratings yet
GP Guidelines (JAN-MAY 2025)
26 pages
Benefits of Laughter for Health
No ratings yet
Benefits of Laughter for Health
2 pages
COVID-19 Positive Test Result
No ratings yet
COVID-19 Positive Test Result
1 page
BSSE 2022 2026 One
No ratings yet
BSSE 2022 2026 One
2 pages
Letter Structure and Examples
No ratings yet
Letter Structure and Examples
6 pages
Paramotor Digital Technology Private Limited
No ratings yet
Paramotor Digital Technology Private Limited
7 pages
Analyst IDC MarketScape ServiceNow
No ratings yet
Analyst IDC MarketScape ServiceNow
9 pages
Chi - Square - Test of Association Notes
No ratings yet
Chi - Square - Test of Association Notes
1 page
InstallationRequirements SatSupport
No ratings yet
InstallationRequirements SatSupport
2 pages
A Simplified Model To Predict Smoke Movement in Vertical Shafts During A High-Rise Structural Fire
No ratings yet
A Simplified Model To Predict Smoke Movement in Vertical Shafts During A High-Rise Structural Fire
10 pages
AC Circuit Analysis Basics
No ratings yet
AC Circuit Analysis Basics
9 pages
P433 EN M R-63-A 313 654 Volume 2
No ratings yet
P433 EN M R-63-A 313 654 Volume 2
488 pages
WJ o JWJWJWJ
No ratings yet
WJ o JWJWJWJ
3 pages
Capacitance - JEE Main 2024 January-Pages-1
No ratings yet
Capacitance - JEE Main 2024 January-Pages-1
3 pages
Federal Mutual 7400
No ratings yet
Federal Mutual 7400
13 pages
Oct 2024
No ratings yet
Oct 2024
15 pages
Aquaculture Engineering 2nd Edition Odd-Ivar Lekang PDF Download
100% (3)
Aquaculture Engineering 2nd Edition Odd-Ivar Lekang PDF Download
46 pages
Furuno Radar Magnetron Running Hours Reset Procedure - Google Search
No ratings yet
Furuno Radar Magnetron Running Hours Reset Procedure - Google Search
1 page
Sluice Gate NRS 4 To 8 IN DRAWING
No ratings yet
Sluice Gate NRS 4 To 8 IN DRAWING
1 page
NeighbourNet Project Report As On 1st June 2024
No ratings yet
NeighbourNet Project Report As On 1st June 2024
38 pages
Learning To Identify Internet Sexual Predation
No ratings yet
Learning To Identify Internet Sexual Predation
22 pages
Best-Self Kickoff: Instructions
No ratings yet
Best-Self Kickoff: Instructions
6 pages
Homeroom Activity
No ratings yet
Homeroom Activity
8 pages
Chapter 1 Introduction To Information System
No ratings yet
Chapter 1 Introduction To Information System
10 pages
Sonali Naik Professional
No ratings yet
Sonali Naik Professional
3 pages
Becoming Focused & Indistractable: Your Quest For Improving Focus and Quality of Life
No ratings yet
Becoming Focused & Indistractable: Your Quest For Improving Focus and Quality of Life
54 pages
Group 5 Understanding The Lived Experiences of Single Parents Balancing Work and
No ratings yet
Group 5 Understanding The Lived Experiences of Single Parents Balancing Work and
15 pages
Language Models
No ratings yet
Language Models
50 pages
Second-Price Auction
No ratings yet
Second-Price Auction
4 pages