Week 2
Week 2
EL
Pawan Goyal
PT CSE, IITKGP
Week 2: Lecture 1
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 1 / 20
Spelling Correction
EL
behalf
behave
....
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 2 / 20
Edit Distance
EL
The minimum edit distance between two strings
Is the minimum number of editing operations
I Insertion
I
I
Deletion
Substitution PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 3 / 20
Minimum Edit Distance
Example
Edit distance from ‘intention’ to ‘execution’
EL
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 4 / 20
Minimum Edit Distance
Example
Edit distance from ‘intention’ to ‘execution’
EL
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 4 / 20
Minimum Edit Distance
EL
PT
If each operation has a cost of 1 (Levenshtein)
N
I Distance between these is 5
If substitution costs 2 (alternate version)
I Distance between these is 8
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 5 / 20
How to find the Minimum Edit Distance?
Searching for a path (sequence of edits) from the start string to the final string:
Initial state: the word we are transforming
EL
Operators: insert, delete, substitute
Goal state: the word we are trying to get to
Path cost: what we want to minimize: the number of edits
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 6 / 20
Minimum Edit as Search
How to navigate?
EL
The space of all edit sequences is huge
Lot of distinct paths end up at the same state
PT
Don’t have to keep track of all of them
Keep track of the shortest path to each state
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 7 / 20
Defining Minimum Edit Distance Matrix
EL
Y of length m
We define D(i, j)
PT
the edit distance between X[1..i] and Y[1..j]
i.e., the first i characters of X and the first j characters of Y
N
Thus, the edit distance between X and Y is D(n, m)
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 8 / 20
Computing Minimum Edit Distance
Dynamic Programming
EL
A tabular computation of D(n, m)
Solving problems by combining solutions to subproblems
Bottom-up
I
I
I
PT
Compute D(i, j) for small i, j
Compute larger D(i, j) based on previously computed smaller values
Compute D(i, j) for all i and j till you get to D(n, m)
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 9 / 20
Dynamic Programming Algorithm
EL
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 10 / 20
The Edit Distance Table
EL
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 11 / 20
The Edit Distance Table
EL
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 12 / 20
Computing Alignments
EL
I We often need to align characters of the two strings to each other
We do this by keeping a “backtrace”
Trace back the path from the upper right corner to read off the alignment
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 13 / 20
The Edit Distance Table
EL
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 15 / 20
Minimum Edit with Backtrace
EL
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 16 / 20
Adding Backtrace to Minimum Edit
EL
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 17 / 20
The distance matrix
EL
from (0,0) to (M,N)
corresponds to an alignment
of two sequences.
PT An optimal alignment is
composed of optimal
N
sub-alignments.
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 18 / 20
Result of Backtrace
EL
PT
N
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 19 / 20
Performance
Time
O(nm)
EL
Space
O(nm)
Backtrace
PT
N
O(n + m)
Pawan Goyal (IIT Kharagpur) Spelling Correction: Edit Distance Week 2: Lecture 1 20 / 20
Weighted Edit Distance, Other variations
EL
Pawan Goyal
PT CSE, IITKGP
Week 2: Lecture 2
N
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 1 / 12
Weighted Edit Distance
EL
Why to add weights to the computation?
Some letters are more likely to be mistyped.
PT
N
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 2 / 12
Confusion Matrix for Spelling Errors
EL
PT
N
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 3 / 12
Keyboard Design
EL
PT
N
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 4 / 12
Weighted Minimum Edit Distance
EL
PT
N
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 5 / 12
How to modify the algorithm with transpose?
Transpose
transpose(x, y) = (y, x)
Also known as metathesis
EL
Modification to the dynamic programmic algorithm
8
> D(i 1, j) + 1
>
>
>
>
>
>D(i, j 1) + 1
>
>
<
D[i][j] = min D(i 1, j 1)+
PT (deletion)
(insertion)
(
1 if (x[i] 6= y[j])(substitution)
N
>
>
> 0 otherwise
>
>
>
>
> D(i 2, j 2) + 1 (x[i] = y[j 1] and x[i 1] = y[j]
>
: (transposition)
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 6 / 12
How to find dictionary entries with smallest edit distance?
Naïve Method
Compute edit ditance from the query term to each dictionary term – an
exhaustive search
EL
Can be made efficient if we do it over a trie structure
PT
N
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 7 / 12
How to find dictionary entries with smallest edit distance?
EL
transpose + substitution + insertion) from the query term and search
them in the dictionary.
to search for
PT
For a word of length 9, alphabet of size 36, this will lead to 114,324 terms
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 8 / 12
How to find dictionary entries with smallest edit distance?
EL
term (offline)
Generate terms with an edit distance 2 (deletes) from the input terms
and search in dictionary
PT
Number of deletes within edit distance 2 for a word of length 9 will be 45
N
A further check is required to remove the false positives
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 9 / 12
Spelling Correction
EL
behaf ! behalf
PT
Typographical errors: three ! there
Cognitive errors (homophones): piece ! peace, too ! two
N
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 10 / 12
Non-word spelling errors
EL
The larger the dictionary the better
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 11 / 12
Real word spelling errors
EL
Find candidate words with similar pronunciations
Find candidate words with similar spelling
Include w in candidate set
Pawan Goyal (IIT Kharagpur) Weighted Edit Distance, Other variations Week 2: Lecture 2 12 / 12
Noisy Channel Model for Spelling Correction
EL
Pawan Goyal
PT CSE, IITKGP
Week 2: Lecture 3
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 1 / 17
Noisy Channel
EL
ŵ = arg maxP(w|x)
w2V
PT
= arg max
w2V
P(x|w)P(w)
P(x)
N
= arg maxP(x|w)P(w)
w2V
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 2 / 17
Non-word spelling error: acress
EL
Words with similar pronuncitation
Small edit distance of pronunciation to error
PT
Damerau-Levenshtein edit distance
Minimum edit distance, where edits are:
N
Insertion, Deletion, Substitution,
Transposition of two adjacent letters
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 3 / 17
Words within edit distance 1 of acress
EL
PT
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 4 / 17
Candidate generation
EL
Almost all errors within edit distance 2
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 5 / 17
Computing error probability: confusion matrix
EL
ins[x,y]: count (x typed as xy)
sub[x,y]: count (x typed as y)
PT
trans[x,y]: count(xy typed as yx)
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 6 / 17
Computing error probability: confusion matrix
EL
ins[x,y]: count (x typed as xy)
sub[x,y]: count (x typed as y)
PT
trans[x,y]: count(xy typed as yx)
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 6 / 17
Channel model
EL
PT
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 7 / 17
Channel model for acress
EL
PT
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 8 / 17
Noisy channel probability for acress
EL
PT
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 9 / 17
Using a bigram language model
EL
Counts from the Corpus of Contemporary American English with add-1
smoothing
P(actress|versatile) = 0.000021, P(across|versatile) = 0.000021
PT
P(whose|actress) = 0.0010, P(whose|across) = 0.000006
P(“versatile actress whose”) = 0.000021 * 0.0010 = 210 x 10 10
N
P(“versatile across whose”) = 0.000021 * 0.000006 = 1 x10 10
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 10 / 17
Real-word spelling errors
EL
The study was conducted mainly be John Black
The design an construction of the system ...
PT
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 11 / 17
Real-word spelling errors
EL
The study was conducted mainly be John Black
The design an construction of the system ...
PT
25-40% of spelling errors are real words
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 11 / 17
Noisy channel for real-word spell correction
Given a sentence X = w1 , w2 , w3 . . . , wn
EL
Candidate (w1 ) = {w1 , w0 1 , w00 1 , w000 1 , . . .}
Candidate (w2 ) = {w2 , w0 2 , w00 2 , w000 2 , . . .}
PT
Candidate (w3 ) = {w3 , w0 3 , w00 3 , w000 3 , . . .}
Choose the sequence W that maximizes P(W|X)
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 12 / 17
Noisy channel for real-world spell correction
EL
PT
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 13 / 17
Simplification: One error per sentence
EL
two of thew
w1 , w00 2 , w3 two off thew
w1 , w2 , w0 3 two of the
PT
w000 1 , w2 , w3 too of thew
N
Choose the sequence W that maximizes P(W|X)
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 14 / 17
Getting the probability values
Noisy Channel
Ŵ = arg maxP(W|X)
W2S
EL
where X is the observed sentence and S is the set of all the possible
sequences from the candidate set
PT
= arg maxP(X|W)P(W)
W2S
N
P(X|W)
Same as for non-word spelling correction
Also require proabability for no error P(w|w)
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 15 / 17
Probability of no error
EL
What is the probability for a correctly typed word? P(“the”|“the”)
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 16 / 17
Computing P(W)
EL
Use Language Model
Unigram
Bigram
... PT
N
Pawan Goyal (IIT Kharagpur) Noisy Channel Model for Spelling Correction Week 2: Lecture 3 17 / 17
Context Sensitive Spelling Correction
EL
Use a Language Model PT
P(about fifteen minutes from) > P(about fifteen minuets from)
N
Speech Recognition
P(I saw a van) >> P(eyes awe of an)
EL
Machine Translation
Which sentence is more plausible in the target language?
PT
P(high winds) > P(large winds)
Other Applications
N
Context Sensitive Spelling Correction
Natural Language Generation
...
EL
Language model also supports predicting the completion of a sentence.
I Please turn off your cell ...
I Your program does not ...
PT
Predictive text input systems can guess what you are typing and give
choices on how to complete it.
N
P(W) = P(w1 , w2 , w3 , . . . , wn )
EL
Related Task: probability of an upcoming word:
PT P(w4 |w1 , w2 , w3 )
N
A model that computes either of these is called a language model
EL
P(about, fifteen, minutes, from)
PT
N
EL
P(about, fifteen, minutes, from)
Basic Idea
PT
Rely on the Chain Rule of Probability
N
Conditional Probabilities
P(A, B)
P(B|A) =
P(A)
EL
P(A, B) = P(A)P(B|A)
More Variables
PT
P(A, B, C, D) = P(A)P(B|A)P(C|A, B)P(D|A, B, C)
N
The Chain Rule in General
P(x1 , x2 , . . . , xn ) = P(x1 )P(x2 |x1 )P(x3 |x1 , x2 ) . . . P(xn |x1 , . . . , xn 1 )
EL
i
EL
Count (about fifteen minutes from office)
P(office | about fifteen minutes from) = Count (about fifteen minutes from)
EL
P(office | about fifteen minutes from) ⇡ P(office | from)
EL
i
PT
P(w1 w2 . . . wn ) ⇡ ’ P(wi |wi
i
k . . . wi 1 )
N
We approximate each component in the product
EL
Unigram: P(office)
Bigram: P(office | from)
PT
Trigram: P(office | minutes from)
EL
In general, an insufficient model of language:
language has long-distance dependencies:
floor crashed.”
PT
“The computer which I had just put into the machine room on the fifth
EL
Value that makes the observed data the “most probable”
count(wi 1 , wi )
P(wi |wi 1 ) =
count(wi 1 )
PT
P(wi |wi 1 ) =
c(wi 1 , wi )
N
c(wi 1 )
EL
c(wi 1 , wi ) <s>who am I </s>
P(wi |wi 1 ) =
c(wi 1 ) <s>I would like to know </s>
PT
N
EL
Estimating bigrams
P(I|<s>) = 2/3
P(</s>|here) =1
P(would | I) = 1/3
P(here | am) = 1/2
PT
N
P(know | like) = 0
EL
PT
N
Normlize by unigrams
EL
Bigram Probabilities
PT
N
EL
P(<s> I want english food </s>)
= P(I | <s>) x P(want | I) x P(english | want) x P(food | english ) x P(</s> | food)
PT
N
EL
P(<s> I want english food </s>)
= P(I | <s>) x P(want | I) x P(english | want) x P(food | english ) x P(</s> | food)
= 0.000031
PT
N
P(english|want) = .0011
EL
P(chinese|want) = .0065
P(to|want) = .66
P(eat | to) = .28
P(food | to) = 0
P(want | spend) = 0
PT
N
P (i | <s>) = .25
EL
Avoids underflow
Adding is faster than multiplying
log(p1 ⇥ p2 ⇥ p3 ⇥ p4 ) = logp1 + logp2 + logp3 + logp4
Handling zeros
PT
N
Use smoothing
EL
SRILM
http://www.speech.sri.com/projects/srilm/
PT
N
EL
Number of unigrams: 13,588,391
Number of bigrams: 314,843,401
Number of trigrams: 977,069,902
PT
Number of fourgrams: 1,313,818,354
Number of fivegrams: 1,176,470,663
http://googleresearch.blogspot.in/2006/08/
N
all-our-n-gram-are-belong-to-you.html
EL
serve as the inspiration 1390
serve as the installation 136
serve as the institute 187
serve as the institution 279
serve as the institutional 461
PT
N
EL
PT
N
EL
Pawan Goyal
PT CSE, IITKGP
Week 2: Lecture 5
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 1 / 16
Evaluating Language Model
EL
ungrammatical (or rarely observed) ones
PT
Parameters of the model are trained on a large corpus of text, called
training set.
N
Performance is tested on a disjoint (held-out) test data using an
evaluation metric
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 2 / 16
Extrinsic evaluation of N-grams models
EL
Use each model for one or more tasks: spelling corrector, speech
recognizer, machine translation
PT
Get accuracy values for A and B
Compare accuracy for A and B
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 3 / 16
Intrinsic evaluation: Perplexity
EL
I always order pizza with cheese and . . .
The president of India is . . .
I wrote a . . .
PT
Unigram model doesn’t work for this game.
N
A better model of text
is one which assigns a higher probability to the actual word
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 4 / 16
Perplexity
The best language model is one that best predics an unseen test set
Perplexity (PP(W))
Perplexity is the inverse probability of the test data, normalized by the number
of words:
EL
1
PP(W) = P(w1 w2 . . . wN ) N
PT
PP(W) = ’
✓
1
P(wi |w1 . . . wi 1 )
◆ N1
N
For bigrams
✓ ◆ N1
1
PP(W) = ’
P(wi |wi 1 )
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 5 / 16
Example: A Simple Scenario
EL
1
PP(W) = P(w1 w2 . . . wN ) N
PT =
✓ ◆N ! N1
✓ ◆ 1
1
10
N
1
=
10
= 10
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 6 / 16
Lower perplexity = better model
WSJ Corpus
Training: 38 million words
Test: 1.5 million words
EL
PT
N
Unigram perplexity: 962?
The model is as confused on test data as if it had to choose uniformly and
independently among 962 possibilities for each word.
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 7 / 16
The Shannon Visualization Method
EL
Choose a random bigram
(<s>,w) as per its
probability
PT
Choose a random bigram
(w,x) as per its probability
N
And so on until we choose
</s>
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 8 / 16
Shakespeare as Corpus
EL
N = 884,647 tokens, V = 29,066
Shakespeare produced 300,000 bigram types out of V 2 = 844 million
possible bigrams.
PT
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 9 / 16
Approximating Shakespeare
EL
PT
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 10 / 16
Problems with simple MLE estimate: zeros
Training set
... denied the allegations Test Data
EL
... denied the reports ... denied the offer
... denied the claims ... denied the loan
... denied the request
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 11 / 16
Language Modeling: Smoothing
EL
PT
Steal probability mass to generalize better
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 12 / 16
Laplace Smoothing (Add-one estimation)
Pretend as if we saw each word (N-gram) one more time that we actually
EL
did
Just add one to all the counts!
PT
MLE estimate for bigram: PMLE (wi |wi 1 ) = c(wi 1 )i
i 1
c(w ,w )
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 13 / 16
Laplace Smoothing (Add-one estimation)
Pretend as if we saw each word (N-gram) one more time that we actually
EL
did
Just add one to all the counts!
PT
MLE estimate for bigram: PMLE (wi |wi 1 ) = c(wi 1 )i
,w )+1
,w )
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 13 / 16
Reconstituted counts as effect of smoothing
EL
Effective bigram count (c⇤ (wn 1 wn ))
c⇤ (wn 1 wn ) c(wn 1 wn ) + 1
=
PT
c(wn 1 ) c(wn 1 ) + V
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 14 / 16
Comparing with bigrams: Restaurant corpus
EL
PT
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 15 / 16
More general formulations: Add-k
c(wi 1 , wi ) + k
PAdd k (wi |wi 1 ) =
c(wi 1 ) + kV
c(wi 1 , wi ) + m( V1 )
EL
PAdd k (wi |wi 1 ) =
c(wi 1 ) + m
Unigram prior smoothing:
PT
PUnigramPrior (wi |wi 1 ) =
c(wi 1 , wi ) + mP(wi )
c(wi 1 ) + m
N
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 16 / 16
More general formulations: Add-k
c(wi 1 , wi ) + k
PAdd k (wi |wi 1 ) =
c(wi 1 ) + kV
c(wi 1 , wi ) + m( V1 )
EL
PAdd k (wi |wi 1 ) =
c(wi 1 ) + m
Unigram prior smoothing:
PT
PUnigramPrior (wi |wi 1 ) =
c(wi 1 , wi ) + mP(wi )
c(wi 1 ) + m
N
A good value of k or m?
Can be optimized on held-out set
Pawan Goyal (IIT Kharagpur) Evaluation of Language Models, Basic Smoothing Week 2: Lecture 5 16 / 16