0% found this document useful (0 votes)

21 views41 pages

03 Text Processing - Minimum Edit Distance

The document discusses Minimum Edit Distance, a measure of string similarity that quantifies the minimum number of operations required to transform one string into another. It highlights various applications in Natural Language Processing, including autocorrection, plagiarism detection, and machine translation. The document also explains the computation of Minimum Edit Distance using dynamic programming and backtrace methods for aligning strings.

Uploaded by

esmailelhariri272

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views41 pages

03 Text Processing - Minimum Edit Distance

Uploaded by

esmailelhariri272

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

TM340 Natural

Language Processing

Text Processing

Minimum Edit Distance

Based on slides by Dan Jurafsky and Chris Manning

Agenda

➢ Definition of Minimum Edit Distance

➢ Computing Minimum Edit Distance

➢ Backtrace for Computing Alignments

➢ Weighted Minimum Edit Distance

2
Definition of Minimum Edit
Distance

3
How similar are two strings?

➢ In spell correction application:

• The user typed “graffe”
➢ Which is closest?
• graf
• graft
• grail
• giraffe
4
Minimum Edit Distance

➢ Minimum Edit Distance: is a measure of string similarity

• Minimum edit distance (also known as Levenshtein
distance) quantifies the similarity between two strings by
determining the minimum number of operations (insertions,
deletions, and substitutions) required to transform one
string into the other.
• It provides a quantitative measure of how different or
similar two strings are.

5
Applications in Natural Language Processing (NLP)

➢ Autocorrection: Used to suggest corrections for misspelled words in various

applications like smartphones, email clients, and text processors.

➢ Plagiarism Detection:

• Comparing documents: By calculating the minimum edit distance between two documents, you
can identify similarities and potential plagiarism.
• Identifying copied passages: This can help in academic integrity and content creation.

➢ Machine Translation:

• Alignment of sentences: Minimum edit distance can be used to align words or phrases between
different languages during the translation process.
• Measuring translation quality: It can help evaluate the accuracy of machine translations.

6
Additional Applications of Minimum Edit Distance in NLP

➢ Information Retrieval:

• Fuzzy string matching: When searching for documents or keywords, minimum edit distance can be used
to find matches that are similar but not exact.
• Improving search results: This can enhance the relevance of search results.

➢ Speech Recognition:

• Word recognition: By comparing the acoustic features of a spoken word to a dictionary of known words,
minimum edit distance can help identify the most likely word.
• Error correction: It can be used to correct errors in speech recognition systems.

➢ DNA Sequence Alignment:

• Biological similarity: Minimum edit distance can be used to compare DNA sequences and identify
similarities between organisms.
• Evolutionary analysis: It can help in understanding evolutionary relationships.

7
Autocorrect Process

➢ We will look at how the Minimum Edit Distance can be used in

autocorrection systems.

➢ The autocorrect process can be broken down into four steps:

• Identify an incorrect word
• Find strings (n) edit distance away
• Filter candidate words
• Calculate the word probabilities
8
Autocorrect Process: Example
➢ Consider the following sentence:

"my deah friend"

1. We can easily see that the word deah is incorrect. The intended word is probably dear.

my dear friend

For humans, it is usually quite easy to identify incorrect words. But for a computer, we need to find a
way to identify incorrect words.
The simplified assumption is that if a word is not in the vocabulary (set of words, i.e. it does not contain
duplicates.), then it is probably a typo.

9
Autocorrect Process: Example
➢ Consider the following sentence:

"my deah friend"

2. We can find words that are 1 edit distance away from deah (by applying Edit Distance
algorithm):

• Note that we do not care if the words are valid words or not. We just want to find
words that are 1 edit distance away from deah.
• Substitutions, e.g. dear, dead, deas, yeah

• Insertions, e.g. deahs, deaht, deahh

• Deletions, e.g. dea, deh, eah

10
Autocorrect Process: Example
➢ Consider the following sentence:

"my deah friend"

3. Many words returned by the edit distance algorithm are not valid words. We need to filter these
words out. To do this, we can also check if the word is in the vocabulary. If it is, then we can keep it.

• dear
• dead
• deal
• deas
• ...

11
Autocorrect Process: Example
➢ Consider the following sentence:

"my deah friend"

4. We can calculate the word probabilities to find the most likely word.

• In our case, we can intuitively assume that dear is more likely than dead.
• My dear friend

• My dead friend

12
Edit Distance

➢ Given a string, the minimum edit distance is the minimum

number of single-character edits (insertions, deletions, or
substitutions) required to change one string into the other.

➢ Editing operations includes:

• Insertion
• Deletion
• Substitution
13
Minimum Edit Distance

➢ Two strings and their alignment:

14
Minimum Edit Distance: Edit Cost

➢ If we transform one string into another string, we can calculate

the total edit cost by adding up the costs of all edit operations.

➢ This is referred to as the total edit cost, and this is what we are
trying to minimize.
• If each operation has cost of 1
• Distance between these two strings is 5

• If substitutions cost 2 (Levenshtein)

• Distance between them is 8

15
How to find the Min Edit Distance?

➢ Searching for a path (sequence of edits) from the start string

to the final string:
• Initial state: the word we’re transforming
• Operators: insert, delete, substitute
• Goal state: the word we’re trying to get to
• Path cost: what we want to minimize: the number of edits

16
Minimum Edit as Search

➢ But the space of all edit sequences is huge!

• We can’t afford to navigate naïvely

• Lots of distinct paths wind up at the same state.
• We don’t have to keep track of all of them
• Just the shortest path to each of those revisited states.

17
Computing Minimum Edit Distance

18
Dynamic Programming for
Minimum Edit Distance
➢ Dynamic programming is a method for solving complex
problems by breaking them down into simpler
subproblems.

➢ It is often used when the subproblems are overlapping, i.e.

when subproblems share subproblems.

➢ We can often solve dynamic programming problems by

using recursion.

19
Dynamic Programming for
Minimum Edit Distance
➢ The Minimum Edit Distance algorithm is a dynamic
programming algorithm because it breaks down the
problem of calculating the minimum edit distance into
smaller subproblems, and it uses the results of those
subproblems to solve the larger problem of calculating
the minimum edit distance.

20
Dynamic Programming for
Minimum Edit Distance
➢ When computing the minimum edit distance, we start with a
source word and transform it into the target word.

➢ The algorithm would then return the minimum cost of

transforming the source word into the target word.

➢ We can do that by created a distance matrix D where the

source word is shown on the vertical axis, and the target
word is shown on the horizontal axis.
21
The Distance Matrix

9 N
8 O
7 I
6 T
5 N
4 E
3 T
2 N
1 I
0 #
# E X E C U T I O N
0 1 2 3 4 5 6 7 8 9
22
The Distance Matrix
➢ The following Distance Matrix is bottom-up, it could also be a top-down.

➢ Each letter is assigned an index ( zero based indexing).

➢ We also add an empty character # to the bottom and left of the matrix.
• This is because we need to be able to transform the source word into an
empty string, and the empty string into the target word.
➢ Our goal is to fill each cell D[i,j] with the correct edit distance values
9 N
8 O
7 I
6 T
5 N
4 E
3 T
2 N
1 I
0 #
# E X E C U T I O N
0 1 2 3 4 5 6 7 8 9 23
Minimum Edit Distance
Initialization
9 N 9
8 O 8
D(i,0) = i deletion
7 I 7
D(0,j) = j insertion 6 T 6
5 N 5
4 E 4
3 T 3
2 N 2
1 I 1
0 # 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Example: 0 1 2 3 4 5 6 7 8 9

The cell D[0,0]: # ➔ # (an empty string into an empty string) cost of 0
The cell D[1,0]: I ➔ # (deleting I to so that we have an empty string) cost of 1.
The cell D[0,1]: # ➔ E (inserting E into an empty string) cost of 1.

24
Minimum Edit Distance
Now, consider the cell D[1,1], which represents the transformation I ➔ E, There are three paths to do
that:
- Insert E and delete I (cost1+1= 2) : D(i-1,j) + 1 Delete cost

- Delete I and insert E (cost1+1= 2): D(i,j-1) + 1 Insert cost

- Substitute I with E (cost 2): D(i-1,j-1) + 2 if X(i) ≠ Y(j) Substitute cost

0 if X(i) = Y(j) No change

All edit paths lead to a cost of 2 9 N 9

8 O 8
But we want to find the minimum cost 7 I 7
6 T 6
So, if one of those costs would be smaller 5 N 5
4 E 4
➔ it would be the cost to enter in 3 T 3
2 N 2
1 I 1 2
0 # 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
0 1 2 3 4 5 6 7 25 8 9
The Distance Matrix

9 N 9
8 O 8
7 I 7
6 T 6
5 N 5
4 E 4
3 T 3
2 N 2
1 I 1 2
0 # 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
0 1 2 3 4 5 6 7 8 9
26
Defining Minimum Edit Distance (Levenshtein)
Initialization

D(i,0) = i
D(0,j) = j

Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion

D(i,j)= min D(i,j-1) + 1 insertion

D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution

0; if X(i) = Y(j) no change

Termination:
D(N,M) is distance

27
The Edit Distance Matrix

N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10

T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
28
Backtrace for Computing
Alignments

29
Computing alignments

➢ Edit distance isn’t sufficient

• We often need to align each character of the two strings to each other
➢ We do this by keeping a “backtrace”

➢ Every time we enter a cell, remember where we came from

➢ When we reach the end,

• Trace back the path from the upper right corner to read off the
alignment

30
Minimum Edit with Backtrace

31
Minimum Edit with Backtrace

➢ Two strings and their alignment:

i n t e * n t i o n
* e x e c u t i o n
Vertical or Horizontal path = an alignment

32
Minimum Edit with Backtrace

➢ Every non-Increasing path from (0,0)

x0 …………………… xN

to (M, N) corresponds to an
alignment of the two sequences

➢ An optimal alignment is composed

y0 ……………………………… yM
of optimal subalignments
33
Adding Backtrace to Minimum Edit Distance
Initialization

D(i,0) = i
D(0,j) = j

Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion

D(i,j)= min D(i,j-1) + 1 insertion

D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution

0; if X(i) = Y(j) no change

LEFT insertion

ptr(i,j)= DOWN deletion

DIAG substitution

Termination:
D(N,M) is distance

34
Weighted Minimum Edit Distance

35
Weighted Edit Distance

➢ Instead of adding fixed cost for all characters:

1 for insert/delete
2 for substitution

➢ It could be better if we assign a weighted cost based

on the operation and the character.

36
Weighted Edit Distance

➢ Why would we add weights to the computation?

• Spell Correction : some letters are more likely to be

mistyped than others
• Biology : certain kinds of deletions or insertions are more
likely than others

37
Confusion matrix for spelling errors

38
Confusion matrix for spelling errors

➢ some letters are more likely to be mistyped due

Keyboard structure

39
Weighted Minimum Edit Distance
Initialization

D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M

Recurrence Relation:

D(i-1,j) + del[x(i)]
D(i,j)= min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
Termination:
D(N,M) is distance

40
Thank You

03 Text Processing - Minimum Edit Distance
No ratings yet
03 Text Processing - Minimum Edit Distance
41 pages
Minimum Edit Distance in NLP
No ratings yet
Minimum Edit Distance in NLP
28 pages
Edit Distance
No ratings yet
Edit Distance
30 pages
Edit Distance
No ratings yet
Edit Distance
19 pages
4 EditDistance
No ratings yet
4 EditDistance
23 pages
2 EditDistance 2023
No ratings yet
2 EditDistance 2023
35 pages
Understanding Minimum Edit Distance
No ratings yet
Understanding Minimum Edit Distance
3 pages
2 EditDistance 2022
No ratings yet
2 EditDistance 2022
37 pages
Edit Dist
No ratings yet
Edit Dist
35 pages
Minimum Edit Distance24
No ratings yet
Minimum Edit Distance24
35 pages
03 Med
No ratings yet
03 Med
35 pages
Calculating Minimum Edit Distance
0% (1)
Calculating Minimum Edit Distance
52 pages
Edit Distance for NLP & Biology
No ratings yet
Edit Distance for NLP & Biology
52 pages
03 Med
No ratings yet
03 Med
52 pages
Definition of Minimum Edit Distance
No ratings yet
Definition of Minimum Edit Distance
49 pages
18-IntroNLP II PDF
No ratings yet
18-IntroNLP II PDF
187 pages
Edit Dist
No ratings yet
Edit Dist
24 pages
L3 Edit Distance
No ratings yet
L3 Edit Distance
23 pages
Spelling Correction: Edit Distance: Pawan Goyal
No ratings yet
Spelling Correction: Edit Distance: Pawan Goyal
67 pages
Minimum Edit Distance.
No ratings yet
Minimum Edit Distance.
12 pages
Lec 6
No ratings yet
Lec 6
19 pages
Spell Correction & Edit Distance
No ratings yet
Spell Correction & Edit Distance
35 pages
Multimedia Application L3
No ratings yet
Multimedia Application L3
49 pages
Week 2
No ratings yet
Week 2
95 pages
Lecture # 15 - New
No ratings yet
Lecture # 15 - New
70 pages
NLP Mrinmoyee Mam
No ratings yet
NLP Mrinmoyee Mam
4 pages
DL UNIT V NLP Application
No ratings yet
DL UNIT V NLP Application
83 pages
Edit Distance
No ratings yet
Edit Distance
59 pages
Week 2
No ratings yet
Week 2
222 pages
Edit Distance & Dynamic Programming
No ratings yet
Edit Distance & Dynamic Programming
30 pages
Compu'ng Minimum Edit Distance
No ratings yet
Compu'ng Minimum Edit Distance
8 pages
Distances
No ratings yet
Distances
40 pages
NLP Lab 5
No ratings yet
NLP Lab 5
2 pages
Error Detection
No ratings yet
Error Detection
6 pages
Assignement 3 1
No ratings yet
Assignement 3 1
3 pages
Lec10 12 Edit Distance
No ratings yet
Lec10 12 Edit Distance
54 pages
Theory I Algorithm Design and Analysis: (13 - Edit Distance and Approximate String Matching)
No ratings yet
Theory I Algorithm Design and Analysis: (13 - Edit Distance and Approximate String Matching)
13 pages
Auto-Correction via N-gram Indexing
No ratings yet
Auto-Correction via N-gram Indexing
5 pages
Dynamic Programming and Single Word Recognizers (Part 1)
No ratings yet
Dynamic Programming and Single Word Recognizers (Part 1)
25 pages
Alignment Algorithm
No ratings yet
Alignment Algorithm
58 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
Levenshtein
No ratings yet
Levenshtein
14 pages
4-Tolerant Retrieval
No ratings yet
4-Tolerant Retrieval
82 pages
String Matching
No ratings yet
String Matching
116 pages
String Matching
No ratings yet
String Matching
66 pages
Spelling Correction
No ratings yet
Spelling Correction
85 pages
UNIT3
No ratings yet
UNIT3
52 pages
12 Strings.v3
No ratings yet
12 Strings.v3
111 pages
B505 Lec.10 DynamicProgramming 1
No ratings yet
B505 Lec.10 DynamicProgramming 1
19 pages
Dynamic Programming 4
No ratings yet
Dynamic Programming 4
107 pages
IR Lecture 3b
No ratings yet
IR Lecture 3b
44 pages
Workshop 01 - Live Session HW0
No ratings yet
Workshop 01 - Live Session HW0
21 pages
String Processing Algorithms
No ratings yet
String Processing Algorithms
111 pages
Java Lect 15
No ratings yet
Java Lect 15
14 pages
Cme4408 p7 Probmodels Med
No ratings yet
Cme4408 p7 Probmodels Med
50 pages
Efficient Algorithms For Normalized Edit Distance: Abdullah N. Arslan, Department of Computer Science
No ratings yet
Efficient Algorithms For Normalized Edit Distance: Abdullah N. Arslan, Department of Computer Science
18 pages
Lecture 2
No ratings yet
Lecture 2
71 pages
String Edit PDF
No ratings yet
String Edit PDF
39 pages
C1 Advanced 4, Test 4, Question 3 - Proposal
No ratings yet
C1 Advanced 4, Test 4, Question 3 - Proposal
2 pages
Civil Engineering Curriculum Update
No ratings yet
Civil Engineering Curriculum Update
78 pages
2024-Jounad-Review of Bond Strength Coefficient (KB) Between FRP Bars and Concrete
No ratings yet
2024-Jounad-Review of Bond Strength Coefficient (KB) Between FRP Bars and Concrete
14 pages
The Disc Filter: Description
No ratings yet
The Disc Filter: Description
4 pages
Rajshree Resume Tech
No ratings yet
Rajshree Resume Tech
2 pages
Set Analysis for Data Analysts
No ratings yet
Set Analysis for Data Analysts
19 pages
Volkswagen - New - Beetle - Factory - Repair - Manual - 1998-2010 - Fuel System
No ratings yet
Volkswagen - New - Beetle - Factory - Repair - Manual - 1998-2010 - Fuel System
160 pages
35 Basic Interview Questions
No ratings yet
35 Basic Interview Questions
8 pages
As 1442-2007 Carbon Steels and Carbon-Manganese Steels - Hot Rolled Bars and Semi-Finished Products
No ratings yet
As 1442-2007 Carbon Steels and Carbon-Manganese Steels - Hot Rolled Bars and Semi-Finished Products
7 pages
ASME PCC-1 Appendix-O-Calculation
50% (2)
ASME PCC-1 Appendix-O-Calculation
23 pages
Fault Tree Analysis FTA Methode Analisis Pohon
No ratings yet
Fault Tree Analysis FTA Methode Analisis Pohon
44 pages
Language Models
No ratings yet
Language Models
50 pages
W1 Retail Analytics
No ratings yet
W1 Retail Analytics
3 pages
EM 5.7 Parts Manual 2013
No ratings yet
EM 5.7 Parts Manual 2013
37 pages
Ramp Layout and Design
No ratings yet
Ramp Layout and Design
1 page
Lesson 1 Grade 9 Digitization
No ratings yet
Lesson 1 Grade 9 Digitization
27 pages
As-202 Press Kit
No ratings yet
As-202 Press Kit
33 pages
Gala Valve
No ratings yet
Gala Valve
2 pages
Az 9200
No ratings yet
Az 9200
8 pages
Good DESIGN THINKING Book by Dharam Mentor
100% (1)
Good DESIGN THINKING Book by Dharam Mentor
78 pages
The Book of (PLC & SCADA) Dosing System by HMI
No ratings yet
The Book of (PLC & SCADA) Dosing System by HMI
102 pages
Criterios para Flanges API SPEC 6A
No ratings yet
Criterios para Flanges API SPEC 6A
33 pages
Bi - Ia 167140
No ratings yet
Bi - Ia 167140
2 pages
POF
No ratings yet
POF
21 pages
SS4.5 Bacillariophyta+Part+3 VO
No ratings yet
SS4.5 Bacillariophyta+Part+3 VO
8 pages
Terms & Conditions: (Gir Online Permit Booking System)
No ratings yet
Terms & Conditions: (Gir Online Permit Booking System)
6 pages
Mumbai Plan
No ratings yet
Mumbai Plan
230 pages
Children and Poetry
No ratings yet
Children and Poetry
9 pages
3rd Quarter Entrepreneurship
No ratings yet
3rd Quarter Entrepreneurship
3 pages
2021 Jce Home Economics
No ratings yet
2021 Jce Home Economics
5 pages

03 Text Processing - Minimum Edit Distance

Uploaded by

03 Text Processing - Minimum Edit Distance

Uploaded by

TM340 Natural

Minimum Edit Distance

Based on slides by Dan Jurafsky and Chris Manning

➢ Definition of Minimum Edit Distance

➢ Computing Minimum Edit Distance

➢ Backtrace for Computing Alignments

➢ Weighted Minimum Edit Distance

➢ In spell correction application:

➢ Minimum Edit Distance: is a measure of string similarity

➢ Autocorrection: Used to suggest corrections for misspelled words in various

➢ DNA Sequence Alignment:

➢ We will look at how the Minimum Edit Distance can be used in

➢ The autocorrect process can be broken down into four steps:

"my deah friend"

"my deah friend"

• Insertions, e.g. deahs, deaht, deahh

• Deletions, e.g. dea, deh, eah

"my deah friend"

"my deah friend"

➢ Given a string, the minimum edit distance is the minimum

➢ Editing operations includes:

➢ Two strings and their alignment:

➢ If we transform one string into another string, we can calculate

• If substitutions cost 2 (Levenshtein)

➢ Searching for a path (sequence of edits) from the start string

➢ But the space of all edit sequences is huge!

• We can’t afford to navigate naïvely

➢ It is often used when the subproblems are overlapping, i.e.

➢ We can often solve dynamic programming problems by

➢ The algorithm would then return the minimum cost of

➢ We can do that by created a distance matrix D where the

➢ Each letter is assigned an index ( zero based indexing).

- Delete I and insert E (cost1+1= 2): D(i,j-1) + 1 Insert cost

- Substitute I with E (cost 2): D(i-1,j-1) + 2 if X(i) ≠ Y(j) Substitute cost

0 if X(i) = Y(j) No change

All edit paths lead to a cost of 2 9 N 9

D(i,j)= min D(i,j-1) + 1 insertion

D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution

0; if X(i) = Y(j) no change

➢ Edit distance isn’t sufficient

➢ Every time we enter a cell, remember where we came from

➢ When we reach the end,

➢ Two strings and their alignment:

➢ Every non-Increasing path from (0,0)

➢ An optimal alignment is composed

D(i,j)= min D(i,j-1) + 1 insertion

D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution

0; if X(i) = Y(j) no change

ptr(i,j)= DOWN deletion

➢ Instead of adding fixed cost for all characters:

➢ It could be better if we assign a weighted cost based

➢ Why would we add weights to the computation?

• Spell Correction : some letters are more likely to be

➢ some letters are more likely to be mistyped due

You might also like