[go: up one dir, main page]

0% found this document useful (0 votes)
21 views41 pages

03 Text Processing - Minimum Edit Distance

The document discusses Minimum Edit Distance, a measure of string similarity that quantifies the minimum number of operations required to transform one string into another. It highlights various applications in Natural Language Processing, including autocorrection, plagiarism detection, and machine translation. The document also explains the computation of Minimum Edit Distance using dynamic programming and backtrace methods for aligning strings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views41 pages

03 Text Processing - Minimum Edit Distance

The document discusses Minimum Edit Distance, a measure of string similarity that quantifies the minimum number of operations required to transform one string into another. It highlights various applications in Natural Language Processing, including autocorrection, plagiarism detection, and machine translation. The document also explains the computation of Minimum Edit Distance using dynamic programming and backtrace methods for aligning strings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

TM340 Natural

Language Processing

Text Processing

Minimum Edit Distance

Based on slides by Dan Jurafsky and Chris Manning


Agenda

➢ Definition of Minimum Edit Distance

➢ Computing Minimum Edit Distance

➢ Backtrace for Computing Alignments

➢ Weighted Minimum Edit Distance

2
Definition of Minimum Edit
Distance

3
How similar are two strings?

➢ In spell correction application:


• The user typed “graffe”
➢ Which is closest?
• graf
• graft
• grail
• giraffe
4
Minimum Edit Distance

➢ Minimum Edit Distance: is a measure of string similarity


• Minimum edit distance (also known as Levenshtein
distance) quantifies the similarity between two strings by
determining the minimum number of operations (insertions,
deletions, and substitutions) required to transform one
string into the other.
• It provides a quantitative measure of how different or
similar two strings are.

5
Applications in Natural Language Processing (NLP)

➢ Autocorrection: Used to suggest corrections for misspelled words in various


applications like smartphones, email clients, and text processors.

➢ Plagiarism Detection:

• Comparing documents: By calculating the minimum edit distance between two documents, you
can identify similarities and potential plagiarism.
• Identifying copied passages: This can help in academic integrity and content creation.

➢ Machine Translation:

• Alignment of sentences: Minimum edit distance can be used to align words or phrases between
different languages during the translation process.
• Measuring translation quality: It can help evaluate the accuracy of machine translations.

6
Additional Applications of Minimum Edit Distance in NLP

➢ Information Retrieval:

• Fuzzy string matching: When searching for documents or keywords, minimum edit distance can be used
to find matches that are similar but not exact.
• Improving search results: This can enhance the relevance of search results.

➢ Speech Recognition:

• Word recognition: By comparing the acoustic features of a spoken word to a dictionary of known words,
minimum edit distance can help identify the most likely word.
• Error correction: It can be used to correct errors in speech recognition systems.

➢ DNA Sequence Alignment:

• Biological similarity: Minimum edit distance can be used to compare DNA sequences and identify
similarities between organisms.
• Evolutionary analysis: It can help in understanding evolutionary relationships.

7
Autocorrect Process

➢ We will look at how the Minimum Edit Distance can be used in


autocorrection systems.

➢ The autocorrect process can be broken down into four steps:


• Identify an incorrect word
• Find strings (n) edit distance away
• Filter candidate words
• Calculate the word probabilities
8
Autocorrect Process: Example
➢ Consider the following sentence:

"my deah friend"

1. We can easily see that the word deah is incorrect. The intended word is probably dear.

my dear friend

For humans, it is usually quite easy to identify incorrect words. But for a computer, we need to find a
way to identify incorrect words.
The simplified assumption is that if a word is not in the vocabulary (set of words, i.e. it does not contain
duplicates.), then it is probably a typo.

9
Autocorrect Process: Example
➢ Consider the following sentence:

"my deah friend"

2. We can find words that are 1 edit distance away from deah (by applying Edit Distance
algorithm):

• Note that we do not care if the words are valid words or not. We just want to find
words that are 1 edit distance away from deah.
• Substitutions, e.g. dear, dead, deas, yeah

• Insertions, e.g. deahs, deaht, deahh

• Deletions, e.g. dea, deh, eah

10
Autocorrect Process: Example
➢ Consider the following sentence:

"my deah friend"

3. Many words returned by the edit distance algorithm are not valid words. We need to filter these
words out. To do this, we can also check if the word is in the vocabulary. If it is, then we can keep it.

• dear
• dead
• deal
• deas
• ...

11
Autocorrect Process: Example
➢ Consider the following sentence:

"my deah friend"

4. We can calculate the word probabilities to find the most likely word.

• In our case, we can intuitively assume that dear is more likely than dead.
• My dear friend

• My dead friend

12
Edit Distance

➢ Given a string, the minimum edit distance is the minimum


number of single-character edits (insertions, deletions, or
substitutions) required to change one string into the other.

➢ Editing operations includes:


• Insertion
• Deletion
• Substitution
13
Minimum Edit Distance

➢ Two strings and their alignment:

14
Minimum Edit Distance: Edit Cost

➢ If we transform one string into another string, we can calculate


the total edit cost by adding up the costs of all edit operations.

➢ This is referred to as the total edit cost, and this is what we are
trying to minimize.
• If each operation has cost of 1
• Distance between these two strings is 5

• If substitutions cost 2 (Levenshtein)


• Distance between them is 8

15
How to find the Min Edit Distance?

➢ Searching for a path (sequence of edits) from the start string


to the final string:
• Initial state: the word we’re transforming
• Operators: insert, delete, substitute
• Goal state: the word we’re trying to get to
• Path cost: what we want to minimize: the number of edits

16
Minimum Edit as Search

➢ But the space of all edit sequences is huge!

• We can’t afford to navigate naïvely


• Lots of distinct paths wind up at the same state.
• We don’t have to keep track of all of them
• Just the shortest path to each of those revisited states.

17
Computing Minimum Edit Distance

18
Dynamic Programming for
Minimum Edit Distance
➢ Dynamic programming is a method for solving complex
problems by breaking them down into simpler
subproblems.

➢ It is often used when the subproblems are overlapping, i.e.


when subproblems share subproblems.

➢ We can often solve dynamic programming problems by


using recursion.

19
Dynamic Programming for
Minimum Edit Distance
➢ The Minimum Edit Distance algorithm is a dynamic
programming algorithm because it breaks down the
problem of calculating the minimum edit distance into
smaller subproblems, and it uses the results of those
subproblems to solve the larger problem of calculating
the minimum edit distance.

20
Dynamic Programming for
Minimum Edit Distance
➢ When computing the minimum edit distance, we start with a
source word and transform it into the target word.

➢ The algorithm would then return the minimum cost of


transforming the source word into the target word.

➢ We can do that by created a distance matrix D where the


source word is shown on the vertical axis, and the target
word is shown on the horizontal axis.
21
The Distance Matrix

9 N
8 O
7 I
6 T
5 N
4 E
3 T
2 N
1 I
0 #
# E X E C U T I O N
0 1 2 3 4 5 6 7 8 9
22
The Distance Matrix
➢ The following Distance Matrix is bottom-up, it could also be a top-down.

➢ Each letter is assigned an index ( zero based indexing).

➢ We also add an empty character # to the bottom and left of the matrix.
• This is because we need to be able to transform the source word into an
empty string, and the empty string into the target word.
➢ Our goal is to fill each cell D[i,j] with the correct edit distance values
9 N
8 O
7 I
6 T
5 N
4 E
3 T
2 N
1 I
0 #
# E X E C U T I O N
0 1 2 3 4 5 6 7 8 9 23
Minimum Edit Distance
Initialization
9 N 9
8 O 8
D(i,0) = i deletion
7 I 7
D(0,j) = j insertion 6 T 6
5 N 5
4 E 4
3 T 3
2 N 2
1 I 1
0 # 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Example: 0 1 2 3 4 5 6 7 8 9

The cell D[0,0]: # ➔ # (an empty string into an empty string) cost of 0
The cell D[1,0]: I ➔ # (deleting I to so that we have an empty string) cost of 1.
The cell D[0,1]: # ➔ E (inserting E into an empty string) cost of 1.

24
Minimum Edit Distance
Now, consider the cell D[1,1], which represents the transformation I ➔ E, There are three paths to do
that:
- Insert E and delete I (cost1+1= 2) : D(i-1,j) + 1 Delete cost

- Delete I and insert E (cost1+1= 2): D(i,j-1) + 1 Insert cost

- Substitute I with E (cost 2): D(i-1,j-1) + 2 if X(i) ≠ Y(j) Substitute cost

0 if X(i) = Y(j) No change

All edit paths lead to a cost of 2 9 N 9


8 O 8
But we want to find the minimum cost 7 I 7
6 T 6
So, if one of those costs would be smaller 5 N 5
4 E 4
➔ it would be the cost to enter in 3 T 3
2 N 2
1 I 1 2
0 # 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
0 1 2 3 4 5 6 7 25 8 9
The Distance Matrix

9 N 9
8 O 8
7 I 7
6 T 6
5 N 5
4 E 4
3 T 3
2 N 2
1 I 1 2
0 # 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
0 1 2 3 4 5 6 7 8 9
26
Defining Minimum Edit Distance (Levenshtein)
Initialization

D(i,0) = i
D(0,j) = j

Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion

D(i,j)= min D(i,j-1) + 1 insertion

D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution

0; if X(i) = Y(j) no change

Termination:
D(N,M) is distance

27
The Edit Distance Matrix

N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10

T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
28
Backtrace for Computing
Alignments

29
Computing alignments

➢ Edit distance isn’t sufficient


• We often need to align each character of the two strings to each other
➢ We do this by keeping a “backtrace”

➢ Every time we enter a cell, remember where we came from

➢ When we reach the end,


• Trace back the path from the upper right corner to read off the
alignment

30
Minimum Edit with Backtrace

31
Minimum Edit with Backtrace

➢ Two strings and their alignment:

i n t e * n t i o n
* e x e c u t i o n
Vertical or Horizontal path = an alignment

32
Minimum Edit with Backtrace

➢ Every non-Increasing path from (0,0)


x0 …………………… xN

to (M, N) corresponds to an
alignment of the two sequences

➢ An optimal alignment is composed


y0 ……………………………… yM
of optimal subalignments
33
Adding Backtrace to Minimum Edit Distance
Initialization

D(i,0) = i
D(0,j) = j

Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1 deletion

D(i,j)= min D(i,j-1) + 1 insertion

D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution

0; if X(i) = Y(j) no change

LEFT insertion

ptr(i,j)= DOWN deletion

DIAG substitution

Termination:
D(N,M) is distance

34
Weighted Minimum Edit Distance

35
Weighted Edit Distance

➢ Instead of adding fixed cost for all characters:


1 for insert/delete
2 for substitution

➢ It could be better if we assign a weighted cost based


on the operation and the character.

36
Weighted Edit Distance

➢ Why would we add weights to the computation?

• Spell Correction : some letters are more likely to be


mistyped than others
• Biology : certain kinds of deletions or insertions are more
likely than others

37
Confusion matrix for spelling errors

38
Confusion matrix for spelling errors

➢ some letters are more likely to be mistyped due


Keyboard structure

39
Weighted Minimum Edit Distance
Initialization

D(0,0) = 0
D(i,0) = D(i-1,0) + del[x(i)]; 1 < i ≤ N
D(0,j) = D(0,j-1) + ins[y(j)]; 1 < j ≤ M

Recurrence Relation:

D(i-1,j) + del[x(i)]
D(i,j)= min D(i,j-1) + ins[y(j)]
D(i-1,j-1) + sub[x(i),y(j)]
Termination:
D(N,M) is distance

40
Thank You

You might also like