Lecture 19: Edit Distance
Agenda:
• LCS
• Edit distance
Reading:
• Text: (partly covered) 350-355
1
Lecture 19: Dynamic Programming
Longest common subsequence (LCS) problem:
Definitions: – Sequence/string:
dynamicprogramming is a sequence over the English alpha-
bet
– Base/letter/character
– Subsequence:
the given sequence with zero or more letters deleted
e.g., dog is a subsequence of dynamicprogramming
WARNing: bases appear in the same order, but not nec-
essarily consecutive
– Common subsequence
– LCS problem: given two sequences X = x1 x2 . . . xn and
Y = y1 y2 . . . ym , find a maximum-length common subse-
quence of them.
• The LCS problem has the “optimal substructure” ...
– if xn is NOT in the LCS (to be computed), then we only
need to compute an LCS of x1 x2 . . . xn−1 and y1 y2 . . . ym
...
– similarly, if ym is NOT in the LCS (to be computed),
then we only need to compute an LCS of x1 x2 . . . xn and
y1 y2 . . . ym−1 ...
– if xn and ym are both in the LCS (to be computed), then
xn = ym and we need to compute an LCS of x1 x2 . . . xn−1
and y1 y2 . . . ym−1 ;
and then adding xn to the end to form an LCS for the
original problem
2
Lecture 19: Dynamic Programming
Longest common subsequence (LCS) problem (cont’d):
• Therefore,
Letting DP [n, m] to denote the length of an LCS of X and
Y , then it is equal to
LCS(x1 x2 . . . xn−1 , y1 y2 . . . ym ),
max length of LCS(x1 x2 . . . xn , y1 y2 . . . ym−1 ),
LCS(x1 x2 . . . xn−1 , y1 y2 . . . ym−1 ) + ‘x0n , if xn = ym
• Correctness
• In general, let DP [i, j] denote the length of an LCS of x1 x2 . . . xi
and y1 y2 . . . yj .
• Recurrence:
If i = 0 or j = 0 then clearly DP [i, j] = 0.
Else
DP [i − 1, j],
DP [i, j] = max DP [i, j − 1],
DP [i − 1, j − 1] + 1, if xi = yj
3
Lecture 19: Dynamic Programming
Longest common subsequence (LCS) problem (cont’d)
— solving the recurrence:
• Pseudocode
procedure dpLCS(X, Y )
n ← length[X]
m ← length[Y ]
for i ← 1 to m do
DP [i, 0] ← 0
for j ← 0 to n do
DP [0, j] ← 0
for i ← 1 to n do
for j ← 1 to m do
if xi = yj then
DP [i, j] ← DP [i − 1, j − 1] + 1
else if DP [i − 1, j] ≥ DP [i, j − 1] then
DP [i, j] ← DP [i − 1, j]
else
DP [i, j] ← DP [i, j − 1]
return DP [n, m]
• Running time: Θ(n × m)
There are n × m entries each takes constant time to compute.
• Step 4: to find the LCS, store extra information and trace
back!
4
Lecture 19: Edit Distance
Editing Sequences:
• Given two sequences S and T over alphabets Σ = {a, b, c}.
• Starting with S, at each step we can either:
– delete a letter from S,
– insert a letter from Σ into S (at any place),
– exchange a letter of S with another one in Σ.
• each of these operations has a cost (given).
• If we use δ to denote “null” then:
– cost of exchanging a with b is: Cost[a, b], and
– cost of deleting c (i.e. replacing it with δ) is Cost[c, δ],
and
– cost of inserting letter b (i.e. replacing δ with b) is
Cost[δ, b].
• Given S, T , and cost table Cost, want to find a sequence of
operations that transform S into T with minimum cost.
• We assume that Cost is metric:
– Cost[α, α] = 0 for α ∈ Σ,
– Cost[α, β] = Cost[β, α], for α, β ∈ Σ ∪ {δ},
– Cost[α, β] ≤ Cost[α, σ] + Cost[σ, β], for α, β, σ ∈ Σ ∪ {δ}.
5
Lecture 19: Edit Distance
Edit Distance:
• Let’s assume S = s1 s2 . . . sn and T = t1 t2 . . . tm .
• Subproblems will be: best edit distance of s1 . . . si and t1 . . . tj
with 1 ≤ i ≤ n and 1 ≤ j ≤ m.
• Let Edit[i, j] denote the minimum cost of editing s1 . . . si into
t1 . . . tj .
• In the optimal solution, either:
– si is deleted and the rest of s1 . . . si−1 is transformed into
t1 . . . tj , or
– s1 . . . si is transformed into t1 . . . tj−1 and we insert tj at
the end, or
– si is changed into tj and the rest of s1 . . . si−1 is trans-
formed into t1 . . . tj−1 .
• Thus:
Edit[i − 1, j] + Cost[si , δ],
Edit[i, j] = min Edit[i, j − 1] + Cost[δ, tj ],
Edit[i − 1, j − 1] + Cost[si , tj ],
Pi Pj
• with Edit[i, 0] = k=1
Cost[sk , δ] and Edit[0, j] = k=1
Cost[δ, tj ].
6
Lecture 19: Dynamic Programming
Edit Distance:
• Pseudocode
procedure Edit-Dist(S, T, Cost)
n ← length[S]
m ← length[T ]
Edit[0, 0] ← 0
for i ← 1 to n do
Edit[i, 0] ← Edit[i − 1, 0] + Cost[S[i], δ]
for j ← 1 to n do
Edit[0, j] ← Edit[0, j − 1] + Cost[δ, T [j]]
for i ← 1 to n do
for j ← 1 to m do
Op1 ← Edit[i − 1, j] + Cost[S[i], δ]
Op2 ← Edit[i, j − 1] + Cost[δ, T [j]]
Op3 ← Edit[i − 1, j − 1] + Cost[S[i], T [j]]
Edit[i, j] = min{Op1, Op2, Op3}
return Edit[n, m]
• Running time: Θ(n × m)
There are n × m entries each takes constant time to compute.
• To find the sequence of operations: keep extra information
and trace back!
7
Lecture 19: Edit Distance
Have you understood the lecture contents?
well ok not-at-all topic
LCS
edit distance
DP for edit distance