STRING EDITING
• Given two strings X = x1, x2,...,xn and Y = y1, y2,….,ym, where xi, 1≤ i ≤ n, and yj, 1
≤ j ≤ m, are members of a finite set of symbols known as the alphabet.
• The problem is to transform X into Y using a sequence of edit operations on X.
• The permissible edit operations are insert, delete, and change (a symbol of X into
another), and there is a cost associated with performing each.
• The cost of a sequence of operations is the sum of the costs of the individual
operations in the sequence.
• The problem of string editing is to identify a minimum-cost sequence of edit
operations that will transform X into Y.
Let,
✓ D(xi) be the cost of deleting the symbol xi from X,
✓ I(yj) be the cost of inserting the symbol yj into X, and
✓ C(xi,yj) be the cost of changing the symbol xi of X into yj.
Example: Consider the sequences X = x1, x2, x3, x4, x5 = a, a, b, a, b and Y = y1, y2, y3,
y4 = b, a, b, b. Let the cost associated with each insertion and deletion be 1(for any
symbol). Also let the cost of changing any symbol to any other symbol be 2.
• One possible way of transforming X into Y is delete each xi, 1≤ i ≤ 5, and insert each
yj, 1≤ j ≤ 4. The total cost of this edit sequence is 9.
• Another possible edit sequence is delete x1 and x2 and insert y4 at the end of
string X. The total cost is only 3.
Dynamic Programming Approach
A solution to the string editing problem consists of a sequence of decisions, one for
each edit operation. A dynamic programming solution for this problem can be obtained
as follows.
Define cost(i,j) to be the minimum cost of any edit sequence for transforming x1,
x2,…..,xi into y1,y2,……,yj (for 0≤ i ≤ n and 1≤ i ≤ m).
Compute cost(i,j) for each i and j.
Then cost(n,m) is the cost of an optimal edit sequence.
The minimum cost of any edit sequence that transforms x1, x2,….,xi into y1,y2,……,yj is:
We have to compute cost(i, j) for all possibles values of i and j. There are (n+1)(m+1)
such values. These values can be computed in the form of a table, M, where each row
of M corresponds to a particular value of i and each column of M corresponds to a
specific value of j. M(i,j) stores the value cost(i,j).
The zeroth row can be computed first since it corresponds to performing a series of
insertions. Likewise, the zeroth column can also be computed. After this, one could
compute the entries of M in row-major order, starting from the first row. Rows should
be processed in the order 1,2,... ,n. Entries in any row are computed in increasing order
of column number.
The value cost(n,m) is the final answer we are interested in. Having computed all the
entries of M, a minimum edit sequence can be obtained by backward trace from
cost(n,m). This is enabled by recording which of the three options for i > 0, j > 0 yielded
the minimum cost for each i and j.
Example
Consider the sequences X = x1,x2, x3, x4, x5 = a,a,b,a,b and Y = y1,y2,y3,y4 = b,a,b,b.
Each insertion and deletion has a unit cost and a change costs 2 units.
j→ Ø 1 2 3 4
i↓
Ø
1
2
3
4
5
Step 1: Fill first row and first column of the table
Using the formula, Compute cost for first row,
Cost (0,0) = 0
Cost(0,1) = cost(0, j-1) + I(y1) //i=0 j =1
= cost(0,0) + 1 = 0+1 = 1
Cost(0,2) = cost(0,j-1) + I(y2) //i=0, j=2
= cost(0,1) + 1 = 1+1 = 2
Cost(0,3) = cost(0,j-1) + I(y3) //i=0, j=3
= cost(0,2) + 1 = 2+1 = 3
Cost(0,4) = cost(0,j-1) + I(y4) //i=0, j=4
= cost(0,3) + 1 = 3+1 = 4
Using the formula, Compute cost for first column,
Cost(1,0) = cost(i-1, 0) + D(x1) //i=1 j =0
= cost(0,0) + 1 = 0+1 = 1
Cost(2,0) = cost(i-1, 0) + D(x2) //i=2 j =0
= cost(1,0) + 1 = 1+1 = 2
Cost(3,0) = cost(i-1, 0) + D(x3) //i=3 j =0
= cost(2,0) + 1 = 2+1 = 3
Cost(4,0) = cost(i-1, 0) + D(x4) //i=4 j =0
= cost(3,0) + 1 = 3+1 = 4
Cost(5,0) = cost(i-1, 0) + D(x4) //i=5 j =0
= cost(4,0) + 1 = 4+1 = 5
j→ Ø 1 2 3 4
i↓
Ø 0 1 2 3 4
1 1
2 2
3 3
4 4
5 5
Compute Cost(i,j) for 2nd row
Cost(1,1) = cost’(1,1)
= min{cost(0,1) + D(x1), cost(0,0) + C(x1, y1), cost(1,0) + I(y1)}
= min{1+1, 0+2, 1+1} = 2
Cost(1,2) = cost’(1,2) //i=1,j=2
= min{cost(0,2) + D(x1), cost(0,1) + C(x1, y2), cost(1,1) + I(y2)
= min{2+1, 1+0, 2+1} = 1
Cost(1,3) = cost’(1,3) //i=1,j=3
= min{cost(0,3) + D(x1), cost(0,2) + C(x1, y3), cost(1,2) + I(y3)
= min{3+1, 2+2, 1+1} = 2
Cost(1,4) = cost’(1,4) //i=1,j=4
= min{cost(0,4) + D(x1), cost(0,3) + C(x1, y4), cost(1,3) + I(y4)
= min{4+1, 3+2, 2+1} = 3
j→ Ø 1 2 3 4
i↓
Ø 0 1 2 3 4
1 1 2 1 2 3
2 2
3 3
4 4
5 5
Compute for 3rd row
Cost(2,1) = cost’(2,1)
= min{cost(1,1) + D(x2), cost(1,0) + C(x2, y1), cost(2,0) + I(y1)}
= min{2+1, 1+2, 2+1} = 3
Cost(2,2) = cost’(2,2)
= min{cost(1,2) + D(x2), cost(1,1) + C(x2, y2), cost(2,1) + I(y2)}
= min{1+1, 2+0, 2+1} = 2
Cost(2,3) = cost’(2,3)
= min{cost(1,3) + D(x2), cost(1,2) + C(x2, y3), cost(2,2) + I(y3)}
= min{2+1, 2+2, 2+1} = 3
Cost(2,4) = cost’(2,4)
= min{cost(1,4) + D(x2), cost(1,3) + C(x2, y4), cost(2,3) + I(y4)}
= min{3+1, 2+2, 3+1} = 4
j→ Ø 1 2 3 4
i↓
Ø 0 1 2 3 4
1 1 2 1 2 3
2 2 3 2 3 4
3 3
4 4
5 5
Compute for 4th row
Cost(3,1) = cost’(3,1)
= min{cost(2,1) + D(x3), cost(2,0) + C(x3, y1), cost(3,0) + I(y1)}
= min{2+1, 2+0, 3+1} = 2
Cost(3,2) = cost’(3,2)
= min{cost(2,2) + D(x3), cost(2,1) + C(x3, y2), cost(3,1) + I(y2)}
= min{2+1, 3+2, 2+1} = 3
Cost(3,3) = cost’(3,3)
= min{cost(2,3) + D(x3), cost(2,2) + C(x3, y3), cost(3,2) + I(y3)}
= min{3+1, 2+0, 3+1} = 2
Cost(3,4) = cost’(3,4)
= min{cost(2,4) + D(x3), cost(2,3) + C(x3, y4), cost(3,3) + I(y4)}
= min{4+1, 3+0, 2+1} = 3
j→ Ø 1 2 3 4
i↓
Ø 0 1 2 3 4
1 1 2 1 2 3
2 2 3 2 3 4
3 3 2 3 2 3
4 4
5 5
Compute for 5th row
Cost(4,1) = cost’(4,1)
= min{cost(3,1) + D(x4), cost(3,0) + C(x4, y1), cost(4,0) + I(y1)}
= min{2+1, 3+2, 4+1} = 3
Cost(4,2) = cost’(4,2)
= min{cost(3,2) + D(x4), cost(3,1) + C(x4, y2), cost(4,1) + I(y2)}
= min{3+1, 2+0, 3+1} = 2
Cost(4,3) = cost’(4,3)
= min{cost(3,3) + D(x4), cost(3,2) + C(x4, y3), cost(4,2) + I(y3)}
= min{2+1, 3+2, 2+1} = 3
Cost(4,4) = cost’(4,4)
= min{cost(3,4) + D(x4), cost(3,3) + C(x4, y4), cost(4,3) + I(y4)}
= min{3+1, 2+2, 3+1} = 4
j→ Ø 1 2 3 4
i↓
Ø 0 1 2 3 4
1 1 2 1 2 3
2 2 3 2 3 4
3 3 2 3 2 3
4 4 3 2 3 4
5 5
Compute for 6th row
Cost(5,1) = cost’(5,1)
= min{cost(4,1) + D(x5), cost(4,0) + C(x5, y1), cost(5,0) + I(y1)}
= min{3+1, 4+0, 5+1} = 4
Cost(5,2) = cost’(5,2)
= min{cost(4,2) + D(x5), cost(4,1) + C(x5, y2), cost(5,1) + I(y2)}
= min{2+1, 3+2, 4+1} = 3
Cost(5,3) = cost’(5,3)
= min{cost(4,3) + D(x5), cost(4,2) + C(x5, y3), cost(5,2) + I(y3)}
= min{3+1, 2+0, 3+1} = 2
Cost(5,4) = cost’(5,4)
= min{cost(4,4) + D(x5), cost(4,3) + C(x5, y4), cost(5,3) + I(y4)}
= min{4+1, 3+0, 2+1} = 3
j→ Ø 1 (b) 2 (a) 3 (b) 4 (b)
i↓
Ø 0 1 2 3 4
1 (a) 1 2 1 2 3
2 (a) 2 3 2 3 4
3 (b) 3 2 3 2 3
4 (a) 4 3 2 3 4
5 (b) 5 4 3 2 3
The value cost(5,4)= 3.
One possible minimum-cost edit sequence is delete x1, delete x2, and insert y4.
Homework
Consider the sequences X = x1,x2, x3, x4, x5, x6 = a,a,b,a,a,b and Y = y1,y2,y3,y4, y5
= b,a,b,a,a. Each insertion and deletion has a unit cost and a change costs 2 units.