String processing
algorithms
David Kauchak
cs161
Summer 2009
Administrative
⚫ Check your scores on coursework
⚫ SCPD Final exam: e-mail me with proctor
information
⚫ Office hours next week?
⚫ Reminder: HW6 due Wed. 8/12 before class
and no late homework
Where did “dynamic programming” come from?
Richard Bellman On the Birth of
Dynamic Programming
Stuart Dreyfus
http://www.eng.tau.ac.il/~ami/cd/o
r50/1526-5463-2002-50-01-
0048.pdf
Strings
⚫ Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …,
z)
⚫ A string is any member of Σ*, i.e. any
sequence of 0 or more members of Σ
⚫ ‘this is a string’ Σ*
⚫ ‘this is also a string’ Σ*
⚫ ‘1234’ Σ*
String operations
⚫ Given strings s1 of length n and s2 of length m
⚫ Equality: is s1 = s2? (case sensitive or
insensitive)
‘this is a string’ = ‘this is a string’
‘this is a string’ ≠ ‘this is another string’
‘this is a string’ =? ‘THIS IS A STRING’
⚫ Running time
⚫ O(n) where n is length of shortest string
String operations
⚫ Concatenate (append): create string s1s2
‘this is a’ . ‘ string’ → ‘this is a string’
⚫ Running time
⚫ Θ(n+m)
String operations
⚫ Substitute: Exchange all occurrences of a
particular character with another character
Substitute(‘this is a string’, ‘i’, ‘x’) →
‘thxs xs a strxng’
Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’
⚫ Running time
⚫ Θ(n)
String operations
⚫ Length: return the number of
characters/symbols in the string
Length(‘this is a string’) → 16
Length(‘this is another string’) → 24
⚫ Running time
⚫ O(1) or Θ(n) depending on implementation
String operations
⚫ Prefix: Get the first j characters in the string
Prefix(‘this is a string’, 4) → ‘this’
⚫ Running time
⚫ Θ(j)
⚫ Suffix: Get the last j characters in the string
Suffix(‘this is a string’, 6) → ‘string’
⚫ Running time
⚫ Θ(j)
String operations
⚫ Substring – Get the characters between i and
j inclusive
Substring(‘this is a string’, 4, 8) → ‘s is ’
⚫ Running time
⚫ Θ(j - i)
⚫ Prefix?
⚫ Prefix(S, i) = Substring(S, 1, i)
⚫ Suffix?
⚫ Suffix(S, i) = Substring(S, i+1, length(n))
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Insertion:
ABACED ABACCED DABACCED
Insert ‘C’ Insert ‘D’
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:
ABACED
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:
ABACED BACED
Delete ‘A’
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:
ABACED BACED BACE
Delete ‘A’ Delete ‘D’
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Substitution:
ABACED ABADED ABADES
Sub ‘D’ for ‘C’ Sub ‘S’ for ‘D’
Edit distance examples
Edit(Kitten, Mitten) = 1
Operations:
Sub ‘M’ for ‘K’ Mitten
Edit distance examples
Edit(Happy, Hilly) = 3
Operations:
Sub ‘a’ for ‘i’ Hippy
Sub ‘l’ for ‘p’ Hilpy
Sub ‘l’ for ‘p’ Hilly
Edit distance examples
Edit(Banana, Car) = 5
Operations:
Delete ‘B’ anana
Delete ‘a’ nana
Delete ‘n’ naa
Sub ‘C’ for ‘n’ Caa
Sub ‘a’ for ‘r’ Car
Edit distance examples
Edit(Simple, Apple) = 3
Operations:
Delete ‘S’ imple
Sub ‘A’ for ‘i’ Ample
Sub ‘m’ for ‘p’ Apple
Is edit distance symmetric?
⚫ that is, is Edit(s1, s2) = Edit(s2, s1)?
Edit(Simple, Apple) =? Edit(Apple, Simple)
⚫ Why?
⚫ sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’
⚫ delete ‘i’ → insert ‘i’
⚫ insert ‘i’ → delete ‘i’
Calculating edit distance
X=ABCBDAB
Y=BDCABA
Ideas?
Calculating edit distance
X=ABCBDA?
Y=BDCAB?
After all of the operations, X needs
to equal Y
Calculating edit distance
X=ABCBDA?
Y=BDCAB?
Operations: Insert
Delete
Substitute
Insert
X=ABCBDA?
Y=BDCAB?
Insert
X=ABCBDA?
Edit
Y=BDCAB?
Edit ( X , Y ) = 1 + Edit ( X 1...n , Y1...m −1 )
Delete
X=ABCBDA?
Y=BDCAB?
Delete
X=ABCBDA?
Edit
Y=BDCAB?
Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m )
Substition
X=ABCBDA?
Y=BDCAB?
Substition
X=ABCBDA?
Edit
Y=BDCAB?
Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m −1 )
Anything else?
X=ABCBDA?
Y=BDCAB?
Equal
X=ABCBDA?
Y=BDCAB?
Equal
X=ABCBDA?
Edit
Y=BDCAB?
Edit ( X , Y ) = Edit ( X 1...n −1 , Y1...m −1 )
Combining results
Insert: Edit ( X , Y ) = 1 + Edit ( X 1...n , Y1...m −1 )
Delete: Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m )
Substitute: Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m −1 )
Equal: Edit ( X , Y ) = Edit ( X 1...n −1 , Y1...m −1 )
Combining results
1 + Edit(X1...n,Y1...m −1 ) insertion
Edit( X , Y ) = min 1 + Edit( X 1...n −1 , Y1...m ) deletion
Diff ( x , y ) + Edit( X
n m 1...n −1 , Y1...m −1 ) equal/substitution
Running time
Θ(nm)
Variants
⚫ Only include insertions and deletions
⚫ What does this do to substitutions?
⚫ Include swaps, i.e. swapping two adjacent
characters counts as one edit
⚫ Weight insertion, deletion and substitution
differently
⚫ Weight specific character insertion, deletion
and substitutions differently
⚫ Length normalize the edit distance
String matching
⚫ Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA
S = DCABABBABABA
String matching
⚫ Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA
S = DCABABBABABA
Uses
⚫ grep/egrep
⚫ search
⚫ find
⚫ java.lang.String.contains()
Naive implementation
Is it correct?
Running time?
⚫ What is the cost of the equality check?
⚫ Best case: O(1)
⚫ Worst case: O(m)
Running time?
⚫ Best case
⚫ Θ(n) – when the first character of the pattern does
not occur in the string
⚫ Worst case
⚫ O((n-m+1)m)
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Worst case
P = AAAA
S = AAAAAAAAAAAAA
repeated work!
Worst case
P = AAAA
S = AAAAAAAAAAAAA
Ideally, after the first match, we’d
know to just check the next
character to see if it is an ‘A’
Patterns
⚫ Which of these patterns will have that
problem?
P = ABAB
P = ABDC
P = BAA
P = ABBCDDCAABB
Patterns
⚫ Which of these patterns will have that
problem?
P = ABAB If the pattern has a
suffix that is also a
P = ABDC prefix then we will
have this problem
P = BAA
P = ABBCDDCAABB
Finite State Automata (FSA)
⚫ An FSA is defined by 5 components
⚫ Q is the set of states
q0 q1 q2 … qn
Finite State Automata (FSA)
⚫ An FSA is defined by 5 components
⚫ Q is the set of states
q0 q1 q2 … qn
⚫ q0 is the start state q7
⚫ A Q, is the set of accepting states where |A| > 0
⚫ Σ is the alphabet (e.g. {A, B}
⚫ is the transition function from Q x Σ to Q
QΣ Q B
q0 A q1
q0 B q2 q0 q1
A
q2 …
A
q1 A q1
…
FSA operation
B A A
q0 q1 q1 q1
A B A
B
B
An FSA starts at state q0 and reads the characters of the input
string one at a time.
If the automaton is in state q and reads character a, then it
transitions to state (q,a).
If the FSA reaches an accepting state (q A), then the FSA has
found a match.
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
What pattern does this represent?
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
FSA operation
P = ABA
B A A
q0 q1 q1 q1
A B A
B
B
S = BABABBABABA
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
( x, y ) = max ( xm−i +1...m = y1...i )
i
σ(abcdab, ababcd) = ?
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
( x, y ) = max ( xm−i +1...m = y1...i )
i
σ(abcdab, ababcd) = 2
Suffix function
⚫ The suffix function σ(x,y) is the index of the
longest suffix of x that is a prefix of y
( x, y ) = max ( xm−i +1...m = y1...i )
i
σ(daabac, abacac) = ?
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
( x, y ) = max ( xm−i +1...m = y1...i )
i
σ(daabac, abacac) = 4
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
( x, y ) = max ( xm−i +1...m = y1...i )
i
σ(dabb, abacd) = ?
Suffix function
⚫ The suffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y
( x, y ) = max ( xm−i +1...m = y1...i )
i
σ(dabb, abacd) = 0
Building a string matching
automata
⚫ Given a pattern P = p1, p2, …, pm, we’d like to build
an FSA that recognizes P in strings
P = ababaca
Ideas?
Building a string matching automata
P = ababaca
⚫ Q = q1, q2, …, qm corresponding to each
symbol, plus a q0 starting state
⚫ the set of accepting states, A = {qm}
⚫ vocab Σ all symbols in P, plus one more
representing all symbols not in P
⚫ The transition function for q Q and a Σ is
defined as:
⚫ (q, a) = σ(p1…qa, P)
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 ? a σ(a, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
B,C
state a b c P
q0 1 0 0 a q0 q1
A
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 ? b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 1 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 c
q6 a
q7
Transition function
P = ababaca
⚫ (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 6 c
q6 7 0 0 a
q7 1 2 0
Matching runtime
⚫ Once we’ve built the FSA, what is the
runtime?
⚫ Θ(n) - Each symbol causes a state transition and
we only visit each character once
⚫ What is the cost to build the FSA?
⚫ How many entries in the table?
⚫ m|Σ| - Best case: Ω(m|Σ|)
⚫ How long does it take to calculate the suffix
function at each entry?
⚫ Naïve: O(m2)
⚫ Overall naïve: O(m3|Σ|)
⚫ Overall fast implementation O(m|Σ|)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA Hash P
T(P)
S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
Hash m symbol
S = BABABBABABA sequences and compare
T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
match
Hash m symbol
S = BABABBABABA sequences and compare
T(ABA)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
Hash m symbol
S = BABABBABABA sequences and compare
T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare
P = ABA
Hash m symbol
S = BABABBABABA sequences and compare
…
T(BAB)
=
T(P)
Rabin-Karp algorithm
For this to be
useful/efficient, what
P = ABA needs to be true
about T?
S = BABABBABABA
…
T(BAB)
=
T(P)
Rabin-Karp algorithm
For this to be
useful/efficient, what
P = ABA needs to be true
about T?
⚫ Given T(si…i+m-1) we must be
able to efficiently calculate
S = BABABBABABA T(si+1…i+m)
…
T(BAB)
=
T(P)
Calculating the hash function
⚫ For simplicity, assume Σ = (0, 1, 2, …, 9). (in
general we can use a base larger than 10).
⚫ A string can then be viewed as a decimal
number
⚫ How do we efficiently calculate the numerical
representation of a string?
T(‘9847261’) = ?
Horner’s rule
T ( p1...m ) = pm + 10 ( pm −1 + 10 ( pm − 2 + ... + 10 ( p2 + 10 p1 )))
9847261
9 * 10 = 90
(90 + 8)*10 = 980
(980 + 4)*10 = 9840
(9840 + 7)*10 = 98470
… = 9847621
Horner’s rule
T ( p1...m ) = pm + 10 ( pm −1 + 10 ( pm − 2 + ... + 10 ( p2 + 10 p1 )))
9847261
Running time?
9 * 10 = 90
Θ(m)
(90 + 8)*10 = 980
(980 + 4)*10 = 9840
(9840 + 7)*10 = 98470
… = 9847621
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4
963801572348267
T(si…i+m-1)
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4 801
963801572348267
T(si…i+m-1) subtract highest order digit
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4 8010
963801572348267
T(si…i+m-1)
shift digits up
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4 8015
963801572348267
T(si…i+m-1)
add in the lowest digit
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Calculating the hash on the
string
⚫ Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?
m=4 Running time?
- Θ(m) for s1…m
963801572348267
- O(1) for the rest
T(si…i+m-1)
T ( si +1...i + m ) = 10 (T ( si...i + m −1 ) − 10 m −1 si ) + si + m
Algorithm so far…
⚫ Is it correct?
⚫ Each string has a unique numerical value and we
compare that with each value in the string
⚫ Running time
⚫ Preprocessing:
⚫ Θ(m)
⚫ Matching
⚫ Θ(n-m+1)
Is there any problem with this analysis?
Algorithm so far…
⚫ Is it correct?
⚫ Each string has a unique numerical value and we
compare that with each value in the string
⚫ Running time
⚫ Preprocessing:
⚫ Θ(m)
⚫ Matching
⚫ Θ(n-m+1)
How long does the check T(P) = T(si…i+m-1) take?
Modular arithmetics
⚫ The run time assumptions we made were
assuming arithmetic operations were
constant time, which is not true for large
numbers
⚫ To keep the numbers small, we’ll use modular
arithmetics, i.e. all operations are performed
mod q
⚫ a+b = (a+b) mod q
⚫ a*b = (a*b) mod q
⚫ …
Modular arithmetics
⚫ If T(A) = T(B), then T(A) mod q = T(B) mod q
⚫ In general, we can apply mods as many times as
we want and we will not effect the result
⚫ What is the downside to this modular
approach?
⚫ Spurious hits: if T(A) mod q = T(B) mod q that
does not necessarily mean that T(A) = T(B)
⚫ If we find a hit, we must check that the actual
string matches the pattern
Runtime
⚫ Preprocessing
⚫ Θ(m)
⚫ Running time
⚫ Best case:
⚫ Θ(n-m+1) – No matches and no spurious hits
⚫ Worst case
⚫ Θ((n-m+1)m)
Average case running time
⚫ Assume v valid matches in the string
⚫ What is the probability of a spurious hit?
⚫ As with hashing, assume a uniform mapping onto
values of q:
Σ* 1…q
⚫ What is the probability under this assumption?
Average case running time
⚫ Assume v valid matches in the string
⚫ What is the probability of a spurious hit?
⚫ As with hashing, assume a uniform mapping onto
values of q:
Σ* 1…q
⚫ What is the probability under this assumption? 1/q
Average case running time
⚫ How many spurious hits?
⚫ n/q
⚫ Average case running time:
O(n-m+1) + O(m(v+n/q)
iterate over the checking matches
positions and spurious hits
Matching running times
Algorithm Preprocessing time Matching time
Naïve 0 O((n-m+1)m)
FSA Θ(m|Σ|) Θ(n)
Rabin-Karp Θ(m) O(n-m+1)m)
Knuth-Morris-Pratt Θ(m) Θ(n)