[go: up one dir, main page]

0% found this document useful (0 votes)
19 views111 pages

12 Strings.v3

String matching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views111 pages

12 Strings.v3

String matching
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 111

String processing

algorithms
David Kauchak
cs161
Summer 2009
Administrative
 Check your scores on coursework
 SCPD Final exam: e-mail me with proctor

information
 Office hours next week?

 Reminder: HW6 due Wed. 8/12 before class

and no late homework


Where did “dynamic programming” come from?

Richard Bellman On the Birth of


Dynamic Programming
Stuart Dreyfus
http://www.eng.tau.ac.il/~ami/cd/
or50/1526-5463-2002-50-01-
0048.pdf
Strings
 Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …,
z)
 A string is any member of Σ*, i.e. any

sequence of 0 or more members of Σ


 ‘this is a string’  Σ*
 ‘this is also a string’  Σ*
 ‘1234’  Σ*
String operations
 Given strings s1 of length n and s2 of length m
 Equality: is s1 = s2? (case sensitive or
insensitive)
‘this is a string’ = ‘this is a string’
‘this is a string’ ≠ ‘this is another string’
‘this is a string’ =? ‘THIS IS A STRING’

 Running time
 O(n) where n is length of shortest string
String operations
 Concatenate (append): create string s1s2
‘this is a’ . ‘ string’ → ‘this is a string’

 Running time
 Θ(n+m)
String operations
 Substitute: Exchange all occurrences of a
particular character with another character
Substitute(‘this is a string’, ‘i’, ‘x’) →
‘thxs xs a strxng’
Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’

 Running time
 Θ(n)
String operations
 Length:return the number of
characters/symbols in the string
Length(‘this is a string’) → 16
Length(‘this is another string’) → 24

 Running time
 O(1) or Θ(n) depending on implementation
String operations
 Prefix: Get the first j characters in the string

Prefix(‘this is a string’, 4) → ‘this’

 Running time
 Θ(j)
 Suffix: Get the last j characters in the string
Suffix(‘this is a string’, 6) → ‘string’

 Running time
 Θ(j)
String operations
 Substring – Get the characters between i and
j inclusive

Substring(‘this is a string’, 4, 8) → ‘s is ’

 Running time
 Θ(j - i)
 Prefix?
 Prefix(S, i) = Substring(S, 1, i)
 Suffix?
 Suffix(S, i) = Substring(S, i+1, length(n))
Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Insertion:

ABACED ABACCED DABACCED

Insert ‘C’ Insert ‘D’


Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED
Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED

Delete ‘A’
Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED BACE

Delete ‘A’ Delete ‘D’


Edit distance
(aka Levenshtein distance)
 Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Substitution:

ABACED ABADED ABADES

Sub ‘D’ for ‘C’ Sub ‘S’ for ‘D’


Edit distance examples

Edit(Kitten, Mitten) = 1

Operations:

Sub ‘M’ for ‘K’ Mitten


Edit distance examples

Edit(Happy, Hilly) = 3

Operations:

Sub ‘a’ for ‘i’ Hippy


Sub ‘l’ for ‘p’ Hilpy
Sub ‘l’ for ‘p’ Hilly
Edit distance examples

Edit(Banana, Car) = 5

Operations:

Delete ‘B’ anana


Delete ‘a’ nana
Delete ‘n’ naa
Sub ‘C’ for ‘n’ Caa
Sub ‘a’ for ‘r’ Car
Edit distance examples

Edit(Simple, Apple) = 3

Operations:

Delete ‘S’ imple


Sub ‘A’ for ‘i’ Ample
Sub ‘m’ for ‘p’ Apple
Is edit distance symmetric?
 that is, is Edit(s1, s2) = Edit(s2, s1)?

Edit(Simple, Apple) =? Edit(Apple, Simple)

 Why?
 sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’
 delete ‘i’ → insert ‘i’
 insert ‘i’ → delete ‘i’
Calculating edit distance

X=ABCBDAB

Y=BDCABA

Ideas?
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

After all of the operations, X needs


to equal Y
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

Operations: Insert
Delete
Substitute
Insert

X=ABCBDA?

Y=BDCAB?
Insert

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) 1  Edit ( X 1...n , Y1...m  1 )


Delete

X=ABCBDA?

Y=BDCAB?
Delete

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m )


Substition

X=ABCBDA?

Y=BDCAB?
Substition

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m  1 )


Anything else?

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) Edit ( X 1...n  1 , Y1...m  1 )


Combining results
Insert: Edit ( X , Y ) 1  Edit ( X 1...n , Y1...m  1 )

Delete: Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m )

Substitute: Edit ( X , Y ) 1  Edit ( X 1...n  1 , Y1...m  1 )

Equal: Edit ( X , Y ) Edit ( X 1...n  1 , Y1...m  1 )


Combining results
 1  Edit(X 1...n ,Y1...m  1 ) insertion

Edit ( X , Y ) min  1  Edit ( X 1...n  1 , Y1...m ) deletion
 Diff ( x , y )  Edit ( X
 n m 1... n  1 , Y1... m  1 ) equal/substitution
Running time

Θ(nm)
Variants
 Only include insertions and deletions
 What does this do to substitutions?
 Include swaps, i.e. swapping two adjacent
characters counts as one edit
 Weight insertion, deletion and substitution
differently
 Weight specific character insertion, deletion
and substitutions differently
 Length normalize the edit distance
String matching
 Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
String matching
 Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
Uses
 grep/egrep

 search

 find

 java.lang.String.contains()
Naive implementation
Is it correct?
Running time?

 What is the cost of the equality check?


 Best case: O(1)
 Worst case: O(m)
Running time?

 Best case
 Θ(n) – when the first character of the pattern does
not occur in the string
 Worst case
 O((n-m+1)m)
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA

repeated work!
Worst case
P = AAAA

S = AAAAAAAAAAAAA

Ideally, after the first match, we’d


know to just check the next
character to see if it is an ‘A’
Patterns
 Whichof these patterns will have that
problem?

P = ABAB

P = ABDC

P = BAA

P = ABBCDDCAABB
Patterns
 Whichof these patterns will have that
problem?

P = ABAB If the pattern has a


suffix that is also a
P = ABDC prefix then we will
have this problem

P = BAA

P = ABBCDDCAABB
Finite State Automata (FSA)
 An FSA is defined by 5 components
 Q is the set of states
q0 q1 q2 … qn
Finite State Automata (FSA)
 An FSA is defined by 5 components
 Q is the set of states
q0 q1 q2 … qn

 q0 is the start state q7

 A  Q, is the set of accepting states where |A| > 0


 Σ is the alphabet (e.g. {A, B}
  is the transition function from Q x Σ to Q
QΣ Q B

q0 A q1
q0 B q2
q0 q1
A
q2 …
A


q1 A q1
FSA operation

B A A

q0 q1 q1 q1
A B A

B
B

An FSA starts at state q0 and reads the characters of the input


string one at a time.
If the automaton is in state q and reads character a, then it
transitions to state (q,a).
If the FSA reaches an accepting state (q  A), then the FSA has
found a match.
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

What pattern does this represent?


FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B

S = BABABBABABA
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(abcdab, ababcd) = ?
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(abcdab, ababcd) = 2
Suffix function
 Thesuffix function σ(x,y) is the index of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(daabac, abacac) = ?
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(daabac, abacac) = 4
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(dabb, abacd) = ?
Suffix function
 Thesuffix function σ(x,y) is the length of the
longest suffix of x that is a prefix of y

 ( x, y ) max ( xm  i 1...m  y1...i )


i

σ(dabb, abacd) = 0
Building a string matching
automata
 Given a pattern P = p1, p2, …, pm, we’d like to build
an FSA that recognizes P in strings

P = ababaca

Ideas?
Building a string matching automata
P = ababaca
Q = q1, q2, …, qm corresponding to each
symbol, plus a q0 starting state
 the set of accepting states, A = {qm}
 vocab Σ all symbols in P, plus one more
representing all symbols not in P
 The transition function for q  Q and a  Σ is

defined as:
 (q, a) = σ(p1…qa, P)
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 ? a σ(a, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 ? a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a σ(b, ababaca)
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
B,C
state a b c P
q0 1 0 0 a q0 q1
A
q1 b
q2 a
q3 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 ? b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘aba’ so far
q1 1 2 0 b
q2 3 0 0 a σ(abaa, ababaca)
q3 1 b
q4 a
q5 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 ? c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
We’ve seen ‘ababa’ so far
q1 1 2 0 b
q2 3 0 0 a σ(ababab, ababaca)
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 c
q6 a
q7
Transition function
P = ababaca
 (q, a) = σ(p1…qa, P)
state a b c P
q0 1 0 0 a
q1 1 2 0 b
q2 3 0 0 a
q3 1 4 0 b
q4 5 0 0 a
q5 1 4 6 c
q6 7 0 0 a
q7 1 2 0
Matching runtime
 Once we’ve built the FSA, what is the
runtime?
 Θ(n) - Each symbol causes a state transition and
we only visit each character once
 What is the cost to build the FSA?
 How many entries in the table?
 m|Σ| - Best case: Ω(m|Σ|)
 How long does it take to calculate the suffix
function at each entry?
 Naïve: O(m2)
 Overall naïve: O(m3|Σ|)
 Overall fast implementation O(m|Σ|)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA Hash P

T(P)

S = BABABBABABA
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

Hash m symbol
S = BABABBABABA sequences and compare

T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

match
Hash m symbol
S = BABABBABABA sequences and compare

T(ABA)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

Hash m symbol
S = BABABBABABA sequences and compare

T(BAB)
=
T(P)
Rabin-Karp algorithm
- Use a function T to that computes a numerical
representation of P
- Calculate T for all m symbol sequences of S
and compare

P = ABA

Hash m symbol
S = BABABBABABA sequences and compare

T(BAB)
=
T(P)
Rabin-Karp algorithm

For this to be
useful/efficient, what
P = ABA needs to be true
about T?

S = BABABBABABA

T(BAB)
=
T(P)
Rabin-Karp algorithm

For this to be
useful/efficient, what
P = ABA needs to be true
about T?

 Given T(si…i+m-1) we must be


able to efficiently calculate
S = BABABBABABA T(si+1…i+m)

T(BAB)
=
T(P)
Calculating the hash function
 For simplicity, assume Σ = (0, 1, 2, …, 9). (in
general we can use a base larger than 10).
 A string can then be viewed as a decimal

number
 How do we efficiently calculate the numerical

representation of a string?

T(‘9847261’) = ?
Horner’s rule
T ( p1...m )  pm  10( pm  1  10( pm  2  ...  10( p2  10 p1 )))

9847261
9 * 10 = 90

(90 + 8)*10 = 980

(980 + 4)*10 = 9840

(9840 + 7)*10 = 98470

… = 9847621
Horner’s rule
T ( p1...m )  pm  10( pm  1  10( pm  2  ...  10( p2  10 p1 )))

9847261 Running time?


9 * 10 = 90
Θ(m)
(90 + 8)*10 = 980

(980 + 4)*10 = 9840

(9840 + 7)*10 = 98470

… = 9847621
Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4
963801572348267
T(si…i+m-1)

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 801
963801572348267
T(si…i+m-1) subtract highest order digit

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 8010
963801572348267
T(si…i+m-1)
shift digits up

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 8015
963801572348267
T(si…i+m-1)
add in the lowest digit

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Calculating the hash on the
string
 Given T(si…i+m-1) how can we efficiently
calculate T(si+1…i+m)?

m=4 Running time?


- Θ(m) for s1…m
963801572348267
- O(1) for the rest
T(si…i+m-1)

T ( si 1...i m ) 10(T ( si...i m  1 )  10m  1 si )  si m


Algorithm so far…
 Is it correct?
 Each string has a unique numerical value and we
compare that with each value in the string
 Running time
 Preprocessing:
 Θ(m)
 Matching
 Θ(n-m+1)

Is there any problem with this analysis?


Algorithm so far…
 Is it correct?
 Each string has a unique numerical value and we
compare that with each value in the string
 Running time
 Preprocessing:
 Θ(m)
 Matching
 Θ(n-m+1)

How long does the check T(P) = T(si…i+m-1) take?


Modular arithmetics
 The run time assumptions we made were
assuming arithmetic operations were
constant time, which is not true for large
numbers
 To keep the numbers small, we’ll use
modular arithmetics, i.e. all operations are
performed mod q
 a+b = (a+b) mod q
 a*b = (a*b) mod q
 …
Modular arithmetics
 If T(A) = T(B), then T(A) mod q = T(B) mod q
 In general, we can apply mods as many times as
we want and we will not effect the result
 What is the downside to this modular
approach?
 Spurious hits: if T(A) mod q = T(B) mod q that
does not necessarily mean that T(A) = T(B)
 If we find a hit, we must check that the actual
string matches the pattern
Runtime
 Preprocessing
 Θ(m)
 Running time
 Best case:
 Θ(n-m+1) – No matches and no spurious hits
 Worst case
 Θ((n-m+1)m)
Average case running time
 Assume v valid matches in the string
 What is the probability of a spurious hit?
 As with hashing, assume a uniform mapping onto
values of q:

Σ* 1…q

 What is the probability under this assumption?


Average case running time
 Assume v valid matches in the string
 What is the probability of a spurious hit?
 As with hashing, assume a uniform mapping onto
values of q:

Σ* 1…q

 What is the probability under this assumption? 1/q


Average case running time
 How many spurious hits?
 n/q
 Average case running time:
O(n-m+1) + O(m(v+n/q)

iterate over the checking matches


positions and spurious hits
Matching running times

Algorithm Preprocessing time Matching time


Naïve 0 O((n-m+1)m)
FSA Θ(m|Σ|) Θ(n)
Rabin-Karp Θ(m) O(n-m+1)m)
Knuth-Morris-Pratt Θ(m) Θ(n)

You might also like