0% found this document useful (0 votes)

63 views111 pages

String Processing Algorithms

Uploaded by

22070073

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views111 pages

String Processing Algorithms

Uploaded by

22070073

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 111

String processing

algorithms
David Kauchak
cs161
Summer 2009
Administrative
⚫ Check your scores on coursework
⚫ SCPD Final exam: e-mail me with proctor
information
⚫ Office hours next week?
⚫ Reminder: HW6 due Wed. 8/12 before class
and no late homework
Where did “dynamic programming” come from?

Richard Bellman On the Birth of

Dynamic Programming
Stuart Dreyfus
http://www.eng.tau.ac.il/~ami/cd/o
r50/1526-5463-2002-50-01-
0048.pdf
Strings
⚫ Let Σ be an alphabet, e.g. Σ = ( , a, b, c, …,
z)
⚫ A string is any member of Σ*, i.e. any
sequence of 0 or more members of Σ
⚫ ‘this is a string’  Σ*
⚫ ‘this is also a string’  Σ*
⚫ ‘1234’  Σ*
String operations
⚫ Given strings s1 of length n and s2 of length m
⚫ Equality: is s1 = s2? (case sensitive or
insensitive)
‘this is a string’ = ‘this is a string’
‘this is a string’ ≠ ‘this is another string’
‘this is a string’ =? ‘THIS IS A STRING’

⚫ Running time
⚫ O(n) where n is length of shortest string
String operations
⚫ Concatenate (append): create string s1s2
‘this is a’ . ‘ string’ → ‘this is a string’

⚫ Running time
⚫ Θ(n+m)
String operations
⚫ Substitute: Exchange all occurrences of a
particular character with another character
Substitute(‘this is a string’, ‘i’, ‘x’) →
‘thxs xs a strxng’
Substitute(‘banana’, ‘a’, ‘o’) → ‘bonono’

⚫ Running time
⚫ Θ(n)
String operations
⚫ Length: return the number of
characters/symbols in the string
Length(‘this is a string’) → 16
Length(‘this is another string’) → 24

⚫ Running time
⚫ O(1) or Θ(n) depending on implementation
String operations
⚫ Prefix: Get the first j characters in the string

Prefix(‘this is a string’, 4) → ‘this’

⚫ Running time
⚫ Θ(j)
⚫ Suffix: Get the last j characters in the string

Suffix(‘this is a string’, 6) → ‘string’

⚫ Running time
⚫ Θ(j)
String operations
⚫ Substring – Get the characters between i and
j inclusive

Substring(‘this is a string’, 4, 8) → ‘s is ’

⚫ Running time
⚫ Θ(j - i)
⚫ Prefix?
⚫ Prefix(S, i) = Substring(S, 1, i)
⚫ Suffix?
⚫ Suffix(S, i) = Substring(S, i+1, length(n))
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Insertion:

ABACED ABACCED DABACCED

Insert ‘C’ Insert ‘D’

Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED

Delete ‘A’
Edit distance
(aka Levenshtein distance)
⚫ Edit distance between two strings is the
minimum number of insertions, deletions and
substitutions required to transform string s1
into string s2
Deletion:

ABACED BACED BACE

Delete ‘A’ Delete ‘D’

ABACED ABADED ABADES

Sub ‘D’ for ‘C’ Sub ‘S’ for ‘D’

Edit distance examples

Edit(Kitten, Mitten) = 1

Operations:

Sub ‘M’ for ‘K’ Mitten

Edit distance examples

Edit(Happy, Hilly) = 3

Operations:

Sub ‘a’ for ‘i’ Hippy

Sub ‘l’ for ‘p’ Hilpy
Sub ‘l’ for ‘p’ Hilly
Edit distance examples

Edit(Banana, Car) = 5

Operations:

Delete ‘B’ anana

Delete ‘a’ nana
Delete ‘n’ naa
Sub ‘C’ for ‘n’ Caa
Sub ‘a’ for ‘r’ Car
Edit distance examples

Edit(Simple, Apple) = 3

Operations:

Delete ‘S’ imple

Sub ‘A’ for ‘i’ Ample
Sub ‘m’ for ‘p’ Apple
Is edit distance symmetric?
⚫ that is, is Edit(s1, s2) = Edit(s2, s1)?

Edit(Simple, Apple) =? Edit(Apple, Simple)

⚫ Why?
⚫ sub ‘i’ for ‘j’ → sub ‘j’ for ‘i’
⚫ delete ‘i’ → insert ‘i’
⚫ insert ‘i’ → delete ‘i’
Calculating edit distance

X=ABCBDAB

Y=BDCABA

Ideas?
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

After all of the operations, X needs

to equal Y
Calculating edit distance

X=ABCBDA?

Y=BDCAB?

Operations: Insert
Delete
Substitute
Insert

X=ABCBDA?

Y=BDCAB?
Insert

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) = 1 + Edit ( X 1...n , Y1...m −1 )

Delete

X=ABCBDA?

Y=BDCAB?
Delete

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m )

Substition

X=ABCBDA?

Y=BDCAB?
Substition

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m −1 )

Anything else?

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?

Y=BDCAB?
Equal

X=ABCBDA?
Edit

Y=BDCAB?

Edit ( X , Y ) = Edit ( X 1...n −1 , Y1...m −1 )

Combining results
Insert: Edit ( X , Y ) = 1 + Edit ( X 1...n , Y1...m −1 )

Delete: Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m )

Substitute: Edit ( X , Y ) = 1 + Edit ( X 1...n −1 , Y1...m −1 )

Equal: Edit ( X , Y ) = Edit ( X 1...n −1 , Y1...m −1 )

Combining results
 1 + Edit(X1...n,Y1...m −1 ) insertion

Edit( X , Y ) = min  1 + Edit( X 1...n −1 , Y1...m ) deletion
 Diff ( x , y ) + Edit( X
 n m 1...n −1 , Y1...m −1 ) equal/substitution
Running time

Θ(nm)
Variants
⚫ Only include insertions and deletions
⚫ What does this do to substitutions?
⚫ Include swaps, i.e. swapping two adjacent
characters counts as one edit
⚫ Weight insertion, deletion and substitution
differently
⚫ Weight specific character insertion, deletion
and substitutions differently
⚫ Length normalize the edit distance
String matching
⚫ Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
String matching
⚫ Given a pattern string P of length m and a
string S of length n, find all locations where P
occurs in S
P = ABA

S = DCABABBABABA
Uses
⚫ grep/egrep
⚫ search
⚫ find
⚫ java.lang.String.contains()
Naive implementation
Is it correct?
Running time?

⚫ What is the cost of the equality check?

⚫ Best case: O(1)
⚫ Worst case: O(m)
Running time?

⚫ Best case
⚫ Θ(n) – when the first character of the pattern does
not occur in the string
⚫ Worst case
⚫ O((n-m+1)m)
Worst case
P = AAAA

S = AAAAAAAAAAAAA
Worst case
P = AAAA

S = AAAAAAAAAAAAA

repeated work!
Worst case
P = AAAA

S = AAAAAAAAAAAAA

Ideally, after the first match, we’d

know to just check the next
character to see if it is an ‘A’
Patterns
⚫ Which of these patterns will have that
problem?

P = ABAB

P = ABDC

P = BAA

P = ABBCDDCAABB
Patterns
⚫ Which of these patterns will have that
problem?

P = ABAB If the pattern has a

suffix that is also a
P = ABDC prefix then we will
have this problem

P = BAA

P = ABBCDDCAABB
Finite State Automata (FSA)
⚫ An FSA is defined by 5 components
⚫ Q is the set of states
q0 q1 q2 … qn
Finite State Automata (FSA)
⚫ An FSA is defined by 5 components
⚫ Q is the set of states
q0 q1 q2 … qn

⚫ q0 is the start state q7

⚫ A  Q, is the set of accepting states where |A| > 0

⚫ Σ is the alphabet (e.g. {A, B}
⚫  is the transition function from Q x Σ to Q
QΣ Q B

q0 A q1
q0 B q2 q0 q1
A
q2 …
A
q1 A q1
…
FSA operation

B A A

q0 q1 q1 q1
A B A

B
B

An FSA starts at state q0 and reads the characters of the input

string one at a time.
If the automaton is in state q and reads character a, then it
transitions to state (q,a).
If the FSA reaches an accepting state (q  A), then the FSA has
found a match.
FSA operation
P = ABA
B A A

q0 q1 q1 q1
A B A

B
B