Algoritmen & Datastructuren
2012 2013
Substring Search
(Slides by Sedgewick)
Philip Dutr & Benedict Brown
Dept. of Computer Science, K.U.Leuven
Goal: Find pattern of length M in a text of length N
Usually M << N
Copyright Ph.Dutr, Spring 2013
Applications
Parsers
Spam filters
Digital libraries
Word processors
Web search engines
Natural language processing
Computational molecular biology
Feature detection in digitized images
Copyright Ph.Dutr, Spring 2013
Application: spam filtering
Identify patterns indicative
of spam.
PROFITS
L0SE WE1GHT
herbal V1AGRA
There is no catch
L0W M0RTGAGE RATES
This is a one-time mailing.
This message is sent in
compliance with spam
regulations
Copyright Ph.Dutr, Spring 2013
Application: electronic surveillance
Look for certain phrases in email traffic
Detonate Bomb
Attack Belgians
Ik zal een taart in het gezicht van Leonard werpen
Huge amount of data, incoming streams, no backup
Copyright Ph.Dutr, Spring 2013
Application: page scraping
Goal. Extract relevant data from web page.
E.g. Find string delimited by <b> and </b> after first
occurrence of pattern Last Trade
Copyright Ph.Dutr, Spring 2013
Brute force
Check all possible starting positions,
check entire length of pattern
Copyright Ph.Dutr, Spring 2013
Brute force
Copyright Ph.Dutr, Spring 2013
Brute force: worst case
Brute-force algorithm can be slow if text and pattern
are repetitive
Worst case: ~ M.N character comparisons
M.(N-M+1) = M.N M.M + M ~ M.N if M << N
Copyright Ph.Dutr, Spring 2013
Challenges
In many applications, backup in data to be avoided
Solution 1: maintain buffer (memory!)
Solution 2: avoid backup
Challenge: we want linear time complexity!
10
Copyright Ph.Dutr, Spring 2013
Brute force: alternate implementation
Same sequence of character compares
11
i points to end of sequence of already-matched chars in text
j stores number of already-matched chars
(end of sequence in pattern)
Copyright Ph.Dutr, Spring 2013
Knuth-Morris-Pratt (KMP)
Intuition: Suppose we are searching in text for pattern
BAAAAAAAAA (alphabet = {A,B})
Suppose we match 5 chars, with mismatch on 6th char
We know previous 6 characters in text are BAAAAB
Don't need to back up text pointer!
12
Copyright Ph.Dutr, Spring 2013
Deterministic finite state automaton (DFA)
DFA is abstract string-searching machine.
13
Finite number of states (including start and halt)
Exactly one transition for each char in alphabet
Accept if sequence of transitions leads to halt state
Copyright Ph.Dutr, Spring 2013
Interpretation of Knuth-Morris-Pratt DFA
Interpretation of DFA state after reading in txt[i]?
State-number = number of characters in pattern that have
been matched.
length of longest prefix of pat[] that is at end of txt[0..i]
E.g. DFA is in state 3 after reading in character txt[6]
14
Copyright Ph.Dutr, Spring 2013
KMP trace
15
Copyright Ph.Dutr, Spring 2013
KMP Implementation
Differences from brute-force implementation:
Text pointer i never decrements (no backup needed)
Need to precompute dfa[][] from pattern (some work)
Running time
Simulate DFA on text: at most N character accesses
Build DFA: how to do efficiently?
16
Copyright Ph.Dutr, Spring 2013
How to build DFA from pattern?
Match transition
first j characters of pattern
have already been matched
next char matches
If in state j and next char c == pat.charAt(j),
then go to state j+1.
now first j+1 characters of
pattern have been matched
17
Copyright Ph.Dutr, Spring 2013
How to build DFA from pattern?
Mismatch transition
To compute dfa[c][j]:
If in state j and next char c != pat.charAt(j), then the last j characters
of input are pat[1..j-1], followed by c
Simulate pat[1..j-1] on DFA (so far ) and take transition c
Running time: Seems to require j steps
Simulation does not start from scratch: remember state X
18
Copyright Ph.Dutr, Spring 2013
19
Copyright Ph.Dutr, Spring 2013
20
Copyright Ph.Dutr, Spring 2013
KMP Analysis
KMP accesses no more than M + N chars
M for constructing DFA
N for actual search
KMP constructs dfa[][] in time and space
proportional to R.M
21
R = size of alphabet
Copyright Ph.Dutr, Spring 2013
The Art of Computer Programming
Donald
Knuth
"If you think you're a really good programmer . . . read
(Knuth's) Art of Computer Programming . . . You should
definitely send me a rsum if you can read the whole
thing. (Bill Gates)
22
Copyright Ph.Dutr, Spring 2013
Boyer-Moore
Intuition:
23
Scan characters in pattern from right to left
Can skip M text chars when finding one not in the pattern
Copyright Ph.Dutr, Spring 2013
Boyer-Moore
How much to skip?
right[c] = rightmost occurrence of character c in pattern
-1 for characters not in pattern
24
Copyright Ph.Dutr, Spring 2013
Boyer-Moore
25
Copyright Ph.Dutr, Spring 2013
Boyer-Moore
26
Copyright Ph.Dutr, Spring 2013
Boyer-Moore Analysis
~ N/M character compares
(characters in pattern < characters in alphabet)
Worst-case: ~ M.N
27
KMP-like rules to avoid repetition linear ~ N
Copyright Ph.Dutr, Spring 2013
Rabin-Karp
Basic idea = modular hashing
28
Compute a hash of pattern characters 0 to M-1
For each i, compute a hash of text characters i to M+i-1
If pattern hash = text substring hash, check for a match
Copyright Ph.Dutr, Spring 2013
Rabin-Karp
Intuition: M-digit, base-R integer, modulo Q
ti = txt.charAt(i)
Horner's method
29
Linear-time method to evaluate degree-M polynomial
Copyright Ph.Dutr, Spring 2013
Rabin-Karp
Can update hash function in constant time!
30
Copyright Ph.Dutr, Spring 2013
Rabin-Karp
31
Copyright Ph.Dutr, Spring 2013
Rabin-Karp
Las Vegas variant
If hash is equal, then check pattern
100% correct
Extremely likely to run in linear time (but worst case
M.N)
Monte Carlo variant
If hash is equal, we assume pattern is equal as well
Always runs in linear time
Extremely likely to return correct answer
32
Probability 1/Q choose large Q
Copyright Ph.Dutr, Spring 2013