pattern matching
pattern matching
1
String Searching
• The object of string searching is to find the location of a
specific text pattern within a larger body of text (e.g., a
sentence, a paragraph, a book, etc.).
• As with most algorithms, the main considerations for string
searching are speed and efficiency.
• There are a number of string searching algorithms in
existence today, but the three we shall review are Brute
Force,Rabin-Karp, and Knuth-Morris-Pratt.
2
Brute Force
• The Brute Force algorithm compares the pattern to the text, one
character at a time, until unmatching characters are found
4
Brute Force-Complexity
• Given a pattern M characters in length, and a text N characters in
length...
• Worst case: compares pattern to each substring of text of length M.
For example, M=5.
• This kind of case can occur for image data.
9
Rabin-Karp Algorithm
11
Rabin-Karp Math
• Consider an M-character sequence as an M-digit number in base b,
where b is the number of letters in the alphabet. The text subsequence
t[i .. i+M-1] is mapped to the number
13
•
Rabin-Karp Mods
If M is large, then the resulting value (~bM) will be enormous. For this
reason, we hash the value by taking it mod a prime number q.
• The mod function (% in Java) is particularly useful in this case due to
several of its inherent properties:
[(x mod q) + (y mod q)] mod q = (x+y) mod q
(x mod q) mod q = x mod q
• For these reasons:
h(i)=((t[i] bM-1 mod q) +(t[i+1] bM-2 mod q) + ...
+(t[i+M-1] mod q))mod q
h(i+1) =( h(i) b mod q
Shift left one digit
-t[i] bM mod q
Subtract leftmost digit
+t[i+M] mod q )
Add new rightmost digit 14
mod q
Rabin-Karp Mods
15
Rabin-Karp Mods
16
Rabin-Karp Complexity
• If a sufficiently large prime number is used for the hash function,
the hashed values of two different patterns will usually be distinct.
• If this is the case, searching takes O(N) time, where N is the
number of characters in the larger body of text.
• It is always possible to construct a scenario with a worst case
complexity of O(MN). This, however, is likely to happen only if
the prime number used for hashing is small.
17
The Knuth-Morris-Pratt Algorithm
• The Knuth-Morris-Pratt (KMP) string searching algorithm differs from the
brute-force algorithm by keeping track of information gained from previous
comparisons.
• A failure function (f) is computed that indicates how much of the last
comparison can be reused if it fails.
• Specifically, f is defined to be the longest prefix of the pattern P[0,..,j] that is
also a suffix of P[1,..,j]
-Note: not a suffix of P[0,..,j]
• Example:-value of the
• KMP failure function:
• This shows how much of the beginning of the string matches up to the
portion immediately preceding a failed comparison.
-if the comparison fails at (4), we know the a,b in positions 2,3 is identical
to positions 0,1 18
The Knuth-Morris-Pratt
Algorithm
19
The Knuth-Morris-Pratt
Algorithm
20
The Knuth-Morris-Pratt
Algorithm
21
The Knuth-Morris-Pratt
Algorithm
22
The Knuth-Morris-Pratt
Algorithm
23
The Knuth-Morris-Pratt
Algorithm
24
The Knuth-Morris-Pratt
Algorithm
25
The Knuth-Morris-Pratt
Algorithm
26
The KMP Algorithm (contd.)
• the KMP string matching algorithm: Pseudo-Code
Algorithm KMPMatch(T,P)
Input: Strings T (text) with n characters and P
(pattern) with m characters.
Output: Starting index of the first substring of T
matching P, or an indication that P is not a
substring of T.
27
Algorithm
f KMPFailureFunction(P) {build failure function}
i0
j0
while i < n do
if P[j] = T[i] then
if j = m - 1 then
return i - m - 1 {a match}
ii+1
jj+1
else if j > 0 then {no match, but we have advanced}
j f(j-1) {j indexes just after matching prefix in P}
else
ii+1
return “There is no substring of T matching P”
28
The KMP Algorithm (contd.)
• The KMP failure function: Pseudo-Code
Algorithm KMPMatch(T,P)
Input: String P (pattern) with m characters
Output: The failure function f for P, which maps j to
the length of the longest prefix of P that is a suffix
of P[1,..,j]
29
Algorithm
f KMPFailureFunction(P) {build failure function}
i0
j0
while i m-1 do
if P[j] = T[i] then
if j = m - 1 then
{ we have matched j+1 characters}
f(i) j + 1
ii+1
jj+1
else if j > 0 then
j f(j-1) {j indexes just after matching prefix in P}
else {there is no match}
f(i) 0
ii+1 30
The KMP Algorithm (contd.)
• A graphical representation of the KMP string searching algorithm
31
The KMP Algorithm (contd.)
• Time Complexity Analysis
• define k = i - j
• In every iteration through the while loop, one of three things happens.
1) if T[i] = P[j], then i increases by 1, as does j k remains the same.
2) if T[i] != P[j] and j > 0, then i does not change and k increases by at least 1,
since k changes from i - j to i - f(j-1)
3) if T[i] != P[j] and j = 0, then i increases by 1 and k increases by 1 since j remains
the same.
• Thus, each time through the loop, either i or k increases by at least 1, so the
greatest possible number of loops is 2n
• This of course assumes that f has already been computed.
• However, f is computed in much the same manner as KMPMatch so the time
complexity argument is analogous. KMPFailureFunction is O(m)
• Total Time Complexity: O(n + m) 32
Regular Expressions
• notation for describing a set of strings, possibly of infinite
size
• denotes the empty string
• ab + c denotes the set {ab, c}
• a* denotes the set {, a, aa, aaa, ...}
• Examples
(a+b)* all the strings from the alphabet {a,b}
b*(ab*a)*b* strings with an even number of a’s
(a+b)*sun(a+b)* strings containing the pattern “sun”
(a+b)(a+b)(a+b)a 4-letter strings ending in a
33