[go: up one dir, main page]

100% found this document useful (1 vote)
208 views48 pages

Boyer Moore Algorithm: Idan Szpektor

The Boyer-Moore algorithm is a string matching algorithm that improves on naive algorithms by skipping characters when possible. It uses two rules: the bad character rule, which shifts the pattern by more than one character on a mismatch, and the good suffix rule, which uses the matched suffix to determine the shift distance. The algorithm preprocesses the pattern to calculate shift amounts, then searches the text from right to left. It runs in O(n+m) time in the worst case, where n is the pattern length and m is the text length.

Uploaded by

Nischitha Nish
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
208 views48 pages

Boyer Moore Algorithm: Idan Szpektor

The Boyer-Moore algorithm is a string matching algorithm that improves on naive algorithms by skipping characters when possible. It uses two rules: the bad character rule, which shifts the pattern by more than one character on a mismatch, and the good suffix rule, which uses the matched suffix to determine the shift distance. The algorithm preprocesses the pattern to calculate shift amounts, then searches the text from right to left. It runs in O(n+m) time in the worst case, where n is the pattern length and m is the text length.

Uploaded by

Nischitha Nish
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

Boyer Moore Algorithm

Idan Szpektor
Boyer and Moore
What It’s About
 A String Matching Algorithm

 Preprocess a Pattern P (|P| = n)

 For a text T (| T| = m), find all of the


occurrences of P in T

 Time complexity: O(n + m), but usually sub-


linear
Right to Left (like in Hebrew)
 Matching the pattern from right to left

 For a pattern abc:



T: bbacdcbaabcddcdaddaaabcbcb
P: abc

 Worst case is still O(n m)


The Bad Character Rule (BCR)

 On a mismatch between the pattern and the


text, we can shift the pattern by more than
one place.
Sublinearity!
ddbbacdcbaabcddcdaddaaabcbcb
acabc

BCR Preprocessing

 A table, for each position in the pattern and a


character, the size of the shift. O(n |Σ|) space. O(1)
access time.
1 2 3 4 5
a b a c b:
a 1 1 3 3 3
1 2 3 4 5
b 2 2 2 5
c 4 4

 A list of positions for each character. O(n + |Σ|)


space. O(n) access time, But in total O(m).
BCR - Summary

 On a mismatch, shift the pattern to the right


until the first occurrence of the mismatched
char in P.

 Still O(n m) worst case running time:

T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: abaaaa
The Good Suffix Rule (GSR)

 We want to use the knowledge of the


matched characters in the pattern’s suffix.

 If we matched S characters in T, what is (if


exists) the smallest shift in P that will align a
sub-string of P of the same S characters ?
GSR (Cont…)
 Example 1 – how much to move:

T: bbacdcbaabcddcdaddaaabcbcb
P: cabbabdbab
cabbabdbab
GSR (Cont…)
 Example 2 – what if there is no alignment:

T: bbacdcbaabcbbabdbabcaabcbcb
P: bcbbabdbabc
bcbbabdbabc
GSR - Detailed
 We mark the matched sub-string in T with t
and the mismatched char with x

1. In case of a mismatch: shift right until the


first occurrence of t in P such that the next
char y in P holds y≠x

2. Otherwise, shift right to the largest prefix of


P that aligns with a suffix of t.
Boyer Moore Algorithm
 Preprocess(P)
 k := n
while (k ≤ m) do
 Match P and T from right to left starting at k

 If a mismatch occurs: shift P right (advance k)


by max(good suffix rule, bad char rule).

 else, print the occurrence and shift P right


(advance k) by the good suffix rule.
Algorithm Correctness

 The bad character rule shift never misses a


match

 The good suffix rule shift never misses a


match
Preprocessing the GSR – L(i)

 L(i) – The biggest index j, such that j < n and


prefix P[1..j] contains suffix P[i..n] as a suffix
but not suffix P[i-1..n]

1 2 3 4 5 6 7 8 9 10 11 12 13

P: b b a b b a a b b c a b b
L: 0 0 0 0 0 0 0 0 0 5 9 0 12
Preprocessing the GSR – l(i)

 l(i) – The length of the longest suffix of P[i..n]


that is also a prefix of P

P: b b a b b a a b b c a b b
l: 2 2 2 2 2 2 2 2 2 2 2 1
Using L(i) and l(i) in GSR

 If mismatch occurs at position n, shift P by 1

 If a mismatch occurs at position i-1 in P:


 If L(i) > 0, shift P by n – L(i)
 else shift P by n – l(i)

 If P was found, shift P by n – l(2)


Building L(i) and l(i) – the Z
 For a string s, Z(i) is the length of the longest
sub-string of s starting at i that matches a
prefix of s.

s: b b a c d c b b a a b b c d d
Z: 1 0 0 0 0 3 1 0 0 2 1 0 0 0

 Naively, we can build Z in O(n^2)


From Z to N

 N(i) is the longest suffix of P[1..i] that is also a


suffix of P.
 N(i) is Z(i), built over P reversed.

s: d d c b b a a b b c d c a b b
N: 0 0 0 1 2 0 0 1 3 0 0 0 0 1
Building L(i) in O(n)
 L(i) – The biggest index j < n, such that prefix
P[1..j] contains suffix P[i..n] as a suffix but not
suffix P[i-1..n]

 L(i) – The biggest index j < n such that:


N(j) == | P[i..n] | == n – i + 1

 for i := 1 to n, L(i) := 0
 for j := 1 to n-1
 i := n – N(j) + 1
 L(i) := j
Building l(i) in O(n)
 l(i) – The length of the longest suffix of P[i..n]
that is also a prefix of P

 l(i) – The biggest j <= | P[i..n] | == n – i + 1


such that N(j) == j

 k := 0
 for j := 1 to n-1
 If(N(j) == j), k := j
 l(n – j + 1) := k
Building Z in O(n)

 For calculating Z(i), we want to use the


previously calculated Z(1)…Z(i-1)

 For each I we remember the right most Z(j):


j, such that j < i and j + Z(j) >= k + Z(k), for all
k<i
Building Z in O(n) (Cont…)

↑ ↑ ↑ ↑
 S i’ j i

 If i < j + Z(j), s[i … j + Z(j) - 1] appeared previously,


starting at i’ = i – j + 1.
 Z(i’) < Z(j) – (i - j) ?
Building Z in O(n) (Cont…)
 For Z(2) calculate explicitly
 j := 2, i := 3
 While i <= |s|:
 if i >= j + Z(j), calculate Z(i) explicitly
 else
 Z(i) := Z(i’)

 If Z(i’) >= Z(j) – (i - j), calculate Z(i) tail

explicitly
 If j + Z(j) < i + Z(i), j := i
Building Z in O(n) - Analysis

 The algorithm builds Z correctly

 The algorithm executes in O(n)


 A new character is matched only once
 All other operations are in O(1)
Boyer Moore Worst Case Analysis
 Assume P consists of n copies of a single
char and T consists of m copies of the same
char:
T: aaaaaaaaaaaaaaaaaaaaaaaaa
P: aaaaaa

 Boyer Moore Algorithm runs in Θ(m n) when


finding all the matches
The Galil Rule
 In a specific matching phase, We mark with k
the position in T of the right end of P. We
mark with s the position of last matched char
in this phase.
s k k’
T: bbacdcbaabcddcdaddaaabcbcb
P: abaab
abaab
The Galil Rule (Cont…)
 All the chars in position s < j ≤ k are known to
be matching. The algorithm doesn’t need to
check them.

 An extended Boyer Moore algorithm with the


Galil rule runs in O(m + n) worst case (even
without the bad-character rule).
Don’t Sleep Yet…
O(n + m) proof - Outline
 Preprocess in O(n) – already proved

1. Properties of strings
2. Proof of search in O(m) if P is not in T, using
only the good suffix rule.
3. Proof of search in O(m) even if P is in T,
adding the Galil rule.
Properties of Strings
 If for two strings δ, γ: δγ = γδ then there is a
string β such that δ = βi and γ = βj, i, j > 0
- Proof by induction

 Definition: A string s is semiperiodic with


period β if s consists of a non-empty suffix of
β (possibly the entire β) followed by one or
more complete copies of β.

β’ β β β
Properties of Strings (Cont…)

 A string is prefix semiperiodic if it contains


one or more complete copies of β followed by
a non-empty prefix of β.

 A string is prefix semiperiodic iff it is


semiperiodic with the same length period
Lemma 1
 Suppose P occurs in T starting at position p
and also at position q, q > p. If q – p ≤  n/2 
then P is semiperiodic with period
α = P[n-(q-p)+1…n]
p

α’ α α α
q

α’ α α α
Proof - when P is Not Found in T

 We have R rounds during the search.

 After each round the good suffix rule decides


on a right shift of si chars.

 Σsi ≤ m

 We shall use Σsi as an upper bound.


Proof (Cont…)
 For each round we count the matched chars
by:
 fi – the number of chars matched for the first
time
 gi –the number of chars already matched in
previous rounds.

 Σfi = m
 We want to prove that gi ≤ 3si ( Σgi ≤ 3m).
Proof (Cont…)
 Each round don’t find P  it matched a
substring ti and one bad char xi in T (xiti  T)

T: bbacdcbaabcbbabdbabcaabcbcb
P: bdbabc

 |ti|+1 ≤ 3si  gi ≤ 3si (because gi + fi = |ti|+1)


 For the rest of the proof we assume that for
the specific round i: |ti| + 1 > 3si
Lemma 2 (|ti| + 1 > 3si)
 In round i we look at the matched suffix of P,
marked P*. P* = yi ti, yi≠ xi.

 Both P* and ti are semiperiodic with period α


of length si and hence with minimal length
period β, α = βk.

 Proof: by Lemma 1.
Lemma 3 (|ti| + 1 > 3si)
 Suppose P overlapped ti during round i. We
shall examine in what ways could P overlap ti
in previous rounds.

 In any round h < i, the right end of P could not


have been aligned with the right end of any
full copy of β in ti.
- proof:
 Both round h and i fail at char xi
 two cases of possible shift after round h are invalid
Lemma 4 (|ti| + 1 > 3si)
 In round h < i, P can correctly match at most |
β|-1 chars in ti.

 By Lemma 3, P is not aligned with a right end of ti in phase h.
 Thus if it matched |β| chars or more there is a suffix γ of β
followed by a prefix δ of β such that δ γ = γ δ.
 By the string properties there is a substring μ such that β = μk,
k>1.
 This contradicts the minimal period size property of β.
Lemma 5 (|ti| + 1 > 3si)

 If in round h < i the right end of P is aligned


with a char in ti, it can only be aligned with
one of the following:
 One of the left-most |β|-1 chars of ti

 One of the right-most |β| chars of ti


-proof:
 If not, By Lemma 3,4, max |β|-1 chars are matched and only
from the middle of a β copy, while there are at least |β|
 A shift cannot pass the right end of that β copy
Proof (Cont…)

 If |ti| + 1 > 3si then gi ≤ 3si



 Using Lemma 5, in previous rounds we could match only the
bad char xi, the last |β|-1 chars in ti or start from the first |β| right
chars in ti.
 In the last case, using Lemma 4, we can only match up to |β|-1
chars
 in total we could previously match:
gi = 1 + |β|-1 + (|β| + |β|-1) ≤ 3|β| ≤ 3si
Proof - Final

 Number of matches = ∑(fi + gi) =


∑fi + ∑gi ≤ m + ∑3si ≤ m + 3m = 4m
Proof - when P is Found in T

 Split the rounds to two groups:


 “match” rounds –an occurrence of P in T was
found.
 “mismatch” rounds –P was not found in T.

 we have proved O(m) for “mismatch” rounds.


Proof (Cont…)
 After P was found in T, P will be shifted by a
constant length s. (s = n – l(2)).

 |n| + 1 ≤ 3s 
∑ matches in round i ≤ ∑3s ≤ m

 For the rest of the proof we assume that:


|n| + 1 > 3s
Proof (|n| + 1 > 3s)
 By Lemma 1, P is semiperiodic with minimal
length period β, |β| = s.

 If round i+1 is also a “match” round then, by


the Galil rule, only the new |β| chars are
compared.

 A contiguous series of “match” rounds, i…i+k


is called a “run”.
Proof (|n| + 1 > 3s)

 ∑ The length of a “run”, not including chars


that where already matched in previous “runs”
≤ m

 How many chars in a “run” where already


matched in previous “runs”?
Lemma (|n| + 1 > 3s)
 Suppose k-1 was a “match” round and k is a
“mismatch” round that ends the “run”.
 If k’ > k is the first “match” round then it
overlaps at most |β|-1 chars with the previous
“run” (ended by round k-1).

 The left end of P at round k’ cannot be aligned with the left end
of a full copy of |β| at round k-1.
 As a result, P cannot overlap |β| chars or more with round k-1.
Proof (|n| + 1 > 3s)
 By the Lemma and because the shift after
every “match” round is of |β|, only the first
round of a “run” can overlap, and only with
the last previous “run”.

 ∑ The length of the chars that where already
matched in previous “runs” ≤ m
Proof (|n| + 1 > 3s) - Final
 ∑ The length of a “run” =
∑ The length of a “run”, not including chars
that where already matched in previous “runs”
+
∑ The length of the chars that where already
matched in previous “runs” ≤
m+m

You might also like