TIME AND SPACE TRADEOFFS:
Input Enhancement in String Matching: Horspool’s algorithm,
Hashing: Open and Closed hashing.
Prof. Naveen Kulkarni
Assistant Professor
Department of Computer Science and Engineering (Cyber Security)
Dayananda Sagar University, Harohalli, Karnataka.
Email-id: naveenk-cse@dsu.edu.in
Faculty Cabin: A434
Review: String searching by brute force
Pattern: a string of m characters to search for
Text: a (long) string of n characters to search in
Brute force algorithm
Step 1 Align pattern at beginning of text
Step 2 Moving from left to right, compare each character of
pattern to the corresponding character in text until either all characters are found to match
(successful search) or a mismatch is detected
Step 3 While a mismatch is detected and the text is not yet exhausted, realign pattern one position to
the right and repeat Step 2
Worst case efficiency of this algorithm is in the O(nm) class.
Intention is to improve the efficiency.
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 2
String searching by preprocessing
Several string searching algorithms are based on the input enhancement idea of preprocessing the
pattern
• Knuth-Morris-Pratt (KMP) algorithm preprocesses pattern left to right to get useful
information for later searching.
• Boyer -Moore algorithm preprocesses pattern right to left and store information into two tables.
• Horspool’s algorithm simplifies the Boyer-Moore algorithm by using just one table.
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 3
Horspool’s Algorithm
A simplified version of Boyer-Moore algorithm:
• Preprocesses pattern to generate a shift table that determines how much to
shift the pattern when a mismatch occurs.
• Always makes a shift based on the text’s character c aligned with the last
character in the pattern according to the shift table’s entry for c.
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 4
How far to shift?
Look at first (rightmost) character in text that was compared:
• The character is not in the pattern
.....c...................... (c not in pattern)
BAOBAB
• The character is in the pattern (but not the rightmost)
.....O...................... (O occurs once in pattern)
BAOBAB
.....A...................... (A occurs twice in pattern)
BAOBAB
• The rightmost characters do match
.....B......................
BAOBAB
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 5
Shift table
• Shift sizes can be precomputed by the formula
distance from c’s rightmost occurrence in pattern
among its first m-1 characters to its right end
t(c) = pattern’s length m, otherwise
by scanning pattern before search begins and stored in a table called shift table
• Shift table is indexed by text and pattern alphabet
Eg, for BAOBAB:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 6
Example of Horspool’s alg. application
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _
1 2 6 6 6 6 6 6 6 6 6 6 6 6 3 6 6 6 6 6 6 6 6 6 6 6 6
BARD LOVED BANANAS
BAOBAB
BAOBAB
BAOBAB
BAOBAB (unsuccessful search)
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 7
Algorithm ShiftTavble(P[0..m-1])
// Fills the shift table used by Hospool’s and Boyer-Moore algorithms
//Input: Pattern P[0..m-1] and an alphabet of possible characters
//Output: Table[0..size-1] indexed by the alphabet of possible characters
filled with shift sizes computed by formula.
for i 0 to size – 1 do Table[i] m
for j 0 to m-2 do Table[P[j]] m-1-j
return Table
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 8
Horspool’s Algorithm
Step 1 : For a given pattern of length m and the alphabet used in both the pattern and text, construct
the shift table.
Step 2 : Align the pattern against the beginning of the text.
Step 3 : Repeat the following until either a matching substring is found or the pattern reaches beyond
the last character of the text. Starting with the last character in the pattern, compare the
corresponding characters in the pattern and text until either all m characters are matched
(then stop) or a mismatching pair is encountered. In the latter case, retrieve the entry t(c)
from the c’s column of the shift table where c is text’s character currently aligned against the
last character of the pattern, and shift the pattern by t(c) characters to right along the text.
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 9
ALGORITHM HorspoolMatching(P[0..m − 1], T [0..n − 1])
//Implements Horspool’s algorithm for string matching
//Input: Pattern P[0..m − 1] and text T [0..n − 1]
//Output: The index of the left end of the first matching substring or −1 if there are no matches
ShiftTable(P[0..m − 1]) //generate Table of shifts
i←m−1 //position of the pattern’s right end
while i ≤ n − 1 do
k←0 //number of matched characters
while k ≤ m − 1 and P[m − 1 − k] = T [i − k] do
k←k+1
if k = m
return i − m + 1
else i ← i + Table[T [i]]
return −1
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 10
Efficiency of Horspool’s Algorithm
• A simple example can demonstrate that the worst-case efficiency of
Horspool’s algorithm is in O(nm).
• But for random texts, it is in Θ(n), and, although in the same efficiency
class, Horspool’s algorithm is obviously faster on average than the
brute-force algorithm
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 11
Hashing
• Idea : prestructuring — preprocess the input to make accessing its elements easier.
• Hashing is a very efficient method for implementing a dictionary, i.e., a set
with the operations:
• find
• insert
• delete
• Based on representation-change and space-for-time tradeoff ideas
• Important applications:
• symbol tables
• databases (extendible hashing)
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 12
Hash tables and hash functions
The idea of hashing is to map keys of a given file of size n into a table of size m,
called the hash table, by using a predefined function, called the hash function.
h: K location (cell) in the hash table
Example: student records, key = SSN. Hash function:
h(K) = K mod m where m is some integer (typically, prime)
If m = 1000, where is record with SSN= 314159265 stored?
Generally, a hash function should:
• be easy to compute
• distribute keys about evenly throughout the hash table
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 13
Collisions
If h(K1) = h(K2), there is a collision
• Good hash functions result in fewer collisions but some collisions should be expected
(birthday paradox)
• Two principal hashing schemes handle collisions differently:
• Open hashing
– each cell is a header of linked list of all keys hashed to it
• Closed hashing
• one key per cell
• in case of collision, finds another cell by
• linear probing: use next free bucket
• double hashing: use second hash function to compute increment
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 14
Open hashing (Separate chaining)
Keys are stored in linked lists outside a hash table whose elements serve as the lists’ headers.
Example: A, FOOL, AND, HIS, MONEY, ARE, SOON, PARTED
h(K) = sum of K ‘s letters’ positions in the alphabet MOD 13
Keys A FOOL AND HIS MONEY ARE SOON PARTED
Hash 1 9 6 10 7 11 11 12
addressess
0 1 2 3 4 5 6 7 8 9 10 11 12
A AND MONEY FOOL HIS ARE Parted
Search for KID SOON
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 15
Open hashing (cont.)
• If hash function distributes keys uniformly, average length of
linked list will be α = n/m. This ratio is called load factor.
• Average number of probes in successful, S, and unsuccessful
searches, U:
S 1+α/2, U=α
• Load α is typically kept small (ideally, about 1)
• Open hashing still works if n > m
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 16
Closed hashing (Open addressing)
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 17
Closed hashing (cont.)
• Does not work if n > m
• Avoids pointers
• Deletions are not straightforward
• Number of probes to find/insert/delete a key depends on load factor α = n/m (hash table density)
and collision resolution strategy. For linear probing:
S = (½) (1+ 1/(1- α)) and U = (½) (1+ 1/(1- α)²)
• As the table gets filled (α approaches 1), number of probes in linear probing increases
dramatically:
Slides prepared by: Prof.Naveen Kulkarni, Assistant Professor, Department of CSE (CS), Dayananda Sagar University, Harohalli, Karnataka. 18