Lecture 13.
Hash Tables
1
Hash Tables (1/4)
¢ Motivation: symbol tables
u A compiler uses a symbol table to relate symbols to associated
data
ê Symbols: variable names, procedure names, etc.
ê Associated data: memory location, call graph, etc.
u For a symbol table (also called a dictionary), we care about
search, insertion, and deletion
u We typically don’t care about sorted order
2
N
e
t
w
o
r
Hash Tables (2/4)
k
i
n
g
L
a 0
b
o
r U
a (universe of keys) h(k1)
t
o h(k4)
r
y
3 K k1 h(k2)
/
4
(actual k4 k2
9 keys)
k3 h(k3)
m-1
3
Hash Tables (3/4)
¢ Idea
u Use a function h to compute the slot for each key
u Store the element in slot ℎ(𝑘)
¢ A hash function h transforms a key into an index in a hash table
𝑇[0 … 𝑚 − 1]:
ℎ ∶ 𝑈 → {0, 1, . . . , 𝑚 − 1}
¢ We say that 𝑘 hashes to slot ℎ(𝑘)
¢ Advantages:
u Reduce the range of array indices handled: 𝑚 instead of |𝑈|
u Storage is correspondingly reduced
4
Hash Tables (4/4)
¢ More formally:
u Given a table 𝑇 and a record 𝑥, with key (= symbol) and
satellite data, we need to support:
ê Insert (𝑇, 𝑥)
ê Delete (𝑇, 𝑥)
ê Search (𝑇, 𝑥)
u We want these to be fast, but don’t care about sorting the
records
¢ The structure we will use is a hash table
u Supports all the above in 𝑂(1) expected time !
5
Direct-Address Tables
¢ Suppose:
u The range of keys is 0. . 𝑚 − 1
u Keys are distinct
¢ The idea:
u Set up an array 𝑇[0. . 𝑚 − 1] in which
ê 𝑇[𝑖] = 𝑥 if 𝑥 ∈ 𝑇 and key[𝑥] = 𝑖
ê 𝑇[𝑖] = NULL otherwise
u This is called a direct-address table
ê Operations take 𝑂(1) time!
ê So what’s the problem?
6
Direct-Address Tables
¢ Dictionary operations are trivial and take 𝑂(1) time
each:
u DIRECT-ADDRESS-SEARCH(T, k)
return 𝑇[𝑘]
u DIRECT-ADDRESS-INSERT(T, x)
𝑇[key[𝑥]] ← 𝑥
u DIRECT-ADDRESS-DELETE(T, x)
𝑇[key[𝑥]] ← NIL
7
The Problem With Direct Addressing
¢ Direct addressing works well when the range 𝑚 of keys
is relatively small
¢ But what if the keys are 32-bit integers?
u Problem 1: direct-address table will have
232 entries, more than 4 billion
u Problem 2: even if memory is not an issue, the time to initialize
the elements to NULL may be …
¢ Solution: map keys to smaller range 0. . 𝑚 − 1
¢ This mapping is called a hash function
8
Hash Functions
¢ Next problem: collision
T
U 0
(universe of keys)
h(k1)
k1
h(k4)
K k4
(actual k5
h(k2) = h(k5)
keys)
k2 h(k3)
k3
m-1
9
Resolving Collisions
¢ How can we solve the problem of collisions?
u Solution 1: chaining
u Solution 2: open addressing
10
Chaining (1/10)
index LL Head
0
1
2
3
4
5
...
n-1
11
Chaining (2/10) key1: data1
Hash
index LL Head
Function
0
1
2 3
3 key1:data1
4
5
...
n-1
12
Chaining (3/10) key2: data2
Hash
index LL Head
Function
0
1 key2:data2
2 1
3 key1:data1
4
5
...
n-1
13
Chaining (4/10) key3: data3
Hash
index LL Head
Function
0
1 key2:data2
2 4
3 key1:data1
4 key3:data3
5
...
n-1
14
Chaining (5/10) key4: data4
Hash
index LL Head
Function
0
1 key2:data2
2 3
3 key4:data4 key1:data1
4 key3:data3
5
...
n-1 Note: Insertion at
beginning of list
15
Chaining (6/10)
¢ Chaining puts elements that hash to the same slot in a
linked list:
T
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
16
Chaining (7/10)
¢ How do we insert an element?
T
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
17
Chaining (8/10)
¢ How do we delete an element?
u Do we need a doubly-linked list for efficient delete?
T
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
18
Chaining (9/10)
¢ How do we search for an element with a given key?
T
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
19
Chaining (10/10)
u CHAINED-HASH-INSERT(T, x)
insert 𝑥 at the head of list 𝑇[ℎ(key[𝑥])]
u CHAINED-HASH-SEARCH(T, k)
search for an element with key 𝑘 in list 𝑇[ℎ[𝑘]]
u CHAINED-HASH-DELETE(T, x)
delete 𝑥 from the list 𝑇[ℎ(key[𝑥])]
20
Practice Problems
¢ Draw a hash table after we insert the keys 5; 28;
19; 15; 20; 33; 12; 17; 10 with collisions resolved
by chaining. Let the table have 9 slots, 𝑇[0, . . , 8],
and the hash function be ℎ(𝑘) = 𝑘 mod 9.
𝑖 0 1 2 3 4 5 6 7 8
[] [10, [20] [12] [] [5] [33, [] [17]
𝑇[𝑖] 19, 15]
28]
21
Analysis of Chaining
¢ Assume simple uniform hashing: each key in table is
equally likely to be hashed to any slot
¢ Given n keys and m slots in the table, the
load factor a = 𝑛 / 𝑚 = average # keys per slot
¢ What will be the average cost of an unsuccessful
search for a key?
u 𝑂(1 + )
¢ What will be the average cost of a successful search?
u 𝑂(1 + a/2) = 𝑂(1 + a)
22
Analysis of Chaining
¢ So, the cost of searching
u 𝑂(1 + )
¢ If the number of keys n is proportional to the
number of slots in the table, what is a?
u a = 𝑂(1)
u In other words, we can make the expected cost of
searching constant if we make a constant
23
Choosing A Hash Function
¢ Clearly choosing the hash function well is crucial
¢ What are desirable features of the hash function?
u Should distribute keys uniformly into slots
u Should not depend on patterns in the data
24
Hash Functions: The Division Method
¢ Idea
u Map a key 𝑘 into one of the 𝑚 slots by taking the
remainder of 𝑘 divided by 𝑚
ℎ(𝑘) = 𝑘 𝑚𝑜𝑑 𝑚
¢ Advantage
u fast, requires only one operation
¢ Disadvantage
u Certain values of 𝑚 are bad: power of 2 and non-prime
numbers
25
The Division Method
¢ A good choice for 𝑚: a prime number, not too close
to an exact power of 2
¢ e.g., allocate a hash table, with collisions resolved
through chaining
u 𝑛 = 2000 character strings (8 bits/character)
u Choose m roughly 𝑛/3: 𝑚 = 701 (prime near 2000/3, not
near a power of 2)
u ℎ(𝑘) = 𝑘 mod 701
26
Hash Functions: The Multiplication Method
Idea:
¢ Multiply key 𝑘 by a constant 𝐴, 0 < 𝐴 < 1
¢ Extract the fractional part of 𝑘𝐴
¢ Multiply the fractional part by 𝑚
¢ Take the floor of the result
ℎ(𝑘) = 𝑚 (𝑘 𝐴 𝑚𝑜𝑑 1)
fractional part of 𝑘𝐴 = 𝑘𝐴 − 𝑘𝐴
¢ Disadvantage: Slower than division method
¢ Advantage: Value of 𝑚 is not critical: typically 2𝑝
27
The Multiplication Method
¢ For a constant 𝐴, 0 < 𝐴 < 1:
¢ ℎ(𝑘) = 𝑚 (𝑘𝐴 − 𝑘𝐴 )
Fractional part of 𝑘𝐴
¢ Choose 𝑚 = 2𝑃
¢ Choose 𝐴 not too close to 0 or 1
¢ Knuth: Good choice for 𝐴 = (Ö5 - 1)/2
28
Truncation method
• Ignore part of the key and use the rest as the array
index (converting non-numeric parts)
• Example
If students have a 9-digit identification number, take
the last 3 digits as the table position
– e.g. 925371622 becomes 622
30
Mid-square method
¢ Middle of square method
¢ This method squares the key value, and then
takes out the number of bits from the middle
of the square
¢ The number of bits to be used to obtain the
address depends on the table size
u If r bits are used, the range of values is 0 to 2r-1
¢ This works well because most or all bits of the
key value contribute to the result
31
Mid-square method
¢ Example
u consider items whose keys are 4-digit numbers in
base 10
u The goal is to hash these key values to a table of
size 100
u This range is equivalent to two digits in base 10,
that is, the number of bits is 2
u For example, the input is the number 4567,
squaring gets an 8-digit number, 20857489
u The middle two digits of this result is 57
32
Folding method
¢ Partition the key into several parts of the same length
except for the last
¢ These parts are then added together to obtain the hash
address for the key
¢ Two ways of carrying out this addition
u Shift-folding
u Folding and reverse
33
Folding method
¢ Example
34
Digit analysis method
¢ All the keys in the table are known in advance
¢ The index is formed by extracting, and then
manipulating specific digits from the key
¢ For example, the key is 925371622, we select
the digits from 5 through 7 resulting 537
¢ The manipulation can then take many forms
u Reversing the digits (735)
u Performing a circular shift to the right (753)
u Performing a circular shift to the left (375)
u Swapping each pair of digits (357)
35
Open Addressing (1/9)
¢ If we have enough contiguous memory to store all the
keys (𝑚 > 𝑁)
Þ store the keys in the table itself
¢ No need to use the linked lists anymore
¢ Collision resolution
u Put the elements that collide in the available empty places in
the table
39
Open Addressing (2/9)
¢ Basic idea:
u To insert: if slot is full, try another slot, …, until an open slot is
found (probing)
u To search, follow same sequence of probes as would be used
when inserting the element
ê If reach element with correct key, return it
ê If reach a NULL pointer, element is not in table
¢ Good for fixed sets (adding but no deletion)
u Example: spell checking
¢ Table needn’t be much bigger than 𝑛
40
Open Addressing (3/9)
index data
0
1
2
3
4
5
...
n-1
41
Open Addressing (4/9)
key1: data1
index data
0
1
2
Hash
3 key1:data1 Function
4
5
...
n-1 3
42
Open Addressing (5/9)
key2: data2
index data
0
1 key2:data2
2
Hash
3 key1:data1 Function
4
5
...
n-1 1
43
Open Addressing (6/9)
key3: data3
index data
0
1 key2:data2
2
Hash
3 key1:data1 Function
4 key3:data3
5
...
n-1 4
44
Open Addressing (7/9)
key4: data4
index data
0
1 key2:data2
2
Hash
3 key1:data1 Function
4 key3:data3
5
...
n-1 3
?
45
Open Addressing (8/9)
key4: data4
index data
0
1 key2:data2
2
Hash
3 key1:data1 Function
4 key3:data3
5 key4:data4
...
n-1 3
46
Open Addressing (9/9)
47
Linear Probing (1/3)
¢ When there is a collision, check the next available
position in the table (probing)
ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖) mod 𝑚
¢ First slot probed: ℎ1(𝑘), second: ℎ1(𝑘) + 1 and so on
48
Linear Probing (2/3)
¢ Searching for a key
¢ Three situations: 0
u Position in table is occupied with an element of
equal key h(k1)
u Position in table is empty h(k4)
u Position in table occupied with a different
element h(k2) = h(k5)
¢ Probe the next higher index until the
element is found or an empty position is h(k3)
found
m-1
¢ The process wraps around to the beginning
of the table
49
Linear Probing (3/3)
¢ Deleting a key
0
u Cannot mark the slot as empty
u Impossible to retrieve keys inserted after that
slot was occupied
¢ Solution
u Mark the slot with a sentinel value DELETED
¢ The deleted slot can later be used for
insertion
m-1
¢ Searching will be able to find all the keys
50
Practice Problems
¢ Consider a hash table of length 11 using open addressing
with the primary hash function ℎ1(𝑘) = 𝑘. Illustrate the
result of inserting 31, 4, 15, 28, 59 using linear probing.
u Hash function ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖) mod 𝑚 = (𝑘 + 𝑖) mod 𝑚
ê ℎ(31,0) = 9 Þ 𝑇[9] = 31
ê ℎ 4,0 = 4 Þ 𝑇[4] = 4
ê ℎ(15,0) = 4 collision
o ℎ 15,1 = 5 ⇒ 𝑇[5] = 15
ê ℎ 28,0 = 6 ⇒ 𝑇[6] = 28
ê ℎ(59,0) = 4 collision
o ℎ(59,1) = 5 collision
o ℎ(59,2) = 6 collision
o ℎ 59,3 = 7 Þ 𝑇[7] = 59
51
Double Hashing (1/3)
¢ Use a first hash function to determine the first slot
¢ Use a second hash function to determine the increment
for the probe sequence
ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖 ℎ2(𝑘)) mod 𝑚
¢ Initial probe: ℎ1(𝑘), second probe is offset by ℎ2(𝑘) mod
m, so on
¢ Advantage: avoids clustering
¢ Disadvantage: harder to delete an element
52
Double Hashing (2/3)
0
ℎ1(𝑘) = 𝑘 mod 13 1 79
ℎ2(𝑘) = 1 + (𝑘 mod 11) 2
3
ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖 ℎ2(𝑘)) mod 13 4 69
¢ Insert key 14: 5 98
6
ℎ1(14) = 14 mod 13 = 1 7 72
ℎ2(14) = 1 + (14 mod 11) = 4 8
9 14
ℎ(14, 1) = (ℎ1(14) + ℎ2(14)) mod 13 10
= (1 + 4) mod 13 = 5 11 50
12
ℎ(14, 2) = (ℎ1(14) + 2 ℎ2(14)) mod 13
= (1 + 8) mod 13 = 9
53
Double Hashing (3/3)
Choosing the second hash function
ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖 ℎ2(𝑘)) mod 𝑚
¢ ℎ2 should not evaluate to 0 Þ infinite loop
¢ ℎ2 should be relatively prime to the table size 𝑚
¢ Solution 1
u Choose m prime
u Design ℎ2 such that it produces an integer less than 𝑚
¢ Solution 2
u Choose 𝑚 = 2𝑝
u Design ℎ2 such that it always produces an odd number
54
Practice Problems
¢ Consider a hash table of length 11 using open addressing.
Illustrate the result of inserting 31, 4, 15, 28, 59 using
double hashing with functions ℎ1(𝑘) = 𝑘 and ℎ2(𝑘) = 1 + (𝑘
mod (𝑚 − 1)).
u Hash function ℎ(𝑘, 𝑖) = (𝑘 + 𝑖(1 + (𝑘 mod (𝑚 − 1)))) mod 𝑚
ê ℎ 31,0 = 9 Þ 𝑇[9] = 31
ê ℎ 4,0 = 4 Þ 𝑇[4] = 4
ê ℎ(15,0) = 4 collision
o ℎ 15,1 = 10 Þ 𝑇[10] = 15
ê ℎ 28,0 = 6 Þ 𝑇[6] = 28
ê ℎ(59,0) = 4 collision
o ℎ 59,1 = 3 Þ 𝑇[3] = 59
55