[go: up one dir, main page]

0% found this document useful (0 votes)
120 views5 pages

L11 PDF

The document discusses techniques for dynamic hashing including cuckoo hashing and FKS hashing. Cuckoo hashing maintains a hash table with two hash functions and handles insertions by evicting collided items and finding empty slots for them in O(1) expected time. FKS hashing constructs a static hash table with no collisions in O(n) expected time using multiple levels of hashing with exponentially increasing table sizes. The document also discusses universal hashing families that provide probabilistic guarantees for hash collisions.

Uploaded by

bala_thegame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views5 pages

L11 PDF

The document discusses techniques for dynamic hashing including cuckoo hashing and FKS hashing. Cuckoo hashing maintains a hash table with two hash functions and handles insertions by evicting collided items and finding empty slots for them in O(1) expected time. FKS hashing constructs a static hash table with no collisions in O(n) expected time using multiple levels of hashing with exponentially increasing table sizes. The document also discusses universal hashing families that provide probabilistic guarantees for hash collisions.

Uploaded by

bala_thegame
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lecture 11

Hashing: static perfect hashing via FKS, dynamic cuckoo hashing

The Problem: Membership/Dictionary: maintain a set S of n items from a universe U under:


- query(x): x ∈ S? (+ information associated with x)
- insert(x) (dynamic)
- delete(x) (dynamic)

The Solution: A hash function h : U → [m] for some positive integer m < |U |.
– maintain a table T [1 . . . m] of linked lists (chains)
– insert(x): add x to T [h(x)].
– query(x): scan T [h(x)].
– ∀h there exist x 6= y s.t h(x) = h(y) ⇒ our goal is short chains.

Theorem 1. If m > n and h is selected uniformly from all hash functions then insert/delete/query
take O(1) expected time.

However, a random hash function requires |U | lg m bits to represent ⇒ infeasible.

Universal Hashing:

weak universal hashing is enough to obtain O(1) expected time per operation.

Definition 1. A set H of hash functions is a weak universal family if for all x, y ∈ U , x 6= y,


O(1)
Pr [h(x) = h(y)] = .
h←H m
d
- Sometimes called d-universal for probability= m .
- Why is weak universal enough?
n
Pick m so that m = O(1), and randomly pick h ∈ H. Let Iy = 1 iff h(x) = h(y).
 
X X X O(1)
E[chain length] = E  Iy  = E[Iy ] = 1 + Pr[h(x) = h(y)] ≤ 1 + n · = O(1)
m
y∈S y∈S y6=x

- Dictionary construction: randomly choose h ∈ H and hash all elements. If there’s a chain that is
“too long”, pick a new h and rehash (rebuild the table from scratch, we will use this idea a lot).

- An example weak universal family: Hp,m = {ha,b | a ∈ {1, 2, . . . , p}, b ∈ {0, 1, 2, . . . , p}}, for some
prime p > |U |, where ha,b (x) = ((ax + b) mod p) mod m. Proof in CLRS.

Definition 2. H is a strong universal family if for all distinct x, y ∈ U , and for all a, b ∈ [m],
O(1)
Pr [h(x) = a ∧ h(y) = b] = .
h←H m2
Definition 3. H is k-independent if for all k distinct items x1 , . . . , xk ∈ U , and for all a1 , . . . , ak ∈
[m],
O(1)
Pr [h(x1 ) = a1 ∧ h(x2 ) = a2 ∧ . . . ∧ h(xk ) = ak ] = .
h←H mk
- An example k-independent family: again, pick some prime p > |U |.

H = {h | h(x) = (c0 + c1 x + · · · + ck−1 xk−1 ) mod m, for some c0 , c1 , . . . , ck−1 ∈ [p]}.

Theorem 2 (Siegel, 1989). ∀ε > 0, ∃ a nΩ(1) -independent family of hash functions, each repre-
sented in nε space, and evaluated in O(1) time.

Theorem 3 (Pagh, Ostlin, 2003). ∃ a n-independent family of hash functions, each represented
in O(n) words, and evaluated in O(1) time.

Worst-case Guarantees in Static Hashing:

-Universal hashing gives good performance only in expectation ⇒ vulnerable to an adversary.

Theorem 4 (Gonnet, 1981).³Let H´be an n-independent family of hash functions. The expected
length of the longest chain is Θ lglglgnn .
³ ´
lg n
⇒ We can construct a static hash table with Θ lg lg n worst-case query time:

– pick a random h ∈ H, hash every x ∈ S (in O(n) time).


– if longest-chain ≤ 2·expected-length then stop.
– otherwise, pick a new h and start over.
1
Pr(bad hash function) ≤ 2 ⇒ O(1) trials, O(n) expected construction time.

- Mitzenmacher 1996 [3]: By using two hash functions (insert to the shorter list, search is in both
lists) we can get Θ(lg lg n) worst-case query time.

FKS - Static Hashing (Fredman, Komlós, Szemerédi [1])

- Construct static hash table with no collisions in expected O(n) time, O(n) worst-case space, and
O(1) worst-case query time.
- Requires only a weak universal family H
- Easy to implement.

First attempt: If m = Ω(n2 ) and we randomly pick h ∈ H then


X µ ¶
n c 1
E[number of collisions] = Pr[h(x) = h(y)] = · ≤
2 m 2
x,y∈S, x6=y

⇒ After expected O(1) trials, we get a collision-free hash function (total time is O(m) = O(n2 )).

Second attempt: If m = n, the same calculation yields


µ ¶
n c
E[number of collisions] = · = O(n)
2 n

⇒ After expected O(1) trials, we find a function h0 that produces O(n) collisions (total time is
O(n)).
FKS: Use h0 to hash into n buckets, then use hi ’s to hash a bucket of size ni to n2i locations.

1
U 2
h’ n22
1 n1 = 0 - 3
2 n2 = 2 h2 4 1
n3 =1 - 2
3
nm = 3 hm 4
5 nm2
6
7
8
9

Let ni = |{x ∈ S | h0 (x) = i}|.


P ¡ni ¢
(I) The number of collisions is i∈[m] 2 = O(n) because we choose h0 so. Thus,
 
X X µni ¶
n2i = O   = O(n).
2
i∈[m] i∈[m]

(II) We can hash ni elements into a table of size n2i without any collisions in expected O(n2i ) time.

- The construction takes O(n) + O(n21 ) + . . . + O(n2m ) = O(n) time in expectation
- Worst-case O(n) space.
- Worst-case O(1) query time (two hashes).

Cuckoo - Dynamic Hashing (Pagh and Rodler 2001 [5])

On the nesting habits of the Cuckoo bird...

- O(1) expected time for insert


- O(1) worst-case time for queries/deletes.
- Requires two O(lg n)-independent hash functions, h1 and h2 . (OPEN: same bound using only
O(1)-independent hash family)
- m > 2n (we will use m = 4n).
- Invariant: x is either at T [h1 (x)] or at T [h2 (x)] ⇒ query/delete takes worst-case two probes.

Insertion:

1. Compute h1 (x),
2. If T [h1 (x)] is empty, we put x there, and we are done.
Otherwise, if y ∈ T [h1 (x)], we evict y and put x in T [h1 (x)].
3. We find a new spot for y by looking at T [h1 (y)] or T [h2 (y)] (the one that is not occupy by x).
4. Repeat this process. After 6 lg n steps stop and rehash.
Let x1 , x2 , . . . , xt be the items that are evicted during the process.
- Cuckoo graph G = (V, E), where V = [m] and (h1 (x), h2 (x)) ∈ E for all x ∈ U . Insertion is one of
three possible walks on G:

x7
•o •O
x1 x1
x8 x6
¹ x2 x3 x4 x5 ¸ ²
• /• /• /• /• •^
x2
/• x3
/• /• x4 x5
/•
I_ u ^I_ u ^I_ u
x2 x3 x4

x1

¼ x9 x10 x11 x12


• /• /• /• /•

x7
•o •O
x1
x8 x6
¸ x2 x3 x4 ² x5
•^ /• /• /• /•
I_ u ^I_ u ^I_ u
x2 x3 x4

x1

¼ x9 x10
• /• /•
O
x13 x11
²
•o x12

Key observation: our functions are O(lg n)-independent so we can treat them as truly random
functions.

– No cycle: Pr[1st eviction] = Pr[T [h1 (x1 )] is occupied ] ≤


X 1 2n 1
(Pr[h1 (x) = h1 (x1 )] + Pr[h2 (x) = h1 (x1 )]) < 2n = = .
m 4n 2
x∈S,x6=x1

By same reasoning, Pr[2nd ≤ 2−2 , and Pr[tth eviction] ≤ 2−t ⇒ the expected running
P∞ eviction]
time of this case is ≤ t=1 t · 2−t = O(1).

1
Also, Pr[rehash] ≤ 2−6 lg n ≤ n2
(∗)

– One cycle: One of the path parts (solid, dashed


P or dotted) is at least t/3 long.
⇒ the expected running time of this case is ≤ ∞ t=1 t · 2−t/3 = O(1).
1
Also, Pr[rehash] ≤ 2−(6 lg n)/3 = n2
(∗)

– Two cycles: Counting argument. How many two-cycle configurations are there?
• The first item in the sequence is x1 .
• At most nt−1 choices of other items in the sequence.
• At most t choices for where the first loop occurs, t choices for where this loop returns, and
t choices for when the second loop occurs.
• We also have to pick t − 1 hash values to associate with the items.
⇒ At most t3 nt−1 (4n)t−1 configurations.

The probability that a specific configuration occurs is 2t (4n)−2t . Why?


⇒ The probability that some two-cycle configuration occurs is at most

t3 nt−1 (4n)t−1 2t t3
= .
(4n)2t 4n2 2t

⇒ The probability that a two-cycle occurs at all is at most



X ∞ µ ¶
t3 1 X t3 1 1
= = · O(1) = O .(∗)
4n2 2t 4n2 2t 2n2 n2
t=2 t=2

By (∗)’s, Pr[insertion causes rehash] ≤ O(1/n2 ).


⇒ Pr[n insertions cause rehash] ≤ O(1/n).
⇒ Rehashing (n insertions) succeeds with prob. 1 − O(1/n), so after constant number of trials.
- A trial takes n · O(1) + O(lg n) = O(n) time in expectation.
⇒ Rehashing takes O(n) time in expectation.
⇒ The expected running time of an insertion is O(1) + O(1/n2 ) · O(n) = O(1) + O(1/n) = O(1).

References
1. M. Fredman, J. Komlós, E. Szemerédi, Storing a Sparse Table with O(1) Worst Case Access Time, Journal of the
ACM, 31(3):538-544, 1984.
2. G. Gonnet, Expected Length of the Longest Probe Sequence in Hash Code Searching, Journal of the ACM, 28(2):289-
304, 1981.
3. M. Mitzenmacher, The Power of Two Choices in Randomized Load Balancing, Ph.D. Thesis 1996.
4. A. Ostlin, R. Pagh, Uniform hashing in constant time and linear space, 35th STOC, p. 622-628, 2003.
5. R. Pagh, F. Rodler, Cuckoo Hashing, Journal of Algorithms, 51(2004), p. 122-144.
6. A. Siegel, On universal classes of fast hash functions, their time-space tradeoff, and their applications, 30th FOCS,
p. 20-25, Oct. 1989.

You might also like