[go: up one dir, main page]

0% found this document useful (0 votes)
27 views35 pages

Hash Tables

hash tables example

Uploaded by

Ashutosh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views35 pages

Hash Tables

hash tables example

Uploaded by

Ashutosh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Hash Tables

ESO207

Indian Institute of Technology, Kanpur


Introduction

• Consider a dynamic set that supports dictionary


operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction

• Consider a dynamic set that supports dictionary


operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction

• Consider a dynamic set that supports dictionary


operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction

• Consider a dynamic set that supports dictionary


operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction

• Consider a dynamic set that supports dictionary


operations, I NSERT, S EARCH and D ELETE.
• Previously, the L IST (and even S TACK and Q UEUE) data
structures can support the above operations, but not
efficiently.
• Hash table is an effective data structure for implementing
dictionaries.
• Searching for an element in a hash table can take as long
as in a linked list – Θ(n) in the worst case, but
• In practice, hashing performs extremely well. Under
reasonable assumptions, expected time to search for an
element in a hash table is O(1).
Introduction (contd.)

• A hash table generalizes an array’s ability to directly


address by an index into it in O(1) time.
• Notation: Let U be the universe of the possible values of
the keys.
• The dynamic set S at any time is a subset of U.
• For simplicity (although not general), assume that
U = {0, 1, . . . , m − 1} is a set of numbered keys.
• No two elements of the set have the same key.
Direct Addressing
Direct Address Tables

• Direct addressing works well when the universe is small.


• An array called the direct-address table is used and
denoted by T [0, . . . , m − 1].
• Each position or slot of the array corresponds to a key in
the universe U.
• If the dynamic set is S, then, for each k ∈ S, T [k ] (that is,
slot k ) points to the element of the set with key k .
• For each k ∈ U − S, T [k ] is NIL.
Operations: Direct Address Tables

D IRECT-A DDRESS -S EARCH(T , k )


1. return T [k ]

D IRECT-A DDRESS -I NSERT(T , x)


1. T [x.key ] = x

D IRECT-A DDRESS -D ELETE(T , x)


1. T [x.key ] = NIL
Direct Address Tables

• For some applications, direct address table can hold the


elements directly.
• That is, rather than storing the element’s key and satellite
data in an object external to the table, with a pointer from
the slot, we can store the object in the slot itself.
• For this, empty slots will need to be indicated by a special
key value.
Hash Tables

• Down-side of direct addressing:


1. If the universe U is large, storing a table T of size |U| may
be impractical or impossible.
2. Further, the set of keys K actually stored in the table T
may be very small. This would lead to wastage of space.
3. Hash tables typically work with space O(|K |), while still
being able to search in expected O(1) time. [For
Direct-Address, this was worst-case O(1) time].
Hash Tables
• Keep a table T consisting of m slots numbered
T [0, . . . , m − 1].
• A hash function h maps each key value of the universe
U to some slot in T , that is,

h : U → {0, 1, . . . , m − 1}

• We say that an element with key k hashes to slot h(k ).


• In general, m  |U| (and that is where the benefit in the
storage requirement of hash tables arises).
• Hence, two keys from U may hash to the same value,
that is,

Collision : h(k1 ) = h(k2 ), k1 , k2 ∈ U, k1 6= k2 .

In such a case we say that the keys k1 and k2 collide


under the hash function h.
Collisions

• Collisions cannot be avoided, since |U| > m.


• One of the goals of the design of hash functions is to
make h appear “random”.
• The very term “to hash” evokes images of random mixing
and chopping.
• We will explain the notion of randomness of hash
functions later — note that a hash function h is
deterministic, that is, given an input key k , the output
h(k ) is the same value.
Collision resolution: by chaining

• Simplest form of resolving collisions is to keep a linked


list of the elements that hash to each bucket.
• Each slot j contains a pointer to the head of the list of all
the stored elements that hash to the slot j.
• If there are no such elements, slot j contains NIL.
Code for Insert, Search and Delete

C HAINED -H ASH -I NSERT(T , x)


1. Insert x at the head of the list T [h(x.key )]

C HAINED -H ASH -S EARCH(T , k )


1. Search for an element with key k in list T [h(k )]

C HAINED -H ASH -D ELETE(T , x)


1. Delete x from the list T [h(x.key )]

• Worst case time for insertion: O(1). Assumes element


being inserted is not already present.
• Otherwise, we can check for this assumption by
searching for an element whose key is x.key before
inserting.
Search, Delete

C HAINED -H ASH -I NSERT(T , x)


1. Insert x at the head of the list T [h(x.key )]

C HAINED -H ASH -S EARCH(T , k )


1. Search for an element with key k in list T [h(k )]

C HAINED -H ASH -D ELETE(T , x)


1. Delete x from the list T [h(x.key )]

• Worst case time for Search: length of the linked list.


• Worst case time for Delete: O(1). Note that deletion
takes (pointer to the element) x to be deleted.
• Since we have doubly-linked lists, deletion is fast.
Comment

Hash tables with open chaining is a very popular data


structure in the industry.
Analysis of Chaining: introduction

• Load factor: Given a hash table T with m slots that


stores n elements, the load factor α for T is n/m, that is,
the average number of elements stored in a chain.
• A well-designed hash table in practice attempts to keep α
close to 1.
• Worst-case performance. In the worst case, the input
may consist of n keys all of whom hash to the same slot.
This would give a chain length of n at this slot, and chain
lengths of 0 at all other slots. The worst-case time for
searching would be O(n).
• Hash tables are not used for their worst-case
performance.
Analysis: introduction

• Hash tables are immensely popular because of their


average-case performance.
• The average case performance of hashing depends on
how well the hash function h distributes the set of keys to
be stored among the m slots {0, 1, . . . , m − 1}, on the
average.
• Idealized assumption about hash functions for analysis:
• Simple uniform hashing. A key value is equally likely
to hash into any of the m slots, independently of where
any other element hashes to.
Analysis
• Corresponding to slot j ∈ {0, 1, . . . , m − 1}, denote the
length of the chain at slot j to be nj . Then,

n = n0 + n1 + . . . + nm−1

• The average chain length is the sum of all chain lengths


divided by the number of slots, so this is
Pm−1
  k =0 nk n
E nj = = =α
m m
which is the load factor of the hash table.
• For the analysis, we will make an important assumption,
namely, that

Hash value h(k ) is computed in time O(1).


Analysis

• Expected time required to search for a key in a hash


table.
• Step 1: compute h(k ) in O(1) time. Then search through
the list at slot number h(k ).
• Length of this list is nh(k ) .
• Hence, an unsuccessful search will go over all the keys in
this list requiring time O(nh(k ) ).
Expected time for searching

• By uniform hashing assumption, h(k ) = j with equal


probability 1/m.
• the average time for search is

m−1
  1 X
E nh(k ) = nj = α
m
j=0

as calculated before.
• The average case time complexity of search is
O(1 + α), where, the O(1) time is required for computing
the hash value h(k ).
Expectation Analysis: contd.

• One would expect a successful search to take slightly


less on average, since, the entire chain need not be
browsed, rather, the search may halt once the key is
found.
• A slightly detailed analysis shows that the average time
for a successful search is 1 + α/2 − α/(2m) = Θ(1 + α).
• All this analysis shows that if α = n/m is O(1), then, the
search operation requires on the average O(1) time.
• Using doubly linked lists for storing the chains, deletion
takes O(1) time.
• Insertion (without searching for duplicates) takes O(1)
time, hence, all the hash table operations can be
performed in O(1) time, on average, if α = O(1).
Good Hash Functions

• A good hash function should come close to realizing the


idealized assumption made by simple uniform hashing.
• Many hash functions assume that the universe of keys is
the set of natural numbers N = {0, 1, 2, . . .}.
• Under this assumption, either the keys themselves are
natural numbers, or they are transformed into natural
numbers (e.g., strings of characters may be viewed as
long numbers over a larger alphabet).
Good and bad hash functions

• The simplest hash function is of the form

h(k ) = k mod m

• This hash function is very efficient, although it does not


have good uniformity properties.
• It is often not recommended to use m = 2r for some
value of r , since, k mod 2r gives the low order r bits of k .
• Unless we know that the low order r bits of the input keys
are equally likely, we are better off designing a hash
function whose value depends on all the bits of the key.
Example hash function: key modulo prime

• Choosing m to be a prime p that is not too close to a


power of 2 is often a good choice.
• For example, suppose there are n = 2000 keys to be
hashed. Let p = 701, which would give a load factor of
2000/701 ≈ 3 and h(k ) = k mod 701.
Multiplicative method for creating hash functions

• The multiplication method for creating hash functions


works in two steps.
1. Choose a number A in the range 0 < A < 1 and multiply
the key k by A and take the fractional part of kA, that is,
take kA − bkAc.
2. This is also referred to as

kA mod 1 = kA − bkAc

3. Now multiply this by the number of slots m and return the


floor, that is,
h(k ) = bm(kA mod 1)c
Multiplication method

• Advantage: the value of m is not critical.


• An efficient way of implementing the multiplication
method:
1. Choose m = 2p , a power of 2.
2. Suppose the word size of the machine is w bits and a key
value fits into a word.
3. Choose A to be of the form s/2w so that A · 2w = s.
4. Since k and s are both w-bit integers, ks is a 2w-bit
integer that we write as

ks = r1 2w + r0

where, r1 is the high order w-bit of ks and r0 is the low


order w-bit.
Multiplication method

1. Then,
ks r0
kA = w
= r1 + w
2 2
2. Hence the fractional part of kA is r0 /2w (since,
0 < r0 ≤ 2w − 1).
3. So, kA mod 1 = r0 /2w .
4. Hence,
j r k
0
h(k ) = bm(kA mod 1)c = 2p w
2
5. Since, r0 is a w-bit number, multiplying r0 /2w by 2p gives
the top p bits of r0 .
• Why? write r0 as a w-bit binary number:

r0 = b1 + b2 · 2 + b2 · 22 . . . + bw 2w−1

• Then,

r0 · 2p 1  p p+1 w+p−1

= b1 · 2 + b2 · 2 + . . . + bw · 2
2w 2w
1  
= w−p b1 + b2 · 2 + . . . + bw−p · 2w−p−1
2
+ bw−p+1 + bw−p+2 · 2 + . . . + bw · 2p−1
• Thus,

r0 · 2p
 
= bw−p+1 + bw−p+2 · 2 + . . . + bw · 2p−1
2w

which is the number corresponding to the top-p bits of r0 .


Multiplicative method

• Although this method works with any value of the


constant A, experiments show that it works better with
some values than others.
• Knuth suggests that

A ≈ ( 5 − 1)/2 = 0.6180339887 . . .

is likely to work reasonably well.

You might also like