Week13 1
Week13 1
Week13 1
Introduction to Hashing
What is hashing?
Hashing is the process of converting a given key into another value. A hash function is used to
generate the new value according to a mathematical algorithm. The result of a hash function is
known as a hash value or simply, a hash.
A good hash function uses a one-way hashing algorithm, or in other words, the hash cannot be
converted back into the original key.
Collisions
Keep in mind that two keys can generate the same hash. This phenomenon is known as
a collision.
Hashing Mechanism
In hashing,
An array data structure called as Hash table is used to store key-value pairs.
The key is sent to a hash function that performs arithmetic operations on it. Hash function
maps any big number or string to a small integer value and is commonly called the hash
value or hash.It is used as an index of the key-value pair in the hash table.
Based on the hash value, data items are inserted into the hash table.
Hash functions
There are many hash functions that use numeric or alphanumeric keys. Let us focus on
discussing different hash functions:
1. Division Method.
2. Mid Square Method.
3. Folding Method.
4. Multiplication Method.
1. Division Method:
This is the most simple and easiest method to generate a hash value. The hash function divides
the value k by M and then uses the remainder obtained.
Formula:
h(K) = k mod M
Here,
k is the key value, and
M is the size of the hash table.
It is best suited that M is a prime number as that can make sure the keys are more uniformly
distributed. The hash function is dependent upon the remainder of a division.
Examples:
k = 12345
M = 95
h(12345) = 12345 mod 95
= 90
k = 1276
M = 11
h(1276) = 1276 mod 11
=0
Pros:
1. This method is quite good for any value of M.
2. The division method is very fast since it requires only a single division operation.
Cons:
1. This method leads to poor performance since consecutive keys map to consecutive hash
values in the hash table.
2. Sometimes extra care should be taken to choose value of M.
4. Multiplication Method
This method involves the following steps:
1. Choose a constant value A such that 0 < A < 1.
2. Multiply the key value with A.
3. Extract the fractional part of kA.
4. Multiply the result of the above step by the size of the hash table i.e. M.
5. The resulting hash value is obtained by taking the floor of the result obtained in step 4.
Formula:
h(K) = floor (M (kA mod 1))
Here,
M is the size of the hash table.
k is the key value.
A is a constant value.
Example:
k = 12345
A = 0.357840
M = 100
h(12345) = floor[ 100 (12345*0.357840 mod 1)]
= floor[ 100 (4417.5348 mod 1) ]
= floor[ 100 (0.5348) ]
= floor[ 53.48 ]
= 53
Pros:
The advantage of the multiplication method is that it can work with any value of between 0 and
1, although there are some values that tend to give better results than the rest.
Cons:
The multiplication method is generally suitable when the table size is the power of two, then
the whole process of computing the index by the key using multiplication hashing is very fast.
In computer science, a perfect hash function h for a set S is a hash function that maps distinct
elements in S to a set of m integers, with no collisions.
Hash Collisions
What is Collision?
If the set of objects we intend to store within our hash table is larger than the size of our hash
table we are bound to have two or more different objects having the same hash value; a hash
collision. Even if the size of the hash table is large enough to accommodate all the objects
finding a hash function which generates a unique hash for each object in the hash table is a
difficult task.
Collisions are bound to occur (unless we find a perfect hash function, which in most of the
cases is hard to find) but can be significantly reduced with the help of various collision
resolution techniques.
Following are the collision resolution techniques used:
Open Hashing (Separate chaining)
Closed Hashing (Open Addressing)
o Liner Probing
o Quadratic probing
o Double hashing
Suppose you wish to store a set of numbers = {0,1,2,4,5,7} into a hash table of size 5.
Now, assume that we have a hash function H, such that H(x) = x%5
So, if we were to map the given data with the given hash function we'll get the
corresponding values.
H(0)-> 0%5 = 0
H(1)-> 1%5 = 1
H(2)-> 2%5 = 2
H(4)-> 4%5 = 4
H(5)-> 5%5 = 0
H(7)-> 7%5 = 2
Clearly 0 and 5, as well as 2 and 7 will have the same hash value, and in this case we'll simply
append the colliding values to a list being pointed by their hash keys.
Obviously in practice the table size can be significantly large and the hash function can be even
more complex, also the data being hashed would be more complex and non-primitive, but the
idea remains the same.
This is an easy way to implement hashing but it has its own demerits.
a. Linear Probing
b. Quadratic probing
c. Double hashing
a. Linear Probing
The idea of linear probing is simple, we take a fixed sized hash table and every time we face a
hash collision we linearly traverse the table in a cyclic manner to find the next empty slot.
Assume a scenario where we intend to store the following set of numbers = {0,1,2,4,5,7}
into a hash table of size 5 with the help of the following hash function H, such that H(x)
= x%5.
So, if we were to map the given data with the given hash function we'll get the
corresponding values
H(0)-> 0%5 = 0
H(1)-> 1%5 = 1
H(2)-> 2%5 = 2
H(4)-> 4%5 = 4
H(5)-> 5%5 = 0
In this case we see a collision of two terms (0 & 5). In this situation we move linearly
down the table to find the first empty slot. Note that this linear traversal is cyclic in
nature, i.e. in the event we exhaust the last element during the search we start again
from the beginning until the initial key is reached.
b. Quadratic Probing
The general idea remains the same, the only difference is that we look at the Q(i) increment at
each iteration when looking for an empty bucket, where Q(i) is some quadratic expression of i.
A simple expression of Q would be Q(i) = i^2, in which case the hash function looks something
like this:
H(x, i) = (H(x) + i^2)%N
Assume a scenario where we intend to store the following set of numbers = {0,1,2,5}
into a hash table of size 5 with the help of the following hash function H, such
that H(x, i) = (x%5 + i^2)%5.
Clearly 5 and 0 will face a collision, in which case we'll do the following:
c. Double Hashing
This method is based upon the idea that in the event of a collision we use an another hashing
function with the key value as an input to find where in the open addressing scheme the data
should actually be placed at.
In this case we use two hashing functions, such that the final hashing function looks
like:
H(x, i) = (H1(x) + i*H2(x))%N
Typically for H1(x) = x%N a good H2 is H2(x) = P - (x%P), where P is a prime number
smaller than N.
A good H2 is a function which never evaluates to zero and ensures that all the cells
of a table are effectively traversed.
Assume a scenario where we intend to store the following set of numbers = {0,1,2,5}
into a hash table of size 5 with the help of the following hash function H, such that
Below is a simple hash function that returns the modulus of the length of the hash table. In our
case, the length of the hash table is 10.
Modulo operator (%) is used in the hashing function. The % (modulo) operator yields the
remainder from the division of the first argument by the second.
defhashing_func(key):
print(hashing_func(10))# Output: 0
print(hashing_func(20))# Output: 0
print(hashing_func(25))# Output: 5
Collision
A collision occurs when two items/values get the same slot/index, i.e. the hashing function
generates same slot number for multiple items. If proper collision resolution steps are not
taken then the previous item in the slot will be replaced by the new item whenever the collision
occurs.
Collision Resolution
Chaining
One of the ways to resolve collision is Chaining. This allows multiple items exist in the same
slot/index. This can create a chain/collection of items in a single slot. When the collision
happens, the item is stored in the same slot using chaining mechanism.
While implementing Chaining in Python, we first create the hash table as a nested list (lists
inside a list).
hash_table=[[]for _ inrange(10)]
print(hash_table)
# Output:
# [[], [], [], [], [], [], [], [], [], []]
hash_key = hashing_func(key)
bucket = hash_table[hash_key]
bucket.append((key, value))
insert(hash_table,10,'Nepal')
insert(hash_table,25,'USA')
insert(hash_table,20,'India')
print(hash_table)
# Output:
# [[(10, 'Nepal'), (20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
defsearch(hash_table, key):
hash_key = hashing_func(key)
bucket =hash_table[hash_key]
k, v =kv
if key == k:
return v
While deleting any existing element from the hash table, we first search if the key already exists
in the hash table.
– If the key is present (found) in the hash table, then we simply delete it. We delete that
particular key-value pair from the hash table.
– Otherwise, no operation is done. We can simply print a message saying that the key was not
found in the hash table.
def delete(hash_table, key):
hash_key = hashing_func(key)
key_exists=False
bucket =hash_table[hash_key]
k, v =kv
if key == k:
key_exists=True
break
if key_exists:
del bucket[i]
else:
delete(hash_table,100)
print(hash_table)
# Output:
# Key 100 not found
# [[(10, 'Nepal'), (20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
delete(hash_table,10)
print(hash_table)
# Output:
# Key 10 deleted
# [[(20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
Complete Code
#create hash table as a nested list
hash_table = [[] for _ in range(10)]
def hashing_func(key):
# division method
return key % len(hash_table)
#use `append()` function to insert key-value pairs in the hash table
def insert(hash_table, key, value):
hash_key = hashing_func(key)
bucket = hash_table[hash_key]
bucket.append((key, value))
def search(hash_table, key):
hash_key = hashing_func(key)
bucket = hash_table[hash_key]
for i, kv in enumerate(bucket):
k, v = kv
if key == k:
return v
def delete(hash_table, key):
hash_key = hashing_func(key)
key_exists = False
bucket = hash_table[hash_key]
for i, kv in enumerate(bucket):
k, v = kv
if key == k:
key_exists = True
break
if key_exists:
del bucket[i]
print("key ",key, " deleted")
else:
print ("key ",key, "not found")
Output:
[[(10, 'Nepal'), (20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
Nepal
India
None
key 100 not found
[[(10, 'Nepal'), (20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
key 10 deleted
[[(20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
==============================================================================