[go: up one dir, main page]

0% found this document useful (0 votes)
62 views16 pages

Week13 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 16

Week 13 – DS with Python

Introduction to Hashing

What is hashing?
Hashing is the process of converting a given key into another value. A hash function is used to
generate the new value according to a mathematical algorithm. The result of a hash function is
known as a hash value or simply, a hash.

A good hash function uses a one-way hashing algorithm, or in other words, the hash cannot be
converted back into the original key.

Collisions
Keep in mind that two keys can generate the same hash. This phenomenon is known as
a collision.

Hashing Mechanism
In hashing,

 An array data structure called as Hash table is used to store key-value pairs.
 The key is sent to a hash function that performs arithmetic operations on it. Hash function
maps any big number or string to a small integer value and is commonly called the hash
value or hash.It is used as an index of the key-value pair in the hash table.
 Based on the hash value, data items are inserted into the hash table.
Hash functions
There are many hash functions that use numeric or alphanumeric keys. Let us focus on
discussing different hash functions:
1. Division Method.
2. Mid Square Method.
3. Folding Method.
4. Multiplication Method.

1. Division Method:
This is the most simple and easiest method to generate a hash value. The hash function divides
the value k by M and then uses the remainder obtained.
Formula:
h(K) = k mod M
Here,
k is the key value, and
M is the size of the hash table.
It is best suited that M is a prime number as that can make sure the keys are more uniformly
distributed. The hash function is dependent upon the remainder of a division.
Examples:
k = 12345
M = 95
h(12345) = 12345 mod 95
= 90
k = 1276
M = 11
h(1276) = 1276 mod 11
=0
Pros:
1. This method is quite good for any value of M.
2. The division method is very fast since it requires only a single division operation.
Cons:
1. This method leads to poor performance since consecutive keys map to consecutive hash
values in the hash table.
2. Sometimes extra care should be taken to choose value of M.

2. Mid Square Method:


The mid square method is a very good hashing method. It involves two steps to compute the
hash value-
1. Square the value of the key k i.e. k2
2. Extract the middle r digits as the hash value.
Formula:
h(K) = h(k x k)
Here,
k is the key value.
The value of r can be decided based on the size of the table.
Example:
Suppose the hash table has 100 memory locations. So r = 2 because two digits are required to
map the key to the memory location.
k = 60
k x k = 60 x 60
= 3600
h(60) = 60
The hash value obtained is 60
Pros:
1. The performance of this method is good as most or all digits of the key value contribute to
the result. This is because all digits in the key contribute to generating the middle digits of
the squared result.
2. The result is not dominated by the distribution of the top digit or bottom digit of the original
key value.
Cons:
1. The size of the key is one of the limitations of this method, as the key is of big size then its
square will double the number of digits.
2. Another disadvantage is that there will be collisions but we can try to reduce collisions.

3. Digit Folding Method:


This method involves two steps:
1. Divide the key-value k into a number of parts i.e. k1, k2, k3,….,kn, where each part has the
same number of digits except for the last part that can have lesser digits than the other
parts.
2. Add the individual parts. The hash value is obtained by ignoring the last carry if any.
Formula:
k = k1, k2, k3, k4, ….., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s
Here,
s is obtained by adding the parts of the key k
Example:
k = 12345
k1 = 12, k2 = 34, k3 = 5
s = k1 + k2 + k3
= 12 + 34 + 5
= 51
h(K) = 51
Note:
The number of digits in each part varies depending upon the size of the hash table. Suppose for
example the size of the hash table is 100, then each part must have two digits except for the
last part that can have a lesser number of digits.

4. Multiplication Method
This method involves the following steps:
1. Choose a constant value A such that 0 < A < 1.
2. Multiply the key value with A.
3. Extract the fractional part of kA.
4. Multiply the result of the above step by the size of the hash table i.e. M.
5. The resulting hash value is obtained by taking the floor of the result obtained in step 4.
Formula:
h(K) = floor (M (kA mod 1))
Here,
M is the size of the hash table.
k is the key value.
A is a constant value.
Example:
k = 12345
A = 0.357840
M = 100
h(12345) = floor[ 100 (12345*0.357840 mod 1)]
= floor[ 100 (4417.5348 mod 1) ]
= floor[ 100 (0.5348) ]
= floor[ 53.48 ]
= 53
Pros:
The advantage of the multiplication method is that it can work with any value of between 0 and
1, although there are some values that tend to give better results than the rest.
Cons:
The multiplication method is generally suitable when the table size is the power of two, then
the whole process of computing the index by the key using multiplication hashing is very fast.

Perfect hashing function


Definition: It is a hash function that maps each different key to a distinct integer. Usually all
possible keys must be known beforehand. A hash table that uses a perfect hash has
no collisions.

In computer science, a perfect hash function h for a set S is a hash function that maps distinct
elements in S to a set of m integers, with no collisions.

Hash Collisions
What is Collision?

If the set of objects we intend to store within our hash table is larger than the size of our hash
table we are bound to have two or more different objects having the same hash value; a hash
collision. Even if the size of the hash table is large enough to accommodate all the objects
finding a hash function which generates a unique hash for each object in the hash table is a
difficult task.

Collisions are bound to occur (unless we find a perfect hash function, which in most of the
cases is hard to find) but can be significantly reduced with the help of various collision
resolution techniques.
Following are the collision resolution techniques used:
 Open Hashing (Separate chaining)
 Closed Hashing (Open Addressing)
o Liner Probing
o Quadratic probing
o Double hashing

1. Open Hashing (Separate chaining)


Collisions are resolved using a list of elements to store objects with the same key together.

 Suppose you wish to store a set of numbers = {0,1,2,4,5,7} into a hash table of size 5.
 Now, assume that we have a hash function H, such that H(x) = x%5
 So, if we were to map the given data with the given hash function we'll get the
corresponding values.

 H(0)-> 0%5 = 0
 H(1)-> 1%5 = 1
 H(2)-> 2%5 = 2
 H(4)-> 4%5 = 4
 H(5)-> 5%5 = 0
 H(7)-> 7%5 = 2

 Clearly 0 and 5, as well as 2 and 7 will have the same hash value, and in this case we'll simply
append the colliding values to a list being pointed by their hash keys.
Obviously in practice the table size can be significantly large and the hash function can be even
more complex, also the data being hashed would be more complex and non-primitive, but the
idea remains the same.
This is an easy way to implement hashing but it has its own demerits.

 The lookups/inserts/updates can become linear [O(N)] instead of constant time


[O(1)] if the hash function has too many collisions.
 It doesn't account for any empty slots.
 Ideally we require a good hash function to guarantee even distribution of the values.

2. Closed Hashing (Open Addressing)


This collision resolution technique requires a hash table with fixed and known size. During
insertion, if a collision is encountered, alternative cells are tried until an empty bucket is found.
These techniques require the size of the hash table to be supposedly larger than the number of
objects to be stored .

There are various methods to find these empty buckets:

a. Linear Probing
b. Quadratic probing
c. Double hashing

a. Linear Probing
The idea of linear probing is simple, we take a fixed sized hash table and every time we face a
hash collision we linearly traverse the table in a cyclic manner to find the next empty slot.

 Assume a scenario where we intend to store the following set of numbers = {0,1,2,4,5,7}
into a hash table of size 5 with the help of the following hash function H, such that H(x)
= x%5.
 So, if we were to map the given data with the given hash function we'll get the
corresponding values

 H(0)-> 0%5 = 0
 H(1)-> 1%5 = 1
 H(2)-> 2%5 = 2
 H(4)-> 4%5 = 4
 H(5)-> 5%5 = 0

 In this case we see a collision of two terms (0 & 5). In this situation we move linearly
down the table to find the first empty slot. Note that this linear traversal is cyclic in
nature, i.e. in the event we exhaust the last element during the search we start again
from the beginning until the initial key is reached.

 In this case our hash function can be considered as this:


H(x, i) = (H(x) + i)%N
where N is the size of the table and i represents the linearly increasing variable
which starts from 1 (until empty bucket is found).
Despite being easy to compute, implement and deliver best cache performance, this suffers
from the problem of clustering (many consecutive elements get grouped together, which
eventually reduces the efficiency of finding elements or empty buckets).

b. Quadratic Probing
The general idea remains the same, the only difference is that we look at the Q(i) increment at
each iteration when looking for an empty bucket, where Q(i) is some quadratic expression of i.
A simple expression of Q would be Q(i) = i^2, in which case the hash function looks something
like this:
H(x, i) = (H(x) + i^2)%N
 Assume a scenario where we intend to store the following set of numbers = {0,1,2,5}
into a hash table of size 5 with the help of the following hash function H, such
that H(x, i) = (x%5 + i^2)%5.
 Clearly 5 and 0 will face a collision, in which case we'll do the following:

- we look at 5%5 = 0 (collision)


- we look at (5%5 + 1^2)%5 = 1 (collision)
- we look at (5%5 + 2^2)%5 = 4 (empty -> place element here)

c. Double Hashing
This method is based upon the idea that in the event of a collision we use an another hashing
function with the key value as an input to find where in the open addressing scheme the data
should actually be placed at.

 In this case we use two hashing functions, such that the final hashing function looks
like:
H(x, i) = (H1(x) + i*H2(x))%N
 Typically for H1(x) = x%N a good H2 is H2(x) = P - (x%P), where P is a prime number
smaller than N.
 A good H2 is a function which never evaluates to zero and ensures that all the cells
of a table are effectively traversed.
 Assume a scenario where we intend to store the following set of numbers = {0,1,2,5}
into a hash table of size 5 with the help of the following hash function H, such that

H(x, i) = (H1(x) + i*H2(x))%5


H1(x) = x%5 and H2(x) = P - (x%P), where P = 3
(3 is a prime smaller than 5)
 Clearly 5 and 0 will face a collision, in which case we'll do the following:

- we look at 5%5 = 0 (collision)


- we look at (5%5 + 1*(3 - (5%3)))%5 = 1 (collision)
- we look at (5%5 + 2*(3 - (5%3)))%5 = 2 (collision)
- we look at (5%5 + 3*(3 - (5%3)))%5 = 3 (empty -> place element here)

Coding Hash Function & Hash Table(with


chaining)in python

Below is a simple hash function that returns the modulus of the length of the hash table. In our
case, the length of the hash table is 10.

Modulo operator (%) is used in the hashing function. The % (modulo) operator yields the
remainder from the division of the first argument by the second.

defhashing_func(key):

return key %len(hash_table)

print(hashing_func(10))# Output: 0
print(hashing_func(20))# Output: 0
print(hashing_func(25))# Output: 5

Collision
A collision occurs when two items/values get the same slot/index, i.e. the hashing function
generates same slot number for multiple items. If proper collision resolution steps are not
taken then the previous item in the slot will be replaced by the new item whenever the collision
occurs.

Collision Resolution
Chaining
One of the ways to resolve collision is Chaining. This allows multiple items exist in the same
slot/index. This can create a chain/collection of items in a single slot. When the collision
happens, the item is stored in the same slot using chaining mechanism.

While implementing Chaining in Python, we first create the hash table as a nested list (lists
inside a list).

hash_table=[[]for _ inrange(10)]

print(hash_table)

# Output:
# [[], [], [], [], [], [], [], [], [], []]

Inserting Data into Hash Table


We use `append()` function to insert key-value pairs in the hash table.

def insert(hash_table, key, value):

hash_key = hashing_func(key)

bucket = hash_table[hash_key]

bucket.append((key, value))

insert(hash_table,10,'Nepal')

insert(hash_table,25,'USA')
insert(hash_table,20,'India')

print(hash_table)

# Output:
# [[(10, 'Nepal'), (20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]

Search Data from the Hash Table


While searching for any key in the hash table, we have to loop through each individual sublist.

defsearch(hash_table, key):

hash_key = hashing_func(key)

bucket =hash_table[hash_key]

for i,kv in enumerate(bucket):

k, v =kv

if key == k:

return v

print(search(hash_table,10))# Output: Nepal


print(search(hash_table,20))# Output: India
print(search(hash_table,30))# Output: None

Delete Data from the Hash Table


Deleting element (key-value pair) from the hash table is somewhat similar to inserting element.
The loop is almost the same.

While deleting any existing element from the hash table, we first search if the key already exists
in the hash table.

– If the key is present (found) in the hash table, then we simply delete it. We delete that
particular key-value pair from the hash table.
– Otherwise, no operation is done. We can simply print a message saying that the key was not
found in the hash table.
def delete(hash_table, key):

hash_key = hashing_func(key)

key_exists=False

bucket =hash_table[hash_key]

for i,kv in enumerate(bucket):

k, v =kv

if key == k:

key_exists=True

break

if key_exists:

del bucket[i]

print("key ",key, " deleted")

else:

print ("key ",key, "not found")

delete(hash_table,100)

print(hash_table)

# Output:
# Key 100 not found
# [[(10, 'Nepal'), (20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]

delete(hash_table,10)
print(hash_table)
# Output:
# Key 10 deleted
# [[(20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
Complete Code
#create hash table as a nested list
hash_table = [[] for _ in range(10)]
def hashing_func(key):
# division method
return key % len(hash_table)
#use `append()` function to insert key-value pairs in the hash table
def insert(hash_table, key, value):
hash_key = hashing_func(key)
bucket = hash_table[hash_key]
bucket.append((key, value))
def search(hash_table, key):
hash_key = hashing_func(key)
bucket = hash_table[hash_key]
for i, kv in enumerate(bucket):
k, v = kv
if key == k:
return v
def delete(hash_table, key):
hash_key = hashing_func(key)
key_exists = False
bucket = hash_table[hash_key]
for i, kv in enumerate(bucket):
k, v = kv
if key == k:
key_exists = True
break
if key_exists:
del bucket[i]
print("key ",key, " deleted")
else:
print ("key ",key, "not found")

insert(hash_table, 10, 'Nepal')


insert(hash_table, 25, 'USA')
insert(hash_table, 20, 'India')
print (hash_table)
print (search(hash_table, 10)) # Output: Nepal
print (search(hash_table, 20)) # Output: India
print (search(hash_table, 30)) # Output: None
delete(hash_table, 100)
print (hash_table)
delete(hash_table, 10)
print (hash_table)

Output:
[[(10, 'Nepal'), (20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
Nepal
India
None
key 100 not found
[[(10, 'Nepal'), (20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
key 10 deleted
[[(20, 'India')], [], [], [], [], [(25, 'USA')], [], [], [], []]
==============================================================================

You might also like