Hashing in DBMS
Hashing in DBMS
Hashing in DBMS
Components of Hashing
1. Hash Table:
o A hash table is an array or data structure and its size is determined by
the total volume of data records present in the database.
o Each memory location in a hash table is called a ‘bucket‘ or hash
indices and stores a data record’s exact location and can be accessed
through a hash function.
2. Bucket:
o A bucket is a memory location (index) in the hash table that stores the
data record.
o These buckets generally store a disk block which further stores
multiple records. It is also known as the hash index.
3. Hash Function:
o A hash function is a mathematical equation or algorithm that takes one
data record’s primary key as input and computes the hash index as
output.
1|Page
Working of Hash Function
The hash function generates a hash index through the primary key of the data record.
Now, there are 2 possibilities:
1. The hash index generated isn’t already occupied by any other value. So, the
address of the data record will be stored here.
2. The hash index generated is already occupied by some other value. This is called
collision so to counter this, a collision resolution technique will be applied.
1. Static Hashing
o In static hashing, the hash function always generates the same bucket’s address.
o For example, if we have a data record for employee_id = 1065, the hash function is
mod-5 which is – H(x) % 5, where x = id. Then the operation will take place like this:
H(106)%5=1.
This indicates that the data record should be placed or searched in the 1st bucket (or
1st hash index) in the hash table.
2|Page
Static Hashing has the following Properties
Data Buckets: The number of buckets in memory remains constant. The size of the
hash table is decided initially and it may also implement chaining that will allow
handling some collision issues.
Hash function: It uses the simplest hash function to map the data records to its
appropriate bucket. It is generally modulo-hash function.
Efficient for known data size: It’s very efficient in terms when we know the data
size and its distribution in the database.
Searching a record
o When a record needs to be searched, then the same hash function retrieves the
address of the bucket where the data is stored.
Insert a Record
o When a new record is inserted into the table, then we will generate an address
for a new record based on the hash key and record is stored in that location.
Delete a Record
o To delete a record, we will first fetch the record which is supposed to be
deleted. Then we will delete the records for that address in memory.
Update a Record
o To update a record, we will first search it using a hash function, and then the
data record is updated.
It is inefficient and inaccurate when the data size dynamically varies because we have
limited space and the hash function always generates the same value for every
specific input. When the data size fluctuates very often it’s not at all useful because
collision will keep happening and it will result in problems like – bucket skew,
insufficient buckets etc.
To resolve this problem of bucket overflow, techniques such as – chaining and open
addressing are used. Here’s a brief info on both:
3|Page
1. Chaining/Open Hashing/Closed Addressing
Method to handle collision in open hashing is called separate chaining. In separate chaining
if two or more key have same hash value then next element is stored by new link to previous
element.
For example: Suppose R3 is a new address which needs to be inserted into the table, the
hash function generates address as 110 for it. But this bucket is full to store the new data. In
this case, a new bucket is inserted at the end of 110 buckets and is linked to it.
4|Page
2. Dynamic Hashing
o Dynamic hashing is also known as extendible hashing, used to handle database that
frequently changes data sets.
o This method offers us a way to add and remove data buckets on demand dynamically.
This way as the number of data records varies, the buckets will also grow and shrink
in size periodically whenever a change is made.
5|Page
increases or decreases. Maintenance of the bucket address table gets difficult when
there is significant increase in data.
In dynamic hashing, bucket overflow can happen.
l
Load Factor
Rehashing
Rehashing means increasing the size of hash table and redistributing elements in it.
It is very costly operation because:
We have to make new hash table of bigger size.
We have to compute new hash values (indices) for each elements and insert in new
hash table.
6|Page