Hashing
Hashing
HASHING: Hashing is the process of transforming any key or a string into another fixed
size value usually of smaller size using a hash function.
HASH TABLE REPRESENTATION
In hashing the dictionary pairs are stored in a table, ht, called the hash table. The hash table is
partitioned into b buckets, ht[0],…,ht[b-1]. The address or location of a pair is determined by
a hash function, h, which maps keys into buckets.
Hash table is a data structure used for storing and retrieving data very quickly. Insertion
of data in the hash table is based on the key value. Hence every entry in the hash table
is associated with some key.
Using the hash key the required piece of data can be searched in the hash table by few
or more key comparisons.
HASH FUNCTION
Hash function is a function which is used to store the data into the hash table. Hence
one can use the same hash function to retrieve the data from the hash table. The integer
returned by the hash function is called hash key.
Static Hashing: Refers to the hashing process where the size of the hash table is fixed.
2. Division:
In this method, a simple hash function is obtained by using the modulus (%)
operator.
In this scheme, we divide the identifier x by some number M and use the
remainder as the hash address for x.
The hash function is:
f(x)= x % M
This gives bucket addresses that range from 0 to M - 1, where M = the table
size.
3. Folding:
In this method, the identifier ‘X’ is partitioned into several parts. All parts,
except for the last one have the same length.
We then add the parts together to obtain the hash address for X.
4. Digit Analysis:
The last method we will examine, digit analysis, is used with static files. A
static file is one in which all the identifiers are known in advance.
We first transform the identifiers into numbers using some radix, r. Then
examine the digits of each identifier. Some digits having most skewed
distributions are deleted.
This deleting of digits is continued until the number of remaining digits is small
enough to give an address in the range of the hash table. Then these digits are
used to calculate the hash address.
COLLISSION
The situation in which the hash function returns the same hash key (home bucket) for
more than one record is called collision and two same hash keys returned for different records
is called synonym.
COLLISION RESOLUTION TECHNIQUES
If collision occurs then it should be handled by applying some techniques. Such a
technique is called collision resolution technique.
There are two methods for detecting collisions and overflows in a static hash table; each
method using a different data structure to represent the hash table. Linear Probing and
Chaining.
1. Open addressing (linear probing)
2. Chaining
3. Rehashing
1. Linear Probing:
When linear open addressing is used , the hash table is represented as a one dimensional
array with indices that range from 0 to the desired table size - 1.
The component type of the array is a struct that contains at least a key field. The C
declarations creating the hash table ht with one slot per bucket are:
Before inserting any elements into this table, the table must be initialized to represent
the situation where all slots are empty. This allows to detect overflows and collisions.
If the slot at the hash address is empty, we simply place the new element into this slot.
However, if the new element is hashed into a full bucket, we must find another bucket
for it. The simplest solution places the new element in the closest unfilled bucket.
DYNAMIC HASHING:
• The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
• In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
• One of the most important classes of software is the database management system or
DBMS.
• In a DBMS the user enters a query using some language (possibly SQL) and the system
translates it and retrieves the resulting data. Fast access time is essential since a DBMS
is typically used to hold large sets of information.
• Another key characteristic of a DBMS is that the amount of information can vary a
great deal over time.
• Dynamic hashing retains the fast retrieval time of conventional hashing, while
extending the technique so that it can accommodate dynamically increasing and
decreasing file size without penalty.
• We assume that a file, F, is a collection of records, R. Each record has a key field, K,
by which it is identified. Records are stored in buckets, or pages as they are called in
dynamic hashing, whose capacity is p.
i. Dynamic Hashing Using Directories
Consider an example where an identifier consists of two characters and each
character is represented by 3 bits.
The identifiers are to be placed into a table that has four pages. Each page can hold no
more than two identifiers, and the pages are indexed by the 2 bit sequence 00, 01, 10, 11,
respectively.
Depth refers to the number of bits used as the index.
ii. Directoryless Dynamic Hashing:
Assuming that there exists a contiguous address space which is large enough
to hold all the records, we can eliminate the directory.
In effect, this leaves it to the operating system to break the address space
into pages, and to manage moving them into and out of memory. This
scheme is referred to as directoryless hashing or linear hashing.
Priority Queues
Definition:
A priority queue is a collection of zero or more elements. Each element has a
priority or value.
Unlike the queues, which are FIFO structures, the order of deleting from a priority
queue is determined by the element priority.
Elements are removed/deleted either in increasing or decreasing order of priority
rather than in the order in which they arrived in the queue.