Topic 1: Hashing - Introduction: Hashing Is A Method of Storing and Retrieving Data From A Database Efficiently
Topic 1: Hashing - Introduction: Hashing Is A Method of Storing and Retrieving Data From A Database Efficiently
Suppose we want to design a system for storing employee records keyed using
phone numbers. And we want following queries to be performed efficiently:
We can think of using the following data structures to maintain information about
different phone numbers.
For arrays and linked lists, we need to search in a linear fashion, which can be
costly in practice. If we use arrays and keep the data sorted, then a phone number
can be searched in O(Logn) time using Binary Search, but insert and delete
operations become costly as we have to maintain sorted order.
With a balanced binary search tree, we get moderate search, insert and delete
times. All of these operations can be guaranteed to be in O(Logn) time.
Another solution that one can think of is to use a direct access table where we
make a big array and use phone numbers as indexes in the array. An entry in an
array is NIL if phone number is not present, else the array entry stores a pointer to
records corresponding to phone number. Time complexity wise this solution is the
best among all, we can do all operations in O(1) time. For example, to insert a phone
number, we create a record with details of given phone number, use phone number
as index and store the pointer to the created record in the table.
This solution has many practical limitations. First problem with this solution is the
extra space required is huge. For example if phone number is n digits, we need O(m
* 10n) space for the table where m is the size of a pointer to record. Another problem
is an integer in a programming language may not store n digits.
Due to above limitations Direct Access Table cannot always be used. Hashing is the
solution that can be used in almost all such situations and performs extremely well
compared to above data structures like Array, Linked List, Balanced BST in practice.
With hashing we get O(1) search time on average (under reasonable assumptions)
and O(n) in the worst case.
Hashing is an improvement over Direct Access Table. The idea is to use a
hash function that converts a given phone number or any other key to a
smaller number and uses the small number as index in a table called hash
table.
Hash Function: A function that converts a given big phone number to a small
practical integer value. The mapped integer value is used as an index in the hash
table. In simple terms, a hash function maps a big number or string to a small integer
that can be used as an index in a hash table.
A good hash function should have following properties:
1. Efficiently computable.
2. Should uniformly distribute the keys (Each table position equally likely for
each key)
For example for phone numbers a bad hash function is to take the first three digits. A
better function is to consider the last three digits. Please note that this may not be
the best hash function. There may be better ways.
Hash Table: An array that stores pointers to records corresponding to a given phone
number. An entry in a hash table is NIL if no existing phone number has a hash
function value equal to the index for the entry.
Collision Handling: Since a hash function gets us a small number for a big key,
there is a possibility that two keys result in the same value. The situation where a
newly inserted key maps to an already occupied slot in a hash table is called
collision and must be handled using some collision handling technique. Following are
the ways to handle collisions:
● Chaining:The idea is to make each cell of the hash table point to a linked list
of records that have the same hash function value. Chaining is simple, but
requires additional memory outside the table.
● Open Addressing: In open addressing, all elements are stored in the hash
table itself. Each table entry contains either a record or NIL. When searching
for an element, we one by one examine table slots until the desired element is
found or it is clear that the element is not in the table.
Topic 2 : Open Addressing
Important Operations:
● Insert(k): Keep probing until an empty slot is found. Once an empty slot is
found, insert k.
● Search(k): Keep probing until the slot's key doesn't become equal to k or an
empty slot is reached.
● Delete(k): Delete operation is interesting. If we simply delete a key, then
search may fail. So slots of deleted keys are marked specially as "deleted".
Insert can insert an item in a deleted slot, but the search doesn't stop at a deleted
slot.
1. Linear Probing: In linear probing, we linearly probe for the next slot. For
example, the typical gap between two probes is 1 as taken in below example
also.
let hash(x) be the slot index computed using hash function and S be the table
size
Let us consider a simple hash function as “key mod 7” and sequence of keys
as 50, 700, 76, 85, 92, 73, 101.
Clustering: The main problem with linear probing is clustering, many
consecutive elements form groups and it starts taking time to find a free slot
or to search an element.
3. Double Hashing We use another hash function hash2(x) and look for
i*hash2(x) slot in i'th rotation.
● Linear probing has the best cache performance but suffers from clustering.
One more advantage of Linear probing is easy to compute.
● Quadratic probing lies between the two in terms of cache performance and
clustering.
● Double hashing has poor cache performance but no clustering. Double
hashing requires more computation time as two hash functions need to be
computed.
S.No
Separate Chaining Open Addressing
.
Open Addressing requires more
1. Chaining is Simpler to implement.
computation.
In chaining, the Hash table never fills
In open addressing, the table may
2. up, we can always add more
become full.
elements to the chain.
Open addressing requires extra
Chaining is Less sensitive to the hash
3. care to avoid clustering and load
function or load factors.
factor.
Chaining is mostly used when it is
Open addressing is used when
unknown how many and how
4. the frequency and number of keys
frequently keys may be inserted or
is known.
deleted.
Cache performance of chaining is not Open addressing provides better
5. good as keys are stored using linked cache performance as everything
lists. is stored in the same table.
Wastage of Space (Some Parts of the In Open addressing, a slot can be
6. hash table in chaining are never used even if an input doesn't map
used). to it.
7. Chaining uses extra space for links. No links in Open addressing
Hashing in C++ can be implemented using different containers present in STL as per
the requirement. Usually, STL offers the below-mentioned containers for
implementing hashing:
● set
● unordered_set
● map
● unordered_map
set
Sets are a type of associative container in which each element has to be unique,
because the value of the element identifies it. The value of the element cannot be
modified once it is added to the set, though it is possible to remove and add the
modified value of that element.
Sets are used in the situation where it is needed to check if an element is present in
a list or not. It can also be done with the help of arrays, but it would take up a lot of
space. Sets can also be used to solve many problems related to sorting as the
elements in the set are arranged in a sorted order.
Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// C++ program to illustrate hashing using
// set in CPP STL
#include <iostream>
#include <set>
#include <iterator>
using namespace std;
int main()
{
// empty set container
set <int> s;
// List of elements
int arr[] = {40, 20, 60, 30, 50, 50, 10};
50 is present
unordered_set
The unordered_set container is implemented using a hash table where keys are
hashed into indices of this hash table so it is not possible to maintain any order. All
operation on an unordered_set takes constant time O(1) on an average which can
go up to linear time in the worst case which depends on the internally used hash
function but practically they perform very well and generally provide constant time
search operation.
The unordered-set can contain keys of any type – predefined or user-defined data
structure but when we define a key of a user-defined type, we need to specify our
comparison function according to which keys will be compared.
set vs unordered_set
Note: Like set containers, the Unordered_set also allows only unique keys.
Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// C++ program to illustrate hashing using
// unordered_set in CPP STL
#include <iostream>
#include <unordered_set>
#include <iterator>
using namespace std;
int main()
{
// empty set container
unordered_set <int> s;
// List of elements
int arr[] = {40, 20, 60, 30, 50, 50, 10};
50 is present
Map container
As a set, the Map container is also associative and stores elements in an ordered
way but Maps store elements in a mapped fashion. Each element has a key value
and a mapped value. No two mapped values can have the same key values.
Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// C++ program to illustrate Map container
#include <iostream>
#include <iterator>
#include <map>
using namespace std;
int main()
{
// empty map container
map<int, int> mp;
// Insert elements in random order
// First element of the pair is the key
// second element of the pair is the value
mp.insert(pair<int, int>(1, 40));
mp.insert(pair<int, int>(2, 30));
mp.insert(pair<int, int>(3, 60));
mp.insert(pair<int, int>(4, 20));
mp.insert(pair<int, int>(5, 50));
mp.insert(pair<int, int>(6, 50));
mp.insert(pair<int, int>(7, 10));
// printing map mp
map<int, int>::iterator itr;
cout << "The map mp is : \n";
cout << "KEY\tELEMENT\n";
for (itr = mp.begin(); itr != mp.end(); ++itr) {
cout << itr->first
Run
Output:
The map mp is :
KEY ELEMENT
1 40
2 30
3 60
4 20
5 50
6 50
7 10
unordered_map Container
The unordered_map is an associated container that stores elements formed by a
combination of key value and a mapped value. The key value is used to uniquely
identify the element and mapped value is the content associated with the key. Both
key and value can be of any type predefined or user-defined.
Internally unordered_map is implemented using Hash Table, the key provided to
map are hashed into indices of a hash table that is why the performance of data
structure depends on hash function a lot but on an average, the cost of search, insert
and delete from hash table is O(1).
Implementation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// C++ program to demonstrate functionality of unordered_map
#include <iostream>
#include <unordered_map>
using namespace std;
int main()
{
// Declaring umap to be of <string, int> type
// key will be of string type and mapped value will
// be of double type
unordered_map<string, int> umap;
// inserting values
umap.insert({"GeeksforGee", 10});
umap.insert({"Practice", 20});
umap.insert({"Contribute", 30});
// Traversing an unordered map
cout << "The map umap is : \n";
cout << "KEY\t\tELEMENT\n";
for (auto itr = umap.begin(); itr!= umap.end(); itr++)
cout << itr->first
<< '\t' << itr->second << '\n';
return 0;
}
Run
Output:
unordered_map vs unordered_set
In an unordered_set, we have only keys, no value, these are mainly used to see
presence/absence in a set. For example, consider the problem of counting
frequencies of individual words. We can’t use unordered_set (or set) as we can’t
store counts.
unordered_map vs map
map (like set) is an ordered sequence of unique keys whereas in the unordered_map
key can be stored in any order, so unordered.
A map is implemented as a balanced tree structure that is why it is possible to
maintain order between the elements (by specific tree traversal). The time
complexity of map operations is O(Log n) while for an unordered_set, it is O(1) on
average.
Topic 4 : Implementing Hashing in Java
Java provides many built-in classes and interfaces to implement hashing easily. That
is, without creating any HashTable or HashFunction. Java mainly provides us with
the following classes to implement Hashing:
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import java.util.*;
class GFG {
Hashtable<Integer, String>
hm.put(1, "Geeks");
hm.put(12, "forGeeks");
hm.put(3, "Portal");
// Printing the Hashtable
System.out.println(hm);
Run
Output:
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
class GFG {
Integer c = hmap.get(arr[i]);
if (hmap.get(arr[i]) == null) {
hmap.put(arr[i], 1);
else {
hmap.put(arr[i], ++c);
Run
Output:
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import java.util.*;
lhm.put("one", "practice.geeksforgeeks.org");
lhm.put("two", "code.geeksforgeeks.org");
lhm.put("four", "quiz.geeksforgeeks.org");
// It prints the elements in same order
System.out.println(lhm);
+ lhm.get("one"));
lhm.containsKey("two"));
".geeksforgeeks.org"));
lhm.remove("one"));
System.out.println(lhm);
Run
Output:
{one=practice.geeksforgeeks.org, two=code.geeksforgeeks.org,
four=quiz.geeksforgeeks.org}
Getting value for key 'one': practice.geeksforgeeks.org
Size of the map: 3
Is map empty? false
Contains key 'two'? true
Contains value 'practice.geeksforgeeks.org'? true
delete element 'one': practice.geeksforgeeks.org
{two=code.geeksforgeeks.org, four=quiz.geeksforgeeks.org}
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import java.util.*;
class Test {
h.add("India");
h.add("Australia");
h.add("South Africa");
System.out.println(h);
+ h.contains("India"));
h.remove("Australia");
Iterator<String> i = h.iterator();
Run
Output:
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import java.util.LinkedHashSet;
LinkedHashSet<String> linkedset =
new LinkedHashSet<String>();
linkedset.add("A");
linkedset.add("B");
linkedset.add("C");
linkedset.add("D");
// This will not add new element as A already
exists
linkedset.add("A");
linkedset.add("E");
linkedset.size());
System.out.println("Original LinkedHashSet:" +
linkedset);
linkedset.remove("D"));
"present: " +
linkedset.remove("Z"));
System.out.println("Checking if A is present=" +
linkedset.contains("A"));
Run
Output:
Size of LinkedHashSet = 5
Original LinkedHashSet:[A, B, C, D, E]
Removing D from LinkedHashSet: true
Trying to Remove Z which is not present: false
Checking if A is present=true
Updated LinkedHashSet: [A, B, C, E]
6. TreeSet (Implements the SortedSet interface, Objects are stored in a
sorted and ascending order):
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import java.util.*;
class TreeSetDemo {
ts1.add("A");
ts1.add("B");
ts1.add("C");
// Duplicates will not get insert
ts1.add("C");
// Sorting Order(Ascending)
+ ts1.contains("A"));
ts1.remove("A");
Run
Output:
TreeSet: [A, B, C]