Tries and Huffman Encoding
Tries and Huffman Encoding
To search a word in the hashmap, we again have to calculate the hashcode of the
string to be searched, and for that also, it would require O(string_length) time.
Similarly, for removing a word from the hashmap, we would have to calculate the
hashcode for that string. It would again require O(string_length) time.
For the same purpose, we can use another data structure known as tries.
1
Here, the node at the top named as the start is the root node of our n
-ary tree.
Suppose we want to insert the word ARE i n the trie. We will begin from the root,
search for the first word A and then insert R
as its child and similarly insert the
letter E
. You can see it as the left-most path in the above trie.
Moving on to searching a word in the trie, for that also, we would have to traverse
the complete length of the string, which would take O(word_length) time.
2
Let us take an example. Suppose we want to search for the word DOT i n the above
trie, we will first-of-all search for the letter D
as the direct child of the start node and
then search for O and T and, then we will return true as the word is present in the
trie.
Consider another example AR, which we want to search for in the given trie.
Following the above approach, we would return true as the given word is present in
it. However ideally, we should return false as the actual word was ARE and not A
R.
To overcome this, it can be observed that in the above trie, some of the letters are
marked with a bolder circle, and others are not. This boldness represents the
termination of a word starting from the root node. Hence, while searching for a
word, we will always be careful about the last letter that it should be bolded to
represent the termination.
Note: W
hile insertion in a trie, the last letter of the word should be bolded as a
mark of termination of a word.
In the above trie, the following are all possibilities that can occur:
● ARE, AS
● DO, DOT
● NEW, NEWS, NOT
● ZEN
Here, to remove the word, we will simply unbold the terminating letter as it will
make that word invalid. For example: If you want to remove the string D
O f rom the
above trie by preserving the occurrence of string DOT, then we will reach O
and
then unbolden it. This way the word D
O i s removed but at the same time, another
word DOT w
hich was on the same path as that of word DO was still preserved in the
trie structure.
3
For removal of a word from trie, the time complexity is still O(word_length) as we
are traversing the complete length of the word to reach the last letter to unbold it.
It can be observed that using tries, we can improve the space complexity.
class TrieNode {
public :
char data; // To store data (character value: ‘A’ - ‘Z’)
TreeNode **children; // To store the address of each child
bool isTerminal; // it will be TRUE if the word terminates at this character
TrieNode(char data) { // Constructor for values initialization
this -> data = data;
children = new TrieNode*[26];
for(int i = 0; i < 26; i++) {
children[i] = NULL;
}
4
isTerminal = false;
}
};
Insert function
To insert a word in a trie, we will use recursion. Suppose we have the following trie:
Recursion says that we need to work on a smaller problem, and the rest it will
handle itself. So, we will do it on the root node.
Note: P
ractically, the functionality of bolding the terminal character is achieved
using the boolean i sTerminal v
ariable for that particular node. If this value is true
means that the node is the terminal value of the string, otherwise not.
Approach:
5
● Small Calculation: We will check if the root node has the first character of
the string as one of its children or not. If not, then we will create one.
● Recursive call: W
e will tell the recursion to insert the remaining string in the
subtree of our trie.
● Base Case: As the length of the string becomes zero, or in other words, we
reach the NULL character, then we need to mark the isTerminal f or the last
character as True.
class Trie {
TrieNode *root;
public :
Trie() {
root = new TrieNode('\0');
}
void insertWord(TrieNode *root, string word) {
// Base case
6
if(word.size() == 0) {
root -> isTerminal = true;
return;
}
// Small Calculation
int index = word[0] - 'a'; // As for ‘a’ refers to index 0, ‘b’ refers to
// index 2 and so on, so to reach the correct index we will do so
TrieNode *child;
if(root -> children[index] != NULL) { // If the first character of string is
// already present as the child node of the root node
child = root -> children[index];
}
else { / / If not present as the child then creating one.
child = new TrieNode(word[0]);
root -> children[index] = child;
}
// Recursive call
insertWord(child, word.substr(1));
}
// For user
void insertWord(string word) {
insertWord(root, word);
}
};
Search in tries
Objective: Create a search function which will get a string as its argument to be
searched in the trie and returns a boolean value. True if the string is present in the
trie and False if not.
Approach: We will be using the same process as that used during insertion. We will
call recursion over the root node after searching for the first character as its child. If
the first character is not present as one of the children of the root node, then we
will simply return false; otherwise, we will send the remaining string to the
recursion. When we reach the empty string, then check for the last character’s
7
isTerminal v
alue; if it is true, m
eans that word exists in our trie, and we will return
true from there otherwise, return false.
Try to code it yourselves, and for the code, refer to the solution tab of the same.
Approach: First-of-all we need to search for the word in the trie, and if found, then
we simply need to mark the i sTerminal value of the last character of the word to
false as that will simply denote that the word does not exist in our trie.
Check out the code below: (Nearly same as that of insertion, just a minor changes
which are explained along side)
8
Types of tries
● Compressed tries:
○ Majorly, used for space optimization.
○ We generally club the characters if they have at most one child.
○ General rule: Every node has at least two child nodes.
9
Figure - 1
● Pattern matching:
○ Used to match patterns in the trie.
10
Huffman Coding
11
Here, each of the characters of the string takes 8 bits of memory. Since there are a
total of 15 characters in the string so the total memory consumption will be 15*8 =
120 bits. Let’s try to compress its size using the Huffman Algorithm.
First-of-all, the Huffman Coding creates a tree by calculating the frequencies of each
character of the string and then assigns them some unique code so that we can
retrieve the data back using these codes.
1. Begin with calculating the frequency of each character value in the given
string.
12
2. Sort the characters in ascending order concerning their frequency and store
them in a priority queue, say Q
.
3. Each character should be considered as
a
different leaf node.
4. Make an empty node, say z . The left child of z marked as the minimum
frequency and right child, the second minimum frequency. The value of z is
calculated by summing up the first two frequencies.
13
5. Now, remove the two characters with the lowest frequencies from the
priority queue Q and append their sum to the same.
6. Simply insert the above node z to the tree.
7. For every character in the string, repeat steps 3 to 5.
14
8. Assign 0 to the left side and 1 to the right side except for the leaf nodes.
15
B 1 100 1*3 = 3
C 6 0 6*1 = 6
D 3 101 3*3 = 9
To decode the code, simply traverse through the tree (starting from the root)
to find the character. Suppose we want to decode 101, then:
Time complexity:
In case of encoding, inserting each character into the priority queue takes O(log n)
time. Therefore, for the complete array, the time complexity becomes O(nlog n).
Similarly, extraction of the element from the priority queue takes O(log n) time.
Hence, for the complete array, the achieved time complexity is O(nlog n).
16