NPTEL MOOC,JAN-FEB 2015
Week 6, Module 5
DESIGN AND ANALYSIS
OF ALGORITHMS
Greedy algorithms: Huffman codes
MADHAVAN MUKUND, CHENNAI MATHEMATICAL INSTITUTE
http://www.cmi.ac.in/~madhavan
Communication and
compression
Messages in English/Hindi/Tamil/… are
transmitted between computers in binary
Encode letters {a,b,…,z} as strings over {0,1}
26 letters, 25 = 32, use strings of length 5?
Can we optimize the amount of data to transfer?
Use shorter strings for more frequent letters?
Variable length encoding
Morse code
Encode letters using dots (0) and dashes (1)
Encoding of e is 0, t is 1, a is 01
Decode 0101 — etet, aa, eta, aet?
Use pauses between letters to distinguish
Like an extra symbol in encoding
Variable length encoding
Prefix code
Encoding E( ), E(x) is not a prefix of E(y) for any x,y
In Morse code E(e) = 0 is a prefix of E(a) = 01
x a b c d e
Example: {a,b,c,d,e}
E(x) 11 01 001 10 000
Decode 0010000011101
Variable length encoding
Prefix code
Encoding E( ), E(x) is not a prefix of E(y) for any x,y
In Morse code E(e) = 0 is a prefix of E(a) = 01
x a b c d e
Example: {a,b,c,d,e}
E(x) 11 01 001 10 000
Decode 0010000011101
c
Variable length encoding
Prefix code
Encoding E( ), E(x) is not a prefix of E(y) for any x,y
In Morse code E(e) = 0 is a prefix of E(a) = 01
x a b c d e
Example: {a,b,c,d,e}
E(x) 11 01 001 10 000
Decode 0010000011101
c e
Variable length encoding
Prefix code
Encoding E( ), E(x) is not a prefix of E(y) for any x,y
In Morse code E(e) = 0 is a prefix of E(a) = 01
x a b c d e
Example: {a,b,c,d,e}
E(x) 11 01 001 10 000
Decode 0010000011101
c e c
Variable length encoding
Prefix code
Encoding E( ), E(x) is not a prefix of E(y) for any x,y
In Morse code E(e) = 0 is a prefix of E(a) = 01
x a b c d e
Example: {a,b,c,d,e}
E(x) 11 01 001 10 000
Decode 0010000011101
c e c a
Variable length encoding
Prefix code
Encoding E( ), E(x) is not a prefix of E(y) for any x,y
In Morse code E(e) = 0 is a prefix of E(a) = 01
x a b c d e
Example: {a,b,c,d,e}
E(x) 11 01 001 10 000
Decode 0010000011101
c e c a b
Optimal prefix codes
Measure frequency f(x) of each letter x
Fraction of occurrences of x over large body of
text
A = {x1,x2,…,xn}, f(x1) + f(x2) + … + f(xn) = 1
f(x) is the “probability” that next letter is x
Optimal prefix codes …
Message M consists of n symbols
For each letter x, n ∙ f(x) occurrences of x in M
Each x is encoded by E(x) with length |E(x)|
Total length of encoded message:
Sum over all x, n ∙ f(x) ∙ |E(x)|
Average number of bits per letter
Sum over all x, f(x) ∙ |E(x)|
Optimal prefix codes ..
Suppose we have
x a b c d e
these frequencies
E(x) 11 01 001 10 000
for our example
f(x) 0.32 0.25 0.20 0.18 0.05
Average number of bits per letter is
0.32 ∙ 2 + 0.25 ∙ 2 + 0.20 ∙ 3 + 0.18 ∙ 2 + 0.05 ∙ 3
2.25
Fixed length encoding uses 3 bits per letter
25% saving using variable length code
Optimal prefix codes ..
A better encoding
x a b c d e
E(x) 11 10 01 001 000
f(x) 0.32 0.25 0.20 0.18 0.05
Average number of bits per letter is
0.32 ∙ 2 + 0.25 ∙ 2 + 0.20 ∙ 2 + 0.18 ∙ 3 + 0.05 ∙ 3
2.23
Given a set of letters A with frequencies, produce a prefix
code that is as efficient as possible
Minimize ABL(A) — Average Bits per Letter
Codes as trees
Encoding can be viewed x a b c d e
as a binary tree
E(x) 11 01 001 10 000
Path to a node is a binary
string—left is 0, right is 1
Label each node by the
letter it encodes
b d a
Prefix code: only leaves
encode letters e c
Codes as trees
Encoding can be viewed x a b c d e
as a binary tree
E(x) 11 10 01 001 000
Path to a node is a binary
string—left is 0, right is 1
Label each node by the
letter it encodes
c b a
Prefix code: only leaves
encode letters e d
Codes as trees …
Full tree: Every node has 0 or 2 children
Claim 1: Any optimal prefix code generates a full tree
If any node has only one child, we can promote its
child and create a shorter tree
Codes as trees …
Claim 2: In an optimal tree, if a leaf labelled x is at a
smaller depth than a leaf labelled y, then f(y) ≤ f(x)
If f(y) > f(x), exchange labels to get a better tree
Codes as trees …
Claim 3: In an optimal tree, if a leaf at maximum
depth is labelled x then its sibling is also a leaf.
If not, the sibling of this leaf has children
There is a leaf at a lower depth
But depth of the leaf labelled x was at maximum
depth
A recursive solution
From Claim 3, leaves at maximum depth occur in
pairs
From Claim 2, these must have lowest frequencies
Pick letters x and y such that f(x) and f(y) are
lowest
We will assign these to a pair of leaves at
maximum depth (left/right does not matter)
A recursive solution …
“Combine” x and y into a new letter “xy” with f(xy) =
f(x) + f(y)
New alphabet A’ is original A - {x,y} + {xy}
Recursively find an optimal encoding of A’
Base case, |A’| = 2, assign the two letters codes 0, 1
Replace the leaf labelled “xy” by a node with two
children labelled x and y
Huffman’s algorithm — Huffman coding
Huffman’s algorithm
x a b c d e
f(x) 0.32 0.25 0.20 0.18 0.05
Huffman’s algorithm
x a b c d e Combine
f(x) 0.32 0.25 0.20 0.18 0.05 d, e as “de”
x a b c de
f(x) 0.32 0.25 0.20 0.23
Huffman’s algorithm
x a b c d Combine
e
f(x) 0.32 0.25 0.20 0.18 0.05 d, e as “de”
x a b c de Combine
f(x) 0.32 0.25 0.20 0.23 c, de as “cde”
x a b cde
f(x) 0.32 0.25 0.43
Huffman’s algorithm
x a b c Combine
d e
f(x) 0.32 0.25 0.20 0.18 0.05 d, e as “de”
x a b c de Combine
f(x) 0.32 0.25 0.20 0.23 c, de as “cde”
x a b cde Combine
f(x) 0.32 0.25 0.43 a, b as “ab”
x ab cde
f(x) 0.57 0.43
Huffman’s algorithm
x a b c Combine
d e
f(x) 0.32 0.25 0.20 0.18 0.05 d, e as “de”
x a b c de Combine
f(x) 0.32 0.25 0.20 0.23 c, de as “cde”
x a b cde Combine
f(x) 0.32 0.25 0.43 a, b as “ab”
x ab cde Two letters,
f(x) 0.57 0.43 base case ab cde
Huffman’s algorithm
x a b c Combine
d e
f(x) 0.32 0.25 0.20 0.18 0.05 d, e as “de”
x a b c de Combine
f(x) 0.32 0.25 0.20 0.23 c, de as “cde”
x a b cde Split “ab”
f(x) 0.32 0.25 0.43 as a, b cde
x ab cde Two letters,
a b
f(x) 0.57 0.43 base case
Huffman’s algorithm
x a b c Combine
d e
f(x) 0.32 0.25 0.20 0.18 0.05 d, e as “de”
x a b c de Split “cde”
f(x) 0.32 0.25 0.20 0.23 as c, de
a b c de
x a b cde Split “ab”
f(x) 0.32 0.25 0.43 as a, b
x ab cde Two letters,
f(x) 0.57 0.43 base case
Huffman’s algorithm
x a b c Split “de”
d e
f(x) 0.32 0.25 0.20 0.18 0.05 as d, e
x a b c de a b c
Split “cde”
f(x) 0.32 0.25 0.20 0.23 as c, de
d e
x a b cde Split “ab”
f(x) 0.32 0.25 0.43 as a, b
x ab cde Two letters,
f(x) 0.57 0.43 base case
Optimality
By induction on the size of the alphabet A
For |A| = 2, base case, clearly the code that uses
{0,1} for the two letters is optimal
Assuming our algorithm is optimal for |A| = k-1, we
have to show it is also optimal for |A| = k
Optimality
Combine lowest frequency x, y into xy
Construct a tree T’ for this alphabet
ABL(T’) optimal by induction
Expand xy into x,y to get T from T’
Claim: ABL(T) - ABL(T’) = f(xy)
Optimality
Claim: ABL(T) - ABL(T’) = f(xy)
From T’ to T, only xy, x, y change contribution to
ABL
Subtract depth(xy)f(xy), add (1+depth(xy))(f(x) + f(y))
f(xy) = f(x)+f(y), so
depth(xy)f(xy) = depth(xy)(f(x) + f(y))
Hence ABL(T) is bigger than ABL(T’) by
f(x)+f(y) = f(xy)
Optimality
Suppose there is another tree S with ABL(S) < ABL(T)
Can shuffle labels of max depth leaves in S, so that
lowest frequency pair x and y label siblings
Merge, x and y into xy and contract S to S’
S’ is over same alphabet as T’, T’ is optimal by
induction, so ABL(T’) ≤ ABL(S’)
ABL(S) - ABL(S’) = ABL(T) - ABL(T’) = f(xy),
so ABL(T) ≤ ABL(S) as well, contradiction!
Implementation, complexity
At each recursive step, extract letters with
minimum frequency and replace by composite
letter with combined frequency
Store frequencies in an array
Linear scan to find minimum values
|A| = k, number of recursive calls is k - 1
Complexity is O(k2)
Implementation, complexity
At each recursive step, extract letters with
minimum frequency and replace by composite
letter with combined frequency
Instead, maintain frequencies in a heap
O(log k) to find minimum values and insert new
combined letter
Complexity drops to O(k log k)
Why is Huffman coding
greedy?
We recursively combine letters with two lowest
frequencies
This is a locally optimal choice
We never go back and consider other ways of
pairing up letters
Historical note
Shannon and Fano tried a divide and conquer
approach
Partition A as A1, A2
Sum of frequencies in A1, A2 roughly equal
Solve each partition recursively
Shannon-Fano codes are not optimal
Huffman heard about this problem in a class by Fano
and later found an optimal solution