0% found this document useful (0 votes)

15 views42 pages

Lecture 22 Compression

The lecture discusses data compression algorithms, focusing on lossless and lossy compression methods, including Huffman coding and Lempel-Ziv compression. It explains the importance of representing data efficiently for storage and transmission, and details the process of constructing optimal prefix codes using Huffman's greedy algorithm. Additionally, it highlights the significance of choosing the correct alphabet and the adaptive nature of some algorithms in achieving effective compression.

Uploaded by

bennett.chatgpt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views42 pages

Lecture 22 Compression

Uploaded by

bennett.chatgpt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Compression Algorithms

Lecture 22
Shafi Goldwasser
Representing Data
Course so far: our goal was to represent
data so that it is easiest to manipulate
– Search, edit: add, delete, cut, paste.
• However, there are actually competing
interests
- easily manipulated/processed
– short for storage and transmission
• Today our goal: represent data in the
shortest possible way
• Look for algorithms to achieve this
Algorithms: Data Compression
Document D Compressed
Compression document
Algorithm

Compressed Document D’
Document De-Compression
Algorithm

Today: Lossless Compression: D=D’

Lossy Compression: D’ close-enough to D
(application dependent)
Data Compression: Lossy or
Lossless

• Lossless:
– Huffman Coding
-- Lempel-Ziv
– .gif
• Lossy:
– .mp3
– .jpg
Non Adaptive and Adaptive
Algorithms

• Non Adaptive & Adaptive Algorithms

– Non-Adaptive: assume we know something
about the data source, say the frequency
of different characters
– Adaptive: do not know anything, and the
algorithms has to build the knowledge up by
itself.

– TODAY: An example of each, start with

non-adaptive
Set Up

• Input: a sequence of characters

– e.g. English language characters a,b,c…
– Text: I am a human being

• Look for Binary codes: encode each

character as a Binary string, also called a
code word

• Output: a concatenation of code words

corresponding to the input characters
Example of fixed length code

• Fixed length code: each character is

assigned a code word of the same length
• Suppose we have a file of 100k characters
and each character is one of 8 letters
[a…h]
• Need only 3 bits to represent each
character: a:000, b: 001, c:010,…, h:111
• To represent the File requires: 300k bits
• Easy to encode and decode: e.g. baba ≡
001000001000
Variable Length Code
Variable Length: Codewords for
different characters may have
different length

Say that you knew something about the

document to be compressed: the
frequency with which each character
appears.

Idea: use fewer bits for code words

corresponding to highly frequent
characters & more bits to encode
characters which are rare
Example of variable length code

a b c d e f
• Frequency 45 13 12 16 9 5
• VarLenCode: 0 101 100 111 1101 1100
• Variable Length Code Size for example:
45k*1+13k*3+12k*3+9k*4+5k*4 =224k, a 25%
savings on the 300k of the fixed length code

• But how do we decode?

Prefix (Free) Coding
• Codes where no codeword is a prefix of
some other codeword.
• Ex: 0 101 100 111 1101 1100
• Decoding (de-compressing) back to the
original file is easy now: can scan from left
tp right , as soon as recognize a code word
in the file, peel it off and continue
decoding
• 01101100 = 0101011110011011101
• Claim: optimal data compression can be
achieved by a prefix-free code
Binary Tree Representation of
Prefix Codes: Easy Decoding

A labeled binary tree where

• A left edge is labeled 0
• A right edge is labeled 1
• To decode a string: follow edges from
the root to the character at that leaf.
• Each Leaf: labeled with a character
• Prefix free implies characters at leafs
Examples of Specific Codes as
Binary Trees

• Fixed Length Codes

• Variable Length Codes

Optimal Coding
Problem: Given alphabet {a1…an} & frequency
counts {f(a1),…f(an)}, find the Binary prefix
code that minimizes the number of bits to
represent your data.

Namely, find codeword c(ai) ∀ ai which

minimizes B(C) = Σ f(ai) L(c(ai)) where
L(c(ai)) is the length of the code word c(ai)
Equivalent Optimization Problem
on Trees
Codes: find codeword c(ai) for each ai that
minimizes B(C) = Σ f(ai) L(c(ai)) where
L(c(ai)) is length of the code word c(ai)

Trees: Let d(ai) be depth of leaf ai in tree T

Find tree T with n leaves where each leaf has f(ai)
associated with it so that it has the minimum
weighted path B(T) = Σ f(ai) d(ai)

Obvious since d(ai) =L(c(ai))

Search for Optimal Coding

• Shannon: invented information theory

• Fano and Shannon worked together to
fin minimal size codes, found good
heuristics
• Fano assigned the problem to his class
• Huffman solved it: not knowing that his
teacher had struggled with it !!!!

• A lesson to us all.
Huffman Code
Huffman invented an algorithm to construct an
optimal prefix code/tree: Huffman Code.

Greedy Algorithm Idea

Step 1: Pick two letters x,y from alphabet A
with the smallest frequencies and create a
sub-tree that has these two characters as
leaves
Step 2: Set frequency f(z) = f(x)+f(y). Remove
x, y and add z creating a new alphabet
A=A u {z}-{x,y}. Obviously |A| is now smaller
Repeat until A is empty
Example of the Algorithm

1. Take the characters and their

frequencies, and sort this list by
increasing frequency

E: 10, T: 7, O: 5, A: 3 →
A: 3, O: 5, T: 7, E: 10
Huffman Codes
2. All the characters are vertices of the tree:

A: 3 O: 5 T: 7 E: 10
Huffman codes
3. Take the smallest (first) 2 vertices from the list
and make them children of a new vertex
having the sum of their frequencies
New vertex with
Frequency 3 + 5 = 8
Z: 8

A: 3 O: 5
Huffman Codes

4. Insert the new vertex into the sorted list of

vertices waiting to be put into the tree

List of remaining T: 7 E: 10
vertices:

New list, with the

T: 7 Z: 8 E: 10
new vertex
Inserted:
New
Huffman Codes
5. Take the first 2 vertices from the list and make
them children of a new vertex having the sum of
their frequencies
New vertex with
Z2:15 Frequency 7 + 8 = 15

T: 7
Z: 8

A: 3 O: 5
Huffman Codes

6. Insert the new vertex into the sorted list of

vertices waiting to be put into the tree

List of remaining E: 10
vertices:

New list, with the

E: 10 Z2: 15
new vertex
inserted:
New
Huffman Codes Z3: 25
New vertex with
Frequency 10 + 15 =
25

7. Take the first

2 vertices from E: 10 Z2:15
the list and
make them
children of a T: 7 Z: 8
vertex having
the sum of
their
frequencies A: 3 O: 5
Huffman Codes Z3: 25
0 1
• Left branch is 0 E: 10
Z2: 15
• Right branch is 1
0 1

Huffman code: T: 7 Z: 8
E: 0 0 1

T: 10
A: 3 O: 5
A: 110
O: 111
Implementation Details,
Complexity
• Stat with forest of 1-node trees representing
each a in A.
• Merge them greedily, using a priority queue,
sorted by the smallest frequency
• Pseudo code : section 16.3 CLR

• Runtime: Let n be the size of the alphabet.

Then, to create the tree O( n log n), since
each priority queue operation takes O(log n)
and we have n operations.
Optimality?
Claim: Hoffman’s Algorithm
produces a tree T with min B(T)
• Proof by induction on |A|. When |A| = 2, the
optimal tree clearly has two leaves, corresponding
to strings 0 and 1, as the algorithm constructs.
• Suppose |A| > 2. The first greedy choice the
algorithm does is make the two lowest-frequency
characters (call them x and y) into leaf nodes,
create a new node z with frequency f(z) =f(z)+f(y)
and apply the algorithm to the new smaller A‘=A -
{x,y}∪{z}. Now A’ is smaller , |A’|=|A|-1, so by
induction can assume T’ for A’ will be optimal.
• Induction step: Must show that T obtained by
adding x & y as children to z in T’ is optimal.
Claim: Hoffman’s Algorithm
produces a tree T with min B(T) (2)
Claim: Adding x and y as L and R children to z in optimal
T’ yields optimal T
Proof: Suppose not. Say ∃ tree T s.t. B(T)<B(T). We
will then show a contradiction to the cost of T’
being optimal.
Case 1: x and y are siblings in T. Remove them and
make their parent a leaf to get T ’ for A’’. Then get
B(T ’) <B(T ’) by the following calculation.
B(T ‘)=B(T) + f(z)d(z) - f(x) d(x) - f(y) d(y) = B(T)+(f(x)+f(y))
(d(x)-1)-f(x)d(x)-f(y)d(y) = B(T)-(f(x)+f(y)) .
Similarly, B(T’)= B(T)-(f(x)+f(y))
So, B(T)<B(T) implies B(T’) <B(T’) contradiction !
Claim: Hoffman’s Algorithm
produces a tree T with min B(T) (2)
• Case 2: x and y are not sibling leafs in the better
tree T. Then can show that there exists another
tree T where x and y are sibling leafs, and do the
argument of case 1 for T
– Lemma: Optimal Huffman Tree must be a Full Binary
Tree (i.e. each internal node has 2 children)
– Proof: if not, get rid of it and replace by child, that would
decrease the total cost of encoding
• Idea: Take sibling leafs a and b at maximum depth,
replace a with x and b with y. Show by calculation
(CLR sec 16) that the weight of the new tree could
not increase.
Huffman Codes Summary
• Reduce size of data by 20%-90% in general
• If no characters occur more frequently than
others, then no advantage over ASCII
• Encoding: Given the characters & their
frequencies, perform the algorithm and
generate a code. Write the file using the code
• Decoding:
- Given the Huffman tree, figure out what
each character is (possible because of prefix
property)
In Practice

• Both the .mp3 and .jpg file formats use

Huffman coding at one stage of the
compression
• Alternative method that achieves higher
compression but is slower is patented by
IBM, making Huffman Codes attractive
Are we done with compression…?
Is Huffman Coding always Optimal
Huffman Coding is optimal under some assumptions

1. The compression is lossless.

When lossy compression is permitted, as for

video, other algorithms can achieve much
greater compression, and this is a very active
area of research because people want to be able
to send video and audio over the Web.
Is Huffman Coding always Optimal?
2. Assumed know all the frequencies f(c) of each
character appearing. How do we get this
information?
– Make two passes over the data, first to compute
the f(c), second to encode the file. May be much
more expensive than passing over the data once
for large files residing on disk or tape.
– Assume fractions f(c)/n of each character in the
file are similar to files you’ve compressed before.
E.g. assume all Java programs, or PowerPoint
files ≈ same fractions of characters appearing.
– Estimate the fractions f(c)/n on the fly as you
process the file: an Huffman coding adaptive
Is Huffman Coding always Optimal?
3. Assumed we know the set of characters (the
alphabet) appearing in the file. Obvious?
– Many choices: For example, the alphabet could
be the English alphabet characters, or the key
words and variables names appearing in a
program.
– Alphabet Matters: suppose we have a file
consisting of n strings aaaa and n strings bbbb
concatenated in some order. If we choose the
alphabet {a,b} then 8n bits are needed to
encode the file. If we choose the alphabet
{aaaa, bbbb} then only 2n bits are needed.
Picking the Correct Alphabet as
You Compress
• Correct alphabet is crucial in practical
compression algorithms.
• Both the UNIX compress and GNU gzip
algorithms use a greedy algorithm due
to Lempel and Ziv

• Lempel-Ziv Compression:
• One pass: compute a good alphabet (of
`phrases’) in one pass while compressing.
• Adaptive Compression Algorithm: build
the source knowledge by themselves
Lempel Ziv Idea
A is a dictionary of phrases, initially A[0]=∈
LZ breaks the text it sees into `phrases’ as
it goes along, storing new phrases in the
dictionary A till it fills up. How?
– As new text is encountered it breaks it
into the longest matching past phrase +
1 new character to append. Stores this
as new phrase in A, if A not full
– Phrases represented as: (loc of past
longest phrase, end char)
– File: list of such pairs
Example

F=0 00 1 11 10 101 1010 00 (to be compressed)

Build dictionary A[1…6]
A(1)=A(0)0=0,
A(2)=A(1)0=00,
A(3)=A(0)1=1,
A(4)=A(3)1=11,
A(5)=A(3)0=10,
A(6)0=1010
Compressed F=
(0,0),(1,0), (0,1), (3,1), (3,0), (5,1)(6,0),(1,0)
Remarks
• In this small example no compression is
obtained, but if A were large, and the same
long bit strings appeared frequently,
compression would be substantial. The gzip
manager claims that source code and English
text is typically compressed 60%-70%.

• LZ Encodes blocks of varying lengths into

blocks of fixed size whereas Huffman
encoded blocks of the same length into blocks
of varying length
Lempel Ziv Encoding Algorithm
• Input: file F, alphabet size N
• Initialize A = {A[0]=∈ } , i = 0 (points to next
place in file f to start encoding)
Repeat till i> |F|
1. find A(k) for 1≤k≤N in A[1…N] that matches
as many bits fifi+1fi+2 …as possible
2. Let p be the number of bits in A(k)
3. if A is full upto l<N, add to it (A(l+1),fi+p )
( fi+p is the first bit unmatched by A(k))
4. output (k ,fi+p)
5. i = i + p + 1
Lempel Ziv: What can you prove?

• There are no optimality guarantees for this

algorithm. It can perform badly if the
nature of the file changes substantially
after A is filled up

• If assume text in the file source obeys a

Markov chain model, can prove average
case guarantees: HARD PROOFS
Compression: Current Research

• Sub-linear time approximation algorithms to

estimate how compressible is a file

• Clustering by compression:
– Want to cluster similar document sources
– Two objects are close if we can compress one given
information on the other.
– Yields: Automatic Clustering without any background
knowledge of the data
– Genomics applications

Huffman
No ratings yet
Huffman
22 pages
Unit 2
No ratings yet
Unit 2
28 pages
Huffman Coding Ms 140400147 Sadia Yunas Butt
No ratings yet
Huffman Coding Ms 140400147 Sadia Yunas Butt
9 pages
Huffman Coding: Greedy Algorithm Guide
No ratings yet
Huffman Coding: Greedy Algorithm Guide
27 pages
Compression: Another Example of Greedy Algorithm: Huffman Codes
No ratings yet
Compression: Another Example of Greedy Algorithm: Huffman Codes
4 pages
Wa0023.
No ratings yet
Wa0023.
28 pages
Huffman Code
No ratings yet
Huffman Code
5 pages
5.2 Huffman Algorithm
No ratings yet
5.2 Huffman Algorithm
12 pages
Lecture 15
No ratings yet
Lecture 15
3 pages
Huffman Alg
No ratings yet
Huffman Alg
14 pages
Huffman Coding Explained
No ratings yet
Huffman Coding Explained
27 pages
Optimization Problems
No ratings yet
Optimization Problems
38 pages
M1 Greedy - Huffman Codes
No ratings yet
M1 Greedy - Huffman Codes
2 pages
Huffman Code
No ratings yet
Huffman Code
7 pages
Greedy Algorithms & Huffman Codes
No ratings yet
Greedy Algorithms & Huffman Codes
21 pages
Unit 2 CA209
No ratings yet
Unit 2 CA209
29 pages
Data Compression Unit-2
No ratings yet
Data Compression Unit-2
74 pages
Huffman Coding Explained
100% (1)
Huffman Coding Explained
13 pages
Multimedia Data Compression
No ratings yet
Multimedia Data Compression
31 pages
Graph Theory - Important Application of Trees Huffman Coding
No ratings yet
Graph Theory - Important Application of Trees Huffman Coding
50 pages
Greedy Huffman Coding
No ratings yet
Greedy Huffman Coding
7 pages
HuffmanCoding 2
No ratings yet
HuffmanCoding 2
16 pages
Application of Compression
No ratings yet
Application of Compression
14 pages
Data Compression for Tech Experts
100% (1)
Data Compression for Tech Experts
14 pages
Chapter 6-Greedy Algorithms
No ratings yet
Chapter 6-Greedy Algorithms
32 pages
Term Paper Huffman Coding
No ratings yet
Term Paper Huffman Coding
9 pages
Unite 4-Greedy Method - CSE
No ratings yet
Unite 4-Greedy Method - CSE
41 pages
0g Huffman
No ratings yet
0g Huffman
23 pages
Huffman Codes
No ratings yet
Huffman Codes
8 pages
Chapter Three
No ratings yet
Chapter Three
30 pages
Data Structure: Huffman Tree:Project Submitted To: Sir Abdul Wahab
No ratings yet
Data Structure: Huffman Tree:Project Submitted To: Sir Abdul Wahab
24 pages
Algorithmics: Information Coding Techniques
No ratings yet
Algorithmics: Information Coding Techniques
44 pages
04huffman 2x2
No ratings yet
04huffman 2x2
6 pages
Huffman Coding
No ratings yet
Huffman Coding
32 pages
W11 Greedy Algorithms Lecture 21 06052024 115021am
No ratings yet
W11 Greedy Algorithms Lecture 21 06052024 115021am
6 pages
Lec 08 Wed
No ratings yet
Lec 08 Wed
17 pages
Lecture 26
No ratings yet
Lecture 26
2 pages
Huffman Coding
No ratings yet
Huffman Coding
22 pages
Unit 2
No ratings yet
Unit 2
82 pages
Group-8 DIP Presentation
No ratings yet
Group-8 DIP Presentation
100 pages
Mini Project
No ratings yet
Mini Project
26 pages
Huffman Coding
No ratings yet
Huffman Coding
32 pages
Huffman Coding
No ratings yet
Huffman Coding
39 pages
University of Management & Technology: Submitted By: Usama Dastagir 14030027011 Hassan Humayoun 14030027043
No ratings yet
University of Management & Technology: Submitted By: Usama Dastagir 14030027011 Hassan Humayoun 14030027043
7 pages
Lecture 09 - Greedy Algos Updates
No ratings yet
Lecture 09 - Greedy Algos Updates
64 pages
Huffman Coding
No ratings yet
Huffman Coding
16 pages
Huffman Coding for Tech Enthusiasts
No ratings yet
Huffman Coding for Tech Enthusiasts
5 pages
Huffman Coding
No ratings yet
Huffman Coding
11 pages
Greedy Algorithms & Huffman Codes
No ratings yet
Greedy Algorithms & Huffman Codes
21 pages
Huffman Code
No ratings yet
Huffman Code
29 pages
Chapter 4 Multi
No ratings yet
Chapter 4 Multi
45 pages
5 Huffman Coding
No ratings yet
5 Huffman Coding
50 pages
Ut 1 PPT
No ratings yet
Ut 1 PPT
77 pages
Unit III - Daa
No ratings yet
Unit III - Daa
127 pages
Huffman Coding Compression Intro
No ratings yet
Huffman Coding Compression Intro
4 pages
Huffman Coding: Data Compression Guide
No ratings yet
Huffman Coding: Data Compression Guide
15 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Compression & Huffman Codes
No ratings yet
Compression & Huffman Codes
29 pages
Unit Iii Greedy and Dynamic Programming
No ratings yet
Unit Iii Greedy and Dynamic Programming
120 pages
Intacc 1
No ratings yet
Intacc 1
3 pages
Prodoc OVC
No ratings yet
Prodoc OVC
23 pages
10 Realtime Python Automation Scripts
100% (2)
10 Realtime Python Automation Scripts
12 pages
Addocat® PP: Plastic Additives / PUR Additives
No ratings yet
Addocat® PP: Plastic Additives / PUR Additives
3 pages
Giới Thiệu Tổng Quan Về Công Ty Cổ Phần Sữa Việt Nam
No ratings yet
Giới Thiệu Tổng Quan Về Công Ty Cổ Phần Sữa Việt Nam
10 pages
Analysis of Power Losses in AC-DC Converter For EV Drive
No ratings yet
Analysis of Power Losses in AC-DC Converter For EV Drive
4 pages
What Is Market
No ratings yet
What Is Market
46 pages
Sepuran® N System 6": in NM /H at 7 Barg 25°C
No ratings yet
Sepuran® N System 6": in NM /H at 7 Barg 25°C
2 pages
Include Graduation Date on Resume
100% (1)
Include Graduation Date on Resume
7 pages
Fsp250-60Pln: (Active PFC & Rohs Compliant)
No ratings yet
Fsp250-60Pln: (Active PFC & Rohs Compliant)
2 pages
Blackrock Weekly Outlook
No ratings yet
Blackrock Weekly Outlook
6 pages
Drum and Lyre Action Plan
No ratings yet
Drum and Lyre Action Plan
3 pages
Swift Standards Category 7 Version 11 September 2006
No ratings yet
Swift Standards Category 7 Version 11 September 2006
245 pages
SP Ishares Core Msci Total International Stock Etf 7 31
No ratings yet
SP Ishares Core Msci Total International Stock Etf 7 31
16 pages
Dorian Armstrong Resume-2022
No ratings yet
Dorian Armstrong Resume-2022
3 pages
Spuds System PDF
No ratings yet
Spuds System PDF
10 pages
Lab Management
No ratings yet
Lab Management
19 pages
Facility Management & Security Market Analysis
No ratings yet
Facility Management & Security Market Analysis
40 pages
Amala College of Nursing: Management Director Principal Parent Hospital Details of Contact Person
No ratings yet
Amala College of Nursing: Management Director Principal Parent Hospital Details of Contact Person
1 page
GEOPLM Siemens PLM Tecnomatix Robcad
No ratings yet
GEOPLM Siemens PLM Tecnomatix Robcad
2 pages
OM - Sport - BSIV - Rev 03 - SBT PDF
No ratings yet
OM - Sport - BSIV - Rev 03 - SBT PDF
67 pages
KritiKal Vista Presentation V1.0
No ratings yet
KritiKal Vista Presentation V1.0
19 pages
Project Title
No ratings yet
Project Title
112 pages
Integrated Science ATAR Y11 Sample Assessment Outline WACE 201516 - PDF
No ratings yet
Integrated Science ATAR Y11 Sample Assessment Outline WACE 201516 - PDF
3 pages
Intelligence on Human Trafficking
No ratings yet
Intelligence on Human Trafficking
80 pages
Financial Statements Formate 3.2
No ratings yet
Financial Statements Formate 3.2
21 pages
Adv C MCQ
No ratings yet
Adv C MCQ
6 pages
Letter From Arthur Z. Schwartz To City Lawyers PDF
No ratings yet
Letter From Arthur Z. Schwartz To City Lawyers PDF
12 pages

Lecture 22 Compression

Uploaded by

Lecture 22 Compression

Uploaded by

Compression Algorithms

Today: Lossless Compression: D=D’

• Non Adaptive & Adaptive Algorithms

– TODAY: An example of each, start with

• Input: a sequence of characters

• Look for Binary codes: encode each

• Output: a concatenation of code words

• Fixed length code: each character is

Say that you knew something about the

Idea: use fewer bits for code words

• But how do we decode?

A labeled binary tree where

• Fixed Length Codes

• Variable Length Codes

Namely, find codeword c(ai) ∀ ai which

Trees: Let d(ai) be depth of leaf ai in tree T

Obvious since d(ai) =L(c(ai))

• Shannon: invented information theory

Greedy Algorithm Idea

1. Take the characters and their

4. Insert the new vertex into the sorted list of

New list, with the

6. Insert the new vertex into the sorted list of

New list, with the

7. Take the first

• Runtime: Let n be the size of the alphabet.

• Both the .mp3 and .jpg file formats use

1. The compression is lossless.

When lossy compression is permitted, as for

F=0 00 1 11 10 101 1010 00 (to be compressed)

• LZ Encodes blocks of varying lengths into

• There are no optimality guarantees for this

• If assume text in the file source obeys a

• Sub-linear time approximation algorithms to

You might also like