Lecture 1
Introduction to lossless compression
EE 274: Data Compression - Lecture 1 1
Plan: Lecture 1-3: theory and concepts from information theory
EE 274: Data Compression - Lecture 1 2
A simple probability distribution
Consider:
Alphabet X = {A, B, C, D}
Uniform probability distribution: P (A) = P (B) = P (C) = P (D) = 14
A text file generating by independently sampling one million symbols from this distribution:
$ cat abcd.txt
ACABDADCBDDC....
What is the size of this file?
EE 274: Data Compression - Lecture 1 3
Bits and bytes
bit: a unit of information expressed as either a 0 or 1 in binary notation.
byte: a group of eight bits operated on as a unit.
1 byte (B) = 8 bits
1 kilobyte (KB) = 1000 bytes = 8000 bits
So on for MB, GB, TB, PB, EB, ...
Note: Sometimes we like to use powers of two, e.g., 1 kilobyte = 1024 bytes.
EE 274: Data Compression - Lecture 1 4
abcd.txt
Size on disk: 1 MB (1 million bytes).
Why 1 byte per letter/character?
EE 274: Data Compression - Lecture 1 5
EE 274: Data Compression - Lecture 1 6
ASCII Table
Symbol ASCII code
A 1000001
B 1000010
C 1000011
D 1000100
8 bits = 1 byte per symbol.
Can we do better?
EE 274: Data Compression - Lecture 1 7
Fixed bitwidth code
Symbol Code
A 00
B 01
C 10
D 11
Bits/symbol?
Decoding?
EE 274: Data Compression - Lecture 1 8
Fixed bitwidth code
k = ∣S∣ different symbols implies at least ⌈log2 k⌉ bits per symbol in a fixed bitwidth
code.
Can we do better? In the uniform distribution example above?
EE 274: Data Compression - Lecture 1 9
Uniform distribution
Symbol Probability
A 0.5
B 0.5
Fixed bitwidth code: 1 bit/symbol
EE 274: Data Compression - Lecture 1 10
Non-uniform distribution
Symbol Probability
A 0.49
B 0.49
C 0.01
D 0.01
Fixed bitwidth code: 2 bits/symbol
Can we do better? Closer to the previous page's 1 bit/base?
EE 274: Data Compression - Lecture 1 11
Non-uniform distribution
Symbol Probability
A 0.49
B 0.49
C 0.01
D 0.01
Solution 1: C and D are low probability, let's just lose them - Lossy Compression (not
commonly used for text/database/log data).
EE 274: Data Compression - Lecture 1 12
Non-uniform distribution
Symbol Probability
A 0.49
B 0.49
C 0.01
D 0.01
Solution 2: Variable length codes: Use fewer bits for more probable symbols.
EE 274: Data Compression - Lecture 1 13
Variable length codes
Use fewer bits for more probable symbols
Symbol Probability Code
A 0.49 0
B 0.49 10
C 0.01 110
D 0.01 111
How to evaluate coding efficiency? Expected code length.
EE 274: Data Compression - Lecture 1 14
Expected code length
"Compressed size/Uncompressed size" - often in units bits/symbol.
Also sometimes called compression rate/compression ratio.
Warning: There's some variability in notation and definitions of these terms so be
careful.
Let l(x) denote the code length for symbol x with probability P (x), where x ∈ X .
Expected code length: E[l(X)] = ∑x∈X P (x)l(x)
EE 274: Data Compression - Lecture 1 15
Expected code length
Symbol Probability Code
A 0.49 0
B 0.49 10
C 0.01 110
D 0.01 111
Expected code length: E[l(X)] = ?
EE 274: Data Compression - Lecture 1 16
Expected code length
Symbol Probability Code l(x)
A 0.49 0 1
B 0.49 10 2
C 0.01 110 3
D 0.01 111 3
E[l(X)] = 0.49 × 1 + 0.49 × 2 + 0.01 × 3 + 0.01 × 3 = 1.53 bits/symbol
EE 274: Data Compression - Lecture 1 17
Thoughts and conclusion
Is the code above lossless? Can you decode it? <- homework for next lecture!
EE 274: Data Compression - Lecture 1 18
Thoughts and conclusion
Is the code above lossless? Can you decode it? <- homework for next lecture!
The non-uniform distribution above seems "worse" but "similar" to the uniform
distribution on just A and B.
EE 274: Data Compression - Lecture 1 19
Thoughts and conclusion
Is the code above lossless? Can you decode it? <- homework for next lecture!
The non-uniform distribution above seems "worse" but "similar" to the uniform
distribution on just A and B.
In the next few lectures, we will learn how to compute the optimal compression rate
and how we can get close to 1.14 bits/symbol for the above distribution (and no
better).
EE 274: Data Compression - Lecture 1 20
Thank you!
EE 274: Data Compression - Lecture 1 21