[go: up one dir, main page]

0% found this document useful (0 votes)
11 views21 pages

L1 Part2

Uploaded by

Manoel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views21 pages

L1 Part2

Uploaded by

Manoel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Lecture 1

Introduction to lossless compression

EE 274: Data Compression - Lecture 1 1


Plan: Lecture 1-3: theory and concepts from information theory

EE 274: Data Compression - Lecture 1 2


A simple probability distribution
Consider:
Alphabet X = {A, B, C, D}
Uniform probability distribution: P (A) = P (B) = P (C) = P (D) = 14 ​

A text file generating by independently sampling one million symbols from this distribution:
$ cat abcd.txt

ACABDADCBDDC....

What is the size of this file?


EE 274: Data Compression - Lecture 1 3
Bits and bytes
bit: a unit of information expressed as either a 0 or 1 in binary notation.
byte: a group of eight bits operated on as a unit.
1 byte (B) = 8 bits
1 kilobyte (KB) = 1000 bytes = 8000 bits
So on for MB, GB, TB, PB, EB, ...
Note: Sometimes we like to use powers of two, e.g., 1 kilobyte = 1024 bytes.

EE 274: Data Compression - Lecture 1 4


abcd.txt
Size on disk: 1 MB (1 million bytes).
Why 1 byte per letter/character?

EE 274: Data Compression - Lecture 1 5


EE 274: Data Compression - Lecture 1 6
ASCII Table
Symbol ASCII code
A 1000001
B 1000010
C 1000011
D 1000100
8 bits = 1 byte per symbol.
Can we do better?
EE 274: Data Compression - Lecture 1 7
Fixed bitwidth code
Symbol Code
A 00
B 01
C 10
D 11
Bits/symbol?
Decoding?
EE 274: Data Compression - Lecture 1 8
Fixed bitwidth code
k = ∣S∣ different symbols implies at least ⌈log2 k⌉ bits per symbol in a fixed bitwidth
code.

Can we do better? In the uniform distribution example above?

EE 274: Data Compression - Lecture 1 9


Uniform distribution
Symbol Probability
A 0.5
B 0.5
Fixed bitwidth code: 1 bit/symbol

EE 274: Data Compression - Lecture 1 10


Non-uniform distribution
Symbol Probability
A 0.49
B 0.49
C 0.01
D 0.01
Fixed bitwidth code: 2 bits/symbol
Can we do better? Closer to the previous page's 1 bit/base?
EE 274: Data Compression - Lecture 1 11
Non-uniform distribution
Symbol Probability
A 0.49
B 0.49
C 0.01
D 0.01
Solution 1: C and D are low probability, let's just lose them - Lossy Compression (not
commonly used for text/database/log data).
EE 274: Data Compression - Lecture 1 12
Non-uniform distribution
Symbol Probability
A 0.49
B 0.49
C 0.01
D 0.01
Solution 2: Variable length codes: Use fewer bits for more probable symbols.

EE 274: Data Compression - Lecture 1 13


Variable length codes
Use fewer bits for more probable symbols
Symbol Probability Code
A 0.49 0
B 0.49 10
C 0.01 110
D 0.01 111
How to evaluate coding efficiency? Expected code length.
EE 274: Data Compression - Lecture 1 14
Expected code length
"Compressed size/Uncompressed size" - often in units bits/symbol.
Also sometimes called compression rate/compression ratio.
Warning: There's some variability in notation and definitions of these terms so be
careful.
Let l(x) denote the code length for symbol x with probability P (x), where x ∈ X .
Expected code length: E[l(X)] = ∑x∈X P (x)l(x)

EE 274: Data Compression - Lecture 1 15


Expected code length
Symbol Probability Code
A 0.49 0
B 0.49 10
C 0.01 110
D 0.01 111
Expected code length: E[l(X)] = ?

EE 274: Data Compression - Lecture 1 16


Expected code length
Symbol Probability Code l(x)
A 0.49 0 1
B 0.49 10 2
C 0.01 110 3
D 0.01 111 3
E[l(X)] = 0.49 × 1 + 0.49 × 2 + 0.01 × 3 + 0.01 × 3 = 1.53 bits/symbol

EE 274: Data Compression - Lecture 1 17


Thoughts and conclusion
Is the code above lossless? Can you decode it? <- homework for next lecture!

EE 274: Data Compression - Lecture 1 18


Thoughts and conclusion
Is the code above lossless? Can you decode it? <- homework for next lecture!
The non-uniform distribution above seems "worse" but "similar" to the uniform
distribution on just A and B.

EE 274: Data Compression - Lecture 1 19


Thoughts and conclusion
Is the code above lossless? Can you decode it? <- homework for next lecture!
The non-uniform distribution above seems "worse" but "similar" to the uniform
distribution on just A and B.
In the next few lectures, we will learn how to compute the optimal compression rate
and how we can get close to 1.14 bits/symbol for the above distribution (and no
better).

EE 274: Data Compression - Lecture 1 20


Thank you!

EE 274: Data Compression - Lecture 1 21

You might also like