Data Compression
Cleophas Mochoge
INTE 412
1
In computer science and information theory, data
compression, source coding, or bit-rate
reduction involves encoding information using
fewer bits than the original representation.
Compression can be either lossy or lossless.
Lossless compression reduces bits by identifying
And eliminating statistical redundancy. No
information is lost in lossless compression.
Lossy compression reduces bits by identifying
Marginally important information and removing it.
2
The process of reducing the size of a data file is
popularly referred to as data compression,
although its formal name is source coding
(coding done at the source of the data, before it
is stored or transmitted). Compression is useful
because it helps reduce resources usage, such
as data Storage space or transmission capacity.
Because compressed data must be
decompressed to use, this extra processing
imposes computational or other costs through
decompression.
3
Lossless
• Lossless data compression algorithms
usually exploit statistical redundancy to
represent data more concisely without
losing information. Lossless compression is
possible because most real-world data has
statistical redundancy.
4
For example, an image may have areas
of colour that do not change over several
pixels; instead of coding "red pixel, red
pixel, ..." the data may be encoded as
"279 red pixels". This is a simple example
of run-length encoding; there are many
schemes to reduce size by eliminating
redundancy.
5
Lossy
Lossy data compression is contrasted
with lossless data compression. In these
schemes, some loss of information is
acceptable. Depending upon the
application, detail can be dropped from
the data to save storage space. Generally
, lossy data compression schemes are
guided by research on how people
perceive the data in question. 6
For example, the human eye is more sensitive
to subtle variations in luminance than it is to
variations in color. JPEG image compression
works in part by "rounding off" less-important
visual information. There is a corresponding
trade-off between information lost and the
size reduction. A number of popular
compression
formats exploit these perceptual differences,
including those used in music files, images,
and video
7
Introduction
Compression is used to reduce the volume
of information to be stored into storages or
to reduce the communication bandwidth
required for its transmission over the
networks
8
9
Compression Principles
Entropy Encoding
Run-length encoding
Lossless & Independent of the type of source
information
Used when the source information comprises
long substrings of the same character or
binary digit
(string or bit pattern, # of occurrences), as
FAX
e.g) 000000011111111110000011……
0,7 1, 10, 0,5 1,2…… 7,10,5,2……
10
Compression Principles
Entropy Encoding
Statistical encoding
Based on the probability of occurrence of a
pattern
The more probable, the shorter codeword
“Prefix property”: a shorter codeword must not
form the start of a longer codeword
11
Huffman Codding
Huffman coding is one type of entropy
coding where a given character must be
encoded together with the probability of
their occurrence. The Huffman Coding
Algorithm determines the optimal code
using the minimum number of bits. The
length (number of bits) of the coded
character will be differing.
12
To determine Huffman code, it is useful to
construct a binary tree. The leaves (nodes) of
the tree represent the characters that are to be
encoded. Every nodes contains the occurrence
of probability 0 and 1 are assigned to the
branches of the tree. Every character has
associated weight equal to number of times the
character occurs in a data stream.
13
Stream of characters
• P(A) = 0.16
• P(B) = 0.51
• P(C) = 0.09
• P(D) = 0.13
• P(E) = 0.11
14
15
E.g) symbols A,B,C,D,E with probabilities
A(0.16), B(0.51), C(0.09), D(0.13), E(.011)
H’ = Σ i=1 5 Ni Pi = (0.16 +
0.51+0.09+0.13+0.11) = 1
bit/codeword
H = -Σ i=1 5 Pi log2Pi = -
((0.16log20.16) + (0.51log20.51)+
(0.09log20.09) + (0.13log20.13) +
(0.11log20.11)) = 1.95
16
Compression Principles
Huffman Encoding
Entropy, H: theoretical min. avg. # of bits that are required
to transmit a particular stream
H = -Σ i=1 n Pi log2Pi
where n: # of symbols, Pi: probability of symbol i
Efficiency, E = H/H’
where, H’ = avr. # of bits per codeword = Σ i=1 n Ni
Pi
Ni: # of bits of symbol i 17
E.g) symbols M(10), F(11), Y(010), N(011), 0(000),
1(001) with probabilities 0.25, 0.25, 0.125, 0.125,
0.125, 0.125
H’ = Σ i=1 6 Ni Pi = (2(20.25) + 4(30.125)) = 2.5
bits/codeword
H = -Σ i=1 6 Pi log2Pi = - (2(0.25log20.25) +
4(0.125log20.125)) = 2.5
E = H/H’ =100 %
3-bit/codeword if we use fixed-length codewords for six
symbols
18
Huffman Algorithm
Method of construction for an encoding tree
• Full Binary Tree Representation
• Each edge of the tree has a value,
(0 is the left child, 1 is the right child)
• Data is at the leaves, not internal nodes
• Result: encoding tree
• “Variable-Length Encoding”
19
Huffman Algorithm
• 1. Maintain a forest of trees
• 2. Weight of tree = sum frequency of
leaves
• 3. For 0 to N-1
– Select two smallest weight trees
– Form a new tree
20
• Huffman coding
• variable length code whose length is inversely
proportional to that character’s frequency
• must satisfy nonprefix property to be uniquely
decodable
• two pass algorithm
– first pass accumulates the character frequency
and generate codebook
– second pass does compression with the
codebook
21
Huffman coding
• create codes by constructing a binary tree
1. consider all characters as free nodes
2. assign two free nodes with lowest frequency to
a parent nodes with weights equal to sum of
their frequencies
3. remove the two free nodes and add the newly
created parent node to the list of free nodes
4. repeat step2 and 3 until there is one free node
left. It becomes the root of tree
22
• Right of binary tree :1
• Left of Binary tree :0
• Prefix (example)
– e:”01”, b: “010”
– “01” is prefix of “010” ==> “e0”
• same frequency : need consistency of
left or right
23
• Example(64 data)
• R K K K K K K K
• K K K R R K K K
• K K R R R R G G
• K K B C C C R R
• G G G M C B R R
• B B B M Y B B R
• G G G G G G G R
• G R R R R G R R
24
• Color frequency Huffman code
• =================================
• R 19 00
• K 17 01
• G 14 10
• B 7 110
• C 4 1110
• M 2 11110
• Y 1 11111
25
26
Static Huffman Coding
Huffman (Code) Tree
Given : a number of symbols (or characters) and their relative
probabilities in prior
Must hold “prefix property” among codes
Symbol Occurrence
Root node 0 8 1
A 4/8
0 4 1 A
B 2/8 Branch node
0 2 1 B
C 1/8
Leaf node D C
D 1/8
Symbol Code
A 1 41 + 22 + 13 +
B 01 13 = 14 bits are
C 001 required to transmit
“AAAABBCD”
D 000
Prefix Property !
27
The end
Thank you
28