UNIT 1
BASICS OF DATA COMPRESSION
Multimedia:
It is the use of computers to present and combine text, graphics, audio, video and
animation in digital format
Basic Elements of Multimedia
Text, Images, Audio, Video and Animation
Compression:
It is the method of reducing the size of digital file so that it occupies less space .
Types of Compression:
There are two types of compression: lossless and lossy.
Lossless compression algorithms reduce the size of files without losing any information in
the file, which means that we can reconstruct the original data from the compressed file.
Lossy compression algorithms reduce the size of files by discarding the less important
information in a file, which can significantly reduce file size but also affect file quality.
Difference between Lossy and Lossless Comparison
S.NO Lossy Compression Lossless Compression
Lossy compression is the method which While Lossless Compression does not
1. eliminate the data which is not eliminate the data which is not
noticeable. noticeable.
In Lossy compression, A file does not While in Lossless Compression, A file can
2.
restore or rebuilt in its original form. be restored in its original form.
In Lossy compression, Data’s quality is But Lossless Compression does not
3.
compromised. compromise the data’s quality.
Lossy compression reduces the size of But Lossless Compression does not
4.
data. reduce the size of data.
Algorithms used in Lossy compression Algorithms used
are: Transform coding, Discrete Cosine in Lossless compression are: Run Length
5.
Transform, Discrete Wavelet Transform, Encoding, Lempel-Ziv-Welch, Huffman
fractal compression etc. Coding, Arithmetic encoding etc.
Lossy compression is used in Images, Lossless Compression is used in Text,
6.
audio, video. images, sound.
Lossless Compression has less data-
Lossy compression has more data-
7. holding capacity than Lossy compression
holding capacity.
technique.
Lossy compression is also termed as Lossless Compression is also termed as
8.
irreversible compression. reversible compression.
Advantages and Disadvantages of Lossy and Lossless comparison:
Lossy Lossless
Small file sizes. Ideal for web
Pros use. Lots of tools, plugins and No loss in quality. Slight decreases in file sizes.
software support it.
Quality degrades due to
Cons Compressed files are larger than lossy files.
higher rate of compression.
Methods used for Lossy and Lossless compression
Algorithms used in Lossy compression are:
Transform coding,
Discrete Cosine Transform,
Discrete Wavelet Transform,
fractal compression etc.
Algorithms used in Lossless compression are:
Run Length Encoding,
Lempel-Ziv-Welch,
Huffman Coding,
Arithmetic encoding etc
Run Length Encoding
Run-length encoding (RLE) is a form of lossless compression.. RLE is a simple method of
compressing data by specifying the number of times a character or pixel colour repeats
followed by the value of the character or pixel. The aim is to reduce the number of bits used
to represent a set of data.
1. Encode AAAAAABBBBBCCCCCCCCDDEEEEEFFF using RLE
Output:
6A 5B 8C 2D 5E 3F
2. Encode
WWWWWWWWWWWWBWWWWWWWWWWWWBBBWWWWWWWWWWWWWWW
WWWWWWWWWBWWWWWWWWWWWWWW
Output
12W 1B 12W 3B 24W 1B 14W
DICTIONARY-BASED ALGORITHMS
The compression techniques we have seen so far replace individual symbols with a variable
length codewords. In dictionary compression, variable length substrings are replaced by
short, possibly even fixed length codewords. Compression is achieved by replacing long
strings with shorter codewords
The general scheme is as follows:
• The dictionary D is a collection of strings
. For completeness, the dictionary includes all single symbols.
• The text T is parsed into a sequence of phrases: T = T1T2 . . . Tz,
• The text is encoded by replacing each phrase Ti with a code that acts as a pointer to the
dictionary.
LEMPEL ZIV
Lossless compression
Dictionary based compression
Encoded string is longer than the input stream
Example 1
Let
AABABBBABAABABBBABBABB A=0
B=1
POSITION 1 2 3 4 5 6 7 8 9
ABA
SEQUENCE A AB ABB B ABA BB ABBA BB
B
1B
(Numbe
NUMERICAL
r
REPRESENTATIO ΦA 2B ΦB 2A 5B 4B 3A 7
denotes
N
position
)
0000
(Last
digit
denote
s the
CODE letter, 0011 0101 `0001 0101 1011 1001 0110 0111
First 3
digit
the
numbe
r
Example 2: 01000101110010100101
LEMPEL ZIV WELCH
Common technique
Used in GIF, optionally in TIFF and PDF
Simple
High throughput in hardware
Widely used in Unix file compression
Doubles the capacity of the hardware
Scans a file for data pattern that appears more than once.
Patterns are stored in dictionary
Example1:
2.Encode “TOBEORNOTTOBEORTOBEORNOT#”
There are 26 symbols in the plaintext alphabet (the capital letters A through Z). # is used to
represent a stop code
The initial dictionary, then, consists of the following
entries:
Symbol Decimal Symbol Decimal
# 0 N 14
A 1 O 15
B 2 P 16
C 3 Q 17
D 4 R 18
E 5 S 19
F 6 T 20
G 7 U 21
H 8 V 22
I 9 W 23
J 10 X 24
K 11 Y 25
L 12 Z 26
M 13
OUTPUT:
Extended
CODE Description
Dictionary
27 TO 20 CODE OF FIRST LETTER T
CODE OF FIRST LETTER O
28 OB 15 (OVERLAPPING WITH THE SECOND
BIT OF THE FIRST CODE)
29 BE 2 CODE OF FIRST LETTER B
30 EO 5 CODE OF FIRST LETTER E
31 OR 15 CODE OF FIRST LETTER O
32 RN 18 CODE OF FIRST LETTER R
33 NO 14 CODE OF FIRST LETTER N
34 OT 15 CODE OF FIRST LETTER O
35 TT 20 CODE OF FIRST LETTER T
36 TOB 27 CODE OF FIRST TWO LETTERS TO
37 BEO 29 CODE OF FIRST TWO LETTERS BE
38 ORT 31 CODE OF FIRST TWO LETTERS OR
39 TOBE 36 CODE OF FIRST TWO LETTERS TOB
40 EOR 30 CODE OF FIRST TWO LETTERS EO
41 RNO 18 CODE OF FIRST TWO LETTERS RN
42 O# 15 CODE OF FIRST LETTER O
Thus the code is 20,15,2,5,15,18,14,15,20,27,29,31,36,30,18,15
Example 2:
Encode wa_ba_wabba_wabba_wabba_w
Initial given dictionary
Index Entry
1 _
2 a
3 b
4 o
5 w
Solution:
Extended
CODE Description
Dictionary
6 wa 5 CODE OF FIRST LETTER w
7 ab 2 CODE OF FIRST LETTER a
8 bb 3 CODE OF FIRST LETTER b
9 ba 3 CODE OF FIRST LETTER b
10 a_ 2 CODE OF FIRST LETTER a
11 _w 1 CODE OF FIRST LETTER _
12 wab 6 CODE OF FIRST TWO LETTERS wa
13 bba 8 CODE OF FIRST TWO LETTERS bb
14 a_w 10 CODE OF FIRST TWO LETTERS a_
15 wabb 12 CODE OF FIRST THREE LETTERS wab
16 ba_ 9 CODE OF FIRST TWO LETTERS ba
17 _wa 11 CODE OF FIRST TWO LETTERS _w
18 abb 7 CODE OF FIRST TWO LETTERS ab
19 ba_w 16 CODE OF FIRST THREE LETTERS ba_
Thus the code is 5,2,3,3,2,1,6,8,10,12,9,11,7,16
Example 3:
Encode geekific-geeficf
Index Entry
45 -
99 C
101 e
102 f
103 g
105 i
107 k
Dictionary length is 255
Extended
CODE Description
Dictionary
256 ge 103 CODE OF FIRST LETTER g
257 ee 101 CODE OF FIRST LETTER e
258 ek 101 CODE OF FIRST LETTER e
259 ki 107 CODE OF FIRST LETTER k
260 if 105 CODE OF FIRST LETTER i
261 fi 102 CODE OF FIRST LETTER f
262 ic 105 CODE OF FIRST LETTER i
263 e- 101 CODE OF FIRST LETTER e
264 -g 45 CODE OF FIRST LETTER -
265 gee 256 CODE OF FIRST TWO LETTERS ge
266 ef 101 CODE OF FIRST LETTER e
267 fic 262 CODE OF FIRST TWO LETTERS fi
268 cf 99 CODE OF FIRST LETTER c
Thus the code word is 103,101,101,107,105,102,105,101,45,256,101,262,99
HUFFMAN CODING
Huffman coding is an algorithm for compressing data with the aim of reducing its size
without losing any of the details. This algorithm was developed by David Huffman.
Huffman coding is typically useful for the case where data that we want to compress has
frequently occurring characters in it.
It is a lossless compression.
It is a variable length code.
Variable length is based on the frequencies of the characters.
Variable length codes are assigned to input characters.
They are prefix codes.
Huffman Coding step-by-step working or creation of Huffman Tree is as follows:
Step-1: Calculate the frequency of each string.
Step-2: Sort all the characters on the basis of their frequency in ascending order.
Step-3: Mark each unique character as a leaf node.
Step-4: Create a new internal node.
Step-5: The frequency of the new node as the sum of the single leaf node
Step-6: Mark the first node as this left child and another node as the right child of the
recently created node.
Step-7: Repeat all the steps from step-2 to step-6.
Example:
Suppose a data file has the following characters and the frequencies. If huffman coding is
used, calculate:
Huffman Code of each character
Average code length
Length of Huffman encoded data
Solution:
Initially, create the Huffman Tree:
Step-2:
Step-3:
Step-4:
Step-5:
The above tree is a Huffman Tree.
Now, assign weight to all the nodes.
Assign “0” to all left edges and “1” to all right edges.
The tree will become
Huffman Code of each character:
A: 00
B: 10
C: 110
D: 01
E: 111
Average code length per character = ∑(frequencyi x code lengthi)/ ∑ frequencyi
= {(12 x 2) + (13 x 2)+ (15 x 2)+ ( 7 x 3)+ (9 x 3)} / (12 + 13 + 15 + 7+ 9)
= (24+ 26+ 30 + 21 + 27)/56
= 128/56
= 2.28
Average code length per character = 2.28
Total number of bits in Huffman encoded message
= Total number of characters in the message x Average code length per
character
= 56 x 2.28
= 127.68
Example :2
Encode "go go gophers" using Huffman coding.
SYMBOL FREQUENCY
g 3
0 3
p 1
e 1
h 1
r 1
s 1
“ 2
Step 1: Arrange in ascending order:
Step 2: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 3: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 4: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 5: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 6: add 2 least numbers and arrange in ascending order again as shown below tree
diagram,
Step 7: add 2 least numbers and arrange in ascending order again as shown below tree
diagram
Step 8: add 2 least numbers and arrange in ascending order again as shown below tree
diagram, and assign values 0 to the left side and 1 right side
Step 8: For every symbol assign code from the values from top to bottom
Frequency of a
Symbol Code Size
String
G 3 00 3x2=6
O 3 01 3x2=6
S 1 100 1x3=3
E 1 1100 1x4=4
H 1 1101 1x4=4
P 1 1110 1x4=4
R 1 1111 1x4=4
“ 2 101 2x3=6
8x4=32 bits 13 bits 37 bits
Before compression 13 x 8 = 104 bits
Total size after compression = 32+13+37 = 82 bits
Example 3:
Find Huffman codeword of the given text “AAAAAAAAAABBBBBCCCSS” BY USING
STATIC Huffman tree. Calculate Entropy and derive the average number of bits per
character for codeword?
Avg L = Σ Li * P (i)
=10/20*1+5/20*2+3/20*3+2/20*3=10+10+9+6/20=35/20=1.75
Space required to store the original message=20*8=160 bits
Space required to store the decoded message=35 bits
Example4:
Encode BCAADDDCCACACAC using Huffman coding.
HUFFMAN CODING
It is the average level of information surprise or uncertainity inherent to the variable’s
possible outcomes.
H[X] = − Σ P i log2 Pi
H[X]= -(1/0.3010) Σ P i log Pi
Average Length L= Σ P I x Ni
Efficiency = η = H/L
Redundancy = 1 – η
Procedure
Example: Encode using Huffman coding P0=0.4, P1=0.2, P2=0.1, P3=0.2 and P4=0.1
Step 1: Arrange the probabilities in Descending order
MSG PROB
P0 0.4
P1 0.2
P3 0.2
P2 0.1
P4 0.1
Step2: Combine ( add ) last 2 probabilities and arrange in descending order
Step3: Combine ( add ) last 2 probabilities and arrange in descending order
Repeat until all the probabilities are combined.
Step 4: Assign values 0 and 1 to all elements as shown
Step 5: Obtain codes by tracing
P0=00
P1=01
P2=010
P3=11
P4=110
SYMBOL PROB CODE INVERSE CODE CODE LENGTH
P0 0.4 00 00 2
P1 0.2 01 10 2
P2 0.1 010 010 3
P3 0.2 11 11 2
P4 0.1 110 011 3
Entropy = H[X]= -(1/0.3010) Σ P i log Pi
=-(1/0.3010)[(0.4*log0.4)+(0.2*log0.2)+(0.1*log0.1)+(0.2*log0.2)+(0.1*log0.1)
=2.122
Average Length L= Σ P I x Ni
=(0.4*2)+(0.2*2)+(0.1*3)+(0.2*2)+(0.1*3)
=2.2
Efficiency = η = H/L =2.122/2.2
96.45
Redundancy =1- η =0.355
-------------------------------------------------------------------------------------------------------------------
-------
Example3: x1=0.4,x2=0.19,x3=0.16,x4=x5=0.15
Example 4: x1=0.3,x2=0.25,x3=0.2,x4=0.12,x5=0.08 and x6=0.05
Advantages:
Efficient
Shorter codes
Higher compression ratio
Doest need any special markers to separate different codes
Easy to integrate with existing system
Lossless
Easy implementation In hardware and software
Disadvantages:
Frequency should be known in advance
Tree is complex
Difficult to understand and debug
Time consuming
When unique symbols are present less effective
---------------------------------------------------------------------------------------------------------------------
--------
ARITHMETIC CODING
Arithmetic coding is a type of entropy encoding utilized in lossless data compression.
Arithmetic coding can be used in most applications of data compression. Its main
usefulness is in obtaining compression in conjunction with an adaptive model, or when the
probability of one event is close to 1.
Difference between Huffman coding and Arithmetic coding:
Arithmetic coding Huffman coding
Doesn’t need probability Distribution Need probability Distribution
No need to keep and share codeword table Need to store code word table
Decompression speed is low Decompression speed is high
compression speed is low compression speed is high
compression ratio is high compression ratio is low
PROCEDURE
Example:
Find Sm for the message”babc” with a=0.2,b=0.5 and c=0.3 probabilities using arithmetic
coding.
symbol probability Cumulative probability
a 0.2 0.2
b 0.5 0.7
c 0.3 1.0
Step1:
Step 2: For b
For “b” Calculate Range and mark in the diagram
Lower Limit =0.2
Upper Limit =0.7
Difference =d =0.7-0.2 =0.5
Range = lower Limit +(difference x probability of the symbol)
Range of a=0.2+(0.5x0.2)=0.2+0.10=0.3
Range of b= 0.3+(0.5x0.5)=0.3+0.25=0.55
Range of c=0.55+(0.5x0.3)=0.55+0.15=0.7
Ste 3: for a
For “a” Calculate Range and mark in the diagram
Lower Limit =0.2
Upper Limit =0.3
Difference =d =0.3-0.2 =0.1
Range = lower Limit +(difference x probability of the symbol)
Range of a=0.2+(0.1x0.2)=0.2+0.02=0.22
Range of b= 0.22+(0.1x0.5)=0.22+0.05=0.27
Range of c=0.27+(0.1x0.3)=0.27+0.03=0.3
For “b” Calculate Range and mark in the diagram
Lower Limit =0.22
Upper Limit =0.27
Difference =d =0.27-0.22 =0.05
Range = lower Limit +(difference x probability of the symbol)
Range of a=0.22+(0.05x0.2)=0.22+0.01=0.23
Range of b= 0.23+(0.05x0.5)=0.23+0.025=0.255
Range of c=0.255+(0.05x0.3)=0.255+0.015=0.27
For code word
Code word lies between Lower Limit =0.255 and Upper Limit =0.27
Tag is = (0.255+0.27)/2=0.2275
---------------------------------------------------------------------------------------------------------------------
-----
CONTEXT BASED COMPRESSION/PREDICTIVE CODING
Compresses based on previous symbol.
Encoding is done in prediction mode.
It the distribution or transformation is based on previous sequence (history), there is no
need to transfer additional information
Using the history to determine the sequence in a predictive manner is called as predictive
coding
Popular algorithm used is Predictive with Partial Match (PPM)
In PPM it is necessary only to store that context (surrounding elements) of an encoding
sequence.
At first since there is no previously encoded elements it has to be encoded separately
In PPM escape symbol <ESC> is used to denote that the letter to be encoded isn’t present in
the context.
PPM Basic Algorithm
The basic algorithm initially attempts to use the largest context. The size of the largest
context is predetermined.
If the symbol to be encoded has not previously been present in this context, an escape
symbol is encoded and the algorithm attempts to use the next smaller context.
If the symbol has not occurred in this context either, the size of the context is further
reduced.
This process continues until either we obtain a context that has previously been present
with this symbol, or we arrive at the conclusion that the symbol has not been encountered
previously in any context.
In this case, we use a probability of 1/M to encode the symbol, where M is the size of the
source alphabet.
Example,
Letter to be coded - letter a of the word probability
First attempt - See if the string proba has previously occurred— that is, if the letter a had
previously occurred in the context of prob.
If not, we would send an escape symbol
Second attempt – See if the letter a had previously occurred in the context of rob.
If the string roba had not occurred previously, not, we would send an escape symbol
Third attempt - See if the letter a had previously occurred in the context of ob.
If the string oba had not occurred previously, not, we would send an escape symbol
Forth attempt - See if the letter a had previously occurred in the context of b.
If the string ba had not occurred previously, it is concluded that a has occurred for the first
timeand encode it separately
Example:
Encode the sequence
“this is the “
1.Assume that we have already encoded initial 7 symbols “this is” (including space)
2.Assume that the longest context length is two, that is order is from -1 to 2 (-1,0,1,2)
3. Count array for -1 order context “this is the “
Letter count Cumulative count
t 1 1
h 1 2
i 1 3
s 1 4
space 1 5
e 1 6
Total count 6
Since special character should be written last interchange e and space
Letter count Cumulative count
t 1 1
h 1 2
i 1 3
s 1 4
e 1 5
space 1 6
Total count 6
4.Count array for 0 order context: “this is”
Letter count Cumulative count Remarks
t 1 1
h 1 2
i 2 4 i appears twice
s 2 6 s appears twice
space 1 7
<ESC> 1 8
Total count 8
4.Count array for 1 order context: “this is”
It gives how many times the letter appears after the context
I order cum
letter count Remarks
context count
h 1 1 h appears once
t
before t in "this
<ESC> 1 2 is"
total count 2
i 1 1
h i appears once
<ESC> 1 2 beforeh in "this
is"
total count 2
s 2 2
i s appears once
<ESC> 1 3
before i in "this
total count 3 is"
space 1 1
s space appears
<ESC> 1 2 once before s in
total "this is"
2
count
i 1 1
space i appears once
before space in
<ESC> 1 2 "this is"
total
2
count
5.Count array for 2 order context: “this is”
II order cum
letter count Remarks
context count
i 1 1
th th appears once before
<ESC> 1 2
i in "this is"
total count 2
s 1 1
hi hi appears once before
<ESC> 1 2 s in "this is"
total count 2
space 1 1
is is appears once before
<ESC> 1 2 space in "this is"
total count 2
i 1 1
s space s space appears once
<ESC> 1 2 before i in "this is"
total count 2
s 1 1
space i space i appears once
<ESC> 1 2 before s in "this is"
total count 2
6.We assume that the word length for arithmetic coding is six. Thus, l=000000 and
u=111111
7.As “this is” has already been encoded, the next letter to be encoded is space in ”this is
the”.
In the II order consider space as the first letter (ie in space i) the cumulative count for space is 1
and total count is 2
Formula for lower limit is ln=ln-1+{[(un-1-ln-1+1)*(cum count/(xn-1)]/total count}
Upper limit =un=ln-1+{[(un-1-ln-1+1)*(cum count/(xn)]/total count}
Un-1=decimal of (111111)=63
Ln-1=0, cum count for space=1 and total count =2
Substituting in ln =0+{[(63-0+1)*0]/2}=0=000000
and un =0+{[(63-0+1)*1]/2}=31=011111