CS3343: Digital Media Production
Audio and Video
Compression
Brown
Lecture Topics
• Basic Compression
• Audio and hearing
• Audio compression
• Video compression
– Transform coding (DCT)
• Codecs
Outcome: After this lecture you should have a much better idea to audio and
video representation, and how it is compressed. We will also talk about
a few standards.
Brown 2
Basic Compression
Brown
Data Compression
• Two categories
– Information Preserving
• Error free compression
• Original data can be recovered completely
– Lossy
• Original data is approximated
• Less than perfect
• Generally allows much higher compression
Brown 4
Basics
• Data Compression
– Process of reducing the amount of data
required to represent a given quantity of
information
• Data vs. Information
– Data and Information are not the same thing
– Data
• the means by which information is conveyed
• various amounts of data can convey the same
information
– Information
• “A signal that contains no uncertainty”
Brown 5
Redundancy
• Redundancy
– “data” that provides no relevant information
– “data” that restates what is already known
• For example
– Consider that N1 and N2 denote the
number of “data units” in two sets that
represent the same information
– where Cr is the “Compression Ratio”
• Cr = N1 / N2
EXAMPLE
N1 = 10 & N2 = 1 data can encode the same information
Compression is Ratio
Cr = N1/N2 = 10 (or 10:1)
Implying 90% of the data in N1 is redundant
Brown 6
Variable Length Coding (1)
Our binary system (called natural binary) is not always that
good at representing data from a compression point of view
Consider the following string of data:
abaaaaabbbcccccaaaaaaaabbbbdaaaaaaa
There are 4 different pieces of information (let’s say 4 symbols)
a, b, c, d
In natural binary we would need at least 2 bits to represent this,
assigning bits as follows:
a=00, b=01, c=10, d=11
There are 35 pieces of data, that is 35*2bits = 70bits
Brown 7
Variable Length Coding (2)
Now, consider the occurrence of each symbol: a,b,c,d
abaaaaabbbcccccaaaaaaaabbbbdaaaaaaa
a = 21/35 (84%)
b = 8 /35 (22%)
c = 5/35 (14%)
d = 1/35 (02%)
Brown 8
Variable Length Coding (3)
Idea of variable length coding: assign less bits to encode to
more frequent symbols, more bits for less frequent
symbols
Bit assignment using VLC:
a=1, b=01, c=001, d=0001
Now, compute the number of bits used to encode the same
data:
21*(1) + 8*(2) + 5*(3) + 2*(4) = 51bits
So, we have a compression ratio of 70:51 or 1.37,
meaning 28% of the data using natural binary encoding
is redundant
Brown 9
Huffman encoding
• This is an example of error free coding, the information is
completely the same, the data is different
• This idea of variable length coding is used in many places
– Area codes for cities in large countries (China/India)
• Prof. David A. Huffman developed an algorithm to take a data set
and compute its “optimal” encoding (bit assignment)
– He developed this as a student at MIT
• This is very commonly applied to many compression techniques as
a
final stage
• There are other VLC techniques, such as Arthimetic coding and
LZW coding (ZIP)
David Huffman Brown 10
Run-length-encoding (RLE)
• Consider the following
aaaaaaaabbbaaaaaaaabbbba
We could instead represent this by “runs”
8a,3b,19a,4b,1a
RLE is often more compact, especially when data contains
lots of runs of the same number(s)
RLE is also lossless, you can always reconstruct the
original
Fax machines transmit RLE-encoded linescans of the
black-and-white documents
Brown 11
Quantization (Lossy)
Another thing we can do is actually quantize the data such that it
cannot be recovered completely
Consider the following string of numbers:
ORIGINAL = {10, 12, 15, 18, 20, 3, 5, 2, 13}
(all the numbers are unique so Huffman coding won’t help)
Integer divide this by 5 (i.e. quantize it)
QUANTIZED = {2, 2, 3, 3, 4, 0, 1, 0, 2}
The natural binary range is smaller, we could also use Huffman
encoding to get a bit more compression.
Of course, the values are not the same, on “reconstruction” (multiply by
5) we get only an approximation of the original:
RECOVERED = {10, 10, 15, 15, 20, 0, 5, 0, 10}
Brown 12
Lossy vs. Lossless
• For things like text documents and computer data files, lossy
compression doesn’t make sense
– An approximation of the original is no good!
• But for data like audio or visual, small errors are not easily
detectable by our senses
– An approximation is acceptable
• This is one reason we can get significant compression of images
and audio, vs. other types of data
– Lossless 10:1 is typically possible
– Lossy 300:1 is possible with no significant perceptual loss
• With lossy we can even talk about “quality”
• The more like the original the higher the quality
• The less like the original, the lower the quality
Brown 13
Audio
Brown
Audio
Sound is a compression wave made by moving particles in a medium.
Sound can be transmitted thorough gases, plasma, liquids, and even solids.
Brown 15
Sound waves are important, because
they are perceived/heard by us.
Brown 16
Sound Representation
• We can consider sound as amplitudes
over time
Brown 17
Sound Waves
• We can decompose a sound wave into frequencies and
their amplitude
Frequencies Frequencies
Original
Decomposition to
10 frequencies
Result Result
adding up adding up
the frequencies Brown
the frequencies 18
Human Response
Pain
threshold
Cannot
hear
Human response is not uniform. We have “better” hearing at difference frequencies.
Brown 19
Measure of sound
• Our range is huge, from ~0.0002 to 130
– This is measured in terms of pressure micropascals (µPA)
– 20 µPA is generally considered the lower limit of our hearing
• We use a log scale (decibels=dB) to represent this range
– dB is not a unit, the unit is still µPA
• X dB is expressed as a ratio, between a reference (R)
and another pressure (P2)
20 log(P2/R) = X dB
Brown 20
Examples in DB
Example:
A sound pressure, P2, at 86dB means:
20 log(P2/R) = 86dB // 20 is because the reference measure is 20 µPA
log (P2/R) = 4.30 // divide by 20
P2/R = 104.3 // raised to power of 10 to remove log
This means that P2 is 104.3 or (~20000) times stronger the reference sound
A sound pressure, P2 at 10dB means:
20 log(P2/R) = 10dB // 20 is because the reference measure is 20 µPA
log(P2/R) = 0.5 // divide by 20
P2/R = 100.5 // raised to power of 10 to remove log
This means that P2 is 100.5 or (~3.4) times stronger the reference sound
A sound pressure at 0dB means:
20 log( P2 / Reference ) = 0dB
log ( P2 / Reference ) = 0
P2 / Reference = 1
This means P2 is the same as the reference. So, 0dB is the starting point.
Brown 21
dB in different Frequencies
If we change the reference level for each frequency (because our response is different),
we have dBs defined differently for each frequency. This is the dB curves based on human hearing
experiments from the 1930s. This is called the Fletcher-Munson curve.
Brown 22
Sound Examples
Brown 23
Equalization
Not surprising that when we manipulate audio, we do it at varying frequencies.
We rarely manipulate the compression wave directly.
Brown 24
Digital Representation
• Wave-form coding
• Just sample the waveform in time
– We call this pulse-code modulation (PCM)
• Two values to control
• Number of samples to capture per second
• Quantization values (number of bits used to represent that sample)
sound amplitude
Sound waveform
Time
Digitized sample points, uniform in time
Brown 25
Base-Line High-Quality Sampling
• Human hearing can hear frequencies from about 16Hz to
20,000Hz
– That is 16 oscillations per second (low pitch - bass)
– To 20,000 oscillations per second (very high pitch – high treble)
• To capture these frequencies we need at least twice the
number of samples
– This comes from sampling theory, known as the Nyquist Rate
– So, 44.1KHz per second is considered the
• 16 bit per sample allows -32768 to +32767 discrete
quantities to be captured
– This seems to be sufficient bits to quantize the amplitude,
although you can read online that people will debate this
http://www.cs.columbia.edu/~hgs/audio/44.1.html Brown 26
Uncompressed Audio
• CD audio uses:
– 44,100 samples per second
– 16 bits per sample
– 2 channels (stereo)
= 1,411,200 bit/s = 1,411.2 kbit/s
(Around 10MB for minute of recording)
• This is often considered the “gold standard” for
audio encoding
– That is, we would say this is uncompressed
Brown 27
Cheap Compression (Lossy)
• Less sampling
– There will be frequencies you can’t reproduce in the sampled
signal
– Maybe OK for some sounds
• Less bits
– 8-bit, 4-bit quantization
• The resolution of the amplitudes will be coarser
• Maybe OK for some sounds?
Brown 28
Some Tricks for Compression
• aLaw and µLaw encoding
– Non-linear Amplitude Encoding
• Assignment more bits to low amplitudes, less bits to higher
amplitudes
– Average bit is still 8-bits per sample, but quantization of low
amplitudes have more bits
– Can call these “log compressors”
• 8Khz temporal sampling with aLaw or µLaw bit
assignment is often used for phone
– Claims it reduces perceive able noise in the quantization
– The quantization scheme is fixed (based on empirical studies)
• aLaw or µLaw are two different quantization standards, but the
same idea
http://en.wikipedia.org/wiki/Mu-law
Brown 29
Adaptive Resolution PCM (ADPCM)
• Like log compressor
– But add an additional trick
• Encode the difference of the samples, not
just the sample
– We call this delta-pulse-code-modulation
Assume the two samples are values 13 and 14
PCM would record 13 and 14
Delta-PCM could record “1” (i.e. 14-13=1)
13 14
ADPCM claims to get almost 50% better
compression (with similar quality) as uLaw.
Brown 30
Variations on ADPCM
• Encoding the difference between the two sample is a
simple form of “predictive coding”, where you try to
predict what the next value is going to be, and encode
only the difference
• There are better predictors (like looking at a longer
history of previous sample instead of just one – i.e.
ADPCA)
• GMS 06.10 is a predictive coder standard for voice
Note that the delta-code and predictive coders do not have
to be lossy. That is if the encode the difference
correctly, no data is loss
Brown 31
Any of these look familiar?
Brown 32
Exploiting Psychoauditory Properties
• For even better compression we can
exploit properties of the human auditory
facilities
• Loudness
– Quiet sound around a relatively loud sound
won’t be perceived
• Frequency
– Low sound around a relatively high sound
won’t be perceived
Brown 33
Temporal Masking
• Masking can effect our perception of sound up to 200
milliseconds after a “masking event”
(mask=loud noise or high-frequency)
– Call this Post-Masking
• And even up to 20 milliseconds before a mask
– Call this Pre-Masking (amazing!)
– That’s right, if there is a loud sound at time T, it will effect your
perception at time T minus 20ms --- amazing!
Brown 34
Compressors can exploit this
Other bands are set to 0
Brown 35
MP3
• Actually part of the MPEG1 standard for audio
• Input data is at 44Kz sampling and 16bit
– (others sampling can be used, but 44Kz most common)
• Algorithm cuts the audio into “frames”
– Each lasts a fraction of a second
• Transforms the frame into 32 frequency bands
– Looks at these bands and determines how to distribute bits
– Bands with more activity need more bits
• Apply psychoaudio masking: throw away bands based
on human perception (see previous slide)
– Where most of the compression comes in
– Lots of bands may become 0
• Then applies Huffman encoding to the output to squeeze
the bits a little further
http://www.mp3-converter.com/mp3codec/
Brown 36
MP3 Bitrate
• In MP3, we can set the bitrate
• This is considered a constant bitrate encoding,
versus a variable bitrate
• The encoder changes the amount of
quantization performed based on specified
bitrate
• This can be complex, since Huffman encoding is
also used, the encoder may need to try several
different quantization levels and then run
through Huffman encoding to see the result
– Padding with 0s is sometimes used
Brown 37
Other formats
• Advance Audio Coding (ACC)
– Similar to MP3, but has a more flexible encoding standard
– Supports a larger range of temporal sampling
– Part of the MPEG2 and MPEG4 standard
• Ogg Vorbis
– Freeware version that has become popular
– Again, similar to MP3 in use of masking
Brown 38
Video
Brown
Images and Video
• Image: 2D array of pixels
• Video: Images/frames over time
time
Brown 40
Video cameras interlacing and progressive scan
• Interlaced
– Cameras sometime
capture 2-fields to make a frame
– So, we capture 60 fields per second
time
• Progressive Scan
– Captures the entire frame at the same time
• 30 full frames per second
– No interlace effect
– But the motion may not look as smooth
time
Brown 41
Transform Coding
+ + + +...
...+ + + ...+
+
...+ + + +
In the pixel domain, you can consider each pixel intensity as a coefficient of a
basis that only occupies that pixels (all other pixels are 0). For a 8x8 image, we
would have 64 basis. Summing of these basis would give you the 8x8 image.
Brown 42
DCT Frequency Domain
In the frequency domain we can transform the 8x8 pixels into their corresponding frequency
representation (Discrete Cosine Transform). This also requires 64 coefficient, each one
Corresponding the amount of each basis to use. Sum all the basis up and you have the
original image.
Brown 43
Why Frequency Domain?
• It turns out we can remove many of the
frequencies and the viewer will not notice.
– Call this pyschovisual redundancy
• Humans are not so sensitive to high-frequencies
• If we heavily quantize the high-frequency
coefficients and then transform back to the pixel
domain
– Pixel values are different, but visually we can’t tell
• This is a lossy compression and provides the
most significant gain in image compression
Brown 44
JPEG Approach
f(x,y) - 128
(normalize between –128 to 127)
Quantize DCT coefficients
DCT via
(Forward DCT) Quantization “Table”
on each block
C’(u,v) = round(C(u,v)/T(u,v))
T(u,v)
Take original image and
Break it into 8x8 blocks Differential 0
coding DC component
Huffman
JPEG bitstream
Encode RLE
AC Vector
“Zig-zag” Order Coefficients
JPEG applies almost every compression trick known.
1) Transform coding, 2) psychovisual (loss), 3) RLE, 4) Difference coding, and Huffman. Brown 45
JPEG for Color Images
C
o
Note: Quantization of
l
color info 2:1:1
o
Luminance r
y y q
y y i
Y 0.299 0.587 0.114 R
I 0.596 0.275 0.321 G
Q 0.212 0.523 0.311 B
Convert image from RGB to YIQ, where Y=brightness, IQ are chroma.
Eye is less sensitive to chroma, so downsample them by 2.
This downsample by 2 is a lossy compression. Brown 46
JPEG Quality
• The amount of quantization applied on the
DCT coefficients amounts to a “quality”
measurement
“Format Options” refer to different ways
Quality to choose the quantization. Baseline
are the original quantization tables specified
by JPEG. Optimize may generate tables
specific to your image.
• JPEG has a 100% quality – it means no
quantization
Brown 47
MPEG is based on JPEG
• JPEG is the foundation
– MPEG extends this to video
– Some encoders just use lots of JPEGs, we call this motionJPEG
• MPEG incorporates motion compensation prediction to help exploit
similarity between frames
– This is done with 16x16 Macroblocks
• Two types of frames
– I frames = intra-coded (JPEG)
– P&B frames = inter-coded
• predicted frames (motion compensation + JPEG-like compression)
• P predicts itself from either I or P frames
• B predicts itself from I or P in either direction (front or back)
• Frames organize a series of I and PB frames
– Call this the Group of Pictures (GOP)
– This is the basic building block
Brown 48
GOP
• Group of Pictures
GOP of 12
GOP of 6
• DVD (MPEG2)
– GOP size is fixed at 18 for NTSC, 15 for PAL
Brown 49
Motion Compensation
Macroblock is
broken into 4
8x8 blocks and
Break image encoded in
into 16x16 blocks JPEG style
In a predictive frame,
search a reference
frame for a similar
block.
Encode only the
difference and the
motion vector.
(Best case, you find
exactly the same
pixels, difference = 0)
Brown 50
I,P,B frames
• I frames are the largest in terms of bits
• P frames are significantly smaller
• B frames even smaller
• Why don’t we just have very large GOP?
– E.g. One I frame with 127 PB frames
• Consider random access
– How can you jump to the frame in the middle of a
GOP?
• Consider if we encounter errors
– What happens on the decoder if there is an error?
Brown 51
MPEG Quality
• Very similar to JPEG quality
• You are setting the amount of quantization
in the DCT components
Quality control in the encoder
Brown 52
Errors in MPEG
• If data becomes corrupted (e.g during
transmission), the 8x8 and 16x16 block’s
used in the encoding scheme become
apparent
Brown 53
Errors in MPEG
• If motion vector data becomes corrupted,
we see very strange things
This is really bad, because if one frames gets corrupted, subsequent frames are
using it for reconstruction (i.e. they predicted from the correct frame, now the
frame is bad, so their prediction is wrong).
Video should become good again when an I-frame is encountered. One reason
for short GOP
Brown 54
CODECs
• Codecs
– Means “Coder/Decoder”
– Often when we download it means decoder
• Codec refers to the algorithms used to encode and
decode the audio/video file
• Note that while MPEG is a standard, its exact
implementation is not specified
– Some encoders might do a better job
– Better motion compensation, better use of DCT quantization, etc
– Almost all encoders are based on MPEG or MPEG-like encoding
• I.e. Motion compensation with GOP structure
• Also, note that encoding and decoding are asymmetric
– Encoding is much more computation than decoding
Brown 55
AVI, MOV, and DV
• AVI (Windows)
– Audio, Video, Interleave (AVI)
– Is a wrapper format for that encodes up other
formats
• I.e. you can have an AVI with codecs
• MOV (Apple)
– Similar, it is a wrapper or container format
– Many different codecs are supported
• (Digital Video) DV format
– Uses interface only compression
• (similar to motion JPEG)
Brown 56
Interleaving Audio
• Technically Audio and Video are separate
• You can have the audio in a different file or logical region
in the video file
• But this might not be good for access, esp on a
harddrive (you’d have disk thrashing)
• We instead interleave the audio data with the video data
Brown 57
Constant Bitrate Video
• Some video encoders support constant
bitrate
– The video is compressed as several different
qualities
– The highest quality closest to the desired
bitrate is chosen
– This means quality may change based on the
image content
• More motion and image content, lower quality
• Less motion and image content, higher quality
Brown 58
Temporal Downsampling
• Another way to get compression is to reduce the
framerate
– That is, show less frames per second
• A rule is that human’s feel comfortable at 24
frames per second
– Less than this can lead to temporal aliasing where we
feel uncomfortable and motion looks unnatural
• For animations it can be as low as 4-8 frames
per second
– This is acceptable as long as the animation has
limited motion
– Once motion is too large, we see temporal aliasing
again
Brown 59
Spatial Downsampling
• Another easier way to gain compression is
spatial down-sampling
– Make the image smaller
• This is acceptable for internet use
– Unacceptable for TV, film, broadcast
• Keep in mind that once you down-sample
you can never get back the original
– Resizing back to the larger size does not
restore the lost details
– Matter of fact, it often looks worse
Brown 60
Summary
• Compression is necessary to save space
– Even with today’s storage, uncompressed
data is too large
• Several types of compression
– Redundancy coding, RLE, transform coding,
psycho coding
• Psycho-coding allows us to “throw away”
information (lossy coding)
• This gives us the greatest “gain”
• Also is often what “quality” is associated with in
compression
• Higher quality means less “thrown away”
Brown 61