Lecture Notes in Te 314
Lecture Notes in Te 314
By
Prof. Matthew L. Luhanga
1
CHAPTER ONE
INTRODUCTION
1.1 Etymology
The sources of and the development of the word “information” i.e , its etymology –shows that it
is a mult-faceted concept [1]. In this introductory course on information theory we shall confine
ourselves to the technical meaning of the word. In the context of Lin [1], we confine ourselves to
the domain of syntax and empirics.
Syntax and empirics represent, respectively, the formalism used in representing a message and
the study of signals (i.e signs and symbols) used to carry a message. Indeed, human
communication can be described as [2];
Information theory is the corner stone of the theory and practice of electrical and optical
communication. It describes the context within which electrical and optical communication
systems can be studied and understood.
Engineers are interested in information theory as an important discipline in the design , operation
and optimization of communication systems and networks. Elementary concepts of information
theory to be covered in this course are shown in the annex to this chapter.
As alluded to previously, the concept of information has many attributes. One Web-based
definition of information is [9]:
2
2. Knowledge gained through study, communication, research, instruction,
etc. ;factual data: His wealth of general
3. The act or fact of informing
This broad definition can be narrowed by classifying information in two broad categories [10]:
Elementary information
Advanced information
Elementary information is related to technical details such as, for example, the frequency of
observing certain letters of the alphabet on a long text in a given language
Advanced information is related, for example to the understanding the context of the text. thus
for example, the information a reader gets from reading a text on self-similar processes in
telecommunications is very different from the information the reader would get from reading a
book by, say, Shaaban Robert.
Advanced information involves the understanding of content in, possibly, multimedia messages.
This necessarily involves the brain, a very complex organ which varies from individual to
individual.
The complexity of the brain has made it difficult, so far, to find a measure for advanced
information. We will confine ourselves to a discussion involving elementary information for
which an information measure exists.
The concept of information and its quantification (measure) on wich we will focus in the course
is one which is useful in the modeling, design and dimensioning of telecommunication systems.
Information theory, as an engineering discipline, is often taken to have begun with work by
Harry Nyquist in the 1920S.In a paper written in 1924 [3], Nyquist suggested that there were two
factors which determined the speed of transmission of “intelligence” – the shape of signals used
in the transmission of intelligence and the choice of a code used to represent the intelligence.
Nyquist’s t paper derives the following formula:
W=klog m
and k is a constant
The words “speed of transmission of intelligence were meant by Nyquist to mean the number of
characters representing letters, figures, etc transmitted in a given unit of time”.
3
Nyquist used his formula to compute the theoretical speed of transmission of intelligence and
compared his results to practical speeds obtained by using different codes in a variety of media.
What was being transmitted – the intelligence- was not the message in the form of letters, figures
etc, but a representation of the message by using characters. Intelligence was thus embedded in
the representation of messages.
A few years later in 1928, without acknowledging the work of Nyquist, R.V.L Hartley [11]
expanded the ideas related to “information” rather than “intellegince”. Hartley developed the
concept of information based on “…physical as contrasted to physiological considerations” for
use in electronic communication. Although Hartley never defined the concept of information- he
called it an elusive term- he addressed the “…amount of information”. Information existed in the
transmission of symbols with the symbols having “…certain meaning to the parties
communicating”. When someone receives information, each received symbol sequence allowed
the recipient to “ exclude possibilities”- excluding other possible symbol sequences which could
have been transmitted and their associated meanings. The precision of information depended
upon what other symbol sequences might have been chosen to be transmitted.
The measure of these other sequences provides an indication of the amount of information
transmitted. It was then suggested that use be made “…as our measure of information the
logarithm of the number of possible symbol sequences” [11]
Hartley quantified his information measure by first assuming that each transmitted message
consisted of n symbols transmitted consecutively. At each transmission instant, one symbol, out
of an alphabet of m symbols, was chosen for transmission. There are thus mn possible symbol
sequences which could be transmitted. Assuming that each symbol sequence was equally likely
to be transmitted, the information H, in a transmission was computed to be [11]:
Hartely was the first person to use the sumbol H to represent the amount of information
During World War II, Claude E. Shannon developed a model of a communication process based
on the earlier work of Nyquist and Hartely. This earlier work dealt with technical aspects of
information i.e they dealt with elementary information.
Shannon’s paper published in 1948 in The Bell Systems Technical Journal entitled “A
mathematical Theory of Communication” established Information Theory as a mathematical
discipline and laid a foundation for all future work in information theory
4
1.3 Shannon’s 1948 Paper
Before the appearance of Shannon’s paper in 1948 communication Engineering was viewed
differently depending on the medium and methodology used in communication. No concept
existed which unified the communication systems which were in existence in 1948. Between the
1880s and 1948 the following communication systems and technologies had been been
developed [7]:
Telegraph (Morse,1830’s)
Telephone (Bell, 1870)
Wireless Telegraph (Marconi,1897)
AM Radio (1900’s)
Single-Sideband Modulation (Canson,1922)
Television (1920’s)
Teletype (1931)
Frequency Modulation (Armstrong, 1936)
Pulse-Code Modulation (1930’s)
Vocoder (1939)
Spread Spectrum (1940’s)
The introduction of information theory with its detailed treatment of the concept of
“information” made it possible to see the connection between these disparate fields and to bind
them together.
Although no unifying theory to bring these fields together existed prior to 1948 some ingredients
which would be key elements of a mathematical theory of communication could be seen in these
systems. Some of the elements are: [8]
The Morse code, a methodology for source coding which ensured efficient transmission
by encoding sources of English text in accordance of letters of English alphabet. We
shall return to this point later
Pulse-Code modulation (PCM) can be considered to be the first “digital” communication
system
Shannon built on Hartely’s thesis that information, as an engineering concept, contained in the
number of possible symbol sequences which an information source could transmit.
This viewpoint of Hartely, which led to eqn1.1, is echoed by Shannon when he states that [4]:
…the fundamental problem of communication is that of reproducing at one point either exactly
or approximately a message selected at another point. Frequently the messages have meaning;
5
that is they refer to or are correlated according to some system with certain physical or
conceptual entities. These semantic aspects of communication are irrelevant to the engineering
problem. The significant aspect is that the actual message is one selected from a set of possible
messages…
Noise
A key addition by Shannon to the work of Nyquist and Hartley was the intrgration of noise into
the communication system. The addition of noise changed messages so that what was received
needed not to be necessarily what was transmitted.
The transmitter does not transmit the messages. It transmits a coded form of the messages. This
coding, which takes place before transmission, is called source coding. Source coding uses
symbol occurrence probabilities (see, for example, morse code) in message representation.
Before transmission, the coded source messages undergo further processing to mitigate the
effects of noise in the channel. This processing is called channel coding.
Shannon’s paper of 1948 addressed both source coding and channel coding. These consideration
expand the block diagram of a communication system given in Fig. 1.1 into Fig. 1.2
6
Source Source Decoder sink
Source Encoder
Transmitter Modulator
Demodulator
Receiver
Channel
Noise
Source coding and channel coding are two important concepts at the center of Shannon’s 1948
paper. That paper dealt with two fundamental questions:
i. What is the maximum limit to which redundancy in a source of signals can be reduced?
ii. What is the maximum transmission rate for reliable (i.e error free) communication over a
noisy channel?
Answers to these two questions yield fundamental limits in Information Theory. These limits are
important because:
7
i. A quantification of the concept of information
ii. Using coding for reducing source redundancy
iii. A quantification of the concept of channel capacity
iv. Importance and use of coding for reliable and efficient communication
These two quantifications are meant to provide a basis for answering the fundamental questions
in Shannon’s paper. The two fundamental questions are best seen in relation to the block diagram
of a communication system of Fig. 1.2
Building on Shannon’s work reported in 1948, information theory has been extensively
diversified. An example of this diversity is depicted in Fig. 1.3. The areas in Fig 1.3 which we
will touch on in this course are:
Lossless compression
Channel capacity
Error detection and correction coding
The development of non-Shannon information measure has motivated the use of information-
theoretic measures in other disciplines as shown in Table 1.1, [5,6]. These applications represent
the vistas beyond classical information theory. As fig. 1.3 shows, information theory is a very
broad discipline.
8
Signal Processing
Lossless/Lossy
Compression
Space-Time Communication
Network information Capacity
Coding Channel Coding
Theory
Coding with Side
Information
• Thermodynamics
• Condensed matter
9
Economics • Portifolio theory
• Kelly gambling
• Statistics
• Reassociation of CO in herme-proteins
• Studies of turbulance
• Coding
• Encryption
• Modulation
10
• Detection
• Fundamental limits
CHAPTER TWO
SOURCE CODING
2.1 Introduction
11
Fig 1.2 from Chapter One Shows that both source coding and channel coding are affected on the
transmission side with relevant decoding operations being performed on the reception side. Since
we are dealing with digital communication systems, the source messages are first digitized into
bit streams.
Note that Fig.1.2 implies that source coding and decoding and channel coding and decoding are
done sequentially. Source coding reduces the bits to be transmitted or stored by reducing the
source redundancy. Channel coding on the other hand, increases the bits which are used to
improve the reliability of data transmission
i. Exploit the relevant separation theorem and carry out the optimization by sequentially
performing source and channel coding and decoding
ii. Perform joint source-channel coding. This is achieved by finding an optimal bit
allowance between source coding and channel coding for given channel loss
characteristics and the source encoder and channel encoder to meet the target source rate
and channel robustness, respectively.
The separation theorem and joint source-channel coding are beyond the scope of this course
Source encoding is the representation of source messages into binary digital sequences. Efficient
source encoding can be attained if the source statistics are known.
Assume some source output symbols are more probable than others. Intuitively, it is seen that an
efficient source encoder is one which assigns long code words to rare output symbols and
shorter code words to frequent symbols. Such a code is referred to as a variable length code.
The Morse code is one such code
One binary digit representation of the Morse Code as a variable length code is given in the
problems at the end of this chapter.
Note that there is dependency between various letters of a language source. The dependency
between the letters may be exploited to improve the efficiency of representation. Thus, for
example, a person at the receiving end of a communication will be able to decode the following
received message thus making it unnecessary for the source to transmit the other letters.
The meaning of the sentence is clear even though about 40 percent of the letters are left out
The frequency of letters and their dependence on one another has been exploited in the design of
the layout of typewriters and computer keyboards for several languages. Go to Google to read
further on this.
12
The statistics of single letters of a source of English is available in [12]. Similar statistics for
Kiswahili are yet to be generated. A problem at the end of this chapter responds to his challenge.
The dependency of single letters on one another or of combination of two letters (digraphs) or
three letters (trigraphs), etc, has been exploited in decryption of cipher text messages. Interested
students may Google this.
ii. Reduces redundancy i.e. there is limited representation of source outputs in the one-to-
one mapping. This is referred to as data compression.
The “data compression” measure in the sources is the rate, in symbols per second, needed to
fully represent and finally reconstitute the source output sequence at the source decoder output
using fewer symbols than those in the original message.
“Data compression” relates to sources of text, audio signals, real-time audio signals and
multimedia signals. Real-time voice and video traffic refers to traffic in which bits are
periodically generated thus providing a constant flow (stream) of traffic.
In the discrete communication systems under consideration, source output symbols are represented
by binary symbols.. The representation must enable the recovery of the original symbols from their
binary representations while using as few binary symbols as possible to represent the source output
symbols. A representation achieving this is called an efficient code
Definition of a DMS
With probabilities
A source output at a particular time is independent of all preceding and succeeding outputs. This
is the memoryless property. Later we will discuss sources with memory.
13
Consider a DMS with an output random sequence of length N, u = {u1, u2, …, uN}. Each of these
output sequences is represented by a binary sequence of length M, x = {x1, x2, …, xM} where the
xi’s are 0’s and 1’s . For fixed N, each of the MN binary sequences (code words) corresponding to
each of the output sequences is called a code
For uniquely decodable codec, there must be a one- to –one mapping between source output
symbols and the corresponding binary sequence representation. If the binary representation can
be reconstituted into the original source output symbols exactly, we say that the coding and
decoding operation is lossless.
The concept of information and its quantification is basic to the analytical modeling of sources. A
quantification of the concept of information is based on the earlier work of Nyquist, Hartley and
Shannon [3,4,11].
The work of Nyquist, Hartely and Shannon led to the following principle underlying the concept
of information:
i. The more uncommon or surprising an event is, the more information it conveys. The
statement “The sun will not rise tomorrow” conveys more information than the statement
“The sun will rise tomorrow”. Thus an information measure must be related to
uncertainty about the occurrence of an event.
Consider a sample set of N possible events. Let the probability that event i has occurred be
denoted by Pi . Normalization yields
1
√
= erfc
2
4 Eb
7 no
………………………………………………….. 2.1
We do not know which event will occur. But suppose that one of these events, say j, finally takes
place. Then the occurrence of event j clearly has made us gain information, because before the
event occurred we did not know which event would occur.
Suppose that the probability Pj of the occurrence of event j is close to 1. This means that we gain
very little information by the observed occurrence of event j, because this event was very likely
14
anyway (assume, for example that j, is the event of the “sun rising tomorrow”). On the other
hand, if Pj is close to zero, then we gain a lot of information by the actual occurrence of event j,
because we did not really expect this event to happen (assume, for example, that j is the event
that” the sun will not rise tomorrow”).
The information gain due to the occurrence of a single event j can be measured by a function
H(
P j ), which should be close to zero for P j close to 1. For example we could choose
The definition of the information measure given in equation 2.2 above has the following
intuitively appealing properties:
iv. H(Pi ¿ Pj) = H(Pi) + H(Pj) for events i and j statistically independent i.e Pi ¿ Pj =
Pi,Pj
The base of the logarithm is arbitrary. It is standard practice, however, in information theory to
use logarithms to base 2. The unit of information is then called bit (short for “binary digit”). We
then write
Entropy
15
Assume that we have a long sequence of occurrence of events, then the average information gain
by the sequence of observed events is given by
N
H ( u)= ∑ H ( pk ) p k ……………………………………2.4
k =0
Pk=Pr {u=uk}
Shannon [4] used letter frequencies for English to compute the entropy of English. The Entropy
of Kiswahili can be computed in a similar manner. Luhanga and Nyambo [] have computed the
entropy of Kiswahili using a convergent gambling approach.
Many information measures studied in the literature are of this simple form. But other forms are
possible as well [10]. If the events represent the sequence of output symbols of a DMS then the
units of H(u) are bits/symbol. Combining equation 2.3 and 2.4 we get
N
H (u)=−∑ pk log 2 p k
k=0 …………………………………………2.5
Note the close resemblance of eqn 2.5 to the entropy equation of thermodynamics. Based on this
correspondence, the term entropy is also used as a measure of average information.
Khinchin formulated four axioms which describe the properties an information measure I(u)
should have [10]. We follow convention by now using the letter S to represent entropy.
Axiom 1
S(u)=f (P 0 , P 1 ,. . ., P N ) ………………………………………. ……………………...2.6
16
This axiom means that the information measure S(u) depends only on the probabilities Pi of
the events and on nothing else
Axiom 2
-1 -1 -1
S (w ,w ,. .. ,w )≥S (P 0 , P 1 ,. . ., P N ) ………………………………………………….2.7
This axiom means that the information measure S(u) takes on an absolute maximum for the
uniform distribution:
1
P i= ∀ i,
W
and that any other probability distribution has an information content that is less than or equal
to that of uniform distribution. The uniform distribution corresponds to the maximum uncertainty
about the events.
Axiom 3
This means that the information measure S(u) should not change If the sample set of events is enlarged
by another event that has probability zero
Axiom 4
B
S {P AB }=S {P A }+ ∑ P A S{P ( j/i)}
…………………………………………………2.9
ij i i i
This axiom deals with a compound system made up of two subsystems A and B (not necessarily
statistically independent). The probabilities of the first subsystem are P iA and those of the second
subsystem described by the probabilities P jA and those of the compound system (A,B) are denoted by
the joint probabilities
B
P AB =P A P ( j/i )
ij i
Where PB ( j/i ) is the conditional probability of event j in subsystem B given that event i has
B
already occurred in subsystem A. S {P ( j/i)} is the conditional information of subsystem B
formed with the conditional probabilities PB ( j/i )
17
i.e under the condition that subsystem A is in state i.
This axiom states that if we first collect the information in subsystem B having assumed that
subsystem A is in state i, then the information in the compound system (A,B) is the information
in subsystem A added to the sum of the conditional information of subsystem B given that
subsystem A is in state i with the sum being taken over all events i in subsystem A, weighing
with the probabilities PiA
For the special case that subsystems A and B are statistically independent the probability of the
compound system factorizes as
P =P A P
ij A , B i jB …………………………………………2.10
and only in this case does Axiom 4 reduce to the additivity rule for information in statistically
independent subsystems
S {P }=S {P A }+S {P B}
ij A , B i j …………………………………..2.11
Note that equation 2.9 can be developed from first principles as follows
S {P
ij
A ,B }=− ∑ ∑ Pij A , B log Pij A , B
i j
But
B
P A , B=P ( j/i )P BA
ij
=−
and hence
∑ ∑ P A Pi ( j/i){log P A +log P B ( j/i)}
i i
i j
¿−∑ ∑ P A PB ( j/i )log P A
i B i B
S {P A , B }=−∑ B∑ P ABP ( j/i )log{P A P ( j /i)}
i j
−∑ij
∑ P A iP log
j iP ( j/i) i
i j i
¿ ∑ P A log P A ∑ P B( j/i)
i i i j
B B
−∑ P A ∑ P ( j /i)log P ( j/i )
i i j 18
¿ S {P A }+ ∑ P A S {P B ( j/i)}
i i i
The Shannon Entropy
In his 1948 paper, Shannon derived the following expression for the entropy (average
information) of a source.
W
S=−k ∑ Pi ln Pi …………………………. 2.12
i=1
It is left as an exercise to the reader to verify that the Shannon entropy satisfies all four of the
Khinchin axioms with Axiom 4 being given by equation 2.11
Note that the Shannon entropy can be obtained from eqn 2.3 by setting H(Pk)= -log Pk
1
In equation 2.12, k represents Boltzmann’s constant. For the uniform distribution P i= ∀ i,
W
the Shannon entropy takes on its maximum value
S= k ln W ………………………..2.13
Maximizing the Shannon entropy subject to suitable constraints leads to ordinary statistical
mechanics,first derived by E.T. Jaynes [14,15].
Jaynes maximum Entropy method is the mojour foundation for the vistas beyond classical
information theory as indicated in Fig. 1.3. In the sequel we shall confine ourselves to the
Shannon entropy formula as given by equation 2.12 with the logarithms being taken to base 2.
Then, we have
19
N
1
S=H ( u )= ∑ p k log bits/ symbol
k =0 pk
n
¿− ∑ pk log pk …………………………..2.14
k=0
Note that
0log0=limλlogλ=0
λ→0
From now on all expressions involving algorithms will be taken to be using logarithms to base 2
It may be shown that the Shannon entropy measure is the only entropic form which satisfies, up
to an arbitrary multiplicative constant, all four of the Khinchin axioms in which Axiom 4 is given
by eqn 2.11
We may relax Axiom 4 by replacing it with the less stringent equation 2.9. In that case one ends
up with other information measures such as the Renyi entropies.
1
S ( R )= ln ∑ P ………2 .15
α α−1 i iα
The Renyi entropies satisfy Khinchin Axioms 1-3 and the additivity condition of eqn 2.11.
Indeed, it can be shown that they follow uniquely, up to multiplicative constant, from these
conditions [10]. For α →1 the Renyi entropies reduce to the Shannon entropy.
Many other entropic forms are available in the literature [13]. Another entropic form, for
example, which may be shown to follow uniquely (up to an arbitrary multiplicative constant)
from Khinchin axioms 1 to 3 with Axiom 4 replaced by the following more general version:
S A , B =S A +S B / A +( 1−q ) S B/ A S A ……2. 17
q
Is the Tsallis entropy q q q q
{1−∑ P q }
i i 20
S q =K …… 2. 18
q−1
where K is a constant and q is a real parameter
B
In equation 2.16, SqB/A is the conditional entropy formed with the conditional probabilities P ( j/i )
and averaged over all states i using the escort probabilities Pi
B
S B /A =∑ Pi S q {P ( j/i)}………2 .19
q i
P
iq
P i= …………2 . 20
∑ P
j
q
j
For q=1 the new Axiom 4 reduces to the old Khinchin Axiom 4,(i.e eqn 2.9 and 2.19)
S =S A +S …………2 . 21
1A , B 1 1 B/ A
For qǂ1 and A and B statistically independent the new Axiom 4 reduces to the pseudo-additivity
condition
S A , B =S A +S B +( 1−q )S A S B …………2 . 22
q q q q q
since
S B / A =S q B
q
For q=1 and A and B statistically independent, the pseudo-additivity condition reduces to the
additivity condition of equation 2.11
S A ,B = S1 + S1 A B
1
21
In this case (i.e the case of q=1) the Tsallis entropy reduces to the Shannon entropy.
Maximizing the Tsallis entropy subject to appropriate constraints may be shown to lead to
nonextensive statistical mechanics.This is the approach which largely account for the extension
of information-theoretic terms into other fields as shown in Table 1.1 [6].
The derivation of the entropy of a discrete memoryless source requires the following two
lemmas.
Lemma 2.1
ln x ≤ x-1 ……………………………………2.23
Proof
Note that the curve ln x is a concave function of x. Remember that concavity of a differentiable
function f(x) means f//(x)<0 for all x
Note that the curve of y=ln x lies below the curve of y=x-1 except at the point where x=1 where
the curves touch. This proves the lemma.
Lemma 2.2
For two probability distributions pk and qk defined over the same alphabet
22
qk
∑
k
pk log 2
( ) pk
≤0 … … 2 . 24
qk q
∑ pk log 2
k
( ) pk
=
1
∑
log e 2 k
p k log e k
pk ( )
Using lemma 1 we get:
qk q
∑ pk log 2
k
( ) pk
≤
1
∑
log e 2 k
p k k −1
pk ( )
1
≤ ∑ ( q k −p k )
log e 2 k
1
≤ ∑ q −∑ p =0
(
log e 2 k n k k )
Which proves the Lemma
qk
∑
k
pk log 2
( ) pk
≤0
Example 2.1
23
For a DMS with alphabet u=(0, 1) with p(0)=p and p(1)=(1-p) (i.e. a binary memoryless source,
BMS);
1 1
H (u)=p log +(1− p )log
p 1− p
H(u) is called binary entropy measure
H(u)≤ 1 with equality iff p(0)= p(1)=1/2. Then each output of a BMS emits 1 bit of information
per symbol
The entropy measure, H(u), for a BMS is defined over a source alphabet {0, 1}. One can use the
entropy measure to define a mathematical function
1 1
F( p )= p log +( 1−p ) log
p 1−p
The function is defined over the set p=(0,1).
A plot of the function is shown in Fig. 2.1. The figure represents what is termed as the Entropy
Function. The entropy function is frequently encountered in information – theoretic studies.
24
ln x
-2
-1
2 4 x-1
1
Fig 2.1. The Entropy Function
Consider a source producing blocks of N successive source symbols. We may consider each
block as being produced by an extended source with source alphabet of size MN where M is the
number of distinct symbols in the source alphabet of the original source.
u={u1,u2…,uN} ……………..2.26
In a DMS, the N symbols in a block are iid and hence the probability of a block of symbols is
equal to the product of the symbol probabilities of symbols in the original source:
N
pN ( u)=∏ p( ui )………2 .27
i=1
1
H ( u N )=∑ p N ( u ) log ……2 .28
μN p25N ( u )
In equation 2.28 p(ui) can be obtained as a marginal probability of pN(u) as follows:
n ≠i
un
p ( ui )= ∑ ¿ ∑ ¿ ¿ ∑ p N ( u )……2 . 29 ¿
u1 ¿ uN
Lemma 2.3
Suppose uN= {u1, u2, …, uN}represents a block of output sequence of symbols of a DMS of length
N. If the random variables uN are i.i.d then
Proof
Let the probability distribution of the N output sequence of symbols be defined by pN(u); The
average entropy for the N symbols H(uN), is given by
1
H ( u N )=∑ p N ( u ) log ……2. 31
u pN ( u )
But since the N symbols are i.i.d then,
N
pN ( u )= ∏ pk
k =1
where pk is the individual symbol probability
1
H ( u N )= ∑ p N ( u ) log N
u
∏ pl
l=1
N
=−∑ p N ( u )
u
(∑
l =1
log pl )
InterchangingN
the order of summation
∑ (∑ p N (u ) log p l ) . .. .. . .2 .32
l =1 ⏟
u
H (u )
26
¿ NH ( u)
We obtain the result by using equation 2.29
H(uN)=N
The average entropy of a block of N i.i.d random variables is equal to the sum of the entropies of
the N individual random variables H(u)
Example 2.2
= 1.5 bits/symbol
Example 2.5
p1 = 2-2,
p2 = 2-2,
p3 = 2-1 .
Where Pk=Pr(u=uk)
Now consider an extended DMS whose alphabet consists of two symbols taken in sequence (i.e
N=2). Show that the alphabet for the extended source is (MN=32=9):
The possible output symbols of the extended DMS and their probabilities are:
27
-2 -2 -4
u1 u1 2 *2 =2
-2 -2 -4
u1 u2 2 *2 =2
-2 -1 -3
u1 u3 2 *2 =2
-2 -2 -4
u2 u2 2 *2 =2
-2 -2 -4
u2 u1 2 *2 =2
-2 -1 -3
u2 u3 2 *2 =2
-1 -1 -2
u3 u3 2 *2 =2
-1 -2 -3
u3 u1 2 *2 =2
-1 -2 -3
u3 u2 2 *2 =2
HN(u) = 4*2-4*4+4*2-3*3+2*2-2
=3 bits/symbol
But
Therefore
HN(u) = NH(u)
as expected
Since digital sources contain redundant information, lossless data compression is used to ensure
efficient utilization of storage and transmission facilitates while guaranteeing that the original
source symbol sequence can be reconstructed exactly from its binary sequence representation.
If the original source sequence can be reconstituted exactly then the data compression is said to
be lossless
Source encoding involves a one-to-one mapping between source output sequences and their
binary symbol representation. To ensure efficiency in the encoding process, source statistics are
exploited to yield a code which uses the lowest possible average number of binary symbols (bits)
needed to uniquely represent source symbols is given by Shannon’s Source Coding Theorem
28
2.5.1 Shannon’s Source Coding Theorem
Shannon’s Source Coding Theorem proposes a fundamental limit on the average codeword
length for a source. We confine ourselves to the noisless source coding theorem for discrete
memoryless sources
The source coding theorem states that a uniquely decodable binary code exists which meets a
fundamental limit. The source coding theorem does not indicate how such a code can be
obtained.
Theorem 2.1 The Noiseless Coding Theorem for Discrete Memoryless Sources.
Consider a DMS of alphabet u and entropy Hu bits per symbol. For source sequence of length N
(N = 1, 2, …)there exists a uniquely decodable binary code with code words of length lN (u), u Є
uN where uN is the alphabet of an extended source made of source sequences of length N and the
average length of the code words is bounded by
⟨L N ⟩<NH(u)
Note that <LN>represent the mean (average) value of LN
The direct half of the theorem as represented by equation 2.33 is proved by constructing a
uniquely decodable code which meets the average length bound.
(ii) A technique due to Huffman (1952) produces an optimal code in that the code minimizes
the average code word length for any N.
Below we adapt a method due to Viterbi and Omura (1979) which introduces the Asymptotic
Equipartition property (AEP)[17]. The proof of the theorem will require the following definitions
and lemmas
The central concepts in source coding are those of typical sequences and typical sets. All output
source sequences are divided into typical sequences which are elements of typical sets with the
rest of the source output sequences assumed to belong to a set which is complementary to the
typical set.
29
Definitin 2.2: Typical Sequence
uN={u1,u2…uN}
such that
p(ui) ≈2-NH(u)……….2.36
Example
Assume that a source produces a rondom binary sequence with N=20, Pr{u=1}=p=0.1 and
Pr{u^=0}=1-p=0.9
(i) H(u)=plgp+(1-p)log(1-p)=0.47bit/symbol
(ii) 2-NH(u)=0.0015
(iii) If N consists of 13 “1”s and 7 “0”s, then PN(u) =p13(1-p)7=10-14. This differs markedly
from the result in (ii) thus indicating that this is not a typical sequence.
(iv) If N consists of 2 “1” s and 18 “0”s, then PN(u) =p2(1-p)18=0.015. comparing to (ii) we
conclude that this a typical sequence.
(v) If N consists of all “0”s, then PN(u) =(1-p)20=0.12 lending to a conclusion that this is
not a typical sequence.
Note that in a long binary sequence of length N with the probability of a “1” being p, then the
number of “1” s is approximately Np and the number of “0”s is approximately N(1-p)
If each of the source symbols is chosen from a source alphabet of size M, then there are MN
possible source output sequences of length N. since there are 2-NH(u) typical source sequences
of length N, then
A typical set, S(N, ζ), is a set whose elements are typical sequences such that
S ( N , ζ ) = [2 ]
−N {H ( u )+ζ } −N {H ( u)−ζ }
,2 …….2.38
Linking this to the definition of a typical sequence we note that a typical set has the
following properties;
(i) The typical sequences making up the elements of a typical set are nearly
equiprobable, with probabilities that are in the neighborhood of 2-NH(u)
30
(ii) The number of elements in a typical set is nearly 2NH(u)
We next show that a typical set occurs with a probability near 1 (that is why it is typical!). to
accomplish this we need the following three lemmas.
E{(uN – μ)2 }=σ2 and let ζ be any positive real number. Then
Proof
∞
2
σ = ∑ (u i−μ )2 PN (u i )
u i=−∞
μ−ξ
2
= ∑ (u i−μ ) PN (u i )
u i=−∞
μ=ξ
+ ∑ (ui−μ )2 P N (ui )
u i=μ−ξ
∞
+ ∑ (ui−μ )2 P N (ui ) …2.40
u i=μ−ξ
2
Since (ui−μ ) and PN (u i ) are non-negative, then
μ−ξ ∞
2
∑ 2
(ui −μ) P N (ui ) ∑ (ui −μ )2 P N (ui )
σ ≥ ui =−∞
+ ui =μ+ξ
2 2
In both summations (ui−μ ) ≥ σ , and so
μ−ξ ∞
2
∑ 2
(ui −μ) P N (ui ) ∑ (ui −μ )2 P N (ui )
σ ≥ ξ 2
+{ ui =−∞
+ ui =μ+ξ } …..2.41
31
μ−ξ ∞
∑ 2
(ui −μ) P N (ui ) ∑ (ui −μ )2 P N (ui )
ui =−∞
+ ui =μ+ξ = Pr {|ui - μ| > ζ } ……2.42
Pr {|ui - μ| > ζ }≤ σ2 / ζ 2
Let uN={u1,u2…uN} be a ranom variable consisting off i.i.d random variables ui, i=1…,N with
mean µ and variance σ2. Let
N
S N = ∑ ui
i =1 ……..2.43
Then,
SN
lim =μ
N →∞ N …..2.44
Proof
Since the ui’s are i.i.d we can apply Lemma 2.4 with
2
SN σ
Variance { N }= N
But,
SN {u1 +u 2 +. ..+u N } Nμ
E{ N }=E N = N
= μ
By Chebshev’s Inequality
SN
Pr {| N - μ| > ζ }≤ σ2 /N ζ 2 …2.45
32
Thus for any fixed ζ,
SN
Pr {| N - μ| > ζ }→0
as N→∞
SN
lim =μ
N →∞ N
The sequence uN={un},n is an integer ≥0, converges to a limit u if for every ζ>0, there exists an m such
that for every n>m, |un-u|< ζ
If the random variable uN={u1,u2…uN} are i.i.d with pmf, PN(u) then
1
− log P N (u )→H (u )
N ….2.46
Proof:
N
pN ( u)=∏ p( ui )
i=1
hence
N
1 1
− log P N (u )=− ∑ log p (ui )=PN ,
N N i=1
33
Note that PN is the sample mean of -log p(ui). Thus, by invoking Lemma 2.5 we have
lim =E {−log p( ui ) }
N →∞
¿ H ( u)
We are now in a position to show that a typical set occurs with a probability near 1
For any ζ > 0, consider a DMS with alphabet u and entropy H(u) and the subset of all source sequences of
length N defined by:
Then all source sequences S(N, ζ) can be uniquely represented by binary code words of finite
length, LN ,where
Furthermore
2
σ
Pr {u∉ S( N , ζ )}≤ ….2.49
Nζ 2
where
and P(u) and H(u) represents the single letter probabilities and entropy; respectively
Note that all sequences in the set S(N,ζ) are nearly equiprobable, differing from the value 2-NH(u)
by a factor no greater than 2Nζ
Proof
Since S(N,ζ) is a subset of uN, the set of sequences of length N, we have the inequality
∑ P N (u)=1≥ ∑ P N (u)
u u∈S(N ,ζ ) 34
….2.51
PN (u )≥2−N {H (u )+ζ }
1≥ ∑ 2−N {H (u )+ζ }
u∈S ( N ,ζ )
−N {H (u )+ζ } ….2.52
=2 |S ( N ,ζ )|
where |S(N, ζ)| is the number of distinct sequences of length N in S(N, ζ). i.e it is the cardinality
of the set S ( N , z )
This bound, of course, is not an integer, let alone a power of 2. However, we may bracket it
between powers of 2 by choosing the integer, LN, such that:
LN −1 L
2
N {H ( u)+ζ }
=(2 ,2 N ] ………………… 2.54
Focusing non on the representation of stance sequences in S ( N , z ,)we begin by assuming that
the DMS has an alphabet of size A in source sequences, with each source sequence being of
length N. It is possible to represent all sequences in the DMS by binary sequences of length LN
if LN is chosen to meet the constraint.
2 LN -1 �A N < 2 LN ……………………….2.55
Or
35
It is now possible to propose a uniquely decodable code for the DMS. Codewords for source
sequences in S ( N , z ) are obtained by appending one more binary symbol to each of the binary
representatives of eqn 2.48 for the sequences in S(N,ζ) by preceding these binary
representatives with a “0”.
While this increases the binary sequence length from LN to 1+ LN, it has a vanishingly small
influence for asymptotically large N.
L
Since there are 2 N distinct binary sequences of length LN, we can represent uniquely all source
sequences belonging to S(N,ζ) by binary sequences of length LN satisfying equation 2.48
Then using eqn 2.48 we have that all sequences in S(N,ζ) are represented uniquely with binary
sequences of length 1 + LN < N {H(u) + ζ} + 2 bits. For all other sequences in S ( N , z ) suppose
these are represented by sequences of length LN where the first binary digit is always a “1” and
the remaining LNsymbols uniquely represent each sequence in S ( N , z, )with being
LN chosen
in accordance with eqn 2.55. Then all sequences in S ( N , are
z ) represented uniquely with
binary sequences of length
Each code is a unique binary sequence and there is never any uncertainty as to when a code
word sequence begins and ends. No code word is a prefix of another.
We now have a uniquely decodable binary code for each output sequence of length N. As an
example, consider a source with alphabet u = {a, b, c} for source sequences of length N = 2. A
possible code, which satisfies the parameters mentioned above is:
uN Codeword
aa 000
ab 001
36
ac 010
ba 011
bb 1000
bc 1001
ca 1010
cb 1011
cc 1100
In order to show that the typical set, S(N,ζ), occurs with a probability near L we proceed as
follows:
Let
F N =Pr {u∈S ( N , ζ )}
= ∑ PN (u)
u∈S( N , ζ )
…….. 2.57
Where S( N ,ζ ) is the complementary set to S(N,ζ), which contains source sequences other
than typical sequences.
u :log P N ( u )=[ − N {H ( u ) + ζ } ,
¿
¿
S ( N , ζ )=¿
¿
¿
¿
N
u :log ∏ p( u n )=[− N {H ( u)+ ζ },
u=1
¿
¿
− N {H ( u)−ζ }]
¿
N
u : ∑ log p ( un )+ NH (u )= 37
n=1
¿
=¿
¿
¿
¿
……… 2.58
N
1
S ( N , z ) = {u : -
N
�log p(u ) - H (u) > z }
n =1
n
…….. 2.59
With s being given by eqn 2.50 as can be easily shown by applying Lemma 2.5 to 2.59
N
1
− ∑ log p(u n )−H (u )
N n=1
38
should, according to Lemma 2.6 on the AEP approach zero as N approaches infinity thus
justifying the result of eqn. 2.57
We have just developed a uniquely decodable code with two possible codeword lengths, LN and
. The average length of code words is thus
u ∈S ( N , ζ )
¿
¿
+( 1 + L N ) Pr {u ∈¿ S ( N , ζ )}
⟨ LN ⟩=( 1+ LN )Pr ¿
¿
¿
¿
……..2.61
{ }
2
2 1 σ
or =N H (u )+ζ + +(log A + ) 2
N N Nζ
LN 2 1 s2
< H (u ) + z + + (log A + )
N N N Nz 2
LN -1 -1 -1
< H (u ) + 2 N -1 + N 3 + [(log A + N -1 )s 2 N 3 ]N 3
N
= H (u ) + o ( N )
which establishes the direct half of the theorem i.e. we have shown that for source sequences
of length N of a DMS there exists a uniquely decodable code for which ⟨LN ⟩<NH (u)+O(N )
39
We note that for large N, nearly all code words can be made of equal length, slightly larger than
LN LN
NH(u) and only two lengths of code words are required ( LN and ). Where refers to
N { H (u ) + z } , N { H (u ) + z } + 1�
Note that L N = �
� �
And so, for large N, all codewords are of almost the same length
M
� - l N ( u ) � � - l N ( u1 ) � � - l N ( u2 ) �
��
�u
2 =
� � �
�
� �u1
2 �
� �
� �2 �
�
L
� �u2 �
� - lN ( uM ) �
�
� � 2 �
�
� u M �
= ��L �2 -{lN ( u1 ) +L + lN ( u M )}
u1 u2 uM
……….. 2.62
where each sum on both sides of the equation is over the entire space uN.
Note that there is a distinct term on the right hand side of equation eqn 2.62 corresponding to
each possible sequence of M code words. Furthermore, the quantity
lN (u 1 ) + lN (u 2 ) + K + lN (u M )
gives the total length in code letters (bits) of the corresponding sequence of code words.
If we let Ak be the number of M successive code words having a total length of k binary symbols,
equation equn 2.62 can be expressed as
M *
� - l N (u ) � MlN
�� 2 � = �Ak 2
-k
………….. 2.63
�u � k =1
Where
40
l ¿N =max u l N (u )
But in order for the source sequence to be recoverable from the binary sequences we must
have
……….. 2.64
A k ≤2k , k=1,2,⋯, Ml¿N
Otherwise, two or more sequences of M successive code words will give the same binary
sequence, violating our uniqueness requirement. Using this bound for Ak we have
¿
Ml N
−l N ( u ) M
(∑ 2
u
) ¿ ∑ 1 =MlN
k=1
¿
……….. 2.65
−l N (u )
∑2 ¿1 …………. 2.66
u
for the left hand side of 2.65 behaves exponentially in M while the right hand side grows only
linearly in M. This inequality is known as the Kraft – McMillan inequality
If we were now to use the general variable length source encoder whose code lengths must
satisfy 2.63 we would have an average of
−l N (u )
2
QN (u)= −l N (u ) …….. 2.68
∑2
[ ]
−l N ( u)
u
∑2
=∑ P N (u)log
u
−l (u )
We have from inequality
u of2Lemma
N
2.2 an inequality
1
)log ∑ 2−l N ( u)
∑∑P NP(uN (u)log
NH(u )==
u u P N (u
u
) ( )
∑ P1QN (u)l
≤∑ P N (u+)log
(u )
N (u)
41
u u N
−l N (u )
¿ ⟨L N ⟩+log
(∑ 2
u
)
……… 2.69
Since the Kraft – McMillan Inequality guarantees that the second term is not positive we have
NH(u)≤⟨ LN ⟩
The bound applies for any sequence of length N and it follows that any source code for which
the source sequence can be uniquely recovered from the binary sequence requires at least an
average of NH(u) bits per symbol
This completes the proof of Theorem 2.1. We have shown that it is possible to source encode a
DMS with an average number of binary symbols per source symbol arbitrarily close to its
entropy and that it is impossible to have a lower average.
42
2.5.2 Source Encoding Algorithms
Shannon’s Source Coding Theorem guarantees the existence of a code for source sequences of
length N and provides a bound on the average codeword lengths. The Source Coding Theorem
does not, however, indicate how such a code can be constructed. We provide in this section
some argentums for designing source codes.
The binary codes we are looking for most meet two conditions:
Efficient codes are the ones which exploit source statistics to ensure that short codewords are
assigned to more probable source symbols and vice versa.
Uniquely decodable codes are the ones which ensure exact reconstitution of the original source
messages from their coded representations. A sufficient condition for a code to be uniquely
decodable is that it be a prefix condition code.
Prefix Coding
A prefix condition code, or a prefix code for short, is a code in which no codeword is a prefix of
another codeword. To illustrate this consider the codes in Table 2.1 below
s1 0.5 0 0 0
s2 0.25 1 10 01
(i) Code I is not a prefix code since the bit “O” – The codeword for symbol S1 is a prefix
of “OO’ – The codeword for symbol S3 . Note also that the bit “I” - the codeword for
43
(ii) Code II is a prefix code whereas Code III is not.
A prefix code is uniquely decodable. The converse is not true as seen in code III. This code is
uniquely decodable as a “0” indicates the beginning of a codeword. The code however is not a
prefix code. The prefix condition is thus a sufficient, but not necessary condition, for unique
decidability.
Prefix condition codes are a subset of uniquely decodable codes. Together with being uniquely
decodable they satisfy the Kraft – McMillan Inequality. This is based on the fact that unique
decodability was a condition used in deriving the inequality from eqns 2.63, 2.64, 2.65 and 2.66.
Note, however, that the Kraft – McMillan inequality does not tell us that a source code is a
prefix code. It is merely a condition on the code word lengths of a code and not on the code
words themselves.
Example 2.4
Suppose s = {s1,s2, s3, s4}. and the binary symbol representation is as shown below:
s1 0.5 0 0 0
s2 0.25 1 10 01
(i) Code 1 violates the Kraft – McMillan Inequality. Is it a prefix code? Why?
(ii) Both codes 2 and 3 satisfy the Kraft – McMillan Inequality. Code 2 is a prefix code
whereas code 3 is not.
Decoding
44
As for as decoding is concerned, let us start by looking at example below:
Example 2.5
Suppose u = {a, b, c} and the binary symbol representation uses three codes as shown below.
We note that if N=1 (i.e single symbols a, b, or care transmitted) then:
A 0 00 1
B 1 01 10
C 10 10 100
• Code 1 is not uniquely decodable as, for example, 0101 can represent abab, abc, cc or
cab source sequences and be impossible to be uniquely decodable into a source
sequence.
• Code 3 is uniquely decodable. “1” marks the beginning of a codeword and they are
unique.
45
To decode a sequence of prefix condition code words generated from a source encoder, the
source decoder sets up a decision tree. For code II the decision tree is:
1011111000
Is decoded as
S2S4S3S0S0
The decoder always starts in the initial state and ends at a terminal state. Each source symbol
corresponds to one terminal state. For example the encoded sequence 101111000 is easily
decoded, using the decision tree, as S2S4S2S1S1
For prefix condition codes, the end of a code is always recognizable. Hence the decoding of a
prefix code representing a source symbol can be accomplished as soon as the binary sequence
is fully received. Such a code is said to be an instantaneous code.
46
Given the probabilities of occurrence of source symbols and the codewords used to represent
each source symbol it is easy to compute the average codeword length, L as:
N
L= �P L k k
k =1 ……. 2.71
Where
N ………………….. 2.72
σ = ∑ Pk ( Lk −L)2
2
k =1
Example 2.4
a0 000 00 ¼
a1 000 01 ¼
1
a2 010 100 /8
1
a3 011 101 /8
1
a4 100 1100 /16
1
a5 101 1101 /16
1
a6 110 1110 /16
47
1
a7 111 1111 /16
∞
Code 2= ∑ nk pk (u)
k =0
2×1 3×1
=(
4
)×2+(
8
)×2+ ( 4×1 16)×4
=2. 75 bits
1 1
H ( u)=2×( log 4 )+2×( log 8 )
4 8
1
+4×( log 16 )
16
=2. 75 bits / codeword
For code 2, the average codeword length is 2.75 bits/symbol which equals the average entropy.
We show in the sequel that this result implies that the code is optimal according to Shannon’s
Source Coding Theorem.
Find and comment on, the entropy and average codeword length for code 1.
Consider a DMS with alphabet u and entropy H(u). For such a source consider further that the
source symbol uk occurs with probability pk and that the binary sequence representing uk has
length lk.
48
N
lk
⟨L⟩=∑ l
…………………. 2.74
k=1 2k
N
1
H (u)=∑ pk log
k =1 pk
N
1 l
=∑ lk
log 2 k
k=1 2
N
lk
=∑
k=1 2k
l ………………. 2.75
How does one match a prefix code to an arbitrary source? One uses an extended code (for a
block of binary symbols).
We know that
As N →∞ and ζ becomes arbitrarily small, the lower and upper bounds converge since
lim
N →∞
ζ→0
LN
¿¿¿ = H ( u )¿
¿ N
Thus, by making the order, N, of an extended source encoder large enough and ζ small enough,
the code can represent a DMS as closely as possible i.e. it almost matches the source.
49
The average code word length of an extended code can be made to be as close to the entropy of
the source provided the extended code has a high enough order. The price to be paid is the
increased decoding complexity from the higher order of the extended code.
In this section we will present three algorithms for generating source codes. The average length
of the code words in each source code is computed and so is the entropy of the source. A
comparison of the two values gives an indication of how closely a code matches the source.
The Hufman code is a prefix condition code. This code assigns, to each symbol in an alphabet, a
sequence of bits roughly equal in length to the information carried by the symbol. This means
that the code word lengths are proportional to the probability of occurrence of a symbol
represented by the code word.
The Hufman code results in a code whose average code word length approaches the entropy of
the source, assumed to be a DMS (It approaches a fundamental limit by asymptotically
matching the source as N→∞)
For extended sources, with each block being comprised of N source symbols, the Hufman code
is optimal in the sense that it produces a code with the smallest average code word length for a
given N.
1. List source symbols in order of decreasing probability. The two source symbols with the
lowest probability are assigned a “0” and a “1”. This is the splitting stage.
2. Combine the two lowest probability source symbols into a new source symbol with
probability equal to the sum of the probabilities of the original source symbols.
The list of source symbols is reduced by one. Place the new symbol in the list of symbols in
accordance with its probability. When the probability of a newly combined symbol is found
to equal another probability in the list, the selection of any one of them to be the next
symbol to be combined will not afect the average codeword length but the variance. The
symbol yielding the lower variance is used in coming up with the final design for the code.
3. Repeat the process until only two symbols are left. Assign a “0” and a “1” to the two
symbols.
50
The code for each original symbol is found by working backward, tracing the sequence of
“0s” and “1s” assigned to the symbol and its successors.
s 1 =0 .4
s 2 =0 .2
s 3 =0 .2
s 4 =0.1
s 5 =0 .1
51
Using the Hufman algorithm, we generate the following Hufman tree
s1 0.4 1
s2 0.2 01
s3 0.2 000
s4 0.1 0010
s5 0.1 0011
52
The entropy is
Note that the average code word length difers from the source entropy by only 3.7 percent.
L = 2.2bits
H (u ) + 1 = 3.12
And thus
53
The Resulting Huffman Code is:
s1 0.4 00
s2 0.2 10
s3 0.2 11
s4 0.1 010
S5 0.1 011
Now
=2.2 bits
k =1
= 1.36
= 0.16
Which of the two codes would you recommend for implementation? Why?
54
(ii) Shannon – Fano Code
1. Arrange the symbols from the source alphabet in order of decreasing probability
2. Group the source symbols into two sub-sets so that the probabilities of the two sub-sets are
just about equal. Assign a “0” to one sub-set and a “1” to the other sub-set.
3. Group the symbols in each sub-set into two groups of almost equal probability. Assign a “0”
and a “1” to each sub-group.
4. Repeat step 3 until all source symbols have been assigned code words.
The Shannon – Fano derived code words for each symbol are read from the original sub –
set to the final sub – set.
Implementation of the Shannon – Fano algorithm produces a prefix condition code [20].
Symbol Probability
s1 0.3
s2 0.25
s3 0.2
s4 0.1
s5 0.1
s6 0.05
55
Code words are obtained by reading from the left to the right. Thus, for this example we get:
s1 0.3 00
s2 0.25 01
s3 0.2 10
s4 0.1 110
s5 0.1 1110
s6 0.05 1111
56
The same result can be obtained by using a tree as shown below:
The code words are indicated in the diagram.The tree is known as the Shannon-Fano tree
Example 2.6
A source has an alphabet u = {a, b, c} with source statistics p(a) = ¼ , p(b) = ¼ , and p(c) = ½ .
This source is used to get an extended code with N = 2. Find a Hufman code and a Shannon –
Fano code for the source.
57
cc 10
ac 010
bc 011
ca 000
cb 001
aa 1110
ab 1111
ba 1100
bb 1101
58
Read the code from the initial state
59
cc 00
ac 01
bc 100
ca 101
cb 1100
aa 1101
ab 11100
ba 11101
bb 1111
Since
p (a ) = 1
4
p (b) = 1
4
p (c ) = 1
2
Decoding
Since both Hufman coding and Shannon – Fano coding produce prefix condition codes,
decoding can be efected using decision trees as discussed in Section 2.5.2. the Hufman tree is
the decision tree for the Hufman code and the Shannon – Fano tree is the decision tree for the
Shannon – Fano code. The initial state in both types of decision tree is the root of the tree. The
leaves of the trees constitute the terminal states.
As an example, suppose we are to decode the following bit tream using the Hufman tree of
Example 2.6.
60
Bit stream: 001110100110
Reading the bits from the left to the right and starting from the root of the Hufman tree and
proceeding to the leaves (the symbols) of th tree we decode the bit stream as:
Cb bb cb cc
Decoding using the Shannon – Fano tree and starting from the root of the tree and moving to
the eaves yields the following decoded message.
If the same message – cb bb cb cc – had been encoded using the Shannon – Fano code of
Example 2.6 the resulting bit stream would have been
11001111110000
If this had been the received bit stream then it can be easily verified that using the Shanno –
Fano tree for decoding the message, starting from the root of the tree and moving to the leaves,
would yield a decoded message which is exactly the same as the encoded message.
Both the Shannon – Fano and the Hufman coding algorithms require a priori knowledge
of the source probabilities. In practice this may not always be possible. Also some
requirements limit the exploitation of dependencies between words and phrases (I.e.
inter character dependences) when arranging text thus compromising the efficiency of
such sources.
One code which overcomes the shortcoming mentioned above is the Lempel – Ziv –
Welch code. This code belongs to a class of codes known as dictionary codes.
The first Lempel – Ziv compression technique was published in 1977 [21] and inter it was
refined by Welch in 1984 [22].
The Lempel – Ziv – Welch (L Z W) algorithm works by looking for data sequences that
occur multiple times and replaces them in a file with fixed size pointers. As it reads data
in it stores strings into a dictionary. When it encounters a string that already exists in the
dictionary, it replaces it with a fixed size index. Compression is achieved since the index
way requires less storage space than the string it represents. The L Z W coder does not
need to store the dictionary along with the strings. During decoding the decoder works
along the same principles as the encoder of reading in the data and simultaneously
creating a dictionary.
61
The Lempel – Ziv algorithm works by parsing the source data stream into segments that
are the shortest subsequences not encountered previously.
Example 2.7
000101110010100101
(i) It is assumed initially that bits 0 and 1 are already stored in a bufer in that order to get:
(ii) With bits 0 and 1 already stored, the shortest subsequence encountered for the first time
and not previously encountered – beginning from the left – is 00. This is the first string to
be stored in the dictionary. This yields:
Subsequences scored: 0, 1, 00
Data to be parsed : 0101110010100101
(iii) In the remaining data to be parsed the shortest sub stream not previously seen is 01. The
second string to be stored is thus 01. This yields.
(iv) At this stage, the shortest sub stream not previously seen and stored in the dictionary is
011. The third string to be stored is thus 011. This yields.
(v) The process is repeated unit the source input data stream has been completely parsed. For
the example at hand the result is:
(vi) The sub streams stored are the contents of the dictionary. Indexes are assigned to the
dictionary contents. The indexes are a binary representation of the decimal number
representing the dictionary entry. For the example at hand we get:
Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Sub stream stored : 0 1 00 01 011 10 010 100 101
62
(vii) In the L Z W encoding algorithm, the last bit of a codeword representing a stored
substream is taken to be the same as the last bit of a stored substream. This last bit in a
stored substream is called an innovation symbol. The innovation symbol when appended
to a subsequence enables that subsequence to to be distinguished from all previous
stored subsequences. The rest of the bits in a codeword are obtained by concatenating
the index of the root subsequence that matches the one in question except for the
innovation symbol. Thus, for example, the codeword for the subsequence 00 is made up
of a 0 as the last bit, since this is the innovation symbol, concatenated with 001, since the
root subsequence that matches 00 except for the innovation symbol 0 is the subsequence
0 stored in position 1. The index for this subsequence is 001. Thus the codeword for
subsequence 00 is 0010.
Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Sub stream stored : 0 1 00 01 011 10 010 100 101
Codeword : 0010 0011 1001 0100 1000 1100 1101
1 + log M
(b) has codewords of length where M is the number of distinct strings of the
parsed source sequence and L> represent the integer just greater than the argument.
(c) is asymptotically optimal in the sense that the average codeword length approaches the
source entropy as the source sequence get very long.
(viii) The decoding procedure parallels the encoding procedure. Thus, assume for example that
the received bit stream is:
0010001110010100100011001101
(a) Since LZW produces fixed length codewords, we divide, for this example, the bit
stream into codewords of Fano bits to get:
63
(b) The decoder sets up a dictionary just like the encoder. The default entries in position L
and 2 are 0 and 1. We thus have:
Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Decoded Entry : 0 1
When the first codeword 0010 is received, the innovation symbol – a 0 in this case – is
decoded as the last bit of the original data string. The index is then used to identify the
other bit(s) in the input string. The index is 001 indicating that the other bit in the
encoded string is a 0. We thus get:
Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Decoded Entry : 0 1 00
The next received codeword is 0011. The innovation bit is decoded as the last bit in the
encoded string. The index identifies one other bit – a 0 – for the input string. Thus
codeword 0011 is decoded as 0 1 yielding:
Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Decoded Entry : 0 1 00 01
The procedure is repented sequentially until at the received codewords have been
decoded. It is left as an exercise to the reader to show that completion of the exercise
would lead to the complete recovery of the original source output data stream.
Hufman coding was for a longtime the algorithm of choice for data compression.
However, the Lempel – Ziv – Welch has taken over from the Hufman algorithm. By
exploiting inter-character dependence, the LZW algorithm achieves greater compression
than the Hufman algorithm. AR a and ZIP are some of the current practical
implementation in software of the LZW algorithm.
The information sources we have been dealing with so far are memoryless. A symbol emitted by
a source at a particular time is independent of preceding symbols and of succeeding symbols.
The assumption that the emitted symbols were independent and identically distributed was
made so as to simplify the mathematical modeling of information sources.
64
Practical information sources have memory. Thus, for example, in English, the probability of a
letter “u” following the letter “q” is much higher than that of any other letter as can be easily
verified by looking at the digraph frequencies of the English language.
The simplest analytical approach to modeling sources with memory is to consider them to be
Markov Sources. Markov Sources belong to a class of processes known as stochastic processes.
The properties of such processes evolve in time in accordance with probability laws. In our case
we shall restrict ourselves to using discrete time, discrete state stochastic processes in the
modeling of Markov sources.
We shall not attempt to have a formal mathematical definition of discrete time, discrete state
stochastic processes. This discrete time, discrete space Markov processes are called Markov
chains. Since Markov chains belong to the class of Markov processes we shall begin with a
definition of Markov processes.
Shannon (1948) gives examples of modeling English text as a Markov source. Shannon gives
sample outputs of such models. Interested readers are referred to Shannon’s paper.
Pr { X n = X X m = y, X l = z ,...} = Pr { X n = X X m = y}...
The right hand side of eqn shows that the dependence is on the latest of the time points l, m, …
The probability density functions of Markov processes are not conditioned on two or more
values of past time points.
The definition of Markov Chains is articulated from the definition of Markov processes.
Pr { X n = k X l = h,..., X m = j} = Pr { X n = k X m = j}...
Example 2.8
65
Suppose that a binary source can exist in two states A and B and that the probabilities of
transmission of a O or a 1 are state dependent. Thus, suppose the n the transmission is a O (i.e.
binary source is in state A) then the probability that a O will be transmitted at the (n + l) the
time is l – x and the probability that a l is transmitted is x. Similarly if the n the transmission is a
1, the probability that the (n = l) the transmission is a O is 1 - b and that of a O is b .
Alternatively, this may be taken to be equivalent to saying that if the source is in state A at time
n, then the probability that it is still in state A at time (n + 1) is 1 – x and the probability that it
moves to state B at time (n+1) is x. similarly if that source is in state B at time n then the
probability of being in state B at time (n+1) is 1 - b and that of being in state A at time (n+1) is
b . These probabilities are called transition probabilities and may be represented in a matrix as:
Next State
A B
Present State
{[
A 1−α α
B β 1−β ]
Note that we have made the simplifying assumption that the transition probabilities are not
time – dependent. Note further that the sum of elements in any row is exactly 1 since the
current state must surely go somewhere as the next transition.
=k|Xn=j}
n+1
Pjk=Pr{X ……..2.76
pr { xn = k } = pk ( n )
………2.77
pk ( n +1) = �p j
( n)
pjk
j
……….2.78
66
[ ]
p11 p12 ⋯ p1 N
P= p 21 ⋯
⋮
pN 1 p N 2 ⋯ p NN ………..2.79
p( n ) = �
�p1( n ) p2 ( n ) �
�
and ……….2.80
p ( n +1) = p ( n ) p ……….2.81
Particularizing our dissension to the simplest non-trivial case of a two – state markov chains we
get.
p (0) = �
�p1 p2 .... p N �
(0) (0) (0)
� ………. 2.83
0 1 0 1 [
[ P ( n ) P ( n ) ]=[ P ( 0 ) P ( 0 ) ]
P11 P12
P21 P 22 ]
Or in matrix notation
In order to find the time-dependent probabilities p(n) ,it is necessary to evaluate P(n) in equation
2.85. This is conveniently done by using the diagonal (spectral) representation of P.
If we assume that after a sufficiently longtime the system settles down to statistical equilibrium
67
in which the state occupancy probabilities lose their dependence on time and are independent
of the initial conditions i.e
Lim Lim ( n )
�
�p0 ( n ) p1( n ) �
�= p = [ π 0 π1 ] = π
n�� n ��
π=πP
or
π (I −P)=0 …………. 2.86
To get a unique solution from 2.87 we need to use the normalization condition that the state
probabilities must sum to unity or, in matrix notation:
πΙ = Ι …………..2.89
The equilibrium probabilities of state occupancy, π, are obtained by solving equation 2.85, but
since these can be shown to be dependent equations, then we invoke, in addition, the
normalizing condition given by equation 2.89
With the background on Markov chains given above, we are now in a position to find an
analytical model of a Markov source.
68
Definition 2.7: Markov Source
A Markov source is characterized by a set of discrete states, denoted by A = {a1, a2, …, aN} and a
source letter alphabet, denoted by u = {u1, u2, …, uM}. Each discrete unit of time with a given
probability, the source emits a symbol from its alphabet and assumes a new state, which could
be the previous state.
Note that the Markov process is being used to model the memory of the source.
Let qj(ak) be the probability that the source emits symbol ak given that it is in state j and assume
that this probability depends on the current state
Note that the symbol Sn represents the state of the Markov chain at time n.
Finally assume that the next state the source moves to depends only on ( i.e. is uniquely
determined by) the previous emitted letter. Emitting symbol a k forces the source to move to
state sk
Figure 2.2 illustrates the operation of a Markov source. The branches are labelled by output ak
and probability qj(ak) i.e. the probabilities for learning state j. The nodes correspond to the
states.
69
Each branch directed away from a given node must correspond to a diferent letter to ensure
that the new state the chain moves to is determined by the previous state and emitter of letter
The conditional entropy of a Markov source is the entropy computed while assuming that the
source is occupying a particular state using eqn this is given by:
1
H (u | s j ) = �qi ( ak ) log ………. 2.91
k q j (a k )
The entropy of a Markov source given that it is in a particular state must satisfy the same
inequalities satisfied by a DMS.
The average entropy is obtained by computing the conditional entropy given the state occupied
by the Markov source and then removing the conditioning
N
H (u)=∑ H (u|s=s k ) π k
k =1 ……… 2.92
Example 2.8
P= [ 0.5 0 . 5
0.2 0 . 8 ]
(i) How many symbols does the source alphabet contain?
In this example, let ai be the source sequence generated given that the Markov source is in state
si, i = 1, 2.
The binary output sequences by an encoder depend on the state of the Markov chain.
70
This is thus an example of a Markov Modulated Source (MMS).
A generalization of this, the Markov Modulated Poisson Process (MMPP) has been used as a
mathematical model of Internet Protocol (IP) traffic. [24 and references therein].
[ π 1 , π 2 ]=[ π 1 ,π 2 ] [ 0. 5 0. 5
0. 2 0. 8 ]
i.e.
71
The computation of the entropy is easily visualized by using the Markov chain representation of
the source
1 1
H (u|s 1 )=0 .5 log +0 . 5 log
0 .5 0. 5
1 1
H (u|s 2 )=0 .2 log +0 . 8 log
0 .2 0. 8
H (u)=H (u|s 1 ) π 1 +H (u|s2 )π 2
=1×π 1 +( log5−1 .6 )π 2
1
= (5 log 2 5−6 )
7
72
Coding for Markov Sources
Coding for Markov sources is achieved by designing a separate code, preferably a prefix
condition code, for each state of the underlying Markov chain. These separate codes are then
combined and used in producing an encoded source output by treating the letters of the source
alphabet as states of a Markov chain. An example will clarify this.
Consider a Markov source with source alphabet {a, b, c} and transition probability matrix:
a b c
1 1 1
a
3 3 3
1 1 1
b
4 2 4
1 1 1
c
4 4 2
1/4
1/2
1/3
a [
P=
p1
p21
⋮
pN1
p12
⋯
⋯ p1 N
pN2 ⋯ pNN ] b
[ ]
p1 p12 ⋯ p1 N
p21 ⋯
P=
⋮
pN1 pN2 ⋯ pNN
1/4 1/4
1/4
1/3
c [
p1
P = p2 1
⋮
pN 1
p12
⋯
⋯ p1 N
pN 2 ⋯ pN N ]
1/2
The Hufman code for each state can be easily shown to be as shown in the Table below.
Following definition 2.6, the word “state” in the Table below refers to a condition that the
Markov source has just emitted n letter. Thus so, for example, refers to the state in which the
Markov source has just emitted in a.
SYMBOL CODE
STATE
73
Sa Sb Sc
a 1 10 10
b 00 0 11
c 01 11 0
In encoding we assume that at the initial state to the encoder transmits the initial state, so say,
of the Markov source. If the Markov source has an alphabet of size m the transmission of the
After transmitting the initial state, the encoder then successively encodes each source letter
using the code for transition into the next state. The decoder, after decoding the initial state so
can decode the received message by using the code based the received message by using the
code based on state so. The other received symbols are then decoded sequentially after that.
Example 2.9
baccbacba
We assume that the sequence is read from the left to the right. After the initial state, the next
letter is b. From Table , the codeword for this 00. The current stat is next b and the next letter
(state) is a. From the Table the codeword for this transition is 10. We summarize the transitions
in Table below.
74
a Sb Sa 10
c Sa Sc 01
b Sc Sb 11
a Sb Sa 10
75
The encoder output sequence is thus:
00100101110011110
Example
Suppose a decoder has decoded that the initial state of a Markov source during the encoding
process was state c. Decode the following received sequence
110111000101
While is stat c, we decode the first two bits – 11 – a – b. We are now in state b. this enables to
decode the next bit – o – as b. We have remained in state b which forces as to decode the next
two bits – 11 – as c. And so on leading to the following decoded message:
bbcabaa
76
CHAPTER THREE
CHANNEL CODING
3.1 Discrete Memoryless Channel
So far we have been dealing with sources required for information generation. We have
confined ourselves to answering questions arising out of Shannon’s first theorem:
(i) What is the fundamental limiting number of binary digits required to uniquely
represent the source?
(ii) How close to this limit can we come in practice?
We have found that the fundamental limit is the source entropy. We have also given two
encoding algorithms whose outputs approach the source entropy.
Now we return to Shannon’s theorem which deals with reliability in information transmission.
To mitigate the efect of the noisy channel, redundancy is introduced in the channel input and it
is used to reconstitute the channel input sequence as accurately as possible at the channel
decoder output.
Redundancy is introduced by coding. Diferent codes are used for bandwidth limited and power
limited channels giving rise to Automatic Repeat Request (ARQ) error – control coding systems
versus those using Forward Error Correction (FEC) codes.
For a noisy channel, a mapping between the input and output is random, mapping the input,
defined over a given discrete set, into an output sequence defined over a, usually diferent,
discrete set.
The channel encoder and decoder map the input digital sequence into a channel input
sequence and, conversely, a channel output sequence into an output digital sequence so that
the efect of the noisy channel is minimized
Shannon coding which is introduced to mitigate the efects of channel noise has an important
on the rate at which useful information can be transmitted over a channel. This led to one of the
questions which Shannon posed in 1948 [4]
The question which Shannon has posed was: What is the maximum transmission rate for
reliable communication over a noisy channel?
77
Definition 3.1: A Discrete Memoryless Channel (DMC) is characterized by a discrete input
alphabet, X, a discrete output alphabet, Y, and a set of conditional probabilities for outputs
given each of the inputs
The conditional probabilities are denoted by p(y|x) for xϵ X and y Є Y . Each output letter of the
channel depends only on one input symbol.
For an input sequence of length N denoted by x = {x1, x2, …, xN}, the conditional probability of a
corresponding output sequence, y = {y1, y2, …, yN}, is expressed as
N
PN ( y|x )=∏ P( y n|x n )
n=1
Normaly 0≤P( y n|x n )≤1 ∀ n ⋯⋯3. 1
The conditional probabilities P(yk/xj) are called transition probabilities for obvious reason
i.e the channel is memoryless when the current output symbol depends only on the current
input symbol and not on any of the previous symbols.
Note that equation 3.1 assumes that the Xi’s are iid
This is the memoryless condition of the DMC. The channel is said to be discrete when both
alphabets X and Y are countably finite.
Note that it is not necessary for the input alphabet, to have the same size as the output
alphabet, Y. For example, for channel coding, the size N of the output alphabet may be larger
than the size M of the input alphabet i.e. N ≥ M.
The channel may conveniently be described by a matrix whose elements are the conditional
probabilities of the channel as folllows
Output
Input
[ ]
p( y 1|x 1 ) p( y 2|x 1 ) ⋯ p ( y N |x 1 )
p( y 1|x 2 ) p( y 2|x 2 ) ⋯ p ( y N |x 2 )
P= ⋯3 . 2
⋮ ⋯
p( y 1|x M ) p ( y 2|x M ) ⋯ p( y N|x M )
78
The M x N matrix is called the channel matrix. Note that each row of the matrix corresponds to a fixed
channel input whereas each column of the matrix corresponds to a fixed channel output. Note that a
fundamental property of the channel matrix is that the row elements sum to one, i.e.
M
∑ p( y k|x j )=1 ∀k ⋯3 . 3
j =1
Suppose that the probability that the event that the channel input X = xj is denoted by
Pr {X =x j }=p ( x j ), j=1, 2, …, M ⋯3 . 4
Then the joint probability distribution of the input random variable, X, and the output random
variable, Y, is given by
p( x j , y k )= Pr {X= x j , Y = y k }
=Pr {Y = y k|X = x j }Pr {X = x j }
= p( y k|x j ) p( x j ) ⋯3 . 5
p( y k )=Pr {Y = y k }
M
=∑ p( y k , x j )
j=1
M
=∑ p( y k|x j ) p( x j ) ∀ k=1, 2, …, N ⋯3 .6
j =i
79
The probabilities p( x j ), j=1,2,…, M are known as the a priori probabilities of the various
input random variables.
Thus, given the a priori probabilities p(xj) and the channel matrix, we may calculate the
probabilities of the various output variables p(yk) using equation 3.6.
This channel is a special case of the DMC with M = N = 2. it is of great theoretical importance
and practical significance
Definition 3.2 A binary symmetric channel (BSC) is a DMC with X = Y = {0, 1}, two input symbols
(x1 = 0, x2 = 1) and two output symbols (y1 = 0, y2 = 1) . The conditional probabilities are of the
form
p(y2|x1) = p(y1|x2) = p
p(y1|x1) = p(y2|x2) = 1 – p
The channel is symmetric because the probabilities of error (i.e. receiving a 0 when a 1 is sent
and vice versa) are equal
80
3.2 Mutual Information
We may think of the channel output, Y as a noisy version of the channel input, X we would like
to know how to measure the uncertainty about X after observing Y. then it will be reasonable to
think of the entropy of the input, H(x), as the prior uncertainity we have about X before
observing any output.
To find the uncertainty about X after observing Y, Shannon defined the notion of mutual
information between two events, α and β, denoted by I(α, β). Mutual information is the
information provided about the α event by the occurrence of the event β. As before, the
probabilities p(α), p(β) and p(α ∩ β) are assumed known.
To be consistent with our previous definition of information we must have the following
boundary conditions.
b) The occurrence of β indicates that α has definitely occurred i.e. p(α| β) = 1, then the
occurrence of β provides us with all the information regarding α. Thus
1
I( α|β )=I (α )=log
p(α )
The two boundary conditions of mutual information are satisfied if we define the mutual
information as follows:
p( α|β )
I ( α , β )=log
p( α )
p( α ∩β )
=log ⋯3 .7
p( α ) p ( β )
1
I( α , α )=log =I (α )
p(α )
81
Thus, mutual information is a generalization of the concept of information. Hence I(α) is
sometimes referred to as the self – information of the event α.
Note also that if, for example, p(α| β) < p(α) then I(α, β) is negative. Thus while self –
information is non – negative, mutual information can take on negative values.
Negative values of I(α, β) indicate that α is less likely than it was a priori before β was observed.
We are primarily interested between the mutual information between the channel input, X,
and the channel output, Y, of a DMC. We will regard the channel as a random mapping of the
random variable, the channel input, X, into the random variable, Y, the channel output, treating
both the input and output as discrete random variables.
Consider a DMC with input alphabet X, output alphabet Y, and conditional probability p(y|x) for
y Є Y and x Є X.
Suppose in addition that input symbols occur with probability p(x) for x Є X.
The input to the channel can then be regarded as a random variable and the output as a
random variable. If we observe the output y, the information provided about the input x is the
mutual information
p( y|x )
I ( x , y )=log
p( y )
p( x| y )
=log ⋯⋯3. 8
p( x )
we see that it is necessary for us to know the input probability distribution p(xi), i = 1, 2, …, M
so that we may calculate the mutual information I{X, Y} of a channel.
As with sources, we are interested in the average amount of information that the output of the
channel provides about the input. Thus we define the average mutual information between the
input and output of a DMC as
I ( χ , γ )=E {I ( x , y )}
p( y|x )
=∑ ∑ p ( y|x )q (x )log ⋯⋯3 .10
y x ∑ p( y|x ) q( x )
x
82
Where q(x) is the probability distribution of the input
The average mutual information I(X,Y) is defined in terms of the given channel conditional
probabilities and the input probability which is independent of the DMC. We can then maximize
I(X,Y) with respect to the input probability distribution
q={q( x ), x∈X }
We digress a little bit before linking the material here, and what follows, with the concept of
channel capacity.
Given an input with an alphabet X, the quantity H(X) gives us the a priori uncertainty about X.
We extend this concept in order to measure the uncertainty about X after observing Y.
Specifically, we define the conditional entropy of X, selected from the alphabet, X, given that Y
= yk i.e.
M
H {χ|γ= y k }Δ ∑ p( x j| y k ) log
j=1 [ 1
p ( x j| y k ) ] ⋯⋯3 . 11
The quantity H{X|Y=yk} is itself a random variable that takes on values H{X|Y=y1}, …, H{X|Y=yN}
with probabilities p(y1), …, p(yN), respectively. The mean of H{X|Y=yk} over the entire alphabet Y
is given by
N
H {X|Y }= ∑ H {χ|γ = y k } p ( y k )
k =1
N M
= ∑ ∑ p ( x j| y k ) p( y k ) log
k =1 j=1
N M
[ 1
p( x j| y k ) ]
= ∑ ∑ p ( x j , y k ) log
k =1 j=1 [ 1
p ( x j| y k ) ] ⋯⋯3 . 12
The quantity H{X|Y} is called the conditional entropy and it represents the amount of
uncertainty remaining about the channel input after the channel output has been observed.
Since H{X} represents the uncertainty about the channel input before observing the channel
output and H{X|Y} is the uncertainty about the channel input after observing the channel
output, the diference H{X} - H{X|Y} must represent our uncertainty about the channel input
that is cleared (resolved) by observing the channel output. Thus this diference must represent
the information which the observation of Y provides about X.
83
This quantity provides an alternative method of computing the mutual information of the
channel. If we denote the channel mutual information as I{X,Y} , we get
Similary,by using the symetry of I{X,Y} we may also write I{X,Y} = H{Y} - H{Y|X}
Property 1
We have already shown that the mutual information of a channel is symmetric i.e.
Where I{X, Y} is the amount of uncertainty about the channel input that is cleared by observing
the channel output and I{X, Y} is the amount of uncertainty about the channel output that is
cleared by sending the channel input.
Propety 2
For two random variables X and Y, defined on alphabet X and Y respectively and given that X =
xj, j = 1, 2, …, M, then
We know that
Hence
Pr {Y| X }Pr {X}
Pr {X|Y }= ⋯⋯3 .16
Pr {Y }
Equation 3.16 is one form of Baye’s rule
84
so long as Pr {Y }≠0
and thus
M
Pr {Y }=∑ Pr {Y| X=x i }Pr {x=x i } ⋯⋯3 .17
i =1
Letting x = xj and using 3.16 and 3.17 yields 3.15, Baye’s Rule
Example 3.1
p(0|0) = 0.95
p(1|1) = 0.9
(c) Find the probability that a one is transmitted given that a one was received
Solution
(a) The probability of error is not symmetric and hence it is not a BSC.
(b) Define
X0 = a zero is transmitted
X1 = a one is transmitted
85
p (Y 1|X 1 ) p( X 1 ) 0 . 9×0 . 6 0 .54 27
p( X 2|Y )= = = =
p(Y 1 ) 1−0 . 44 0 . 56 28
Let us now consider an example which is unrelated to a communication system but which
demonstrates the computation of I{X, Y}.
Example 3.2
Three urns, A1, A2 and A3 are equally likely to be selected. The number of white balls (W) and
black balls(B) contained in each urn is shown below.
An experiment is done by selecting an urn, drawing a ball from it and observing the colour of
the ball. What is the average mutual information produced by the experiment.
Solution
The input alphabet consists of selecting an urn. Thus X = {A1,A2,A3} with equal probabilities of
selection.
The output of the experiment consists of observing the colour of a drawn ball. Thus
Y = {W, B}
I(X,Y)=H(X) – H(X/Y)
86
H(X) = log 3
We need to compute
I(X,Y)=H(X)-H(X/Y)
ii. We will need the conditional probabilities p(x j/yk) with x1=A1,x2=A2 and x3=A3,we also have
y1=W,y2=B
P(W |A1 ) P ( A 1 )
P( A 1|W )=
P(W |A1 ) P ( A 1 )+ P( W | A 2 )P( A2 )+ P( W | A3 ) P( A 3 )
0 .8
3
=
0 .8 0 .6 0 .2
+ +
3 3 3
=0 .5
0 .2
3
P( A 1|B)=
0 .2 0.4 0.8
+ +
3 3 3
1
=
7
and
3
P( A 2|W )=
8
2
P( A 2|B)=
7
1
P( A 3|W )=
8
4
P( A 3|B)=
7
The entropy of any urn for y = B or W is obtained by averaging over the color. Thus
1 1
H ( A 1|γ)=P( A 1|W )log +P( A1|B )log
P( A1|W ) P( A1|B )
1 1 87
= log2+ log 7
2 7
1
=1+ log7
7
similarly,
3 8 2 7
H ( A 2|γ )= log + log
8 3 7 2
9 3 2 7
= − log 3+ log
8 8 7 2
1 4 7
H ( A 3|γ )= log8+ log
8 7 4
3 4 7
= + log
8 7 4
Thus 3
H ( χ|γ )=∑ H ( A i|γ ) p ( A i )
i=1
1 1 3 1
= + log 7+ − log 3
3 21 8 8
2 7 1 4 7
+ log + + log
21 2 8 21 4
=1. 026
Hence
Since I(X,Y) > 0, knowing the color of the ball drawn from the urns resolves positively the
uncertainty about the input i.e. the urns containing the balls.
88
The input probability distribution, q(xi) is independent of the channel. Since mutual information
is a relation between the channel input and channel output, we may maximize the mutual
information of the channel with respect to q(xi). We thus have the following definition .
The channel capacity, denoted by C, of a DMC is the maximum average mutual information in
any single use of the channel (i.e. transmission interval), where the maximization is taken over
all possible input distributions, i.e.
The channel capacity is measured in bits per channel use or bits per transmission
The channel capacity is a function only of the conditional probabilities p(yk|xi) which completely
define the channel.
The maximization of the channel capacity over all possible input probability distributions is done
subject to the obvious constraints
q(x i )≥0 ∀ i
And
M
∑ q ( x i )=1
i=1
Consider the BSC with probabilities of error being p, the entropy H(X) is maximized (maximum
uncertainty) when p(x = 0) = p(x = 1) = ½ . This also maximizes the average mutual information.
Thus
C=max I ( χ , γ )
q( x )
=I ( χ , γ )| 1
p ( x=0 )= p( x =1)=
2
p(y=0)|x=1) = p(y=1|x=0) = p
and
89
p(y=0)|x=0) = p(y=1|x=1) = 1 – p
1 1
H ( X )= log 2+ log 2
2 2
=1 bit /transimision int erval
2 2
1
H ( X|Y )=∑ ∑ p( x i| y k ) p( y k )log
k =1 i=1 p ( xi|y k )
1 1
={(1−p )log + p log } p( y =0)
1− p p
1 1
+{(1− p )log + p log } p( y=1 )
1−p p
1 1
= p log +(1− p)log
p 1− p
=− p log p−(1− p)log(1−p )
=H ( p ) bits per transmission interval
The last equality is obtained by using the definition of the entropy function which we
encountered when discussing the Binary Memoryless Source (BMS). A plot of C versus p is:
90
Fig. 3.2 Channel capacity versus p
1. When the channel is noise free i.e. the transition probability p=0, C attains its maximum
value of 1 bit per channel use. At this value of p, the entropy function, H(p), attains its
minimum value.
2. When the channel has noise resulting in an error probability p = ½ , the channel capacity,
C, attains its minimum value of zero. The channel is then said to be useless.
3. When p=0, then C=1 bit per channel use, which is exactly equal to the information in
each channel input. This is intuitively correct. Since with p=0, when the channel output
is observed, the uncertainty that is resolved about the channel input must equal the
information in the channel input.
5. Note that we also have C=1 when p=1. In this case, the channel output, which is the
complement of the channel input, means that by observing the output we can uniquely
determine the channel input.
91
3.4 The Channel Coding Theorem
Noise is always present in a communication channel. Noise causes errors between channel
output and input of a digital communication system. With a probability of error of 10-4 say, it
means that on average 104 - 1 bits are received correctly. It is normal in landline
communication, to specify a probability of error of 10-6 or better. To achieve such a high level
of performance requires the use of channel coding.
Channel coding consists of mapping an incoming data sequence into a channel input
sequence and inverse mapping a corresponding channel output sequence into an output
data sequence in such a way that the efect of channel noise on the system is minimized.
The first mapping operation is performed in the transmitter by a channel encoder and the
inverse mapping operation is performed in the receiver by a channel decoder.
Chanel coding is usually employed to efect error detection and/or error correction in a
received channel output data sequence.
We shall confine ourselves to discussing in detail block codes. In this class of codes, k
information bits are mapped into an n – bit block where the n – k bits are called parity –
check bits. The n – k parity – check bits are obtained by a linear operation on the k data bits
and are then appended to the data bits. Note that n > k.
k
r= <1 ⋯⋯3 . 19
n
92
W shall further restrict ourselves to discuss a particular class of block codes, cyclic codes.
Relatively simple digital encoders and decoders (CODECS) may be found. Cyclic codes have
not found wide utilization in bandwidth – limited channels.
Power – limited channels such as space and satellite channels have been the prime areas of
application of FEC codes. Here both error detection and error correction capabilities of
codes are used.
The accurate representation of the original source sequence at the destination requires that
the average probability of error in the received digital data sequence be as low as possible.
This poses the question: Does there exist a channel coding scheme for which the bit error
probability, Pe, is arbitrarily close to zero and yet the channel coding scheme is efficient in
that the code rate is not too small?
The answer to this question is provided by Shannon’s second theorem. Shannon posed the
question: Is it possible to communicate over a channel at a fixed rate R>0 (bits per second)
such that Pe is arbitrarily close to zero?
Shannon answered the question in the affirmative (a “yes” answer) provided that R is less or
equal to a quantity, C, called the “channel capacity”. This is achieved by “channel coding”, a
process in which the channel input source sequence (bits) are made to depend on the
source output sequence (bits) in a complicated way. Arbitrarily small bit error probability,
Pe, may be attained as the coding complexity becomes great. Both convolutional codes and
turbo codes are near-capacity coding schemes in that their use may enable transmission
rates at close to the Shannon channel capacity to be attained.
93
Let the DMS in Fig. 3.3 have a source alphabet u and entropy H(u) bits per source symbol.
We assume that the source emits one symbol every Ts seconds. The average information
rate of the source is thus H(u)/Ts
bits per second (bps). There is no backlog in the system and so the destination is assumed
to receive symbols at the same rate they were emitted by the source i.e. one symbol every
Ts seconds. The DMC was assumed earlier to have a channel capacity of C bits per channel
use.
Assuming that the channel is used once every Tc seconds, then the channel capacity per
unit time is C/Tc bits per second. Thus the maximum rate of information transfer over the
channel is C/ bits per second(bps).
As explained in the introduction, the theorem asks and answers the question: What is the
maximum transmission rate for reliable communication over a noisy channel?
Theorem 3.1. The Channel Coding theorem for Discrete Memoryless Channels.
(i) Let a DMS with an alphabet u and entropy H(u) produce one symbol every Ts seconds. Let
a DMC have capacity C bits per channel use and be used once every Tc seconds. Then if
H (u ) C
≤ ⋯⋯3 .20
Ts Tc
there exists a coding scheme for which the source output can be transmitted over the channel
and be reconstituted at the receiver with an arbitrarily small probabilityof error Pe.
The parameter C/Tc is called the critical rate. If equation 3.20 is satisfied with equality, the
system is said to be transmitting at the critical rate.
(ii) Conversely, if
H (u ) C
>
Ts Tc
it is impossible to transmit symbols over the channel and reconstitute it with an arbitrarily small
Pe.
94
Shannon’s second theorem specifies the channel capacity as a fundamental limit on the rate at
which the transmission of error free messages can take place over a DMC.
(i) Shannon’s second theorem assures us that if the theorem is satisfied, a suitable code
exists. It does not tell us how to obtain the code.
(ii) The theorem does not specify a precise value for Pe. If equaation 3.20 is satisfied then it
only specifies that Pe → 0 as the code rate approaches zero.
Note that power and bandwidth, although neglected in our discussion so far, are important
determinants of Pe for both power – limited and bandwidth – limited channels. They do
determine values of the elements of the channel matrix P.
• The source entropy is one bit per source symbol (i.e. p(0) = p(1) = ½ )
• The source is applied to a channel with a channel encoder with code rate . τ
Then:
Assuming that the channel capacity is as given by the transition probabiliites of a BSC then the
channel capacity per unit time is C bits per second.
Tc
Part (i) of the Channel Coding theorem implies that if
1 C
≤ ⋯⋯3 . 21
Ts Tc
95
then the probability of error can be made to approach zero by choice of a proper channel
encoding scheme.
Tc
r= ⋯⋯3 . 22
Ts
That is if r ≤ c, there exists a code whose code rate is less or equal to C, which can achieve
probabilities of error approaching zero for a BSC.
Example 3.3
The majority rule is used at the channel decoder where, if the number of received 0’s exceeds
the number of received 1’s in a block of n bits, then a “0” is declared to be the bit which had
been transmitted and vice versa.
An error occurs when m + 1 or more bits out of n = 2m + 1 are received in error. Because the
channel is symmetric, the average probability of error Pe is independent of the a priori
proabbilities of transmission of a “0” or a “1” (prove this). Pe is thus given by
n
Pe= ∑
k=m+1
( nk ) p (1−p )
k n−k
⋯⋯3. 22
96
Note that in equation 3.22, Pe → 0 for n → ∞. (In fact, by the law of Large Numbers,about np
channel errors are expected to occur in n transmissions for very large values of n).
The price which we must pay for this is that the code rate r = 1/n for repetition codes also tends
to zero as n tends to infinity.
Note also that for a BSC we can plot C as a function of the transition probability, p, to get:
C =1 + p log p + (1 – p)log(1 – p)
Fig. 3.4
97
3.5 Block Codes for Error Detection and Correction
We previously introduced the concept of coding for either error detection or for both error
detection and correction. Codes for Forward Error Correction (FEC) are obviously longer than
codes for error detection alone.
Shannon’s Second theorem states that so long as the transmission rate over a channel is less
than the channel capacity, a code exists which will ensure reliable communication over the
channel.
In this section we will discuss one class of such possible codes – parity – check block codes in
which r parity – check bits are formed by a linear operation on k data bits and then appended to
the k data bits and then transmitted as one block. These codes can be used for FEC. We shall
focus on a particular class of block codes – cyclic codes – for which relatively simple CoDecs may
be designed.
Note that the encoder tales the k data bits and outputs a block of n > k binary digits. Both k data
bits and the block of n bits have a transmission interval of T sec.
This accounts for the need for a greater bandwidth in the modulator required to transmit the
block of n bits as compared to that required for the k data bits. why is this? Alternatively, if the
bit duration is held constant, the n bits have to be transmitted at a lower rate over the same
bandwidth.
In this code n = k + 1 i.e. for every k data bits a single parity – check bit, usually the modulo – 2
(mod – 2) sum of the data bits, is appended to the data bits. Such a code can detect an odd
number of errors. Why?
98
With mod – 2 addition, the resulting code is called an even parity check code as the number of
1’s in the resulting code word is even. Addition using mod – 2 arithmetic will be symbolized by
. In logic it is synonymous with the Exclusive – OR operation.
Decoding and detection of an odd number of errors if having occurred is done at the receiver
by repeating the parity – check operation (mod – 2 addition).
Example 3.4
Suppose the data sequence 100110 (read from the right) is to be transmitted. The even – parity
code word is
| ← Codeword →|
1 0 0 1 1 0 1
| ←data bits → |
Suppose that three codewords are received: 1011000, 1001111 and 1011111.
Performing mod – 2 addition we get an odd – parity (the sum is 1) for the first two code words
indicating that an odd number of errors has occurred (3 in the first code word and 1 in the
second code word).
Note that although mod – 2 addition enables us to detect an odd number of errors, it does not
tell us which bits in the code word have been received in error.
Mod – 2 addition of the third code word yields an even – parity (the sum is a 0) which may lead
us to conclude that no error in transmission has occurred when, in fact, two errors have
occurred.
With k data bits in a block of n bits, one speaks of an (n, k) code. The encoder outputs a unique
n – bit code word for each of the 2k possible input data words. Thus, for example, a (7, 4) code
has r = n – k = 7 – 4 =3 parity – check bits.
We will limit our discussion to linear codes in which each of the parity – check bits are given by
mod – 2 addition of the data bits. We shall further limit ourselves to systematic codes in which
the parity – check bits appear at the end of a code word.
Let d = {d1, d2, …, dk} be a vector representing the k – data bits. Let the n – bit vector c = {c1, c2,
…, ck, ck+1, …, cn}. For a systematic code,
99
ci = di, I = 1, 2, …, k
The r = n – k parity – check bits are obtained as a weighted mod – 2 addition of data bits:
c k +1=a11 d 1 ⊕ a12 d 2 ⊕ …⊕ a1 k d k
c k +2=a21 d 1 ⊕ a 22 d 2 ⊕ …⊕ a2 k d k
⋮
c n =a1 d 1 ⊕ a 2 d 2 ⊕…⊕ a k d k ⋯⋯3 . 23
c = dG
where G is a k x n matrix with the first k columns belonging to an identity matrix Ik representing
the k data bits. The remaining r columns represent the transpose of the h ij coefficients of
equation 3.23
[ ]
a 11 a21 ⋯ ar 1
with P= a12 a22 a11
⋮ ⋮ ⋮
a1k a2 k a rk
G is called the code generator matrix and P is called the parity check matrix.
Example 3.5
[ ]
1 1 0 0 ……3.26
P= 0 1 1 0
1 1 1 1
100
Note that k = 3 and, therefore, the code rate, k/n, is 3/7. The n = 7 code words are
generated from the matrix equation
d c=d [ I 3 P ]
000 000 0000
001 001 1111
010 010 0110
011 011 1001
100 100 1100
101 101 0011
110 110 1010
111 111 0101
Will linear systematic codes always satisfy the prefix condition? Why?
The Hamming weight of a code word c, denoted by w(c) is the number of non zero components
of c.
The Hamming distance between two code words c1 and c2, denoted by d(c1, c2), is the number
of positions in which they difer.
For example, if c1 = 110101 and c2 = 111000 then = d(c1, c2) = 3. Note that
For example 3.5 we note that the minimum distance between any two code words is 3. We can
thus correct one error since the received code word, with one error, will still be close in distance
to the correct code word. This is thus an example of a single error correcting code.
101
If the code is used only for error correction, then the code can correct all patterns of t or fewer
A code can correct all patterns of t on less random errors and detect all patterns having no more
s errors provided that 2(s + t) + 1 ≤ d.
3.5.3 Decoding
Before discussing decoding let us find a probabilistic criterion to decoding. Such a criterion,
called maximum likelihood decoding will be shown to be equivalent to finding closest code
word.
Given the received vector r, the decision rule the decision rule that minimizes the probability of
error is to the codeword ci which maximizes p(c=ci/r). this is called the maximum a posteriori
decision rule. The proof that this minimizes the probability of error will not be presented here.
p(r /c) p( c )
p(c /r)=
p (r)
since p(r), the probability of observing the vector r is independent of c, maximizing p(c/r) is
equivalent to maximizing
p(r/c)p(c)
if we now assume, without loss of generality, that each codeword is chosen with equal
probability then maximizing p(r/c)p(c) is equivalent to maximizing p(r/c).
A codeword which is chosen to maximize p(r/c) is said to be chosen according to the maximum
likelihood criterion.
Assume a BSC with iid random variables and crossover probability p. then;
n
p( r/c )=∏ ¿ p( r i /c i )
i=1
¿ p ( r i / c i )=
{ 1− p , if c i=r i
p , if c i≠r i
Then
102
n ¿ ( r i=c i ) ¿(ri ≠ci )
p(r/c )=∏ ¿ p ( r i /c i)=1− p p
i=1
¿
Or
n− ¿ (r i =ci ) ¿(ri ≠c i )
p(r/c )=(1−p) p
¿ ( r i≠c i )
n p
=(1− p) ( )
1− p
But
Thus p(r/c) is maximized when #(ri≠ci) attains its lowest value. Thus If we want to maximize
p(r/c) we should choose that ci which is closest to r. Thus, under our assumptions, the
maximum likelihood criterion is the minimum distance criterion. We now consider decoding.
c = d[ Ik P] = [ d dP ] ……3.27
=[d cp ] ……3.28
Where cp = dP
Suppose that c now represents a received sequence of n bits. The first k bits correspond to data
bits. To decode the received sequence, therefore, the receiver performs the operation dP and
compares it to the received parity – check sequence. Since mod – 2 addition and subtraction are
the same, then
d P ⊕c p =0 ……3.29
[ ]
dP+c p =[ d c p ] P =0
I n−k ……3.30
103
where In-k is an (n – k) x (n – k) identity matrix.
c=[d c p]
and so
[ ]
a 11 a12 ⋯ a1 k 1 0 ⋯ 0
H= a21 a22 ⋯ a 2k 0 1 ⋯ 0
⋮
ar 1 ar 2 ⋯ a rk 0 0 ⋯ 1 ……3.34
Now suppose that an error has occurred in one of the bits of c. The received vector, r, can be
written as
r=c⊕e ……3.35
where e is an n – bit vector representing the error pattern. [If an error has occurred in the third
and fifth positions, for example,
e = (001010…0)]
Performing r on HT yields:
T T T
r H =(c ⊕e ) H =s=e H ……3.36
104
The vector s, called the syndrome, will be non zero if an error has occurred.
Suppose that a single error has occurred in the ith position. Then
e= 00…10…0)
ith position
and
Comparing with 3.34 we note that s is just the ith column of H. This makes the error not only
detectable but also correctable
Example 3.6
Consider a (7, 3) code and assume that the transmitted code word is 1101010 and the received
bit sequence is 1111010 ( i.e. an error has occurred in the third position). From 3.30 we get
[ ]
1 0 1 1 0 0 0
1 1 1 0 1 0 0
H=
0 1 1 0 0 1 0
0 0 1 0 0 0 1
s = rHT = (1111)
s is the third column of H verifying that an error has occurred in the third position.
For a code to correct up to at least t errors it must meet any of several bounds. One these
bounds is the Hamming Bound
A source using an (n,k) code produces source sequences of length n and there are 2 n such
possible sequences. Since channel coding is being used, some of the 2 n source sequences are
105
codewords. We are interested in determining the distance of a codeword of interest and source
sequences determined from the error correcting capabilities of a code.
Assume that the codeword of interest is represented by a vectorof all 0’s . It is then easy to
show that the number of source sequences that are at a distance of less than or equal to t from
the codeword is given by
t
∑ (nj ) . .. .. . 3. 40
j =0
since there are 2k possible codewords, then the total number of source sequences which are at
a distance of less than or equal to t from any codeword is given by
t
2 k
∑ ( nj ) . .. . .. 3 . 41
j=0
Obviously
t
2 k
∑ ( nj )≤2 n
j=0
Or
t
2 n−k
≥∑
j=0
( nj )
Which is the Hamming Bound for a code to correct up to t errors.
r
2 −( r+ 1)>k . . .. .. 3 . 42
This result can be related to the properties which the columns of the matrix must satisfy.
To be a single error correcting code, all columns of H must be unique. This can happen if the first
k columns (of r bits each) are distinct as they are related to the data bits. They must also be
diferent from the last r columns containing a single 1 in each column and must not contain an
all zero column. (why?)
106
Example 3.7
Consider a (6, 3) and a (6, 4) code and determine whether they are single error correcting.
For the (6, 3) code r = 3 and r= 2 for the (6, 4) code. Applying inequality 6.3.12, we get
22 – 3 = 1 > k = 4
Some typical codes where we select the largest value of k only are shown below
107
3.5.4 Cyclic Codes
These are codes in which code words are lateral shifts of one another. Thus, for example, if c =
( c1, c2, …, cn) is a code word, so is c = ( c2, c3, …, cn , c1) and c = ( c3, c4, …, cn , c1 , c2) . These codes
are generated by a generator matrix. In this generator matrix the element in the kth row and nth
column of G is always a 1.
[ ]
1 0 ⋯ 0 ⋯
0 1 0 ⋯ ⋮
G=
⋮
0 ⋯ 0 1 ⋯ 1
Then
c=d G=d 1 d 2 … d k 0 0 … d k
nth position
d1 = d2 = … = dk-1 = 0, dk = 1, then
c=00…010…0
nth position
kth position
c=00…010…0
k + 1th position
This is impossible since an all zero data sequence is producing a non – zero parity – check
sequence.
108
Cyclic codes can be represented by polynomials. Thus the codeword
In the polynomial, c1 represents the first bit of the code word and cn the last bit of the code
word. A shift once to generate another code word is represented by xc(x).
The polynomial representation may be extended to the matrix G. To do that, we insert the
appropriate power of x wherever there is a 1 in G and we leave blanks where we have 0s.
[ ]
1 0 0 1 1 1 0
G= 0 1 0 0 1 1 1
0 0 1 1 1 0 1
[ ]
x6 ¿ ¿ x3 x2 x ¿
G= ¿ x5 ¿ ¿ x2 x 1
¿ ¿ x4 x3 x2 ¿ 1
[ ]
x n−1 ¿ ¿ ¿
n−2
G= ¿ x ¿ ¿
¿ ¿ x n−k 1
109
The polynomial for the last row has the form
g(x) = xn-k + … + 1
and is called the generator polynomial of the code. This polynomial determines the properties
of the code.
(ii) To generate the (k -1)th row, cyclically shift the kth row one column to the left. This is xg(x).
The kth column entry must be a zero for the matrix G has the identity matrix in the first k
columns.
(iii) If the kth column entry is a 1 add the kth row to this, the (k -1)th row using mod – 2 addition.
(iv) To generate the (k -2)th row , repeat the process; but now operating on row (k – 1). Add
g(x) to this row if the k’th column entry is a 1.
(v) Repeat the process until you reach the top row.
Example 3.8
Assume g(x) = x4 + x3 + x2 + 1 with n = 7. (This is a (7, k) code). If shifting is done without adding
g(x) to the (k -1)th and subsequent rows we get
[ ][ ]
x 2 g( x ) x6 x5 x 4 ¿ x2 ¿ ¿
G= xg( x ) = ¿ x 5 x 4 x 3 ¿ x ¿
g( x ) ¿ ¿ x 4 x3 x2 ¿ 1
We note that the first k = 3 columns do not represent an identity matrix. To generate G in the
standard form, we follow the procedure above
[ ][ ]
x [ xg( x )+ g( x )] x6 ¿ ¿ x3 x2 x ¿
G= xg( x )+ g( x ) = ¿ x 5 ¿ ¿ x 2 x 1
g(x) ¿ ¿ x 4 x3 x2 ¿ 1
Replacing x’s by 1’s and blanks by 0’s we get
110
[ ]
1 0 0 1 1 1 0
G= 0 1 0 0 1 1 1
0 0 1 1 1 0 1
Theorem 3.2 The generator polynomial, g(x), of an (n, k) cyclic code is a divisor of xn + 1.
Example 3.9
Consider the class of (7, k) codes. The generator polynomials are divisors of x7 + 1. We can show
that
The rows of G are obtained from g(x) by successive multiplication by x. Each cyclic code is thus
derivable from g(x) using the relation c = dG. A code word c(x), in polynomial form may thus be
written as
The polynomial d(x) must be of order (k – 1) to have c(x) a polynomial of order (n – 1).
The operation xn-k d(x) produces a polynomial of degree (n – 1) or less. Now, with g(x) a
polynomial of degree (n-k) or less, we have;
x n−k d ( x ) r (x ) ……3.45
=q ( x )+
g( x) g (x )
where the division has yielded a polynomial q(x) of degree (k – 1) or less and a remainder
polynomial r(x).
111
Since in mod – 2 addition r(x) + r(x) =0, it follows that the (n – 1) degree polynomial xn-kd(x) +
r(x) is divisible by g(x) and must, therefore, be a code word. Thus
But xn-kd(x) represents a shift by (n – k) times of the data bits. Hence the reminder r(x) must
represent the parity – check bits.
Example 3.10
Consider the (7, 3) code with a generator polynomial g(x) = x4 + x3 + x2 + 1. Assume that the
data word is d = (0 0 1) or d(x) = 1. Then
x n−k d ( x )
r ( x )= Remainder of
g( x)
4
x
=Remainder of 4 3 2
x + x + x +1
3 2
=x + x +1
The code word is thus
c = ( 0 0 1 1 1 0 1)
c = ( 1 1 1 0 1 0 0)
Since for a given k we can generate all possible data sequences, this approach enables us to find
all the corresponding codewords.
To summarise:
i. We use the Hamming bound to select a code meeting our design criteria. We chose n and
k.
112
Example 3.11
Consider communication over an AWGN channel using PSK modulation. The bit error probability
is given by
1 E
Pe = erfc b
2 no√
where
√ Eb
no is the bit energy to noise density ratio.
1
Pe1 = erfc
2
4 Eb
7 no √
Since this is a single error correcting code, the block error correcting code, the block error
probability is given by
7
Pe , coded= ∑
j=2
()
7
j
Pe , j (1−Pe , )7− j
¿21 Pe ,2
113
3.6 CONVULUTIONAL CODES
The convolutional encoder, unlike the block encoder which operate on a block by block basis,
acts in streams of bits. A general block diagram of a convolutional encoder is shown in fig. 3.6.
this particular encoder has one input, n outputs and K shift registers. For every one bit input,
the encoder outputs n bits. It is thus referred to as a rate 1/n encoder. K is referred to as the
constraint length. The encoder is thus a constraint K, rate 1/n encoder.
114
Assume that an i’th bit ,di, has arrived at the input. The two output bits at time i are denoted as
ci1 and ci2. Clearly
c i1 =d i ⊕d i−1 ⊕d i−2
c i2 =d i ⊕d i−2
or in matrix notation
[ ]
1 1
[c i c i2 ]=[ d i di−1 d i−2 ] 1 0
1
1 1
In the tree diagram, inputs of the encoder are indicated by paths whereas encoder outputs are
indicated by symbols on the upper branches of the tree. An input zero specifies the upper
branch of a bifurcation while a one specifies the lower one. Thus for the encoder of figure (),
the input sequence 0 1 1 0 is indicated by up at the first branching level, down at the second
and third, and up at the fourth producing the outputs indicated on the diagram traversed: 00 11
01 01. Fig() indicates all output sequences corresponding to all 32 possible sequences for the
first five input bits.
The tree diagram indicates that after the first three branches the structure becomes repetitive.
In fact, we note that beyond the third branch the code symbols on branches emanating from
the two nodes labeled a are identical,and similarly for all the identically lebelled pairs of nodes.
Thus all identically labeled nodes in fig() can be joined together. This leads to redrawing the tree
diagram as shown in fig (). This new figure is called a trellis diagram.
115
The convention used in the trellis diagram is that the code branches produced by a “0” input bit
are shown as solid branches and code branches produced by a “1” input bit are shown dashed.
Note that the last bit must be followed by K-1 zeros to clear the register of all input bits. These
K-1 output branches are called the tail of this code. In this case the last two branches are the tail
of this code.
The complete recursive structure of the trellis diagram leads to a farther simplification to the
representation of the encoder. Fig() is a representation of the encoder using a state diagram.
The states of the state diagram correspond to nodes of the trellis diagram
The encoder is a finite state digital machine. Since K=3, a four state directed graph can be used
to uniquely represent the input-output relations for the machine. A state diagram
representation of fig() is shown in fig()
116
Polynomial Encoding
c i1 =d i ⊕d i−1 ⊕d i−2
c i2 =d i ⊕d i−2
the delays of the data bits may be captured by using generator polynomials. Thus the generator
polynomials for the two streams may be written as;
(1) 2
g ( D)=1+D+D
(2 ) 2
g ( D)=1+D
where D is the unit delay operator
117
The output polynomial C(i) (D) is given by
( 1) 3 4 5 6
C ( D )=1+ D + D + D + D
( 2) 2 3 4 5 6
C ( D )=1+ D+ D + D + D + D + D
Thus the output bit streams are
C(1) = 1 0 0 1 1 1 1
C(2) = 1 1 1 1 1 0 1
the output message is formed by interleaving the two bit streams to get
11010111111011
The last 4 output bits are flushing bits. Thus polynomial encoding automatically yields the
correct encoded bit stream together with the flushing bits.
Exercise
1. Find the output bit stream for an input message 1 1 0 1 1 1 0 0 1 applied to the encoder
of fig ()
2. Draw a code tree, trellis diagram and state diagram for the following encoder
118
PROBLEMS
Entropy
1. (i) Justify the code word lengths for the letters Q and E in the Morse code for English
(ii) Using approved Kiswahili story books such as those by Shaaban Robert find the single
letter frequencies of letters in Kiswahili.
2. Show that the Shannon entropy measure of equation 2.12 satisfies all four Khinchin
axioms.
3. Entropy measures using escort probabilities of eqn 2.29 provide control on the
contributions provided events. If � or q is large and positive, for example, the entropy
measure is more sensitive to events that are more often. Draw curves for the Ranyi, Tsallis
and Shannon entropies for values of �and q: - 0.1, - 0.5, - 0.8, 1.2, 1.5, 2.0, 3.0. Discuss
your results.
4. (i) Show that the Shannon entropy measure of eqn 2.12 satisfies all four Khinchin axioms
with Axiom � being given by eqn 2.11.
(ii) Derive eqn 2.17 using the Tsallis entropy measure of eqn 2.18.
P q = Pi e ( q -1) ln Pi
and hence show that in the limit as q – 1, q W�, the Tsallis entropy
119
N
1 - �Pi q
S qT = i =l
q-l
Data Compression
6. Identify which codes are prefix codes and which ones are uniquely decodable in the Table
below. Draw decision trees for all prefix codes.
Which of the two codes would you use in a communication system and why?
11101001100010110100…
Using the Lempel – Ziv – Wien algorithm to encode the sequence. Assume that the binary
symbols 0 and 1 are already stored in the codebook.
120
9. A source has an alphabet of 4 letters. The probabilities of the letters and two possible sets
of binary code words for the source are given below:
For each code, without proofs or numerical calculations; answer the following questions:
10. Consider two discrete memoryless sources. Source I has an alphabet of 6 letters with
probabilities: 0.3, 0.2, 0.15, 0.1, 0.1. Source II has an alphabet of 7 letters with the
probabilities: 0.3, 0.25, 0.15, 0.1, 0.1, 0.05, and 0.50. Construct a binary Hufman code for
each set of messages. Find the average number of code letter per source letter for each
code. Find the Hufman code using two diferent methods and show that they both yield
the same average codeword lengths but diferent codeword length variances.
H ( u)
(i) Find .
(ii) Construct a prefix condition code for this source with average code word length
H ( u)
equal to .
a
r
�nr -n
=
( r -l)
2
n =l
Note that
121
12. Two sources have source alphabets u and a with source statistics
Symbol Probability
u1 a1 1 1
4 3
u2 a2 1 1
4 3
u3 a3 1 1
8 9
u4 a4 1 1
16 9
u5 a5 1 1
16 27
u6 a6 1 1
8 27
u7 a7 1 1
16 27
u8 1
16
(a) Find
(i) The Hufman code for each source.
(ii) The entropy for each source.
(iii) The average code word length and variance for each source.
(b) Draw a block diagram for a computer program to determine the entropy for each
source if source sequences are of length N = 4. Draw another block diagram for
arbitrary N.
(c) In percentage terms, how does the average length difer from the entropy?
H ( u ) �LN �H ( u ) + 1
(d) Is the equation, , satisfied?
(e) Repeat parts (a) – (d) above for the Shannon – Fano code.
122
13. A source has the following alphabet and source probabilities:
Symbol Probability
S1 0.3
S2 0.1
Space 0.6
Two source symbols are taken in sequence and then encoded into a binary sequence. If a
space is used to separate the code words, find:
H ( u)
(i) The entropy of the source, .
H ( uN )
(ii) The entropy of the source sequence, .
H ( uN ) = H ( u ) ?
(iii) Is Why?
15. A source produces a sequence of statistically independent binary digits with the
probabilities p(l) = 0.005 and p(0) = 0.995. These digits are taken 100 at a time and a binary
code word is provided for every sequence of 100 digits containing 3 of fewer 1’s.
(i) If the code words are all of the same length, find the minimum length required to
provide the specified set of code words.
(ii) Find the probability of getting a source sequence for which no codeword has been
provided.
u1 = {a1, a2 }
Vistas beyond
Classical 123
Information
Source Coding
Space-Time
Coding
u2 = {b1,b2,b3 }
P(u1, u2) b1 b2 b3
a1 0.08 0.2 0.1
Maxi a2 0.06 0.5 0.06
mum
(i) Determine a Hufman code, using the codeword for each symbol pair {u1, u2}. Find
is average codeword length.
(ii) Ascertain whether the entropy of the joint event {u1, u2} is equal to, greater or
smaller than the sum of the two entropies H(u1) + H(u2).
17. A DMS has an alphabet u = {u1, u2, u3} with source statistics and code word representations
as shown in the table below:
(b) There is one code for which it is not possible to draw a decision tree. Which code is
that and why is it not possible to draw a decision tree? Draw a decision tree for the
code for which it is possible to do so.
18. A discrete memoryless source has source alphabet u={a, b, c} with source statistics
1
P(a) = 2
124
1
P(b) = 4
1
P(c) = 4
An extended source is made by taking as outputs source sequences of length N = 2.
(a) Determine the alphabet and source statistics for the extended source.
(b) Find the Shannon – Fano code for the extended source.
(c) Find the entropy of the extended source
(d) Find the average codeword length for the Shannon-Fano code.
(e) Find a Hufman code with the least codeword variance.
19. Define a sequence of random variables as follows. Toss an unbiased coin. If the
125
Markov Sources
20. Find the equilibrium state probabilities for the Markov source shown below
10 01
(i) Find the entropy of each source output letter conditional on the current source
state.
(ii) Find the entropy per letter of the source sequence.
(iii) For each source state, find a variable length binary code for the letters in the source
alphabet.
(iv) Do the average codeword lengths for each state satisfy Shannon’s source coding
Theorem for discrete memoryless sources?
126
21. (a) A four state Markov source is represented by the following state transition
(b) Assume that Sx represents the state of the Markov source just after it has emitted
letter X, find the encoded source output bit stream for the following source output
symbol stream.
dabbcaacd
127
REFERENCES
1. Shu-Kun Lin, “Gibbs Paradox and the Concepts of Information, Symmetry, Similarity
and their Relationship”, http://www.mdpi.com/1099-4300/10/1/1,2008.
3. H. Nyquist, “Certain Factors Afecting Telegraph Speed”, Bell Syst. Tech. J., Vol.3, pp
324 – 352, April, 1924.
4. C.E. Shannon, “A mathematical Theory of Communication”, Bell Syst. Tech. J., Vol. 27,
pp 379 – 423, 623 – 656, July, October, 1948.
7. S. Verdu, “Fifty ‘Years of’ Shannon Theory”, IEEE Trans. Inform. Theory Vol. 44 No.6.
pp 2057 – 2078.
9. Dictionary.com
10. C. Beck, “Generalized Information and Entropy Measures in Physics”, arxiv : 0902
1235 2 [Cond-mat.stat-mech]
11. R.V.L. Hartley, “Transmission of information”, Bell Syst. Tech. J., Vol.7, pp 535 -563,
July 1928.
14. E.T. Jaynes, “Information Theory and Statistical Mechanics I”, Phys. Rev., 106, pp 620
– 630, 1957.
128
15. E.T. Jaynes, “Information Theory and Statistical Mechanics II”, Phys. Rev. 108, pp 171
– 190, 1957.
16. M.L. Luhanga and T.G.B. Nyambo, “On the Entropy of Kiswahili”, Tanzania Journal of
Science, Vol.18, pp 1-5, 1992.
17. A.J. Viterbi and J.K. Omura, Principles of Digital Communication and Coding,
McGraw-Hill Book Company, New York, N.Y., 1979.
18. D.A Hufman, “A Method for the Construction of Minimum - Redundancy Codes”,
Proc. IRE, Vol. 40, No. 9, pp 1098 – 1101, Sept. 1952.
19. E.S. Schwartz, “An Optimal Encoding with Minimum Longest Code and Total Number
of Digits”, Inform. Contr. Vol.7, pp 37 – 44, March, 1960.
20. R.G. Gallager, Information Theory and Reliable Communication, Wiley, New York,
1968.
23. D.R. Cox and H.O. Miller, the Theory of Stochastic Processes, Chapman and Hall,
London, 1965.
24. C.E.M. Amos, “Performance Analysis of IP – Based Networks: A Case Study of the
University of Dar es Salaam Local Area Network”, M.Sc. Dissertation, University of
Dar es Salaam, July, 2006.
129
130