[go: up one dir, main page]

0% found this document useful (0 votes)
291 views130 pages

Lecture Notes in Te 314

This document provides an introduction to information theory. It discusses the etymology and technical meaning of information. Key points include: 1) Information theory began with the work of Harry Nyquist and R.V.L Hartley in the 1920s-1930s on quantifying the speed of transmitting signals and amount of information respectively. 2) Claude Shannon's 1948 paper established information theory as a mathematical discipline and unified different communication systems by introducing the concept of "information". 3) The document outlines the historical development of quantifying information from Nyquist and Hartley's early work to Shannon's seminal 1948 paper which laid the foundation of information theory.

Uploaded by

Michael Samwel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
291 views130 pages

Lecture Notes in Te 314

This document provides an introduction to information theory. It discusses the etymology and technical meaning of information. Key points include: 1) Information theory began with the work of Harry Nyquist and R.V.L Hartley in the 1920s-1930s on quantifying the speed of transmitting signals and amount of information respectively. 2) Claude Shannon's 1948 paper established information theory as a mathematical discipline and unified different communication systems by introducing the concept of "information". 3) The document outlines the historical development of quantifying information from Nyquist and Hartley's early work to Shannon's seminal 1948 paper which laid the foundation of information theory.

Uploaded by

Michael Samwel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 130

LECTURE NOTES IN

TE 314: INTRODUCTION TO INFORMATION THEORY

By
Prof. Matthew L. Luhanga

1
CHAPTER ONE
INTRODUCTION

1.1 Etymology

The sources of and the development of the word “information” i.e , its etymology –shows that it
is a mult-faceted concept [1]. In this introductory course on information theory we shall confine
ourselves to the technical meaning of the word. In the context of Lin [1], we confine ourselves to
the domain of syntax and empirics.

Syntax and empirics represent, respectively, the formalism used in representing a message and
the study of signals (i.e signs and symbols) used to carry a message. Indeed, human
communication can be described as [2];

Human are symbol nature creatures. We communicate by symbols


—growls and grunts, hand signals and drawing painted on cave
walls in prehistoric times. Later we developed languages,
associating sound with ideas. Eventually Homo sapiens developed
writing, perhaps first symbols scratched on rocks, and then written
permanently on tablets, and paper. Today we transmit symbols –
coded digital signals of voice, graphics, video and data—around
the word at close to the speed of light.

Information theory is the corner stone of the theory and practice of electrical and optical
communication. It describes the context within which electrical and optical communication
systems can be studied and understood.

1.2 Technical meaning of Information

Engineers are interested in information theory as an important discipline in the design , operation
and optimization of communication systems and networks. Elementary concepts of information
theory to be covered in this course are shown in the annex to this chapter.

Information theory is a mathematical discipline. Information theory may be considered to have


begun with the work of Harry Nyquist [3] although it builds on earlier work for some of its
concepts such as entropy. We will come back to this point later.

1.2.1 What is Information?

As alluded to previously, the concept of information has many attributes. One Web-based
definition of information is [9]:

1. Knowledge communicated concerning particular face or circumstance;


news :information concerning a crime.

2
2. Knowledge gained through study, communication, research, instruction,
etc. ;factual data: His wealth of general
3. The act or fact of informing

This broad definition can be narrowed by classifying information in two broad categories [10]:

 Elementary information
 Advanced information

Elementary information is related to technical details such as, for example, the frequency of
observing certain letters of the alphabet on a long text in a given language

Advanced information is related, for example to the understanding the context of the text. thus
for example, the information a reader gets from reading a text on self-similar processes in
telecommunications is very different from the information the reader would get from reading a
book by, say, Shaaban Robert.

Advanced information involves the understanding of content in, possibly, multimedia messages.
This necessarily involves the brain, a very complex organ which varies from individual to
individual.

The complexity of the brain has made it difficult, so far, to find a measure for advanced
information. We will confine ourselves to a discussion involving elementary information for
which an information measure exists.

The concept of information and its quantification (measure) on wich we will focus in the course
is one which is useful in the modeling, design and dimensioning of telecommunication systems.

1.2.2 Historical Quest for a Quantitative Measure of Information

Information theory, as an engineering discipline, is often taken to have begun with work by
Harry Nyquist in the 1920S.In a paper written in 1924 [3], Nyquist suggested that there were two
factors which determined the speed of transmission of “intelligence” – the shape of signals used
in the transmission of intelligence and the choice of a code used to represent the intelligence.
Nyquist’s t paper derives the following formula:

W=klog m

Where W is the speed of transmission of intelligence,

m is the number of the current levels used in coding the intelligence,

and k is a constant

The words “speed of transmission of intelligence were meant by Nyquist to mean the number of
characters representing letters, figures, etc transmitted in a given unit of time”.

3
Nyquist used his formula to compute the theoretical speed of transmission of intelligence and
compared his results to practical speeds obtained by using different codes in a variety of media.

What was being transmitted – the intelligence- was not the message in the form of letters, figures
etc, but a representation of the message by using characters. Intelligence was thus embedded in
the representation of messages.

A few years later in 1928, without acknowledging the work of Nyquist, R.V.L Hartley [11]
expanded the ideas related to “information” rather than “intellegince”. Hartley developed the
concept of information based on “…physical as contrasted to physiological considerations” for
use in electronic communication. Although Hartley never defined the concept of information- he
called it an elusive term- he addressed the “…amount of information”. Information existed in the
transmission of symbols with the symbols having “…certain meaning to the parties
communicating”. When someone receives information, each received symbol sequence allowed
the recipient to “ exclude possibilities”- excluding other possible symbol sequences which could
have been transmitted and their associated meanings. The precision of information depended
upon what other symbol sequences might have been chosen to be transmitted.

The measure of these other sequences provides an indication of the amount of information
transmitted. It was then suggested that use be made “…as our measure of information the
logarithm of the number of possible symbol sequences” [11]

Hartley quantified his information measure by first assuming that each transmitted message
consisted of n symbols transmitted consecutively. At each transmission instant, one symbol, out
of an alphabet of m symbols, was chosen for transmission. There are thus mn possible symbol
sequences which could be transmitted. Assuming that each symbol sequence was equally likely
to be transmitted, the information H, in a transmission was computed to be [11]:

Hartely was the first person to use the sumbol H to represent the amount of information

During World War II, Claude E. Shannon developed a model of a communication process based
on the earlier work of Nyquist and Hartely. This earlier work dealt with technical aspects of
information i.e they dealt with elementary information.

Shannon’s paper published in 1948 in The Bell Systems Technical Journal entitled “A
mathematical Theory of Communication” established Information Theory as a mathematical
discipline and laid a foundation for all future work in information theory

4
1.3 Shannon’s 1948 Paper

In 1948 Claude E. Shannon published a paper, “A Mathematical Theory of Communication” [4]


which unified electrical communication and served as s foundation upon which all subsequent
work in information theory has been built.

Before the appearance of Shannon’s paper in 1948 communication Engineering was viewed
differently depending on the medium and methodology used in communication. No concept
existed which unified the communication systems which were in existence in 1948. Between the
1880s and 1948 the following communication systems and technologies had been been
developed [7]:

 Telegraph (Morse,1830’s)
 Telephone (Bell, 1870)
 Wireless Telegraph (Marconi,1897)
 AM Radio (1900’s)
 Single-Sideband Modulation (Canson,1922)
 Television (1920’s)
 Teletype (1931)
 Frequency Modulation (Armstrong, 1936)
 Pulse-Code Modulation (1930’s)
 Vocoder (1939)
 Spread Spectrum (1940’s)

The introduction of information theory with its detailed treatment of the concept of
“information” made it possible to see the connection between these disparate fields and to bind
them together.

Although no unifying theory to bring these fields together existed prior to 1948 some ingredients
which would be key elements of a mathematical theory of communication could be seen in these
systems. Some of the elements are: [8]

 The Morse code, a methodology for source coding which ensured efficient transmission
by encoding sources of English text in accordance of letters of English alphabet. We
shall return to this point later
 Pulse-Code modulation (PCM) can be considered to be the first “digital” communication
system

Shannon built on Hartely’s thesis that information, as an engineering concept, contained in the
number of possible symbol sequences which an information source could transmit.

This viewpoint of Hartely, which led to eqn1.1, is echoed by Shannon when he states that [4]:

…the fundamental problem of communication is that of reproducing at one point either exactly
or approximately a message selected at another point. Frequently the messages have meaning;
5
that is they refer to or are correlated according to some system with certain physical or
conceptual entities. These semantic aspects of communication are irrelevant to the engineering
problem. The significant aspect is that the actual message is one selected from a set of possible
messages…

A block diagram of the model studied by Shannon is shown in figure 1.1

Noise

Transmitter Channel Receiver Message sink


Message source

Fig 1.1. Model of a Communication System

A key addition by Shannon to the work of Nyquist and Hartley was the intrgration of noise into
the communication system. The addition of noise changed messages so that what was received
needed not to be necessarily what was transmitted.

The transmitter does not transmit the messages. It transmits a coded form of the messages. This
coding, which takes place before transmission, is called source coding. Source coding uses
symbol occurrence probabilities (see, for example, morse code) in message representation.
Before transmission, the coded source messages undergo further processing to mitigate the
effects of noise in the channel. This processing is called channel coding.

Shannon’s paper of 1948 addressed both source coding and channel coding. These consideration
expand the block diagram of a communication system given in Fig. 1.1 into Fig. 1.2

6
Source Source Decoder sink
Source Encoder

Channel Encoder Channel


Decoder

Transmitter Modulator
Demodulator
Receiver

Channel

Noise

Fig. 1.2. Shannon’s Model of a Communication System

Source coding and channel coding are two important concepts at the center of Shannon’s 1948
paper. That paper dealt with two fundamental questions:

i. What is the maximum limit to which redundancy in a source of signals can be reduced?

ii. What is the maximum transmission rate for reliable (i.e error free) communication over a
noisy channel?

Answers to these two questions yield fundamental limits in Information Theory. These limits are
important because:

• Limits may suggest better system designs

• Limits assist in the evaluation of how good a particular design is

• Limits may prevent one from trying to do the impossible.

The paper by Shannon contains:

7
i. A quantification of the concept of information
ii. Using coding for reducing source redundancy
iii. A quantification of the concept of channel capacity
iv. Importance and use of coding for reliable and efficient communication

The quantifications by Shannon were based on:

i. Entropy for sources

ii. Channel capacity for communication channels

These two quantifications are meant to provide a basis for answering the fundamental questions
in Shannon’s paper. The two fundamental questions are best seen in relation to the block diagram
of a communication system of Fig. 1.2

1.4 The Diversity of Information Theory

Building on Shannon’s work reported in 1948, information theory has been extensively
diversified. An example of this diversity is depicted in Fig. 1.3. The areas in Fig 1.3 which we
will touch on in this course are:

 Lossless compression
 Channel capacity
 Error detection and correction coding

The development of non-Shannon information measure has motivated the use of information-
theoretic measures in other disciplines as shown in Table 1.1, [5,6]. These applications represent
the vistas beyond classical information theory. As fig. 1.3 shows, information theory is a very
broad discipline.

8
Signal Processing
Lossless/Lossy
Compression

Non-Shannon Vistas beyond Classical


Information Information Theory Source Coding
Measure
Rate Distortion
Maximum Entropy Theory
method INFORMATION THEORY

Joint Source- Error Detection and


Channel Coding Correction Coding

Space-Time Communication
Network information Capacity
Coding Channel Coding
Theory
Coding with Side
Information

Fig. 1.3 Diversity of Information Theory

Table 1.1 The Span of Information Theory

FIELD RELATIONSHIP TO INFORMATION THEORY

Physics • Self –organizing systems in astrophysics and cosmology

• Quantum information theory

• Thermodynamics

• Condensed matter

Computer Science • Kolmogorov complexity

9
Economics • Portifolio theory

• Kelly gambling

Mathematics • Probability theory

• Statistics

• Maximum Entropy formalism

Chemistry • Generalization of the Arrhenius law to anomalous


diffusion

• Generalized simulated annealing algorithm

Biology • Self-organized criticality

• Reassociation of CO in herme-proteins

Linguistics • Generalization of the Zipf law

Medicine • Processing of electro-encephalographic (EEG) signals of


epileptic patients

• Processing of electro-cardiographic (EEG) signals

Geophysics • Earthquake studies

• Studies of turbulance

Social Science • Derivation of laws governing social activities

Communication theory • Compression

• Coding

• Encryption

• Modulation

10
• Detection

• Fundamental limits

• Modeling the internet

CHAPTER TWO
SOURCE CODING
2.1 Introduction

11
Fig 1.2 from Chapter One Shows that both source coding and channel coding are affected on the
transmission side with relevant decoding operations being performed on the reception side. Since
we are dealing with digital communication systems, the source messages are first digitized into
bit streams.

Note that Fig.1.2 implies that source coding and decoding and channel coding and decoding are
done sequentially. Source coding reduces the bits to be transmitted or stored by reducing the
source redundancy. Channel coding on the other hand, increases the bits which are used to
improve the reliability of data transmission

To optimize the two coding operations one can:

i. Exploit the relevant separation theorem and carry out the optimization by sequentially
performing source and channel coding and decoding

ii. Perform joint source-channel coding. This is achieved by finding an optimal bit
allowance between source coding and channel coding for given channel loss
characteristics and the source encoder and channel encoder to meet the target source rate
and channel robustness, respectively.

The separation theorem and joint source-channel coding are beyond the scope of this course

Source encoding is the representation of source messages into binary digital sequences. Efficient
source encoding can be attained if the source statistics are known.

Assume some source output symbols are more probable than others. Intuitively, it is seen that an
efficient source encoder is one which assigns long code words to rare output symbols and
shorter code words to frequent symbols. Such a code is referred to as a variable length code.
The Morse code is one such code

One binary digit representation of the Morse Code as a variable length code is given in the
problems at the end of this chapter.

Note that there is dependency between various letters of a language source. The dependency
between the letters may be exploited to improve the efficiency of representation. Thus, for
example, a person at the receiving end of a communication will be able to decode the following
received message thus making it unnecessary for the source to transmit the other letters.

“u shld b abl t ried ths evn tho svrl ltrs r mising”

The meaning of the sentence is clear even though about 40 percent of the letters are left out

The frequency of letters and their dependence on one another has been exploited in the design of
the layout of typewriters and computer keyboards for several languages. Go to Google to read
further on this.

12
The statistics of single letters of a source of English is available in [12]. Similar statistics for
Kiswahili are yet to be generated. A problem at the end of this chapter responds to his challenge.

The dependency of single letters on one another or of combination of two letters (digraphs) or
three letters (trigraphs), etc, has been exploited in decryption of cipher text messages. Interested
students may Google this.

The source encoder performs the following functions:

i. It maps the source output into a digital data sequence.

ii. Reduces redundancy i.e. there is limited representation of source outputs in the one-to-
one mapping. This is referred to as data compression.

The “data compression” measure in the sources is the rate, in symbols per second, needed to
fully represent and finally reconstitute the source output sequence at the source decoder output
using fewer symbols than those in the original message.

“Data compression” relates to sources of text, audio signals, real-time audio signals and
multimedia signals. Real-time voice and video traffic refers to traffic in which bits are
periodically generated thus providing a constant flow (stream) of traffic.

2.2 Model of a Source

In the discrete communication systems under consideration, source output symbols are represented
by binary symbols.. The representation must enable the recovery of the original symbols from their
binary representations while using as few binary symbols as possible to represent the source output
symbols. A representation achieving this is called an efficient code

The efficient code we are looking for must be uniquely decodable.

Definition of a DMS

A discrete memoryless source (DMS) is a source characterized by a random variable u which


takes letters from a discrete finite alphabet

u = {u0, u1, … uN}

With probabilities

p{u = uk} = pk, k = 0, 1, …, N

A source output at a particular time is independent of all preceding and succeeding outputs. This
is the memoryless property. Later we will discuss sources with memory.

13
Consider a DMS with an output random sequence of length N, u = {u1, u2, …, uN}. Each of these
output sequences is represented by a binary sequence of length M, x = {x1, x2, …, xM} where the
xi’s are 0’s and 1’s . For fixed N, each of the MN binary sequences (code words) corresponding to
each of the output sequences is called a code

For uniquely decodable codec, there must be a one- to –one mapping between source output
symbols and the corresponding binary sequence representation. If the binary representation can
be reconstituted into the original source output symbols exactly, we say that the coding and
decoding operation is lossless.

2.3 Information and its Measurements

The concept of information and its quantification is basic to the analytical modeling of sources. A
quantification of the concept of information is based on the earlier work of Nyquist, Hartley and
Shannon [3,4,11].

The work of Nyquist, Hartely and Shannon led to the following principle underlying the concept
of information:

i. The more uncommon or surprising an event is, the more information it conveys. The
statement “The sun will not rise tomorrow” conveys more information than the statement
“The sun will rise tomorrow”. Thus an information measure must be related to
uncertainty about the occurrence of an event.

ii. Information contained in statistically independent events must be additive

2.3.1 An Information Measure

We confine ourselves to elementary information and will therefore, restrict ourselves to a


concept of an information measure which is relevant to information that is a function of a
probability distribution of events.

Consider a sample set of N possible events. Let the probability that event i has occurred be
denoted by Pi . Normalization yields

1

= erfc
2
4 Eb
7 no
………………………………………………….. 2.1

We do not know which event will occur. But suppose that one of these events, say j, finally takes
place. Then the occurrence of event j clearly has made us gain information, because before the
event occurred we did not know which event would occur.

Suppose that the probability Pj of the occurrence of event j is close to 1. This means that we gain
very little information by the observed occurrence of event j, because this event was very likely

14
anyway (assume, for example that j, is the event of the “sun rising tomorrow”). On the other
hand, if Pj is close to zero, then we gain a lot of information by the actual occurrence of event j,
because we did not really expect this event to happen (assume, for example, that j is the event
that” the sun will not rise tomorrow”).

The information gain due to the occurrence of a single event j can be measured by a function

H(
P j ), which should be close to zero for P j close to 1. For example we could choose

Where a is the base of the logarithm

Note that variants or other functions H(


P j ) are possible depending on the application one has
in mind [10].

The definition of the information measure given in equation 2.2 above has the following
intuitively appealing properties:

i. H(Pi) = 0 for Pi = 1.0

ii. H(Pi) ≥ 0 Pi ¿ (0, 1]

iii. H(Pj) > h(Pi) for Pj < Pi

iv. H(Pi ¿ Pj) = H(Pi) + H(Pj) for events i and j statistically independent i.e Pi ¿ Pj =
Pi,Pj

The base of the logarithm is arbitrary. It is standard practice, however, in information theory to
use logarithms to base 2. The unit of information is then called bit (short for “binary digit”). We
then write

H(Pj) = -log2 P(j) bits/symbol……………………………..2.3

Entropy

15
Assume that we have a long sequence of occurrence of events, then the average information gain
by the sequence of observed events is given by

N
H ( u)= ∑ H ( pk ) p k ……………………………………2.4
k =0

Where u={u0, u1…,uN}

Pk=Pr {u=uk}

Shannon [4] used letter frequencies for English to compute the entropy of English. The Entropy
of Kiswahili can be computed in a similar manner. Luhanga and Nyambo [] have computed the
entropy of Kiswahili using a convergent gambling approach.

Many information measures studied in the literature are of this simple form. But other forms are
possible as well [10]. If the events represent the sequence of output symbols of a DMS then the
units of H(u) are bits/symbol. Combining equation 2.3 and 2.4 we get

N
H (u)=−∑ pk log 2 p k
k=0 …………………………………………2.5

Note the close resemblance of eqn 2.5 to the entropy equation of thermodynamics. Based on this
correspondence, the term entropy is also used as a measure of average information.

The Khinchin Axioms


There is a more formal way of selecting a measure of average information. The point of
departure is a formulation of axioms which are then used in searching for a measure or measures
of information which satisfy the axioms. The chosen axioms must postulate some basic and
essential properties which the information measure we are interested in must possess and then
the functional form for the information measure is derived from the postulates.

Khinchin formulated four axioms which describe the properties an information measure I(u)
should have [10]. We follow convention by now using the letter S to represent entropy.

Axiom 1
S(u)=f (P 0 , P 1 ,. . ., P N ) ………………………………………. ……………………...2.6

16
This axiom means that the information measure S(u) depends only on the probabilities Pi of
the events and on nothing else

Axiom 2
-1 -1 -1
S (w ,w ,. .. ,w )≥S (P 0 , P 1 ,. . ., P N ) ………………………………………………….2.7

This axiom means that the information measure S(u) takes on an absolute maximum for the
uniform distribution:

1
P i= ∀ i,
W

and that any other probability distribution has an information content that is less than or equal
to that of uniform distribution. The uniform distribution corresponds to the maximum uncertainty
about the events.

Axiom 3

S( P0 ,P 1 ,...,P N )=S( P0 ,P1 ,..., PN , P N+1 )………………………………………2.8


where PN+1=0

This means that the information measure S(u) should not change If the sample set of events is enlarged
by another event that has probability zero

Axiom 4

B
S {P AB }=S {P A }+ ∑ P A S{P ( j/i)}
…………………………………………………2.9

ij i i i
This axiom deals with a compound system made up of two subsystems A and B (not necessarily
statistically independent). The probabilities of the first subsystem are P iA and those of the second
subsystem described by the probabilities P jA and those of the compound system (A,B) are denoted by
the joint probabilities

B
P AB =P A P ( j/i )
ij i
Where PB ( j/i ) is the conditional probability of event j in subsystem B given that event i has
B
already occurred in subsystem A. S {P ( j/i)} is the conditional information of subsystem B
formed with the conditional probabilities PB ( j/i )

17
i.e under the condition that subsystem A is in state i.

This axiom states that if we first collect the information in subsystem B having assumed that
subsystem A is in state i, then the information in the compound system (A,B) is the information
in subsystem A added to the sum of the conditional information of subsystem B given that
subsystem A is in state i with the sum being taken over all events i in subsystem A, weighing
with the probabilities PiA

For the special case that subsystems A and B are statistically independent the probability of the
compound system factorizes as

P =P A P
ij A , B i jB …………………………………………2.10

and only in this case does Axiom 4 reduce to the additivity rule for information in statistically
independent subsystems

S {P }=S {P A }+S {P B}
ij A , B i j …………………………………..2.11

Note that equation 2.9 can be developed from first principles as follows

S {P
ij
A ,B }=− ∑ ∑ Pij A , B log Pij A , B
i j
But
B
P A , B=P ( j/i )P BA
ij
=−
and hence
∑ ∑ P A Pi ( j/i){log P A +log P B ( j/i)}
i i
i j
¿−∑ ∑ P A PB ( j/i )log P A
i B i B
S {P A , B }=−∑ B∑ P ABP ( j/i )log{P A P ( j /i)}
i j
−∑ij
∑ P A iP log
j iP ( j/i) i
i j i

¿ ∑ P A log P A ∑ P B( j/i)
i i i j
B B
−∑ P A ∑ P ( j /i)log P ( j/i )
i i j 18

¿ S {P A }+ ∑ P A S {P B ( j/i)}
i i i
The Shannon Entropy

In his 1948 paper, Shannon derived the following expression for the entropy (average
information) of a source.

W
S=−k ∑ Pi ln Pi …………………………. 2.12
i=1

It is left as an exercise to the reader to verify that the Shannon entropy satisfies all four of the
Khinchin axioms with Axiom 4 being given by equation 2.11

Note that the Shannon entropy can be obtained from eqn 2.3 by setting H(Pk)= -log Pk
1
In equation 2.12, k represents Boltzmann’s constant. For the uniform distribution P i= ∀ i,
W
the Shannon entropy takes on its maximum value

S= k ln W ………………………..2.13

Which is Boltzmann’s famous formula, carved on his grave in Vienna.

Maximizing the Shannon entropy subject to suitable constraints leads to ordinary statistical
mechanics,first derived by E.T. Jaynes [14,15].

Jaynes maximum Entropy method is the mojour foundation for the vistas beyond classical
information theory as indicated in Fig. 1.3. In the sequel we shall confine ourselves to the
Shannon entropy formula as given by equation 2.12 with the logarithms being taken to base 2.
Then, we have

19
N
1
S=H ( u )= ∑ p k log bits/ symbol
k =0 pk
n
¿− ∑ pk log pk …………………………..2.14
k=0
Note that

0log0=limλlogλ=0

λ→0

From now on all expressions involving algorithms will be taken to be using logarithms to base 2

More General Information Measures

It may be shown that the Shannon entropy measure is the only entropic form which satisfies, up
to an arbitrary multiplicative constant, all four of the Khinchin axioms in which Axiom 4 is given
by eqn 2.11

We may relax Axiom 4 by replacing it with the less stringent equation 2.9. In that case one ends
up with other information measures such as the Renyi entropies.

The Renyi entropies are defined for an arbitrary real parameter α as

1
S ( R )= ln ∑ P ………2 .15
α α−1 i iα

The Renyi entropies satisfy Khinchin Axioms 1-3 and the additivity condition of eqn 2.11.
Indeed, it can be shown that they follow uniquely, up to multiplicative constant, from these
conditions [10]. For α →1 the Renyi entropies reduce to the Shannon entropy.

Many other entropic forms are available in the literature [13]. Another entropic form, for
example, which may be shown to follow uniquely (up to an arbitrary multiplicative constant)
from Khinchin axioms 1 to 3 with Axiom 4 replaced by the following more general version:

S A , B =S A +S B / A +( 1−q ) S B/ A S A ……2. 17
q
Is the Tsallis entropy q q q q

{1−∑ P q }
i i 20
S q =K …… 2. 18
q−1
where K is a constant and q is a real parameter
B
In equation 2.16, SqB/A is the conditional entropy formed with the conditional probabilities P ( j/i )
and averaged over all states i using the escort probabilities Pi

B
S B /A =∑ Pi S q {P ( j/i)}………2 .19
q i

The escort probabilities, Pi, are given by [6]

P
iq
P i= …………2 . 20
∑ P
j
q
j
For q=1 the new Axiom 4 reduces to the old Khinchin Axiom 4,(i.e eqn 2.9 and 2.19)

S =S A +S …………2 . 21
1A , B 1 1 B/ A
For qǂ1 and A and B statistically independent the new Axiom 4 reduces to the pseudo-additivity
condition

S A , B =S A +S B +( 1−q )S A S B …………2 . 22
q q q q q

since

S B / A =S q B
q
For q=1 and A and B statistically independent, the pseudo-additivity condition reduces to the
additivity condition of equation 2.11

S A ,B = S1 + S1 A B
1

21
In this case (i.e the case of q=1) the Tsallis entropy reduces to the Shannon entropy.

Maximizing the Tsallis entropy subject to appropriate constraints may be shown to lead to
nonextensive statistical mechanics.This is the approach which largely account for the extension
of information-theoretic terms into other fields as shown in Table 1.1 [6].

2.4 Entropy of a DMS

The derivation of the entropy of a discrete memoryless source requires the following two
lemmas.

Lemma 2.1

ln x ≤ x-1 ……………………………………2.23

Proof

A plot of the two curves y=lnx and y=x-1 shows

Note that the curve ln x is a concave function of x. Remember that concavity of a differentiable
function f(x) means f//(x)<0 for all x

Note that the curve of y=ln x lies below the curve of y=x-1 except at the point where x=1 where
the curves touch. This proves the lemma.

Lemma 2.2

For two probability distributions pk and qk defined over the same alphabet

22
qk

k
pk log 2
( ) pk
≤0 … … 2 . 24

Changing the base of the logarithm yields,

qk q
∑ pk log 2
k
( ) pk
=
1

log e 2 k
p k log e k
pk ( )
Using lemma 1 we get:

qk q
∑ pk log 2
k
( ) pk

1

log e 2 k
p k k −1
pk ( )
1
≤ ∑ ( q k −p k )
log e 2 k
1
≤ ∑ q −∑ p =0
(
log e 2 k n k k )
Which proves the Lemma

We thus have the inequality:

qk

k
pk log 2
( ) pk
≤0

The equality holds iff pk = qk ∀ k.

Note that inequality 2.23 can be re-written as


1 1
∑ pk log2 ≤∑ p k log ………2. 25
pk k qk
k
With the equality holding iff pk = qk ∀k.

Example 2.1

23
For a DMS with alphabet u=(0, 1) with p(0)=p and p(1)=(1-p) (i.e. a binary memoryless source,
BMS);

1 1
H (u)=p log +(1− p )log
p 1− p
H(u) is called binary entropy measure

H(u)≤ 1 with equality iff p(0)= p(1)=1/2. Then each output of a BMS emits 1 bit of information
per symbol

The entropy measure, H(u), for a BMS is defined over a source alphabet {0, 1}. One can use the
entropy measure to define a mathematical function

1 1
F( p )= p log +( 1−p ) log
p 1−p
The function is defined over the set p=(0,1).

A plot of the function is shown in Fig. 2.1. The figure represents what is termed as the Entropy
Function. The entropy function is frequently encountered in information – theoretic studies.

See problem 4 for more entropy functions.

24
ln x
-2

-1

2 4 x-1
1
Fig 2.1. The Entropy Function

Entropy of Blocks of Symbols

Consider a source producing blocks of N successive source symbols. We may consider each
block as being produced by an extended source with source alphabet of size MN where M is the
number of distinct symbols in the source alphabet of the original source.

Let u represent the N symbols in a block of symbols:

u={u1,u2…,uN} ……………..2.26

In a DMS, the N symbols in a block are iid and hence the probability of a block of symbols is
equal to the product of the symbol probabilities of symbols in the original source:

N
pN ( u)=∏ p( ui )………2 .27
i=1

Where p(ui) is the probability of a symbol i in the original source.

The average entropy of the extended source is:

1
H ( u N )=∑ p N ( u ) log ……2 .28
μN p25N ( u )
In equation 2.28 p(ui) can be obtained as a marginal probability of pN(u) as follows:

n ≠i
un

p ( ui )= ∑ ¿ ∑ ¿ ¿ ∑ p N ( u )……2 . 29 ¿
u1 ¿ uN
Lemma 2.3

Suppose uN= {u1, u2, …, uN}represents a block of output sequence of symbols of a DMS of length
N. If the random variables uN are i.i.d then

H(uN) =NH(u) ……2.30

Where H(u) is the average source entropy per symbol

Proof

Let the probability distribution of the N output sequence of symbols be defined by pN(u); The
average entropy for the N symbols H(uN), is given by

1
H ( u N )=∑ p N ( u ) log ……2. 31
u pN ( u )
But since the N symbols are i.i.d then,

N
pN ( u )= ∏ pk
k =1
where pk is the individual symbol probability
1
H ( u N )= ∑ p N ( u ) log N
u
∏ pl
l=1
N
=−∑ p N ( u )
u
(∑
l =1
log pl )
InterchangingN
the order of summation

∑ (∑ p N (u ) log p l ) . .. .. . .2 .32
l =1 ⏟
u
H (u )
26
¿ NH ( u)
We obtain the result by using equation 2.29

−∑ p N ( u ) log p k=−∑ p k log p k


, Why?
u k
¿ H ( u)
Hence

H(uN)=N

The average entropy of a block of N i.i.d random variables is equal to the sum of the entropies of
the N individual random variables H(u)

Example 2.2

Consider a DMS with alphabet u = {a,b,c} with respective probabilities pa = ¼, pb = ¼, pc = ½ .

H(u) = ¼ log 4 + ¼ log 4 + ½ log 2

= 1.5 bits/symbol

Example 2.5

Consider a DMS with alphabet u = {u1, u2, u3} with probabilities

p1 = 2-2,

p2 = 2-2,

p3 = 2-1 .

Where Pk=Pr(u=uk)

Show that the average source entropy is 1.5 bits/symbol

Now consider an extended DMS whose alphabet consists of two symbols taken in sequence (i.e
N=2). Show that the alphabet for the extended source is (MN=32=9):

The possible output symbols of the extended DMS and their probabilities are:

27
-2 -2 -4
u1 u1 2 *2 =2

-2 -2 -4
u1 u2 2 *2 =2

-2 -1 -3
u1 u3 2 *2 =2

-2 -2 -4
u2 u2 2 *2 =2

-2 -2 -4
u2 u1 2 *2 =2

-2 -1 -3
u2 u3 2 *2 =2

-1 -1 -2
u3 u3 2 *2 =2

-1 -2 -3
u3 u1 2 *2 =2

-1 -2 -3
u3 u2 2 *2 =2

HN(u) = 4*2-4*4+4*2-3*3+2*2-2

=3 bits/symbol

But

HN(u) = 2*1.5 = 3bits/symbol

Therefore

HN(u) = NH(u)

as expected

2.5 Data Compression

Since digital sources contain redundant information, lossless data compression is used to ensure
efficient utilization of storage and transmission facilitates while guaranteeing that the original
source symbol sequence can be reconstructed exactly from its binary sequence representation.

If the original source sequence can be reconstituted exactly then the data compression is said to
be lossless

Source encoding involves a one-to-one mapping between source output sequences and their
binary symbol representation. To ensure efficiency in the encoding process, source statistics are
exploited to yield a code which uses the lowest possible average number of binary symbols (bits)
needed to uniquely represent source symbols is given by Shannon’s Source Coding Theorem

28
2.5.1 Shannon’s Source Coding Theorem

Shannon’s Source Coding Theorem proposes a fundamental limit on the average codeword
length for a source. We confine ourselves to the noisless source coding theorem for discrete
memoryless sources

The source coding theorem states that a uniquely decodable binary code exists which meets a
fundamental limit. The source coding theorem does not indicate how such a code can be
obtained.

Theorem 2.1 The Noiseless Coding Theorem for Discrete Memoryless Sources.

Consider a DMS of alphabet u and entropy Hu bits per symbol. For source sequence of length N
(N = 1, 2, …)there exists a uniquely decodable binary code with code words of length lN (u), u Є
uN where uN is the alphabet of an extended source made of source sequences of length N and the
average length of the code words is bounded by

⟨L N ⟩=∑ l N (u) P N (u)


u
≤N {H (u )+O (N )}⋯⋯⋯2. 35
where O(N) is a term which becomes vanishingly small as N approaches infinity. Conversely, no
code exists for which

⟨L N ⟩<NH(u)
Note that <LN>represent the mean (average) value of LN

The direct half of the theorem as represented by equation 2.33 is proved by constructing a
uniquely decodable code which meets the average length bound.

Several techniques are available in the literature for doing this:

(i) The earliest technique is that due to Shannon (1948)

(ii) A technique due to Huffman (1952) produces an optimal code in that the code minimizes
the average code word length for any N.

Below we adapt a method due to Viterbi and Omura (1979) which introduces the Asymptotic
Equipartition property (AEP)[17]. The proof of the theorem will require the following definitions
and lemmas

The central concepts in source coding are those of typical sequences and typical sets. All output
source sequences are divided into typical sequences which are elements of typical sets with the
rest of the source output sequences assumed to belong to a set which is complementary to the
typical set.

29
Definitin 2.2: Typical Sequence

A typical sequence of length N (where N is large) of a DMS of definition 2.1 is a source


sequence

uN={u1,u2…uN}

such that

p(ui) ≈2-NH(u)……….2.36

in mathematics “≈” is defined more rigorously.

Example

Assume that a source produces a rondom binary sequence with N=20, Pr{u=1}=p=0.1 and
Pr{u^=0}=1-p=0.9

(i) H(u)=plgp+(1-p)log(1-p)=0.47bit/symbol
(ii) 2-NH(u)=0.0015
(iii) If N consists of 13 “1”s and 7 “0”s, then PN(u) =p13(1-p)7=10-14. This differs markedly
from the result in (ii) thus indicating that this is not a typical sequence.
(iv) If N consists of 2 “1” s and 18 “0”s, then PN(u) =p2(1-p)18=0.015. comparing to (ii) we
conclude that this a typical sequence.
(v) If N consists of all “0”s, then PN(u) =(1-p)20=0.12 lending to a conclusion that this is
not a typical sequence.

Note that in a long binary sequence of length N with the probability of a “1” being p, then the
number of “1” s is approximately Np and the number of “0”s is approximately N(1-p)

If each of the source symbols is chosen from a source alphabet of size M, then there are MN
possible source output sequences of length N. since there are 2-NH(u) typical source sequences
of length N, then

2-NH(u) < MN…………………………2.37

Definition 2.3: Typical set

A typical set, S(N, ζ), is a set whose elements are typical sequences such that

S ( N , ζ ) = [2 ]
−N {H ( u )+ζ } −N {H ( u)−ζ }
,2 …….2.38

Linking this to the definition of a typical sequence we note that a typical set has the
following properties;

(i) The typical sequences making up the elements of a typical set are nearly
equiprobable, with probabilities that are in the neighborhood of 2-NH(u)

30
(ii) The number of elements in a typical set is nearly 2NH(u)

We next show that a typical set occurs with a probability near 1 (that is why it is typical!). to
accomplish this we need the following three lemmas.

Lemma 2.4: the Chebshev Inequality

Let uN be a discrete random variable with expected value E{uN}=μ , variance

E{(uN – μ)2 }=σ2 and let ζ be any positive real number. Then

Pr {|ui - μ| > ζ } ≤ σ2 / ζ 2 …… 2.39

Proof

We Note that if the probability mass function,(pmf) of ui is pN(ui) then


2
σ = ∑ (u i−μ )2 PN (u i )
u i=−∞
μ−ξ
2
= ∑ (u i−μ ) PN (u i )
u i=−∞
μ=ξ
+ ∑ (ui−μ )2 P N (ui )
u i=μ−ξ

+ ∑ (ui−μ )2 P N (ui ) …2.40
u i=μ−ξ

2
Since (ui−μ ) and PN (u i ) are non-negative, then

μ−ξ ∞

2
∑ 2
(ui −μ) P N (ui ) ∑ (ui −μ )2 P N (ui )
σ ≥ ui =−∞
+ ui =μ+ξ
2 2
In both summations (ui−μ ) ≥ σ , and so
μ−ξ ∞

2
∑ 2
(ui −μ) P N (ui ) ∑ (ui −μ )2 P N (ui )
σ ≥ ξ 2
+{ ui =−∞
+ ui =μ+ξ } …..2.41

B y invoking a property of the pmf, PN (u i ) , we get

31
μ−ξ ∞

∑ 2
(ui −μ) P N (ui ) ∑ (ui −μ )2 P N (ui )
ui =−∞
+ ui =μ+ξ = Pr {|ui - μ| > ζ } ……2.42

Thus from eqn 2.3.9 we finally get


2
σ ≥ ξ 2
Pr {|ui - μ| > ζ }

Which can be re-written as

Pr {|ui - μ| > ζ }≤ σ2 / ζ 2

Lemma 2.5: The Weak Law of Large Numbers

Let uN={u1,u2…uN} be a ranom variable consisting off i.i.d random variables ui, i=1…,N with
mean µ and variance σ2. Let
N
S N = ∑ ui
i =1 ……..2.43

Then,

SN
lim =μ
N →∞ N …..2.44

Proof

Since the ui’s are i.i.d we can apply Lemma 2.4 with

Variance {SN}=N σ2 and

2
SN σ
Variance { N }= N

But,

SN {u1 +u 2 +. ..+u N } Nμ
E{ N }=E N = N

= μ

By Chebshev’s Inequality

SN
Pr {| N - μ| > ζ }≤ σ2 /N ζ 2 …2.45

32
Thus for any fixed ζ,

SN
Pr {| N - μ| > ζ }→0

as N→∞

This implies that

SN
lim =μ
N →∞ N

Before proving the next lemma we need a definition of convergence in probability.

Definition 2.4 : Convergence of Random Variables

The sequence uN={un},n is an integer ≥0, converges to a limit u if for every ζ>0, there exists an m such
that for every n>m, |un-u|< ζ

The sequence of random variable UN={ui}, i ≥ 0, converges;

(i) In mean square if E{ui-U}2→0


(ii) In mean probability if for every ζ>0, Pr{|un-U|> ζ }→0
lim un =U
(iii) With probability 1 (also called almost surely) if Pr{ n→∞ }=1

Lemma 2.4: The Asymptotic Equipartition Principle (AEP)

If the random variable uN={u1,u2…uN} are i.i.d with pmf, PN(u) then

1
− log P N (u )→H (u )
N ….2.46

Proof:

since the ui’s are i.i.d we note that

N
pN ( u)=∏ p( ui )
i=1

hence

N
1 1
− log P N (u )=− ∑ log p (ui )=PN ,
N N i=1

33
Note that PN is the sample mean of -log p(ui). Thus, by invoking Lemma 2.5 we have

lim =E {−log p( ui ) }
N →∞
¿ H ( u)
We are now in a position to show that a typical set occurs with a probability near 1

Lemma 2.7: Probability of Occurrence of a typical set

For any ζ > 0, consider a DMS with alphabet u and entropy H(u) and the subset of all source sequences of
length N defined by:

S(N ,ζ) Δ {u: P N (u) Δ [ 2−N {H(u)+ζ} ,2−N {H(u)−ζ} ] } ….2.47

Then all source sequences S(N, ζ) can be uniquely represented by binary code words of finite
length, LN ,where

LN =[ N {H (u )+ζ }, N {H (u )+ζ }+1) ….2.48

Furthermore

2
σ
Pr {u∉ S( N , ζ )}≤ ….2.49
Nζ 2

where

σ 2 =∑ {−log P (u )−H (u )}2 P (u)


u

and P(u) and H(u) represents the single letter probabilities and entropy; respectively

Note that all sequences in the set S(N,ζ) are nearly equiprobable, differing from the value 2-NH(u)
by a factor no greater than 2Nζ

Proof

Since S(N,ζ) is a subset of uN, the set of sequences of length N, we have the inequality

∑ P N (u)=1≥ ∑ P N (u)
u u∈S(N ,ζ ) 34
….2.51

Since we have defined

PN (u )≥2−N {H (u )+ζ }

for every ū Є S(N, ζ), equation 2.47 becomes

1≥ ∑ 2−N {H (u )+ζ }
u∈S ( N ,ζ )
−N {H (u )+ζ } ….2.52
=2 |S ( N ,ζ )|
where |S(N, ζ)| is the number of distinct sequences of length N in S(N, ζ). i.e it is the cardinality
of the set S ( N , z )

This gives us the bound

|S(N, ζ)| ≤ 2N{H(u)+ ζ} ………… 2.53

This bound, of course, is not an integer, let alone a power of 2. However, we may bracket it
between powers of 2 by choosing the integer, LN, such that:
LN −1 L
2
N {H ( u)+ζ }
=(2 ,2 N ] ………………… 2.54

This equation is the same as equation 2.48

Focusing non on the representation of stance sequences in S ( N , z ,)we begin by assuming that
the DMS has an alphabet of size A in source sequences, with each source sequence being of
length N. It is possible to represent all sequences in the DMS by binary sequences of length LN
if LN is chosen to meet the constraint.

2 LN -1 �A N < 2 LN ……………………….2.55

Or

N log A �LN < N log A + 1 ……………………. 2.56

35
It is now possible to propose a uniquely decodable code for the DMS. Codewords for source
sequences in S ( N , z ) are obtained by appending one more binary symbol to each of the binary
representatives of eqn 2.48 for the sequences in S(N,ζ) by preceding these binary
representatives with a “0”.

While this increases the binary sequence length from LN to 1+ LN, it has a vanishingly small
influence for asymptotically large N.
L
Since there are 2 N distinct binary sequences of length LN, we can represent uniquely all source
sequences belonging to S(N,ζ) by binary sequences of length LN satisfying equation 2.48

Then using eqn 2.48 we have that all sequences in S(N,ζ) are represented uniquely with binary
sequences of length 1 + LN < N {H(u) + ζ} + 2 bits. For all other sequences in S ( N , z ) suppose
these are represented by sequences of length LN where the first binary digit is always a “1” and
the remaining LNsymbols uniquely represent each sequence in S ( N , z, )with being
LN chosen
in accordance with eqn 2.55. Then all sequences in S ( N , are
z ) represented uniquely with
binary sequences of length

1+LN <N log A+1


The code we have constructed is uniquely decodable since the first bit specifies the length (“0”
1 + LN ) of the code word and the remaining bits
means length 1 + LN and “1” means length
uniquely specify the source output sequence of length N. If the first bit is a “0”, we examine the
remaining LN bits which establish uniquely a source sequence in S(N,ζ) while if the first bit is a

bits which establish uniquely a source sequence in S ( N , z )


LN
“1” we examine the next

Each code is a unique binary sequence and there is never any uncertainty as to when a code
word sequence begins and ends. No code word is a prefix of another.

We now have a uniquely decodable binary code for each output sequence of length N. As an
example, consider a source with alphabet u = {a, b, c} for source sequences of length N = 2. A
possible code, which satisfies the parameters mentioned above is:

uN Codeword

aa 000

ab 001

36
ac 010

ba 011

bb 1000

bc 1001

ca 1010

cb 1011

cc 1100

In order to show that the typical set, S(N,ζ), occurs with a probability near L we proceed as
follows:

Let
F N =Pr {u∈S ( N , ζ )}
= ∑ PN (u)
u∈S( N , ζ )
…….. 2.57

Where S( N ,ζ ) is the complementary set to S(N,ζ), which contains source sequences other
than typical sequences.

From the definition of S(N,ζ), we have

u :log P N ( u )=[ − N {H ( u ) + ζ } ,
¿

¿
S ( N , ζ )=¿
¿
¿
¿

N
u :log ∏ p( u n )=[− N {H ( u)+ ζ },
u=1
¿

¿
− N {H ( u)−ζ }]
¿
N
u : ∑ log p ( un )+ NH (u )= 37
n=1
¿
=¿
¿
¿
¿
……… 2.58

Hence the complementary set is

N
1
S ( N , z ) = {u : -
N
�log p(u ) - H (u) > z }
n =1
n

…….. 2.59

If we re-write eqn 2.43 as


N
S N = �{- log p(ui )}
i =1

Then, applying eqn 2.45 to eqn 2.46 we get


N
1 σ2 ……………………. 2.60
F N ={u :|−
N
∑ log p ( un )−H (u )|>ζ }≤ Nζ 2
n=1

With s being given by eqn 2.50 as can be easily shown by applying Lemma 2.5 to 2.59

Thus, FN , the probability of occurrence of sequences not belonging to S ( N , z ) becomes


vanishingly small as N approaches infinity or, conversely, the occurrence of binary sequences in
the set S ( N , z ) approaches 1 as N approaches infinity.

This result is not surprising. The term

N
1
− ∑ log p(u n )−H (u )
N n=1

38
should, according to Lemma 2.6 on the AEP approach zero as N approaches infinity thus
justifying the result of eqn. 2.57

Proof of Theorem 2.1

We have just developed a uniquely decodable code with two possible codeword lengths, LN and
. The average length of code words is thus

u ∈S ( N , ζ )
¿

¿
+( 1 + L N ) Pr {u ∈¿ S ( N , ζ )}
⟨ LN ⟩=( 1+ LN )Pr ¿
¿
¿
¿
……..2.61

and it follows from 2.48, 2.55 and 2.60 that

⟨ L N ⟩ <1+ N [ H ( u )+ζ ]+1


σ2
+[ N log A +1 ]
Nζ 2

{ }
2
2 1 σ
or =N H (u )+ζ + +(log A + ) 2
N N Nζ

LN 2 1 s2
< H (u ) + z + + (log A + )
N N N Nz 2

Choosing ζ = N – 1/3 this yields

LN -1 -1 -1
< H (u ) + 2 N -1 + N 3 + [(log A + N -1 )s 2 N 3 ]N 3
N
= H (u ) + o ( N )

which establishes the direct half of the theorem i.e. we have shown that for source sequences
of length N of a DMS there exists a uniquely decodable code for which ⟨LN ⟩<NH (u)+O(N )

39
We note that for large N, nearly all code words can be made of equal length, slightly larger than
LN LN
NH(u) and only two lengths of code words are required ( LN and ). Where refers to

the length of source sequencies belonging to S (N ,z )

N { H (u ) + z } , N { H (u ) + z } + 1�
Note that L N = �
� �
And so, for large N, all codewords are of almost the same length

We now prove the converse half of the theorem.

Consider the identity

M
� - l N ( u ) � � - l N ( u1 ) � � - l N ( u2 ) �
��
�u
2 =
� � �

� �u1
2 �
� �
� �2 �

L
� �u2 �
� - lN ( uM ) �

� � 2 �

� u M �
= ��L �2 -{lN ( u1 ) +L + lN ( u M )}

u1 u2 uM

……….. 2.62

where each sum on both sides of the equation is over the entire space uN.

Note that there is a distinct term on the right hand side of equation eqn 2.62 corresponding to
each possible sequence of M code words. Furthermore, the quantity

lN (u 1 ) + lN (u 2 ) + K + lN (u M )
gives the total length in code letters (bits) of the corresponding sequence of code words.

If we let Ak be the number of M successive code words having a total length of k binary symbols,
equation equn 2.62 can be expressed as

M *
� - l N (u ) � MlN
�� 2 � = �Ak 2
-k

………….. 2.63
�u � k =1

Where

40
l ¿N =max u l N (u )

But in order for the source sequence to be recoverable from the binary sequences we must
have

……….. 2.64
A k ≤2k , k=1,2,⋯, Ml¿N
Otherwise, two or more sequences of M successive code words will give the same binary
sequence, violating our uniqueness requirement. Using this bound for Ak we have

¿
Ml N
−l N ( u ) M
(∑ 2
u
) ¿ ∑ 1 =MlN
k=1
¿
……….. 2.65

for all integers M

Clearly this can be satisfied for all M if

−l N (u )
∑2 ¿1 …………. 2.66
u

for the left hand side of 2.65 behaves exponentially in M while the right hand side grows only
linearly in M. This inequality is known as the Kraft – McMillan inequality

If we were now to use the general variable length source encoder whose code lengths must
satisfy 2.63 we would have an average of

⟨L N ⟩=∑ P N (u )l N (u) ………… 2.67


u
binary symbols per source symbol

Defining on the distribution

−l N (u )
2
QN (u)= −l N (u ) …….. 2.68
∑2

[ ]
−l N ( u)
u
∑2
=∑ P N (u)log
u
−l (u )
We have from inequality
u of2Lemma
N
2.2 an inequality
1
)log ∑ 2−l N ( u)
∑∑P NP(uN (u)log
NH(u )==
u u P N (u
u
) ( )
∑ P1QN (u)l
≤∑ P N (u+)log
(u )
N (u)
41
u u N
−l N (u )
¿ ⟨L N ⟩+log
(∑ 2
u
)
……… 2.69

Since the Kraft – McMillan Inequality guarantees that the second term is not positive we have

NH(u)≤⟨ LN ⟩

LN >NH (u ) ……….. 2.70

The bound applies for any sequence of length N and it follows that any source code for which
the source sequence can be uniquely recovered from the binary sequence requires at least an
average of NH(u) bits per symbol

This completes the proof of Theorem 2.1. We have shown that it is possible to source encode a
DMS with an average number of binary symbols per source symbol arbitrarily close to its
entropy and that it is impossible to have a lower average.

42
2.5.2 Source Encoding Algorithms

Shannon’s Source Coding Theorem guarantees the existence of a code for source sequences of
length N and provides a bound on the average codeword lengths. The Source Coding Theorem
does not, however, indicate how such a code can be constructed. We provide in this section
some argentums for designing source codes.

The binary codes we are looking for most meet two conditions:

(i) They must be efficient


(ii) They must be uniquely decodable

Efficient codes are the ones which exploit source statistics to ensure that short codewords are
assigned to more probable source symbols and vice versa.

Uniquely decodable codes are the ones which ensure exact reconstitution of the original source
messages from their coded representations. A sufficient condition for a code to be uniquely
decodable is that it be a prefix condition code.

Prefix Coding

A prefix condition code, or a prefix code for short, is a code in which no codeword is a prefix of
another codeword. To illustrate this consider the codes in Table 2.1 below

Table 2.1: Illustration of Prefix Codes

Source Probability of Code Code II Code III


symbol occurrence 1

s1 0.5 0 0 0

s2 0.25 1 10 01

s3 0.125 00 110 011

s4 0.125 11 111 0111

(i) Code I is not a prefix code since the bit “O” – The codeword for symbol S1 is a prefix

of “OO’ – The codeword for symbol S3 . Note also that the bit “I” - the codeword for

symbol S1 is a prefix of “II” the codeword for symbol S 4 .

43
(ii) Code II is a prefix code whereas Code III is not.

A prefix code is uniquely decodable. The converse is not true as seen in code III. This code is
uniquely decodable as a “0” indicates the beginning of a codeword. The code however is not a
prefix code. The prefix condition is thus a sufficient, but not necessary condition, for unique
decidability.

Prefix condition codes are a subset of uniquely decodable codes. Together with being uniquely
decodable they satisfy the Kraft – McMillan Inequality. This is based on the fact that unique
decodability was a condition used in deriving the inequality from eqns 2.63, 2.64, 2.65 and 2.66.

Note, however, that the Kraft – McMillan inequality does not tell us that a source code is a
prefix code. It is merely a condition on the code word lengths of a code and not on the code
words themselves.

Example 2.4

Suppose s = {s1,s2, s3, s4}. and the binary symbol representation is as shown below:

Source Probability Code 1 Code 2 Code 3


symbol of
occurrence

s1 0.5 0 0 0

s2 0.25 1 10 01

s3 0.125 00 110 011

s4 0.125 11 111 0111

From this example we note the following:

(i) Code 1 violates the Kraft – McMillan Inequality. Is it a prefix code? Why?

(ii) Both codes 2 and 3 satisfy the Kraft – McMillan Inequality. Code 2 is a prefix code
whereas code 3 is not.

Decoding

44
As for as decoding is concerned, let us start by looking at example below:

Example 2.5

Suppose u = {a, b, c} and the binary symbol representation uses three codes as shown below.
We note that if N=1 (i.e single symbols a, b, or care transmitted) then:

U Code 1 Code 2 Code 3

A 0 00 1

B 1 01 10

C 10 10 100

• Code 1 is not uniquely decodable as, for example, 0101 can represent abab, abc, cc or
cab source sequences and be impossible to be uniquely decodable into a source
sequence.

• Code 2 is uniquely decodable. It is a prefix code.

• Code 3 is uniquely decodable. “1” marks the beginning of a codeword and they are
unique.

A prefix codeword is uniquely decodable because:

(i) The end of a codeword can be uniquely determined

(ii) No two code words are the same.

45
To decode a sequence of prefix condition code words generated from a source encoder, the
source decoder sets up a decision tree. For code II the decision tree is:

Fig. 2.2 Decision Tree for Code II

The received bit sequence

1011111000

Is decoded as

S2S4S3S0S0

The source decoder emits the source output symbols as indicated.

The decoder always starts in the initial state and ends at a terminal state. Each source symbol
corresponds to one terminal state. For example the encoded sequence 101111000 is easily
decoded, using the decision tree, as S2S4S2S1S1

For prefix condition codes, the end of a code is always recognizable. Hence the decoding of a
prefix code representing a source symbol can be accomplished as soon as the binary sequence
is fully received. Such a code is said to be an instantaneous code.

Average Codeword Length and Variance

46
Given the probabilities of occurrence of source symbols and the codewords used to represent
each source symbol it is easy to compute the average codeword length, L as:
N
L= �P L k k
k =1 ……. 2.71

Where

Pk is the probability of occurrence of source symbol k

L k is the length of the codeword representing source symbol k.

The variance of the codeword lengths is easily computed using

N ………………….. 2.72
σ = ∑ Pk ( Lk −L)2
2

k =1

Example 2.4

Compute the average codeword lengths for the following codes

A Code 1 Code 2 pk (a)(Both codes)

a0 000 00 ¼

a1 000 01 ¼

1
a2 010 100 /8

1
a3 011 101 /8

1
a4 100 1100 /16

1
a5 101 1101 /16

1
a6 110 1110 /16

47
1
a7 111 1111 /16

The average codeword length for code 1 is 3.0 bits

Average length of a codeword in code 2 is:


Code 2= ∑ nk pk (u)
k =0
2×1 3×1
=(
4
)×2+(
8
)×2+ ( 4×1 16)×4
=2. 75 bits

The entropy for code 2,

1 1
H ( u)=2×( log 4 )+2×( log 8 )
4 8
1
+4×( log 16 )
16
=2. 75 bits / codeword

For code 2, the average codeword length is 2.75 bits/symbol which equals the average entropy.
We show in the sequel that this result implies that the code is optimal according to Shannon’s
Source Coding Theorem.

Find and comment on, the entropy and average codeword length for code 1.

Encoder- Source Matching

Consider a DMS with alphabet u and entropy H(u). For such a source consider further that the
source symbol uk occurs with probability pk and that the binary sequence representing uk has
length lk.

Consider the special case where


−lk
pk =2 ………………… 2.73

The average length of the code words, , is given by

48
N
lk
⟨L⟩=∑ l
…………………. 2.74
k=1 2k

The entropy, H(u), is given by

N
1
H (u)=∑ pk log
k =1 pk
N
1 l
=∑ lk
log 2 k
k=1 2
N
lk
=∑
k=1 2k
l ………………. 2.75

Note that in this specific case


L = H (u )

and the code is said to be matched to the source.

How does one match a prefix code to an arbitrary source? One uses an extended code (for a
block of binary symbols).

We know that

L N =[ N {H ( u )+ζ }, N {H ( u)+ζ }+1 )


LN 1
or =[ H ( u )+ ζ , H (u )+ζ + )
N N

As N →∞ and ζ becomes arbitrarily small, the lower and upper bounds converge since

lim
N →∞
ζ→0
LN
¿¿¿ = H ( u )¿
¿ N

Thus, by making the order, N, of an extended source encoder large enough and ζ small enough,
the code can represent a DMS as closely as possible i.e. it almost matches the source.

49
The average code word length of an extended code can be made to be as close to the entropy of
the source provided the extended code has a high enough order. The price to be paid is the
increased decoding complexity from the higher order of the extended code.

Example of Practical Source Codes

In this section we will present three algorithms for generating source codes. The average length
of the code words in each source code is computed and so is the entropy of the source. A
comparison of the two values gives an indication of how closely a code matches the source.

(i) Huffman Code

The Hufman code is a prefix condition code. This code assigns, to each symbol in an alphabet, a
sequence of bits roughly equal in length to the information carried by the symbol. This means
that the code word lengths are proportional to the probability of occurrence of a symbol
represented by the code word.

The Hufman code results in a code whose average code word length approaches the entropy of
the source, assumed to be a DMS (It approaches a fundamental limit by asymptotically
matching the source as N→∞)

For extended sources, with each block being comprised of N source symbols, the Hufman code
is optimal in the sense that it produces a code with the smallest average code word length for a
given N.

The Hufman encoding algorithm is:

1. List source symbols in order of decreasing probability. The two source symbols with the
lowest probability are assigned a “0” and a “1”. This is the splitting stage.

2. Combine the two lowest probability source symbols into a new source symbol with
probability equal to the sum of the probabilities of the original source symbols.

The list of source symbols is reduced by one. Place the new symbol in the list of symbols in
accordance with its probability. When the probability of a newly combined symbol is found
to equal another probability in the list, the selection of any one of them to be the next
symbol to be combined will not afect the average codeword length but the variance. The
symbol yielding the lower variance is used in coming up with the final design for the code.

3. Repeat the process until only two symbols are left. Assign a “0” and a “1” to the two
symbols.

50
The code for each original symbol is found by working backward, tracing the sequence of
“0s” and “1s” assigned to the symbol and its successors.

Implementations of the Hufman algorithm produces a prefix condition code [18].

The Hufman algorithm is implemented using a Hufman tree.

The Huffman tree

Consider a source with the following alphabet and source statistics

s 1 =0 .4
s 2 =0 .2
s 3 =0 .2
s 4 =0.1
s 5 =0 .1

Find the Hufman code for the source

51
Using the Hufman algorithm, we generate the following Hufman tree

The resulting Hufman code is:

Symbol Probability Code word

s1 0.4 1

s2 0.2 01

s3 0.2 000

s4 0.1 0010

s5 0.1 0011

The average code word length is


L = 1�0.4 + 2 �0.2 + 3 �0.2 +
4 �0.1 + 4 �0.1
= 2.2 bits

52
The entropy is

H (u)=−(0 . 4 log 0 . 4+0 .2 log 0 .2+


0 .2 log 0 . 2+0 .1 log 0 .1+0 . 1 log 0. 1
=2. 12bits /symbol

Note that the average code word length difers from the source entropy by only 3.7 percent.

Note further that

L = 2.2bits
H (u ) + 1 = 3.12

And thus

H (u)<⟨ L⟩< H (u)+1


Note that a modification in the merging of probabilities in which a new node arising from
merging of probabilities is placed above existing nodes of the same weight (probability) in the
list while merging always the last two weights in the list would produce a diferent code with
minimum values of maximum codeword length and total codeword length [19]. Thus, for
example, if after combining we place the new probability at the top of relevant probabilities in
the list of probabilities, the previous example would yield:

53
The Resulting Huffman Code is:

Symbol Probability Codeword

s1 0.4 00

s2 0.2 10

s3 0.2 11

s4 0.1 010

S5 0.1 011

Now

L = 2 x0.4 + 2 x0.2 + 2 x0.2 + 3x0.1 + 3 x0.1


��

=2.2 bits

i.e. the average codeword length has not changed.

The variances are as follows:

(i) For the first code we have:


5
s 2 = �( lk - 2.2 ) pk
2

k =1

= 1.36

(ii) For the second code we have:


5
s 2 = �( lk - 2.2 ) pk
k =1

= 0.16

Which of the two codes would you recommend for implementation? Why?

54
(ii) Shannon – Fano Code

The algorithm for generating the Shannon – Fano code is:

1. Arrange the symbols from the source alphabet in order of decreasing probability

2. Group the source symbols into two sub-sets so that the probabilities of the two sub-sets are
just about equal. Assign a “0” to one sub-set and a “1” to the other sub-set.

3. Group the symbols in each sub-set into two groups of almost equal probability. Assign a “0”
and a “1” to each sub-group.

4. Repeat step 3 until all source symbols have been assigned code words.

The Shannon – Fano derived code words for each symbol are read from the original sub –
set to the final sub – set.

Implementation of the Shannon – Fano algorithm produces a prefix condition code [20].

Suppose we have a source with the following source probabilities.

Symbol Probability

s1 0.3

s2 0.25

s3 0.2

s4 0.1

s5 0.1

s6 0.05

Applying the Shannon – Fano algorithm yields

55
Code words are obtained by reading from the left to the right. Thus, for this example we get:

Symbol Probability Code word

s1 0.3 00

s2 0.25 01

s3 0.2 10

s4 0.1 110

s5 0.1 1110

s6 0.05 1111

The Shannon – Fano algorithm has produced a prefix condition code.

56
The same result can be obtained by using a tree as shown below:

The code words are indicated in the diagram.The tree is known as the Shannon-Fano tree

Example 2.6

A source has an alphabet u = {a, b, c} with source statistics p(a) = ¼ , p(b) = ¼ , and p(c) = ½ .
This source is used to get an extended code with N = 2. Find a Hufman code and a Shannon –
Fano code for the source.

(a) Huffman Code

57
cc 10

ac 010

bc 011

ca 000

cb 001

aa 1110

ab 1111

ba 1100

bb 1101

(b) Shannon – Fano Code

58
Read the code from the initial state

The Shannon – Fano code is

59
cc 00

ac 01

bc 100

ca 101

cb 1100

aa 1101

ab 11100

ba 11101

bb 1111

Since
p (a ) = 1
4
p (b) = 1
4
p (c ) = 1
2

Find L2 and H(u2). Compare your results and comment

Decoding

Since both Hufman coding and Shannon – Fano coding produce prefix condition codes,
decoding can be efected using decision trees as discussed in Section 2.5.2. the Hufman tree is
the decision tree for the Hufman code and the Shannon – Fano tree is the decision tree for the
Shannon – Fano code. The initial state in both types of decision tree is the root of the tree. The
leaves of the trees constitute the terminal states.

As an example, suppose we are to decode the following bit tream using the Hufman tree of
Example 2.6.

60
Bit stream: 001110100110

Reading the bits from the left to the right and starting from the root of the Hufman tree and
proceeding to the leaves (the symbols) of th tree we decode the bit stream as:

Cb bb cb cc

Decoding using the Shannon – Fano tree and starting from the root of the tree and moving to
the eaves yields the following decoded message.

If the same message – cb bb cb cc – had been encoded using the Shannon – Fano code of
Example 2.6 the resulting bit stream would have been

11001111110000

If this had been the received bit stream then it can be easily verified that using the Shanno –
Fano tree for decoding the message, starting from the root of the tree and moving to the leaves,
would yield a decoded message which is exactly the same as the encoded message.

(iii) The Lempel – Ziv – Welch Code

Both the Shannon – Fano and the Hufman coding algorithms require a priori knowledge
of the source probabilities. In practice this may not always be possible. Also some
requirements limit the exploitation of dependencies between words and phrases (I.e.
inter character dependences) when arranging text thus compromising the efficiency of
such sources.

One code which overcomes the shortcoming mentioned above is the Lempel – Ziv –
Welch code. This code belongs to a class of codes known as dictionary codes.

The first Lempel – Ziv compression technique was published in 1977 [21] and inter it was
refined by Welch in 1984 [22].

The Lempel – Ziv – Welch (L Z W) algorithm works by looking for data sequences that
occur multiple times and replaces them in a file with fixed size pointers. As it reads data
in it stores strings into a dictionary. When it encounters a string that already exists in the
dictionary, it replaces it with a fixed size index. Compression is achieved since the index
way requires less storage space than the string it represents. The L Z W coder does not
need to store the dictionary along with the strings. During decoding the decoder works
along the same principles as the encoder of reading in the data and simultaneously
creating a dictionary.

61
The Lempel – Ziv algorithm works by parsing the source data stream into segments that
are the shortest subsequences not encountered previously.

Example 2.7

Assume that the following source data stream is to be encoded:

000101110010100101

(i) It is assumed initially that bits 0 and 1 are already stored in a bufer in that order to get:

Subsequence stored: 0,1

Data to be parsed : 000101110010100101…

(ii) With bits 0 and 1 already stored, the shortest subsequence encountered for the first time
and not previously encountered – beginning from the left – is 00. This is the first string to
be stored in the dictionary. This yields:

Subsequences scored: 0, 1, 00
Data to be parsed : 0101110010100101

(iii) In the remaining data to be parsed the shortest sub stream not previously seen is 01. The
second string to be stored is thus 01. This yields.

(iv) At this stage, the shortest sub stream not previously seen and stored in the dictionary is
011. The third string to be stored is thus 011. This yields.

Subsequences scored: 0, 1, 00, 01, 011


Data to be parsed : 10010100101

(v) The process is repeated unit the source input data stream has been completely parsed. For
the example at hand the result is:

Subsequences scored: 0, 1, 00,01,011,10, 010, 100, 101

(vi) The sub streams stored are the contents of the dictionary. Indexes are assigned to the
dictionary contents. The indexes are a binary representation of the decimal number
representing the dictionary entry. For the example at hand we get:

Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Sub stream stored : 0 1 00 01 011 10 010 100 101

62
(vii) In the L Z W encoding algorithm, the last bit of a codeword representing a stored
substream is taken to be the same as the last bit of a stored substream. This last bit in a
stored substream is called an innovation symbol. The innovation symbol when appended
to a subsequence enables that subsequence to to be distinguished from all previous
stored subsequences. The rest of the bits in a codeword are obtained by concatenating
the index of the root subsequence that matches the one in question except for the
innovation symbol. Thus, for example, the codeword for the subsequence 00 is made up
of a 0 as the last bit, since this is the innovation symbol, concatenated with 001, since the
root subsequence that matches 00 except for the innovation symbol 0 is the subsequence
0 stored in position 1. The index for this subsequence is 001. Thus the codeword for
subsequence 00 is 0010.

Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Sub stream stored : 0 1 00 01 011 10 010 100 101
Codeword : 0010 0011 1001 0100 1000 1100 1101

Note that the Lempel – Ziv – Welch code is:

(a) a variable length to fixed length code

1 + log M
(b) has codewords of length where M is the number of distinct strings of the
parsed source sequence and L> represent the integer just greater than the argument.

(c) is asymptotically optimal in the sense that the average codeword length approaches the
source entropy as the source sequence get very long.

(viii) The decoding procedure parallels the encoding procedure. Thus, assume for example that
the received bit stream is:

0010001110010100100011001101

The decoding procedure proceeds as follows:

(a) Since LZW produces fixed length codewords, we divide, for this example, the bit
stream into codewords of Fano bits to get:

0010 0011 1001 0100 1000 1100 1101

63
(b) The decoder sets up a dictionary just like the encoder. The default entries in position L
and 2 are 0 and 1. We thus have:

Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Decoded Entry : 0 1
When the first codeword 0010 is received, the innovation symbol – a 0 in this case – is
decoded as the last bit of the original data string. The index is then used to identify the
other bit(s) in the input string. The index is 001 indicating that the other bit in the
encoded string is a 0. We thus get:

Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Decoded Entry : 0 1 00

The next received codeword is 0011. The innovation bit is decoded as the last bit in the
encoded string. The index identifies one other bit – a 0 – for the input string. Thus
codeword 0011 is decoded as 0 1 yielding:

Numerical position: 1 2 3 4 5 6 7 8 9
Index : 001 010 011 100 101 110 111 1000 1001
Decoded Entry : 0 1 00 01

The procedure is repented sequentially until at the received codewords have been
decoded. It is left as an exercise to the reader to show that completion of the exercise
would lead to the complete recovery of the original source output data stream.

Hufman coding was for a longtime the algorithm of choice for data compression.
However, the Lempel – Ziv – Welch has taken over from the Hufman algorithm. By
exploiting inter-character dependence, the LZW algorithm achieves greater compression
than the Hufman algorithm. AR a and ZIP are some of the current practical
implementation in software of the LZW algorithm.

2.5.3 Markov Sources

The information sources we have been dealing with so far are memoryless. A symbol emitted by
a source at a particular time is independent of preceding symbols and of succeeding symbols.

The assumption that the emitted symbols were independent and identically distributed was
made so as to simplify the mathematical modeling of information sources.

64
Practical information sources have memory. Thus, for example, in English, the probability of a
letter “u” following the letter “q” is much higher than that of any other letter as can be easily
verified by looking at the digraph frequencies of the English language.

The simplest analytical approach to modeling sources with memory is to consider them to be
Markov Sources. Markov Sources belong to a class of processes known as stochastic processes.
The properties of such processes evolve in time in accordance with probability laws. In our case
we shall restrict ourselves to using discrete time, discrete state stochastic processes in the
modeling of Markov sources.

We shall not attempt to have a formal mathematical definition of discrete time, discrete state
stochastic processes. This discrete time, discrete space Markov processes are called Markov
chains. Since Markov chains belong to the class of Markov processes we shall begin with a
definition of Markov processes.

Shannon (1948) gives examples of modeling English text as a Markov source. Shannon gives
sample outputs of such models. Interested readers are referred to Shannon’s paper.

Definition 2.5: Markov Processes

A stochastic process is a Markov process if for arbitrary times …. L < m < n,

Pr { X n = X X m = y, X l = z ,...} = Pr { X n = X X m = y}...

The right hand side of eqn shows that the dependence is on the latest of the time points l, m, …
The probability density functions of Markov processes are not conditioned on two or more
values of past time points.

The definition of Markov Chains is articulated from the definition of Markov processes.

Definition 2.6: Markov Chain

Given a sequence of discrete random variables X l ,…, X m then

Pr { X n = k X l = h,..., X m = j} = Pr { X n = k X m = j}...

Example 2.8

65
Suppose that a binary source can exist in two states A and B and that the probabilities of
transmission of a O or a 1 are state dependent. Thus, suppose the n the transmission is a O (i.e.
binary source is in state A) then the probability that a O will be transmitted at the (n + l) the
time is l – x and the probability that a l is transmitted is x. Similarly if the n the transmission is a
1, the probability that the (n = l) the transmission is a O is 1 - b and that of a O is b .
Alternatively, this may be taken to be equivalent to saying that if the source is in state A at time
n, then the probability that it is still in state A at time (n + 1) is 1 – x and the probability that it
moves to state B at time (n+1) is x. similarly if that source is in state B at time n then the
probability of being in state B at time (n+1) is 1 - b and that of being in state A at time (n+1) is
b . These probabilities are called transition probabilities and may be represented in a matrix as:

Next State

A B

Present State
{[
A 1−α α
B β 1−β ]
Note that we have made the simplifying assumption that the transition probabilities are not
time – dependent. Note further that the sum of elements in any row is exactly 1 since the
current state must surely go somewhere as the next transition.

The entries the transition probability matrix is defined as:

=k|Xn=j}
n+1
Pjk=Pr{X ……..2.76

If the probability that the Markov chain is in state k at time n is denoted by

pr { xn = k } = pk ( n )
………2.77

Then, using the transition probabilities we write

pk ( n +1) = �p j
( n)
pjk
j
……….2.78

or, if we define P as the matrix

66
[ ]
p11 p12 ⋯ p1 N
P= p 21 ⋯

pN 1 p N 2 ⋯ p NN ………..2.79

p( n ) = �
�p1( n ) p2 ( n ) �

and ……….2.80

eqn 2.78 can be written in matrix notation a

p ( n +1) = p ( n ) p ……….2.81

Particularizing our dissension to the simplest non-trivial case of a two – state markov chains we
get.

[ P ( n+1) P ( n+1) ]=[ P


0 1 0
(n ) P (n ) ]
1 [ P11 P12
P21 P22 ] ………. 2.82

Starting with the initial condition

p (0) = �
�p1 p2 .... p N �
(0) (0) (0)
� ………. 2.83

It can be easily shown that through iteration we get:


n

0 1 0 1 [
[ P ( n ) P ( n ) ]=[ P ( 0 ) P ( 0 ) ]
P11 P12
P21 P 22 ]
Or in matrix notation

p ( n +1) = p (0) p n ……….. 2.85

In order to find the time-dependent probabilities p(n) ,it is necessary to evaluate P(n) in equation
2.85. This is conveniently done by using the diagonal (spectral) representation of P.
If we assume that after a sufficiently longtime the system settles down to statistical equilibrium

67
in which the state occupancy probabilities lose their dependence on time and are independent
of the initial conditions i.e

then, it can be shown that, at equilibrium

Lim Lim ( n )

�p0 ( n ) p1( n ) �
�= p = [ π 0 π1 ] = π
n�� n ��

Assuming that the underling stochastic process is ergodie,

Then, it can be shown that, at equilibrium [23]

π=πP
or
π (I −P)=0 …………. 2.86

Equations 2.87 has a non-zero solution if

det( I−P )=0


…………. 2.88

Where I is the identity matrix

To get a unique solution from 2.87 we need to use the normalization condition that the state
probabilities must sum to unity or, in matrix notation:

πΙ = Ι …………..2.89

The matrix P is called the State Transition Matrix

The equilibrium probabilities of state occupancy, π, are obtained by solving equation 2.85, but
since these can be shown to be dependent equations, then we invoke, in addition, the
normalizing condition given by equation 2.89

With the background on Markov chains given above, we are now in a position to find an
analytical model of a Markov source.

68
Definition 2.7: Markov Source

A Markov source is characterized by a set of discrete states, denoted by A = {a1, a2, …, aN} and a
source letter alphabet, denoted by u = {u1, u2, …, uM}. Each discrete unit of time with a given
probability, the source emits a symbol from its alphabet and assumes a new state, which could
be the previous state.

Note that the Markov process is being used to model the memory of the source.

Let qj(ak) be the probability that the source emits symbol ak given that it is in state j and assume
that this probability depends on the current state

q j ( ak ) ΔPr {an =ak |s n = j}


………. 2.90

Note that the symbol Sn represents the state of the Markov chain at time n.

Finally assume that the next state the source moves to depends only on ( i.e. is uniquely
determined by) the previous emitted letter. Emitting symbol a k forces the source to move to
state sk

Figure 2.2 illustrates the operation of a Markov source. The branches are labelled by output ak
and probability qj(ak) i.e. the probabilities for learning state j. The nodes correspond to the
states.

Figure 2.2 A Markov source

69
Each branch directed away from a given node must correspond to a diferent letter to ensure
that the new state the chain moves to is determined by the previous state and emitter of letter

The conditional entropy of a Markov source is the entropy computed while assuming that the
source is occupying a particular state using eqn this is given by:
1
H (u | s j ) = �qi ( ak ) log ………. 2.91
k q j (a k )

The entropy of a Markov source given that it is in a particular state must satisfy the same
inequalities satisfied by a DMS.

The average entropy is obtained by computing the conditional entropy given the state occupied
by the Markov source and then removing the conditioning

N
H (u)=∑ H (u|s=s k ) π k
k =1 ……… 2.92

Example 2.8

Consider a Markov source with the following state transition matrix

P= [ 0.5 0 . 5
0.2 0 . 8 ]
(i) How many symbols does the source alphabet contain?

(ii) Draw the Markov chain for the source.

(iii) Determine the entropy of the source

We leave parts (i) and (ii) as exercises for the reader.

In this example, let ai be the source sequence generated given that the Markov source is in state
si, i = 1, 2.

The binary output sequences by an encoder depend on the state of the Markov chain.

70
This is thus an example of a Markov Modulated Source (MMS).

A generalization of this, the Markov Modulated Poisson Process (MMPP) has been used as a
mathematical model of Internet Protocol (IP) traffic. [24 and references therein].

When the Markov source in the example is at equilibrium π=π P

[ π 1 , π 2 ]=[ π 1 ,π 2 ] [ 0. 5 0. 5
0. 2 0. 8 ]
i.e.

Using one equation from above and the equation π 1 +π 2 =1


2 5
We get π 1= and π 2 =
7 7

71
The computation of the entropy is easily visualized by using the Markov chain representation of
the source

1 1
H (u|s 1 )=0 .5 log +0 . 5 log
0 .5 0. 5
1 1
H (u|s 2 )=0 .2 log +0 . 8 log
0 .2 0. 8
H (u)=H (u|s 1 ) π 1 +H (u|s2 )π 2
=1×π 1 +( log5−1 .6 )π 2
1
= (5 log 2 5−6 )
7

72
Coding for Markov Sources

Coding for Markov sources is achieved by designing a separate code, preferably a prefix
condition code, for each state of the underlying Markov chain. These separate codes are then
combined and used in producing an encoded source output by treating the letters of the source
alphabet as states of a Markov chain. An example will clarify this.

Consider a Markov source with source alphabet {a, b, c} and transition probability matrix:

a b c
1 1 1
a
3 3 3
1 1 1
b
4 2 4
1 1 1
c
4 4 2

The Markov chain for this example is:


1/3

1/4
1/2
1/3
a [
P=
p1
p21

pN1
p12

⋯ p1 N

pN2 ⋯ pNN ] b
[ ]
p1 p12 ⋯ p1 N
p21 ⋯
P=

pN1 pN2 ⋯ pNN

1/4 1/4
1/4
1/3
c [
p1
P = p2 1

pN 1
p12

⋯ p1 N

pN 2 ⋯ pN N ]

1/2

The Hufman code for each state can be easily shown to be as shown in the Table below.
Following definition 2.6, the word “state” in the Table below refers to a condition that the
Markov source has just emitted n letter. Thus so, for example, refers to the state in which the
Markov source has just emitted in a.

SYMBOL CODE
STATE

73
Sa Sb Sc
a 1 10 10
b 00 0 11
c 01 11 0

In encoding we assume that at the initial state to the encoder transmits the initial state, so say,
of the Markov source. If the Markov source has an alphabet of size m the transmission of the

initial state will require [


log m ]
bits where [ ] represents the smallest integer greater than x .
x
The long-term expected number of bits per source letter is not afected by this transmission as
it is done only once at the beginning of the transmission.

After transmitting the initial state, the encoder then successively encodes each source letter
using the code for transition into the next state. The decoder, after decoding the initial state so
can decode the received message by using the code based the received message by using the
code based on state so. The other received symbols are then decoded sequentially after that.

Example 2.9

An encoder is to encode the Markov source output sequence.

baccbacba

with the initial state of the Markov source being state a.

We assume that the sequence is read from the left to the right. After the initial state, the next
letter is b. From Table , the codeword for this 00. The current stat is next b and the next letter
(state) is a. From the Table the codeword for this transition is 10. We summarize the transitions
in Table below.

OUTPUT SYMBOL MARKOV SOURCE STATE CODEWORD


SEQUENCE CURRENT STATE NEXT STATE
b Sa Sb 00
a Sb Sa 10
c Sa Sc 01
c Sc Sc 0
b Sc Sb 11

74
a Sb Sa 10
c Sa Sc 01
b Sc Sb 11
a Sb Sa 10

75
The encoder output sequence is thus:

00100101110011110

Example

Suppose a decoder has decoded that the initial state of a Markov source during the encoding
process was state c. Decode the following received sequence

110111000101

While is stat c, we decode the first two bits – 11 – a – b. We are now in state b. this enables to
decode the next bit – o – as b. We have remained in state b which forces as to decode the next
two bits – 11 – as c. And so on leading to the following decoded message:

bbcabaa

76
CHAPTER THREE

CHANNEL CODING
3.1 Discrete Memoryless Channel

So far we have been dealing with sources required for information generation. We have
confined ourselves to answering questions arising out of Shannon’s first theorem:

(i) What is the fundamental limiting number of binary digits required to uniquely
represent the source?
(ii) How close to this limit can we come in practice?

We have found that the fundamental limit is the source entropy. We have also given two
encoding algorithms whose outputs approach the source entropy.

Now we return to Shannon’s theorem which deals with reliability in information transmission.

To mitigate the efect of the noisy channel, redundancy is introduced in the channel input and it
is used to reconstitute the channel input sequence as accurately as possible at the channel
decoder output.

Redundancy is introduced by coding. Diferent codes are used for bandwidth limited and power
limited channels giving rise to Automatic Repeat Request (ARQ) error – control coding systems
versus those using Forward Error Correction (FEC) codes.

For a noisy channel, a mapping between the input and output is random, mapping the input,
defined over a given discrete set, into an output sequence defined over a, usually diferent,
discrete set.

The channel encoder and decoder map the input digital sequence into a channel input
sequence and, conversely, a channel output sequence into an output digital sequence so that
the efect of the noisy channel is minimized

Shannon coding which is introduced to mitigate the efects of channel noise has an important
on the rate at which useful information can be transmitted over a channel. This led to one of the
questions which Shannon posed in 1948 [4]

The question which Shannon has posed was: What is the maximum transmission rate for
reliable communication over a noisy channel?

77
Definition 3.1: A Discrete Memoryless Channel (DMC) is characterized by a discrete input
alphabet, X, a discrete output alphabet, Y, and a set of conditional probabilities for outputs
given each of the inputs

The conditional probabilities are denoted by p(y|x) for xϵ X and y Є Y . Each output letter of the
channel depends only on one input symbol.

For an input sequence of length N denoted by x = {x1, x2, …, xN}, the conditional probability of a
corresponding output sequence, y = {y1, y2, …, yN}, is expressed as

N
PN ( y|x )=∏ P( y n|x n )
n=1
Normaly 0≤P( y n|x n )≤1 ∀ n ⋯⋯3. 1

The conditional probabilities P(yk/xj) are called transition probabilities for obvious reason

i.e the channel is memoryless when the current output symbol depends only on the current
input symbol and not on any of the previous symbols.

Note that equation 3.1 assumes that the Xi’s are iid

This is the memoryless condition of the DMC. The channel is said to be discrete when both
alphabets X and Y are countably finite.

Note that it is not necessary for the input alphabet, to have the same size as the output
alphabet, Y. For example, for channel coding, the size N of the output alphabet may be larger
than the size M of the input alphabet i.e. N ≥ M.

The channel may conveniently be described by a matrix whose elements are the conditional
probabilities of the channel as folllows

Output

Input

[ ]
p( y 1|x 1 ) p( y 2|x 1 ) ⋯ p ( y N |x 1 )
p( y 1|x 2 ) p( y 2|x 2 ) ⋯ p ( y N |x 2 )
P= ⋯3 . 2
⋮ ⋯
p( y 1|x M ) p ( y 2|x M ) ⋯ p( y N|x M )
78
The M x N matrix is called the channel matrix. Note that each row of the matrix corresponds to a fixed
channel input whereas each column of the matrix corresponds to a fixed channel output. Note that a
fundamental property of the channel matrix is that the row elements sum to one, i.e.

M
∑ p( y k|x j )=1 ∀k ⋯3 . 3
j =1

Why is the column sum equal to one?

Suppose that the probability that the event that the channel input X = xj is denoted by

Pr {X =x j }=p ( x j ), j=1, 2, …, M ⋯3 . 4
Then the joint probability distribution of the input random variable, X, and the output random
variable, Y, is given by

p( x j , y k )= Pr {X= x j , Y = y k }
=Pr {Y = y k|X = x j }Pr {X = x j }
= p( y k|x j ) p( x j ) ⋯3 . 5

The marginal probability distribution of the output random variable, Y, is obtained by

p( y k )=Pr {Y = y k }
M
=∑ p( y k , x j )
j=1
M
=∑ p( y k|x j ) p( x j ) ∀ k=1, 2, …, N ⋯3 .6
j =i

79
The probabilities p( x j ), j=1,2,…, M are known as the a priori probabilities of the various
input random variables.

Thus, given the a priori probabilities p(xj) and the channel matrix, we may calculate the
probabilities of the various output variables p(yk) using equation 3.6.

3.1 The Binary Symmetric Channel (BSC)

This channel is a special case of the DMC with M = N = 2. it is of great theoretical importance
and practical significance

Definition 3.2 A binary symmetric channel (BSC) is a DMC with X = Y = {0, 1}, two input symbols
(x1 = 0, x2 = 1) and two output symbols (y1 = 0, y2 = 1) . The conditional probabilities are of the
form

p(y2|x1) = p(y1|x2) = p

p(y1|x1) = p(y2|x2) = 1 – p

The channel is symmetric because the probabilities of error (i.e. receiving a 0 when a 1 is sent
and vice versa) are equal

The conditional probabilities are shown in the figure below:

Fig. 3.1 The Binary Symmetric Channel

We shall return to discuss in more detail the BSC in section 3.3.2

80
3.2 Mutual Information

We may think of the channel output, Y as a noisy version of the channel input, X we would like
to know how to measure the uncertainty about X after observing Y. then it will be reasonable to
think of the entropy of the input, H(x), as the prior uncertainity we have about X before
observing any output.

To find the uncertainty about X after observing Y, Shannon defined the notion of mutual
information between two events, α and β, denoted by I(α, β). Mutual information is the
information provided about the α event by the occurrence of the event β. As before, the
probabilities p(α), p(β) and p(α ∩ β) are assumed known.

To be consistent with our previous definition of information we must have the following
boundary conditions.

a) If α and β are independent events p(α ∩ β) = p(α)p(β)), then the occurrence of β


provides no information about α i.e. I(α, β) = 0.

b) The occurrence of β indicates that α has definitely occurred i.e. p(α| β) = 1, then the
occurrence of β provides us with all the information regarding α. Thus

1
I( α|β )=I (α )=log
p(α )

The two boundary conditions of mutual information are satisfied if we define the mutual
information as follows:

p( α|β )
I ( α , β )=log
p( α )
p( α ∩β )
=log ⋯3 .7
p( α ) p ( β )

Note that the definition is symmetric in α, β since I(α, β) = I(β, α)

Note further that

1
I( α , α )=log =I (α )
p(α )

81
Thus, mutual information is a generalization of the concept of information. Hence I(α) is
sometimes referred to as the self – information of the event α.

Note also that if, for example, p(α| β) < p(α) then I(α, β) is negative. Thus while self –
information is non – negative, mutual information can take on negative values.

Negative values of I(α, β) indicate that α is less likely than it was a priori before β was observed.

We are primarily interested between the mutual information between the channel input, X,
and the channel output, Y, of a DMC. We will regard the channel as a random mapping of the
random variable, the channel input, X, into the random variable, Y, the channel output, treating
both the input and output as discrete random variables.

Consider a DMC with input alphabet X, output alphabet Y, and conditional probability p(y|x) for
y Є Y and x Є X.

Suppose in addition that input symbols occur with probability p(x) for x Є X.

The input to the channel can then be regarded as a random variable and the output as a
random variable. If we observe the output y, the information provided about the input x is the
mutual information

p( y|x )
I ( x , y )=log
p( y )
p( x| y )
=log ⋯⋯3. 8
p( x )

p( y )=∑ p ( y|x ) p( x ) ⋯⋯3. 9


x

we see that it is necessary for us to know the input probability distribution p(xi), i = 1, 2, …, M
so that we may calculate the mutual information I{X, Y} of a channel.

As with sources, we are interested in the average amount of information that the output of the
channel provides about the input. Thus we define the average mutual information between the
input and output of a DMC as

I ( χ , γ )=E {I ( x , y )}
p( y|x )
=∑ ∑ p ( y|x )q (x )log ⋯⋯3 .10
y x ∑ p( y|x ) q( x )
x

82
Where q(x) is the probability distribution of the input

The average mutual information I(X,Y) is defined in terms of the given channel conditional
probabilities and the input probability which is independent of the DMC. We can then maximize
I(X,Y) with respect to the input probability distribution
q={q( x ), x∈X }

We digress a little bit before linking the material here, and what follows, with the concept of
channel capacity.

3.2.1 Conditional Entropy

Given an input with an alphabet X, the quantity H(X) gives us the a priori uncertainty about X.
We extend this concept in order to measure the uncertainty about X after observing Y.
Specifically, we define the conditional entropy of X, selected from the alphabet, X, given that Y
= yk i.e.

M
H {χ|γ= y k }Δ ∑ p( x j| y k ) log
j=1 [ 1
p ( x j| y k ) ] ⋯⋯3 . 11

The quantity H{X|Y=yk} is itself a random variable that takes on values H{X|Y=y1}, …, H{X|Y=yN}
with probabilities p(y1), …, p(yN), respectively. The mean of H{X|Y=yk} over the entire alphabet Y
is given by

N
H {X|Y }= ∑ H {χ|γ = y k } p ( y k )
k =1
N M
= ∑ ∑ p ( x j| y k ) p( y k ) log
k =1 j=1
N M
[ 1
p( x j| y k ) ]
= ∑ ∑ p ( x j , y k ) log
k =1 j=1 [ 1
p ( x j| y k ) ] ⋯⋯3 . 12

The quantity H{X|Y} is called the conditional entropy and it represents the amount of
uncertainty remaining about the channel input after the channel output has been observed.

Since H{X} represents the uncertainty about the channel input before observing the channel
output and H{X|Y} is the uncertainty about the channel input after observing the channel
output, the diference H{X} - H{X|Y} must represent our uncertainty about the channel input
that is cleared (resolved) by observing the channel output. Thus this diference must represent
the information which the observation of Y provides about X.

83
This quantity provides an alternative method of computing the mutual information of the
channel. If we denote the channel mutual information as I{X,Y} , we get

I{X,Y} = H{X} - H{X|Y}

Similary,by using the symetry of I{X,Y} we may also write I{X,Y} = H{Y} - H{Y|X}

3.2.2 Properties of Mutual Information

Property 1

We have already shown that the mutual information of a channel is symmetric i.e.

I{X, Y} = I{Y, X} ………4.16

Where I{X, Y} is the amount of uncertainty about the channel input that is cleared by observing
the channel output and I{X, Y} is the amount of uncertainty about the channel output that is
cleared by sending the channel input.

Propety 2

I(X,Y) =H(X) - H(X/Y)

In order to prove the result we need Baye’s Rule.

Lemma 3.1 (Bayes Rule)

For two random variables X and Y, defined on alphabet X and Y respectively and given that X =
xj, j = 1, 2, …, M, then

Pr {Y| X=x j }Pr {X=x j }


Pr {X =x j|Y }= M
⋯⋯3 .15
∑ Pr {Y |X =x j }Pr {X =x j }
i =1
Proof

We know that

Pr {X , Y }=Pr {X|Y }Pr {Y }


=Pr {Y |X }Pr{X }

Hence
Pr {Y| X }Pr {X}
Pr {X|Y }= ⋯⋯3 .16
Pr {Y }
Equation 3.16 is one form of Baye’s rule

84
so long as Pr {Y }≠0

We also know that

Pr {Y , X=x i }=Pr {Y |X=x i }Pr {x=x i }

and thus
M
Pr {Y }=∑ Pr {Y| X=x i }Pr {x=x i } ⋯⋯3 .17
i =1

Letting x = xj and using 3.16 and 3.17 yields 3.15, Baye’s Rule

Example 3.1

Consider a binary communication channel with the following conditional probabilities

p(0|0) = 0.95

p(1|1) = 0.9

and Pr{x = 0} = p(0) = 0.4

(a) Is this a BSC?

(b) Find the probability that a zero is received.

(c) Find the probability that a one is transmitted given that a one was received

Solution

(a) The probability of error is not symmetric and hence it is not a BSC.

(b) Define

Y0= zero is received

X0 = a zero is transmitted

X1 = a one is transmitted

Using 3.17 we get

Pr{Y0} = 0.95 x 0.4 + 0.1 x 0.6 = 0.44

Using 3.15 with Y1 = a one is received

85
p (Y 1|X 1 ) p( X 1 ) 0 . 9×0 . 6 0 .54 27
p( X 2|Y )= = = =
p(Y 1 ) 1−0 . 44 0 . 56 28

Let us now consider an example which is unrelated to a communication system but which
demonstrates the computation of I{X, Y}.

Example 3.2

Three urns, A1, A2 and A3 are equally likely to be selected. The number of white balls (W) and
black balls(B) contained in each urn is shown below.

An experiment is done by selecting an urn, drawing a ball from it and observing the colour of
the ball. What is the average mutual information produced by the experiment.

Solution

The input alphabet consists of selecting an urn. Thus X = {A1,A2,A3} with equal probabilities of
selection.

The output of the experiment consists of observing the colour of a drawn ball. Thus

Y = {W, B}

in solving the problem, we shall use the equation

I(X,Y)=H(X) – H(X/Y)

86
H(X) = log 3

We need to compute

I(X,Y)=H(X)-H(X/Y)

H(X|Y) is computed as follows:

i. We first note that X = {A1,A2,A3} and Y = {W, B}

ii. We will need the conditional probabilities p(x j/yk) with x1=A1,x2=A2 and x3=A3,we also have
y1=W,y2=B

Using Baye’s Rule we get

P(W |A1 ) P ( A 1 )
P( A 1|W )=
P(W |A1 ) P ( A 1 )+ P( W | A 2 )P( A2 )+ P( W | A3 ) P( A 3 )
0 .8
3
=
0 .8 0 .6 0 .2
+ +
3 3 3
=0 .5

Applying Baye’s Rule in a similar manner, we get:

0 .2
3
P( A 1|B)=
0 .2 0.4 0.8
+ +
3 3 3
1
=
7
and
3
P( A 2|W )=
8
2
P( A 2|B)=
7

1
P( A 3|W )=
8
4
P( A 3|B)=
7
The entropy of any urn for y = B or W is obtained by averaging over the color. Thus
1 1
H ( A 1|γ)=P( A 1|W )log +P( A1|B )log
P( A1|W ) P( A1|B )
1 1 87
= log2+ log 7
2 7
1
=1+ log7
7
similarly,
3 8 2 7
H ( A 2|γ )= log + log
8 3 7 2
9 3 2 7
= − log 3+ log
8 8 7 2
1 4 7
H ( A 3|γ )= log8+ log
8 7 4
3 4 7
= + log
8 7 4

H(X|Y) is obtained by averaging out over X.

Thus 3
H ( χ|γ )=∑ H ( A i|γ ) p ( A i )
i=1
1 1 3 1
= + log 7+ − log 3
3 21 8 8
2 7 1 4 7
+ log + + log
21 2 8 21 4
=1. 026

Hence

I( X ,Y )=H ( X )−H ( X|Y )


=1.784−1.026
=0.754

Since I(X,Y) > 0, knowing the color of the ball drawn from the urns resolves positively the
uncertainty about the input i.e. the urns containing the balls.

3.3 Channel capacity

88
The input probability distribution, q(xi) is independent of the channel. Since mutual information
is a relation between the channel input and channel output, we may maximize the mutual
information of the channel with respect to q(xi). We thus have the following definition .

Definition 3.3. Channel capacity

The channel capacity, denoted by C, of a DMC is the maximum average mutual information in
any single use of the channel (i.e. transmission interval), where the maximization is taken over
all possible input distributions, i.e.

C=max I ( χ , γ ) ⋯⋯3 .18


q( x )

The channel capacity is measured in bits per channel use or bits per transmission

The channel capacity is a function only of the conditional probabilities p(yk|xi) which completely
define the channel.

The maximization of the channel capacity over all possible input probability distributions is done
subject to the obvious constraints

q(x i )≥0 ∀ i

And
M
∑ q ( x i )=1
i=1

In general this maximization is not a trivial task

3.3.1 The Binary Symmetric Channel

Consider the BSC with probabilities of error being p, the entropy H(X) is maximized (maximum
uncertainty) when p(x = 0) = p(x = 1) = ½ . This also maximizes the average mutual information.
Thus

C=max I ( χ , γ )
q( x )
=I ( χ , γ )| 1
p ( x=0 )= p( x =1)=
2

But we know that

p(y=0)|x=1) = p(y=1|x=0) = p

and

89
p(y=0)|x=0) = p(y=1|x=1) = 1 – p

So H ( χ )=p log p−(1−p )log(1− p )

1 1
H ( X )= log 2+ log 2
2 2
=1 bit /transimision int erval

2 2
1
H ( X|Y )=∑ ∑ p( x i| y k ) p( y k )log
k =1 i=1 p ( xi|y k )
1 1
={(1−p )log + p log } p( y =0)
1− p p
1 1
+{(1− p )log + p log } p( y=1 )
1−p p
1 1
= p log +(1− p)log
p 1− p
=− p log p−(1− p)log(1−p )
=H ( p ) bits per transmission interval

C=max I ( χ , γ )=1−H ( p )bits per transmission interval


q( x )

The last equality is obtained by using the definition of the entropy function which we
encountered when discussing the Binary Memoryless Source (BMS). A plot of C versus p is:

90
Fig. 3.2 Channel capacity versus p

We observe the following:

1. When the channel is noise free i.e. the transition probability p=0, C attains its maximum
value of 1 bit per channel use. At this value of p, the entropy function, H(p), attains its
minimum value.

2. When the channel has noise resulting in an error probability p = ½ , the channel capacity,
C, attains its minimum value of zero. The channel is then said to be useless.

3. When p=0, then C=1 bit per channel use, which is exactly equal to the information in
each channel input. This is intuitively correct. Since with p=0, when the channel output
is observed, the uncertainty that is resolved about the channel input must equal the
information in the channel input.

4. We note that when the conditional probability of error, p, is zero, H(X|Y)=0

5. Note that we also have C=1 when p=1. In this case, the channel output, which is the
complement of the channel input, means that by observing the output we can uniquely
determine the channel input.

91
3.4 The Channel Coding Theorem

Noise is always present in a communication channel. Noise causes errors between channel
output and input of a digital communication system. With a probability of error of 10-4 say, it
means that on average 104 - 1 bits are received correctly. It is normal in landline
communication, to specify a probability of error of 10-6 or better. To achieve such a high level
of performance requires the use of channel coding.

Channel coding consists of mapping an incoming data sequence into a channel input
sequence and inverse mapping a corresponding channel output sequence into an output
data sequence in such a way that the efect of channel noise on the system is minimized.
The first mapping operation is performed in the transmitter by a channel encoder and the
inverse mapping operation is performed in the receiver by a channel decoder.

The channel encoder is designed to and redundancy in a controlled manner so as to enable


the reconstruction of the original channel input sequence as accurately as possible at the
output of the channel decoder.

Chanel coding is usually employed to efect error detection and/or error correction in a
received channel output data sequence.

Historically, coding schemes have been grouped into:

(i) Block coding schemes

(ii) Convolutional coding schemes

(iii) Turbo coding schemes

We shall confine ourselves to discussing in detail block codes. In this class of codes, k
information bits are mapped into an n – bit block where the n – k bits are called parity –
check bits. The n – k parity – check bits are obtained by a linear operation on the k data bits
and are then appended to the data bits. Note that n > k.

The ration k/n is called the code rate, r, i.e.

k
r= <1 ⋯⋯3 . 19
n

For a specified k, the code rate r approaches zero as n approaches infinity.

92
W shall further restrict ourselves to discuss a particular class of block codes, cyclic codes.
Relatively simple digital encoders and decoders (CODECS) may be found. Cyclic codes have
not found wide utilization in bandwidth – limited channels.

In these channels, only the error detection capabilities of codes is used.

Power – limited channels such as space and satellite channels have been the prime areas of
application of FEC codes. Here both error detection and error correction capabilities of
codes are used.

The accurate representation of the original source sequence at the destination requires that
the average probability of error in the received digital data sequence be as low as possible.

This poses the question: Does there exist a channel coding scheme for which the bit error
probability, Pe, is arbitrarily close to zero and yet the channel coding scheme is efficient in
that the code rate is not too small?

The answer to this question is provided by Shannon’s second theorem. Shannon posed the
question: Is it possible to communicate over a channel at a fixed rate R>0 (bits per second)
such that Pe is arbitrarily close to zero?

Shannon answered the question in the affirmative (a “yes” answer) provided that R is less or
equal to a quantity, C, called the “channel capacity”. This is achieved by “channel coding”, a
process in which the channel input source sequence (bits) are made to depend on the
source output sequence (bits) in a complicated way. Arbitrarily small bit error probability,
Pe, may be attained as the coding complexity becomes great. Both convolutional codes and
turbo codes are near-capacity coding schemes in that their use may enable transmission
rates at close to the Shannon channel capacity to be attained.

In our previous description of channel capacity, time was not considered.

Suppose we consider the communication system represented by figure 3.3

Fig. 3.3. A Digital Communication System

93
Let the DMS in Fig. 3.3 have a source alphabet u and entropy H(u) bits per source symbol.
We assume that the source emits one symbol every Ts seconds. The average information
rate of the source is thus H(u)/Ts

bits per second (bps). There is no backlog in the system and so the destination is assumed
to receive symbols at the same rate they were emitted by the source i.e. one symbol every
Ts seconds. The DMC was assumed earlier to have a channel capacity of C bits per channel
use.

Assuming that the channel is used once every Tc seconds, then the channel capacity per
unit time is C/Tc bits per second. Thus the maximum rate of information transfer over the
channel is C/ bits per second(bps).

3.4.1 Statement of the Ts Channel Coding Theorem

As explained in the introduction, the theorem asks and answers the question: What is the
maximum transmission rate for reliable communication over a noisy channel?

Theorem 3.1. The Channel Coding theorem for Discrete Memoryless Channels.

(i) Let a DMS with an alphabet u and entropy H(u) produce one symbol every Ts seconds. Let
a DMC have capacity C bits per channel use and be used once every Tc seconds. Then if

H (u ) C
≤ ⋯⋯3 .20
Ts Tc

there exists a coding scheme for which the source output can be transmitted over the channel
and be reconstituted at the receiver with an arbitrarily small probabilityof error Pe.

The parameter C/Tc is called the critical rate. If equation 3.20 is satisfied with equality, the
system is said to be transmitting at the critical rate.

(ii) Conversely, if

H (u ) C
>
Ts Tc

it is impossible to transmit symbols over the channel and reconstitute it with an arbitrarily small
Pe.

94
Shannon’s second theorem specifies the channel capacity as a fundamental limit on the rate at
which the transmission of error free messages can take place over a DMC.

The following should be noted:

(i) Shannon’s second theorem assures us that if the theorem is satisfied, a suitable code
exists. It does not tell us how to obtain the code.

(ii) The theorem does not specify a precise value for Pe. If equaation 3.20 is satisfied then it
only specifies that Pe → 0 as the code rate approaches zero.

Note that power and bandwidth, although neglected in our discussion so far, are important
determinants of Pe for both power – limited and bandwidth – limited channels. They do
determine values of the elements of the channel matrix P.

Application of the Channel Coding Theorem to the Binary Symmetric Channel.

Consider a DMC with:

• Equally likely binary symbols (0’s and 1’s)

• Binary symbols are transmitted once every Ts seconds.

• The source entropy is one bit per source symbol (i.e. p(0) = p(1) = ½ )

• The source is applied to a channel with a channel encoder with code rate . τ

• The channel encoder produces one symbol every Tc seconds

Then:

 The information rate of the source is 1 bits per second (bps).


Ts
1
• The source encoder symbol transmission rate is symbols per second.
Ts
• The channel encoder engages a BSC once every Tc seconds.

Assuming that the channel capacity is as given by the transition probabiliites of a BSC then the
channel capacity per unit time is C bits per second.
Tc
Part (i) of the Channel Coding theorem implies that if
1 C
≤ ⋯⋯3 . 21
Ts Tc

95
then the probability of error can be made to approach zero by choice of a proper channel
encoding scheme.

The code rate of the channel encoder is

Tc
r= ⋯⋯3 . 22
Ts

and therefore, in terms of r, equation 3.21 becomes: r ≤ c

That is if r ≤ c, there exists a code whose code rate is less or equal to C, which can achieve
probabilities of error approaching zero for a BSC.

Example 3.3

In a repetition code, each bit in a message is repeated n times where n = 2m + 1 is an odd


integer. For example for n = 3, we transmit, for a “0, the digits 000 and for a “1”, the digits 111.
This obviously improves the reliability of the channel as compared to transmission of single 0’s
and 1,s.

The majority rule is used at the channel decoder where, if the number of received 0’s exceeds
the number of received 1’s in a block of n bits, then a “0” is declared to be the bit which had
been transmitted and vice versa.

An error occurs when m + 1 or more bits out of n = 2m + 1 are received in error. Because the
channel is symmetric, the average probability of error Pe is independent of the a priori
proabbilities of transmission of a “0” or a “1” (prove this). Pe is thus given by

n
Pe= ∑
k=m+1
( nk ) p (1−p )
k n−k
⋯⋯3. 22

where p is the transition probability of the channel.

For p = 10-2 and various values of n, we get

96
Note that in equation 3.22, Pe → 0 for n → ∞. (In fact, by the law of Large Numbers,about np
channel errors are expected to occur in n transmissions for very large values of n).

The price which we must pay for this is that the code rate r = 1/n for repetition codes also tends
to zero as n tends to infinity.

Note also that for a BSC we can plot C as a function of the transition probability, p, to get:

C =1 + p log p + (1 – p)log(1 – p)

Fig. 3.4

Note that C decreases from 1 to 0 as p increases from 0 to ½ .

97
3.5 Block Codes for Error Detection and Correction

We previously introduced the concept of coding for either error detection or for both error
detection and correction. Codes for Forward Error Correction (FEC) are obviously longer than
codes for error detection alone.

Shannon’s Second theorem states that so long as the transmission rate over a channel is less
than the channel capacity, a code exists which will ensure reliable communication over the
channel.

In this section we will discuss one class of such possible codes – parity – check block codes in
which r parity – check bits are formed by a linear operation on k data bits and then appended to
the k data bits and then transmitted as one block. These codes can be used for FEC. We shall
focus on a particular class of block codes – cyclic codes – for which relatively simple CoDecs may
be designed.

Fig 3.5 A Channel Encoder

Note that the encoder tales the k data bits and outputs a block of n > k binary digits. Both k data
bits and the block of n bits have a transmission interval of T sec.

This accounts for the need for a greater bandwidth in the modulator required to transmit the
block of n bits as compared to that required for the k data bits. why is this? Alternatively, if the
bit duration is held constant, the n bits have to be transmitted at a lower rate over the same
bandwidth.

3.5.1 Single Parity – Check Code

In this code n = k + 1 i.e. for every k data bits a single parity – check bit, usually the modulo – 2
(mod – 2) sum of the data bits, is appended to the data bits. Such a code can detect an odd
number of errors. Why?

98
With mod – 2 addition, the resulting code is called an even parity check code as the number of
1’s in the resulting code word is even. Addition using mod – 2 arithmetic will be symbolized by
. In logic it is synonymous with the Exclusive – OR operation.

Decoding and detection of an odd number of errors if having occurred is done at the receiver
by repeating the parity – check operation (mod – 2 addition).

Example 3.4

Suppose the data sequence 100110 (read from the right) is to be transmitted. The even – parity
code word is

| ← Codeword →|

1 0 0 1 1 0 1

| ←data bits → |

Suppose that three codewords are received: 1011000, 1001111 and 1011111.

Performing mod – 2 addition we get an odd – parity (the sum is 1) for the first two code words
indicating that an odd number of errors has occurred (3 in the first code word and 1 in the
second code word).

Note that although mod – 2 addition enables us to detect an odd number of errors, it does not
tell us which bits in the code word have been received in error.

Mod – 2 addition of the third code word yields an even – parity (the sum is a 0) which may lead
us to conclude that no error in transmission has occurred when, in fact, two errors have
occurred.

With k data bits in a block of n bits, one speaks of an (n, k) code. The encoder outputs a unique
n – bit code word for each of the 2k possible input data words. Thus, for example, a (7, 4) code
has r = n – k = 7 – 4 =3 parity – check bits.

3.5.2 Linear Codes

We will limit our discussion to linear codes in which each of the parity – check bits are given by
mod – 2 addition of the data bits. We shall further limit ourselves to systematic codes in which
the parity – check bits appear at the end of a code word.

Let d = {d1, d2, …, dk} be a vector representing the k – data bits. Let the n – bit vector c = {c1, c2,
…, ck, ck+1, …, cn}. For a systematic code,

99
ci = di, I = 1, 2, …, k

The r = n – k parity – check bits are obtained as a weighted mod – 2 addition of data bits:

c k +1=a11 d 1 ⊕ a12 d 2 ⊕ …⊕ a1 k d k
c k +2=a21 d 1 ⊕ a 22 d 2 ⊕ …⊕ a2 k d k

c n =a1 d 1 ⊕ a 2 d 2 ⊕…⊕ a k d k ⋯⋯3 . 23

The coefficients aij for i = 1, …, r and j =1, …, k are either 0 or 1.

For systematic codes we can put 3.23 in matrix notation as

c = dG

where G is a k x n matrix with the first k columns belonging to an identity matrix Ik representing
the k data bits. The remaining r columns represent the transpose of the h ij coefficients of
equation 3.23

We may now write

G = [Ik |P] ……3.25

[ ]
a 11 a21 ⋯ ar 1
with P= a12 a22 a11
⋮ ⋮ ⋮
a1k a2 k a rk

G is called the code generator matrix and P is called the parity check matrix.

Example 3.5

Suppose we have a (7, 3) code with P given as

[ ]
1 1 0 0 ……3.26
P= 0 1 1 0
1 1 1 1

100
Note that k = 3 and, therefore, the code rate, k/n, is 3/7. The n = 7 code words are
generated from the matrix equation

c=d G=d [ I k⋮P ]


k 3
There are 2 = 2 = 8 possible data words and, therefore, we get:

d c=d [ I 3 P ]
000 000 0000
001 001 1111
010 010 0110
011 011 1001
100 100 1100
101 101 0011
110 110 1010
111 111 0101

Will linear systematic codes always satisfy the prefix condition? Why?

Definition 3.4 Hamming Weight

The Hamming weight of a code word c, denoted by w(c) is the number of non zero components
of c.

For example, if c = 110101, then w(c) = 4.

Definition 3.5 Hamming Distance

The Hamming distance between two code words c1 and c2, denoted by d(c1, c2), is the number
of positions in which they difer.

For example, if c1 = 110101 and c2 = 111000 then = d(c1, c2) = 3. Note that

d(c1, c2) = w(c1 c2)

For example 3.5 we note that the minimum distance between any two code words is 3. We can
thus correct one error since the received code word, with one error, will still be close in distance
to the correct code word. This is thus an example of a single error correcting code.

101
If the code is used only for error correction, then the code can correct all patterns of t or fewer

part of its argument 2 [ ]


random errors provided d ≥ 2t + 1 or t= d−1 where the notation [x] represents the integer

A code can correct all patterns of t on less random errors and detect all patterns having no more
s errors provided that 2(s + t) + 1 ≤ d.

3.5.3 Decoding

Before discussing decoding let us find a probabilistic criterion to decoding. Such a criterion,
called maximum likelihood decoding will be shown to be equivalent to finding closest code
word.

Maximum Likelihood Decoding

Given the received vector r, the decision rule the decision rule that minimizes the probability of
error is to the codeword ci which maximizes p(c=ci/r). this is called the maximum a posteriori
decision rule. The proof that this minimizes the probability of error will not be presented here.

We know by Baye’s rule that

p(r /c) p( c )
p(c /r)=
p (r)

since p(r), the probability of observing the vector r is independent of c, maximizing p(c/r) is
equivalent to maximizing

p(r/c)p(c)

if we now assume, without loss of generality, that each codeword is chosen with equal
probability then maximizing p(r/c)p(c) is equivalent to maximizing p(r/c).

A codeword which is chosen to maximize p(r/c) is said to be chosen according to the maximum
likelihood criterion.

Assume a BSC with iid random variables and crossover probability p. then;
n
p( r/c )=∏ ¿ p( r i /c i )
i=1

¿ p ( r i / c i )=
{ 1− p , if c i=r i
p , if c i≠r i

Then

102
n ¿ ( r i=c i ) ¿(ri ≠ci )
p(r/c )=∏ ¿ p ( r i /c i)=1− p p
i=1
¿

Where # represents “the number of”.

Or

n− ¿ (r i =ci ) ¿(ri ≠c i )
p(r/c )=(1−p) p
¿ ( r i≠c i )
n p
=(1− p) ( )
1− p
But

#(ri≠ci) = Hamming distance between ri and ci

Thus p(r/c) is maximized when #(ri≠ci) attains its lowest value. Thus If we want to maximize
p(r/c) we should choose that ci which is closest to r. Thus, under our assumptions, the
maximum likelihood criterion is the minimum distance criterion. We now consider decoding.

From equations 3.24 and 3.25 we get

c = d[ Ik P] = [ d dP ] ……3.27

=[d cp ] ……3.28

Where cp = dP

cp is called the parity check sequence

Suppose that c now represents a received sequence of n bits. The first k bits correspond to data
bits. To decode the received sequence, therefore, the receiver performs the operation dP and
compares it to the received parity – check sequence. Since mod – 2 addition and subtraction are
the same, then

d P ⊕c p =0 ……3.29

if the received sequence is not in error.

Rewriting equation 3.29 in matrix form we get:

[ ]
dP+c p =[ d c p ] P =0
I n−k ……3.30

103
where In-k is an (n – k) x (n – k) identity matrix.

In general, for any code word c, we have

c=[d c p]

and so

The matrix H is of the form

[ ]
a 11 a12 ⋯ a1 k 1 0 ⋯ 0
H= a21 a22 ⋯ a 2k 0 1 ⋯ 0

ar 1 ar 2 ⋯ a rk 0 0 ⋯ 1 ……3.34

Now suppose that an error has occurred in one of the bits of c. The received vector, r, can be
written as

r=c⊕e ……3.35

where e is an n – bit vector representing the error pattern. [If an error has occurred in the third
and fifth positions, for example,

e = (001010…0)]

Performing r on HT yields:
T T T
r H =(c ⊕e ) H =s=e H ……3.36

104
The vector s, called the syndrome, will be non zero if an error has occurred.

Suppose that a single error has occurred in the ith position. Then

e= 00…10…0)

ith position

and

eHT = ( a1i a2i ... ari ) ……3.37

Comparing with 3.34 we note that s is just the ith column of H. This makes the error not only
detectable but also correctable

Example 3.6

Consider a (7, 3) code and assume that the transmitted code word is 1101010 and the received
bit sequence is 1111010 ( i.e. an error has occurred in the third position). From 3.30 we get

[ ]
1 0 1 1 0 0 0
1 1 1 0 1 0 0
H=
0 1 1 0 0 1 0
0 0 1 0 0 0 1

Performing the operation rHT we get

s = rHT = (1111)

s is the third column of H verifying that an error has occurred in the third position.

For a code to correct up to at least t errors it must meet any of several bounds. One these
bounds is the Hamming Bound

A source using an (n,k) code produces source sequences of length n and there are 2 n such
possible sequences. Since channel coding is being used, some of the 2 n source sequences are

105
codewords. We are interested in determining the distance of a codeword of interest and source
sequences determined from the error correcting capabilities of a code.

Assume that the codeword of interest is represented by a vectorof all 0’s . It is then easy to
show that the number of source sequences that are at a distance of less than or equal to t from
the codeword is given by

t
∑ (nj ) . .. .. . 3. 40
j =0

since there are 2k possible codewords, then the total number of source sequences which are at
a distance of less than or equal to t from any codeword is given by

t
2 k
∑ ( nj ) . .. . .. 3 . 41
j=0

Obviously

t
2 k
∑ ( nj )≤2 n
j=0

Or

t
2 n−k
≥∑
j=0
( nj )
Which is the Hamming Bound for a code to correct up to t errors.

Note that with t= 1 and r=n-k, the bound yields

r
2 −( r+ 1)>k . . .. .. 3 . 42
This result can be related to the properties which the columns of the matrix must satisfy.

To be a single error correcting code, all columns of H must be unique. This can happen if the first
k columns (of r bits each) are distinct as they are related to the data bits. They must also be
diferent from the last r columns containing a single 1 in each column and must not contain an
all zero column. (why?)

106
Example 3.7

Consider a (6, 3) and a (6, 4) code and determine whether they are single error correcting.

For the (6, 3) code r = 3 and r= 2 for the (6, 4) code. Applying inequality 6.3.12, we get

(i) For the (6, 3)

23 – 4 = 4 > k = 3. The code is therefore single-error correcting.

ii. For the (6, 4) code

22 – 3 = 1 > k = 4

This code is therefore not single error correcting.

Some typical codes where we select the largest value of k only are shown below

Table 3.1 Typical Codes

t n k max Code Efficiency


1 4 1
5 2 ( 5, 2 ) 0 . 4
6 3 ( 6, 3 ) 0 .5
7 4 ( 7, 4 ) 0 . 57
15 11 ( 15 , 11 ) 0. 73
2 10 4 ( 10 , 4 ) 0 . 4
10 4 ( 11 , 4 ) 0 . 36
10 8 ( 15 ,8 ) 0 .53
3 10 2 ( 10 , 2) 0 .2
10 5 (15 ,5 ) 0 .33
10 12 (23 ,12) 0 .52
10 12 ( 24 , 12) 0 .5
¿¿¿¿¿¿¿¿¿¿¿¿

Note that the efficiency of a code (k / n) decreases with t

107
3.5.4 Cyclic Codes

These are codes in which code words are lateral shifts of one another. Thus, for example, if c =
( c1, c2, …, cn) is a code word, so is c = ( c2, c3, …, cn , c1) and c = ( c3, c4, …, cn , c1 , c2) . These codes
are generated by a generator matrix. In this generator matrix the element in the kth row and nth
column of G is always a 1.

The generator matrix for a systematic code has the form:

[ ]
1 0 ⋯ 0 ⋯
0 1 0 ⋯ ⋮
G=

0 ⋯ 0 1 ⋯ 1
Then
c=d G=d 1 d 2 … d k 0 0 … d k

nth position

If the last element in G was a 0 and

d1 = d2 = … = dk-1 = 0, dk = 1, then

c=00…010…0

nth position

kth position

Shifting once to the right yields

c=00…010…0

k + 1th position

This is impossible since an all zero data sequence is producing a non – zero parity – check
sequence.

This is why the last element in G cannot be a zero. it has to be a 1

108
Cyclic codes can be represented by polynomials. Thus the codeword

c = ( c1, c2, …, cn) can be represented by an (n – 1) – degree polynomial:

c(x) = c1xn-1 + c2xn-2 + … + cn-1x + cn ………3.43

Each power of x represents a one bit shift in time.

In the polynomial, c1 represents the first bit of the code word and cn the last bit of the code
word. A shift once to generate another code word is represented by xc(x).

xc(x) = c2xn-1 + c3xn-2 + … + cnx + c1

Shifting twice yields

x2c(x) = c3xn-1 + c4xn-2 + … + cnx2 + c1x + c2

The polynomial representation may be extended to the matrix G. To do that, we insert the
appropriate power of x wherever there is a 1 in G and we leave blanks where we have 0s.

Thus for the (7, 3) code, for example, we have

[ ]
1 0 0 1 1 1 0
G= 0 1 0 0 1 1 1
0 0 1 1 1 0 1

And its polynomial representation

[ ]
x6 ¿ ¿ x3 x2 x ¿
G= ¿ x5 ¿ ¿ x2 x 1
¿ ¿ x4 x3 x2 ¿ 1

More generally, for an (n,k) cyclic code

[ ]
x n−1 ¿ ¿ ¿
n−2
G= ¿ x ¿ ¿
¿ ¿ x n−k 1

109
The polynomial for the last row has the form

g(x) = xn-k + … + 1

and is called the generator polynomial of the code. This polynomial determines the properties
of the code.

To generate the polynomials in G yielding systematic codewords we follow these steps:

(i) Use g(x) as the kth row

(ii) To generate the (k -1)th row, cyclically shift the kth row one column to the left. This is xg(x).
The kth column entry must be a zero for the matrix G has the identity matrix in the first k
columns.

(iii) If the kth column entry is a 1 add the kth row to this, the (k -1)th row using mod – 2 addition.

(iv) To generate the (k -2)th row , repeat the process; but now operating on row (k – 1). Add
g(x) to this row if the k’th column entry is a 1.

(v) Repeat the process until you reach the top row.

Example 3.8

Assume g(x) = x4 + x3 + x2 + 1 with n = 7. (This is a (7, k) code). If shifting is done without adding
g(x) to the (k -1)th and subsequent rows we get

[ ][ ]
x 2 g( x ) x6 x5 x 4 ¿ x2 ¿ ¿
G= xg( x ) = ¿ x 5 x 4 x 3 ¿ x ¿
g( x ) ¿ ¿ x 4 x3 x2 ¿ 1

We note that the first k = 3 columns do not represent an identity matrix. To generate G in the
standard form, we follow the procedure above

[ ][ ]
x [ xg( x )+ g( x )] x6 ¿ ¿ x3 x2 x ¿
G= xg( x )+ g( x ) = ¿ x 5 ¿ ¿ x 2 x 1
g(x) ¿ ¿ x 4 x3 x2 ¿ 1
Replacing x’s by 1’s and blanks by 0’s we get

110
[ ]
1 0 0 1 1 1 0
G= 0 1 0 0 1 1 1
0 0 1 1 1 0 1

This is the same code generator matrix we saw before.

The generator polynomial is obtained from the theorem:

Theorem 3.2 The generator polynomial, g(x), of an (n, k) cyclic code is a divisor of xn + 1.

Example 3.9

Consider the class of (7, k) codes. The generator polynomials are divisors of x7 + 1. We can show
that

(x7 + 1) = (x + 1)(x3 + x + 1)(x3 + x2 + 1)

Products of divisors are divisors themselves. Consider the generator polynomial

g(x) = (x + 1)(x3 + x + 1)= x4 + x3 + x2 + 1

This is precisely the generator polynomial we encountered before.

Let us now consider how to generate code words.

The rows of G are obtained from g(x) by successive multiplication by x. Each cyclic code is thus
derivable from g(x) using the relation c = dG. A code word c(x), in polynomial form may thus be
written as

c(x) = d(x)g(x) ………3.44

The polynomial d(x) must be of order (k – 1) to have c(x) a polynomial of order (n – 1).

Now consider a polynomial made from a data sequence

d(x) = d1xk-1 + d2xk-2 + …+ dk-1x + dk

The operation xn-k d(x) produces a polynomial of degree (n – 1) or less. Now, with g(x) a
polynomial of degree (n-k) or less, we have;

x n−k d ( x ) r (x ) ……3.45
=q ( x )+
g( x) g (x )

where the division has yielded a polynomial q(x) of degree (k – 1) or less and a remainder
polynomial r(x).

111
Since in mod – 2 addition r(x) + r(x) =0, it follows that the (n – 1) degree polynomial xn-kd(x) +
r(x) is divisible by g(x) and must, therefore, be a code word. Thus

c(x) = d(x)g(x) = xn-kd(x) + r(x)

But xn-kd(x) represents a shift by (n – k) times of the data bits. Hence the reminder r(x) must
represent the parity – check bits.

Example 3.10

Consider the (7, 3) code with a generator polynomial g(x) = x4 + x3 + x2 + 1. Assume that the
data word is d = (0 0 1) or d(x) = 1. Then

x n−k d ( x )
r ( x )= Remainder of
g( x)
4
x
=Remainder of 4 3 2
x + x + x +1
3 2
=x + x +1
The code word is thus

c = ( 0 0 1 1 1 0 1)

Similarly, say the data sequence is d = ( 1 1 1 ) or d(x) = x2 + x + 1. Then


x6 +x 5 +x 4
r ( x )=Remainder of
x 4 +x 3 + x 2 +1
=x 2
The code word is thus

c = ( 1 1 1 0 1 0 0)

Since for a given k we can generate all possible data sequences, this approach enables us to find
all the corresponding codewords.

To summarise:

i. We use the Hamming bound to select a code meeting our design criteria. We chose n and
k.

ii. Given n, we find an appropriate generator polynomial g(x)

iii. Knowing g(x), we obtain G and we can then get c=dG

112
Example 3.11

Consider communication over an AWGN channel using PSK modulation. The bit error probability
is given by

1 E
Pe = erfc b
2 no√
where

√ Eb
no is the bit energy to noise density ratio.

Assume that a (7,4) code is used. Then, n=7 and k=4

The block error probability for transmission of 4 information bits is

(i) Pe, uncoded =1-(1-Pe1)4 ≈4Pe1


(ii) For the coded case, we transmit 7 bits in the time interval in which we previously
transmitted 4 bits. The bit error probability is now given by

1
Pe1 = erfc
2
4 Eb
7 no √
Since this is a single error correcting code, the block error correcting code, the block error
probability is given by

7
Pe , coded= ∑
j=2
()
7
j
Pe , j (1−Pe , )7− j

¿21 Pe ,2

Typical performance values are given in the table below

Eb/no, dB Pe, Pe,uncoded Pe, coded


-1 10-1 0.34 0.34

6.8 10-3 4×10-3 1.9×10-3

9.6 10-5 4×10-5 8.6×10-6

113
3.6 CONVULUTIONAL CODES

The convolutional encoder, unlike the block encoder which operate on a block by block basis,
acts in streams of bits. A general block diagram of a convolutional encoder is shown in fig. 3.6.
this particular encoder has one input, n outputs and K shift registers. For every one bit input,
the encoder outputs n bits. It is thus referred to as a rate 1/n encoder. K is referred to as the
constraint length. The encoder is thus a constraint K, rate 1/n encoder.

Let us consider a K=3, r=1/2 encoder shown in Fig. 3.6 below

114
Assume that an i’th bit ,di, has arrived at the input. The two output bits at time i are denoted as
ci1 and ci2. Clearly

c i1 =d i ⊕d i−1 ⊕d i−2
c i2 =d i ⊕d i−2
or in matrix notation

[ ]
1 1
[c i c i2 ]=[ d i di−1 d i−2 ] 1 0
1
1 1

A convolutional encoder may be considered to be a digital, linear time-invariant finite state


machine. The structure of such a machine may be described with the aid of several diagrams.
We start with a tree diagram.

In the tree diagram, inputs of the encoder are indicated by paths whereas encoder outputs are
indicated by symbols on the upper branches of the tree. An input zero specifies the upper
branch of a bifurcation while a one specifies the lower one. Thus for the encoder of figure (),
the input sequence 0 1 1 0 is indicated by up at the first branching level, down at the second
and third, and up at the fourth producing the outputs indicated on the diagram traversed: 00 11
01 01. Fig() indicates all output sequences corresponding to all 32 possible sequences for the
first five input bits.

The tree diagram indicates that after the first three branches the structure becomes repetitive.
In fact, we note that beyond the third branch the code symbols on branches emanating from
the two nodes labeled a are identical,and similarly for all the identically lebelled pairs of nodes.
Thus all identically labeled nodes in fig() can be joined together. This leads to redrawing the tree
diagram as shown in fig (). This new figure is called a trellis diagram.

115
The convention used in the trellis diagram is that the code branches produced by a “0” input bit
are shown as solid branches and code branches produced by a “1” input bit are shown dashed.
Note that the last bit must be followed by K-1 zeros to clear the register of all input bits. These
K-1 output branches are called the tail of this code. In this case the last two branches are the tail
of this code.

The complete recursive structure of the trellis diagram leads to a farther simplification to the
representation of the encoder. Fig() is a representation of the encoder using a state diagram.

The states of the state diagram correspond to nodes of the trellis diagram

The encoder is a finite state digital machine. Since K=3, a four state directed graph can be used
to uniquely represent the input-output relations for the machine. A state diagram
representation of fig() is shown in fig()

116
Polynomial Encoding

Let us return to figure () and specifically to equations

c i1 =d i ⊕d i−1 ⊕d i−2
c i2 =d i ⊕d i−2

the delays of the data bits may be captured by using generator polynomials. Thus the generator
polynomials for the two streams may be written as;

(1) 2
g ( D)=1+D+D
(2 ) 2
g ( D)=1+D
where D is the unit delay operator

Assume that we are required to encode a message polynomial may be written as


4
m( D)=1+D+D

117
The output polynomial C(i) (D) is given by

C( i)( D )=g(i) ( D)m(D )


It can easily be shown that

( 1) 3 4 5 6
C ( D )=1+ D + D + D + D
( 2) 2 3 4 5 6
C ( D )=1+ D+ D + D + D + D + D
Thus the output bit streams are

C(1) = 1 0 0 1 1 1 1

C(2) = 1 1 1 1 1 0 1

the output message is formed by interleaving the two bit streams to get

11010111111011

The last 4 output bits are flushing bits. Thus polynomial encoding automatically yields the
correct encoded bit stream together with the flushing bits.

Exercise

1. Find the output bit stream for an input message 1 1 0 1 1 1 0 0 1 applied to the encoder
of fig ()
2. Draw a code tree, trellis diagram and state diagram for the following encoder

118
PROBLEMS

Entropy

1. (i) Justify the code word lengths for the letters Q and E in the Morse code for English

(ii) Using approved Kiswahili story books such as those by Shaaban Robert find the single
letter frequencies of letters in Kiswahili.

(iii) Access www.sxlist.com/techref/method/compress/etxfreq.htm to find a binary digit


representation of the Morse Code as a variable length code. Compare this
representation with a representation of the same characters in the ASCII (American
Standard Code for Information Interchange) code.

2. Show that the Shannon entropy measure of equation 2.12 satisfies all four Khinchin
axioms.

3. Entropy measures using escort probabilities of eqn 2.29 provide control on the
contributions provided events. If � or q is large and positive, for example, the entropy
measure is more sensitive to events that are more often. Draw curves for the Ranyi, Tsallis
and Shannon entropies for values of �and q: - 0.1, - 0.5, - 0.8, 1.2, 1.5, 2.0, 3.0. Discuss
your results.

4. (i) Show that the Shannon entropy measure of eqn 2.12 satisfies all four Khinchin axioms
with Axiom � being given by eqn 2.11.

(ii) Derive eqn 2.17 using the Tsallis entropy measure of eqn 2.18.

5. Verify the expression

P q = Pi e ( q -1) ln Pi
and hence show that in the limit as q – 1, q W�, the Tsallis entropy

119
N
1 - �Pi q
S qT = i =l
q-l

Yields the Shannon entropy

Data Compression

6. Identify which codes are prefix codes and which ones are uniquely decodable in the Table
below. Draw decision trees for all prefix codes.

Source Code I Code II Code III Code IV


Symbol
S0 0 0 0 00
S1 10 01 01 01
S2 110 001 011 10
S3 1110 0010 110 110
S4 1111 0011 111 111

7. Two codes are given in the Table below:

Source Symbol Code I Code II


S1 0 0
S2 11 100
S3 100 101
S4 1010 110
S5 1011 111

Which of the two codes would you use in a communication system and why?

8. Consider the following binary sequence

11101001100010110100…

Using the Lempel – Ziv – Wien algorithm to encode the sequence. Assume that the binary
symbols 0 and 1 are already stored in the codebook.

120
9. A source has an alphabet of 4 letters. The probabilities of the letters and two possible sets
of binary code words for the source are given below:

Source Probability Code I Code II


Symbol
a1 1.4 1 1
a2 0.3 01 10
a3 0.2 001 100
a4 0.1 000 1000

For each code, without proofs or numerical calculations; answer the following questions:

(i) Does the code satisfy the prefix condition?


(ii) Is the code uniquely decodable?

10. Consider two discrete memoryless sources. Source I has an alphabet of 6 letters with
probabilities: 0.3, 0.2, 0.15, 0.1, 0.1. Source II has an alphabet of 7 letters with the
probabilities: 0.3, 0.25, 0.15, 0.1, 0.1, 0.05, and 0.50. Construct a binary Hufman code for
each set of messages. Find the average number of code letter per source letter for each
code. Find the Hufman code using two diferent methods and show that they both yield
the same average codeword lengths but diferent codeword length variances.

u { ul ,u2 ...., uk ... }


11. A DMS outputs symbols from an infinite alphabet
p ( uk ) = 2- k , k = 1, 2,...
With probabilities

H ( u)
(i) Find .
(ii) Construct a prefix condition code for this source with average code word length
H ( u)
equal to .

a
r
�nr -n
=
( r -l)
2
n =l
Note that

121
12. Two sources have source alphabets u and a with source statistics

Symbol Probability
u1 a1 1 1
4 3
u2 a2 1 1
4 3
u3 a3 1 1
8 9
u4 a4 1 1
16 9
u5 a5 1 1
16 27
u6 a6 1 1
8 27
u7 a7 1 1
16 27
u8 1
16

(a) Find
(i) The Hufman code for each source.
(ii) The entropy for each source.
(iii) The average code word length and variance for each source.

(b) Draw a block diagram for a computer program to determine the entropy for each
source if source sequences are of length N = 4. Draw another block diagram for
arbitrary N.

(c) In percentage terms, how does the average length difer from the entropy?

H ( u ) �LN �H ( u ) + 1
(d) Is the equation, , satisfied?

(e) Repeat parts (a) – (d) above for the Shannon – Fano code.

122
13. A source has the following alphabet and source probabilities:

Symbol Probability
S1 0.3
S2 0.1
Space 0.6

Two source symbols are taken in sequence and then encoded into a binary sequence. If a
space is used to separate the code words, find:

H ( u)
(i) The entropy of the source, .
H ( uN )
(ii) The entropy of the source sequence, .
H ( uN ) = H ( u ) ?
(iii) Is Why?

14. Given the following code

Symbol Probability Code


S1 0.5 0
S2 0.25 01
S3 0.125 011
S4 0.125 0111

(i) Is this a prefix condition code? Why?


(ii) Is the code uniquely decoded?
(iii) Find the entropy if source symbols are taken two in sequence before encoding.

15. A source produces a sequence of statistically independent binary digits with the
probabilities p(l) = 0.005 and p(0) = 0.995. These digits are taken 100 at a time and a binary
code word is provided for every sequence of 100 digits containing 3 of fewer 1’s.

(i) If the code words are all of the same length, find the minimum length required to
provide the specified set of code words.
(ii) Find the probability of getting a source sequence for which no codeword has been
provided.

16. A pair of DMSs, operating in parallel, are connected to an encoder as shown.

u1 = {a1, a2 }

Vistas beyond
Classical 123
Information
Source Coding

Space-Time
Coding
u2 = {b1,b2,b3 }

The probability of the joint event


{ u1 , u2 } is given in the following table:
u2

P(u1, u2) b1 b2 b3
a1 0.08 0.2 0.1
Maxi a2 0.06 0.5 0.06
mum
(i) Determine a Hufman code, using the codeword for each symbol pair {u1, u2}. Find
is average codeword length.
(ii) Ascertain whether the entropy of the joint event {u1, u2} is equal to, greater or
smaller than the sum of the two entropies H(u1) + H(u2).

17. A DMS has an alphabet u = {u1, u2, u3} with source statistics and code word representations
as shown in the table below:

Symbol Probability Code 1 Code 2


u1 1 00 1
2
0 u2 1 01 10
4
u3 1 10 100
4
(a) Are the two codes satisfactory representations for the source symbols? Why?

(b) There is one code for which it is not possible to draw a decision tree. Which code is
that and why is it not possible to draw a decision tree? Draw a decision tree for the
code for which it is possible to do so.

18. A discrete memoryless source has source alphabet u={a, b, c} with source statistics

1
P(a) = 2

124
1
P(b) = 4
1
P(c) = 4
An extended source is made by taking as outputs source sequences of length N = 2.

(a) Determine the alphabet and source statistics for the extended source.
(b) Find the Shannon – Fano code for the extended source.
(c) Find the entropy of the extended source
(d) Find the average codeword length for the Shannon-Fano code.
(e) Find a Hufman code with the least codeword variance.

19. Define a sequence of random variables as follows. Toss an unbiased coin. If the

result is heads, let S n = 1"n


if the result is tails, let sn = 0"n

125
Markov Sources

20. Find the equilibrium state probabilities for the Markov source shown below

10 01

(i) Find the entropy of each source output letter conditional on the current source
state.
(ii) Find the entropy per letter of the source sequence.
(iii) For each source state, find a variable length binary code for the letters in the source
alphabet.
(iv) Do the average codeword lengths for each state satisfy Shannon’s source coding
Theorem for discrete memoryless sources?

126
21. (a) A four state Markov source is represented by the following state transition

diagram. Find the stationary average entropy for the source.

(b) Assume that Sx represents the state of the Markov source just after it has emitted
letter X, find the encoded source output bit stream for the following source output
symbol stream.

dabbcaacd

Assume that the initial state of the Markov source is S c.

127
REFERENCES

1. Shu-Kun Lin, “Gibbs Paradox and the Concepts of Information, Symmetry, Similarity
and their Relationship”, http://www.mdpi.com/1099-4300/10/1/1,2008.

2. “Bell-Labs Celebrates 50 years of Information Theory”, www.aleatel-lucent.com

3. H. Nyquist, “Certain Factors Afecting Telegraph Speed”, Bell Syst. Tech. J., Vol.3, pp
324 – 352, April, 1924.

4. C.E. Shannon, “A mathematical Theory of Communication”, Bell Syst. Tech. J., Vol. 27,
pp 379 – 423, 623 – 656, July, October, 1948.

5. David P. Feldman, “Extensions to Shannon Entropy, Renyi Entropies Multifractals,


Tsallis Entropies”, http://hornacek.cca.edu/dare/parts/apf.slides.day.2.pdf

6. M. Gell-Mann and C. Tsallis (Eds), Non extensive Entropy Interdisciplinary


Applications, Oxford University Press. Oxford, 2004.

7. S. Verdu, “Fifty ‘Years of’ Shannon Theory”, IEEE Trans. Inform. Theory Vol. 44 No.6.
pp 2057 – 2078.

8. E. Chiu, et. al, “Mathematical Theory of Claude Shannon”,


http.//mit.edu/6.933/www/Fall2001/Shannon.pdf.

9. Dictionary.com

10. C. Beck, “Generalized Information and Entropy Measures in Physics”, arxiv : 0902
1235 2 [Cond-mat.stat-mech]

11. R.V.L. Hartley, “Transmission of information”, Bell Syst. Tech. J., Vol.7, pp 535 -563,
July 1928.

12. “Statistical Distribution of English Text”, www.data-compression.com/english.shtml

13. D.P. Feldman, http://hormacek.cca.edu/davo

14. E.T. Jaynes, “Information Theory and Statistical Mechanics I”, Phys. Rev., 106, pp 620
– 630, 1957.

128
15. E.T. Jaynes, “Information Theory and Statistical Mechanics II”, Phys. Rev. 108, pp 171
– 190, 1957.

16. M.L. Luhanga and T.G.B. Nyambo, “On the Entropy of Kiswahili”, Tanzania Journal of
Science, Vol.18, pp 1-5, 1992.

17. A.J. Viterbi and J.K. Omura, Principles of Digital Communication and Coding,
McGraw-Hill Book Company, New York, N.Y., 1979.

18. D.A Hufman, “A Method for the Construction of Minimum - Redundancy Codes”,
Proc. IRE, Vol. 40, No. 9, pp 1098 – 1101, Sept. 1952.

19. E.S. Schwartz, “An Optimal Encoding with Minimum Longest Code and Total Number
of Digits”, Inform. Contr. Vol.7, pp 37 – 44, March, 1960.

20. R.G. Gallager, Information Theory and Reliable Communication, Wiley, New York,
1968.

21. J. Ziv and Al Lempel – Formerly Ref [18].

22. T.A. Welch – Formerly Ref [19].

23. D.R. Cox and H.O. Miller, the Theory of Stochastic Processes, Chapman and Hall,
London, 1965.

24. C.E.M. Amos, “Performance Analysis of IP – Based Networks: A Case Study of the
University of Dar es Salaam Local Area Network”, M.Sc. Dissertation, University of
Dar es Salaam, July, 2006.

129
130

You might also like