Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples

EE376A/STATS376A Information Theory Lecture 8 - 02/01/2018
Lecture 8: Channel Capacity, Continuous Random Variables

Lecturer: Tsachy Weissman Scribe: Augustine Chemparathy, Adithya Ganesh, Philip Hwang
1 Channel Capacity
Given a channel with inputs X and outputs Y with transition probability P (Y |X):
Define: Channel capacity C is the maximal rate of reliable communication (over memoryless channel
characterized by P (Y |X)).
Further, recall the following definition:
C (I) = max I(X; Y ).
PX
Theorem.
C = C (I) .
Proof: We will see this proof in the coming lectures.
• This theorem is important because C is challenging to optimize over, whereas C (I) is a tractable
optimization problem.
1.1 Examples
Example I. Channel capacity of a Binary Symmetric Channel (BSC).

Define alphabets X = Y = {0, 1}. A BSC is defined by the PMF:
(
p y 6= x
PY |X (y|x) =
1 − p y = x.
This is equivalent to a channel matrix

1−p p
p 1−p
The rows of the matrix correspond to input symbols 0 and 1, while the columns correspond to output symbols
0 and 1.
And the graph representation
1
This can also be expressed in the form of additive noise.
Y = X ⊕2 Z, where Z ∼ Ber(p) and Z is independent of X.
To determine the channel capacity of a BSC, by the theorem we must maximize the mutual informa-
tion.
I(X; Y ) = H(Y ) − H(Y |X)

= H(Y ) − H(X ⊕2 Z|X)
Once we condition on X, the uncertainty in X ⊕2 Z is same as the uncertainty in Z. Formally, we can

simplify the second term (can also be shown by separately considering cases X = 0 and X = 1):
I(X; Y ) = H(Y ) − H(Z)

= H(Y ) − h2 (p) ≤ 1 − h2 (p).
where h2 (p) is the binary entropy function. Taking X ∼ Ber( 21 ) achieves equality: I(X; Y ) = 1 − h2 (p)
(note: by symmetry X ∼ Ber( 21 ) implies Y ∼ Ber( 21 )). Thus, C = 1 − h2 (p).
Example II. Channel capacity of a Binary Erasure Channel (BEC).
Define alphabets X = {0, 1} and Y = {0, 1, e} where e stands for erasure. Any input symbol Xi has a
probability 1−α of being retained in the output sequence and a probability α of being erased. Schematically,
we have:
2
Examining the mutual information, we have that
I(X; Y ) = H(X) − H(X|Y )

= H(X) − [H(X|Y = e)P (Y = e) + H(X|Y = 0)P (Y = 0) + H(X|Y = 1)P (Y = 1)]
= H(X) − [H(X) · α + 0 · P (Y = 0) + 0 · P (Y = 1)]
= (1 − α)H(X)
Because the entropy of a binary variable can be no larger than 1:
(1 − α)H(X) ≤ 1 − α
Equality is achieved when H(X) = 1, that is X ∼ Ber( 21 ). Thus, the capacity of the BEC is C = 1−α.
Note that if we knew exactly which positions were going to be erased, we could communicate at this rate by
sending the input bits at exactly those positions (since the expected fraction of erasures is 1 − α). The fact
that C = 1 − α indicates that we can achieve this rate even when we do not know which positions are going
to be erased.
2 Information of Continuous Random Variables

The channel capacity theorem also holds for continuous valued channels, which are very important in a
number of practical scenarios, e.g., in wireless communication. But before studying such channels, we need
to extend notions like entropy and mutual information for continuous random variables.
Definition: The relative entropy between two probability density functions f and g is given by
Z
f (x)
D(f ||g) = f (x) log dx.
g(x)
Exercise: Show that D(f ||g) ≥ 0 with equality if and only if f = g.
Proof. Observe that that
Z
f (x)
D(f ||g) = f (x) log dx
g(x)
Z
g(x)
= − f (x) log dx
f (x)

g(x)
= −E log
f (x)

g(x)
≥ − log E
f (x)
Z
g(x)
= − log f (x) dx
f (x)
= − log 1
= 0.
Equality occurs in the manner of Jensen’s when f = g.

Definition: The mutual information between X and Y that have a joint probability density function fX,Y
is
I(X; Y ) = D(fX,Y ||fX fY ).
3
Definition: The differential entropy of a continuous random variable X with probability density function
fX is Z
h(X) = − fX (x) log fX (x) dx = E [− log fX (X)]
If X, Y have joint density fX,Y , the conditional differential entropy is

Z
h(X|Y ) = − fX,Y (x, y) log fX|Y (x|y) dx dy = E[− log fX|Y (X|Y )],
and the joint differential entropy is

Z
h(X, Y ) = − fX,Y (x, y) log fX,Y (x, y) dx dy = E[− log fX,Y (X, Y )].
2.1 Exercises
Exercise 1 . Show that

h(X|Y ) ≤ h(X)
with equality iff X and Y are independent.
Proof. This follows from exercise 2 below combined with the fact that I(X; Y ) ≥ 0 which holds since
the relative entropy is non-negative (see exercise above). The equality holds exactly when fX,Y (x, y) =
fX (x)fY (y), i.e. when X and Y are independent.
Exercise 2. Show that
I(X; Y ) = h(X) − h(X|Y )

= h(Y ) − h(Y |X)
= h(X) + h(Y ) − h(X, Y ).
Proof.
Z
fX,Y (x, y)
I(X; Y ) = fX,Y (x, y) log dxdy
fX (x)fY (y)
Z Z
fX,Y (x, y)
= fX,Y (x, y) log dxdy − fX,Y (x, y) log fY (y)dxdy
fX (x)
Z Z Z
= fX (x) fY |X (y|x) log fY |X (y|x)dy dx − fY (y) log fY (y)dy
= h(Y ) − h(Y |X).
Symmetrically the same can be shown for I(X; Y ) = H(X) − H(X|Y ). Also
Z
fX,Y (x, y)
I(X; Y ) = fX,Y (x, y) log dxdy
fX (x)fY (y)
Z Z Z
= fX,Y (x, y) log fX,Y (x, y)dxdy − fX,Y (x, y) log fX (x)dxdy − fX,Y (x, y) log fY (y)dxdy
= h(X, Y ) − h(X) − h(Y ).
Exercise 3. Show that
h(X + c) = h(X).
4
and
h(c · X) = h(X) + log|c|, c 6= 0.
Proof. We will combine them and prove that h(aX + b) = h(X) + log|a| whenever a 6= 0. Let Y = aX + b.
Then the pdf for Y can be written as

1 y−b
fY (y) = fX
|a| a
Thus,
Z
h(aX + b) = h(Y ) = − fY (y) log fY (y) dy

y−b y−b
Z
1 1
=− fX log fX dy
|a| a |a| a
Changing variable to x = (y − b)/a (and reversing limits if a < 0),

Z
1
h(aX + b) = − fX (x) log fX (x) dx
|a|
Z Z
1
= − fX (x) log dx − fX (x) log (fX (x)) dx
|a|
= h(X) + log|a|
2.2 Examples
Example I: Differential entropy of a uniform random variable U ∼ U(a, b).

• The pdf of an uniform random variable is
(
1
b−a a≤x≤b
fX (x) =
0 otherwise
The differential entropy is simply:
h(X) = E[− log fX (X)] = log(b − a)
• Notice that the differential entropy can be negative or positive depending on whether b − a is less than
or greater than 1. In practice, because of this property, differential entropy is usually used as means
to determine mutual information and does not have much operational significance by itself.
Example II: Differential entropy of a Gaussian random variable X ∼ N (0, σ 2 ).
1 2
1
• The pdf of a Gaussian random variable is f (x) = √2πσ 2
e− 2σ2 x .
The differential entropy is:
h(X) = E[− log f (X)]
5
For simplicity, convert the base to e:
1
h(X) = E[− ln f (X)]
ln 2
1 1 2 1 2
= E ln 2πσ + 2 X
ln 2 2 2σ

1 1 2 1 2
= ln 2πσ + E X
ln 2 2 2σ 2

1 1 1
= ln 2πσ 2 + 2 σ 2
ln 2 2 2σ

1 1 1
= ln 2πeσ 2 = log 2πeσ 2
ln 2 2 2
• Per Exercise 3, differential entropy is invariant to constant shifts. Therefore this expression represents
the differential entropy of all Gaussian random variables regardless of mean.
• Claim: The Gaussian distribution has maximal differential entropy, i.e., if X ∼ fX with second moment
E[X 2 ] ≤ σ 2 and G ∼ N (0, σ 2 ) then h(X) ≤ h(G). Equality holds if and only if X ∼ N (0, σ 2 ).
Proof:

fX (X)
0 ≤ D(fX kG) = E log
fG (X)

1
= −h(X) + E log
fG (X)
X2
" #
1 2σ 2
D(fX kG) = −h(X) + E log√ +
2πσ 2 ln 2
Because the second moment of X is upper bounded by the second moment of G:

G2
" #
1 2σ 2
0 ≤ D(fX kG) ≤ −h(X) + E log√ +
2πσ 2 ln 2

1
≤ −h(X) + E log = −h(X) + h(G)
fG (G)
Rearranging:
h(X) ≤ h(G)
Equality holds when D(fX kG) = 0, i.e., when X ∼ N (0, σ 2 ).
Example III: Channel capacity of an Additive White Gaussian Noise channel (AWGN) that is
restricted by power P
The AWGN channel with parameter σ 2 has real input and output related as Yi = Xi + Wi , where Wi ’s are
iid ∼ N (0, σ 2 ) (and Wi ’s are independent of Xi ’s).
Pn
• Power constraint: n1 i=1 Xi2 ≤ P .
• The Channel Coding Theorem in this setting states that:
C(P ) = max I(X; Y )
E[X 2 ]≤P
Where C(P ) represents the ‘capacity’; the maximal rate of reliable communication when constrained
to power P . This maximization problem will be addressed in the next lecture.

Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 8: Channel Capacity, Continuous Random Variables: 1.1 Examples

Uploaded by

Copyright:

Available Formats

EE376A/STATS376A Information Theory Lecture 8 - 02/01/2018

Lecture 8: Channel Capacity, Continuous Random Variables

Example I. Channel capacity of a Binary Symmetric Channel (BSC).

This is equivalent to a channel matrix  

Y = X ⊕2 Z, where Z ∼ Ber(p) and Z is independent of X.

I(X; Y ) = H(Y ) − H(Y |X)

Once we condition on X, the uncertainty in X ⊕2 Z is same as the uncertainty in Z. Formally, we can

I(X; Y ) = H(Y ) − H(Z)

I(X; Y ) = H(X) − H(X|Y )

Because the entropy of a binary variable can be no larger than 1:

2 Information of Continuous Random Variables

Equality occurs in the manner of Jensen’s when f = g.

If X, Y have joint density fX,Y , the conditional differential entropy is

and the joint differential entropy is

Exercise 1 . Show that

I(X; Y ) = h(X) − h(X|Y )

= h(Y ) − h(Y |X).

= h(X, Y ) − h(X) − h(Y ).

Exercise 3. Show that

h(c · X) = h(X) + log|c|, c 6= 0.

Changing variable to x = (y − b)/a (and reversing limits if a < 0),

Example I: Differential entropy of a uniform random variable U ∼ U(a, b).

The differential entropy is simply:

h(X) = E[− log fX (X)] = log(b − a)

h(X) = E[− log f (X)]

Because the second moment of X is upper bounded by the second moment of G:

You might also like

This is equivalent to a channel matrix