[go: up one dir, main page]

0% found this document useful (0 votes)
31 views142 pages

HUST Guest Lecture 2022

This document provides an overview of probability theory and stochastic processes. It introduces key concepts such as random variables, probability distributions, expectation, variance, and limit theorems. It also covers stochastic processes, describing their properties and how to characterize them using correlation functions and power spectral density.

Uploaded by

caosyduong2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views142 pages

HUST Guest Lecture 2022

This document provides an overview of probability theory and stochastic processes. It introduces key concepts such as random variables, probability distributions, expectation, variance, and limit theorems. It also covers stochastic processes, describing their properties and how to characterize them using correlation functions and power spectral density.

Uploaded by

caosyduong2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 142

GUEST LECTURE

Probability Theory and Stochastic


Processes

at HUST

Autumn 2022

Matthias Pätzold
University of Agder
Faculty of Engineering and Science
Mobile Communications Group
P.O. Box 509, NO-4898 Grimstad, Norway
Tel.: (+47) 3723 - 3283
E-mail: Matthias.Paetzold@uia.no
Web: http://mcg.uia.no/
Table of Contents

Introduction i

I Random Variables 1

1 The Meaning of Probability 2


1.1 Definitions of Probability . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 The Relative Frequency Definition of Probability . . . . . . . . 2
1.1.2 The Classical Definition of Probability . . . . . . . . . . . . . . 2
1.2 Bertrand Paradox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 The Axioms of Probability 6


2.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Set Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Set Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Probability Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Total Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Repeated Trials 18
3.1 Combined Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Bernoulli Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 The Concept of a Random Variable 23


4.1 Definition of Random Variables . . . . . . . . . . . . . . . . . . . . . . 23
4.2 The Cumulative Distribution Function . . . . . . . . . . . . . . . . . . 25
4.2.1 Properties of Cumulative Distribution Functions . . . . . . . . 26
4.2.2 Continuous and Discrete Types of Random Variables . . . . . . 26
4.3 The Probability Density Function . . . . . . . . . . . . . . . . . . . . . 27
4.3.1 Properties of Probability Density Functions . . . . . . . . . . . 28
4.3.2 Examples of Continuous Probability Density Functions . . . . . 29
4.3.3 Examples of Discrete Probability Density Functions . . . . . . 34
CONTENTS 3

5 Functions of Random Variables 37


5.1 Transformations of Random Variables . . . . . . . . . . . . . . . . . . 37
5.2 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 The Expected Value of Continuous-Type Random Variables . . 41
5.2.2 The Expected Value of Discrete-Type Random Variables . . . . 42
5.2.3 The Expected Value of Y = g(X) . . . . . . . . . . . . . . . . . 43
5.2.4 Properties of the Expectation Operator . . . . . . . . . . . . . 45
5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 The Chebyshev and Markov Inequalities . . . . . . . . . . . . . . . . . 49
5.6 The Characteristic Function . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Two Random Variables 54


6.1 Joint Distribution and Joint Density . . . . . . . . . . . . . . . . . . . 54
6.2 One Function of Two Random Variables . . . . . . . . . . . . . . . . . 58
6.2.1 The Sum of Two Random Variables . . . . . . . . . . . . . . . 61
6.2.2 The Quotient of Two Random Variables . . . . . . . . . . . . . 63
6.2.3 The Maximum of Two Random Variables . . . . . . . . . . . . 64
6.3 Two Functions of Two Random Variables . . . . . . . . . . . . . . . . 66
6.4 Joint Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 The Joint Characteristic Function . . . . . . . . . . . . . . . . . . . . 76
6.6 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.6.1 The Conditional Distribution of Y given X = x . . . . . . . . . 78
6.6.2 The Conditional Density of Y given X = x . . . . . . . . . . . 79

7 Sums of Random Variables and Limit Theorems 81


7.1 Sums of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . 81
7.1.1 The Expected Value of Sums of Random Variables . . . . . . . 82
7.1.2 The Variance of Sums of Random Variables . . . . . . . . . . . 82
7.1.3 The Density of Sums of Independent Random Variables . . . . 83
7.2 The Sample Mean and the Sample Variance . . . . . . . . . . . . . . . 84
7.3 The Laws of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . 87
7.3.1 The Weak Law of Large Numbers . . . . . . . . . . . . . . . . 87
7.4 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . 89
7.5 Gaussian Approximation for Binomial Probabilities . . . . . . . . . . . 93
7.5.1 Approximate Evaluation of P {X = k} . . . . . . . . . . . . . . 94
7.5.2 Approximate Evaluation of P {k1 ≤ X ≤ k2 } . . . . . . . . . . 97
7.6 The Central Limit Theorem for Products . . . . . . . . . . . . . . . . 101

II Stochastic Processes 102

8 Stochastic Processes 103


8.1 Definition of Stochastic Processes . . . . . . . . . . . . . . . . . . . . . 103
8.2 Statistics of Stochastic Processes . . . . . . . . . . . . . . . . . . . . . 108
8.2.1 Specifying a Random Process . . . . . . . . . . . . . . . . . . . 108
8.2.2 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . 110
8.2.3 Ergodic Processes . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.3 Correlation Functions of Stochastic Processes . . . . . . . . . . . . . . 118
8.3.1 The Autocorrelation Function . . . . . . . . . . . . . . . . . . . 118
8.3.2 Properties of Autocorrelation Functions . . . . . . . . . . . . . 119
8.3.3 The Cross-Correlation and Cross-Covariance Function . . . . . 122
8.4 The Power Spectral Density . . . . . . . . . . . . . . . . . . . . . . . . 123
8.5 Linear Time Invariant Systems with Stochastic Inputs . . . . . . . . . 126
8.5.1 Linear Time Invariant Systems . . . . . . . . . . . . . . . . . . 126
8.5.2 The Mean and Autocorrelation Function of the Output Process 128
8.5.3 The Power Spectral Density of the Output Process . . . . . . . 129
8.5.4 The Cross-Correlation Function and the Cross-Power Spectral
Density Between the Input and the Output Process . . . . . . 130
8.6 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.7.1 System Identification Using White Noise . . . . . . . . . . . . . 133
8.7.2 Matched Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

Bibliography 136
Introduction

This lecture is divided in two parts. The first part presents an introductory treat-

ment of probability and random variables, and the second part deals with stochastic

processes.

The theory of probability is rooted in phenomena which can be modeled by an experi-

ment with an outcome that is unpredictable. In case that the experiment is repeated,

the outcome can differ due to the influence of an underlying random phenomenon.

Such types of experiments are known as random experiments. A typical random ex-

periment is the observation of the result of tossing a fair coin, where the possible

outcomes of a trial are “heads” or “tails”. In this coin experiment, for example, the

percentage of heads approaches 1/2. Also in other random experiments, it has been

observed that certain averages approach a constant value as the number of observa-

tions increases. The purpose of the probability theory is to describe and predict such

averages in terms of probabilities of events. The probability P (A) of an event A is a

number that can be interpreted as follows:

If the experiment is performed n times and the event A occurs nA times, then
(with a high degree of certainty) the relative frequency nA /n of the occurrence
of A is close to P (A), i.e.,

nA
P (A) ≃
n

provided that n is sufficiently large.

Approximately the first half of the lecture is devoted to the theory of probability and
ii CONTENTS

random variables.

Physical systems and models in which there is uncertainty or randomness play a very

important role in the area of electrical engineering (see Fig. 1). For example, the

received signal in a radio communication system usually consists of the transmitted

signal, a random interference component, and channel noise. The interference compo-

nent may represent spurious electromagnetic waves produced by other communication

systems operating in the vicinity of the receiver. A major source of channel noise is

thermal noise, which is caused by the random motion of the electrons in devices at the

front end of the receiver. Hence, the received signal is random in nature. Although

it is not possible to predict the exact value of a random signal (stochastic process) in

advance, it is possible to describe the signal in terms of important statistical quanti-

ties such as mean value, autocorrelation function, and power spectral density, as we

will see in the second half of the lecture.

The intent of the course is to provide the foundation in statistics which would be

needed in other courses such as mobile radio communications, digital signal process-

ing, coding theory, and control theory.

M o b ile r a d io
c o m m u n ic a tio n s

I n fo r m a tio n A n a lo g /d ig ita l
& c o d in g th e o r y c o m m u n ic a tio n s

R a n d o m v a r ia b le s
&
s to c h a s tic p r o c e s s e s

M e a su r e m e n t D ig ita l
th e o r y s ig n a l p r o c e s s in g

C o n tr o l
th e o r y

Fig. 1: The importance of statistics.


Part I

Random Variables
Chapter 1

The Meaning of Probability

1.1 Definitions of Probability

1.1.1 The Relative Frequency Definition of Probability

According to the relative frequency approach, the probability P (A) of an event A is

defined as follows.

Definition 1.1. The probability P (A) of an event A is the limit (Richard von Mises)

nA
P (A) = lim H(A, n) = lim (1.1)
n→∞ n→∞ n

where
H(A, n): relative frequency

nA : number of occurrences of A

n: number of trials.

• Problem: In a physical experiment, the numbers nA and n are finite. However,


provided that n is sufficiently large, we may write
nA
P (A) ≈ . (1.2)
n

1.1.2 The Classical Definition of Probability

According to the classical definition, the probability P (A) of an event A is determined

a priori without actual experimentation.


Bertrand Paradox 3

Definition 1.2. The probability P (A) of an event A is given by the ratio

NA
P (A) = (1.3)
N

where
NA : number of favorable outcomes

N: number of possible outcomes.

The classical definition was introduced by H. Bernoulli in 1713 as a consequence of

the principle of insufficient reason:

“In the absence of any prior knowledge, we must assume that the events Ai have
equal probabilities.”

Example 1.1. We roll two dice and we want to find the probability p that the sum

of the appearing numbers equals 7. To solve this problem, we must determine the

numbers NA and N .

Number of favorable outcomes NA :

(3, 4), (4, 3), (5, 2), (2, 5), (6, 1), (1, 6) → NA = 6.

Number of possible outcomes N :

N = 6 · 6 = 36.

According to the classical definition (1.3), the probability is

6 1
P (sum equals 7) = = .
36 6

1.2 Bertrand Paradox

Given is a circle C of radius r. What is the probability p that the length l of a “ran-

domly selected” chord AB is greater than the length r 3 of the inscribed equilateral

triangle?
4 The Meaning of Probability

C
r 3

Fig. 1.1: Bertrand paradox.

Solutions:

a) If the center M of the chord AB lies inside the circle C1 of radius r/2, then

l > r 3 (see Fig. 1.2). Hence, we conclude that
AC1 πr 2 /4 1
p= = 2
= .
AC πr 4

C
A

C1
M
r/2
r B

Fig. 1.2: Bertrand paradox (first solution).

b) We now assume that the end A of the chord AB is fixed (see Fig. 1.3). Hence,
B is on the 120◦ arc DBE 2πr/3 1
p= = = .
B is on the circumference of C 2πr 3

B
D E

Fig. 1.3: Bertrand paradox (second solution).


Bertrand Paradox 5

c) We assume finally that the direction AB is perpendicular to the line F K (see

Fig. 1.4). Hence,

The center M of AB is between G and H r 1


p= = = .
The center M of AB is between F and K 2r 2
F

G
r/2 M
A B

r/2

Fig. 1.4: Bertrand paradox (third solution).

Obviously three different solutions exist for the same problem. Note that these solu-

tions correspond to three different experiments. The Bertrand paradox demonstrates

the ambiguities associated with the classical definition, and the need for a clear spec-

ification of the outcomes of an experiment.


Chapter 2

The Axioms of Probability

2.1 Set Theory

• A “set” is a collection of objects called “elements”.

• A “subset” B of a set A is another set whose elements are also elements of A.

2.1.1 Set Notations

• A = {ζ1 , ζ2 , . . . , ζn }: “The set A consists of the ele-


ments ζ1 , ζ2 , . . . , ζn .”

• B ⊂ A: “B is a subset of A”

• ζi ∈ A: ζi is an element of A

• ζi ∈
/ A: ζi is not an element of A

• ∅: empty or null set

• S: space

Note that if a set consists of n elements, then the total number of its subsets equals

2n .

Example 2.1. Suppose that a coin is tossed twice. The resulting outcomes are the

four objects: hh, ht, th, tt .

Hence, the set S is given by S = {hh, ht, th, tt}. The set S has 24 = 16 subsets. For
Set Theory 7

example,

A = {heads at the first toss} = {hh, ht}

B = {only one head showed} = {ht, th} .

2.1.2 Set Operations


Subset: B⊂A
Transitivity: A ⊂ B and B ⊂ C then A ⊂ C
Equality: A ⊂ B and B ⊂ A then A = B
Sum (union): A + B or A ∪ B
Product (intersection): AB or A ∩ B

Set operations and set relations can be illustrated by the use of Venn diagrams.
Some basic set operations using Venn diagrams are shown in Fig. 2.1.
U

(a) B A (b) A+B and AB

11111111111
00000000000
00000000000
11111111111 A+B
00000000000
11111111111 B
00000000000
11111111111
00000000000
11111111111
B
00000000000
11111111111 AB
00000000000
11111111111
A
00000000000
11111111111
A

Fig. 2.1: Venn diagrams illustrating basic set operations.

Example 2.2. If S = {1, 2, 3, 4, 5, 6}, A = {even}, B = {less than 5}


then

AB = {even, less than 5} = {2, 4} .

Mutually exclusive sets: Two sets A and B are said to be mutually exclusive or

disjoint if

AB = ∅ .
8 The Axioms of Probability

Several sets A1 , A2 , . . . are called mutually exclusive or disjoint if

Ai Aj = ∅ for every i and j 6= i.

Partition: A partition U of a set S is a collection of mutually exclusive subsets Ai

of S whose union equals S, i.e.,

A1 ∪ A2 ∪ · · · ∪ An = S, Ai Aj = ∅, i 6= j.

Thus,
U = [A1 , A2 , . . . , An ].

S
A1
A2

A3
A4

Fig. 2.2: A partition U = [A1 , A2 , A3 , A4 ] of a set S.

Complement: Ā (“Ā denotes the complement of A”)

S1111111111111
0000000000000
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
A −

0000000000000
1111111111111
A
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
0000000000000
1111111111111
Fig. 2.3: The complement Ā of A.

Properties: A ∪ Ā = S
A Ā = ∅
 = A
S̄ = ∅
¯
∅=S
Probability Space 9

De Morgan’s law: A ∪ B = ĀB̄


and AB = Ā ∪ B̄

111111111111
000000000000
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
−−

000000000000
111111111111
AB

000000000000
111111111111
B

000000000000
111111111111
000000000000
111111111111
AB
000000000000
111111111111
A
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
000000000000
111111111111
Fig. 2.4: De Morgan’s law.

2.2 Probability Space

In the probability theory, the following set terminology is used:

• The space S is called certain event.

• The elements of S are called experimental outcomes.

• The subsets of S are called events.

• The empty set ∅ is called impossible event.

• The event {ζi } consisting of a single element ζi is called elementary event.

• A single performance of an experiment is called trial.

Definition 2.1. Axiomatic definition of probability (A. N. Kolmogorov, 1933)

A mapping P , which assigns to each event A a real number P (A), is called the

probability of the event A if the following three conditions are satisfied:

I. 0 ≤ P (A) ≤ 1 (2.1)

II. P (S) = 1 (2.2)

III. If AB = ∅ then P (A ∪ B) = P (A) + P (B) (2.3)


10 The Axioms of Probability

These conditions are the axioms of the theory of probability. In the development of

the theory, all conclusions are based directly or indirectly on the axioms and only on

the axioms.

Properties:

(i) The probability of the impossible event is 0, i.e.,

P (∅) = 0 . (2.4)

Proof. From A∅ = ∅ and A + ∅ = A, it follows by using (2.3)

P (A) = P (A + ∅) = P (A) + P (∅)

⇒ P (∅) = 0 .

(ii) For any A,

P (A) = 1 − P (Ā) ≤ 1 . (2.5)

Proof. Since A + Ā = S and AĀ = ∅, it follows

1 = P (S) = P (A + Ā)

= P (A) + P (Ā) .

(iii) For any A and B,

P (A + B) = P (A) + P (B) − P (AB) (2.6)

≤ P (A) + P (B) . (2.7)

Proof. A + B = A + ĀB , B = AB + ĀB

Therefore,

P (A + B) = P (A) + P (ĀB) , P (B) = P (AB) + P (ĀB).

Eliminating P (ĀB) gives (2.6).


Conditional Probability 11

Definition 2.2. Fields

A field F is a nonempty class of sets such that:

I. If A ∈ F then Ā ∈ F (2.8)

II. If A ∈ F and B ∈ F then A + B ∈ F . (2.9)

Properties:

i) If A ∈ F and B ∈ F then AB ∈ F (2.10)

ii) A field contains the certain event and the impossible event, i.e.,
S ∈ F and ∅ ∈ F . (2.11)

From this it follows that all sets that can be written as unions or intersections of

finitely many sets in F are also in F. However, this is not necessarily the case for

infinitely many sets.

Definition 2.3. Borel fields

Suppose that A1 , . . . , An , . . . is an infinite sequence of sets in F. If the union and

intersection of these sets also belong to F, then F is called a Borel field.

2.3 Conditional Probability

Definition 2.4. The conditional probability of an event A assuming another event

B, denoted by P (A|B), is by definition the ratio

P (AB)
P (A|B) = (2.12)
P (B)

where we assume that P (B) is not 0.

Example 2.3. Fair-die experiment

Determine the conditional probability of the event {f2 } assuming that the event even

occurred.
12 The Axioms of Probability

With

A = {f2 } and B = {even} = {f2 , f4 , f6 }

we have
1 3
P (A) = and P (B) = .
6 6
Since AB = A, (2.12) yields

P (AB) P (A)
P (A|B) = =
P (B) P (B)
P {f2 } 1/6 1
P {f2 |even} = = = .
P {even} 3/6 3

This probability equals the relative frequency of the occurrence of the event {f2 } in

the subsequence whose outcomes are even numbers.

2.4 Total Probability

Definition 2.5. Total probability theorem

If U = [A1 , A2 , . . . , An ] is a partition of S and B is an arbitrary event [see Fig. 2.4],

then

P (B) = P (B|A1 ) P (A1 ) + · · · + P (B|An ) P (An ) . (2.13)

S
A2
A1 A3
B

A4

Fig. 2.4: A partition of S.

Proof. To prove (2.13), we write

B = BS = B(A1 ∪ A2 ∪ · · · ∪ An ) = BA1 ∪ BA2 ∪ · · · ∪ BAn .


Total Probability 13

Since U = [A1 , A2 , . . . , An ] is a partition of S, the events Ai and Aj are mutu-

ally exclusive. Consequently BAi and BAj are mutually exclusive too, and thus

P (B) = P (BA1 ) + P (BA2 ) + · · · + P (BAn ) .

Hence, (2.13) follows because P (BAi ) can be expressed as [see (2.12)]

P (BAi ) = P (B|Ai ) P (Ai ) .

Example 2.4. Binary symmetric channel

Consider the binary symmetric channel shown in Fig. 2.5. The transmission of binary

data (0, 1) over this channel results in an error if the receiver does not detect the

transmitted bit (symbol) correctly.


Channel
P(B1 |A 1)
P(A 1 ) 1 1 P(B1 )

P(B0 |A 1 )

P(B 1 |A0 )

P(A 0 ) 0 0 P(B0 )
P(B0 |A 0 )

Fig. 2.5: Binary symmetric channel.

The following results have been obtained from measurements:

• 95% of all “1” are transmitted correctly, i.e., P (B1 |A1 ) = 0.95

⇒ P (B0 |A0 ) = 0.95 (due to symmetry)

• 45% of all transmitted symbols are “0”, i.e., P (A0 ) = 0.45

⇒ P (A1 ) = 1 − P (A0 ) = 0.55

Compute the probability P (error) for the occurrence of a transmission error.

Solution. In this example, the certain event S is given by S = {0, 1}2 , and the event

(A, B) ∈ S describes that the bit A was sent and the bit B was received. By using
14 The Axioms of Probability

(2.13), the probability of a transmission error is then given by

P (error) = P ({0, 1}, {1, 0})

= P (A0 ) · P (B1 |A0 ) + P (A1 ) · P (B0 |A1 )

= 0.45 · 0.05 + 0.55 · 0.05 = 0.05 .

2.5 Bayes’ Theorem

Since P (BAi ) = P (Ai |B) P (B) = P (B|Ai ) P (Ai ) it follows that


P (B|Ai ) P (Ai )
P (Ai |B) = . (2.14)
P (B)
Inserting (2.13) into (2.14), we obtain Bayes’ theorem

P (B|Ai ) P (Ai )
P (Ai |B) = n . (2.15)
P
P (B|Ai ) · P (Ai )
i=1

The probabilities P (Ai ) and P (Ai |B) are often called a priori and a posteriori prob-

abilities, respectively.

Example 2.5. Maximum a posteriori (MAP) receiver

When a MAP receiver is used, the detection of the received signal is based on the a

posteriori probability.

We continue with Example 2.4 and ask for the probability that the bit “1” was

transmitted on the condition that the bit “0” was received. With the events

A1 = {“1” was transmitted}, B0 = {“0” was received}

and (2.15), we obtain


P (B0 |A1 ) · P (A1 )
P (A1 |B0 ) =
P (B0 |A0 ) · P (A0 ) + P (B0 |A1 ) · P (A1 )
0.05 · 0.55
=
0.95 · 0.45 + 0.05 · 0.55
= 0.06 .
Independence 15

Next, we ask for the probability that the bit “0” was transmitted on the condition

that the bit “0” was received. With the event A0 = {“0” was transmitted} and (2.15),

we find

P (B0 |A0 )P (A0 )


P (A0 |B0 ) =
P (B0 |A0 )P (A0 ) + P (B0 |A1 ) · P (A1 )
0.95 · 0.45
=
0.95 · 0.45 + 0.05 · 0.55
= 0.94 .

Hence, P (A0 |B0 ) > P (A1 |B0 ) which implies that the MAP receiver decides that the

bit “0” was transmitted.

2.6 Independence

Definition 2.6. Independence of two events

Two events A and B are called (statistically) independent if

P (AB) = P (A) · P (B) (2.16a)

or when

P (A|B) = P (A) . (2.16b)

Definition 2.7. Independence of three events

The events A1 , A2 , and A3 are called (mutually) independent if they are independent

in pairs, i.e.,

P (Ai Aj ) = P (Ai ) · P (Aj ) i 6= j (2.17)

and

P (A1 A2 A3 ) = P (A1 ) P (A2 ) P (A3 ) . (2.18)

It should be emphasized that three events might be independent in pairs but not

independent.
16 The Axioms of Probability

Example 2.6. If we toss a coin twice, then we generate the four outcomes hh, ht, th,

and tt.
A1 = {first coin shows head} = {hh, ht}

A2 = {second coin shows head} = {hh, th}

A3 = {only one coin shows head} = {ht, th}


Hence,
1
P (A1 ) = P (A2 ) = P (A3 ) =
2
and due to
1
P (A1 ∩ A2 ) = P {hh} = 4 = P (A1 ) · P (A2 )
1
P (A1 ∩ A3 ) = P {ht} = 4 = P (A1 ) · P (A3 )
1
P (A2 ∩ A3 ) = P {th} = 4 = P (A2 ) · P (A3 )
it follows that the events A1 , A2 , and A3 are independent in pairs. But they are not

independent, because

A1 ∩ A2 ∩ A3 = ∅

and
1
P (A1 ∩ A2 ∩ A3 ) = 0 6= = P (A1 ) · P (A2 ) · P (A3 ) .
8

Definition 2.8. Independence of n events

The events A1 , A2 , . . . , An are said to be independent if the following relations are

satisfied for all combinations 1 ≤ i < j < k · · · ≤ n:

P (Ai Aj ) = P (Ai ) P (Aj )

P (Ai Aj Ak ) = P (Ai ) P (Aj ) P (Ak ) (2.19)


..
.

P (A1 A2 · · · An ) = P (A1 ) P (A2 ) · · · P (An ) .

Example 2.7. Three switches connected in parallel operate independently (see Fig. 2.6).

Each switch remains closed with probability p. Find the probability of receiving an

input signal at the output.


Independence 17

S1

S2
Input Output

S3

Fig. 2.6: Three independent parallel switches.

Solution. Let Ai be the event Ai = {Switch Si is closed}. Then P (Ai ) = p, i =

1, 2, 3 . Since the switches operate independently, we have

P (Ai Aj ) = P (Ai ) P (Aj ) , 1≤i<j≤3

P (A1 A2 A3 ) = P (A1 ) P (A2 ) P (A3 ) .

Let R represent the event: “Input signal is received at the output.” Note that for the

occurrence of the event R, either S1 or S2 or S3 must remain closed.

Hence,

R = A1 ∪ A2 ∪ A3

and thus

P (R) = 1 − P (R̄) = 1 − P (Ā1 Ā2 Ā3 )

= 1 − P (Ā1 ) P (Ā2 ) P (Ā3 )

= 1 − (1 − p)3

= 3p − 3p2 + p3 .
Chapter 3

Repeated Trials

3.1 Combined Experiments

Let us consider the following two (independent) experiments:

1) Rolling of a fair die

S1 = {f1 , f2 , . . . , f6 }, P1 {fi } = 1/6

2) Tossing of a fair coin

S2 = {h, t} , P2 {h} = P2 {t} = 1/2

If we perform both experiments, then the two experiments can be viewed as a single

experiment whose possible outcomes are

f1 h, f2 h, . . . , f6 h, f1 t, f2 t, . . . , f6 t .

Thus, the resulting space S consists of 12 elements.

Definition 3.1. Cartesian product

Let S1 and S2 be two spaces with elements ζ1 and ζ2 , respectively. The Cartesian

product

S = S1 × S2 (3.1)

of the spaces S1 and S2 defines the combined space whose elements are all pairs ζ1 ζ2 ,

where ζ1 is any element of S1 and ζ2 is any element of S2 .

Example 3.1. The Cartesian product of the spaces


Combined Experiments 19

S1 = {h, t} and S2 = {h, t}

has four elements

S = S1 × S2 = {h, t} × {h, t} = {hh, ht, th, tt} .

Note that the element ht is different from the element th.

Definition 3.2. Generalization

Let S1 , S2 , . . . , Sn be the spaces corresponding to n experiments. As the Cartesian

product, we define the space

S = S1 × S2 × · · · × Sn (3.2)

whose elements are all ordered n tuples ζ1 , ζ2 , . . . , ζn , where ζi ∈ Si . Events in the

space S are all events of the form

A1 × A2 × · · · × An

where Ai ⊂ Si . If the experiments are independent and P (Ai ) is the probability of

the event Ai in the experiment Si , then

P (A1 × A2 × · · · × An ) = P (A1 ) P (A2 ) · · · P (An ) . (3.3)

Example 3.2. Outage probability

The outage probability p of a system describes the probability that the system fails.

Fig. 3.1 shows a system that is composed of four sub-systems each of which has an

outage probability p. System 3 represents a backup-system for the systems 1 and 2.

What is the outage probability of the overall system?

System 1 System 2

System 4

System 3

A B

Fig. 3.1: An overall system consisting of four sub-systems.


20 Repeated Trials

The outage probability of the overall system is

P {outage} = P {outage of A or B}

= 1 − P {no outage of A and B}

= 1 − P {no outage of A}P {no outage of B}

= 1 − (1 − P {outage of A})(1 − P {outage of B})

= 1 − (1 − P {outage of A})(1 − p) . (3.4)

Outage probability of sub-system A:

P {outage of A} = P {outage of 1 or 2} · P {outage of 3}

= (1 − P {no outage of 1 and 2}) · P {outage of 3}

= (1 − P {no outage of 1}P {no outage of 2})P {outage of 3}

= [1 − (1 − P {outage of 1})(1 − P {outage of 2})]P {outage of 3}

= [1 − (1 − p)2 ] p

= 2p2 − p3 . (3.5)

Substituting (3.5) in (3.4) gives

P {outage} = 1 − [1 − (2p2 − p3 )](1 − p)

= p + 2p2 − 3p3 + p4 .

3.2 Bernoulli Trials

Let us consider an experiment S and an event A with

P (A) = p , P (Ā) = q , p + q = 1.

We repeat the experiment n times and we denote the resulting product space by

S n = S × S × . . . × S.

The probability pn (k) that the event A occurs exactly k times in any order is deter-

mined by the following fundamental theorem.


Bernoulli Trials 21

Theorem 3.1. Fundamental Theorem


 
n k n−k
pn (k) = P {A occurs k times in any order} = p q (3.6)
k
where
 
n n! n(n − 1) (n − 2) · · · (n − k + 1)
= = . (3.7)
k k! (n − k)! 1 · 2···k

Proof. Without proof

n
Note that the term k in (3.6) denotes the number of combinations that the event

A occurs k times in any order.

Example 3.3. A coin with P {h} = p is tossed n times. The probability pn (k) that

heads shows k times (in any order) is given by


 
n k n−k
pn (k) = P {k heads in any order} = p q
k
with q = 1 − p.

Example 3.4. A sequence of n bits is transmitted over a binary symmetric channel

with error probability p. The probability pn (k) that the sequence of n bits contains

k errors (at any position) is given by


 
n k n−k
pn (k) = P {k errors at any position} = p q
k
with q = 1 − p.

Example 3.5. A pair of dice is tossed 5 times. Find the probability of obtaining at

least one “double six”.

Solution: The space S of a single roll of two dice consists of the 36 elements fi fj ,

i, j = 1, 2, . . . , 6. The event A = {double six} consists of the single element f6 f6 .

Thus, p = P (A) = 1/36 and q = P (Ā) = 35/36. For the combined experiment with

n = 5 tosses, it follows that

P {at least one double six} = 1 − P {no double six}


 
n 0 n−0
= 1− p q = 1 − qn
0
= 1 − (35/36)5 = 0.13 .
22 Repeated Trials

Alternatively, we can write


n  
X n
P {at least one double six} = pk q n−k
k
k=1
 
n 0 n−0
= 1− p q = 1 − qn
0
= 0.13 .
Chapter 4

The Concept of a Random


Variable

4.1 Definition of Random Variables

We consider an experiment specified by the space S. The elements s ∈ S are the

outcomes of the experiment. A random variable is a function X which assigns a real

number X(s) to every outcome s of this experiment.1

Definition 4.1. Random variable

A random variable X is a function whose domain is the set of outcomes s ∈ S, and

whose range is the real line R (see Fig. 4.1), i.e.,

X: S→R

s 7→ X(s) . (4.1)

S
X

s
X(s) x0 IR

Fig. 4.1: Illustration of the definition of a random variable.

1
All random variables will be denoted by uppercase letters, and fixed values of random variables
will be denoted by lowercase letters.
24 The Concept of a Random Variable

The function X must satisfy the following two conditions:

I. The set {s|X(s) ≤ x0 } is an event for every x0 ∈ R.

II. The probabilities of the events {s|X(s) = ∞} and {s|X(s) = −∞}

equal zero, i.e.,

P {X = ∞} = P {X = −∞} = 0 .

Example 4.1. Consider the toss of one die. In our experiment, we assign to the

six outcomes si the numbers X(si ) = 10i. The mapping performed by the random

variable X is described by
X : S→R

si 7→ X(si ) = 10i .

Thus, the possible outcomes of the random variable X are

X(s1 ) = 10 , X(s2 ) = 20, . . . , X(s6 ) = 60 .

The probability that the outcome 10i occurs is P {X = 10i} = 1/6 .

Example 4.2. In the die experiment, we assign the number 1 to every even outcome

and the number 0 to every odd outcome, i.e.,




 1 if si = even
X(si ) =

 0 if si = odd .

Thus,

X(s1 ) = X(s3 ) = X(s5 ) = 0 , X(s2 ) = X(s4 ) = X(s6 ) = 1 .

The probability that the outcome equals 0 is P {X = 0} = 3/6 = 1/2, and the

probability that the outcome 1 occurs is P {X = 1} = 3/6 = 1/2 .

Definition 4.2. Complex random variables

A complex random variable Z is a sum

Z = X + jY (4.2)

where X and Y are real random variables.


The Cumulative Distribution Function 25

We will make use of the following abbreviations:

P {X = x} := P {s|X(s) = x}

P {X ≤ x} := P {s|X(s) ≤ x}

P {x1 < X ≤ x2 } := P {s|x1 < X(s) ≤ x2 } .

4.2 The Cumulative Distribution Function

Definition 4.3. The (cumulative) distribution function (CDF) of a random variable

X is the function

FX (x) = P {X ≤ x} (4.3)

which is defined for every x from −∞ to ∞.

Example 4.3. In case of the die experiment of Example 4.1, the cumulative distri-

bution function FX (x) of X is a staircase function as shown in Fig. 4.2.

F X(x)

1/2

x
10 20 30 40 50 60

Fig. 4.2: Cumulative distribution function FX (x) of X defined in Exam-


ple 4.1.

Definition 4.4. Percentiles

The u-percentile of a random variable X is the smallest number xu such that

u = P {X ≤ xu } = FX (xu ) . (4.4)
26 The Concept of a Random Variable

Note that xu is the inverse of the function u = FX (xu ), i.e., xu = FX−1 (u). The

0.5-percentile of X is called median. (Application: median filter.)

4.2.1 Properties of Cumulative Distribution Functions

1. FX (x) is normalized:

FX (−∞) = 0 , FX (+∞) = 1 . (4.5)

2. FX (x) is a nondecreasing function of x:

If x1 < x2 then FX (x1 ) ≤ FX (x2 ) . (4.6)

3. FX (x) is continuous from the right:

lim FX (x + ε) = FX (x) . (4.7)


ε→0

4. The probability of a number is:

P {X = x} = FX (x) − lim FX (x − ε) . (4.8)


ε→0

5. The probability that X is between x1 and x2 is:

P {x1 < X ≤ x2 } = FX (x2 ) − FX (x1 ) . (4.9)

4.2.2 Continuous and Discrete Types of Random Variables

Definition 4.5. The random variable X is said to be a continuous-type random

variable if its distribution function FX (x) is continuous [see Fig. 4.3(a)].

Definition 4.6. The random variable X is said to be a discrete-type random variable

if its distribution function FX (x) is constant expect for a finite number of jump

discontinuities [see Fig. 4.3(b)].


The Probability Density Function 27

F X(x) F X(x)

1 1

pi

x xi x
(a) (b)

Fig. 4.3: Examples of distribution functions for (a) continuous-type and


(b) discrete-type random variables.

Note that if FX (x) is continuous, then lim FX (x − ε) = FX (x) for all x, and from
ε→0
(4.8) we get

P {X = x} = FX (x) − lim FX (x − ε) = 0 .
ε→0

Otherwise, if FX (x) is discrete, and xi is a discontinuity point, then

P {X = xi } = FX (xi ) − lim FX (xi − ε) = pi


ε→0

where pi is a positive constant.

4.3 The Probability Density Function

Definition 4.7. The derivative of the cumulative distribution function FX (x ) is

called the probability density function (PDF) fX (x) of the random variable X. Thus,

dFX (x)
fX (x) = . (4.10)
dx

Example 4.4. With reference to the die experiment of Examples 4.1 and 4.3, the

probability density function fX (x) of X is a sum of weighted Dirac delta functions


P
of the form fX (x) = 6i=1 pi δ(x − xi ), where pi = 1/6 and xi = 10i, as shown in

Fig. 4.4.
28 The Concept of a Random Variable

fX (x)

1/6

10 20 30 40 50 60 x

Fig. 4.4: Probability density function fX (x) of X defined in Example 4.1.

4.3.1 Properties of Probability Density Functions

1. Since FX (x) is a nondecreasing function, it follows that fX (x) is non-negative,

because

dFX (x) FX (x + ∆x) − FX (x)


fX (x) = = lim ≥ 0 ∀x. (4.11)
dx ∆x→0 ∆x

2. From (4.10), we obtain

Zx
FX (x) = fX (u) du . (4.12)
−∞

3. Since FX (+ ∞) = 1, (4.12) yields

Z∞
fX (x) dx = 1 (4.13)
−∞

i.e., the area under fX (x) equals one.

4. Furthermore, from (4.12), we also get

Zx2
P {x1 < X ≤ x2 } = FX (x2 ) − FX (x1 ) = fX (x) dx . (4.14)
x1

5. If X is a continuous-type (discrete-type) random variable, then fX (x) will be a

continuous (discrete) function.


The Probability Density Function 29

4.3.2 Examples of Continuous Probability Density Functions

Uniform Distribution. We say X is uniformly distributed in the interval (a, b) if

its density function is given by [see Fig. 4.5(a)]



1

b−a if a ≤ x ≤ b
fX (x) = (4.15)
 0 otherwise .

The distribution function of X is given by [see Fig. 4.5(b)]



 1

 if x ≥ b
FX (x) = x−a (4.16)
b−a if a ≤ x < b



0 if x < a .

f X(x) F X(x)

1
b−a

x x
a b a b

(a) (b)
Fig. 4.5: (a) Uniform distribution and (b) corresponding distribution function.

To represent the uniform distribution in (4.15), we will use the notation X ∼ U [a, b].

Applications. Modelling of quantization noise. The phase of the received signal in

wireless communications is often assumed to be uniformly distributed.

Normal (Gaussian) Distribution. We say X is a normal or Gaussian random

variable with parameters µ and σ 2 if its density function is given by [see Fig. 4.6(a)]

1 2 2
fX (x) = √ e −(x−µ) /2σ . (4.17)
2πσ
30 The Concept of a Random Variable

The corresponding distribution function is [see Fig. 4.6(b)]

Zx  
1 2 2 △ x−µ
FX (x) = √ e −(y−µ) /2σ dy = G (4.18)
2πσ σ
−∞
where
Zx
△ 1 2
G(x) = √ e −y /2 dy (4.19)

−∞

is known as the Gaussian error integral. To represent the Gaussian PDF in (4.17), we

often use the notation X ∼ N (µ, σ 2 ). The special case X ∼ N (0, 1) is referred to as

the standard normal random variable. Note that G(x) is the cumulative distribution

function of the standard normal random variable.

fX (x) 0.4 FX (x) 1

σ2 = 1 σ2 = 1
0.32 0.8
σ2 = 2
σ2 = 2 µµ==2
=2 σ2 = 3
0.24 σ2 = 3 0.6

0.16 0.4
µµ==2
=2

0.08 0.2

0 0
−4 −2 0 2 4 6 8 −4 −2 0 2 4 6 8
x x

(a) (b)

Fig. 4.6: (a) Normal density function and (b) the corresponding distribution
function.

Applications. Modelling of random electrical noise in communication systems. The

normal distribution is one of the most widely used PDFs.

Exponential Distribution. We say X is an exponential random variable with

parameter λ if its density function is given by [see Fig. 4.7(a)]


 λe −λx if x ≥ 0
fX (x) = (4.20)
 0 otherwise .
The Probability Density Function 31

The corresponding distribution function is [see Fig. 4.7(b)]



 1 − e −λx if x ≥ 0
FX (x) = (4.21)
 0 otherwise .

fX (x) 3 FX (x) 1

2.5 0.8

2 λ=1
λ=1 0.6 λ=2
1.5 λ=2 λ=3
λ=3 0.4
1
0.2
0.5

0 0
0 1 2 3 4 0 1 2 3 4
x x

(a) (b)
Fig. 4.7: (a) Exponential distribution and (b) the corresponding distribution
function.

Applications. Modelling of the waiting time a customer has to spend at a restaurant

or at a service location, at a bank or an airline counter. Modelling of the life duration

of an appliance (birth and death processes).

Rayleigh Distribution. The random variable X is said to be Rayleigh distributed

with parameter σ 2 if [see Fig. 4.8(a)]



x 2 /2σ 2

σ2
e−x if x ≥ 0
fX (x) = (4.22)
 0 otherwise .

The distribution function of X is given by [see Fig. 4.8(b)]



 1 − e−x2 /2σ2 if x ≥ 0
FX (x) = (4.23)
 0 otherwise .
32 The Concept of a Random Variable

fX (x) 0.7 FX (x) 1


0.6
σ2 = 1 0.8
0.5 σ2 = 2 σ2 = 1
σ2 = 3 0.6 σ2 = 2
0.4
σ2 = 3
0.3 0.4
0.2
0.2
0.1
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
x x

(a) (b)
Fig. 4.8: (a) Rayleigh distribution and (b) the corresponding distribution
function.

Applications. In mobile radio communications, the distribution of the signal ampli-

tude of a randomly received signal is often modelled as a Rayleigh distribution if no

line-of-sight component is present.

Rice Distribution. The random variable X is said to be Ricean distributed with

parameters σ 2 and ρ if [see Fig. 4.9(a)]


 2 2
 x e− x 2σ+ρ2 I xρ 
if x ≥ 0
σ 2 0 σ2
fX (x) = (4.24)
 0 otherwise

where I0 (· ) is the 0th-order modified Bessel function of the first kind. Note that if

ρ → 0, then the Rice distribution tends to the Rayleigh distribution. The distribution

function of a Ricean distributed random variable X is given by [see Fig. 4.9(b)]



 1 − Q ρ , x  if x ≥ 0
1 σ σ
FX (x) = (4.25)
 0 otherwise

where
Z∞
x2 +α2
Q1 (α, β) = xe − 2 I0 (αx) dx (4.26)
β

is the first-order Marcum Q-function.


The Probability Density Function 33

fX (x) 0.5 FX (x) 1


ρ=1
0.4 ρ=2 0.8
ρ=3 ρ=1
ρ=2
0.3 0.6
σ 2==1
=1 ρ=3
0.2 0.4
σ 2==1
=1
0.1 0.2

0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
x x

(a) (b)

Fig. 4.9: (a) Rice distribution and (b) the corresponding distribution
function.

Applications. The Rice distribution is frequently used in mobile radio communications

to model the distribution of the signal amplitude of a randomly received signal in cases

where a line-of-sight component is present.

Nakagami-m Distribution. The Nakagami distribution

 
2 m m 2m−1 −mx2 /Ω

Γ(m) Ω x e if x > 0
fX (x) = (4.27)
 0 otherwise

is a generalization to the Rayleigh distribution.2 Notice that the Nakagami distribu-

tion (4.27) corresponds to the Rayleigh distribution if m = 1.

Applications. Modelling of fading channels in mobile communications.

Lognormal Distribution. Suppose that X is a normally distributed random vari-

able with parameters µ and σ 2 . Let us define a new random variable Y that is related

2
R∞
In (4.27), Γ(m) represents the gamma function defined as Γ(m) = xm−1 e −x dx. Some impor-
0
tant properties
√ and formulas for the gamma function are: Γ(m + 1) = mΓ(m) = m!, Γ(1) = 1, and
Γ(1/2) = π.
34 The Concept of a Random Variable

to X through the transformation Y = e X . Then, the PDF of Y is


2 /2σ 2
 √ 1 e −(ln y−µ) if y ≥ 0
fY (y) = 2πσy (4.28)
 0 otherwise .

Applications. In mobile communications, the lognormal distribution is suitable for

modelling the effect of shadowing of the transmitted signal due to large obstructions,

such as tall buildings.

4.3.3 Examples of Discrete Probability Density Functions

Bernoulli Distribution. The random variable X is said to be Bernoulli distributed

if X takes the values 1 and 0 with

fX (1) = P {X = 1} = p (4.29a)

fX (0) = P {X = 0} = q = 1 − p . (4.29b)

Applications. Binary symmetric channel (BSC).

Binomial Distribution. The random variable X is said to be binomial distributed

with parameters n and p if its PDF is given by [see Fig. 4.10(a)]

 
n k n−k
fX (k) = P {X = k} = p q , p + q = 1, k = 0, 1, . . . , n . (4.30)
k

The corresponding distribution function is a staircase function as shown in Fig. 4.10(b).


The Probability Density Function 35

fX (k) FX (k) 1
0.15
n = 25 0.8
p = 1/2
0.1 0.6
n = 25

0.4 p = 1/2
0.05
0.2

0 0
0 5 10 15 20 25 0 5 10 15 20 25
k k

(a) (b)
Fig. 4.10: (a) Binomial density function and (b) the corresponding distribution
function.

Applications. Number of errors in a block of transmitted bits.

Poisson Distribution. X is said to be a Poisson random variable with parameter

λ, if X takes the values 0, 1, 2, . . . , ∞ with [see Fig. 4.11(a)]

λk
fX (k) = P {X = k} = e −λ , k = 0, 1, . . . , ∞. (4.31)
k!

The corresponding distribution function shown in Fig. 4.11(b) is a staircase function

similar to Fig. 4.10(b) but it contains an infinite number of steps.


fX (k) 0.25 FX (k) 1

0.2 0.8
λ==3
λ =3
0.15 0.6

0.1 λ==3
λ=3 0.4

0.05 0.2

0 0
0 5 10 15 20 25 0 5 10 15 20 25
k k

(a) (b)
Fig. 4.11: (a) Poisson density function and (b) the corresponding distribution
function.

Applications. The Poisson distribution is often used to describe the number of occur-

rences of a rare event in a large number of trials. Typical examples are:


36 The Concept of a Random Variable

• The number of telephone calls at a telephone exchange over a fixed time interval

(0, t0 ).

• The number of printing errors in a book.

• The number of defect devices detected at the quality control.

The following theorem shows that the binomial distribution is closely connected to

the Poisson distribution.

Theorem 4.1. Poisson theorem

If n → ∞ and p → 0 such that np → λ, then


 
n k n−k −−−−−→ λk
p q n→∞ e −λ , k = 0, 1, 2, . . . (4.32)
k p→ 0 k!

Proof. Without proof.


Chapter 5

Functions of Random Variables

5.1 Transformations of Random Variables

In practice, one often finds situations in which a random variable Y is a function of

another random variable X, i.e.,

Y = g(X) (5.1)

where g(x) is a real-valued function. The random variable Y is described by a PDF

fY (y), which depends on the function g(x) as well as the PDF fX (x) of the random

variable X.
y g(x)

Input X Output Y
with fX (x) with f Y (y)

Fig. 5.1: A function g(x) maps the random variable X with PDF fX (x) into
another random variable Y with PDF fY (y).

The problem is to find the PDF fY (y) of Y for a given function g(x) and a known

PDF fX (x) of X. A general solution to this problem is provided by the following

theorem.
38 Functions of Random Variables

Theorem 5.1. Fundamental theorem

Suppose that X is a random variable and g(x) is a real-valued function. Let y be

a particular value of Y = g(X) and let x1 , x2 , . . . , xn be the roots of the equation

y = g(x), i.e.,

y = g(x1 ) = g(x2 ) = · · · = g(xn ) (5.2)

then the PDF fY (y) of Y = g(X) is given by


fX (x1 ) fX (x2 ) fX (xn )
fY (y) = ′
+ ′ + ··· + ′ (5.3)
|g (x1 )| |g (x2 )| |g (xn )|
where g ′ (x) denotes the derivative of g(x).

Proof. To avoid generalities, we assume that the equation y = g(x) has three roots

as shown in Fig. 5.2.


y

g(x)

g(x)
y+dy
dy

x1 x2 x3 x
f Y (y)

f X(x)

x
dx1 dx2 dx3

Fig. 5.2: Transformation of a continuous random variable.

We know that

fY (y)dy = P {y < Y ≤ y + dy} .


Transformations of Random Variables 39

We have to find the set of values of x such that y < g(x) ≤ y + dy, then we can obtain

fY (y) dy from the probability that X belongs to this set, i.e.,

P {y < Y ≤ y + dy} = P {x|y < g(x) ≤ y + dy} .

For the example shown in Fig. 5.2, this set contains the following three intervals:

x1 < x ≤ x1 + dx1 x2 + dx2 < x ≤ x2 x3 < x ≤ x3 + dx3

where dx1 > 0, dx3 > 0 but dx2 < 0. From the foregoing it follows that

P {y < Y < y + dy} = P {x1 < X < x1 + dx1 }

+ P {x2 + dx2 < X < x2 } + P {x3 < X < x3 + dx3 } .

We can see from Fig. 5.2 that the terms on the right hand side are given by

P {x1 < X < x1 + dx1 } = fX (x1 ) dx1

P {x2 + dx2 < X < x2 } = fX (x2 ) |dx2 |

P {x3 < X < x3 + dx3 } = fX (x3 ) dx3 .

Since the slope g ′ (x) of g(x) is dy/dx, we have

dx1 = dy/g′ (x1 ) dx2 = dy/g ′ (x2 ) dx3 = dy/g ′ (x3 ) .

Hence, when we have three roots for the equation y = g(x), we conclude that

fX (x1 ) fX (x2 ) fX (x3 )


fY (y) dy = ′
dy + ′ dy + ′ dy.
g (x1 ) |g (x2 )| g (x3 )

Cancelling the dy and generalizing the result gives (5.3).

Example 5.1. Suppose that Y is a random variable given by

Y = aX 2 + b

where a > 0 and b are constants, and X has a Gaussian distribution with parameters

µ = 0 and σ 2 = 1. Find the density function of Y .


40 Functions of Random Variables

Solution. If y ≤ b, then the function y = g(x) = ax2 + b has no real solution; hence

fY (y) = 0. If y > b, then the function y = g(x) = ax2 + b has two roots:
r r
y−b y−b
x1 = + x2 = − .
a a

Since the slope g ′ (x) = dy/dx = 2ax at x1 and x2 is given by

p p
g′ (x1 ) = 2 a(y − b) g′ (x2 ) = −2 a(y − b)

it follows from (5.3) that the density function of Y can be expressed by

fX (x1 ) fX (x2 )
fY (y) = ′
+ ′
|g (x1 )| |g (x2 )|
" r ! r !#
1 y−b y−b
= p fX + fX − .
2 a(y − b) a a

With fX (x) given as

1 x2
fX (x) = √ e− 2

we obtain  y−b
1
 √ e− 2a if y > b
fY (y) = 2πa(y−b)
 0 otherwise .

Example 5.2. Lognormal distribution

A Gaussian distributed random variable X ∼ N (µ, σ 2 ) is transformed into another

random variable Y according to

Y = eX .

Find the density function of Y .

Solution. If y ≤ 0, then the function y = g(x) = e x has no solution, i.e., fY (y) = 0.

If y > 0, then the equation y = e x has a single solution x1 = ln y. Since the slope

g′ (x) = dy/dx = e x at x1 = ln y is given by g′ (x1 ) = y, it follows from (5.3)

fX (x1 ) fX (ln y)
fY (y) = ′
= , y > 0.
|g (x1 )| y
Expected Value 41

With fX (x) given as


(x − µ)2
1 −
fX (x) = √ e 2σ 2
2πσ
it follows the lognormal distribution
(ln y − µ)2
1 −
fy (y) = √ e 2σ 2 , y > 0.
2πσy

5.2 Expected Value

The distribution of a random variable is completely described by the PDF. In some

situations, however, we are interested in a few parameters that summarize the in-

formation provided by the PDF. Such parameters are the expected value and the

variance.

5.2.1 The Expected Value of Continuous-Type Random Variables

Definition 5.1. The expected value or mean of a continuous-type random variable

X is defined by

Z∞
E{X} = xfX (x) dx . (5.4)
−∞

This number will also be denoted by µX or, simply, µ.

Example 5.3. If X is normally distributed, i.e., X ∼ N (µ, σ 2 ), then

Z∞
1 (x−µ)2
E{X} = √ xe− 2σ 2 dx .
2πσ
−∞
42 Functions of Random Variables

Using the transformation of variables t := (x − µ)/σ gives


Z∞
1 t2
E{X} = √ (tσ + µ) e − 2 dt

−∞
Z∞ Z∞
σ 2
− t2 µ t2
= √ te dt + √ e − 2 dt
2π 2π
−∞ −∞
| {z } | {z }

=0 = 2π
= µ.

5.2.2 The Expected Value of Discrete-Type Random Variables

Suppose that the random variable X is discrete and takes the values xi with proba-

bility P {X = xi } = pi . In this case, the PDF of X is of the form

X
fX (x) = pi δ(x − xi ) . (5.5)
i

Inserting (5.5) into (5.4) and using the sifting property of the delta function
Z∞
x δ(x − xi ) dx = xi
−∞

results in

X
E{X} = xi p i , pi = P {X = xi } . (5.6)
i

Example 5.4. If X is a Bernoulli distributed random variable, then X takes the

value x0 = 0 with probability p0 = q = 1 − p and the value x1 = 1 with probability

p1 = p. Hence,
1
X
E{X} = xi p i , pi = P {X = xi }
i=0
= x0 p 0 + x1 p 1

= 0 · (1 − p) + 1 · p

= p.
Expected Value 43

5.2.3 The Expected Value of Y = g(X)

Suppose that we are interested in finding the expected value of Y = g(X). The direct

approach involves first finding the PDF of Y , and then the evaluation of E{Y } using

(5.4). This, however, is not necessary. As the next theorem shows, E{Y } can be

expressed directly in terms of the function g(X) and the PDF fX (x) of X.

Theorem 5.2. Let X be a random variable and g(x) be a function. Then, the expected

value of Y = g(X) is given by

Z∞
E{Y } = E{g(X)} = g(x)fX (x) dx . (5.7)
−∞

Proof. For simplicity, we suppose that g(x) is strictly increasing as shown in Fig. 5.3.

Let us divide the y-axis into intervals of length ∆y. We index the intervals with

the index k and we let yk be the value in the center of the kth interval. Then, the

expected value of Y can be approximated by the following sum


P
E{Y } ≈ yk fY (yk )∆y .
k
Since g(x) is strictly increasing, the kth interval in the y-axis has a unique correspond-

ing equivalent event of width ∆x in the x-axis (see Fig. 5.3). Let xk be the value in

the kth interval such that g(xk ) = yk . Then, by using fY (yk )∆y = fX (xk )∆x, we

obtain
X
E{Y } ≈ yk fY (yk )∆y
k
X
= g(xk )fX (xk )∆x .
k

Hence, in the limit ∆y → 0 (∆x → 0), we obtain (5.7). It should be noted that this

equation is valid even if g(x) is not strictly increasing. In this case, for the derivation,

we have to take into account the fact that each y-interval has an equivalent event as

in Fig. 5.2.
44 Functions of Random Variables

g(x)

yk ∆y

∆x

x
xk

Fig. 5.3: A strictly increasing function g(x).

Example 5.5. Let X be uniform in the interval (0, 3), i.e.,


 1

3 if 0 ≤ x ≤ 3
fX (x) =

 0 otherwise .

Let Y be a random variable defined by Y = X 2 /2. Find the expected value E{Y } of

Y.

Solution. With g(x) = x2 /2 and (5.7), we obtain

Z∞
E{Y } = g(x)fX (x) dx
−∞
Z3
x2 1
= · dx
2 3
0
3
1 1 1 3 3
= · · x = .
3 2 3 0 2
Variance 45

5.2.4 Properties of the Expectation Operator

Let X and Y be two random variables, and a and b are two real constants. Then,

1. Expected value of a constant:

E{a} = a . (5.8a)

2. Linearity:

E{aX + bY } = a E{X} + b E{Y } (5.8b)

E{aX + b} = a E{X} + b . (5.8c)

3. If X and Y are statistically independent, then

E{X · Y } = E{X} · E{Y } . (5.8d)

4. If Z = X + j Y is a complex random variable, then

E{Z} = E{X} + j E{Y } . (5.8e)

5.3 Variance

Apart from the mean, the variance is an additional parameter to describe the PDF of

a random variable. The variance of a random variable X is a measure of the spread

of X around its mean.

Definition 5.2. The variance of a continuous-type random variable X with mean µ

is defined by


Var{X} = E{(X − µ)2 }
Z∞
= (x − µ)2 fX (x) dx > 0 . (5.9)
−∞

2 or, simply, σ 2 . The positive constant σ =


This number will also be denoted by σX X
p
Var{X} is known as the standard deviation of X.
46 Functions of Random Variables

The expression in (5.9) can be simplified as follows

2
Var{X} = σX = E{(X − µ)2 }

= E{X 2 − 2Xµ + µ2 }

= E{X 2 } − 2µE{X} + µ2

= E{X 2 } − µ2

= E{X 2 } − (E{X})2 . (5.10)

Example 5.6. Quantization noise

The quantization noise is in general described by a random variable Q, which is

uniformly distributed in the interval (−q/2, q/2). Since the PDF of the quantization
1
noise Q is fQ (x) = q rect ( xq ), the variance of Q is given by1

2
Var{Q} = σQ = E{Q2 } − (E{Q})2
| {z }
=0
Z∞
= x2 fQ (x) dx
−∞
q
Zq/2 2

1 x3 1 q2
= x2 · dx = · = .
q 3 q 12
−q/2 − q2

Example 5.7. Find the variance of a normal random variable.

Solution. A normal random variable X is described by the density

1 (x−µ)2
fX (x) = √ e − 2σ2 .
2πσ
1
The rectangular function rect(x) is defined as

 1 for |x| < 1/2
rect(x) = 1/2 for x = 1/2 (5.11)
0 for x > 1/2 .

Variance 47

From the fact that the area under fX (x) equals 1, i.e.,
Z∞ Z∞
1 (x−µ)2
fX (x) dx = √ e− 2σ 2 dx = 1
2πσ
−∞ −∞

it follows
Z∞
1 (x−µ)2
√ e− 2σ 2 dx = σ .

−∞

Differentiating both sides with respect to σ, gives


Z∞
1 (x − µ)2 − (x−µ) 2
√ e 2σ 2 dx = 1 .
2π σ3
−∞

By rearranging the above equation, we obtain


Z∞
1 (x−µ)2
Var{X} = √ (x − µ)2 e − 2σ 2 dx = σ 2 .
2πσ
−∞

Definition 5.3. If X is a discrete-type random variable with mean µ, then the

variance of X is defined by
X
2
Var{X} = σX = pi (xi − µ)2 , pi = P {X = xi } . (5.12)
i

Example 5.8. Find the variance of a Bernoulli distributed random variable.

Solution. If X is a Bernoulli distributed random variable, then X takes the values 1

and 0 with probabilities p and q = 1 − p, respectively. Hence, with

E{X} = 1 · p + 0 · q = p

E{X 2 } = 12 · p + 02 · q = p

we obtain

Var{X} = E{X 2 } − (E{X})2

= p − p2 = p(1 − p) = pq .

Properties:
48 Functions of Random Variables

1. The variance of a constant a is zero, i.e.,

Var{a} = 0 . (5.13)

2. The variance is a non-linear operation:

Var{aX + b} = a2 Var{X} . (5.14)

3. In case of statistically independent random variables X1 , X2 , . . . , Xn , the fol-

lowing identity holds:


( n ) n
X X
Var Xi = Var{Xi } . (5.15)
i=1 i=1

5.4 Moments

Besides the expected value and the variance, the following quantities are of interest

in the study of random variables.

Z∞
Moments. mn = E{X n } = xn fX (x) dx (5.16)
−∞
Z∞
n
Central Moments. ηn = E{(X − µ) } = (x − µ)n fX (x) dx (5.17)
−∞
Absolute Moments. E{|X|n } E{|X − µ|n } (5.18)

Notice that the mean and variance can be seen to be defined in terms of the first two

moments, E{X} and E{X 2 }.

Under certain conditions, the PDF fX (x) of a random variable is uniquely determined

if all its moments mn are known. The underlying theory is known in mathematics as

the moment theorem (see Section 5.6).


The Chebyshev and Markov Inequalities 49

5.5 The Chebyshev and Markov Inequalities

In general, the mean and variance do not provide enough information to determine

the PDF. However, the mean and variance of a random variable X do allow us to

obtain bounds for probabilities of the form P {|X| ≥ ε}.

Chebyshev Inequality. Suppose that X is a random variable with mean E{X} = µ

and variance Var{X} = σ 2 . The Chebyshev inequality states that

σ2
P {|X − µ| ≥ ε} ≤ (5.19)
ε2

where ε > 0.

Proof. The probability that |X − µ| ≥ ε can be expressed as


Z
P {|X − µ| ≥ ε} = fX (x) dx .
|x−µ|≥ε

For the variance of X, we obtain the following bound

Z∞
2
σ = Var{X} = (x − µ)2 fX (x) dx
−∞
Z
≥ (x − µ)2 fX (x) dx
|x−µ|≥ε
Z
2
≥ ε fX (x) dx .
|x−µ|≥ε

Hence, (5.19) results because the last integral equals P {|X − µ| ≥ ε}.

Markov Inequality. Suppose that X is a nonnegative random variable, i.e., fX (x) =

0 for x < 0. Then, the Markov inequality states that

µ
P {X ≥ ε} ≤ (5.20)
ε

where ε > 0.
50 Functions of Random Variables

Proof. The mean of X is bounded by


Z∞ Z∞ Z∞
µ = E{X} = xfX (x) dx ≥ xfX (x) dx ≥ ε fX (x) dx .
0 ε ε

Hence, (5.20) results because the last integral equals P {X ≥ ε}.

5.6 The Characteristic Function

The concept of the characteristic function will prove very useful when we consider

sums of random variables in Chapter 6.

Definition 5.4. The characteristic function of a continuous-type random variable X

is defined by

ΦX (ω) = E{e jωX } (5.21a)


Z∞
= fX (x) e jωx dx . (5.21b)
−∞

The two expressions given above motivate the following two interpretations of the

characteristic function:

• In (5.21a), ΦX (ω) can be viewed as the expected value of the function e jωX .

• In (5.21b), ΦX (ω) is simply (the complex conjugate of) the Fourier transform

of the PDF fX (x), i.e., ΦX (ω) = F ∗ {fX (x)}.

Properties:

1. ΦX (ω) is maximum at the origin because fX (x) ≥ 0 :

|ΦX (ω)| ≤ ΦX (0) = 1 . (5.22)

2. If Y = aX + b then ΦY (ω) = e jbω ΦX (aω) . (5.23)


The Characteristic Function 51

Proof. ΦY (ω) = E{e jωY } = E{e jω(aX+b) } = e jωb E{e jaωX }

3. The properties of characteristic functions are essentially the same as the proper-

ties of Fourier transforms. This follows from the fact that ΦX (ω) is the Fourier

transform of fX (x).

4. In particular, from the Fourier transform inversion formula, it follows that fX (x)

can be expressed in terms of ΦX (ω)


Z∞
1
fX (x) = ΦX (ω)e −jωx dω . (5.24)

−∞

Example 5.9. Show that the characteristic function of an N (µ, σ 2 ) random variable

X equals

σ2 ω2
ΦX (ω) = e jµω− 2 . (5.25)

Proof. We consider the N (0, 1) random variable Y = (X − µ)/σ. The characteristic

function of Y equals
Z∞
ΦY (ω) = fY (y)e jωy dy
−∞
Z∞
1 y2
= √ e− 2 e jωy dy .

−∞

With the identity

y2 1 ω2
− + jωy = − (y − jω)2 −
2 2 2

we obtain
Z∞
2 1 (y−jω)2 ω2
− ω2
ΦY (ω) = e √ e − 2 dy = e − 2 .

−∞
| {z }
=1

Since X = σY + µ, we conclude that (5.25) follows from (5.23).


52 Functions of Random Variables

Definition 5.5. Let X be a discrete-type random variable. Then the characteristic

function of X is given by

X
ΦX (ω) = pi e jωxi pi = P {X = xi } . (5.26)
i=−∞

In this case, ΦX (ω) is the (complex conjugate of the) Fourier transform of the sequence

pi .

A further useful property of the characteristic function is its relation to the moments

of the random variable. As the following moment theorem states, the moments of X

can be obtained from the derivatives of ΦX (ω) at the origin ω = 0.

Theorem 5.3. Moment theorem

The moments of a random variable X are given by

1 dn
E{X n } = ΦX (ω) . (5.27)
j n dω n ω=0

Proof. To proof this, we expand e jωx in a power series

(jωx)2
e jωx = 1 + jωx + + ··· .
2!

Substituting the power series in the definition of ΦX (ω) [see (5.21b)] gives
Z∞  
(jωx)2
ΦX (ω) = fX (x) 1 + jωx + + ··· dx .
2!
−∞

On the assumption that the above integrals converge, we obtain

(jω)2 (jω)n
ΦX (ω) = 1 + jωE{X} + E{X 2 } + · · · + E{X n } + · · · (5.28)
2! n!

If we differentiate the above expression n times and evaluate the result at ω = 0, we

finally obtain

dn
ΦX (ω)|ω=0 = j n E{X n }
dω n

which yields (5.27).


The Characteristic Function 53

In particular, if n = 1, then we obtain the expected value

d
E{X} = −j ΦX (ω) . (5.29)
dω ω=0

It should be observed that if the power series in (5.28) converges, then the charac-

teristic function ΦX (ω) and hence the PDF fX (x) are completely determined by the

moments of X. In other words,



X (jω)n
ΦX (ω) = E{X n } . (5.30)
n=0
n!
Chapter 6

Two Random Variables

6.1 Joint Distribution and Joint Density

Definition 6.1. Joint cumulative distribution function

The joint (cumulative) distribution function FXY (x, y) or, simply, F (x, y) of two

random variables X and Y is defined as the probability of the product-form event

{X ≤ x} ∩ {Y ≤ y} = {X ≤ x, Y ≤ y} = {(X, Y ) ∈ D}, i.e.,

F (x, y) = P {X ≤ x, Y ≤ y} (6.1)

where x and y are two arbitrary real numbers, and D is the semi-infinite rectangle

shown in Fig. 6.1.

Fig. 6.1: The joint cumulative distribution function is defined as the probability
that (X, Y ) is in the semi-infinite rectangle determined by the points x
and y.
Joint Distribution and Joint Density 55

Properties:

1. It is impossible for either X or Y to assume a value less than −∞. Therefore,

F (−∞, y) = 0 and F (x, −∞) = 0 . (6.2)

2. It is certain that X and Y will assume values less than infinity. Therefore,

F (∞, ∞) = 1 . (6.3)

3. F (x, y) is a nondecreasing function of x and y:

If x1 ≤ x2 and y1 ≤ y2 then F (x1 , y1 ) ≤ F (x2 , y2 ) . (6.4)

4. F (x, y) is continuous from the “north” and from the “east”:

lim F (x, y + ε) = F (x, y) lim F (x + ε, y) = F (x, y) . (6.5)


ε→0 ε→0

5. The probability that (X, Y ) is in the rectangle D of Fig. 6.2 is

P {x1 < X ≤ x2 , y1 < Y ≤ y2 } = F (x2 , y2 ) − F (x1 , y2 )

− F (x2 , y1 ) + F (x1 , y1 ) . (6.6)

y2

D
y1

x1 x2

Fig. 6.2: The joint cumulative distribution function can be used to determine
the probability that (X, Y ) is in the rectangle D.
56 Two Random Variables

Definition 6.2. Joint probability density function

The joint probability density function fXY (x, y) or, simply, f (x, y) of X and Y is by

definition the function

∂ 2 F (x, y)
f (x, y) = . (6.7)
∂x∂y

Properties:

1. From (6.7), we obtain


Zy Zx
F (x, y) = f (α, β) dα dβ . (6.8)
−∞ −∞

2. Since F (∞, ∞) = 1, (6.8) yields


Z∞ Z∞
f (α, β) dα dβ = 1 . (6.9)
−∞ −∞

3. The probability that the point (X, Y ) is in a region D equals the integral of

f (x, y) in D, i.e.,
Z Z
P {(X, Y ) ∈ D} = f (x, y) dx dy (6.10)
D

where the event {(X, Y ) ∈ D} consists of all outcomes ζ such that the point

(X(ζ), Y (ζ)) is in D.

Marginal distribution. The marginal (cumulative) distribution functions FX (x)

and FY (y) are obtained from the joint distribution function F (x, y) as follows

FX (x) = F (x, ∞) FY (y) = F (∞, y) . (6.11)

Marginal density. The marginal probability density function fX (x) and fY (y) are

obtained from the joint probability density function f (x, y) by integrating out the
Joint Distribution and Joint Density 57

variable which is not of interest. Thus,

Z∞ Z∞
fX (x) = f (x, y) dy fY (y) = f (x, y) dx . (6.12)
−∞ −∞

Definition 6.3. Independence of two random variables

Two random variables X and Y are called (statistically) independent if the events

{X ∈ A} and {Y ∈ B} are independent, that is, if

P {X ∈ A, Y ∈ B} = P {X ∈ A} P {Y ∈ B} (6.13)

where A and B are two arbitrary sets on the x and y axes, respectively. Let us apply

this to the events {X ≤ x} and {Y ≤ y}, then the independence of X and Y implies

that

F (x, y) = FX (x) FY (y) (6.14)

and, thus,

f (x, y) = fX (x) fY (y) . (6.15)

Therefore, if X and Y are independent random variables, then the joint PDF is equal

to the product of the marginal PDFs. It can also be shown that, if the joint PDF of

X and Y equals the product of the marginal PDFs, then X and Y are independent.

Hence,

X and Y are independent ⇐⇒ f (x, y) = fX (x) fY (y) . (6.16)

Example 6.1. Let X ∼ N (0, σ 2 ) and Y ∼ N (0, σ 2 ). Suppose that X and Y are

independent. Find the joint PDF f (x, y) of X and Y .


58 Two Random Variables

Solution. Since X and Y are independent, we only have to substitute

1 x2
fX (x) = √ e − 2σ2
2πσ

and

1 y2
fY (y) = √ e − 2σ2
2πσ

in (6.15). Hence,
2 2
1 − x +y
f (x, y) = fX (x) fY (y) = e 2σ 2 . (6.17)
2πσ 2

6.2 One Function of Two Random Variables

In this section, we discuss methods of determining the probabilities of events involving

functions of two random variables.

Let a random variable Z be defined as a function of two given random variables X

and Y , i.e.,

Z = g(X, Y ) (6.18)

where it is supposed that the joint PDF fXY (x, y) is given. The problem is then to

obtain the PDF fZ (z) of the random variable Z.

To solve this problem, we first find the cumulative distribution function of Z

FZ (z) = P {Z(ζ) ≤ z} = P {g(X, Y ) ≤ z} = P {(X, Y ) ∈ DZ }


Z Z
= fXY (x, y) dx dy (6.19)
x,y∈DZ

where DZ represents the region in which the inequality g(x, y) ≤ z is satisfied (see

Fig. 6.3). The PDF fZ (z) of Z is then found by taking the derivative of FZ (z), i.e.,

dFZ (z)
fZ (z) = . (6.20)
dz
One Function of Two Random Variables 59

Dz

g(x, y) <− z

Fig. 6.3: The region DZ where g(x, y) ≤ z.

In practice, we are quite often interested in functions g(X, Y ), which are of the type

shown in Fig. 6.4.


X +Y

max(X, Y ) X −Y

min(X, Y ) X/Y
g(X, Y )


X2 + Y 2 XY

min(X, Y ) tan−1 (Y /X)


max(X, Y )

Fig. 6.4: Important types of the function g(X, Y ).

Example 6.2. Let the random variables X ∼ N (0, σ 2 ) and Y ∼ N (0, σ 2 ) be inde-

pendent and Z = X 2 + Y 2 . Determine fZ (z).

Solution. We have

p ZZ
FZ (z) = P {Z ≤ z} = P { X 2 + Y 2 ≤ z} = √ fXY (x, y) dx dy
DZ = x2 +y 2 ≤z

p
where DZ = x2 + y 2 ≤ z represents the area of a circle with radius z as shown in

Fig. 6.5.
60 Two Random Variables

Dz z

p
Fig. 6.5: The region DZ = x2 + y 2 ≤ z.

The transform of the Cartesian coordinates (x, y) into polar coordinates (r, θ) by

means of

x = r cos θ y = r sin θ dx dy = r dr dθ

gives
Z2π Zz
FZ (z) = fXY (r cos θ, r sin θ)rdrdθ .
0 0

Since the joint PDF fXY (x, y) of two zero-mean independent Gaussian random vari-

ables each with variance σ 2 is known from (6.17)


2 2
1 − x +y
fXY (x, y) = e 2σ 2 .
2πσ 2

Hence, it follows
Z2π Zz
1 r2
FZ (z) = e − 2σ2 r dr dθ
2πσ 2
0 0
Zz
1 r2
= re − 2σ2 dr .
σ2
0

By differentiation, we finally obtain

dFZ (z) z z2
fZ (z) = = 2 e − 2σ2 , z ≥ 0. (6.21)
dz σ

Note that the PDF in (6.21) is the Rayleigh density (4.22). Thus, if W = X + jY ,

where X and Y are real independent normal random variables with zero mean and
One Function of Two Random Variables 61


equal variance, then the random variable Z = |W | = X 2 + Y 2 has a Rayleigh

density.

6.2.1 The Sum of Two Random Variables

Theorem 6.1. Let us consider the sum of two random variables

Z =X +Y . (6.22)

Then, the PDF fZ (z) of Z is given by

R∞
fZ (z) = fXY (z − y, y) dy . (6.23)
−∞

Proof. In the first step, we determine the cumulative distribution function FZ (z) of

Z. Therefore, we have to integrate the joint PDF fXY (x, y) of X and Y over the

region DZ of the xy-plane, where DZ = x+y ≤ z is the shaded area shown in Fig. 6.6.

Hence, from (6.19) it follows

FZ (z) = P {Z ≤ z} = P {X + Y ≤ z}
Z∞ Z z−y

= fXY (x, y) dx dy . (6.24)


y=−∞ x=−∞

x+
y=
z

x=z− y

x + y <− z

Fig. 6.6: The region DZ = x + y ≤ z.


62 Two Random Variables

In the next step, we determine fZ (z) by using the relation

d
fZ (z) = FZ (z) .
dz

In this context, it is useful to apply the differentiation rule due to Leibnitz, which

states that the differentiation of


Zb(z)
FZ (z) = f (x, z) dx (6.25)
a(z)

results in
Zb(z)
d db(z) da(z) ∂f (x, z)
fZ (z) = FZ (z) = f (b(z), z) − f (a(z), z) + dx . (6.26)
dz dz dz ∂z
a(z)

Hence, using (6.26) in (6.24), we get


 z−y

Z∞ Z
fZ (z) = ∂ fXY (x, y)dx dy
∂z
−∞ −∞
 z−y

Z∞ Z
1 · fXY (z − y, y) − 0 + ∂f XY (x, y)
=  dy
∂z
−∞ −∞
Z∞
= fXY (z − y, y) dy .
−∞

If X and Y are independent random variables, then

fXY (x, y) = fX (x) fY (y) . (6.27)

Inserting (6.27) into (6.23), we get the integral

Z∞
fZ (z) = fX (z − y)fY (y) dy (6.28)
−∞

which represents the convolution of fX (z) with fY (z). This result leads to the fol-

lowing conclusion: If two random variables are independent, then the density of their

sum equals the convolution of their densities.


One Function of Two Random Variables 63

6.2.2 The Quotient of Two Random Variables

Theorem 6.2. Let us consider the quotient of two random variables

X
Z= . (6.29)
Y

Then, the PDF fZ (z) of Z is given by

Z∞
fZ (z) = |y|fXY (yz, y) dy . (6.30)
−∞

Proof. The cumulative distribution function FZ (z) of Z is

FZ (z) = P {Z ≤ z} = P {X/Y ≤ z} . (6.31)

The inequality X/Y ≤ z can be rewritten as X ≤ Y z if Y > 0, and X ≥ Y z if

Y < 0. Hence, the event {X/Y ≤ z} in (6.31) needs to be conditioned by the event

A = {Y > 0} and its complement Ā = {Y < 0}. Since A ∪ Ā = S, we have

FZ (z) = P {X/Y ≤ z}

= P {X/Y ≤ z ∩ (A ∪ Ā)}

= P {X/Y ≤ z ∩ A} + P {X/Y ≤ z ∩ Ā}

= P {X/Y ≤ z, Y > 0} + P {X/Y ≥ z, Y < 0}

= P {X ≤ Y z, Y > 0} + P {X ≥ Y z, Y < 0} .

The areas corresponding to the first term and the second term are shown in Fig. 6.7(a)

and Fig. 6.7(b), respectively.


64 Two Random Variables

y y

x
x/y = z

yz
yz

=
x
=
x

x/y = z

(a) (b)

Fig. 6.7: (a) The region x ≤ yz (y > 0) and (b) the region x ≥ yz (y < 0).

The integration over these two regions gives


Z∞ Zyz Z0 Z∞
FZ (z) = fXY (x, y) dx dy + fXY (x, y) dx dy .
y=0 x=−∞ y=−∞ x=yz

Finally, the differentiation of FZ (z) with respect to z results in


Z∞ Z0
d
fZ (z) = FZ (z) = yfXY (yz, y) dy + (−y)fXY (yz, y) dy
dz
0 −∞
Z∞
= |y|fXY (yz, y) dy . (6.32)
−∞

Note that if X and Y are independent random variables, then fXY (x, y) = fX (x)fY (y)

holds, and (6.30) simplifies to

Z∞
fZ (z) = |y|fX (yz)fY (y) dy . (6.33)
−∞

6.2.3 The Maximum of Two Random Variables

Theorem 6.3. Let us consider the maximum of two random variables




 X if X > Y
Z = max(X, Y ) = (6.34)

 Y if X ≤ Y .
One Function of Two Random Variables 65

Then, the cumulative distribution function FZ (z) of Z is given by

FZ (z) = FXY (z, z) . (6.35)

Proof. The cumulative distribution function FZ (z) of Z is

FZ (z) = P {Z ≤ z} = P {max(X, Y ) ≤ z}

= P {(X ≤ z, X > Y ) ∪ (Y ≤ z, X ≤ Y )}

= P {X ≤ z, X > Y } + P {Y ≤ z, X ≤ Y } (6.36)

where we have used the fact that {X > Y } and {X ≤ Y } are mutually exclusive sets

that form a partition. Figure 6.8(a) and Fig. 6.8(b) show the regions satisfying the

corresponding inequalities in each term in (6.36). Figure 6.8(c) illustrates the total

region from which we conclude that

FZ (z) = P {X ≤ z, Y ≤ z} = FXY (z, z) . (6.37)


66 Two Random Variables

y y
x=z

y
=

y
x≤y

=
x
x≤z (z, z) y=z
(z, z)

x x
x>y y≤z

(a) (b)

y
(z, z)
x ≤ z, y ≤ z

(c)

Fig. 6.8: The regions (a) x ≤ z (x > y) and (b) y ≤ z (x ≤ y), as well as the total
region (c) x ≤ z (y ≤ z).
Note that if X and Y are independent, then

FZ (z) = FX (z) · FY (z) (6.38)

and hence

dFZ (z)
fZ (z) = = FX (z) · fY (z) + fX (z) · FY (z) . (6.39)
dz

6.3 Two Functions of Two Random Variables

Let X and Y be two random variables with joint PDF fXY (x, y). Furthermore, let

g(x, y) and h(x, y) be two functions, and suppose that Z and W are two new random
Two Functions of Two Random Variables 67

variables defined as

Z = g(X, Y ) (6.40)

W = h(X, Y ) . (6.41)

We consider the problem of finding the joint PDF fZW (z, w) of Z and W . Obviously,

with fZW (z, w) in hand, one can easily determine the marginal PDFs fZ (z) and

fW (w).

The procedure for determining fZW (z, w) is essentially the same as that described

in the previous section. The joint cumulative distribution function FZW (z, w) of Z

and W at the point (z, w) equals the probability that (X, Y ) is in the region of the

xy-plane where g(x, y) ≤ z and h(x, y) ≤ w. Hence,

FZW (z, w) = P {Z(ζ) ≤ z, W (ζ) ≤ w}

= P {g(X, Y ) ≤ z, h(X, Y ) ≤ w}
ZZ
= P {(X, Y ) ∈ DZW } = fXY (x, y) dx dy (6.42)
(x,y)∈DZW

with DZW representing the region where the inequalities g(x, y) ≤ z and h(x, y) ≤ w

are simultaneously satisfied (see Fig. 6.9).


y

D ZW

D ZW

Fig. 6.9: The region DZW , where g(x, y) ≤ z and h(x, y) ≤ w.

Theorem 6.4. Suppose that X and Y are random variables, which are described
68 Two Random Variables

by the joint PDF fXY (x, y). Furthermore, we suppose that g(x, y) and h(x, y) are

continuous and differentiable functions. Let (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) be the

solution of the equations

z = g(x, y) w = h(x, y) (6.43)

such that

z = g(x1 , y1 ) = g(x2 , y2 ) = · · · = g(xn , yn )

w = h(x1 , y1 ) = h(x2 , y2 ) = · · · = h(xn , yn ) .

Then, the joint PDF fZW (z, w) of the random variables

Z = g(X, Y ) and W = h(X, Y )

is given by
n
X fXY (xi , yi )
fZW (z, w) = (6.44)
|J(xi , yi )|
i=1

where J(xi , yi ) is the determinant


∂g ∂g
∂x ∂y
J(xi , yi ) = (6.45)
∂h ∂h
∂x ∂y x=xi
y=yi

which is called the Jacobian of the transformation.

Proof. Without proof.

Remark:

It can be shown that

1
|J(xi , yi )| = (6.46)
|J(z, w)|
where
∂g1 ∂g1
J(z, w) = ∂z ∂w . (6.47)
∂h1 ∂h1
∂z ∂w
In (6.47), the functions g1 and h1 denote the inverse transformation in (6.43), so that

xi = g1 (z, w) and yi = h1 (z, w). Hence, due to (6.46), we can replace 1/|J(xi , yi )|

by |J(z, w)| in (6.44) .


Two Functions of Two Random Variables 69

Example 6.3. Let X and Y be two zero-mean independent Gaussian random vari-

ables with common variance σ 2 . Find the PDF of


 
p
−1 Y
R= X2 +Y 2 and Θ = tan (6.48)
X
where 0 < r < ∞ and |θ| < π.

Solution. With reference to Example 6.1, the joint PDF fXY (x, y) of X and Y equals
2 2
1 − x +y
fXY (x, y) = e 2σ 2 . (6.49)
2πσ 2
Here, the functions g(x, y) and h(x, y) are given by
p y 
r = g(x, y) = x2 + y 2 and θ = h(x, y) = tan−1 (6.50)
x
respectively, where 0 < r < ∞ and |θ| < π. In this case, we have only a single solution

pair (x1 , y1 ) given by

x1 = g1 (r, θ) = r cos θ y1 = h1 (r, θ) = r sin θ . (6.51)

By using (6.47), we obtain


∂g1 ∂g1
∂r ∂θ cos θ −r sin θ
J(r, θ) = =
∂h1 ∂h1 sin θ r cos θ
∂r ∂θ
= r cos θ + r sin2 θ
2

= r (6.52)

so that

|J(r, θ)| = r . (6.53)

From (6.46), it follows


1 1
|J(x1 , y1 )| = = . (6.54)
|J(r, θ)| r
Substituting (6.49), (6.51), and (6.54) in (6.44), we get the joint PDF

fRΘ (r, θ) = rfXY (x1 , y1 )


r x2 +y 2
− 1 21
= e 2σ
2πσ 2
2
r − r2
= e 2σ 0<r<∞ |θ| < π . (6.55)
2πσ 2
70 Two Random Variables

Thus,

r − r22
fR (r) = fRΘ (r, θ) dθ = e 2σ 0<r<∞ (6.56)
σ2
−π

and
Z∞
1
fΘ (θ) = fRΘ (r, θ) dr = |θ| < π . (6.57)

0

Note that fR (r) in (6.56) represents the density of a Rayleigh random variable with

parameter σ 2 , whereas fΘ (θ) in (6.57) is the density of a uniform random variable in

the interval (−π, π). We also observe that

fRΘ (r, θ) = fR (r) · fΘ (θ) (6.58)

implying that R and Θ are independent.

The above results can be summarized in the following statement: If X ∼ N (0, σ 2 ) and

Y ∼ N (0, σ 2 ) are independent, then the magnitude R and phase Θ of the complex

Gaussian random variable

X + jY = R e jΘ (6.59)

are independent with Rayleigh and uniform distributions, respectively.

Auxiliary Variables. Let X and Y be two random variables and g(x, y) is a

function. Suppose that

Z = g(X, Y ) . (6.60)

To determine the PDF fZ (z) of Z by using the relation in (6.44), we define an auxiliary

variable

W =X (or W = Y ) . (6.61)

Then, the PDF fZ (z) can be obtained from fZW (z, w) by integration over w.
Joint Moments 71

Example 6.4. Find the PDF of Z = XY .

Solution. Let us define W = X. Then, the system of equations

z = g(x, y) = xy w = h(x, y) = x

has a single solution:

x1 = w y1 = z/w .

The Jacobian is
∂g ∂g
∂x ∂y y x
J(x1 , y1 ) = ∂h ∂h = = −x1 = −w .
1 0 x=x1
∂x ∂y x=x1 y=y1
y=y1

From (6.44), it follows


1  z
fZW (z, w) = fXY w,
|w| w
and, thus, the density fZ (z) of Z = XY is given by
Z∞  z
1
fZ (z) = fXY w, dw . (6.62)
|w| w
−∞

In the special case that the random variables X and Y are independent, it follows
Z∞ z
1
fZ (z) = fX (w)fY dw (6.63)
|w| w
−∞

since fXY (w, z/w) = fX (w)fY (z/w).

6.4 Joint Moments

In the following, we suppose that X and Y are random variables and g(x, y) is a

function.

Expected Value. The expected value of the random variable Z = g(X, Y ) is defined

by
Z∞
E{Z} = zfZ (z) dz (6.64)
−∞
72 Two Random Variables

or, alternatively,

Z∞ Z∞
E{Z} = E{g(X, Y )} = g(x, y)f (x, y) dx dy (6.65)
−∞ −∞

where f (x, y) is the joint density of X and Y .

Covariance. The covariance CXY of two random variables X and Y is defined by

the number

CXY = E{(X − µX )(Y − µY )} (6.66)

where µX = E{X} and µY = E{Y }.

The expression in (6.66) can be simplified as follows

CXY = E{(X − µX )(Y − µY )}

= E{XY − XµY − Y µX + µX µY }

= E{XY } − µY E{X} − µX E{Y } + µX µY

= E{XY } − µX µY

= E{XY } − E{X}E{Y } . (6.67)

Correlation Coefficient. The correlation coefficient ρXY of the random variables

X and Y is by definition the ratio

CXY
ρXY = . (6.68)
σX σY

Without proof, we maintain that

|ρXY | ≤ 1 |CXY | ≤ σX σY . (6.69)

Uncorrelatedness. Two random variables X and Y are called uncorrelated if their

covariance CXY is 0. This can be phrased in the following three equivalent forms:

CXY = 0 ρXY = 0 E{XY } = E{X}E{Y } . (6.70)


Joint Moments 73

Orthogonality. Two random variables X and Y are called orthogonal if

E{X Y } = 0 . (6.71)

To indicate that the random variables X and Y are orthogonal, one often uses the

notation X⊥Y .

Theorem 6.5. If two random variables X and Y are independent, that is, if

fXY (x, y) = fX (x) fY (y) (6.72)

then they are uncorrelated.

Proof. According to (6.70), we have to show that

E{XY } = E{X}E{Y } . (6.73)

From (6.65) and (6.72) it follows that


Z∞ Z∞ Z∞ Z∞
E{XY } = xyfXY (x, y) dx dy = xyfX (x)fY (y) dx dy
−∞ −∞ −∞ −∞
Z∞ Z∞
= xfX (x) dx yfY (y) dy = E{X}E{Y } .
−∞ −∞

It can be shown, however, that if two random variables X and Y are uncorrelated,

then they are not necessarily independent. The above results can be phrased in the

following statement:

If X and Y are independent =⇒


/ X and Y are uncorrelated.
⇐= (6.74)

It should be noted that for normal random variables uncorrelatedness is equivalent

to independence.

Finally, we summarize the properties of the correlation coefficient ρXY :


74 Two Random Variables

1. If X and Y are uncorrelated, then ρXY = 0.

2. If X and Y are linearly dependent, then ρXY = ±1.

3. For any random variables X and Y , then −1 ≤ ρXY ≤ 1.

Fundamental results can also be obtained for the expected value and the variance of

the sum of two random variables.

Expected value of the sum of two random variables: If Z = X + Y , then

µZ = E{Z} = E{X + Y }

= E{X} + E{Y }

= µX + µY . (6.75)

Thus, the expected value of the sum of two random variables equals the sum of their

expected values. Note that X and Y need not to be independent.

Variance of the sum of two random variables: If Z = X + Y , then

σZ2 = σX
2
+ 2ρXY σX σY + σY2 . (6.76)

This leads to the conclusion that if two random variables are uncorrelated, i.e., if

ρXY = 0, then the variance of their sum equals the sum of their variances, that is,

σZ2 = σX
2
+ σY2 . (6.77)

Proof. The relation (6.76) can be proved by using (6.66) and (6.68) as follows:

σZ2 = Var{Z} = E{(Z − µZ )2 }

= E{[(X − µX ) + (Y − µY )]2 }

= E{(X − µX )2 + 2(X − µX )(Y − µY ) + (Y − µY )2 }


2
= σX + 2CXY + σY2 = σX
2
+ 2ρXY σX σY + σY2 .
Joint Moments 75

Joint Moments. The mean of the product X k Y r , denoted by

Z∞ Z∞
k r
mkr = E{X Y } = xk y r fXY (x, y) dx dy (6.78)
−∞ −∞

defines the joint moment of the random variables X and Y of order n = k + r.

First-order moments: m10 = µX , m01 = µY


Second-order moments: m20 = E{X 2 } , m02 = E{Y 2 } , m11 = E{XY }

Similar to (5.17) and (5.18), the joint central and absolute moments of X and Y can

be defined.

2 ) and Y ∼ N (µ , σ 2 ). The random variables


Joint Normality. Let X ∼ N (µX , σX Y Y

X and Y are called jointly normal (or jointly Gaussian) if their joint density is given

by

(x−µX )2 (y−µY )2
 
1 (x−µX )(y−µY )
− −2ρXY +
2(1−ρ2 ) σ2 σX σY σ2
fXY (x, y) = A e XY X Y (6.79)

where

1
A= q (6.80)
2πσX σY 1 − ρ2XY

and ρXY denotes the correlation coefficient. The function fXY (x, y) in (6.79) is also
2 , σ2 , ρ
denoted by N (µX , µY , σX Y XY ). Without proof, we mention that the marginal

densities of X and Y are given by

(x−µX )2 (y−µY )2
1 −
2σ 2
1 −
2σ 2
fX (x) = √ e X , fY (y) = √ e Y (6.81)
2πσX 2πσY

From the results above, it follows that if two random variables are jointly normal, then

they are also marginally normal. Note that fXY (x, y) can be written as fXY (x, y) =

fX (x) · fY (y) if ρXY = 0.


76 Two Random Variables

6.5 The Joint Characteristic Function

Definition 6.4. The joint characteristic function of the random variables X and Y

is defined as

ΦXY (ω1 , ω2 ) = E{e j(ω1 X+ω2 Y ) } (6.82a)


Z∞ Z∞
= fXY (x, y) e j(ω1 x+ω2 y) dx dy . (6.82b)
−∞ −∞

By analogy to the one-dimensional case (see Section 5.6), the two expressions given

above can be interpreted as follows:

• In (6.82a), ΦXY (ω1 , ω2 ) can be viewed as the expected value of the two-dimensional

function e j(ω1 X+ω2 Y ) .

• In (6.82b), ΦXY (ω1 , ω2 ) is simply (the complex conjugate of) the two-dimensional

Fourier transform of the joint PDF fXY (x, y) of X and Y , i.e., ΦXY (ω1 , ω2 ) =

F ∗ {fXY (x, y)}.

Properties:

1. The marginal characteristic functions can be obtained from the joint character-

istic function:

ΦX (ω) = ΦXY (ω, 0) ΦY (ω) = ΦXY (0, ω) . (6.83)

2. If Z = aX + bY , then

ΦZ (ω) = E{e iωZ } = E{e j(aX+bY )ω }

= E{e j(aωX+bωY ) } = ΦXY (aω, bω) . (6.84)

3. The inversion formula for the two-dimensional Fourier transform implies that

the joint PDF is given by


Z∞ Z∞
1
fXY (x, y) = ΦXY (ω1 , ω2 )e −j(ω1 x+ω2 y) dω1 dω2 . (6.85)
(2π)2
−∞ −∞
The Joint Characteristic Function 77

Independence. If the two random variables X and Y are independent, then the

joint characteristic function is the product of the marginal characteristic functions,

i.e.,

ΦXY (ω1 , ω2 ) = ΦX (ω1 ) · ΦY (ω2 ) . (6.86)

Proof.

ΦXY (ω1 , ω2 ) = E{e j(ω1 X+ω2 Y ) } = E{e jω1 X e jω2 Y }

= E{e jω1 X } E{e jω2 Y } = ΦX (ω1 )ΦY (ω2 )

where the third equality follows from the fact that E{g1 (X) g2 (Y )} = E{g1 (X)} E{g2 (Y )}.

Theorem 6.6. Convolution theorem

If the random variables X and Y are independent and Z = X + Y , then

ΦZ (ω) = ΦX (ω) · ΦY (ω) . (6.87)

Proof.

ΦZ (ω) = E{e jωZ } = E{e jω(X+Y ) }

= E{e jωX } · E{e jωY } = ΦX (ω) · ΦY (ω) .

Remember that the density fZ (z) of Z = X + Y equals the convolution of fX (x) and

fY (y) if X and Y are independent [see (6.28)]. From this and (6.87) it follows that

the characteristic function of the convolution of two densities equals the product of

their characteristic functions. This result is often used when dealing with sums of

random variables.
78 Two Random Variables

6.6 Conditional Probability

The definition of the conditional probability in Section 2.3 provides the formula for

computing the probability that the event {Y ≤ y} occurs given that we know the

exact value of X, i.e., X = x:

P {Y ≤ y, X = x}
FY (y | X = x) = P {Y ≤ y | X = x} = . (6.88)
P {X = x}

6.6.1 The Conditional Distribution of Y given X = x

Theorem 6.7. If X is a continuous-type random variable, then the conditional dis-

tribution of Y given X = x equals

Ry
fXY (x, y ′ ) dy ′
−∞
FY (y | X = x) = . (6.89)
fX (x)

Proof. If X is a continuous-type random variable, then P {X = x} = 0, so that (6.88)

is undefined. However, we can define the conditional distribution of Y given X = x

by the following limiting procedure

FY (y | X = x) = lim FY (y | x < X ≤ x + ∆x) . (6.90)


∆x→0

The conditional distribution on the right-hand side of (6.90) is

P {Y ≤ y, x < X ≤ x + ∆x}
FY (y | x < X ≤ x + ∆x) =
P {x < X ≤ x + ∆x}
y
R x+∆x
R
fXY (x′ , y ′ ) dx′ dy ′
−∞ x
= x+∆x
R
fX (x′ ) dx′
x
Ry
fXY (x, y ′ ) dy ′ ∆x
−∞
≃ . (6.91)
fX (x)∆x
Conditional Probability 79

If we let ∆x approach to 0, then (6.90) and (6.91) imply that


Ry
fXY (x, y ′ ) dy ′
−∞
FY (y | X = x) = lim FY (y | x < X ≤ x + ∆x) = .
∆x→0 fX (x)

6.6.2 The Conditional Density of Y given X = x

The conditional PDF of Y given X = x is obtained by taking the derivative of

FY (y | X = x) with respect to y. Hence,

d fXY (x, y)
fY (y | X = x) = FY (y | X = x) = . (6.92)
dy fX (x)

To simplify the notation, the function fY (y | X = x) will be written in the form

fY (y | x). Using this notation and defining fX (x | y) similarly, we may write

fXY (x, y) fXY (x, y)


fY (y | x) = fX (x | y) = . (6.93)
fX (x) fY (y)

Note that if the random variables X and Y are independent, then fXY (x, y) =

fX (x)fY (y), and thus

fY (y | x) = fY (y) fX (x | y) = fX (x) . (6.94)

Total Probability. Inserting fXY (x, y) = fY (y | x)fX (x) into the marginal PDF
Z∞
fY (y) = fXY (x, y) dx
−∞

gives the PDF version of the total probability introduced in Section 2.4

Z∞
fY (y) = fY (y | x)fX (x) dx . (6.95)
−∞

As this result shows, to remove the condition X = x from the conditional density

fY (y | x), we multiply fY (y | x) by the density of X and integrate the product over x.


80 Two Random Variables

Bayes’ Theorem. Since fXY (x, y) = fY (y | x) fX (x) = fX (x | y)fY (y), it follows

that

fY (y | x)fX (x)
fX (x | y) = . (6.96)
fY (y)

Inserting (6.95) into (6.96), we obtain Bayes’ theorem (see Section 2.5) for densities

fY (y | x)fX (x)
fX (x | y) = R ∞ . (6.97)
−∞ fY (y | x)fX (x) dx
Chapter 7

Sums of Random Variables and


Limit Theorems

Many technical problems involve:

• the counting of the number of occurrences of events,

• the measurement of cumulative effects,

• the computation of arithmetic averages in a series of measurements.

Such problems can often be reduced to the problem of finding the distribution of

a random variable that consists of a sum of n independent, identically distributed

(i.i.d.) random variables. In this chapter, we investigate the statistical properties of

sums of n random variables as n becomes large.

7.1 Sums of Random Variables

Let X1 , X2 , . . . , Xn be a sequence of random variables, and let Y be their sum

n
X
Y = Xi . (7.1)
i=1

In this section, we study the expected value and variance of Y , as well as the PDF of

Y in the important case where the Xi ’s are independent.


82 Sums of Random Variables and Limit Theorems

7.1.1 The Expected Value of Sums of Random Variables

Let Y = X1 + X2 + · · · + Xn be a sum of n random variables, each with expected

value µXi (i = 1, 2, . . . , n), then

µY = E{Y } = E{X1 + X2 + · · · + Xn }

= E{X1 } + E{X2 } + · · · + E{Xn }

= µ X1 + µ X2 + · · · + µ Xn . (7.2)

Thus, regardless of statistical dependence, the expected value of a sum of n random

variables is equal to the sum of the expected values of the individual random variables.

7.1.2 The Variance of Sums of Random Variables

2
Let Y = X1 + X2 + · · · + Xn be a sum of n random variables, each with variance σX i

(i = 1, 2, . . . , n), then

σY2 = Var{Y } = E{(Y − E{Y })2 }


" #2 
 X n 
= E (Xi − E{Xi })
 
i=1
 
Xn X n 
= E (Xi − E{Xi })(Xj − E{Xj })
 
i=1 j=1
n
X n X
X n
= Var{Xi } + C Xi Xj
i=1 i=1 j=1
j6=i
n
X n X
X n
2
= σX i
+ C Xi Xj . (7.3)
i=1 i=1 j=1
j6=i

However, if the random variables X1 , X2 , . . . , Xn are independent, then CXi Xj = 0

for i 6= j and

n
X
σY2 = 2
σX i
. (7.4)
i=1
Sums of Random Variables 83

7.1.3 The Density of Sums of Independent Random Variables

In this section, we will find the PDF of

Y = X1 + X2 + · · · + Xn (7.5)

where we assume that the random variables X1 , X2 , . . . , Xn are independent and

that their densities fXi (x) are known for all i = 1, 2, . . . , n. To solve this problem,

we compute the characteristic function of Y . Hence,

ΦY (ω) = E{e jωY }

= E{e jω(X1 +X2 +···+Xn ) }

= E{e jωX1 } · E{e jωX2 } · · · E{ejωXn }

= ΦX1 (ω) · ΦX2 (ω) · · · ΦXn (ω) (7.6)

where ΦXi (ω) = F ∗ {fXi (x)}. Equation (7.6) states that the characteristic function of

Y equals the product of the individual characteristic functions of the random variables

X1 , X2 , . . . , Xn . Finally, the PDF of Y can then be obtained by finding the inverse

Fourier transform of ΦY (ω) [see (5.24)], i.e.,


Z∞
1
fY (y) = ΦY (ω) e −jωy dω . (7.7)

−∞

Note that by applying the convolution theorem for Fourier transforms, we obtain

ΦY (ω) = ΦX1 (ω) · ΦX2 (ω) · · · ΦXn (ω) (7.8)


◦—•

◦—•

◦—•

◦—•

fY (y) = fX1 (y) ⋆ fX2 (y) ⋆ · · · ⋆ fXn (y) (7.9)

where ⋆ denotes the convolution operator. Hence, the density fY (y) of the sum

Y = X1 + X2 + · · · + Xn equals the convolution of their individual densities.

Example 7.1. Let Y be the sum of n independent Gaussian random variables Xi


2 . Find the density
(i = 1, 2, . . . , n), each with expected value µXi and variance σX i

of Y = X1 + X2 + · · · + Xn .
84 Sums of Random Variables and Limit Theorems

Solution. According to Example 5.9, the characteristic function ΦXi (ω) of Xi ∼


2 ) is
N (µXi , σX i

jωµXi −ω 2 σX
2 /2
ΦXi (ω) = e i . (7.10)

By using (7.6), we obtain

n
Y
ΦY (ω) = ΦXi (ω) (7.11)
i=1
Yn
jωµXi −ω 2 σX
2 /2
= e i

i=1
jω n
P 2
Pn 2
i=1 µXi −ω ( i=1 σX )/2
= e i

2 σ2
= e jωµY −ω Y /2
(7.12)

Pn Pn
where µY = i=1 µXi and σY2 = 2
i=1 σXi . Note that (7.12) is the characteristic

function of a Gaussian random variable with expected value µY and variance σY2 .

Thus, Y ∼ N (µY , σY2 ).

7.2 The Sample Mean and the Sample Variance

2 = Var{X} of
In practice, the expected mean µX = E{X} and the variance σX

random variables are often unknown. A problem is then to estimate the mean and

variance from repeated measurements of X. Let X1 , X2 , . . . , Xn denote a sequence

of n independent repeated measurements of X, then the random variables Xi (i =

1, 2, . . . , n) are i.i.d. random variables with the same density as X.

Definition 7.1. Let X1 , X2 , . . . , Xn be a sequence of n i.i.d. random variables for

which the mean and variance are unknown, then the sample mean µ̄X and the sample
The Sample Mean and the Sample Variance 85

2 are defined as
variance σ̄X
n
1X
µ̄X = Xi (7.13)
n
i=1
and
n
2 1 X
σ̄X = (Xi − µ̄X )2 (7.14)
n−1
i=1
respectively.

The sample mean µ̄X can be considered as an estimator for the mean µX = E{X}
2 represents an estimator for the variance σ 2 = Var {X}.
and the sample variance σ̄X X

To assess the effectiveness of the two estimators, we will show that

(i) E{µ̄X } = µX (7.15)

and
2 2
(ii) E{σ̄X } = σX . (7.16)

Proof. Since it has been assumed that the random variables Xi are i.i.d., they must
2 .
have the same mean E{Xi } = µX and the same variance Var{Xi } = σX

(i) The proof of (7.15) results from the linearity of the expected value as follows
( n ) n n
1X 1X 1X
E{µ̄X } = E Xi = E{Xi } = µX = µX . (7.17)
n n n
i=1 i=1 i=1
Hence, the expected value of µ̄X is equal to the expected value of X.

(ii) To prove (7.16), we observe first that the variance of µ̄X is given by
( n ) n n
1X 1 X 1 X 2 σ2
Var{µ̄X } = Var Xi = 2 Var{Xi } = 2 σX = X . (7.18)
n n n n
i=1 i=1 i=1
We note that the variance of µ̄X decreases inversely with the number of samples. We

observe also that


1
E{(Xi − µX )(µ̄X − µX )} = E{(Xi − µX )[(X1 − µX ) + · · · + (Xn − µX )]}
n
1
= E{(Xi − µX )(Xi − µX )}
n
1 σ2
= Var{Xi } = X
n n
86 Sums of Random Variables and Limit Theorems

where the second equation holds because the random variables Xi and Xj (j 6= i) are

uncorrelated by assumption. Hence,

E{(Xi − µ̄X )2 } = E{[(Xi − µX ) − (µ̄X − µX )]2 }

= E{(Xi − µX )2 − 2(Xi − µX )(µ̄X − µX ) + (µ̄X − µX )2 }


2
= Var{Xi } − Var{Xi } + Var{µ̄X }
n
2 2 2 1 2
= σX − σX + σX
n n
n−1 2
= σX .
n

With this result, we finally obtain


( n
)
2 1 X
E{σ̄X } = E (Xi − µ̄X )2
n−1
i=1
n
1 X
= E{(Xi − µ̄X )2 }
n−1
i=1
n
1 X n−1 2 2
= σX = σX .
n−1 n
i=1

Remarks:

1. Note that the sample mean µ̄X is an estimator for the mean µX of X. Anal-
2 can be considered as an estimator for the
ogously, the sample variance σ̄X
2 of X.
variance σX

2. Equation (7.15) shows that the statistical average of the sample mean µ̄X is

equal to E{X} = µX . For this reason, we say that the sample mean µ̄X is

an unbiased estimator for µX . Analogously, we conclude from (7.16) that the


2 is an unbiased estimator for σ 2 .
sample variance σ̄X X

2 , defined in (7.14), can be used as an estimate of σ 2


3. The sample variance σ̄X X

only if the mean µX = E{X} is unknown. If µX is known, however, then we


The Laws of Large Numbers 87

replace µ̄X by µX and we change n − 1 to n. This yields the estimate

n
2 1X
σ̄X = (Xi − µX )2 . (7.19)
n
i=1

7.3 The Laws of Large Numbers

7.3.1 The Weak Law of Large Numbers

Theorem 7.1. Weak law of large numbers (Bernoulli)

Suppose we make a series of independent measurements of the same random variable

X with mean µX . Let X1 , X2 , . . . , Xn be the resulting sequence of i.i.d. random

variables. Then,

lim P {|µ̄X − µX | ≥ ε} = 0 (7.20)


n→∞

for ε > 0, where µ̄X denotes the sample mean of the sequence X1 , X2 , . . . , Xn . We

say that µ̄X tends to µX in probability. This is also called stochastic convergence.

Proof. Using the Chebyshev inequality (5.19) gives

Var{µ̄X }
P {|µ̄X − E{µ̄X }| ≥ ε} ≤ . (7.21)
ε2

Substituting (7.15) and (7.18) in (7.21) results in

2
σX
P {|µ̄X − µX | ≥ ε} ≤ . (7.22)
nε2

If we let n approach infinity, we obtain

lim P {|µ̄X − µX | ≥ ε} = 0 . (7.23)


n→∞
88 Sums of Random Variables and Limit Theorems

The weak law of large numbers can be phrased as follows: The probability that the

sample mean µ̄X differs from the true mean µX by more than ε (ε > 0) tends to zero

as n approaches infinity.

It can be shown that µ̄X tends to µX not only in probability, but also with probability

1. This result is known as the strong law of large numbers.

Theorem 7.2. Strong law of large numbers (Borel)

Suppose we make a series of independent measurements of the same random variable

X with mean µX . Let X1 , X2 , . . . , Xn be the resulting sequence of i.i.d. random

variables. Then,

P { lim µ̄X = µX } = 1 (7.24)


n→∞

where µ̄X denotes the sample mean of the sequence X1 , X2 , . . . , Xn . We say that µ̄X

converges almost everywhere (or with probability 1).

Proof. Without proof.

The strong law of large numbers can be phrased as follows: With probability 1, we

expect that each particular sequence of sample mean calculations tends to the true

mean µX = E{X} and stays there. This statement is illustrated in Fig. 7.1, where,

for simplicity, a sequence of sample means µ̄X is drawn as curve.

It should be noted that the strong low of large numbers predicts the convergence of

sample means µ̄X to the expected value µX . However, there are still gaps between the

mathematical theory and the real world, because we can never actually carry out an

infinite number of measurements and compute an infinite number of sample means.

Nevertheless, the strong law of large numbers demonstrates a remarkable consistency

between the theory and the observed physical behavior.


The Central Limit Theorem 89

µ̄X

µX

Fig. 7.1: Convergence of the sample mean µ̄X to the expected value µX .

7.4 The Central Limit Theorem

The central limit theorem is extremely useful concerning the density of a sum of

random variables in the limit as the number of terms in the sum approaches infinity.

Here, we consider two versions of this theorem.

Theorem 7.3. Central limit theorem (Lindeberg-Lévy)

Let

Y = X1 + X2 + · · · + Xn (7.25)

be a sum of n i.i.d. random variables, each having a finite mean E{Xi } = µ and a

finite variance Var{Xi } = σ 2 > 0. Then, the normalized random variable

Y − nµ
Z= √ (7.26)
σ n

tends to a Gaussian random variable with zero mean and unit variance as n tends to

infinity. Hence,

Z ∼ N (0, 1) as n → ∞. (7.27)
90 Sums of Random Variables and Limit Theorems

Proof. First, we note that


n
Y − nµ 1 X
Z= √ = √ (Xi − µ) .
σ n σ n
i=1

The characteristic function of Z is given by


ΦZ (ω) = E e jωZ
( Pn √ )
jω (Xi −µ)/(σ n)
= E e i=1

( n )
Y √
jω(Xi −µ)/(σ n)
= E e
i=1
n
Y n √ o
= E e jω(Xi −µ)/(σ n)
i=1
h n √ oin
= E e jω(Xi −µ)/(σ n) . (7.28)

It should be mentioned that the second last equality follows from the independence

of the random variables Xi , and the last equality follows from the fact that the Xi ’s

are identically distributed.

Next, we expand the mean of the exponential function in (7.28) as follows


 
n √ o jω (jω)2
E e jω(Xi −µ)/(σ n) = E 1 + √ (Xi − µ) + (Xi − µ)2
+ · · ·
σ n 2σ 2 n
jω (jω)2 Rn (ω)
= 1 + √ E{Xi − µ} + 2
E{(Xi − µ)2 } +
σ n 2σ n n

where Rn (ω)/n denotes the remainder. We note that the term Rn (ω) approaches zero

as n → ∞. Notice that E{Xi − µ} = 0 and E{(Xi − µ)2 } = σ 2 , we obtain


n √ o ω 2 Rn (ω)
E e jω(Xi −µ)/(σ n) = 1 − + . (7.29)
2n n

Hence, by substituting (7.29) into (7.28), yields the characteristic function of Z in

the form
 n
ω 2 Rn (ω)
ΦZ (ω) = 1 − + .
2n n

Taking the natural logarithm, we obtain


 
ω 2 Rn (ω)
ln ΦZ (ω) = n ln 1 − + . (7.30)
2n n
The Central Limit Theorem 91

For small values of x, we can expand ln(1 + x) in the power series

1 1
ln(1 + x) = x − x2 + x3 − · · · .
2 3

This expansion applied to (7.30) yields


"  2 #
ω 2 Rn (ω) 1 ω 2 Rn (ω)
ln ΦZ (ω) = n − + − − + + ··· .
2n n 2 2n n

Finally, when we take the limit as n → ∞, the above equation reduces to

ω2
lim ln ΦZ (ω) = −
n→∞ 2

or, equivalently,

ω2
lim ΦZ (ω) = e − 2 . (7.31)
n→∞

This is just the characteristic function of a Gaussian random variable with zero mean

and unit variance, i.e., Z ∼ N (0, 1) .

The central limit theorem of Lindeberg and Lévy allows the following statement:

The sum of n statistically independent and identically distributed random vari-


ables with finite mean and variance approaches a Gaussian random variable as
n → ∞.

Although we have assumed that the random variables Xi in the sum Y are identically

distributed, the assumption can be relaxed provided that additional restrictions are

imposed on the properties of the Xi ’s. In the following variation of the central limit

theorem, the assumption of identically distributed random variables is abandoned in

favor of a condition on the third absolute moment of the random variables in the sum

of Y .

Theorem 7.4. Central limit theorem (Lyapunov)

Let
92 Sums of Random Variables and Limit Theorems

Y = X1 + X2 + · · · + Xn

be a sum of n independent random variables, each with mean µi and variance σi2 > 0.

Furthermore, let
v v
u n u n
uX uX
3
Bn = t E{|Xi − µi |3 } Cn = t σi2 . (7.32)
i=1 i=1

If the condition

Bn
lim =0 (7.33)
n→∞ Cn

is fulfilled, then, the normalized random variable


Pn
Y − µi
i=1
Z= s (7.34)
Pn
σi2
i=1

approaches an N (0, 1) random variable as n tends to infinity. Hence,

Z ∼ N (0, 1) as n → ∞. (7.35)

Proof. Without proof.

Note that Lyapunov’s central limit theorem is more general than the central limit

theorem of Lindeberg and Lévy, since the random variables Xi need not to be iden-

tically distributed.

The nature of the central limit theorem approximation depends on the form of the

densities fXi (x) of the random variables Xi . If the Xi ’s are i.i.d. random variables,

then values of n in the range from 10 to 20 are adequate for most applications. In

fact, if the densities fXi (x) are smooth, then values of n as low as 5 can be used.

Fig. 7.2 illustrates the central limit theorem. This figure shows the density fY (y) of

the sum Y = X1 + X2 + · · · + Xn for n = 1, 2, 3, 4, where the random variables Xi

are i.i.d. and uniformly distributed in the interval (−1/2, 1/2).


Gaussian Approximation for Binomial Probabilities 93

1 n=1
n=2
n=3
n=4
0.8
f Y (y)

0.6

0.4

0.2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2


y

Fig. 7.2: Density of a sum of n i.i.d. distributed random variables.

7.5 Gaussian Approximation for Binomial Probabilities

A binomial random variable X is a sum of i.i.d. Bernoulli random variables Xi , i.e.,

X = X1 + X2 + · · · + Xn (7.36)

where the Xi ’s (i = 1, 2, . . . , n) take the values 1 and 0 with probability

P {Xi = 1} = p and P {Xi = 0} = q = 1 − p . (7.37)

In Example 5.8, we have seen that the mean and the variance of a Bernoulli random

variable Xi are given by

E{Xi } = p (7.38)

Var{Xi } = p (1 − p) . (7.39)

Since the Xi ’s in (7.36) are i.i.d. random variables, the mean and the variance of the

binomial random variable X can be expressed as

E{X} = np (7.40)

Var{X} = np (1 − p) . (7.41)
94 Sums of Random Variables and Limit Theorems

According to (4.30), the density of a binomial random variable X is given by


 
n k
fX (k) = P {X = k} = p (1 − p)n−k , k = 0, 1, . . . , n . (7.42)
k

The problem with (7.42) is that the binomial coefficient


 
n n!
= (7.43)
k (n − k)! k!

grows quite rapidly with n; for instance,


   
64 14 127
≈ 1.6 · 10 ≈ 1.6 · 1018 .
15 14

Hence, the evaluation of (7.42) is difficult for large n. A particular important applica-

tion of the central limit theorem is in the approximation of binomial distributions. The

central limit theorem states that a binomial random variable X [see (7.36)] tends to a

Gaussian random variable with mean E{X} = np and variance Var{X} = np (1 − p)

as n approaches infinity.

7.5.1 Approximate Evaluation of P {X = k}

Theorem 7.5. DeMoivre-Laplace theorem

Let X be a binomial random variable with mean E{X} = np and variance Var{X} =

np (1 − p), then

 
n k
fX (k) = P {X = k} = p (1 − p)n−k
k
1 −
(k−np)2 (7.44)
= p e 2np (1−p) as n → ∞ .
2πnp (1 − p)

Proof. First, we calculate the characteristic function of a binomial random variable.

Recall that the characteristic function of a discrete-type random variable X is given


Gaussian Approximation for Binomial Probabilities 95

by [see (5.26)]

X
ΦX (ω) = pk e jωxk , pk = P {X = xk } . (7.45)
k=−∞

Here, the random variable X is binomial distributed, i.e.,


 
n k n−k
pk = P {X = k} = p q , k = 0, 1, . . . , n . (7.46)
k

Substituting (7.46) into (7.45), we obtain the characteristic function of a binomial

random variable to be
n  
X n
ΦX (ω) = pk q n−k e jωk
k
k=0
n
= p e jω + q . (7.47)

Next, we define for convenience

X − np
Z= √ . (7.48)
npq

Using the property (5.23), gives the characteristic function of Z in the form
 
jωZ

−jnpω/ npq ω
ΦZ (ω) = E{e }=e ΦX √
npq
√  √ n
= e −jnpω/ npq p e jω/ npq + q
 √ √ n
= p e jqω/ npq + q e −jpω/ npq
( " ∞   #
jqω q2ω2 X 1 jqω k
= p 1+ √ − + √
npq 2npq k! npq
k=3
" ∞   #)n
jpω p2 ω 2 X 1 −jpω k
+ q 1− √ − + √
npq 2npq k! npq
k=3
  n
ω2
= 1− [1 + Rn (ω)] (7.49)
2n

where the term Rn (ω) is defined as


∞  
X 1 jω k−2 pq k + q(−p)k
Rn (ω) = 2 √ √ . (7.50)
k! n ( pq)k
k=3

Note that Rn (ω) → 0 as n → ∞. Hence, in the limit n → ∞, (7.49) becomes

ω2
lim ΦZ (ω) = e − 2 (7.51)
n→∞
96 Sums of Random Variables and Limit Theorems

which is equivalent to
1 z2
lim fZ (z) = √ e − 2 (7.52)
n→∞ 2π
where fZ (z) denotes the density of Z. Hence, the distribution of Z tends to the

standard normal distribution, and from (7.48) we conclude that X ∼ N (np, npq).

From Theorem 7.5 it follows that if n is large, then the binomial distribution can be

approximated by the Gaussian distribution. Hence,


  2
n k 1 − (k−np)
P {X = k} = p (1 − p)n−k ≃ p e 2np(1−p) . (7.53)
k 2πnp(1 − p)

Example 7.2. A fair coin is tossed 1000 times. Find the probability that the number

of heads is 500.

Solution. In this example, we have


p √
p = q = 0.5 n = 1000 np(1 − p) = 5 10 .

If k = 500, then k − np = 0 and (7.53) yields


1 1
P {X = 500} ≃ p =√ √ = 0.0252 .
2πnp (1 − p) 2π5 10

The next example indicates that the approximation (7.53) is satisfactory even for

moderate values of n.

Example 7.3. A fair coin is tossed 10 times. Find the probability that the number

of heads equals 5.

Solution. Note that p = 0.5, n = 10, and k = 5 .

(a) Exact solution: Using (7.44), we obtain


     
n k n−k 10! 1 5 1 5
P {X = 5} = p (1 − p) = = 0.246 .
k 5! 5! 2 2
Gaussian Approximation for Binomial Probabilities 97

(b) Approximate solution: Using (7.53), we find

1 2 /2np(1−p)
P {X = 5} ≃ p e −(k−np)
2πnp(1 − p)
1
= √ = 0.252 .

7.5.2 Approximate Evaluation of P {k1 ≤ X ≤ k2 }

Form (7.53), we can find the following useful approximation for the probability that

the binomial random variable X is in n trials between k1 and k2

k2 k2  
X X n k
P {k1 ≤ X ≤ k2 } = fX (k) = p (1 − p)n−k
k
k=k1 k=k1
Zk2 2
1 (x−np)
− 2np
≃ p e (1−p) dx . (7.54)
2πnp(1 − p)
k1

This approximation can be stated as an equality if n → ∞. It should be mentioned

that the integral in (7.54) does not have a closed-form solution. In electrical engi-

neering it is customary to work with the so-called error function1 [see Fig. 7.3]

Zx
1 2 /2
erf(x) = √ e −y dy , x>0 (7.55)

0

which is listed in many books in form of looking up tables. Note that the symmetry

of the integrand in (7.55) implies that the error function is an odd function, i.e.,

erf(−x) = −erf(x) . (7.56)

Zx
2 2
1
In other references, the error function is often defined as erf(x) = √ e −y dy .
π
0
98 Sums of Random Variables and Limit Theorems

0.5

0.45

0.4
Rx 2 /2
← erf(x) = √1 e −y dy
0.35 2π
0
0.3

0.25

2 /2
0.2
← √1 e −x

0.15

0.1

0.05

0
0 0.5 1 1.5 2 2.5 3 3.5 4
x

Fig. 7.3: The error function erf(x).

Using (7.55) in (7.54), we finally obtain the desired approximation in the form
! !
k − np k − np
P {k1 ≤ X ≤ k2 } ≃ erf p 2 − erf p 1 . (7.57)
np(1 − p) np(1 − p)

Example 7.4. A fair coin is tossed 10 000 times. What is the probability that the

number of heads is between 4 900 and 5 100?

Solution. In this example, we have

n = 10 000 p = 0.5 k1 = 4 900 k2 = 5 100

Since,

k − np 100 k1 − np 100
p 2 = = 2 and p =− = −2
np (1 − p) 50 np (1 − p) 50

we conclude from (7.57) and (7.56) that the unknown probability is approximately

given by

P {4 900 ≤ X ≤ 5 100} ≃ erf(2) − erf(−2)

= 2 erf(2)

= 0.9545.
Gaussian Approximation for Binomial Probabilities 99

Example 7.5. Telephone calls

An exchange is connected with 180 telephones. For each telephone, the probability

that a call will occur in a four-hour interval is equal to p = 1/3. What is the

probability that in a four-hour interval the number of calls is between 50 and 70?

Solution. The situation in this example can be considered as a problem of repeated

trials with p = 1/3. Using (7.54) and (7.57), the probability that the number of calls

is between k1 = 50 and k2 = 70 equals

k2   70    k  180−k
X n k X 180 1 2
p (1 − p)n−k =
k k 3 3
k=k1 k=50
! !
k2 − np k1 − np
≃ erf p − erf p
np (1 − p) np (1 − p
   
1 1
70 − 180 · 3 50 − 180 · 3
= erf  q  − erf  q 
1 2 1 2
180 · 3 · 3 180 · 3 · 3
√ √
= erf( 2.5) − erf(− 2.5)

= 2 erf( 2.5)

≃ 0.8862 .

Example 7.6. Package error probability

A binary data stream is transmitted over a binary symmetric channel with a bit error

probability of p = 0.1. The data stream is partitioned into packages of data blocks

of length n. A package error occurs if a block of n bits contains one or more than

one bit errors. To protect the data transmission from errors, block codes are used,

which can correct up to t errors in a block of n bits. For some important block codes,

the typical values for n and t are listed in Table 7.1. What is the probability for a

package error when the block codes listed in Table 7.1 are used?
100 Sums of Random Variables and Limit Theorems

Block code n t

Hamming 31 1

Reed-Muller 64 15

BCH 127 14

Table 7.1. Characteristic parameters of block codes.

Solution. A package error occurs if the number of bit errors in a data block is larger

than t. Hence, for p = 0.1, the probability of a package error is

P {Package error} = P {More than t bit errors}


n  
X n k
= p (1 − p)n−k
k
k=t+1
! !
n − np t + 1 − np
≃ erf p − erf p .
np (1 − p) np (1 − p)

a) Hamming code: n = 31, t=1

31
P 31 k
• Exact solution: P{Package error} = k p (1 − p)31−k = 0.8304
k=2

• Approximate solution: P{Package error} ≃ erf (16.7)− erf (−0.658) ≃ 0.7449

b) Reed-Muller code: n = 64, t = 15

64
P 64 k
• Exact solution: P{Package error} = k p (1 − p)64−k = 4.467 · 10−4
k=16

• Approximate solution: P{Package error} ≃ erf (24)−erf (3.999) ≃ 3.167 · 10−5

c) BCH code: n = 127, t = 14

127
P 
127
• Exact solution: P{Package error} = k p k (1 − p)127−k = 0.2875
k=15

• Approximate solution: P{Package error} ≃ erf (33.8)−erf (0.68) ≃ 0.248


The Central Limit Theorem for Products 101

7.6 The Central Limit Theorem for Products

Theorem 7.6. Let X1 , X2 , . . . , Xn be a sequence of n independent positive random

variables. Then, their product defines the random variable

Y = X1 · X2 · · · Xn , Xi > 0 . (7.58)

For large n, the density fY (y) of Y is approximately lognormal distributed



1
 √ 1 e − 2σ2 (ln y−µ)2 if y ≥ 0

2πyσ
fY (y) ≃ (7.59)

 0 otherwise

where
n
X n
X
2
µ= E{ln Xi } σ = Var{ln Xi } . (7.60)
i=1 i=1

Proof. Consider the random variable

Z = ln Y = ln X1 + ln X2 + · · · + ln Xn

which is the sum of the random variables ln Xi . From the central limit theorem,

it follows that for large n, this random variable is nearly normal with mean µ and

variance σ 2 . Since

Y = eZ

we conclude from (4.28) that Y has a lognormal density. This theorem holds if

the random variables ln Xi satisfy the conditions for the validity of the central limit

theorem.
Part II

Stochastic Processes
Chapter 8

Stochastic Processes

8.1 Definition of Stochastic Processes

• A random variable X is a rule for assigning to every outcome ζ of an experiment

S a number X(ζ).

• A stochastic process X(t) is a rule for assigning to every outcome ζ of an exper-

iment S a function X(t, ζ).

Definition 8.1. Stochastic process

A stochastic process X(t) is a family (or an ensemble) of time functions depending

on the parameter ζ. Thus,

X(t) = {X(t, ζ), t ∈ R, ζ ∈ S} (8.1)

where the domain of t is the set R of real numbers and the domain of ζ is the set S

of all experimental outcomes.

Note that we use the notation X(t) to represent a stochastic process omitting its

dependence on ζ. Figure 8.1 illustrates the following interpretation of stochastic

processes X(t):
104 Stochastic Processes

1. If t and ζ are variables, then X(t) is a stochastic process, i.e., a family (or an
ensemble) of functions X(t, ζ).

2. If t is a variable and ζ = ζi is fixed, then X(t) is a specific time function X(t, ζi )


called sample function (or realization).

3. If t = t0 is fixed and ζ is a variable, then X(t0 ) is a random variable.

4. If t = t0 and ζ = ζi are fixed, then X(t0 ) is a number.


Stochastic process
X(t) = {X(t, ζ)}
✑ ◗
t = t0 ✑✑ ◗ ζ=ζ
◗ i
✑ ◗


✑ ◗s

Random X(t0 ) = {X(t0 , ζ)} X(t) = X(t, ζi ) Sample
variable function
◗ ✑
◗ ✑
ζ = ζi ◗ ✑ t = t0
◗ ✑
◗◗
s X(t0 ) = X(t0 , ζi ) ✑

Single number
Fig. 8.1. Relationship between stochastic processes, sample functions, random
variables, and numbers.

A stochastic process X(t) is said to be continuous-time if the domain of t is the set

of real numbers R. If the domain of t is the set of integer numbers Z, then X(t) is a

discrete-time process.

We say that X(t) is a discrete-state process if its values are countable. Otherwise,

X(t) is a continuous-state process.

A complex stochastic process Z(t) is the sum

Z(t) = X(t) + jY (t) (8.2)

where X(t) and Y (t) are real stochastic processes. A vector process X (t) (or n-

dimensional process) is a family of n stochastic processes X1 (t), X2 (t), . . . , Xn (t)

X (t) = (X1 (t), X2 (t), . . . , Xn (t)) . (8.3)


Definition of Stochastic Processes 105

Example 8.1. Toss of a die

Consider a random experiment that consists of tossing a die at t = 0. The sample

space S = {ζ1 , ζ2 , . . . , ζ6 } is the set of possible outcomes ζi (i = 1, 2, . . . , 6), where

the index i refers to the number of dots showing on the top face. For each outcome of

the experiment, let us arbitrarily assign the following functions of time t (see Fig. 8.2):

Outcome waveform (sample function)

ζ1 7−→ X(t, ζ1 ) = −4, t≥0

ζ2 7−→ X(t, ζ2 ) = −2, t≥0

ζ3 7−→ X(t, ζ3 ) = 2, t≥0

ζ4 7−→ X(t, ζ4 ) = 4, t≥0

ζ5 7−→ X(t, ζ5 ) = −t/2, t≥0

ζ6 7−→ X(t, ζ6 ) = t/2, t ≥ 0.


In this experiment, each waveform X(t, ζi ) represents a sample function, and the

set of waveforms {X(t, ζ1 ), X(t, ζ2 ), . . . , X(t, ζ6 )} represents the stochastic process

X(t), i.e.,

X(t) = {X(t, ζ1 ), X(t, ζ2 ), . . . , X(t, ζ6 ) | t ≥ 0} .

Sample functions X(t, ζi )

X(t, ζ4 )

X(t, ζ 3 )
Sample
4
space X(t, ζ 6 )
S 2 Stochastic
process
0 t X(t)
2 4 6 8 10
−2
X(t, ζ 5 )
−4
X(t, ζ 2 )

X(t, ζ 1 )

Fig. 8.2: Example of a stochastic process.


106 Stochastic Processes

Example 8.2. Random Sinusoids

a) Let A be a random variable uniformly distributed in the interval [−1, 1]. Define

sample functions X(t, ζi ) by

X(t, ζi ) = A(ζi ) cos(2πf t + θ), i = 1, 2, . . .

where f and θ are constants. The stochastic process X(t) = {X(t, ζ1 ), X(t, ζ2 ), . . . }

is a family of sinusoids with amplitudes A(ζi ), as shown in Fig. 8.3(a).

b) Let θ be a random variable uniformly distributed in the interval (−π, π]. Define

sample functions X(t, ζi ) by

X(t, ζi ) = A cos(2πf t + θ(ζi )), i = 1, 2, . . .

where A and f are constants. The stochastic process X(t) = {X(t, ζ1 ), X(t, ζ2 ), . . . }

is a family of time-shifted versions of A cos(2πf t), as shown in Fig. 8.3(b).

c) Let f be a random variable uniformly distributed in the interval [−2, 2]. Define

sample functions X(t, ζi ) by

X(t, ζi ) = A cos(2πf (ζi )t + θ), i = 1, 2, . . .

where A and θ are constants. The stochastic process X(t) = {X(t, ζ1 ), X(t, ζ2 ), . . . }

is a family of sinusoids with frequencies f (ζi ), as shown in Fig. 8.3(c).


Definition of Stochastic Processes 107

A(ζ1 ) = 0.8
X(t, ζi ) ✂A(ζ ) = 0.6
✂✂ 2

✂✌ ✂
A(ζ3 ) = −0.2
✂✌✂

(a) X(t, ζi ) = A(ζi ) cos(2πf t + θ)

X(t, ζi )
θ(ζ1 ) = 0
✂ θ(ζ ) = −π/4
✂✂ 2
✂✌✂ ✂✌

(b) X(t, ζi ) = A · cos(2πf t + θ(ζi ))

X(t, ζi ) f (ζ1 ) = 0.8


f (ζ2 ) = 2 ❆

✂ ❆❯
✂✌

(c) X(t, ζi ) = A · cos(2πf (ζi ) + θ)

Fig. 8.3. Sinusoid with (a) random amplitude, (b) random phase, and (c) random
frequency.
108 Stochastic Processes

8.2 Statistics of Stochastic Processes

8.2.1 Specifying a Random Process

Let X1 , X2 , . . . , Xn be a sequence of n random variables obtained by sampling the

random process X(t) at the times t1 , t2 , . . . , tn , as shown in Fig. 8.4, i.e.,

X1 = X(t1 ), X2 = X(t2 ), ..., Xn = X(tn ) . (8.4)

X(t, ζ 1)

t1 t2 t3 t4 t

X(t, ζ 2)

t1 t2 t3 t4 t

X(t, ζ 3)

t1 t2 t3 t4 t

Fig. 8.4. Three sample functions of a stochastic process sampled at


t1 , t2 , t3 , and t4 .

The statistical properties of a random process X(t) are completely determined in

terms of its nth-order joint cumulative distribution function of the random variables

X1 , X2 , . . . , Xn .
Statistics of Stochastic Processes 109

nth-Order Distribution and Density. The nth-order (joint cumulative) distribu-

tion function of a stochastic process X(t) is the joint distribution

F (x1 , x2 , . . . , xn ; t1 , t2 , . . . , tn ) = P {X(t1 ) ≤ x1 , X(t2 ) ≤ x2 , . . . , X(tn ) ≤ xn }

(8.5)

of the random variables X1 , X2 , . . . , Xn , and the corresponding nth-order density

equals

∂ n F (x1 , x2 , . . . , xn ; t1 , t2 , . . . , tn )
f (x1 , x2 , . . . , xn ; t1 , t2 , . . . , tn ) = . (8.6)
∂x1 ∂x2 · · · ∂xn

Mean. The mean µX (t) of a stochastic process X(t) is defined by


Z∞
µX (t) = E{X(t)} = xfX (x; t) dx (8.7)
−∞

where fX (x; t) is the PDF of X(t). In general, µX (t) is a function of time. Trends in

the behavior of X(t) are reflected in the variation of µX (t) with time.

Autocorrelation Function. The autocorrelation function RXX (t1 , t2 ) of a stochas-

tic process X(t) is defined as the joint moment of the random variables X(t1 ) and

X ∗ (t2 ):

RXX (t1 , t2 ) = E{X(t1 )X ∗ (t2 )} . (8.8)

Thus, the autocorrelation function is the mean of the product X(t1 )X ∗ (t2 ). Note

that the conjugate term is associated with the second variable in RXX (t1 , t2 ). From

(8.8) it follows that

RXX (t2 , t1 ) = E{X(t2 )X ∗ (t1 )} = RXX



(t1 , t2 ) (8.9)

and

RXX (t, t) = E{|X(t)|2 } ≥ 0 . (8.10)


110 Stochastic Processes

The last result shows that the autocorrelation function RXX (t2 , t1 ) on the diagonal

t = t1 = t2 is the average power of X(t).

Autocovariance Function. The autocovariance function CXX (t1 , t2 ) of a stochastic

process X(t) is defined as the covariance of the random variables X(t1 ) and X(t2 ):

CXX (t1 , t2 ) = E{[X(t1 ) − µX (t1 )][X(t2 ) − µX (t2 )]∗ } . (8.11)

From (6.67), it follows that the autocovariance function can be expressed in terms of

the autocorrelation function and the means

CXX (t1 , t2 ) = RXX (t1 , t2 ) − µX (t1 )µ∗X (t2 ) . (8.12)

Variance. The variance of a stochastic process X(t) is defined by

Var{X(t)} = E{[X(t) − µX (t)][X(t) − µX (t)]∗ }

= CXX (t, t) . (8.13)

Correlation Coefficient. The correlation coefficient ρXX (t1 , t2 ) of a stochastic

process X(t) is by definition the ratio


CXX (t1 , t2 )
ρXX (t1 , t2 ) = p . (8.14)
CXX (t1 , t1 )CXX (t2 , t2 )

8.2.2 Stationary Processes

Many stochastic processes have the property that the nature of the randomness in the

process does not change with time. In such cases, an observation of the process in the

time interval (t1 , t2 ) exhibits the same type of random behavior as an observation in

some other time interval (t1 + c, t2 + c). This leads us to postulate that probabilities

involving samples taken at times t1 , t2 , . . . , tn will not differ from those taken at

t1 + c, t2 + c, . . . , tn + c.
Statistics of Stochastic Processes 111

Definition 8.2. Strict-sense stationary

A stochastic process X(t) is called strict-sense stationary (abbreviated SSS) if its

nth-order density is invariant to a shift of the time origin, i.e.,

f (x1 , x2 , . . . , xn ; t1 , t2 , . . . , tn ) = f (x1 , x2 , . . . , xn ; t1 + c, t2 + c, . . . , tn + c)
(8.15)

for any c.

If X(t) is a SSS process, then the processes X(t) and X(t + c) have the same statistics

for any c (see Fig. 8.5).

Properties of SSS processes:

1. The first-order PDF is independent of time, i.e.,

f (x; t) = f (x; t + c) = f (x) (8.16)

for all t and c.

2. The second-order PDF depends only on the time difference between the samples,

i.e.,

f (x1 , x2 ; t1 , t2 ) = f (x1 , x2 ; τ ) τ = t1 − t2 (8.17)

for all t1 and t2 .

Proof. In particular for c = −t2 , it follows from (8.15)

f (x1 , x2 ; t1 , t2 ) = f (x1 , x2 ; t1 + c, t2 + c)

= f (x1 , x2 ; t1 − t2 , 0)

= f (x1 , x2 ; t1 − t2 )
112 Stochastic Processes

X(t, ζ 1)

t1 t2 t

X(t, ζ 2)

t1 t2 t

X(t, ζ 3)

t1 t2 t
c
X(t 1) X(t 2)

Same statistics

Fig. 8.5. Illustration of a SSS process X(t).

3. The first property implies that the mean of X(t) is constant

µX (t) = E{X(t)} = µX (8.18)

for all t.

4. The second property implies that the autocorrelation function of X(t) depends
Statistics of Stochastic Processes 113

only on τ = t1 − t2 , i.e.,

RXX (t1 , t2 ) = RXX (τ ) τ = t1 − t2 (8.19)

for all t1 and t2 .

In many technical situations, we cannot determine whether a random process is SSS,

since (8.15) is difficult to examine. But we can determine whether the mean is constant

and whether the autocorrelation function is a function of τ = t1 − t2 only. This

motivates us to introduce a further important class of stochastic processes.

Definition 8.3. Wide-sense stationary

A stochastic process X(t) is called wide-sense stationary (abbreviated WSS) if its

mean is constant and its autocorrelation function depends only on τ = t1 − t2 :

E{X(t)} = µX (8.20)

RXX (t1 , t2 ) = RXX (τ ) = E{X(t + τ )X ∗ (t)} . (8.21)

Note that all SSS processes are WSS, since they satisfy (8.20) and (8.21), but a WSS

process needs not to be SSS. This statement can be phrased as follows:

SSS =⇒ WSS . (8.22)


⇐=
/

Example 8.3. Sinusoid with random phase

Let X(t) = a cos(2πf t + Θ), where a and f are constants and Θ is a random variable,

which is uniformly distributed in the interval (−π, π]. Is the stochastic process X(t)

WSS?

Solution. The mean µX (t) of X(t) is found by using (8.7) and (5.7):

µX (t) = E{X(t)} = E{a cos(2πf t + Θ)}



a
= cos(2πf t + θ) dθ = 0 .

−π
114 Stochastic Processes

The autocorrelation function RXX (t1 , t2 ) is obtained by using (8.8):

RXX (t1 , t2 ) = E{X(t1 )X ∗ (t2 )}

= E{a cos(2πf t1 + Θ)a cos(2πf t2 + Θ)} .

Using the identity

1
cos(x) cos(y) = [cos(x + y) + cos(x − y)]
2

gives

a2
RXX (t1 , t2 ) = E{cos[2πf (t1 + t2 ) + 2Θ] + cos[2πf (t1 − t2 )]}
2
a2 a2
= E{cos[2πf (t1 + t2 ) + 2Θ]} + E{cos[2πf (t1 − t2 )]}
2 2
a 2
= 0+ cos[2πf (t1 − t2 )]
2
a2
= cos[2πf (t1 − t2 )] .
2

Note that µX (t) is a constant and RXX (t1 , t2 ) depends only on the time difference

τ = t1 − t2 . Hence, X(t) is WSS.

Example 8.4. Sinusoid with random amplitude

Let X(t) = A cos(2πf t + θ), where f and θ are constants and A is a random variable

with mean µA > 0. Is the stochastic process X(t) WSS?

Solution. The mean µX (t) is found by using (8.7):

µX (t) = E{X(t)} = E{A cos(2πf t + θ)}

= E{A} cos(2πf t + θ)

= µA cos(2πf t + θ) .
Statistics of Stochastic Processes 115

The autocorrelation function RXX (t1 , t2 ) is

RXX (t1 , t2 ) = E{X(t1 )X ∗ (t2 )}

= E{A cos(2πf t1 + θ)A∗ cos(2πf t2 + θ)}

= E{|A|2 } · cos(2πf t1 + θ) cos(2πf t2 + θ)


E{|A|2 }
= {cos[2πf (t1 + t2 ) + 2θ] + cos[2πf (t1 − t2 )]} .
2

Note that µX (t) varies with t and RXX (t1 , t2 ) depends not only on the time difference

τ = t1 − t2 but also on t1 + t2 . Hence, X(t) is not WSS.

8.2.3 Ergodic Processes

A central problem in the application of stochastic processes is that the parameters

of a random process X(t) must be obtained through measurements. For example,

to estimate the mean µX (t) of a stochastic process, we have to average over a large

number of realizations X(t, ζ) of X(t). In many applications, however, we know only

a single sample function of X(t). In such cases, one refers to the ergodicity theorem

(see Theorem 8.1). The ergodicity theorem states conditions under which the time

average of a single sample function of X(t) converges to the ensemble average (mean)

of X(t).

Time average. The time average µ̄X of a single sample function X(t, ζi ) is defined

as

ZT
1
µ̄X =< X(t, ζi ) >= lim X(t, ζi ) dt . (8.23)
T →∞ 2T
−T
116 Stochastic Processes

Time autocorrelation function. The time autocorrelation function R̄XX (τ ) of a

single sample function X(t, ζi ) is defined as

ZT
∗ 1
R̄XX (τ ) =< X(t + τ, ζi )X (t, ζi ) >= lim X(t + τ, ζi )X ∗ (t, ζi ) dt . (8.24)
T →∞ 2T
−T

Definition 8.4. A stationary process X(t) is mean-ergodic if its ensemble average

µX equals the time average µ̄X of X(t, ζi ), i.e.,

ZT
1
µX = E{X(t)} = lim X(t, ζi ) dt = µ̄X . (8.25)
T →∞ 2T
−T

Definition 8.5. A stationary process X(t) is autocorrelation-ergodic if its autocorre-

lation function RXX (τ ) equals the time autocorrelation function R̄XX (τ ) of X(t, ζi ),

i.e.,

ZT
∗ 1
RXX (τ ) = E{X(t + τ )X (t)} = lim X(t + τ, ζi ) X ∗ (t, ζi ) dt = R̄XX (τ ) .
T →∞ 2T
−T
(8.26)

Figure 8.6 illustrates a mean-ergodic process X(t), and the following theorem states

the conditions under which a stochastic process X(t) is mean-ergodic.


Statistics of Stochastic Processes 117

X(t, ζ 1)

t1 t

X(t, ζ 2)


µX

t1 t
Ensemble average = time average
E{X(t1 )} = < X (t, ζ 2 ) >
X(t, ζ 3) −
µ X= µ X

t1 t

µ X=E{X (t 1)}

Fig. 8.6. Illustration of a mean-ergodic process X(t).

Theorem 8.1. A stationary process X(t) with autocovariance function CXX (τ ) is

mean-ergodic if and only if (iff )

Z2T 
1 α 
CXX (α) 1 − dα −−
−−→ 0 . (8.27)
T 2T T →∞
0

Proof. Without proof.


118 Stochastic Processes

Example 8.5. Suppose that Z is a random variable with mean µZ . Is the stochastic

process X(t), defined by X(t) = Z, mean-ergodic?

Solution. In this case, X(t) is an ensemble of straight lines. The ensemble average of

X(t) is

µX = E{X(t)} = E{Z} = µZ .

For a specific outcome ζi , the time average

ZT
1
µ̄X = lim X(t, ζi ) dt = Z(ζi )
T →∞ 2T
−T

is a constant different from µX if Z(ζi ) 6= µX . Hence, the stochastic process X(t) is

not mean-ergodic.

8.3 Correlation Functions of Stochastic Processes

8.3.1 The Autocorrelation Function

The autocorrelation function RXX (τ ) is an appropriate measure for the average rate

of change of a random process. Indeed, if a random process changes slowly with time,

then it remains correlated with itself for a long period of time, and RXX (τ ) decreases

slowly as a function of τ . On the other hand, a rapidly varying stochastic process

becomes quickly uncorrelated with itself, and RXX (τ ) decreases rapidly with τ . The

autocorrelation function of stochastic processes plays a crucial role in the design of

linear signal processing algorithms. For this reason, we will discuss here some prop-

erties of autocorrelation functions. In the following, we assume that all stochastic

processes are WSS.

Recall that the autocorrelation function RXX (τ ) of a WSS process X(t) depends only
Correlation Functions of Stochastic Processes 119

on the time difference τ = t1 − t2

RXX (τ ) = E{X(t + τ )X ∗ (t)} . (8.28)

Furthermore, if the stochastic process is WSS and autocorrelation-ergodic, then the

autocorrelation function RXX (τ ) can be obtained from a single sample function

X(t, ζi ):
ZT
1
RXX (τ ) = lim X(t + τ, ζi )X ∗ (t, ζi ) dt . (8.29)
T →∞ 2T
−T

A discrete-time (or digital) process is a sequence X(n) of random variables, where n is

an integer. Most results involving continuous-time processes can be readily extended

to digital processes.

The autocorrelation sequence RXX (m) of a WSS discrete-time process X(n) is defined

as

RXX (m) = E{X(n + m)X ∗ (n)} . (8.30)

Finally, if a discrete-time process X(n) is WSS and autocorrelation ergodic, then the

autocorrelation sequence RXX (m) can be obtained from a single sample sequence

X(n, ζi ):

PN
1
RXX (m) = lim X(n + m, ζi )X ∗ (n, ζi ) . (8.31)
N →∞ 2N +1 n=−N

8.3.2 Properties of Autocorrelation Functions

The autocorrelation function RXX (τ ) of a WSS process has the following properties:

1. RXX (τ ) at τ = 0 gives the average power of the process X(t):

RXX (0) = E{|X(t)|2 } . (8.32)


120 Stochastic Processes

2. RXX (τ ) is such that


RXX (τ ) = RXX (−τ ) . (8.33)

Note that if X(t) is real-valued, then RXX (τ ) = RXX (−τ ), i.e., the autocorre-

lation function is real-valued and even.

Proof.

RXX (−τ ) = E{X(t − τ )X ∗ (t)}

= E{X(t)X ∗ (t + τ )}

= E{X ∗ (t + τ )X(t)} = RXX



(τ ) .

3. RXX (τ ) is maximum at τ = 0, i.e.,

|RXX (τ )| ≤ RXX (0) ∀τ . (8.34)

Proof. We make use of Schwarz’s inequality

|E{XY ∗ }|2 ≤ E{|X|2 }E{|Y |2 } (8.35)

which holds for any two random variables X and Y . Applying this equation to

X(t + τ ) and X(t), we obtain

|RXX (τ )|2 = |E{X(t + τ )X ∗ (t)}|2

≤ E{|X(t + τ )|2 } E{|X(t)|2 } = RXX


2
(0) .

4. RXX (τ ) is positive definite, that is,

XX
ai a∗j RXX (τi − τj ) ≥ 0 (8.36)
i j

for every ai , aj , τi , and τj .


Correlation Functions of Stochastic Processes 121

Proof. This property is a consequence of the identity


   
X 2 X X 
0 ≤ E ai X(ti ) =E ai a∗j X(ti )X ∗ (tj )
   
i i j
XX XX
= ai a∗j E{X(ti )X ∗ (tj )} = ai a∗j RXX (ti − tj ) .
i j i j

5. If RXX (0) = RXX (Tp ), then RXX (τ ) is periodic with period Tp and X(t) is

mean-square periodic, i.e.,

E{|X(t + Tp ) − X(t)|2 } = 0 . (8.37)

Proof. If we apply (8.35) to [X(t + τ + Tp ) − X(t + τ )] and X(t), then we obtain

|E{[X(t + τ + Tp ) − X(t + τ )]X(t)}|2

≤ E{|X(t + τ + Tp ) − X(t + τ )|2 }E{|X(t)|2 } .

This implies that

|RXX (τ + Tp ) − RXX (τ )|2 ≤ 2Re{RXX (0) − RXX (Tp )}RXX (0) .

Thus, RXX (0) = RXX (Tp ) implies that the right-hand side of the equation is

zero, and that

RXX (τ + Tp ) = RXX (τ ) for all τ . (8.38)

Hence, RXX (τ ) is periodic with period Tp . The fact that X(t) is mean-square

periodic follows from

E{|X(t + Tp ) − X(t)|2 } = 2Re{RXX (0) − RXX (Tp )} = 0 .


122 Stochastic Processes

8.3.3 The Cross-Correlation and Cross-Covariance Function

If we deal with two or more stochastic processes, we are often interested in the rela-

tionship between the processes. In such situations, the following expected values are

useful to describe the relationship between any two WSS processes X(t) and Y (t):

Cross-Correlation Function.

RXY (τ ) = E{X(t + τ )Y ∗ (t)} (8.39)

Cross-Covariance Function.

CXY (τ ) = RXY (τ ) − µX µ∗Y (8.40)

Correlation Coefficient.

CXY (τ )
ρXY (τ ) = p (8.41)
CXX (τ ) CY Y (τ )

Using these expected values, we can determine the degree of dependence between

two stochastic processes. The cross-correlation function RXY (τ ) has the following

properties:

1. RXY (τ ) = RY∗ X (−τ ) (8.42)


p
2. |RXY (τ )| ≤ RXX (0)RY Y (0) (8.43)
1
3. |RXY (τ )| ≤ [RXX (0) + RY Y (0)] (8.44)
2
4. X(t) and Y (t) are orthogonal if

RXY (τ ) = 0 for all τ . (8.45)

5. X(t) and Y (t) are independent if

RXY (τ ) = µX µY for all τ . (8.46)


The Power Spectral Density 123

Example 8.6. Suppose that the stochastic processes X(t) and Y (t) are real-valued

and WSS. Find the autocorrelation function of the complex process Z(t) = X(t) +

jY (t).

Solution. Using (8.28), (8.39), and (8.42), we obtain

RZZ (τ ) = E{Z(t + τ )Z ∗ (t)}

= E{[X(t + τ ) + jY (t + τ )][X(t) − jY (t)]}

= E{X(t + τ )X(t)} − jE{X(t + τ )Y (t)}

+ jE{Y (t + τ )X(t)} + E{Y (t + τ )Y (t)}

= RXX (τ ) − jRXY (τ ) + jRY X (τ ) + RY Y (τ )

= RXX (τ ) + RY Y (τ ) + j[RXY (−τ ) − RXY (τ )] .

Thus, the autocorrelation function of a complex process depends in general on the

autocorrelation function of the inphase and quadrature component as well as on the

cross-correlation functions.

8.4 The Power Spectral Density

We now present the Wiener-Khinchin theorem, which states that the power spectral

density of a WSS process is given by the Fourier transform of the autocorrelation

function.

Definition 8.6. Wiener-Khinchin relation

The power spectral density (or power spectrum) SXX (ω) of a WSS process X(t) is

defined as the Fourier transform of its autocorrelation function RXX (τ ):

Z∞
SXX (ω) = F{RXX (τ )} = RXX (τ ) e −jωτ dτ . (8.47)
−∞
124 Stochastic Processes

We also use the notation SXX (ω) •—◦ RXX (τ ). From the Fourier inversion formula,

it follows that

Z∞
1
RXX (τ ) = F −1 {S XX (ω)} = SXX (ω) e jωτ dω . (8.48)

−∞

Properties of the Power Spectral Density:

1. SXX (ω) is a real and nonnegative function of ω.

2. If X(t) is real, then RXX (τ ) is real and even and hence SXX (ω) is also real and

even, i.e.,

SXX (ω) = SXX (−ω) . (8.49)

3. The average power of X(t) is given by


Z∞
2 1
E{|X(t)| } = RXX (0) = SXX (ω) dω . (8.50)

−∞

4. If X(t) has periodic components, then SXX (ω) will have impulse components.

Definition 8.7. Cross-power spectral density

The cross-power spectral density (or cross-power spectrum) SXY (ω) of two processes

X(t) and Y (t) is the Fourier transform of their cross-correlation function RXY (τ ):

Z∞
SXY (ω) = F{RXY (τ )} = RXY (τ ) e −jωτ dτ . (8.51)
−∞

From the Fourier inversion formula, we obtain

Z∞
1
RXY (τ ) = F −1 {SXY (ω)} = SXY (ω) e jωτ dω . (8.52)

−∞
The Power Spectral Density 125

Properties of the Cross-Power Spectral Density:

1. Since RXY (τ ) = RY∗ X (−τ ), it follows that

SXY (ω) = SY∗ X (ω) . (8.53)

2. The real part of SXY (ω) is an even function of ω, and the imaginary part of

SXY (ω) is an odd function of ω.

3. If X(t) and Y (t) are orthogonal, then

SXY (ω) = 0 . (8.54)

4. If X(t) and Y (t) are independent, then

SXY (ω) = 2πµX µY δ(ω) . (8.55)

Some important and frequently used autocorrelation functions RXX (τ ) and the cor-

responding power spectral densities SXX (ω) are listed in Table 8.1.

Autocorrelation function RXX (τ ) Power spectral density SXX (ω)

δ(τ ) ◦—• 1

1 ◦—• 2πδ(ω)

e jω0 τ ◦—• 2πδ(ω − ω0 )

cos(ω0 τ ) ◦—• π[δ(ω + ω0 ) + δ(ω − ω0 )]

sin(ω0 τ ) ◦—• jπ[δ(ω + ω0 ) − δ(ω − ω0 )]



   1, |ω| < ω0

sin(ω0 τ ) ω
◦—• rect =
πτ 2ω0 

 0, |ω| > ω0

2ω0
e −ω0 |τ | ◦—•
ω02 + ω2

2 π ω 2
−( 2ω )
e −(ω0 τ ) ◦—• e 0
ω0

Table 8.1. Frequently used autocorrelation functions RXX (τ ) and corresponding


power spectral densities SXX (ω).
126 Stochastic Processes

Example 8.7. Sinusoid with random phase

Find the power spectral density of X(t) = a cos(ω0 t + Θ), where a > 0 and ω0 > 0

are constants, and Θ is uniformly distributed in the interval (−π, π).

Solution. From Example 8.3, we know that the autocorrelation function of X(t) is

a2
RXX (τ ) = cos(ω0 τ ) .
2

Thus, using Table 8.1, it follows

SXX (ω) = F{RXX (τ )}


 2 
a
= F cos(ω0 τ )
2
a2 π
= [δ(ω + ω0 ) + δ(ω − ω0 )] .
2

Note that the power spectral density is infinite at ±ω0 , but the average power of X(t)

is RXX (0) = a2 /2.

8.5 Linear Time Invariant Systems with Stochastic In-

puts

In many cases, physical systems are modeled as linear time invariant (LTI) systems.

If the input signal of an LTI system is deterministic, then the output signal can be

computed in the time domain via the convolution integral or in the spectral domain

via the Fourier transform. In this section, we develop techniques for calculating the

response of LTI systems driven by stochastic input signals.

8.5.1 Linear Time Invariant Systems

Let us consider a system in which a stochastic input process X(t) is mapped into a

stochastic output process Y (t) by the transformation

Y (t) = L[X(t)] . (8.56)


Linear Time Invariant Systems with Stochastic Inputs 127

The system is said to be LTI if it has the following properties:

1. The system is linear, i.e.,

L[a1 X1 (t) + a2 X2 (t)] = a1 L[X1 (t)] + a2 L[X2 (t)] (8.57)

for any a1 , a2 , X1 (t), and X2 (t).

2. The system is time-invariant, i.e., if Y (t) = L[X(t)], then

Y (t − t0 ) = L[X(t − t0 )] . (8.58)

We shall assume throughout that all linear systems under consideration are time-

invariant. A linear system is completely specified by its impulse response. The impulse

response h(t) of a linear system is defined by

h(t) = L[δ(t)] (8.59)

where δ(t) is the delta function. The Fourier transform of the impulse response h(t)

is known as the transfer function

Z∞
H(ω) = F{h(t)} = h(t)e −jωt dt . (8.60)
−∞

If the input to a linear system is a stochastic process X(t), as shown in Fig. 8.7, then

the output of the system is the stochastic process

Z∞
Y (t) = X(t) ⋆ h(t) = X(t − τ )h(τ ) dτ . (8.61)
−∞

✲ h(t) ✲
X(t) ❜ Y (t)

Fig. 8.7. A linear system with a stochastic input signal.


128 Stochastic Processes

8.5.2 The Mean and Autocorrelation Function of the Output Pro-


cess

In the following, we compute the mean and the autocorrelation function of the output

process Y (t) of a linear system with impulse response h(t), where we assume that the

input process X(t) is WSS.

Mean of Y (t). The mean of Y (t) is given by


 ∞ 
Z 
E{Y (t)} = E X(t − τ )h(τ ) dτ
 
−∞
Z∞
= E{X(t − τ )}h(τ ) dτ
−∞
Z∞
= µX h(τ ) dτ
−∞
= µX H(0) (8.62)

where H(0) is the transfer function of the linear system at ω = 0. In the second last

identity, we have used the fact that µX = E{X(t − τ )}, since X(t) is WSS. Equation

(8.62) shows that the mean of the output Y (t) does not depend on time.

Autocorrelation Function of Y (t). The autocorrelation function of Y (t) is given

by

RY Y (τ ) = E{Y (t + τ )Y ∗ (t)}
Z∞ Z∞
= E{ X(t + τ − s)h(s) ds X ∗ (t − r)h(r) dr}
−∞ −∞
Z Z∞

= h(s)h(r)E{X(t + τ − s)X ∗ (t − r)} ds dr


−∞ −∞
Z∞ Z∞
= h(s)h(r)RXX (τ + r − s) ds dr (8.63)
−∞ −∞
Linear Time Invariant Systems with Stochastic Inputs 129

where we have used the fact that X(t) is WSS. Note that the autocorrelation function

of Y (t) depends only on τ , and since the mean E{Y (t)} is a constant, it follows that

Y (t) is a WSS process.

The above results can be phrased as follows:

If X(t) is a WSS process, then Y (t) is also a WSS process.

In particular, it can be shown that if the input process is a Gaussian WSS process,

then the output process Y (t) will also be a Gaussian WSS process.

8.5.3 The Power Spectral Density of the Output Process

The following theorem relates the input and output power spectral densities to the

transfer function.

Theorem 8.2. Wiener-Lee relation

The power spectral density SY Y (ω) of the output process Y (t) is given by

SY Y (ω) = |H(ω)|2 SXX (ω). (8.64)

Proof. Taking the Fourier transform of RY Y (τ ) as given in (8.63), we obtain

Z∞
SY Y (ω) = RY Y (τ )e −jωτ dτ
−∞
Z∞ Z∞ Z∞
= h(s)h(r)RXX (τ + r − s)e −jωτ ds dr dτ .
−∞ −∞ −∞
130 Stochastic Processes

Changing variables by using u = τ + r − s gives

Z∞ Z∞ Z∞
SY Y (ω) = h(s)h(r)RXX (u)e −jω(u−r+s) ds dr du
−∞ −∞ −∞
Z∞ Z∞ Z∞
−jωs jωr
= h(s)e ds h(r)e dr RXX (u)e −jωu du
−∞ −∞ −∞

= H(ω)H (ω)SXX (ω)

= |H(ω)|2 SXX (ω) .

8.5.4 The Cross-Correlation Function and the Cross-Power Spectral


Density Between the Input and the Output Process

The cross-correlation properties between the input and output processes are also of

interest. Some important relations are given in the following theorem.

Theorem 8.3. The cross-correlation function and the cross-power spectral densities

of the input process X(t) of a linear system and its response Y (t) are given by

RXY (τ ) = RXX (τ ) ⋆ h∗ (−τ ) RY X (τ ) = RXX (τ ) ⋆ h(τ ) (8.65)

SXY (ω) = SXX (ω) · H ∗ (ω) SY X (ω) = SXX (ω) · H(ω) . (8.66)

Proof. The cross-correlation function RXY (τ ) is obtained as follows:


Linear Time Invariant Systems with Stochastic Inputs 131

RXY (τ ) = E{X(t + τ )Y ∗ (t)}


Z∞
= E{X(t + τ ) X ∗ (t − r)h∗ (r) dr}
−∞
Z∞
= E{X(t + τ )X ∗ (t − r)}h∗ (r) dr
−∞
Z∞
= E{X(t + τ + r)X ∗ (t)}h∗ (r) dr
−∞
Z∞
= RXX (τ + r)h∗ (r) dr
−∞
Z∞
= RXX (τ − r)h∗ (−r) dr
−∞
= RXX (τ ) ⋆ h∗ (−τ ) . (8.67)

By using (8.33) and (8.42), we obtain

RY X (τ ) = RXX (τ ) ⋆ h(τ ) . (8.68)

The cross-power spectral density SXY (ω) is obtained by taking the Fourier transform

of (8.67). Thus,

SXY (ω) = F{RXY (τ )}

= SXX (ω) · H ∗ (ω). (8.69)

Finally, the Fourier transform of (8.68) gives

SY X (ω) = SXX (ω) · H(ω) . (8.70)

Note that this equation follows also from (8.69) by using (8.53) and taking into account

that SXX (ω) is real.


132 Stochastic Processes

8.6 White Noise

Definition 8.8. White noise

A white noise process N (t) is a stochastic process whose power spectral density equals

N0 /2 for all frequencies, i.e.,

N0
SN N (ω) = for all ω. (8.71)
2

A stochastic process described by (8.71) is said to be “white” by analogy to white

light, which contains all frequency components in equal amounts. Using Table 8.1,

we find that the autocorrelation function of a white noise process N (t) is given by

N0
RN N (τ ) = δ(τ ). (8.72)
2

Note that the average power of a white noise process is infinite, since RN N (0) = ∞.

Note also that (8.72) implies that N (t) and N (t + τ ) are uncorrelated for any value

of τ 6= 0.

Example 8.8. Filtered white noise

Given is a linear system with transfer function



    1, |ω| < ω0
ω
H(ω) = rect = (8.73)
2ω0 
 0, |ω| > ω0 .

Suppose that the input process of this system is white noise N (t). Find the power

spectral density SY Y (ω) and the autocorrelation function RY Y (τ ) of the output pro-

cess Y (t).

Solution. From the Wiener-Lee relation (8.64), it follows that SY Y (ω) is given by

SY Y (ω) = |H(ω)|2 SN N (ω)


 
ω N0
= rect ·
2ω0 2

 N0 , |ω| < ω
2 0
= (8.74)
 0, |ω| > ω0 .
Applications 133

Taking the inverse Fourier transform of SY Y (ω) gives the autocorrelation function

N0 sin(ω0 τ )
RY Y (τ ) = · . (8.75)
2 πτ

Note that the average power of the filtered white noise process Y (t) is given by

RY Y (0) = N0 ω0 /(2π). The power spectral density SY Y (ω) and the autocorrelation

function RY Y (τ ) of a filtered white noise process are shown in Fig. 8.8.

SY Y (ω)

N0 /2

−ω0 ω0 ω
(a)

RY Y (τ )
N0 ω0

− ω3π0 − ω2π0 − ωπ0 π 2π


ω0
3π τ
ω0 ω0

(b)

Fig. 8.8. (a) Power spectral density SY Y (ω) and (b) autocorrelation function
RY Y (τ ) of a filtered white noise process.

8.7 Applications

8.7.1 System Identification Using White Noise

The impulse response of an unknown system can be determined by proceeding as

follows:
134 Stochastic Processes

1. Excite the unknown system by white noise

N0 N0
SXX (ω) = •—◦ RXX (τ ) = δ(τ ) . (8.76)
2 2

2. Determine by measurements the cross-correlation function RY X (τ ) of the output

process Y (t) and the input process (white noise) X(t) and notice that RY X (τ )

can be expressed by using (8.65) as

RY X (τ ) = RXX (τ ) ⋆ h(τ )
N0
= δ(τ ) ⋆ h(τ )
2
N0
= h(τ ) . (8.77)
2

3. Finally, from (8.77), it follows that the impulse response of the unknown system

can by identified by
2
h(τ ) = RY X (τ ) . (8.78)
N0

8.7.2 Matched Filter

Consider the transmission system shown in Fig. 8.9. The transmitted signal X(t) is

disturbed by additive white noise N (t). To minimize the bit error probability, a large

signal-to-noise ratio (SNR) is required. A received filter h(t) that is “matched” to the

transmit filter g(t) is called matched filter. Its impulse response is given by

h(t) = c · g(T0 − t) (8.79)

where c and T0 are constants.


N(t) Sampling

TA

Xn g(t) h(t) Y(TA )


X(t) Y(t)
Data
sequence Transmitter Receiver

Fig. 8.9. Transmission system with additive white noise.


Applications 135

In the following, we show that the matched filter maximizes the signal-to-noise ratio.

Proof. The signal before sampling can be expressed by

Y (t) = [X(t) + N (t)] ⋆ h(t) .

The average signal power at the output of the receiver at t = TA is given by

S = E{Y 2 (TA )} = E{[X(TA ) ⋆ h(TA )]2 }


 ∞ 2
Z
= E{Xn2 }  h(τ )g(T0 − τ ) dτ 
−∞

and the average noise power at the output of the receiver is obtained as
Z∞
N = N0 h2 (t) dt .
−∞

The SNR can be written as


hR i2

S E{Xn2 } −∞ h(τ )g(T 0 − τ ) dτ
= R∞
2
.
N N0 −∞ h (t) dt

By using the average energy of the transmission signal


Z∞
2
ES = E{Xn } g2 (t) dt
−∞

we can express the SNR as


R∞ 2
S ES [ −∞ h(τ )g(T0 − τ ) dτ ]
= · R∞ 2 R∞ .
N N0 −∞ g (t) dt · −∞ h2 (t) dt
Using the Cauchy-Schwarz inequality
 ∞ 2
Z Z∞ Z∞
2
 h(τ )g(T0 − τ ) dτ  ≤ g (t) dt · h2 (t) dt
−∞ −∞ −∞
gives
S ES
≤ .
N N0
The equality holds if

h(t) = c · g(T0 − t) .
Bibliography

[Dav87] W. B. Davenport and W. L. Root, An Introduction to the Theory of Ran-


dom Signals and Noise. New York: IEEE Press, 1987.

[Gar94] A. L. Garcia, Probability and Random Processes for Electrial Engineering.


New York: Addison-Wesley, 2nd ed., 1994.

[Gub06] J. A. Gubner, Probability and Random Processes for Electrical and Com-
puter Engineers. Cambridge: Cambridge University Press, 2006.

[Pap02] A. Papoulis and S. U. Pillai, Probability, Random Variables and Stochastic


Processes. New York: McGraw-Hill, 4th ed., 2002.

[Pro01] J. G. Proakis, Digital Communications. New York: McGraw-Hill, 4th ed.,


2001.

[Roh01] V. K. Rohatgi and A. K. Saleh, An Introduction to Probability and Statis-


tics. New York: John Wiley & Sons, 2001.

[Sha88] K. S. Shanmugan and A. M. Breipohl, Random Signals: Detection, Esti-


mation and Data Analysis. New York: John Wiley & Sons, 1988.

You might also like