Random Signals: 1 Kolmogorov's Axiomatic Definition of Probability
Random Signals: 1 Kolmogorov's Axiomatic Definition of Probability
Random Signals: 1 Kolmogorov's Axiomatic Definition of Probability
Chapter II
Random Signals
This chapter provides a summary of denitions and fundamental concepts of probability
spaces, random variables, and random processes. References [1][8] provide more extensive
background and examples.
1 Kolmogorovs Axiomatic Denition of Probability
A probability space is a triple (, F, P) where is the sample space, F is a -eld dened
on , and P is a probability measure dened on F.
The sample space represents the set of outcomes of a random experiment, e.g., =
Head, Tail for a coin ip; or = R
+
for the time to next child birth in Urbana, Illinois.
The sigma eld F is a collection of subsets of (viewed as events) such that
1. F and F
2. If c
1
, c
2
F then c
1
c
2
F
3. If c F then c
c
F
4. F is closed under countable set of unions and intersections, i.e.,
_
i=1
c
i
F and
i=1
c
i
F.
The probability measure P assigns a number P(c) to each c F, such that
1. P(c) 0
2. P() = 1
3. P(c
1
c
2
) = P(c
1
) +P(c
2
) for disjoint sets: c
1
c
2
=
4. P (
i=1
c
i
) =
i=1
P(c
i
) if c
i
c
j
= for all i, j
If is a countable set, F may be the ensemble of all subsets of . However this cannot
be the case when is uncountable, e.g., = [a, b], because it would not be possible to dene
a probability measure (satisfying the above axioms) on F. When = [a, b], one may take F
to be the Borel -eld dened over [a, b]. The Borel -eld is the smallest -eld for [a, b] and
contains all the open subsets of [a, b].
1
2 Random Variables
Given a probability space (, F, P), we may formally dene a real-valued random variable
X as a measurable function from to R, i.e.,
: X() x F, x R.
The cumulative distribution function (cdf) for X may then be dened as
F
X
(x) Pr[X x] = P( : X() x).
Observe that for every Borel set B in R, the set : X() B must correspond to an
event c F, i.e., it must be in the domain of the probability function P(). All sets of interest
are unions or intersections of such sets.
A cdf is nondecreasing, right-continuous, and satises
lim
x
F
X
(x) = 0 and lim
x
F
X
(x) = 1.
Two types of random variables are often encountered in practice:
Discrete random variables, for which the probability of X is concentrated at a countable
set of points x
i
, i 1:
P(X x
i
, i 1) = 1.
The cdf for discrete random variables is piecewise constant (staircase function). There
is a jump at each point x
i
; the amplitude of this jump is given by the probability mass
function (pmf)
p
X
(x
i
) = P(X = x
i
), i 1.
The pmf is nonnegative and sums to 1.
Continuous random variables, for which the cdf is the integral of a function:
F
X
(x) =
_
x
f
X
(u)du
The function f
X
is nonnegative and integrates to one; f
X
is the probability density
function (pdf) for X.
Other random variables are of a mixed discrete/continuous type, e.g., consider X = 0 with
probability
1
2
and X is uniformly distributed over the interval [1, 2] with probability
1
2
.
The mathematical expectation of a random variable X is dened as
EX
_
R
xdF
X
(x)
which may also be written as
EX =
_
dP().
These expressions simplify to the standard formulas EX =
iI
x
i
p
X
(x
i
) and EX =
_
xf
X
(x) dx
for discrete and continuous random variables, respectively.
2
3 Convergence of Random Sequences
Let X
n
, 1 n < , be a sequence of random variables indexed by the integer n. We are
interested in the convergence property X
n
X. For a deterministic sequence, we would use
the classical denition: X
n
X means that
> 0, n() : [X
n
X[ < n > n().
For random sequences, the denition must involve the elements of .
3.1 Types of Convergence
We shall encounter the following types of convergence:
1. X
n
X surely means
lim
n
X
n
() = X .
2. X
n
X almost surely (with probability 1) means
P
_
lim
n
X
n
() = X
_
= 1.
3. X
n
X in probability means
lim
n
P([X
n
() X[ < ) = 1.
4. X
n
X in the mean square (m.s.) means
lim
n
E[X
n
() X[
2
= 0.
5. X
n
X surely means
lim
n
F
n
(x) = F(x) (pointwise on R)
where F
n
and F are the distributions for X
n
and X, respectively.
Convergence a.s. implies convergence surely.
Convergence in probability implies convergence almost surely.
Convergence in distribution implies convergence in probability and convergence m.s.
3
3.2 Law of Large Numbers
Let X
i
, 1 i < be i.i.d. random variables with mean and nite variance. Then the
sample average
X
n
=
1
n
n
i=1
X
i
converges in probability to (weak law of large numbers); as well, X
n
converges a.s. to
(strong law of large numbers).
3.3 Central Limit Theorem
Let X
i
, 1 i < be i.i.d. random variables with mean and variance
2
. Then the
normalized sum
Z
n
=
1
n
n
i=1
(X
i
)
converges to a normal distribution ^(0, 1) in distribution.
A generalization of this problem is the case of variable components. Let X
i
, 1 i < be
independent random variables with common mean but unequal variances
2
i
, 1 i < [2,
p. 213] [7, p. 518]. Dene the sum of variances
S
2
n
=
n
i=1
2
i
.
Then the normalized sum
Z
n
=
1
S
n
n
i=1
(X
i
)
converges to a normal distribution ^(0, 1) in distribution if Lindebergs conditions are satised:
> 0, n() :
i
S
n
< n > n(), i = 1, , n.
We may interpret the ratio
i
/S
n
as the contribution of component i to the weighted sum
Z
n
. For the CLT to apply, Z
n
must be sum of many asymptotically negligible components.
One component is not allowed to dominate the sum.
4 Random Processes
Denition: A random process is a collection of random variables X(t), t T , where T is
the index set.
4
Usually T is not a nite set. We shall distinguish between the case where T is a countable
index set, e.g., 0, 1, Z, and Z
d
, or an uncountable set, such as an interval on the real line.
One diculty that arises when [T [ = is that the notion of pdf breaks down. Moreover,
the denition of innite-dimensional cdfs requires care.
Let us start with nite-dimensional cdfs. The denition of one-dimensional cdfs,
F
t
(x) = P[X(t) x]
is naturally extended to two dimensions:
F
t
1
,t
2
(x
1
, x
2
) = P[X(t
1
) x
1
, X(t
2
) x
2
]
and to any nite number n of dimensions:
F
t
1
, ,t
n
(x
1
, , x
n
) = P[X(t
1
) x
1
, , X(t
n
) x
n
].
This denes a hierarchy of multidimensional cdfs of all orders. They must satisfy the com-
patibility conditions
lim
x
2
F
t
1
,t
2
(x
1
, x
2
) = F
t
1
(x
1
), etc.
In practice, fortunately, there exist simple mechanisms for generating probability distributions.
4.1 Kolmogorovs Extension Theorem
Having dened nite-dimensional distributions, we need extend this concept to the whole index
set T . Formally, we need to specify a probability space (
T
, F
T
, P
T
) that is compatible with
all nite-dimensional probability spaces (
n
, F
n
, P
n
). If T is countable (say N), F
T
is the
smallest -eld of subsets of R
T
containing all nite-dimensional rectangles [4, p. 21].
The existence of such a probability space (
T
, F
T
, P
T
) is guaranteed by Kolmogorovs
Extension Theorem [5, p. 16] [4, p. 24]. Furthermore, this extension is unique when T is
countable.
Examples (with T = Z
2
)
x(n
1
, n
2
) = i.i.d. random variables (spatially white noise)
x(n
1
, n
2
) A where A is a random variable
x(n
1
, n
2
) 0, 1: binary random eld
5
4.2 Uncountable Index Set
When the index T is countable, we may conceptually obtain the entire sample path of the
random process by observing its samples for all values of t (we can count these samples).
However, technical diculties arise when T is uncountably innite, e.g., T = [a, b]. Intuitively,
the collection of random variables X(t), t T may be too large and cannot be recovered
by sampling. Such is the case, for instance, when T = R and X(t), t T , are i.i.d. samples.
A simple example is the process X(t), t [0, 1], that is generated as follows: pick a random
variable t
0
from the uniform distribution on [0, 1] and let X(t) = 0 at all points except at t = t
0
,
where X(t) = 1. This process is statistically indistinguishable from the trivial process
X(t) 0
in the following sense. The n-dimensional cdfs for the processes X and
X are identical for all
values of n. Heuristically speaking, we may collect an arbitrary large number of samples of X,
yet with probability 1 they will all have value 0, i.e., coincide with those for the process
X.
In measure-theoretic terms, when T is uncountably innite, Kolmogorovs extension is not
necessarily unique. This diculty does not arise when the random process is continuous
enough that specifying all nite-order cdfs completely characterizes the distribution of the
process. Such processes can be sampled and are said to be separable. Fortunately, all random
processes of interest to us are in the category. The example given above describes a processes
X that is not separable because not continuous enough.
4.3 Moments of a Random Process
The moments of a real-valued random process include
the mean
X
(t), t T , where
X
(t) EX(t);
the correlation function R
X
(t
1
, t
2
), t
1
, t
2
T , where R
X
(t
1
, t
2
) = E[X(t
1
)X(t
2
)];
the covariance function C
X
(t
1
, t
2
) = R
X
(t
1
, t
2
)
X
(t
1
)
X
(t
2
), which is the correlation
function for the centered process X(t)
X
(t);
the n-th order moment E[X(t
1
) X(t
n
)].
The covariance function is symmetric and positive semidenite, i.e., the nn matrix R
X
(t
i
, t
j
), 1
i, j n is positive semidenite for all choices of n-tuples t
i
. A process is said to be a second-
order process if EX
2
(t) = R
X
(t, t) < for all t.
4.4 Gaussian Random Process
Dene a mean function
X
(t), t T and a continuous covariance function C
X
(t
1
, t
2
). Let
F
t
1
, ,t
n
(x
1
, , x
n
) be the classical n-dimensional distribution with mean vector
X
(t
1
), ,
X
(t
n
)
and covariance matrix C
X
(t
i
, t
j
), 1 i, j n. This denes a hierarchy of multidimensional
distributions, and the consistency conditions are automatically satised. Note the simplicity
of this construction.
6
5 Stationary Processes
A stationary process is a random process whose statistical properties do not change over
time/space. Formally, the process X(t), t T , is said to be stationary if
F
t
1
+, ,t
n
+
(x
1
, , x
n
) = F
t
1
, ,t
n
(x
1
, , x
n
) t
i
, x
i
T
n
R
n
, T , n.
The index set might be discrete or continuous time (T = R or Z, respectively) or d-dimensional
space (T = R
d
or Z
d
).
A wide-sense stationary (WSS) process satises a similar invariance properties in terms of
its rst two moments:
X
(t) =
X
, t T
R
X
(t
1
, t
2
) = R
X
(t
1
t
2
), t
1
, t
2
T .
Stationarity implies wide-sense stationarity, but the converse is not true. A WSS process is
also called weakly stationary.
When the index set for a WSS random process X is R
d
or Z
d
where d 1, we dene the
spectral density of X as
S
X
(f) = T[C
X
](f)
where f is a d-dimensional frequency vector, and T denotes the appropriate d-dimensional
Fourier transform. When T = R
d
, we have the Fourier transform pair
S
X
(f) =
_
R
d
C
X
(t)e
j2ft
dt, f R
d
C
X
(t) =
_
R
d
S
X
(f)e
j2ft
df, t R
d
where f t denotes the dot product of the respective d-vectors. When T = Z
d
, we have
S
X
(f) =
iZ
d
C
X
(i)e
j2fi
, f
_
1
2
,
1
2
_
d
C
X
(t) =
_
[
1
2
,
1
2
]
d
S
X
(f)e
j2ft
df, t Z
d
.
The (d-dimensional) spectral density represents the average distribution of energy across fre-
quency. The spectral density is nonnegative (as proven by Herglotz (1911) for T = Z and by
Bochner (1932) for T = R). Existence of the spectral density is guaranteed if
_
[C
X
[ < (in
the case T = R
d
) or
i
[C
X
(i)[ < (in the case T = Z
d
).
7
6 Isotropic Processes
Let T = R
d
or Z
d
. A d-dimensional isotropic process is a stationary process whose correlation
function R
X
(t) is spherically symmetric:
R
X
(t) = R
X
(|t|)
where | | denotes Euclidean norm. By the elementary properties of the Fourier transform,
isotropy implies that the spectral density is also spherically symmetric:
S
X
(f) = S
X
(|f|).
Isotropic models are frequently used in image processing.
7 Wide-Sense Periodic Processes
A d-dimensional random process X(t), t T is wide-sense periodic if its mean function
X
(t)
and its correlation function C
X
(t
1
, t
2
) satisfy the following properties:
1.
X
(t) =
X
(t +T
i
) for i = 1, , d, where T
1
, , T
d
are d periodicity vectors.
2. C
X
(t
1
, t
2
) = C
X
(t
1
+T
i
, t
2
) = C
X
(t
1
, t
2
+T
i
) for i = 1, , d.
For a wide-sense periodic process, it may be shown that
E[X(t +T
i
) X(t)[
2
= 0 t T , i = 1, , d,
i.e., the realizations of this process are mean-square periodic.
8 Continuity of Random Processes
Let X(t), t R
d
be a second-order random process. The process is said to be mean-square
(m.s.) continuous at t if
lim
st
X(s) = X(t) m.s.
i.e., E[X(t) X(s)[
2
m.s.
0 as s t. One may similarly dene processes that are continuous in
probability, or in distribution, by replacing the m.s. limit by the appropriate limit.
None of this implies that the sample paths of the process X are continuous. For instance,
dene the following process: pick two independent random variables A and t
0
according to a
normal distribution, and let
X(t) =
_
A : t < t
0
A+ 1 : t t
0
.
8
Any sample path of this process has one point of discontinuity (at the random location t
0
).
Yet it can be veried that X is m.s. continuous everywhere.
The notion of a.s. continuity of a random process is quite dierent from the three notions
discussed above. A random process is a.s. continuous if the probability that a sample path is
continuous is 1. The process X in our example above fails that stronger condition, because
the probability that a sample path is continuous is 0.
The notion of m.s. continuity is useful in part because that property can be inferred by
inspection of the correlation function. Indeed the following three properties are equivalent:
1. X is m.s. continuous
2. R
X
(t, s) is continuous over T T
3. R
X
(t, t) is continuous at all t T .
For a WSS process, the correlation function is of the form R
X
(t), t T , and the following
three properties are equivalent:
1. X is m.s. continuous
2. R
X
(t) is continuous over T
3. R
X
(t) is continuous at t = 0.
9 Stochastic Integrals
We often deal with random processes that are outputs of linear systems, e.g.,
Y (t) =
_
R
d
X(s)h(t s) ds, t R
d
.
Since X depends on the outcome of the random experiment, this integral may not always
exist. What meaning can we attach to such an integral?
Let us consider the following generic problem. Denote by I the integral
_
_
i=1
i
= ,
i
j
= i ,= j.
Assume that max
i
[
i
[ 0 as n . Choose some t
i
in each interval
i
and dene the sum
I
n
=
n
i=1
X(t
i
)[
i
[.
9
For a classical (Riemann) integral, we have I lim
n
I
n
.
For a stochastic integral, the mean-square integral I exists when
lim
n
E(I I
n
)
2
= 0, i.e., I
n
m.s.
I.
Properties of the mean-square integral.
Existence: guaranteed if
_
R
X
(t, s) dt ds < ;
Linearity:
_
[aX(t) + bY (t)] dt = a
_
X(t) dt + b
_
Y (t) dt provided that
_
X and
_
Y
exist;
Moments: E[
_
X(t) dt] =
_
E[X(t)] dt and E(
_
X(t) dt)
2
=
_ _
R
X
(t, s) dt ds .
LSI Filtering. Consider a Linear Shift Invariant (LSI) system with impulse response h(t),
and denote by R
h
(u)
_
h(t)h(t +u) dt the deterministic autocorrelation function of h. If the
process X(t), t R
d
is input to this sytem, the output is
Y (t) =
_
R
d
X(s)h(t s) ds, t R
d
Y
(t) = EY (t) =
_
EX(s)h(t s) ds =
_
X
(s)h(t s) ds
R
Y
(t, s) = E[Y (t)Y (s)]
= E
__
X(t
)h(t t
) dt
_
X(s
)h(s s
) ds
_
=
_ _
E[X(t
)X(s
)]h(t t
)h(s s
) dt
ds
=
_ _
R
X
(t
, s
)h(t t
)h(s s
) dt
ds
.
If the process X is WSS, these expressions simplify to
Y
(t) =
X
_
h(t) dt
R
Y
(t s) = (R
X
R
h
)(t s).
Special Case: Filtering white noise X(t) through a LSI system.
Taking R
X
(t) = (t), we obtain
R
Y
(t) = (R
X
R
h
)(t) = R
h
(t).
The stochastic m.s. integral Y (t) =
_
X(s)h(t s) ds exists if R
h
(0) =
_
[h(t)[
2
dt < .
Compare with the traditional Bounded Input Bounded Output (BIBO) condition for stability
of LSI systems:
_
[h(t)[ dt < . Which condition is stronger?
10
Note that white noise on R is not a separable process, but the ltered noise is a separable
process. We could have been more careful in the derivation above and selected a separable
process X
1
R(
1
t) which forms a resolution of the identity:
_
f(t)(t) dt lim
0
_
f(t)R(t; ) dt = f(0)
for any function f that is continuous at t = 0. Then if a WSS process X
with correlation
function R(t; ) is input to the LSI system, the correlation function of the output is given by
R
h
(t) in the limit as 0.
10 Ergodic Processes
Let T = R
d
or Z
d
. In practice a statistical description of images (or any other random process)
is not available, so these quantities have to be estimated from the data. Consider for instance
the problem of estimating the mean of a stationary random process given data X(t), t .
The mean can be estimated using the stochastic integral
=
1
[[
_
X(t) dt.
In many cases, as [[ . (Example: spatially white noise) In this case, the process
X(t) is said to be ergodic in the mean. Formally:
Denition: A WSS process X(t), t T , is (m.s.) ergodic in the mean i
m.s.
as [[ .
What does it take to have ergodicity?
Intuitively, should be an average of many independent observations (so that a law of
large numbers type result applies). This means that X(t) should decorrelate rapidly, i.e., the
correlation area should be small enough.
Example #1 (Worst-case scenario): X(t) = A (random variable) for all t T . Then
= A no matter how large the observation window is, and does not converge to .
Example #2 X(t) = Acos(2t) for all t T . For a symmetric window = [n, n+]
where n N and [0, 1), we obtain =
2Asin(2)
||
which vanishes as [[ .
The expected value of is given by
E =
1
[[
_
EX(t) dt = ,
11
i.e., is an unbiased estimator of . The variance of is given by
E( )
2
=
1
[[
2
_
EC
X
(t t
) dt dt
.
If this variance vanishes as [[ , we say that the estimator is m.s. consistent.
Slutskys Ergodic Theorem [Yaglom, p. 218]: X is ergodic in the mean i
1
[[
_
C
X
(t) dt 0 as [[
The condition above is certainly satised if C
X
(t) 0 as [t[ or even if C
X
(t) contains
sinusoidal components (as in the case of periodic processes such as X in Example #2 above).
But C
X
(t) should not contain a d.c. component, which implies that S
X
(f) may not contain
an impulse at f = 0! (as in Example #1 above).
Other types of ergodicity can be dened similarly. For instance:
Denition: A WSS process X is ergodic in correlation at the shift s if
R
X
(s) R
X
(s) m.s. as [[
where
R
X
(s) =
1
[[
_
For 2-D images, the notion of stationarity implies that images are dened over innite
planar domains, but that assumption is clearly articial. Modeling images as wide-sense pe-
riodic processes, as discussed earlier, is a possible solution. The representation of images as
random elds dened on the sphere is a useful alternative because one may view such images
as projections of a physical 3-D image.
Now, how can we dene stationary and isotropic processes on the sphere?
On the sphere, stationarity and isotropicity mean invariance to rotations:
EX(t) =
X
is independent of t;
E[X(t)X(s)] = R
X
(t, s) = R
X
() where is the spherical angle (a distance) between t
and s.
The general form of a nonnegative denite kernel R
X
(t, s) that depends on t and s only via
13
is as follows [Grenander p. 130]:
R
X
() =
m=0
S
X
(m)P
m
(cos )
S
X
(m) =
_
sphere
R
X
()P
m
(cos ) dt
where S
X
(m) 0, and P
m
() are Legendre polynomials. These expressions are reminiscent of
the Fourier transform formulas, except that the basis functions are not complex exponentials
but Legendre polynomials.
12 Random Processes on Arbitrary Algebraic Structures
We have studied random processes whose correlation function R
X
(t, s) is invariant to shifts
(in Z
d
and R
d
) and rotations (on the sphere). Can this be further generalized?
The answer is positive if we consider an index set T with group structure [6]. One can
then introduce the concept of invariance to the group operation (e.g., translation or rotation)
and dene a spectral density which is a Fourier transform dened on the group. The basis
functions for such Fourier transforms are group characters (t) which satisfy the property
(t
1
+t
2
) = (t
1
) +(t
2
).
References
[1] B. Hajek, Notes for ECE534: An Exploration of Random Processes for Engineers, available
from http://www.ifp.uiuc.edu/hajek/Papers/randomprocJuly06.pdf, 2006.
[2] H. Stark and J. W. Woods, Probability, Random Processes, and Estimation Theory for
Engineers, Prentice-Hall, Upper Saddle River, NJ, 1994.
[3] A. M. Yaglom, Correlation Theory of Stationary and Related Random Functions I: Basic
Results, Springer-Verlag, 1987.
[4] L. Breiman, Probability, SIAM, 1992.
[5] U. Grenander, Abstract Inference, Wiley, New York, 1981.
[6] U. Grenander, Probabilities on Algebraic Structures, Wiley, New York, 1963.
[7] W. Feller, An Introduction to Probability Theory and Its Applications, Wiley, New York,
1971.
[8] J. Doob, Stochastic Processes, Wiley, 1953.
14