Dembo Notes
Dembo Notes
Dembo Notes
Contents
Preface
7
7
18
30
54
71
71
77
85
95
95
103
117
133
141
153
153
158
166
171
177
177
186
193
207
212
227
227
235
257
271
271
276
CONTENTS
286
291
291
296
319
343
343
351
369
Bibliography
377
Index
379
Preface
These are the lecture notes for a year long, PhD level course in Probability Theory
that I taught at Stanford University in 2004, 2006 and 2009. The goal of this
course is to prepare incoming PhD students in Stanfords mathematics and statistics
departments to do research in probability theory. More broadly, the goal of the text
is to help the reader master the mathematical foundations of probability theory
and the techniques most commonly used in proving theorems in this area. This is
then applied to the rigorous study of the most fundamental classes of stochastic
processes.
Towards this goal, we introduce in Chapter 1 the relevant elements from measure
and integration theory, namely, the probability space and the -algebras of events
in it, random variables viewed as measurable functions, their expectation as the
corresponding Lebesgue integral, and the important concept of independence.
Utilizing these elements, we study in Chapter 2 the various notions of convergence
of random variables and derive the weak and strong laws of large numbers.
Chapter 3 is devoted to the theory of weak convergence, the related concepts
of distribution and characteristic functions and two important special cases: the
Central Limit Theorem (in short clt) and the Poisson approximation.
Drawing upon the framework of Chapter 1, we devote Chapter 4 to the definition,
existence and properties of the conditional expectation and the associated regular
conditional probability distribution.
Chapter 5 deals with filtrations, the mathematical notion of information progression in time, and with the corresponding stopping times. Results about the latter
are obtained as a by product of the study of a collection of stochastic processes
called martingales. Martingale representations are explored, as well as maximal
inequalities, convergence theorems and various applications thereof. Aiming for a
clearer and easier presentation, we focus here on the discrete time settings deferring
the continuous time counterpart to Chapter 8.
Chapter 6 provides a brief introduction to the theory of Markov chains, a vast
subject at the core of probability theory, to which many text books are devoted.
We illustrate some of the interesting mathematical properties of such processes by
examining a few special cases of interest.
Chapter 7 sets the framework for studying right-continuous stochastic processes
indexed by a continuous time parameter, introduces the family of Gaussian processes and rigorously constructs the Brownian motion as a Gaussian process of
continuous sample path and zero-mean, stationary independent increments.
5
PREFACE
Chapter 8 expands our earlier treatment of martingales and strong Markov processes to the continuous time setting, emphasizing the role of right-continuous filtration. The mathematical structure of such processes is then illustrated both in
the context of Brownian motion and that of Markov jump processes.
Building on this, in Chapter 9 we re-construct the Brownian motion via the invariance principle as the limit of certain rescaled random walks. We further delve
into the rich properties of its sample path and the many applications of Brownian
motion to the clt and the Law of the Iterated Logarithm (in short, lil).
The intended audience for this course should have prior exposure to stochastic
processes, at an informal level. While students are assumed to have taken a real
analysis class dealing with Riemann integration, and mastered well this material,
prior knowledge of measure theory is not assumed.
It is quite clear that these notes are much influenced by the text books [Bil95,
Dur10, Wil91, KaS97] I have been using.
I thank my students out of whose work this text materialized and my teaching assistants Su Chen, Kshitij Khare, Guoqiang Hu, Julia Salzman, Kevin Sun and Hua
Zhou for their help in the assembly of the notes of more than eighty students into
a coherent document. I am also much indebted to Kevin Ross, Andrea Montanari
and Oana Mocioalca for their feedback on earlier drafts of these notes, to Kevin
Ross for providing all the figures in this text, and to Andrea Montanari, David
Siegmund and Tze Lai for contributing some of the exercises in these notes.
Amir Dembo
Stanford, California
April 2010
CHAPTER 1
= (Bn ) 0
P
P
making sure that p = 1. Then, it is easy to see that taking P(A) = A p
for any A results with a probability measure on (, 2 ). For instance, when
1
is finite, we can take p = ||
, the uniform measure on , whereby computing
probabilities is the same as counting. Concrete examples are a single coin toss, for
which we have 1 = {H, T} ( = H if the coin lands on its head and = T if it
lands on its tail), and F1 = {, , H, T}, or when we consider a finite number of
coin tosses, say n, in which case n = {(1 , . . . , n ) : i {H, T}, i = 1, . . . , n}
is the set of all possible n-tuples of coin tosses, while Fn = 2n is the collection
of all possible sets of n-tuples of coin tosses. Another example pertains to the
set of all non-negative integers = {0, 1, 2, . . .} and F = 2 , where we get the
k
Poisson probability measure of parameter > 0 when starting from pk = k! e for
k = 0, 1, 2, . . ..
When is uncountable such a strategy as in Example 1.1.6 will no longer work.
The problem is that if we take p = P({}) > 0 for uncountably many values of
, we shall end up with P() = . Of course we may define everything as before
b of and demand that P(A) = P(A )
b for each A .
on a countable subset
Excluding such trivial cases, to genuinely use an uncountable sample space we
need to restrict our -algebra F to a strict subset of 2 .
Definition 1.1.7. We say that a probability space (, F , P) is non-atomic, or
alternatively call P non-atomic if P(A) > 0 implies the existence of B F , B A
with 0 < P(B) < P(A).
Indeed, in contrast to the case of countable , the generic uncountable sample
space results with a non-atomic probability space (c.f. Exercise 1.1.27). Here is an
interesting property of such spaces (see also [Bil95, Problem 2.19]).
Exercise 1.1.8. Suppose P is non-atomic and A F with P(A) > 0.
(a) Show that for every > 0, we have B A such that 0 < P(B) < .
(b) Prove that if 0 < a < P(A) then there exists B A with P(B) = a.
Hint: Fix n 0 and define inductively numbers xn and sets Gn F with H0 = ,
Hn = k<n Gk , S
xn = sup{P(G) : G A\Hn , P(Hn G) a} and Gn A\Hn
such that P(Hn Gn ) a and P(Gn ) (1 n )xn . Consider B = k Gk .
As you show next, the collection of all measures on a given space is a convex cone.
Here are few properties of probability measures for which the conclusions of Exercise 1.1.4 are useful.
Exercise 1.1.10. A function d : X X [0, ) is called a semi-metric on
the set X if d(x, x) = 0, d(x, y) = d(y, x) and the triangle inequality d(x, z)
d(x, y) + d(y, z) holds. With AB = (A B c ) (Ac B) denoting the symmetric
difference of subsets A and B of , show that for any probability space (, F , P),
the function d(A, B) = P(AB) is a semi-metric on F .
Exercise 1.1.11. Consider events {An } in a probability space (, F , P) that are
almost disjointP
in the sense that P(An Am ) = 0 for all n 6= m. Show that then
P(
A
)
=
n=1 n
n=1 P(An ).
10
Different sets of generators may result with the same -algebra. For example, taking = {1, 2, 3} it is easy to see that ({1}) = ({2, 3}) = {, {1}, {2, 3}, {1, 2, 3}}.
A -algebra F is countably generated if there exists a countable collection of sets
that generates it. Exercise 1.1.17 shows that BR is countably generated, but as you
show next, there exist non countably generated -algebras even on = R.
Exercise 1.1.16. Let F consist of all A such that either A is a countable set
or Ac is a countable set.
(a) Verify that F is a -algebra.
(b) Show that F is countably generated if and only if is a countable set.
Recall that if a collection of sets A is a subset of a -algebra G, then also (A) G.
Consequently, to show that ({A }) = ({B }) for two different sets of generators
{A } and {B }, we only need to show that A ({B }) for each and that
B ({A }) for each . For instance, considering BQ = ({(a, b) : a < b Q}),
we have by this approach that BQ = ({(a, b) : a < b R}), as soon as we
show that any interval (a, b) is in BQ . To see this fact, note that for any real
a < b there are rational numbers qn < rn such that qn a and rn b, hence
(a, b) = n (qn , rn ) BQ . Expanding on this, the next exercise provides useful
alternative definitions of B.
11
12
Exercise 1.1.24. Show that if P is a probability measure on (R, B) then for any
A B and > 0, there exists an open set G containing A such that P(A) + >
P(G).
Here is more information about BRd .
non-empty for finite n, then the countable intersection i=1 Ki is also non-empty).
Exercise 1.1.27. Show that ((0, 1], B(0,1] , U ) is a non-atomic probability space and
deduce that (R, B, ) is a non-atomic measure space.
Note that any countable union of sets of probability zero has probability zero, but
this is not the case for an uncountable union. For example, U ({x}) = 0 for every
x R, but U (R) = 1.
As we have seen in Example 1.1.26 it is often impossible to explicitly specify the
value of a measure on all sets of the -algebra F . Instead, we wish to specify its
values on a much smaller and better behaved collection of generators A of F and
use Caratheodorys theorem to guarantee the existence of a unique measure on F
that coincides with our specified values. To this end, we require that A be an
algebra, that is,
Definition 1.1.28. A collection A of subsets of is an algebra (or a field) if
(a) A,
(b) If A A then Ac A as well,
(c) If A, B A then also A B A.
13
(a) Verify that f (A) is indeed an algebra and that f (A) is minimal in the
sense that if G is an algebra and A G, then f (A) G.
(b) Show T
that f (A) is the collection of all finite disjoint unions of sets of the
ni
form j=1
Aij , where for each i and j either Aij or Acij are in A.
We next state Caratheodorys extension theorem, a key result from measure theory, and demonstrate how it applies in the context of Example 1.1.26.
be a collection of subsets of (0, 1]. It is not hard to verify that A is an algebra, and
further that (A) = B(0,1] (c.f. Exercise 1.1.17, for a similar issue, just with (0, 1]
replaced by R). With U0 denoting the non-negative set function on A such that
r
r
[
X
(1.1.1)
U0
(ak , bk ] =
(bk ak ) ,
k=1
k=1
note that U0 ((0, 1]) = 1, hence the existence of a unique probability measure U on
((0, 1], B(0,1]) such that U (A) = U0 (A) for sets A A follows by Caratheodorys
extension theorem, as soon as we verify that
Sn
Exercise 1.1.32. Show that U0 is finitely additive on A. That is, U0 ( k=1 Ak ) =
P
n
k=1 U0 (Ak ) for any finite collection of disjoint sets A1 , . . . , An A.
Sn
Proof. Let Gn = k=1 Ak and Hn = A \ Gn . Then, Hn and since
Ak , A A which is an algebra it follows that Gn and hence Hn are also in A. By
definition, U0 is finitely additive on A, so
n
X
U0 (A) = U0 (Hn ) + U0 (Gn ) = U0 (Hn ) +
U0 (Ak ) .
k=1
To prove that U0 is countably additive, it suffices to show that U0 (Hn ) 0, for then
n
X
X
U0 (A) = lim U0 (Gn ) = lim
U0 (Ak ) =
U0 (Ak ) .
n
k=1
k=1
=1
14
In particular, for every n, the set n J is non-empty and therefore so are the
T
decreasing sets Kn = n J . Since Kn are compact sets (by Heine-Borel theorem), the set J is then non-empty as well, and since J is a subset of H for all
we arrive at H non-empty, contradicting our assumption that Hn .
Remark. The proof of Lemma 1.1.31 is generic (for finite measures). Namely,
any non-negative finitely additive set function 0 on an algebra A is countably
additive if 0 (Hn ) 0 whenever Hn A and Hn . Further, as this proof shows,
when is a topological space it suffices for countable additivity of 0 to have for
any H A a sequence Jk A such that J k H are compact and 0 (H \ Jk ) 0
as k .
Exercise 1.1.33. Show the necessity of the assumption that A be an algebra in
Caratheodorys extension theorem, by giving an example of two probability measures
6= on a measurable space (, F ) such that (A) = (A) for all A A and
F = (A).
Hint: This can be done with = {1, 2, 3, 4} and F = 2 .
It is often useful to assume that the probability space we have is complete, in the
sense we make precise now.
Definition 1.1.34. We say that a measure space (, F , ) is complete if any
subset N of any B F with (B) = 0 is also in F . If further = P is a probability
measure, we say that the probability space (, F , P) is a complete probability space.
Our next theorem states that any measure space can be completed by adding to
its -algebra all subsets of sets of zero measure (a procedure that depends on the
measure in use).
Theorem 1.1.35. Given a measure space (, F , ), let N = {N : N A for
some A F with (A) = 0} denote the collection of -null sets. Then, there
exists a complete measure space (, F , ), called the completion of the measure
space (, F , ), such that F = {F N : F F , N N } and = on F .
Proof. This is beyond our scope, but see detailed proof in [Dur10, Theorem
A.2.3]. In particular, F = (F , N ) and (A N ) = (A) for any N N and
A F (c.f. [Bil95, Problems 3.10 and 10.5]).
The following collections of sets play an important role in proving the easy part
of Caratheodorys theorem, the uniqueness of the extension .
Definition 1.1.36. A -system is a collection P of sets closed under finite intersections (i.e. if I P and J P then I J P).
A -system is a collection L of sets containing and B\A for any A B A, B L,
15
16
Remark. With a somewhat more involved proof one can relax the condition
1 () = 2 () < to the existence of An P such that An and 1 (An ) <
(c.f. [Bil95, Theorem 10.3] for details). Accordingly, in Caratheodorys extension
theorem we can relax 0 () < to the assumption that 0 is a -finite measure,
that is 0 (An ) < for some An A such that An , as is the case with
Lebesgues measure on R.
We conclude this subsection with an outline the proof of Caratheodorys extension
theorem, noting that since an algebra A is a -system and A, the uniqueness of
the extension to (A) follows from Proposition 1.1.39. Our outline of the existence
of an extension follows [Wil91, Section A.1.8] (or see [Bil95, Theorem 11.3] for
the proof of a somewhat stronger result). This outline centers on the construction
of the appropriate outer measure, a relaxation of the concept of measure, which we
now define.
Definition 1.1.40. An increasing, countably sub-additive, non-negative set function on a measurable space (, F ) is called an outer measure. That is, : F 7
[0, ], having the properties:
(a) ()
(A1 ) (A2 ) for any A1 , A2 F with A1 A2 .
S = 0 andP
X
[
(E) = inf{
0 (An ) : E
An , An A},
n=1
The third step uses the countable additivity of 0 on A to show that for any A A
the outer measure is additive when splitting subsets of by intersections with A
and Ac . That is, we show that any element of A is a -measurable set, as defined
next.
17
on B(0,1] , this is the -algebra B (0,1] of all Lebesgue measurable subsets of (0, 1].
Associated with it are the Lebesgue measurable functions f : (0, 1] 7 R for which
f 1 (B) B (0,1] for all B B. However, as noted for example in [Dur10, Theorem
A.2.4], the non Borel set constructed in the proof of Proposition 1.1.18 is also non
Lebesgue measurable.
The following concept of a monotone class of sets is a considerable relaxation of
that of a -system (hence also of a -algebra, see Proposition 1.1.37).
Definition 1.1.43. A monotone class is a collection M of sets closed under both
monotone increasing and monotone decreasing limits (i.e. if Ai M and either
Ai A or Ai A, then A M).
When starting from an algebra instead of a -system, one may save effort by
applying Halmoss monotone class theorem instead of Dynkins theorem.
Theorem 1.1.44 (Halmoss monotone class theorem). If A M with A
an algebra and M a monotone class then (A) M.
Proof. Clearly, any algebra which is a monotone class must be a -algebra.
Another short though dense exercise in set manipulations shows that the intersection m(A) of all monotone classes containing an algebra A is both an algebra and
a monotone class (see the proof of [Bil95, Theorem 3.4]). Consequently, m(A) is
a -algebra. Since A m(A) this implies that (A) m(A) and we complete the
proof upon noting that m(A) M.
Exercise 1.1.45. We say that a subset V of {1, 2, 3, } has Cesaro density (V )
and write V CES if the limit
(V ) = lim n1 |V {1, 2, 3, , n}| ,
n
18
B S.
19
Proof. Let
fn (x) = n1x>n +
n
n2
1
X
k=0
noting that for R.V. X 0, we have that Xn = fn (X) are simple functions. Since
X Xn+1 Xn and X() Xn () 2n whenever X() n, it follows that
Xn () X() as n , for each .
We write a general R.V. as X() = X+ ()X () where X+ () = max(X(), 0)
and X () = min(X(), 0) are non-negative R.V.-s. By the above argument
the simple functions Xn = fn (X+ ) fn (X ) have the convergence property we
claimed.
Note that in case F = 2 , every mapping X : 7 S is measurable (and therefore
is an (S, S)-valued R.V.). The choice of the -algebra F is very important in
determining the class of all (S, S)-valued R.V. For example, there are non-trivial
-algebras G and F on = R such that X() = is a measurable function for
(, F ), but is non-measurable for (, G). Indeed, one such example is when F is the
Borel -algebra B and G = ({[a, b] : a, b Z}) (for example, the set { : }
is not in G whenever
/ Z).
Building on Proposition 1.2.6 we have the following analog of Halmoss monotone
class theorem. It allows us to deduce in the sequel general properties of (bounded)
measurable functions upon verifying them only for indicators of elements of systems.
Theorem 1.2.7 (Monotone class theorem). Suppose H is a collection of
R-valued functions on such that:
(a) The constant function 1 is an element of H.
(b) H is a vector space over R. That is, if h1 , h2 H and c1 , c2 R then
c1 h1 + c2 h2 is in H.
(c) If hn H are non-negative and hn h where h is a (bounded) real-valued
function on , then h H.
If P is a -system and IA H for all A P, then H contains all (bounded)
functions on that are measurable with respect to (P).
Remark. We stated here two versions of the monotone class theorem, with the
less restrictive assumption that (c) holds only for bounded h yielding the weaker
conclusion about bounded elements of m(P). In the sequel we use both versions,
which as we see next, are derived by essentially the same proof. Adapting this
proof you can also show that any collection H of non-negative functions on
satisfying the conditions of Theorem 1.2.7 apart from requiring (b) to hold only
when c1 h1 + c2 h2 0, must contain all non-negative elements of m(P).
Proof. Let L = {A : IA H}. From (a) we have that L, while (b)
implies that B \ A is in L whenever A B are both in L. Further, in view of (c)
the collection L is closed under monotone increasing limits. Consequently, L is a
-system, so by Dynkins - theorem, our assumption that L contains P results
with (P) L. With H a vector space over R, this in turn implies that H contains
all simple functions with respect to the measurable space (, (P)). In the proof of
Proposition 1.2.6 we saw that any (bounded) measurable function is a difference of
20
Exercise 1.2.10. Adapting the proof of Theorem 1.2.9, show that for any mapping
X : 7 S and any -algebra S of subsets of S, the collection {X 1 (B) : B S} is
a -algebra. Verify that X is an (S, S)-valued R.V. if and only if {X 1 (B) : B
S} F , in which case we denote {X 1 (B) : B S} either by (X) or by F X and
call it the -algebra generated by X.
To practice your understanding of generated -algebras, solve the next exercise,
providing a convenient collection of generators for (X).
Exercise 1.2.11. If X is an (S, S)-valued R.V. and S = (A) then (X) is
generated by the collection of sets X 1 (A) := {X 1 (A) : A A}.
An important example of use of Exercise 1.2.11 corresponds to (R, B)-valued random variables and A = {(, x] : x R} (or even A = {(, x] : x Q}) which
generates B (see Exercise 1.1.17), leading to the following alternative definition of
the -algebra generated by such R.V. X.
21
More generally, given a random vector X = (X1 , . . . , Xn ), that is, random variables
X1 , . . . , Xn on the same probability space, let (Xk , k n) (or FnX ), denote the
smallest -algebra F such that Xk (), k = 1, . . . , n are measurable on (, F ).
Alternatively,
(Xk , k n) = ({ : Xk () }, R, k n) .
22
(b) Show that for any bounded random variable X and > 0 there exists a
PN
simple function Y = n=1 cn IAn with An A such that P(|X Y | >
) < .
In view of Exercise 1.2.3 we have the following special case of Proposition 1.2.18,
corresponding to S = Rn and T = R equipped with the respective Borel -algebras.
Corollary 1.2.19. Let Xi , i = 1, . . . , n be R.V. on the same measurable space
(, F ) and f : Rn 7 R a Borel function. Then, f (X1 , . . . , Xn ) is also a R.V. on
the same space.
To appreciate the power of Corollary 1.2.19, consider the following exercise, in
which you show that every continuous function is also a Borel function.
Exercise 1.2.20. Suppose (S, ) is a metric space (for example, S = Rn ). A function g : S 7 [, ] is called lower semi-continuous (l.s.c.) if lim inf (y,x)0 g(y)
g(x), for all x S. A function g is said to be upper semi-continuous(u.s.c.) if g
is l.s.c.
(a) Show that if g is l.s.c. then {x : g(x) b} is closed for each b R.
(b) Conclude that semi-continuous functions are Borel measurable.
(c) Conclude that continuous functions are Borel measurable.
23
sup Xn ,
n
lim inf Xn ,
lim sup Xn ,
[
[
{ : inf Xn () < b} =
{ : Xn () < b} =
Xn1 ([, b)) F .
n
n=1
n=1
ln
By the preceding proof we have that Yn = inf ln Xl are R-valued R.V.-s and hence
so is W = supn Yn .
Similarly to the arguments already used, we conclude the proof either by observing
that
i
h
Z = lim sup Xn = inf sup Xl ,
n
ln
Remark. Since inf n Xn , supn Xn , lim supn Xn and lim inf n Xn may result in values even when every Xn is R-valued, hereafter we let mF also denote the
collection of R-valued R.V.
An important corollary of this theorem deals with the existence of limits of sequences of R.V.
Corollary 1.2.23. For any sequence Xn mF , both
and
1 = { : lim inf Xn () = lim sup Xn () R}
n
24
Proof. By Theorem 1.2.22 we have that Z = lim supn Xn and W = lim inf n Xn
are two R-valued variables on the same space, with Z() W () for all . Hence,
1 = { : Z() W () = 0, Z() R, W () R} is measurable (apply Corollary
1.2.19 for f (z, w) = z w), as is 0 = W 1 ({}) Z 1 ({}) 1 .
The following structural result is yet another consequence of Theorem 1.2.22.
Corollary 1.2.24. For any d < and R.V.-s Y1 , . . . , Yd on the same measurable
space (, F ) the collection H = {h(Y1 , . . . , Yd ); h : Rd 7 R Borel function} is a
vector space over R containing the constant functions, such that if Xn H are
non-negative and Xn X, an R-valued function on , then X H.
Proof. By Example 1.2.21 the collection of all Borel functions is a vector
space over R which evidently contains the constant functions. Consequently, the
same applies for H. Next, suppose Xn = hn (Y1 , . . . , Yd ) for Borel functions hn such
that 0 Xn () X() for all . Then, h(y) = supn hn (y) is by Theorem
1.2.22 an R-valued Borel function on Rd , such that X = h(Y1 , . . . , Yd ). Setting
h(y) = h(y) when h(y) R and h(y) = 0 otherwise, it is easy to check that h is a
real-valued Borel function. Moreover, with X : 7 R (finite valued), necessarily
X = h(Y1 , . . . , Yd ) as well, so X H.
The point-wise convergence of R.V., that is Xn () X(), for every is
often too strong of a requirement, as it may fail to hold as a result of the R.V. being
ill-defined for a negligible set of values of (that is, a set of zero measure). We
thus define the more useful, weaker notion of almost sure convergence of random
variables.
Definition 1.2.25. We say that a sequence of random variables Xn on the same
probability space (, F , P) converges almost surely if P(0 ) = 1. We then set
X = lim supn Xn , and say that Xn converges almost surely to X , or use the
a.s.
notation Xn X .
Remark. Note that in Definition 1.2.25 we allow the limit X () to take the
values with positive probability. So, we say that Xn converges almost surely
to a finite limit if P(1 ) = 1, or alternatively, if X R with probability one.
We proceed with an explicit characterization of the functions measurable with
respect to a -algebra of the form (Yk , k n).
Theorem 1.2.26. Let G = (Yk , k n) for some n < and R.V.-s Y1 , . . . , Yn
on the same measurable space (, F ). Then, mG = {g(Y1 , . . . , Yn ) : g : Rn 7
R is a Borel function}.
Proof. From Corollary 1.2.19 we know that Z = g(Y1 , . . . , Yn ) is in mG for
each Borel function g : Rn 7 R. Turning to prove the converse result, recall
part (b) of Exercise 1.2.14 that the -algebra G is generated by the -system P =
{A : = (1 , . . . , n ) Rn } where IA = h (Y1 , . . . , Yn ) for the Borel function
Qn
h (y1 , . . . , yn ) = k=1 1yk k . Thus, in view of Corollary 1.2.24, we have by the
monotone class theorem that H = {g(Y1 , . . . , Yn ) : g : Rn 7 R is a Borel function}
contains all elements of mG.
We conclude this sub-section with a few exercises, starting with Borel measurability of monotone functions (regardless of their continuity properties).
25
Show that for any > 0, there exists an event A with P(A) < and a non-random
N = N (), sufficiently large such that Xn () < X () + for all n N and every
Ac .
Equipped with Theorem 1.2.22 you can also strengthen Proposition 1.2.6.
Finally, relying on Theorem 1.2.26 it is easy to show that a Borel function can
only reduce the amount of information quantified by the corresponding generated
-algebras, whereas such information content is invariant under invertible Borel
transformations, that is
Exercise 1.2.33. Show that (g(Y1 , . . . , Yn )) (Yk , k n) for any Borel function g : Rn 7 R. Further, if Y1 , . . . , Yn and Z1 , . . . , Zm defined on the same probability space are such that Zk = gk (Y1 , . . . , Yn ), k = 1, . . . , m and Yi = hi (Z1 , . . . , Zm ),
i = 1, . . . , n for some Borel functions gk : Rn 7 R and hi : Rm 7 R, then
(Y1 , . . . , Yn ) = (Z1 , . . . , Zm ).
1.2.3. Distribution, density and law. As defined next, every random variable X induces a probability measure on its range which is called the law of X.
Definition 1.2.34. The law of a real-valued R.V. X, denoted PX , is the probability measure on (R, B) such that PX (B) = P({ : X() B}) for any Borel set
B.
26
Remark. Since X is a R.V., it follows that PX (B) is well defined for all B B.
Further, the non-negativity of P implies that PX is a non-negative set function on
(R, B), and since X 1 (R) = , also PX (R) = 1. Consider next disjoint Borel sets
Bi , observing that X 1 (Bi ) F are disjoint subsets of such that
[
[
X 1 (Bi ) .
X 1 ( Bi ) =
i
Our next result characterizes the set of all functions F : R 7 [0, 1] that are
distribution functions of some R.V.
Theorem 1.2.37. A function F : R 7 [0, 1] is a distribution function of some
R.V. if and only if
(a) F is non-decreasing
(b) limx F (x) = 1 and limx F (x) = 0
(c) F is right-continuous, i.e. limyx F (y) = F (x)
Proof. First, assuming that F = FX is a distribution function, we show that
it must have the stated properties (a)-(c). Indeed, if x y then (, x] (, y],
and by the monotonicity of the probability measure PX (see part (a) of Exercise
1.1.4), we have that FX (x) FX (y), proving that FX is non-decreasing. Further,
(, x] R as x , while (, x] as x , resulting with property (b)
of the theorem by the continuity from below and the continuity from above of the
probability measure PX on R. Similarly, since (, y] (, x] as y x we get
the right continuity of FX by yet another application of continuity from above of
PX .
We proceed to prove the converse result, that is, assuming F has the stated properties (a)-(c), we consider the random variable X () = sup{y : F (y) < } on
the probability space ((0, 1], B(0,1] , U ) and show that FX = F . With F having
27
property (b), we see that for any > 0 the set {y : F (y) < } is non-empty and
further if < 1 then X () < , so X : (0, 1) 7 R is well defined. The identity
(1.2.1)
{ : X () x} = { : F (x)} ,
implies that FX (x) = U ((0, F (x)]) = F (x) for all x R, and further, the sets
(0, F (x)] are all in B(0,1] , implying that X is a measurable function (i.e. a R.V.).
Turning to prove (1.2.1) note that if F (x) then x 6 {y : F (y) < } and so by
definition (and the monotonicity of F ), X () x. Now suppose that > F (x).
Since F is right continuous, this implies that F (x + ) < for some > 0, hence
by definition of X also X () x + > x, completing the proof of (1.2.1) and
with it the proof of the theorem.
Check your understanding of the preceding proof by showing that the collection
of distribution functions for R-valued random variables consist of all F : R 7 [0, 1]
that are non-decreasing and right-continuous.
Remark. The construction of the random variable X () in Theorem 1.2.37 is
called Skorokhods representation. You can, and should, verify that the random
variable X + () = sup{y : F (y) } would have worked equally well for that
purpose, since X + () 6= X () only if X + () > q X () for some rational q,
in which case by definition F (q) , so there are most countably many such
values of (hence P(X + 6= X ) = 0). We shall return to this construction when
dealing with convergence in distribution in Section 3.2. An alternative approach to
Theorem 1.2.37 is to adapt the construction of the probability measure of Example
1.1.26, taking here = RPwith the corresponding change to A and replacing the
r
right side of (1.1.1) with k=1 (F (bk ) F (ak )), yielding a probability measure P
on (R, B) such that P((, ]) = F () for all R (c.f. [Bil95, Theorem 12.4]).
Our next example highlights the possible shape of the distribution function.
Example 1.2.38. Consider Example 1.1.6 of n coin tosses, with -algebra
P Fn =
2n , sample space n = {H, T }n, and the probability measure Pn (A) = A p ,
where p = 2n for each n (that is, = {1 , 2 , , n } for i {H, T }),
corresponding to independent, fair, coin tosses. Let Y () = I{1 =H} measure the
outcome of the first toss. The law of this random variable is,
1
1
PY (B) = 1{0B} + 1{1B}
2
2
and its distribution function is
1, 1
(1.2.2) FY () = PY ((, ]) = Pn (Y () ) = 12 , 0 < 1 .
0, < 0
Note that in general (X) is a strict subset of the -algebra F (in Example 1.2.38
we have that (Y ) determines the probability measure for the first coin toss, but
tells us nothing about the probability measure assigned to the remaining n 1
tosses). Consequently, though the law PX determines the probability measure P
on (X) it usually does not completely determine P.
Example 1.2.38 is somewhat generic. That is, if the R.V. X is a simple function (or
more generally, when the set {X() : } is countable and has no accumulation
points), then its distribution function FX is piecewise constant with jumps at the
28
possible values that X takes and jump sizes that are the corresponding probabilities.
Indeed, note that (, y] (, x) as y x, so by the continuity from below of
PX it follows that
FX (x ) := lim FX (y) = P({ : X() < x}) = FX (x) P({ : X() = x}) ,
yx
1, > 1
FU () = P(U ) = P(U [0, ]) = , 0 1
(1.2.4)
0, < 0
(
1, 0 u 1
and its density is fU (u) =
.
0, otherwise
The exponential distribution function is
(
0, x 0
F (x) =
,
1 ex , x 0
(
0, x 0
corresponding to the density f (x) =
ex , x > 0
distribution has the density
(x) = (2)1/2 e
x2
2
29
Every real-valued R.V. X has a distribution function but not necessarily a density.
For example X = 0 w.p.1 has distribution function FX () = 10 . Since FX is
discontinuous at 0, the R.V. X does not have a density.
Definition 1.2.42. We say that a function F is a Lebesgue singular function if
it has a zero derivative except on a set of zero Lebesgue measure.
Since the distribution function of any R.V. is non-decreasing, from real analysis
we know that it is almost everywhere differentiable. However, perhaps somewhat
surprisingly, there are continuous distribution functions that are Lebesgue singular
functions. Consequently, there are non-discrete random variables that do not have
a density. We next provide one such example.
Example 1.2.43. The Cantor set C is defined by removing (1/3, 2/3) from [0, 1]
and then iteratively removing the middle third of each interval that remains. The
uniform distribution on the (closed) set C corresponds to the distribution function
obtained by setting F (x) = 0 for x 0, F (x) = 1 for x 1, F (x) = 1/2 for
x [1/3, 2/3], then F (x) = 1/4 for x [1/9, 2/9], F (x) = 3/4 for x [7/9, 8/9],
and so on (which as you should check, satisfies the properties (a)-(c) of Theorem
1.2.37). From the definition, we see that dF/dx = 0 for almost every x
/ C and
that the corresponding probability measure has P(C c ) = 0. As the Lebesgue measure
of C is zero, we see that the derivative of F is zero except on a set of zero
R x Lebesgue
measure, and consequently, there is no function f for which F (x) = f (y)dy
holds. Though it is somewhat more involved, you may want to check that F is
everywhere continuous (c.f. [Bil95, Problem 31.2]).
Even discrete distribution functions can be quite complex. As the next example
shows, the points of discontinuity of such a function might form a (countable) dense
subset of R (which in a sense is extreme, per Exercise 1.2.39).
Example 1.2.44. Let q1 , q2 , . . . be an enumeration of the rational numbers and set
X
2i 1[qi ,) (x)
F (x) =
i=1
(where 1[qi ,) (x) = 1 if x qi and zero otherwise). Clearly, such F is nondecreasing, with limits 0 and 1 as x and x , respectively. It is not hard
to check that F is also right continuous, hence a distribution function, whereas by
construction F is discontinuous at each rational number.
As we have that P({ : X() }) = FX () for the generators { : X() }
of (X), we are not at all surprised by the following proposition.
Proposition 1.2.45. The distribution function FX uniquely determines the law
PX of X.
30
31
monotonicity and linearity. In Subsection 1.3.2 we consider fundamental inequalities associated with the expectation. Subsection 1.3.3 is about the exchange of
integration and limit operations, complemented by uniform integrability and its
consequences in Subsection 1.3.4. Subsection 1.3.5 considers densities relative to
arbitrary measures and relates our treatment of integration and expectation to
Riemanns integral and the classical definition of the expectation for a R.V. with
probability density. We conclude with Subsection 1.3.6 about moments of random
variables, including their values for a few well known distributions.
1.3.1. Lebesgue integral, linearity and monotonicity. Let SF+ denote
the collection of non-negative simple functions with respect to the given measurable
space (S, F ) and mF+ denote the collection of [0, ]-valued measurable functions
on this space. We next define Lebesgues integral with respect to any measure
on (S, F ),
R first for SF+ , then extending it to all f mF+ . With the notation
(f ) := S f (s)d(s) for this integral, we also denote by 0 () the more restrictive
integral, defined only on SF+ , so as to clarify the role each of these plays in some of
our proofs. We call an R-valued measurable function f mF for which (|f |) < ,
a -integrable function, and denote the collection of all -integrable functions by
L1 (S, F , ), extending the definition of the integral (f ) to all f L1 (S, F , ).
Definition 1.3.1. Fix a measure space (S, F , ) and define (f ) by the following
four step procedure:
Step 1. Define 0 (IA ) := (A) for each A F .
n
P
l=1
non-random cl [0, ] and sets Al F , yielding the definition of the integral via
0 () :=
n
X
cl (Al ) ,
l=1
n
X
l=1
cl E[IAl ] =
n
X
l=1
cl P(Al ) .
32
Remark. Note that we may have EX = while X() < for all . For
instance, take the random variable
X() = for = {1, 2, . . .} and F = 2 . If
P 2
2
P( = k) = ck
c = [ k=1 k ]1 a positive, finite normalization constant,
P with
1
then EX = c k=1 k = .
0 (c) = c0 () ,
k=1
k=1
33
(fn (h)+fn (g)) = 0 (fn (h)+fn (g)) = 0 (fn (h))+0 (fn (g)) = (fn (h))+(fn (g)) .
(h + g) = lim (fn (h) + fn (g)) = lim (fn (h)) + lim (fn (g)) = (h) + (g) .
n
(cf ) = c(f ) .
34
(see Exercise 1.1.4). Hence, if ({s : h(s) > 0}) > 0, then for some n < ,
0 < n1 ({s : h(s) > n1 }) = 0 (n1 Ih>n1 ) (h) ,
where the right most inequality is a consequence of the definition of (h) and the
fact that h n1 Ih>n1 SF+ . Thus, our assumption that (h) = 0 must imply
that ({s : h(s) > 0}) = 0.
To prove the second part of the lemma, consider e
h = g f which is non-negative
outside a set N F such that (N ) = 0. Hence, h = (g f )IN c mF+ and
0 = (g) (f ) = (e
h) = (h) by Proposition 1.3.5, implying that ({s : h(s) >
0}) = 0 by the preceding proof. The same applies for e
h and the statement of the
lemma follows.
We conclude this subsection by stating the results of Proposition 1.3.5 and Lemma
1.3.8 in terms of the expectation on a probability space (, F , P).
Theorem 1.3.9. The mathematical expectation E[X] is well defined for every R.V.
X on (, F , P) provided either X 0 almost surely, or X L1 (, F , P). Further,
a.s.
(a) EX = EY whenever X = Y .
(b) The expectation is a linear operation, for if Y and Z are integrable R.V. then
for any constants , the R.V. Y + Z is integrable and E(Y + Z) = (EY ) +
(EZ). The same applies when Y, Z 0 almost surely and , 0.
(c) The expectation is monotone. That is, if Y and Z are either integrable or
non-negative and Y Z almost surely, then EY EZ. Further, if Y and Z are
a.s.
integrable with Y Z a.s. and EY = EZ, then Y = Z.
a.s.
(d) Constants are invariant under the expectation. That is, if X = c for nonrandom c (, ], then EX = c.
35
Remark. Part (d) of the theorem relies on the fact that P is a probability measure, namely P() = 1. Indeed, it is obtained by considering the expectation of
the simple function cI to which X equals with probability one.
The linearity of the expectation (i.e. part (b) of the preceding theorem), is often
extremely helpful when looking for an explicit formula for it. We next provide a
few examples of this.
Exercise 1.3.10. Write (, F , P) for a random experiment whose outcome is a
recording of the results of n independent rolls of a balanced six-sided dice (including
their order). Compute the expectation of the random variable D() which counts
the number of different faces of the dice recorded in these n rolls.
Exercise 1.3.11 (Matching). In a random matching experiment, we apply a
random permutation to the integers {1, 2, . . . , n}, where each of the possible n!
permutations is equally likely. Let Zi = I{(i)=i} be the random variable indicating
whether
i = 1, 2, . . . , n is a fixed point of the random permutation, and Xn =
Pn
Z
count the number of fixed points of the random permutation (i.e. the
i
i=1
number of self-matchings). Show that E[Xn (Xn 1) (Xn k + 1)] = 1 for
k = 1, 2, . . . , n.
Similarly, here is an elementary application of the monotonicity of the expectation
(i.e. part (c) of the preceding theorem).
Exercise 1.3.12. Suppose an integrable random variable X is such that E(XIA ) =
0 for each A (X). Show that necessarily X = 0 almost surely.
1.3.2. Inequalities. The linearity of the expectation often allows us to compute EX even when we cannot compute the distribution function FX . In such cases
the expectation can be used to bound tail probabilities, based on the following classical inequality.
Theorem 1.3.13 (Markovs inequality). Suppose : R 7 [0, ] is a Borel
function and let (A) = inf{(y) : y A} for any A B. Then for any R.V. X,
(A)P(X A) E((X)IXA ) E(X).
Proof. By the definition of (A) and non-negativity of we have that
(A)IxA (x)IxA (x) ,
for all x R. Therefore, (A)IXA (X)IXA (X) for every .
We deduce the stated inequality by the monotonicity of the expectation and the
identity E( (A)IXA ) = (A)P(X A) (due to Step 2 of Definition 1.3.1).
We next specify three common instances of Markovs inequality.
Example 1.3.14. (a). Taking (x) = x+ and A = [a, ) for some a > 0 we have
that (A) = a. Markovs inequality is then
EX+
,
a
which is particularly appealing when X 0, so EX+ = EX.
(b). Taking (x) = |x|q and A = (, a] [a, ) for some a > 0, we get that
(A) = aq . Markovs inequality is then aq P(|X| a) E|X|q . Considering q = 2
P(X a)
36
Var(Y )
,
a2
which we call Chebyshevs inequality (c.f. Definition 1.3.67 for the variance and
moments of random variable Y ).
(c). Taking (x) = ex for some > 0 and A = [a, ) for some a R we have
that (A) = ea . Markovs inequality is then
P(|Y EY | a)
P(X a) ea EeX .
This bound provides an exponential decay in a, at the cost of requiring X to have
finite exponential moments.
In general, we cannot compute EX explicitly from the Definition 1.3.1 except
for discrete R.V.s and for R.V.s having a probability density function. We thus
appeal to the properties of the expectation listed in Theorem 1.3.9, or use various
inequalities to bound one expectation by another. To this end, we start with
Jensens inequality, dealing with the effect that a convex function makes on the
expectation.
Proposition 1.3.15 (Jensens inequality). Suppose g() is a convex function
on an open interval G of R, that is,
g(x) + (1 )g(y) g(x + (1 )y)
x, y G,
0 1.
x G .
h>0,c+hG
g(c + h) g(c)
g(c) g(c h)
:= (D+ g)(c) (D g)(c) :=
sup
.
h
h
h>0,chG
With G an open set, obviously (D g)(x) > and (D+ g)(x) < for any x G
(in particular, g() is continuous on G). Now for any b [(D g)(c), (D+ g)(c)] R
we get (1.3.3) out of the definition of D+ g and D g.
37
Remark. Since g() is convex if and only if g() is concave, we may as well state
Jensens inequality for concave functions, just reversing the sign of the inequality in
this case. A trivial instance of Jensens inequality happens when X() = xIA () +
yIAc () for some x, y R and A F such that P(A) = . Then,
EX = xP(A) + yP(Ac ) = x + y(1 ) ,
as g is convex.
We next bound the expectation of the product of two R.V. while assuming nothing
about the relation between them.
lders inequality). Let X, Y be two random variables
Proposition 1.3.17 (Ho
on the same probability space. If p, q > 1 with p1 + 1q = 1, then
(1.3.4)
2
E|XY | EX EY 2 ,
is called the Cauchy-Schwarz inequality.
Proof. Fixing p > 1 and q = p/(p 1) let = ||X||p and = ||Y ||q . If = 0
a.s.
a.s.
then |X|p = 0 (see Theorem 1.3.9). Likewise, if = 0 then |Y |q = 0. In either
case, the inequality (1.3.4) trivially holds. As this inequality also trivially holds
when either = or = , we may and shall assume hereafter that both and
are finite and strictly positive. Recall that
yq
xp
+
xy 0,
x, y 0
p
q
(c.f. [Dur10, Page 21] where it is proved by considering the first two derivatives
in x). Taking x = |X|/ and y = |Y |/, we have by linearity and monotonicity of
the expectation that
E|X|p E|Y |q
E|XY |
1 1
+ q
,
1= + =
p q
p p
q
38
A direct consequence of H
olders inequality is the triangle inequality for the norm
||X||p in Lp (, F , P), that is,
Proposition 1.3.18 (Minkowskis inequality). If X, Y Lp (, F , P), p 1,
then ||X + Y ||p ||X||p + ||Y ||p .
x, y R, p > 1,
1 q1
1
q
= p1 ).
Remark. Jensens inequality applies only for probability measures, while both
H
olders inequality (|f g|) (|f |p )1/p (|g|q )1/q and Minkowskis inequality apply for any measure , with exactly the same proof we provided for probability
measures.
To practice your understanding of Markovs inequality, solve the following exercise.
Exercise 1.3.19. Let X be a non-negative random variable with Var(X) 1/2.
Show that then P(1 + EX X 2EX) 1/2.
To practice your understanding of the proof of Jensens inequality, try to prove
its extension to convex functions on Rn .
Exercise 1.3.20. Suppose g : Rn R is a convex function and X1 , X2 , . . . , Xn
are integrable random variables, defined on the same probability space and such that
g(X1 , . . . , Xn ) is integrable. Show that then Eg(X1 , . . . , Xn ) g(EX1 , . . . , EXn ).
Hint: Use convex analysis to show that g() is continuous and further that for any
c Rn there exists b Rn such that g(x) g(c) + hb, x ci for all x Rn (with
h, i denoting the inner product of two vectors in Rn ).
Exercise 1.3.21. Let Y 0 with v = E(Y 2 ) < .
(a) Show that for any 0 a < EY ,
(EY a)2
E(Y 2 )
Hint: Apply the Cauchy-Schwarz inequality to Y IY >a .
(b) Show that (E|Y 2 v|)2 4v(v (EY )2 ).
P(Y > a)
i=1
1j<in
39
Pn
i=1 IAi ?
a.s.
Example 1.3.25. Consider the probability space ((0, 1], B(0,1] , U ) and Xn () =
1[tn ,tn +sn ] () with sn 0 as n slowly enough and tn [0, 1 sn ] are such
that any (0, 1] is in infinitely many intervals [tn , tn + sn ]. The latter property
applies if tn = (i 1)/k and sn = 1/k when n = k(k 1)/2 + i, i = 1, 2, . . . , k and
p
k = 1, 2, . . . (plot the intervals [tn , tn + sn ] to convince yourself ). Then, Xn 0
(since sn = U (Xn 6= 0) 0), whereas fixing each (0, 1], we have that Xn () =
1 for infinitely many values of n, hence Xn does not converge a.s. to zero.
Associated with each space Lq (, F , P) is the notion of Lq convergence which we
now define.
Lq
40
Lr
Next note that the Lq convergence implies the convergence of the expectation of
|Xn |q .
Exercise 1.3.28. Fixing q 1, use Minkowskis inequality (Proposition 1.3.18),
Lq
for any R.V. Y and any > 0 (c.f part (b) of Example 1.3.14). The assumed
convergence in Lq means that E[|Xn X |q ] 0 as n , so taking Y = Yn =
Xn X , we necessarily have also P(|Xn X | > ) 0 as n . Since > 0
p
is arbitrary, we see that Xn X as claimed.
The converse of Proposition 1.3.29 does not hold in general. As we next demonstrate, even the stronger almost surely convergence (see Exercise 1.3.23), and having
a non-random constant limit are not enough to guarantee the Lq convergence, for
any q > 0.
Example 1.3.30. Fixing q > 0, consider the probability space ((0, 1], B(0,1], U )
and the R.V. Yn () = n1/q I[0,n1 ] (). Since Yn () = 0 for all n n0 and some
a.s.
finite n0 = n0 (), it follows that Yn () 0 as n . However, E[|Yn |q ] =
nU ([0, n1 ]) = 1 for all n, so Yn does not converge to zero in Lq (see Exercise
1.3.28).
Lq
a.s.
a.s.
41
Turning to do just that, we first outline the results which apply in the more
general measure theory setting, starting with the proof of the monotone convergence
theorem.
Proof of Theorem 1.3.4. By part (c) of Proposition 1.3.5, the proof of
which did not use Theorem 1.3.4, we know that (hn ) is a non-decreasing sequence
that is bounded above by (h). It therefore suffices to show that
lim (hn ) = sup{0 () : SF+ , hn }
(1.3.5)
(see Step 3 of Definition 1.3.1). That is, it suffices to find for each non-negative
simple function h a sequence of non-negative simple functions n hn such
that 0 (n ) 0 () as n . To this end, fixing , we may and shall choose
m
P
cl IAl such that Al F are
without loss of generality a representation =
l=1
disjoint and further cl (Al ) > 0 for l = 1, . . . , m (see proof of Lemma 1.3.3). Using
hereafter the notation f (A) = inf{f (s) : s A} for f mF+ and A F , the
condition (s) h(s) for all s S is equivalent to cl h (Al ) for all l, so
0 ()
m
X
l=1
h (Al )(Al ) = V .
Suppose first that V < , that is 0 < h (Al )(Al ) < for all l. In this case, fixing
< 1, consider for each n the disjoint sets Al,,n = {s Al : hn (s) h (Al )} F
and the corresponding
,n (s) =
m
X
l=1
where ,n (s) hn (s) for all s S. If s Al then h(s) > h (Al ). Thus, hn h
implies that Al,,n Al as n , for each l. Consequently, by definition of (hn )
and the continuity from below of ,
lim (hn ) lim 0 (,n ) = V .
Taking x h (A1 ) we deduce that limn (hn ) h (A1 )(A1 ) = , completing the
proof of (1.3.5) and that of the theorem.
Considering probability spaces, Theorem 1.3.4 tells us that we can exchange the
order of the limit and the expectation in case of monotone upward a.s. convergence
of non-negative R.V. (with the limit possibly infinite). That is,
Theorem 1.3.32 (Monotone convergence theorem). If Xn 0 and Xn ()
X () for almost every , then EXn EX .
42
In Example 1.3.30 we have a point-wise convergent sequence of R.V. whose expectations exceed that of their limit. In a sense this is always the case, as stated
next in Fatous lemma (which is a direct consequence of the monotone convergence
theorem).
Lemma 1.3.33 (Fatous lemma). For any measure space (S, F , ) and any fn
mF , if fn (s) g(s) for some -integrable function g, all n and -almost-every
s S, then
lim inf (fn ) (lim inf fn ) .
(1.3.6)
(1.3.7)
Proof. Assume first that fn mF+ and let hn (s) = inf kn fk (s), noting
that hn mF+ is a non-decreasing sequence, whose point-wise limit is h(s) :=
lim inf n fn (s). By the monotone convergence theorem, (hn ) (h). Since
fn (s) hn (s) for all s S, the monotonicity of the integral (see Proposition 1.3.5)
implies that (fn ) (hn ) for all n. Considering the lim inf as n we arrive
at (1.3.6).
Turning to extend this inequality to the more general setting of the lemma, note
a.e.
that our conditions imply that fn = g + (fn g)+ for each n. Considering the
countable union of the -negligible sets in which one of these identities is violated,
we thus have that
a.e.
Further, (fn ) = (g) + ((fn g)+ ) by the linearity of the integral in mF+ L1 .
Taking n and applying (1.3.6) for (fn g)+ mF+ we deduce that
lim inf (fn ) (g) + (lim inf (fn g)+ ) = (g) + (h g) = (h)
n
(where for the right most identity we used the linearity of the integral, as well as
the fact that g is -integrable).
Finally, we get (1.3.7) for fn by considering (1.3.6) for fn .
Remark. In terms of the expectation, Fatous lemma is the statement that if
R.V. Xn X, almost surely, for some X L1 and all n, then
(1.3.8)
(1.3.9)
Some text books call (1.3.9) and (1.3.7) the Reverse Fatou Lemma (e.g. [Wil91,
Section 5.4]).
Using Fatous lemma, we can easily prove Lebesgues dominated convergence theorem (in short DOM).
Theorem 1.3.34 (Dominated convergence theorem). For any measure space
(S, F , ) and any fn mF , if for some -integrable function g and -almost-every
s S both fn (s) f (s) as n , and |fn (s)| g(s) for all n, then f is
-integrable and further (|fn f |) 0 as n .
43
as claimed.
We conclude this sub-section with quite a few exercises, starting with an alternative characterization of convergence almost surely.
a.s.
Exercise 1.3.36. Show that Xn 0 if and only if for each > 0 there is n
so that for each random integer M with M () n for all we have that
P({ : |XM() ()| > }) < .
Exercise 1.3.37. Let Yn be (real-valued) random variables on (, F , P), and Nk
positive integer valued random variables on the same probability space.
(a) Show that YNk () = YNk () () are random variables on (, F ).
a.s.
a.s.
a.s
(b) Show that if Yn Y and Nk then YNk Y .
44
a.s.
X
jP({ : j < X() (j + 1)}) .
(1.3.10) EX = lim E X for E X =
0
j=0
X
X
EYn .
Yn ) =
E(
n=1
n=1
Exercise 1.3.42.
(a) Show that for any random variable X, the function t 7 E[e|tX| ] is continuous on R (this function is sometimes called the bilateral exponential
transform).
(b) Suppose X 0 is such that EX q < for some q > 0. Show that then
q 1 (EX q 1) E log X as q 0 and deduce that also q 1 log EX q
E log X as q 0.
Hint: Fixing x 0 deduce from convexity of q 7 xq that q 1 (xq 1) log x as
q 0.
Exercise 1.3.43. Suppose X is an integrable random variable.
45
and
Xn X .
Our next lemma shows that U.I. is a relaxation of the condition of dominated
convergence, and that U.I. still implies the boundedness in L1 of {X , I}.
Lemma 1.3.48. If |X | Y for all and some R.V. Y such that EY < , then
the collection {X } is U.I. In particular, any finite collection of integrable R.V. is
U.I.
Further, if {X } is U.I. then sup E|X | < .
46
We next state and prove Vitalis convergence theorem for probability measures,
deferring the general case to Exercise 1.3.53.
p
Remark. In view of Lemma 1.3.48, Vitalis theorem relaxes the assumed a.s.
convergence Xn X of the dominated (or bounded) convergence theorem, and
of Scheffes lemma, to that of convergence in probability.
Proof. Suppose first that |Xn | M for some non-random finite constant M
and all n. For each > 0 let Bn, = { : |Xn () X ()| > }. The assumed
convergence in probability means that P(Bn, ) 0 as n (see Definition
1.3.22). Since P(|X | M + ) P(Bn, ), taking n and considering
= k 0, we get by continuity from below of P that almost surely |X | M .
So, |Xn X | 2M and by linearity and monotonicity of the expectation, for any
n and > 0,
c
E|Xn X | = E[|Xn X |IBn,
] + E[|Xn X |IBn, ]
c ] + E[2M IB
E[IBn,
n, ] + 2M P(Bn, ) .
(1.3.12)
47
increasing M if needed, by the U.I. condition also E[|Xn |I|Xn |>M ] < for all n.
Considering the expectation of the inequality |x M (x)| |x|I|x|>M (which holds
for all x R), with x = Xn and x = X , we obtain that
E|Xn X | E|Xn M (Xn )| + E|M (Xn ) M (X )| + E|X M (X )|
2 + E|M (Xn ) M (X )| .
L1
E|X | Em (X ) + 2 .
As each Xn is integrable, E[|Xn |I|Xn |>M ] 2 for some M m finite and all n
(including also n < n0 ()). The fact that such finite M = M () exists for any > 0
amounts to the collection {Xn } being U.I.
The following exercise builds upon the bounded convergence theorem.
Exercise 1.3.50. Show that for any X 0 (do not assume E(1/X) < ), both
(a) lim yE[X 1 IX>y ] = 0 and
y
48
Exercise 1.3.52. Let Un denote a random variable whose law is the uniform
probability measure on (0, n], namely, Lebesgue measure restricted to the interval
p
(0, n] and normalized by n1 to a probability measure. Show that g(Un ) 0 as
n , for any Borel function g() such that |g(y)| 0 as y
R n . Further,
assuming that also supy |g(y)| < , deduce that E|g(Un )| = n1 0 |g(y)|dy 0
as n .
Here is Vitalis convergence theorem for a general measure space.
Exercise 1.3.53. Given a measure space (S, F , ), suppose fn , f mF with
(|fn |) finite and (|fn f | > ) 0 as n , for each fixed > 0. Show
that (|fn f |) 0 as n if and only if both supn (|fn |I|fn |>k ) 0 and
supn (|fn |IAk ) 0 for k and some {Ak } F such that (Ack ) < .
We conclude this subsection with a useful sufficient criterion for uniform integrability and few of its consequences.
Exercise 1.3.54. Let f 0 be a Borel function such that f (r)/r as r .
Suppose Ef (|X |) C for some finite non-random constant C and all I. Show
that then {X : I} is a uniformly integrable collection of R.V.
Exercise 1.3.55.
(a) Construct random variables Xn such that supn E(|Xn |) < , but the
collection {Xn } is not uniformly integrable.
(b) Show that if {Xn } is a U.I. collection and {Yn } is a U.I. collection, then
{Xn + Yn } is also U.I.
p
(c) Show that if Xn X and the collection {Xn } is uniformly integrable,
then E(Xn IA ) E(X IA ) as n , for any measurable set A.
1.3.5. Expectation, density and Riemann integral. Applying the standard machine we now show that fixing a measure space (S, F , ), each non-negative
measurable function f induces a measure f on (S, F ), with f being the natural
generalization of the concept of probability density function.
Proposition 1.3.56. Fix a measure space (S, F , ). Every f mF+ induces
a measure f on (S, F ) via (f )(A) = (f IA ) for all A F . These measures
satisfy the composition relation h(f ) = (hf ) for all f, h mF+ . Further, h
L1 (S, F , f ) if and only if f h L1 (S, F , ) and then (f )(h) = (f h).
Proof. Fixing f mF+ , obviously f is a non-negative set function on (S, F )
with (f )() = (f I ) = (0) = 0. To check that f is countably additive, hence
aPmeasure, let A = k Ak for a countable collection of disjoint sets Ak F . Since
n
k=1 f IAk f IA , it follows by monotone convergence and linearity of the integral
that,
n
n
X
X
X
(f IAk ) =
(f IA ) = lim (
f IAk ) = lim
(f IAk )
n
k=1
k=1
(1.3.13)
(f )(hIA ) = (f hIA )
A F ,
holds for any h mF+ . Since the left side of (1.3.13) is the value assigned to A
by the measure h(f ) and the right side of this identity is the value assigned to
49
the same set by the measure (hf ), this would verify the stated composition rule
h(f ) = (hf ). The proof of (1.3.13) proceeds by applying the standard machine:
Step 1. If h = IB for B F we have by the definition of the integral of an indicator
function that
(f )(IB IA ) = (f )(IAB ) = (f )(A B) = (f IAB ) = (f IB IA ) ,
which is (1.3.13).
Pn
Step 2. Take h SF+ represented as h = l=1 cl IBl with cl 0 and Bl F .
Then, by Step 1 and the linearity of the integrals with respect to f and with
respect to , we see that
n
n
n
X
X
X
cl IBl IA ) = (f hIA ) ,
cl (f IBl IA ) = (f
cl (f )(IBl IA ) =
(f )(hIA ) =
l=1
l=1
l=1
Observing that f h = (f h) when f mF+ , we thusR deduce thatR h is f integrable if and only if f h is -integrable in which case hd(f ) = f hd, as
stated.
50
P
Step 2. Representing h SF+ as h = m
l=1 cl IBl for cl 0, the identity (1.3.14)
follows from Step 1 by the linearity of the expectation in both spaces.
Step 3. For h mB+ , consider hn SF+ such that hn h. Since hn (X())
h(X()) for all , we get by monotone convergence on (, F , P), followed by applying Step 2 for hn , and finally monotone convergence on (R, B, PX ), that
Z
Z
hn (X())dP()
h(X())dP() = lim
n
Z
Z
h(x)dPX (x),
hn (x)dPX (x) =
= lim
n
as claimed.
Step 4. Write a Borel function h(x) as h+ (x) h (x). Then, by Step 3, (1.3.14)
applies for both non-negative functions h+ and h . Further, at least one of these
51
two identities involves finite quantities. So, taking their difference and using the
linearity of the expectation (in both probability spaces), lead to the same result for
h.
Combining Theorem 1.3.61 with Example 1.3.60, we show that the expectation of
a Borel function of a R.V. X having a density fX can be computed by performing
calculus type integration on the real line.
Corollary 1.3.62. Suppose that the distribution function of a R.V. X is of
the form (1.2.3) for some Lebesgue integrable function fX (x). Then, for any
Borel
measurable function h : R 7 R, the R.V.
R
R h(X) is integrable if and only if
|h(x)|fX (x)dx < , in which case Eh(X) = h(x)fX (x)dx. The latter formula
applies also for any non-negative Borel function h().
Proof. Recall Example 1.3.60 that the law PX of X equals to the probability
measure fX . For h 0 we thus deduce from Theorem 1.3.61 that Eh(X) =
fRX (h), which by the composition rule of Proposition 1.3.56 is given by (fX h) =
h(x)fX (x)dx. The decomposition h = h+ h then completes the proof of the
general case.
Our next task is to compare Lebesgues integral (of Definition 1.3.1) with Riemanns integral. To this end recall,
Definition 1.3.63. A function f : (a, b] 7 [0, ] is Riemann integrable
with inteP
gral R(f ) < if for any > 0 there exists = () > 0 such that | l f (xl )(Jl )
R(f )| < , for any xl Jl and {Jl } a finite partition of (a, b] into disjoint subintervals whose length (Jl ) < .
Lebesgues integral of a function f is based on splitting its range to small intervals
and approximating f (s) by a constant on the subset of S for which f () falls into
each such interval. As such, it accommodates an arbitrary domain S of the function,
in contrast to Riemanns integral where the domain of integration is split into small
rectangles hence limited to Rd . As we next show, even for S = (a, b], if f 0
(or more generally, f bounded), is Riemann integrable, then it is also Lebesgue
integrable, with the integrals coinciding in value.
Proposition 1.3.64. If f (x) is a non-negative Riemann integrable function on
an interval (a, b], then it is also Lebesgue integrable on (a, b] and (f ) = R(f ).
Proof. Let f (J) = inf{f (x) : x J} and f (J) = sup{f (x) : x J}.
Varying xl over Jl we see that
X
X
(1.3.15)
R(f )
f (Jl )(Jl )
f (Jl )(Jl ) R(f ) + ,
l
for any finite partition of (a, b] into disjoint subintervals Jl such that P
supl (Jl )
. For any such
partition,
the
non-negative
simple
functions
()
=
l f (Jl )IJl
P
f
(J
)I
are
such
that
()
u(),
whereas
R(f
)
and u() =
l
J
l
l
(()) (u()) R(f ) + , by (1.3.15). Consider the dyadic partitions n
of (a, b] to 2n intervals of length (b a)2n each, such that n+1 is a refinement
of n for each n = 1, 2, . . .. Note that u(n )(x) u(n+1 )(x) for all x (a, b]
and any n, hence u(n ))(x) u (x) a Borel measurable R-valued function (see
52
Exercise 1.2.31). Similarly, (n )(x) (x) for all x (a, b], with also Borel
measurable, and by the monotonicity of Lebesgues integral,
R(f ) lim ((n )) ( ) (u ) lim (u(n )) R(f ) .
n
E(X; An ) = E(X; A) ,
n=0
that is, the sum converges absolutely and has the value on the right.
(b) Deduce from this that for Z 0 with EZ positive and finite, Q(A) :=
EZIA /EZ is a probability measure.
(c) Suppose that X and Y are non-negative random variables on the same
probability space (, F , P) such that EX = EY < . Deduce from the
preceding that if EXIA = EY IA for any A in a -system A such that
a.s.
F = (A), then X = Y .
Exercise 1.3.66. Suppose P is aR probability measure on (R, B) and f 0 is a
Borel function such that P(B) = B f (x)dx for B = (, b], b R. Using the
theorem show that this identity applies for all B B. Building on this result,
use the standard machine to directly prove Corollary 1.3.62 (without Proposition
1.3.56).
1.3.6. Mean, variance and moments. We start with the definition of moments of a random variable.
Definition 1.3.67. If k is a positive integer then EX k is called the kth moment
of X. When it is well defined, the first moment mX = EX is called the mean. If
EX 2 < , then the variance of X is defined to be
(1.3.16)
for any k (see Example 1.2.41 for its density). The mean of X is mX = 1 and
its variance is EX 2 (EX)2 = 1. For any > 0, it is easy to see that T = X/
53
We call the law of G the normal distribution of mean and variance 2 (as EG =
and Var(G) = 2 ).
Next are some examples of R.V. with finite or countable set of possible values.
Example 1.3.69. We say that B has a Bernoulli distribution of parameter p
[0, 1] if P(B = 1) = 1 P(B = 0) = p. Clearly,
EB = p 1 + (1 p) 0 = p .
Further, B 2 = B so EB 2 = EB = p and
Var(B) = EB 2 (EB)2 = p p2 = p(1 p) .
Recall that N has a Poisson distribution with parameter 0 if
P(N = k) =
k
e
k!
for k = 0, 1, 2, . . .
n=k
n(n 1) (n k + 1)
n
e
n!
X
nk
e = k .
=
(n k)!
k
n=k
Var(N ) = EN 2 (EN )2 = .
The random variable Z is said to have a Geometric distribution of success probability p (0, 1) if
P(Z = k) = p(1 p)k1
for
k = 1, 2, . . .
This is the distribution of the number of independent coin tosses needed till the first
appearance of a Head, or more generally, the number of independent trials till the
54
k=1
X
k=2
kp(1 p)k1 =
1
p
2(1 p)
p2
1p
.
p2
Pn
Exercise 1.3.70. Consider a counting random variable Nn = i=1 IAi .
(a) Provide a formula for Var(Nn ) in terms of P(Ai ) and P(Ai Aj ) for
i 6= j.
(b) Using your formula, find the variance of the number Nn of empty boxes
when distributing at random r distinct balls among n distinct boxes, where
each of the possible nr assignments of balls to boxes is equally likely.
Var(Z) = EZ(Z 1) + EZ (EZ)2 =
Exercise 1.3.71. Show that if P(X [a, b]) = 1, then Var(X) (b a)2 /4.
1.4. Independence and product measures
In Subsection 1.4.1 we build-up the notion of independence, from events to random
variables via -algebras, relating it to the structure of the joint distribution function. Subsection 1.4.2 considers finite product measures associated with the joint
law of independent R.V.-s. This is followed by Kolmogorovs extension theorem
which we use in order to construct infinitely many independent R.V.-s. Subsection
1.4.3 is about Fubinis theorem and its applications for computing the expectation
of functions of independent R.V.
1.4.1. Definition and conditions for independence. Recall the classical
definition that two events A, B F are independent if P(A B) = P(A)P(B).
For example, suppose two fair dice are thrown (i.e. = {1, 2, 3, 4, 5, 6}2 with
F = 2 and the uniform probability measure). Let E1 = {Sum of two is 6} and
E2 = {first die is 4} then E1 and E2 are not independent since
P(E1 ) = P({(1, 5) (2, 4) (3, 3) (4, 2) (5, 1)}) =
5
,
36
1
6
and
1
6= P(E1 )P(E2 ).
36
However one can check that E2 and E3 = {sum of dice is 7} are independent.
P(E1 E2 ) = P({(4, 2)}) =
G G, H H ,
that is, two -algebras are independent if every event in one of them is independent
of every event in the other.
55
The random vectors X = (X1 , . . . , Xn ) and Y = (Y1 , . . . , Ym ) on the same probability space are independent if the corresponding -algebras (X1 , . . . , Xn ) and
(Y1 , . . . , Ym ) are independent.
Remark. Our definition of independence of random variables is consistent with
that of independence of events. For example, if the events A, B F are independent, then so are IA and IB . Indeed, we need to show that (IA ) = {, , A, Ac }
and (IB ) = {, , B, B c} are independent. Since P() = 0 and is invariant under
intersections, whereas P() = 1 and all events are invariant under intersection with
, it suffices to consider G {A, Ac } and H {B, B c }. We check independence
first for G = A and H = B c . Noting that A is the union of the disjoint events
A B and A B c we have that
P(A B c ) = P(A) P(A B) = P(A)[1 P(B)] = P(A)P(B c ) ,
where the middle equality is due to the assumed independence of A and B. The
proof for all other choices of G and H is very similar.
More generally we define the mutual independence of events as follows.
Definition 1.4.2. Events Ai F are P-mutually independent if for any L <
and distinct indices i1 , i2 , . . . , iL ,
P(Ai1 Ai2 AiL ) =
L
Y
P(Aik ).
k=1
L
Y
k=1
P(Ak ),
Ak Ak , k = 1, . . . , L .
We say that random variables X , I are P-mutually independent if the algebras (X ), I are P-mutually independent.
When the probability measure P in consideration is clear from the context, we say
that random variables, or collections of events, are mutually independent.
Our next theorem gives a sufficient condition for the mutual independence of
a collection of -algebras which as we later show, greatly simplifies the task of
checking independence.
Theorem 1.4.4. Suppose Gi = (Ai ) F for i = 1, 2, , n where Ai are systems. Then, a sufficient condition for the mutual independence of Gi is that Ai ,
i = 1, . . . , n are mutually independent.
Proof. Let H = Ai1 Ai2 AiL , where i1 , i2 , . . . , iL are distinct elements
from {1, 2, . . . , n 1} and Ai Ai for i = 1, . . . , n 1. Consider the two finite
measures 1 (A) = P(A H) and 2 (A) = P(H)P(A) on the measurable space
(, Gn ). Note that
1 () = P( H) = P(H) = P(H)P() = 2 () .
56
L
Y
P(Aik ))P(A)
k=1
57
S Proof. Since Gk Gk+1 for all k and Gk are -algebras, it follows that A =
k1 Gk is a -system. The assumed P-independence of H and Gk for each k
yields the P-independence of H and A. Thus, by Theorem 1.4.4 we have that
H and (A) are P-independent. Since H (A) it follows that in particular
P(H) = P(H H) = P(H) P(H) for each H H. So, necessarily P(H) {0, 1}
for all H H. That is, H is P-trivial.
We next define the tail -algebra of a stochastic process.
Definition 1.4.9. For a stochastic process {Xk } we set TnX = (Xr , r > n) and
call T X = n TnX the tail -algebra of the process {Xk }.
As we next see, the P-triviality of the tail -algebra of independent random variables is an immediate consequence of Lemma 1.4.8. This result, due to Kolmogorov,
is just one of the many 0-1 laws that exist in probability theory.
Corollary 1.4.10 (Kolmogorovs 0-1 law). If {Xk } are P-mutually independent then the corresponding tail -algebra T X is P-trivial.
S
X
Proof. Note that FkX Fk+1
and T X F X = (Xk , k 1) = ( k1 FkX )
(see Exercise 1.2.14 for the latter identity). Further, recall Exercise 1.4.7 that for
any n 1, the -algebras TnX and FnX are P-mutually independent. Hence, each of
the -algebras FkX is also P-mutually independent of the tail -algebra T X , which
by Lemma 1.4.8 is thus P-trivial.
Out of Corollary 1.4.6 we deduce that functions of disjoint collections of mutually
independent random variables are mutually independent.
Corollary 1.4.11. If R.V. Xk,j , 1 k m, 1 j l(k) are mutually independent and fk : Rl(k) 7 R are Borel functions, then Yk = fk (Xk,1 , . . . , Xk,l(k) ) are
mutually independent random variables for k = 1, . . . , m.
Proof. We apply Corollary 1.4.6 for the index set J = {(k, j) : 1 k
m, 1 j l(k)}, and mutually independent -systems Hk,j = (Xk,j ), to deduce
the mutual independence of Gk = (j Hk,j ). Recall that Gk = (Xk,j , 1 j l(k))
and (Yk ) Gk (see Definition 1.2.12 and Exercise 1.2.33). We complete the proof
by noting that Yk are mutually independent if and only if (Yk ) are mutually
independent.
Our next result is an application of Theorem 1.4.4 to the independence of random
variables.
Corollary 1.4.12. Real-valued random variables X1 , X2 , . . . , Xm on the same
probability space (, F , P) are mutually independent if and only if
m
Y
P(Xi xi ) , x1 , . . . , xm R.
(1.4.1)
P(X1 x1 , . . . , Xm xm ) =
i=1
Proof. Let Ai denote the collection of subsets of of the form Xi1 ((, b])
for b R. Recall that Ai generates (Xi ) (see Exercise 1.2.11), whereas (1.4.1)
states that the -systems Ai are mutually independent (by continuity from below
of P, taking xi for i 6= i1 , i 6= i2 , . . . , i 6= iL , has the same effect as taking a
subset of distinct indices i1 , . . . , iL from {1, . . . , m}). So, just apply Theorem 1.4.4
to conclude the proof.
58
The condition (1.4.1) for mutual independence of R.V.-s is further simplified when
these variables are either discrete valued, or having a density.
Exercise 1.4.13. Suppose (X1 , . . . , Xm ) are random variables and (S1 , . . . , Sm )
are countable sets such that P(Xi Si ) = 1 for i = 1, . . . , m. Show that if
m
Y
P(Xi = xi )
P(X1 = x1 , . . . , Xm = xm ) =
i=1
Exercise 1.4.14. Suppose the random vector X = (X1 , . . . , Xm ) has a joint probability density function fX (x) = g1 (x1 ) gm (xm ). That is,
Z
P((X1 , . . . , Xm ) A) =
g1 (x1 ) gm (xm )dx1 . . . dxm ,
A BRm ,
A
59
Exercise
P 1.4.18. Recall Eulers zeta-function which for real s > 1 is given by
(s) = k=1 k s . Fixing such s, let X and Y be independent random variables
with P(X = k) = P(Y = k) = k s /(s) for k = 1, 2, . . ..
(a) Show that the events Dp = {X is divisible by p}, with p a prime number,
are P-mutually independent.
(b) By considering the event {X
Q = 1}, provide a probabilistic explanation of
Eulers formula 1/(s) = p (1 1/ps ).
(c) Show that the probability that no perfect square other than 1 divides X is
precisely 1/(2s).
(d) Show that P(G = k) = k 2s /(2s), where G is the greatest common
divisor of X and Y .
m
n]
j=1
o
Aj Bj : Aj F1 , Bj F2 , m < ,
m
]
j=1
Aj Bj ) =
m
X
Aj F1 , Bj F2 , m < .
1 (Aj )2 (Bj ),
j=1
j=1
Aj Bj )
n
]
\ ]
( Ci Di ) = [(Aj Bj ) (Ci Di )]
i=1
i,j
]
= (Aj Ci ) (Bj Di ) .
i,j
m
]
j=1
Aj Bj )c =
m
\
j=1
60
(1.4.2)
1 (Aj )2 (Bj ) =
1 (Ci )2 (Di ) ,
j=1
Um
U
whenever j=1 Aj Bj = i (Ci Di ) for some m < , Aj , Ci F1 and Bj , Di
F2 , then we deduce that the value of 2 (E) is independent of the representation
we choose for E A in terms of measurable rectangles, and further that 2 is
countably additive on A. To this end, note that the preceding set identity amounts
to
m
X
X
x 1 , y 2 .
ICi (x)IDi (y)
IAj (x)IBj (y) =
i
j=1
P
Hence, fixing x 1 , we have that (y) = m
j=1 IAj (x)IBj (y) SF+ is the monoPn
tone increasing limit of n (y) = i=1 ICi (x)IDi (y) SF+ as n . Thus, by
linearity of the integral with respect to 2 and monotone convergence,
g(x) :=
m
X
j=1
n
X
i=1
We deduce that the non-negative g(x) mF1 isPthe monotone increasing limit of
the non-negative measurable functions hn (x) = ni=1 2 (Di )ICi (x). Hence, by the
same reasoning,
m
X
j=1
2 (Di )1 (Ci ) ,
It follows from Theorem 1.4.19 by induction on n that given any finite collection
of -finite measure spaces (i , Fi , i ), i = 1, . . . , n, there exists a unique product
measure n = 1 n on the product space (, F ) (i.e., = 1 n
and F = (A1 An ; Ai Fi , i = 1, . . . , n)), such that
(1.4.3)
n (A1 An ) =
n
Y
i=1
i (Ai )
Ai Fi , i = 1, . . . , n.
61
n
Y
i=1
P(Xi Bi ) =
n
Y
i=1
i (Bi ) = 1 n (B1 Bn ) .
This shows that the law of (X1 , . . . , Xn ) and the product measure n agree on the
collection of all measurable rectangles B1 Bn , a -system that generates BRn
(see Exercise 1.1.21). Consequently, these two probability measures agree on BRn
(c.f. Proposition 1.1.39).
Conversely, if PX = 1 n , then by same reasoning, for Borel sets Bi ,
P(
n
\
i=1
{ : Xi () Bi }) = PX (B1 Bn ) = 1 n (B1 Bn )
=
n
Y
i (Bi ) =
i=1
n
Y
i=1
P({ : Xi () Bi }) ,
We wish to extend the construction of product measures to that of an infinite collection of independent random variables. To this end, let N = {1, 2, . . .} denote the
set of natural numbers and RN = {x = (x1 , x2 , . . .) : xi R} denote the collection
of all infinite sequences of real numbers. We equip RN with the product -algebra
Bc = (R) generated by the collection R of all finite dimensional measurable rectangles (also called cylinder sets), that is sets of the form {x : x1 B1 , . . . , xn Bn },
where Bi B, i = 1, . . . , n N (e.g. see Example 1.1.19).
Kolmogorovs extension theorem provides the existence of a unique probability
measure P on (RN , Bc ) whose projections coincide with a given consistent sequence
of probability measures n on (Rn , BRn ).
Theorem 1.4.22 (Kolmogorovs extension theorem). Suppose we are given
probability measures n on (Rn , BRn ) that are consistent, that is,
n+1 (B1 Bn R) = n (B1 Bn )
Bi B,
i = 1, . . . , n <
in the same linear manner we used when proving Theorem 1.4.19. Since A generates
Bc and P0 (RN ) = n (Rn ) = 1, by Caratheodorys extension theorem it suffices to
check that P0 is countably additive on A. The countable additivity of P0 is verified
by the method we already employed when dealing with Lebesgues measure. That
62
is, by the remark after Lemma 1.1.31, it suffices to prove that P0 (Hn ) 0 whenever
Hn A and Hn . The proof by contradiction of the latter, adapting the
argument of Lemma 1.1.31, is based on approximating each H A by a finite
union Jk H of compact rectangles, such that P0 (H \ Jk ) 0 as k . This is
done for example in [Bil95, Page 490].
Example 1.4.23. To systematically construct an infinite sequence of independent
random variables {Xi } of prescribed laws PXi = i , we apply Kolmogorovs extension theorem for the product measures n = 1 n constructed following
Theorem 1.4.19 (where it is by definition that the sequence n is consistent). Alternatively, for infinite product measures one can take arbitrary probability spaces
(i , Fi , i ) and directly show by contradiction that P0 (Hn ) 0 whenever Hn A
and Hn (for more details, see [Str93, Exercise 1.1.14]).
Remark. As we shall find in Sections 6.1 and 7.1, Kolmogorovs extension theorem is the key to the study of stochastic processes, where it relates the law of
the process to its finite dimensional distributions. Certain properties of R are key
to the proof of Kolmogorovs extension theorem which indeed is false if (R, B) is
replaced with an arbitrary measurable space (S, S) (see the discussions in [Dur10,
Subsection 2.1.4] and [Dud89, notes for Section 12.1]). Nevertheless, as you show
next, the conclusion of this theorem applies for any B-isomorphic measurable space
(S, S).
Definition 1.4.24. Two measurable spaces (S, S) and (T, T ) are isomorphic if
there exists a one to one and onto measurable mapping between them whose inverse
is also a measurable mapping. A measurable space (S, S) is B-isomorphic if it is
isomorphic to a Borel subset T of R equipped with the induced Borel -algebra
T = {B T : B B}.
Here is our generalized version of Kolmogorovs extension theorem.
Corollary 1.4.25. Given a measurable space (S, S) let SN denote the collection
of all infinite sequences of elements in S equipped the product -algebra Sc generated
by the collection of all cylinder sets of the form {s : s1 A1 , . . . , sn An }, where
Ai S for i = 1, . . . , n. If (S, S) is B-isomorphic then for any consistent sequence
of probability measures n on (Sn , S n ) (that is, n+1 (A1 An S) = n (A1
An ) for all n and Ai S), there exists a unique probability measure Q on
(SN , Sc ) such that for all n and Ai S,
(1.4.5)
63
(c) Consider the one to one and onto mapping g (s) = (g(s1 ), . . . , g(sn ), . . .)
from SN to TN and the unique probability measure P on (TN , Tc ) for
which (1.4.4) holds. Verify that Sc is contained in the -algebra of subsets
A of SN for which g (A) is in Tc and deduce that Q(A) = P(g (A)) is
a probability measure on (SN , Sc ).
(d) Conclude your proof of Corollary 1.4.25 by showing that this Q is the
unique probability measure for which (1.4.5) holds.
Remark. Recall that Caratheodorys extension theorem applies for any -finite
measure. It follows that, by the same proof as in the preceding exercise, any
consistent sequence of -finite measures n uniquely determines a -finite measure
Q on (SN , Sc ) for which (1.4.5) holds, a fact which we use in later parts of this text
(for example, in the study of Markov chains in Section 6.1).
Our next proposition shows that in most applications one encounters B-isomorphic
measurable spaces (for which Kolmogorovs theorem applies).
Proposition 1.4.27. If S BM for a complete separable metric space M and S
is the restriction of BM to S then (S, S) is B-isomorphic.
Remark. While we do not provide the proof of this proposition, we note in passing
that it is an immediate consequence of [Dud89, Theorem 13.1.1].
1.4.3. Fubinis theorem and its application. Returning to (, F , ) which
is the product of two -finite measure spaces, as in Theorem 1.4.19, we now prove
that:
Theorem 1.4.28 (Fubinis theorem). Suppose = 1 2 is the product of
the -finite measures
R 1 on (X, X) and 2 on (Y, Y). If h mF for F = X Y is
such that h 0 or |h| d < , then,
Z
Z Z
h d =
h(x, y) d2 (y) d1 (x)
(1.4.6)
XY
X
Y
Z Z
h(x, y) d1 (x) d2 (y)
=
Y
y 7 h(x, y) mY, x X,
Z
h(x, y) d2 (y) mX,
x 7 fh (x) :=
Y
so the double integral on the right side of (1.4.6) is well defined and
Z
Z
fh (x)d1 (x) .
h d =
(1.4.9)
XY
We do so in three steps, first proving (1.4.7)-(1.4.9) for finite measures and bounded
h, proceeding to extend these results to non-negative h Rand -finite measures, and
then showing that (1.4.6) holds whenever h mF and |h|d is finite.
64
Step 1. Let H denote the collection of bounded functions on XY for which (1.4.7)
(1.4.9) hold. Assuming that both 1 (X) and 2 (Y) are finite, we deduce that H
contains all bounded h mF by verifying the assumptions of the monotone class
theorem (i.e. Theorem 1.2.7) for H and the -system R = {A B : A X, B Y}
of measurable rectangles (which generates F ).
To this end, note that if h = IE and E = AB R, then either h(x, ) = IB () (in
case x A), or h(x, ) is identically zero (when x 6 A). With IB mY we thus have
(1.4.7) for any such h. Further, in this case the simple function fh (x) = 2 (B)IA (x)
on (X, X) is in mX and
Z
Z
IE d = 1 2 (E) = 2 (B)1 (A) =
fh (x)d1 (x) .
XY
65
is finite. By Step 2 we know that fh mX, hence X0 = {x : fh+ (x) + fh (x) <
} is in X. From
R Step 2 we further have that 1 (fh ) = (h ) whereby our
assumption that |h| d = 1 (fh+ + fh ) < implies that 1 (Xc0 ) = 0. Let
/ X0 . Clearly, feh mX
feh (x) = fh+ (x) fh (x) on X0 and feh (x) = 0 for all x
is 1 -almost-everywhere the same as the inner integral on the right side of (1.4.6).
Moreover, in view of (1.4.10) and linearity of the integrals with respect to 1 and
we deduce that
(h) = (h+ ) (h ) = 1 (fh+ ) 1 (fh ) = 1 (feh ) ,
Equipped with Fubinis theorem, we have the following simpler formula for the
expectation of a Borel function h of two independent R.V.
Theorem 1.4.29. Suppose that X and Y are independent random variables of
laws 1 = PX and 2 = PY . If h : R2 7 R is a Borel measurable function such
that h 0 or E|h(X, Y )| < , then,
Z hZ
i
(1.4.11)
Eh(X, Y ) =
h(x, y) d1 (x) d2 (y)
In particular, for Borel functions f, g : R 7 R such that f, g 0 or E|f (X)| <
and E|g(Y )| < ,
(1.4.12)
which is (1.4.11). Take now h(x, y) = f (x)g(y) for non-negative Borel functions
f (x) and g(y). In this case, the iterated integral on the right side of (1.4.11) can
be further simplified to,
Z hZ
Z
Z
i
E(f (X)g(Y )) =
f (x)g(y) d1 (x) d2 (y) = g(y)[ f (x) d1 (x)] d2 (y)
Z
= [Ef (X)]g(y) d2 (y) = Ef (X) Eg(Y )
(with Theorem 1.3.61 applied twice here), which is the stated identity (1.4.12).
To deal with Borel functions f and g that are not necessarily non-negative, first
apply (1.4.12) for the non-negative functions |f | and |g| to get that E(|f (X)g(Y )|) =
E|f (X)|E|g(Y )| < . Thus, the assumed integrability of f (X) and of g(Y ) allows
us to apply again (1.4.11) for h(x, y) = f (x)g(y). Now repeat the argument we
used for deriving (1.4.12) in case of non-negative Borel functions.
Another consequence of Fubinis theorem is the following integration by parts formula.
66
Rx
Lemma 1.4.30 (integration by parts). Suppose H(x) = h(y)dy for a
non-negative Borel function h and all x R. Then, for any random variable X,
Z
h(y)P(X > y)dy .
(1.4.13)
EH(X) =
R
Proof. Combining the change of variables formula (Theorem 1.3.61), with our
assumption about H(), we have that
Z hZ
Z
i
h(y)Ix>y d(y) dPX (x) ,
H(x)dPX (x) =
EH(X) =
R
where denotes Lebesgues measure on (R, B). For each y R, the expectation of
the simple function x 7 h(x, y) = h(y)Ix>y with respect to (R, B, PX ) is merely
h(y)P(X > y). Thus, applying Fubinis theorem for the non-negative measurable
function h(x, y) on the product space R R equipped with its Borel -algebra BR2 ,
and the -finite measures 1 = PX and 2 = , we have that
Z
Z hZ
i
h(y)P(X > y)dy ,
h(y)Ix>y dPX (x) d(y) =
EH(X) =
R
as claimed.
(b) If X, Y 0 are such that P(Y y) y 1 E[XIY y ] for all y > 0, then
kY kp qkXkp for any p > 1 and q = p/(p 1).
(c) Under the same hypothesis also EY 1 + E[X(log Y )+ ].
Proof. (a) The first identity is merely the integration by parts formula for
hp (y) = py p1 1y>0 and Hp (x) = xp 1x0 and the second identity follows by the
fact that P(Y = y) = 0 up to a (countable)
set of zero Lebesgue measure. Finally,
R
it is easy to check that Hp (x) = R hp,r (x, y)dy for the non-negative Borel function
hp,r (x, y) = (1 p/r)py p1 min(x/y, 1)r 1x0 1y>0 and any r > p > 0. Hence,
replacing h(y)I
x>y throughout the proof of Lemma 1.4.30 by hp,r (x, y) we find that
R
E[Hp (X)] = 0 E[hp,r (X, y)]dy, which is exactly our third identity.
(b) In a similar manner it follows from Fubinis theorem that for p > 1 and any
non-negative random variables X and Y
Z
Z
E[XY p1 ] = E[XHp1 (Y )] = E[ hp1 (y)XIY y dy] =
hp1 (y)E[XIY y ]dy .
R
Thus, with y
EY p =
67
Applying H
olders inequality we deduce that
EY p qE[XY p1 ] qkXkp kY p1 kq = qkXkp [EY p ]1/q
where the right-most equality is due to the fact that (p 1)q = p. In case Y
is bounded, dividing both sides of the preceding bound by [EY p ]1/q implies that
kY kp qkXkp . To deal with the general case, let Yn = Y n, n = 1, 2, . . . and
note that either {Yn y} is empty (for n < y) or {Yn y} = {Y y}. Thus, our
assumption implies that P(Yn y) y 1 E[XIYn y ] for all y > 0 and n 1. By
the preceding argument kYn kp qkXkp for any n. Taking n it follows by
monotone convergence that kY kp qkXkp .
(c) Considering part (a) with p = 1, we bound P(Y y) by one for y [0, 1] and
by y 1 E[XIY y ] for y > 1, to get by Fubinis theorem that
Z
Z
EY =
P(Y y)dy 1 +
y 1 E[XIY y ]dy
0
1
Z
y 1 IY y dy] = 1 + E[X(log Y )+ ] .
= 1 + E[X
1
We further have the following corollary of (1.4.12), dealing with the expectation
of a product of mutually independent R.V.
Corollary 1.4.32. Suppose that X1 , . . . , Xn are P-mutually independent random
variables such that either Xi 0 for all i, or E|Xi | < for all i. Then,
(1.4.14)
n
Y
i=1
n
Y
EXi ,
Xi =
i=1
that is, the expectation on the left exists and has the value given on the right.
Proof. By Corollary 1.4.11 we know that X = X1 and Y = X2 Xn are
independent. Taking f (x) = |x| and g(y) = |y| in Theorem 1.4.29, we thus have
that E|X1 Xn | = E|X1 |E|X2 Xn | for any n 2. Applying this identity
iteratively for Xl , . . . , Xn , starting with l = m, then l = m + 1, m + 2, . . . , n 1
leads to
(1.4.15)
E|Xm Xn | =
n
Y
k=m
E|Xk | ,
holding for any 1 m n. If Xi 0 for all i, then |Xi | = Xi and we have (1.4.14)
as the special case m = 1.
To deal with the proof in case Xi L1 for all i, note that for m = 2 the identity
(1.4.15) tells us that E|Y | = E|X2 Xn | < , so using Theorem 1.4.29 with
f (x) = x and g(y) = y we have that E(X1 Xn ) = (EX1 )E(X2 Xn ). Iterating
this identity for Xl , . . . , Xn , starting with l = 1, then l = 2, 3, . . . , n 1 leads to
the desired result (1.4.14).
Another application of Theorem 1.4.29 provides us with the familiar formula for
the probability density function of the sum X + Y of independent random variables
X and Y , having densities fX and fY respectively.
68
fX (z y)fY (y)dy .
zy
fX (x)dx ,
where the right most equality is by the existence of a density fX (x) for X (c.f.
R zy
Rz
Definition 1.2.40). Clearly, fX (x)dx = fX (x y)dx. Thus, applying
Fubinis theorem for the Borel measurable function g(x, y) = fX (x y) 0 and
the product of the -finite Lebesgues measure on (, z] and the probability
measure PY , we see that
Z z hZ
Z hZ z
i
i
fX (x y)dPY (y) dx
fX (x y)dx dPY (y) =
FZ (z) =
R
(in this application of Fubinis theorem we replace one iterated integral by another,
exchanging the order of integrations). Since this applies for any z R, it follows
by definition that Z has the probability density
Z
fX (z y)dPY (y) = EfX (z Y ) .
fZ (z) =
R
69
Next you find two of the many applications of Fubinis theorem in real analysis.
Exercise 1.4.36. Show that the set Gf = {(x, y) R2 : 0 y f (x)} of points
under the graph of a non-negative Borel function
f : R 7 [0, ) is in BR2 and
R
deduce the well-known formula (Gf ) = f (x)d(x), for its area.
S n1
Hint: Note that (u) = log E[euY ] is convex, non-negative and finite on
[0, 1] with (0) = 0 and (0) = 0. Verify that (u) + (u)2 2 while
(u) = log cosh(u) satisfies the differential equation (u)+ (u)2 = 2 .
As demonstrated next, Fubinis theorem is also handy in proving the impossibility
of certain constructions.
70
Random variables X and Y such that E(X 2 ) < and E(Y 2 ) < are called
uncorrelated if E(XY ) = E(X)E(Y ). It follows from (1.4.12) that independent
random variables X, Y with finite second moment are uncorrelated. While the
converse is not necessarily true, it does apply for pairs of random variables that
take only two different values each.
Exercise 1.4.42. Suppose X and Y are uncorrelated random variables.
(a) Show that if X = IA and Y = IB for some A, B F then X and Y are
also independent.
(b) Using this, show that if {a, b}-valued R.V. X and {c, d}-valued R.V. Y
are uncorrelated, then they are also independent.
(c) Give an example of a pair of R.V. X and Y that are uncorrelated but not
independent.
Next come a pair of exercises utilizing Corollary 1.4.32.
Exercise 1.4.43. Suppose X and Y are random variables on the same probability
space, X has a Poisson distribution with parameter > 0, and Y has a Poisson
distribution with parameter > (see Example 1.3.69).
CHAPTER 2
or equivalently, if
Cov(X , X ) = 0
6= ,
6= .
As we next show, the variance of the sum of finitely many uncorrelated random
variables is the sum of the variances of the variables.
Lemma 2.1.2. Suppose X1 , . . . , Xn are uncorrelated random variables (which necessarily are defined on the same probability space). Then,
(2.1.1)
72
Proof. Let Sn =
n
P
i=1
n
X
i=1
Xi
n
X
i=1
n
X
EXi ]2 = E [ (Xi EXi )]2 .
i=1
Writing the square of the sum as the sum of all possible cross-products, we get that
n
X
E[(Xi EXi )(Xj EXj )]
Var(Sn ) =
=
i,j=1
n
X
Cov(Xi , Xj ) =
n
X
Cov(Xi , Xi ) =
i=1
i,j=1
n
X
Var(Xi ) ,
i=1
where we use the fact that Cov(Xi , Xj ) = 0 for each i 6= j since Xi and Xj are
uncorrelated.
Equipped with this lemma we have our
Theorem 2.1.3 (L2 weak law of large numbers). Consider Sn =
n
P
Xi
i=1
Remark. As we shall see, the weaker condition E|Xi | < suffices for the convergence in probability of n1 Sn to mX . In Section 2.3 we show that it even suffices
for the convergence almost surely of n1 Sn to mX , a statement called the strong
law of large numbers.
Exercise 2.1.5. Show that the conclusion of the L2 weak law of large numbers
holds even for correlated Xi , provided EXi = x and Cov(Xi , Xj ) r(|i j|) for all
i, j, and some bounded sequence r(k) 0 as k .
With an eye on generalizing the L2 weak law of large numbers we observe that
1
are such that b2
n Var(Zn ) 0 as n , then bn (Zn EZn ) 0.
73
2
2
Proof. We have E[(b1
n (Zn EZn )) ] = bn Var(Zn ) 0.
Pn
Example 2.1.7. Let Zn = k=1 Xk for uncorrelated random variables {Xk }. If
Var(Xk )/k 0 as k , then Lemma 2.1.6 applies for Zn and bn = n, hence
n1 (Zn EZn ) 0 in L2 (and in probability). Alternatively, if also Var(Xk ) 0,
then Lemma 2.1.6 applies even for Zn and bn = n1/2 .
n
P
k=1
the row sums of triangular arrays of random variables {Xn,k : k = 1, . . . , n}. Here
are two such examples, both relying on Lemma 2.1.6.
Example 2.1.8 (Coupon collectors problem). Consider i.i.d. random variables U1 , U2 , . . ., each distributed uniformly on {1, 2, . . . , n}. Let |{U1 , . . . , Ul }| denote the number of distinct elements among the first l variables, and kn = inf{l :
|{U1 , . . . , Ul }| = k} be the first time one has k distinct values. We are interested
in the asymptotic behavior as n of Tn = nn , the time it takes to have at least
one representative of each of the n possible values.
To motivate the name assigned to this example, think of collecting a set of n
different coupons, where independently of all previous choices, each item is chosen
at random in such a way that each of the possible n outcomes is equally likely.
Then, Tn is the number of items one has to collect till having the complete set.
n
Setting 0n = 0, let Xn,k = kn k1
denote the additional time it takes to get
an item different from the first k 1 distinct items collected. Note that Xn,k has
1
a geometric distribution of success probability qn,k = 1 k1
n , hence EXn,k = qn,k
2
and Var(Xn,k ) qn,k
(see Example 1.3.69). Since
Tn =
nn
0n
n
X
k=1
(kn
n
k1
)
n
X
Xn,k ,
k=1
where n =
n
P
=1
Rn
1
=1
n
n
X
X
X
k 1 2
n2
2 = Cn2 ,
Var(Tn ) =
Var(Xn,k )
1
n
k=1
k=1
=1
for some C < . Applying Lemma 2.1.6 with bn = n log n, we deduce that
Tn n(log n + n ) L2
0.
n log n
Tn L2
1,
n log n
74
One possible extension of Example 2.1.8 concerns infinitely many possible coupons.
That is,
Exercise 2.1.9. Suppose {k } are i.i.d. positive integer valued random variables,
with P(1 = i) = pi > 0 for i = 1, 2, . . .. Let Dl = |{1 , . . . , l }| denote the number
of distinct elements among the first l variables.
a.s.
(a) Show that Dn as n .
p
(b) Show that n1 EDn 0 as n and deduce that n1 Dn 0.
Hint: Recall that (1 p)n 1 np for any p [0, 1] and n 0.
Example 2.1.10 (An occupancy problem). Suppose we distribute at random
r distinct balls among n distinct boxes, where each of the possible nr assignments
of balls to boxes is equally likely. We are interested in the asymptotic behavior of
the number Nn of empty boxes when r/n [0, ], while n . To this
n
P
IAi . Since
end, let Ai denote the event that the i-th box is empty, so Nn =
i=1
and
n
P
(b) b2
Var(X n,k ) 0.
n
k=1
Then, b1
n (Sn an ) 0 as n , where Sn =
n
P
k=1
Xn,k and an =
n
P
k=1
EX n,k .
75
Pn
Proof. Let S n = k=1 X n,k . Clearly, for any > 0,
n
o [ S a
Sn a n
n
n
> Sn 6= S n
> .
bn
bn
Consequently,
(2.1.2)
S a
S a
n
n
n
n
P
> P(Sn 6= S n ) + P
> .
bn
bn
To bound the first term, note that our condition (a) implies that as n ,
n
[
{Xn,k 6= X n,k }
P(Sn 6= S n ) P
k=1
n
X
k=1
P(Xn,k 6= X n,k ) =
n
X
k=1
P(|Xn,k | > bn ) 0 .
Turning to bound the second term in (2.1.2), recall that pairwise independence
is preserved under truncation, hence X n,k , k = 1, . . . , n are uncorrelated random
variables (to convince yourself, apply (1.4.12) for the appropriate functions). Thus,
an application of Lemma 2.1.2 yields that as n ,
2
Var(b1
n S n ) = bn
n
X
k=1
Var(X n,k ) 0 ,
Specializing the weak law of Theorem 2.1.11 to a single sequence yields the following.
Proposition 2.1.12 (Weak law of large numbers). Consider i.i.d. random
p
variables {Xi }, such that xP(|X1 | > x) 0 as x . Then, n1 Sn n 0,
n
P
Xi and n = E[X1 I{|X1 |n} ].
where Sn =
i=1
k=1
76
(see part (a) of Lemma 1.4.31 for the right identity). Considering Z = X n,1 =
X1 I{|X1 |n} for which P(|Z| > y) = P(|X1 | > y) P(|X1 | > n) P(|X1 | > y)
when 0 < y < n and P(|Z| > y) = 0 when y n, we deduce that
Z n
1
1
n = n Var(Z) n
g(y)dy ,
0
k=1
77
the integers mn are such that mn log2 n . Taking mn log2 n + log2 (log2 n)
implies that bn n log2 n and an /(n log2 n) = mn / log2 n 1 as n , with the
p
consequence of Sn /(n log2 n) 1 (for details see [Dur10, Example 2.2.7]).
2.2. The Borel-Cantelli lemmas
When dealing with asymptotic theory, we often wish to understand the relation
between countably many events An in the same probability space. The two BorelCantelli lemmas of Subsection 2.2.1 provide information on the probability of the set
of outcomes that are in infinitely many of these events, based only on P(An ). There
are numerous applications to these lemmas, few of which are given in Subsection
2.2.2 while many more appear in later sections of these notes.
2.2.1. Limit superior and the Borel-Cantelli lemmas. We are often interested in the limits superior and limits inferior of a sequence of events An on
the same measurable space (, F ).
Definition 2.2.1. For a sequence of subsets An , define
[
\
A := lim sup An =
A
m=1 =m
Similarly,
lim inf An =
m=1 =m
Remark. Note that if An F are measurable, then so are lim sup An and
lim inf An . By DeMorgans law, we have that {An ev. } = {Acn i.o. }c , that is,
An for all n large enough if and only if Acn for finitely many ns.
Also, if An eventually, then certainly An infinitely often, that is
lim inf An lim sup An .
The notations lim sup An and lim inf An are due to the intimate connection of
these sets to the lim sup and lim inf of the indicator functions on the sets An . For
example,
lim sup IAn () = Ilim sup An (),
n
since for a given , the lim sup on the left side equals 1 if and only if the
sequence n
7 IAn () contains an infinite subsequence of ones. In other words, if
and only if the given is in infinitely many of the sets An . Similarly,
lim inf IAn () = Ilim inf An (),
n
since for a given , the lim inf on the left side equals 1 if and only if there
are only finitely many zeros in the sequence n 7 IAn () (for otherwise, their limit
inferior is zero). In other words, if and only if the given is in An for all n large
enough.
78
In view of the preceding remark, Fatous lemma yields the following relations.
Exercise 2.2.2. Prove that for any sequence An F ,
P(lim sup An ) lim sup P(An ) lim inf P(An ) P(lim inf An ) .
n
Show that the right most inequality holds even when the probability measure is replaced
S by an arbitrary measure (), but the left most inequality may then fail unless
( kn Ak ) < for some n.
Practice your understanding of the concepts of lim sup and lim inf of sets by solving
the following exercise.
Exercise 2.2.3. Assume that P(lim sup An ) = 1 and P(lim inf Bn ) = 1. Prove
that P(lim sup(An Bn )) = 1. What happens if the condition on {Bn } is weakened
to P(lim sup Bn ) = 1?
Our next result, called the first Borel-Cantelli lemma, states that if the probabilities P(An ) of the individual events An converge to zero fast enough, then almost
surely, An occurs for only finitely many values of n, that is, P(An i.o.) = 0. This
lemma is extremely useful, as the possibly complex relation between the different
events An is irrelevant for its conclusion.
Lemma 2.2.4 (Borel-Cantelli I). Suppose An F and
P(An i.o.) = 0.
Proof. Define N () =
and our assumption,
E[N ()] = E
hX
k=1
n=1
i X
IAk () =
P(Ak ) < .
k=1
P
Exercise 2.2.5. Suppose An F are such that
P(An Acn+1 ) < and
n=1
P
then
The first Borel-Cantelli lemma states that if the series n P(An ) converges
P
almost every is in finitely many sets An . If P(An ) 0, but the series n P(An )
diverges, then the event {An i.o.} might or might not have positive probability. In
this sense, the Borel-Cantelli I is not tight, as the following example demonstrates.
Example 2.2.6. Consider the uniform probability measure U on ((0, 1], B(0,1] ),
andPthe events An = (0, 1/n]. Then An , so {An i.o.} = , but U (An ) = 1/n,
so n U (An ) = and the Borel-Cantelli I does not apply.
Recall also Example 1.3.25 showing the existence of An = (tn , tn + 1/n] such that
U (An ) = 1/n while {An i.o.} = (0, 1]. Thus, in general the probability of {An i.o.}
depends on the relation between the different events An .
79
P
As seen in the preceding example, the divergence of the series n P(An ) is not
sufficient for the occurrence of a set of positive probability of values, each of
which is in infinitely many events An . However, upon adding the assumption that
the events An are mutually independent (flagrantly not the case in Example 2.2.6),
we conclude that almost all must be in infinitely many of the events An :
Lemma 2.2.7 (Borel-Cantelli II). Suppose An F are mutually independent
P
and
P(An ) = . Then, necessarily P(An i.o.) = 1.
n=1
Proof. Fix 0 < m < n < . Use the mutual independence of the events A
and the inequality 1 x ex for x 0, to deduce that
n
n
n
\
Y
Y
P
P(Ac ) =
(1 P(A ))
Ac =
=m
As n , the set
n
T
=m
=m
n
Y
=m
eP(A ) = exp(
=m
n
X
P(A )) .
=m
continuity from above of the probability measure P() we see that for any m,
\
X
c
A exp(
P(A )) = 0 .
P
=m
=m
S
Take the complement to see that P(Bm ) = 1 for Bm =
=m A and all m. Since
Bm {An i.o. } when m , it follows by continuity from above of P() that
P(An i.o.) = lim P(Bm ) = 1 ,
m
as stated.
As an immediate corollary of the two Borel-Cantelli lemmas, we observe yet another 0-1 law.
Corollary 2.2.8. If An F are P-mutually independent then P(An i.o.) is
either 0 or 1. In other words, for any given sequence of mutually independent
events, either almost all outcomes are in infinitely many of these events, or almost
all outcomes are in finitely many of them.
The Kochen-Stone lemma, left as an exercise, generalizes Borel-Cantelli II to situations lacking independence.
Exercise 2.2.9. Suppose Ak are events on the same probability space such that
P
k P(Ak ) = and
lim sup
n
n
X
k=1
2
P(Ak ) /
1j,kn
P(Aj Ak ) = > 0 .
80
Xn(mk ) X as k .
p
81
Proof. Fix a subsequence Xn(m) . By Theorem 2.2.10 there exists a subsequence Xn(mk ) such that P(A) = 1 for A = { : Xn(mk ) () X () as k }.
Let B = { : X ()
/ Dg }, noting that by assumption P(B) = 1. For any
A B we have g(Xn(mk ) ()) g(X ()) by the continuity of g outside Dg .
a.s.
Therefore, g(Xn(mk ) ) g(X ). Now apply Theorem 2.2.10 in the reverse direction: For any subsequence, we have just constructed a further subsequence with
p
convergence a.s., hence g(Xn ) g(X ).
Finally, if g is bounded, then the collection {g(Xn )} is U.I. yielding, by Vitalis
convergence theorem, its convergence in L1 (and hence that Eg(Xn ) Eg(X )).
You are next to extend the scope of Theorem 2.2.10 and the continuous mapping
of Corollary 2.2.13 to random variables taking values in a separable metric space.
Exercise 2.2.14. Recall the definition of convergence in probability in a separable
metric space (S, ) as in Remark 1.3.24.
(a) Extend the proof of Theorem 2.2.10 to apply for any (S, BS )-valued random variables {Xn , n } (and in particular for R-valued variables).
(b) Denote by Dg the set of discontinuities of a Borel measurable g : S 7
R (defined similarly to Exercise 1.2.28, where real-valued functions are
p
considered). Suppose Xn X and P(X Dg ) = 0. Show that then
L1
a bounded, continuous function h(x) on [0, ) is the function Lh (s) = 0 esx h(x)dx
on (0, ).
(a) Show that for any s > 0 and positive integer k,
Z
k (k1)
(s)
sk xk1
k1 s Lh
(1)
=
esx
h(x)dx = E[h(Wk )] ,
(k 1)!
(k 1)!
0
(k1)
where Lh
() denotes the (k 1)-th derivative of the function Lh () and
Wk has the gamma density with parameters k and s.
(b) Recall Exercise
Pn 1.4.46 that for s = n/y the law of Wn coincides with the
law of n1 i=1 Ti where Ti 0 are i.i.d. random variables, each having
the exponential distribution of parameter 1/y (with ET1 = y and finite
moments of all order, c.f. Example 1.3.68). Deduce that the inversion
formula
h(y) = lim (1)n1
n
(n/y)n (n1)
L
(n/y) ,
(n 1)! h
82
The next application of Borel-Cantelli I provides our first strong law of large
numbers.
Proposition 2.2.16. Suppose E[Zn2 ] C for some C < and all n. Then,
a.s.
n1 Zn 0 as n .
Proof. Fixing > 0 let Ak = { : |k 1 Zk ()| > } for k = 1, 2, . . .. Then, by
Chebyshevs inequality and our assumption,
P(Ak ) = P({ : |Zk ()| k})
C
E(Zk2 )
2 k 2 .
2
(k)
P
Since k k 2 < , it follows by Borel Cantelli I that P(A ) = 0, where A =
{ : |k 1 Zk ()| > for infinitely many values of k}. Hence, for any fixed
> 0, with probability one k 1 |Zk ()| for all large enough k, that is,
lim supn n1 |Zn ()| a.s. Considering a sequence m 0 we conclude that
n1 Zn 0 for n and a.e. .
Exercise 2.2.17. Let Sn =
EX1 = 0 and EX14 < .
n
P
l=1
a.s.
Next we have few other applications of Borel-Cantelli I, starting with some additional properties of convergence a.s.
Exercise 2.2.19. Show that for any R.V. Xn
a.s.
83
backwards from time m, we are interested in the asymptotics of the longest such
run during 1, 2, . . . , n, that is,
Ln = max{m : m = 1, . . . , n}
= max{m k : Xk+1 = = Xm = 1 for some m = 1, . . . , n} .
Noting that m + 1 has a geometric distribution of success probability p = 1/2, we
deduce by an application of Borel-Cantelli I that for each > 0, with probability
one, n (1 + ) log2 n for all n large enough. Hence, on the same set of probability
one, we have N = N () finite such that Ln max(LN , (1+) log2 n) for all n N .
Dividing by log2 n and considering n followed by k 0, this implies that
lim sup
n
Ln a.s.
1.
log2 n
For each fixed > 0 let An = {Ln < kn } for kn = [(1 ) log2 n]. Noting that
An
m
\n
Bic ,
i=1
Ln a.s.
1
log2 n
After deriving the classical bounds on the tail of the normal distribution, you
use both Borel-Cantelli lemmas in bounding the fluctuations of the sums of i.i.d.
standard normal variables.
Exercise 2.2.24. Let {Gi } be i.i.d. standard normal random variables.
84
(x
)e
x2 /2
ey
/2
dy x1 ex
/2
Many texts prove these estimates, for example see [Dur10, Theorem
1.2.3].
(b) Show that, with probability one,
Gn
lim sup
= 1.
2 log n
n
(c) Let Sn = G1 + + Gn . Recall that n1/2 Sn has the standard normal
distribution. Show that
p
P(|Sn | < 2 n log n, ev. ) = 1 .
Remark. Ignoring the dependence between the elements of the sequence Sk , the
bound in part (c) of the preceding exercise is not tight. The definite result here is
the law of the iterated logarithm (in short lil), which states that when the i.i.d.
summands are of zero mean and variance one,
(2.2.1)
Sn
= 1) = 1 .
P(lim sup
2n log log n
n
We defer the derivation of (2.2.1) to Theorem 9.2.29, building on a similar lil for
the Brownian motion (but, see [Bil95, Theorem 9.5] for a direct proof of (2.2.1),
using both Borel-Cantelli lemmas).
The next exercise relates explicit integrability conditions for i.i.d. random variables to the asymptotics of their running maxima.
Exercise 2.2.25. Consider possibly R-valued, i.i.d. random variables {Yi } and
their running maxima Mn = maxkn Yk .
(a) Using (2.3.4) if needed, show that P(|Yn | > n i.o. ) = 0 if and only if
E[|Y1 |] < .
a.s.
(b) Show that n1 Yn 0 if and only if E[|Y1 |] < .
a.s.
1
(c) Show that n Mn 0 if and only if E[(Y1 )+ ] < and P(Y1 > ) >
0.
p
(d) Show that n1 Mn 0 if and only if nP(Y1 > n) 0 and P(Y1 >
) > 0.
p
(e) Show that n1 Yn 0 if and only if P(|Y1 | < ) = 1.
In the following exercise, you combine Borel Cantelli I and the variance computation of Lemma 2.1.2 to improve upon Borel Cantelli II.
P
Exercise 2.2.26.
n=1 P(An ) = for pairwise independent events
Pn Suppose
{Ai }. Let Sn = i=1 IAi be the number of events occurring among the first n.
p
(a) Prove that Var(Sn ) E(Sn ) and deduce from it that Sn /E(Sn ) 1.
a.s.
(b) Applying Borel-Cantelli I show that Snk /E(Snk ) 1 as k , where
nk = inf{n : E(Sn ) k 2 }.
(c) Show that E(Snk+1 )/E(Snk ) 1 and since n 7 Sn is non-decreasing,
a.s.
deduce that Sn /E(Sn ) 1.
85
86
X
X
P(Xk 6= X k ) =
P(|X1 | > k)
P(|X1 | > x)dx = E|X1 | <
k=1
k=1
(see part (a) of Lemma 1.4.31 for the rightmost identity and recall our assumption
that X1 is integrable). Thus, by Borel-Cantelli I, with probability one, Xk () =
X k () for all but finitely many ks, in which case necessarily supn |Sn () S n ()|
a.s.
is finite. This shows that n1 (Sn S n ) 0, whereby it suffices to prove that
a.s.
n1 S n EX1 .
To this end, we next show that it suffices to prove the following lemma about
almost sure convergence of S n along suitably chosen subsequences.
Lemma 2.3.2. Fixing > 1 let nl = [l ]. Under the conditions of the proposition,
a.s.
n1
l (S nl ES nl ) 0 as l .
By dominated convergence, E[X1 I|X1 |k ] EX1 as k , and consequently, as
n ,
n
n
1
1X
1X
ES n =
EX k =
E[X1 I|X1 |k ] EX1
n
n
n
k=1
k=1
(we have used here the consistency of Cesaro averages, c.f. Exercise 1.3.52 for an
a.s.
integral version). Thus, assuming that Lemma 2.3.2 holds, we have that n1
l S nl
EX1 when l , for each > 1.
We complete the proof of the proposition by interpolating from the subsequences
nl = [l ] to the whole sequence. To this end, fix > 1. Since n 7 S n is nondecreasing, we have for all and any n [nl , nl+1 ],
S n ()
nl S nl ()
nl+1 S nl+1 ()
nl+1 nl
n
nl
nl+1
With nl /nl+1 1/ for l , the a.s. convergence of m1 S m along the subsequence m = nl implies that the event
1
S n ()
S n ()
EX1 lim inf
lim sup
EX1 } ,
n
n
n
n
has
we deduce that the event B :=
T probability one. Consequently, taking m 1,
1
also
has
probability
one,
and
further,
n
A
S n () EX1 for each B.
m
m
a.s.
We thus deduce that n1 S n EX1 , as needed to complete the proof of the
proposition.
A := { :
87
k=1
X
X
2
2
n1
E[X
I
]
=
E[X
n1
|X
|n
1
1
1
l
l
l I|X1 |nl ] <
l=1
l=1
(the latter identity is a special case of Exercise 1.3.40). Since E|X1 | < , it thus
suffices to show that for nl = [l ] and any x > 0,
X
1
(2.3.1)
u(x) :=
n1
,
l Ixnl cx
l=1
where c = 2/( 1) < . To establish (2.3.1) fix > 1 and x > 0, setting
L = min{l 1 : nl x}. Then, L x, and since [y] y/2 for all y 1,
X
X
u(x) =
n1
2
l = cL cx1 .
l
l=L
l=L
So, we have established (2.3.1) and hence completed the proof of the lemma.
As already promised, it is not hard to extend the scope of the strong law of large
numbers beyond integrable and non-negative random variables.
Pn
Theorem 2.3.3 (Strong law of large numbers). Let Sn =
i=1 Xi for
pairwise independent and identically distributed random variables {Xi }, such that
a.s.
either E[(X1 )+ ] is finite or E[(X1 ) ] is finite. Then, n1 Sn EX1 as n .
Proof. First consider non-negative Xi . The case of EX1 < has already
Pn
(m)
(m)
been dealt with in Proposition 2.3.1. In case EX1 = , consider Sn = i=1 Xi
for the bounded, non-negative, pairwise independent and identically distributed
(m)
= min(Xi , m) Xi . Since Proposition 2.3.1 applies for
random variables Xi
(m)
{Xi }, it follows that a.s. for any fixed m < ,
(2.3.2)
(m)
= E min(X1 , m) .
i=1
88
Pn
a.s.
also n1 i=1 (Xi ) E[(X1 ) ]. Our assumption that either E[(X1 )+ ] < or
E[(X1 ) ] < implies that EX1 = E[(X1 )+ ] E[(X1 ) ] is well defined, and in
view of (2.3.3) we have the stated a.s. convergence of n1 Sn to EX1 .
Exercise 2.3.4. You are to prove now a converse to the strong law of large numbers (for a more general result, due to Feller (1946), see [Dur10, Theorem 2.5.9]).
(a) Let YPdenote the integer part of a random variable Z 0. Show that
Y =
n=1 I{Zn} , and deduce that
(2.3.4)
n=1
P(Z n) EZ 1 +
n=1
P(Z n) .
(b) Suppose {Xi } are i.i.d R.V.s with E[|X1 | ] = for some > 0. Show
that for any k > 0,
n=1
a.s.
We provide next two classical applications of the strong law of large numbers, the
first of which deals with the large sample asymptotics of the empirical distribution
function.
Example 2.3.5 (Empirical distribution function). Let
Fn (x) = n1
n
X
I(,x] (Xi ) ,
i=1
denote the observed fraction of values among the first n variables of the sequence
{Xi } which do not exceed x. The functions Fn () are thus called the empirical
distribution functions of this sequence.
For i.i.d. {Xi } with distribution function FX our next result improves the strong
law of large numbers by showing that Fn converges uniformly to FX as n .
Theorem 2.3.6 (Glivenko-Cantelli). For i.i.d. {Xi } with arbitrary distribution function FX , as n ,
a.s.
Remark. While outside our scope, we note in passing the Dvoretzky-KieferWolfowitz inequality that P(Dn > ) 2 exp(2n2 ) for any n and all > 0,
quantifying the rate of convergence of Dn to zero (see [DKW56], or [Mas90] for
the optimal pre-exponential constant).
Proof. By the right continuity of both x 7 Fn (x) and x 7 FX (x) (c.f.
Theorem 1.2.37), the value of Dn is unchanged when the supremum over x R is
replaced by the one over x Q (the rational numbers). In particular, this shows
that each Dn is a random variable (c.f. Theorem 1.2.22).
89
Applying the strong law of large numbers for the i.i.d. non-negative I(,x] (Xi )
a.s.
whose expectation is FX (x), we deduce that Fn (x) FX (x) for each fixed nonrandom x R. Similarly, considering the strong law of large numbers for the i.i.d.
a.s.
non-negative I(,x) (Xi ) whose expectation is FX (x ), we have that Fn (x )
FX (x ) for each fixed non-random x R. Consequently, for any fixed l < and
x1,l , . . . , xl,l we have that
l
k=1
k=1
a.s.
Exercise 2.3.9. In the context of Example 2.3.7 fix > 0 such that P(1 > ) >
Pn
ek for the i.i.d. random variables ei = I{i >} . Note that
and let Ten =
k=1
et = sup{n : Ten t}.
Ten Tn and consequently Nt N
et2 < .
(a) Show that lim supt t2 EN
1
(b) Deduce that {t Nt : t 1} is uniformly integrable (see Exercise 1.3.54),
and conclude that t1 ENt 1/E1 when t .
The next exercise deals with an elaboration over Example 2.3.7.
Exercise 2.3.10. For i = 1, 2, . . . the ith light bulb burns for an amount of time i
and then remains burned out for time si before being replaced by the (i + 1)th bulb.
Let Rt denote the fraction of time during [0, t] in which we have a working light.
Assuming that the two sequences {i } and {si } are independent, each consisting of
a.s.
i.i.d. positive and integrable random variables, show that Rt E1 /(E1 + Es1 ).
Here is another exercise, dealing with sampling at times of heads in independent
fair coin tosses, from a non-random bounded sequence of weights v(l), the averages
of which converge.
90
We proceed with a few additional applications of the strong law of large numbers,
first to a problem of universal hypothesis testing, then an application involving
stochastic geometry, and finally one motivated by investment science.
Exercise 2.3.12. Consider i.i.d. [0, 1]-valued random variables {Xk }.
(a) Find Borel measurable functions fn : [0, 1]n 7 {0, 1}, which are indea.s.
pendent of the law of Xk , such that fn (X1 , X2 , . . . , Xn ) 0 whenever
a.s.
EX1 < 1/2 and fn (X1 , X2 , . . . , Xn ) 1 whenever EX1 > 1/2.
a.s.
(b) Modify your answer to assure that fn (X1 , X2 , . . . , Xn ) 1 also in case
EX1 = 1/2.
Exercise 2.3.13. Let {Un } be i.i.d. random vectors, each uniformly distributed
on the unit ball {u R2 : |u| 1}. Consider the R2 -valued random vectors
Xn = |Xn1 |Un , n = 1, 2, . . . starting at a non-random, non-zero vector X0 (that is,
each point is uniformly chosen in a ball centered at the origin and whose radius is the
a.s.
distance from the origin to the previously chosen point). Show that n1 log |Xn |
1/2 as n .
Exercise 2.3.14. Let {Vn } be i.i.d. non-negative random variables. Fixing r > 0
and q (0, 1], consider the sequence W0 = 1 and Wn = (qr + (1 q)Vn )Wn1 ,
n = 1, 2, . . .. A motivating example is of Wn recording the relative growth of a
portfolio where a constant fraction q of ones wealth is re-invested each year in a
risk-less asset that grows by r per year, with the remainder re-invested in a risky
asset whose annual growth factors are the random Vn .
a.s.
91
Exercise 2.3.15. Show that almost surely lim supn log Zn / log EZn 1 for
a.s.
any positive, non-decreasing random variables Zn such that Zn .
2.3.2. Convergence of random series. A second approach to the strong
law of large numbers is based on studying the convergence of random series. The
key tool in this approach is Kolmogorovs maximal inequality, which we prove next.
Proposition 2.3.16 (Kolmogorovs maximal inequality). The random variables Y1 , . . . , Yn are mutually independent, with EYl2 < and EYl = 0 for l =
1, . . . , n. Then, for Zk = Y1 + + Yk and any z > 0,
z 2 P( max |Zk | z) Var(Zn ) .
(2.3.5)
1kn
(2.3.6)
n
X
k=1
since
Zk2
z 2 P(Ak )
n
X
E[Zk2 ; Ak ] ,
k=1
(2.3.7)
n
X
E[Zn2 ; Ak ] .
k=1
and (2.3.5) follows by comparing (2.3.6) and (2.3.7). Since Zk IAk can be represented
as a non-random Borel function of (Y1 , . . . , Yk ), it follows that Zk IAk is measurable
on (Y1 , . . . , Yk ). Consequently, for fixed k and l > k the variables Yl and Zk IAk
are independent, hence uncorrelated. Further EYl = 0, so
n
n
X
X
E[(Zn Zk )Zk ; Ak ] =
E[Yl Zk IAk ] =
E(Yl )E(Zk IAk ) = 0 ,
l=k+1
l=k+1
Proof. Applying Kolmogorovs maximal inequality for the independent variables Yl = Xl+r , we have that for any > 0 and positive integers r and n,
P( max
rkr+n
|Sk Sr | )
Var(Sr+n Sr ) =
r+n
X
l=r+1
Var(Xl ) .
92
X
P(sup |Sk Sr | ) 2
Var(Xl )
kr
l=r+1
P
By our assumption that n Var(Xn ) is finite, it follows that l>r Var(Xl ) 0 as
r . Hence, if we let Tr = supn,mr |Sn Sm |, then for any > 0,
P
a.s.
The link between convergence of random series and the strong law of large numbers
is the following classical analysis lemma.
Lemma 2.3.20 (Kroneckers lemma). Consider
two sequences of real numbers
P
{xn } and {bn } where bn > 0 and bn . If n xn /bn converges, then sn /bn 0
for sn = x1 + + xn .
Pn
Proof. Let un =
k=1 (xk /bk ) which by assumption converges to a finite
limit denoted u . Setting u0 = b0 = 0, summation by parts yields the identity,
n
n
X
X
sn =
bk (uk uk1 ) = bn un
(bk bk1 )uk1 .
k=1
k=1
Pn
k=1 (bk
Theorem 2.3.17 provides an alternative proof for the strong law of large numbers
of Theorem 2.3.3 in case {Xi } are i.i.d. (that is, replacing pairwise independence
by mutual independence). Indeed, applying the same truncation scheme as in the
proof of Proposition 2.3.1, it suffices to prove the following alternative to Lemma
2.3.2.
93
Pm
k=1
Xk
Lemma 2.3.21, in contrast to Lemma 2.3.2, does not require the restriction to a
subsequence nl . Consequently, in this proof of the strong law there is no need for
an interpolation argument so it is carried directly for Xk , with no need to split each
variable to its positive and negative parts.
Proof of Lemma 2.3.21. We will shortly show that
X
(2.3.8)
k 2 Var(X k ) 2E|X1 | .
k=1
X
X
2
EX12 I|X1 |k = EX12 v(|X1 |) ,
k 2 Var(X k )
k(k + 1)
k=1
k=1
X
v(x) = 2
k=x
X
1
1
2
1
=
=2
2x1 .
k(k + 1)
k
k+1
x
k=x
Many of the ingredients of this proof of the strong law of large numbers are also
relevant for solving the following exercise.
Exercise 2.3.22. Let cn be a bounded sequence of non-random constants, and
Pn
a.s.
{Xi } i.i.d. integrable R.V.-s of zero mean. Show that n1 k=1 ck Xk 0 for
n .
Next you find few exercises that illustrate how useful Kroneckers lemma is when
proving the strong law of large numbers in case of independent but not identically
distributed summands.
Pn
Exercise 2.3.23. Let Sn = k=1 Yk for independent random variables {Yi } such
a.s.
that Var(Yk ) < B < and EYk = 0 for all k. Show that [n(log n)1+ ]1/2 Sn 0
as n and > 0 is fixed (this falls short of the law of the iterated logarithm of
(2.2.1), but each Yk is allowed here to have a different distribution).
Exercise 2.3.24. Suppose the independent random variables {Xi } are such that
Var(Xk ) pk < and EXk = 0 for k = 1, 2, . . ..
P
Pn
a.s.
1
(a) Show that if k pk <
k=1 kXk 0.
P then n
(b) Conversely, assuming k pk = , give an example of independent random variables {Xi }, such that Var(Xk ) pk , EXk = 0, for which almost
surely lim supn Xn () = 1.
94
X
(2.3.9)
[P(Xn 1) + E(Xn IXn <1 )] <
n=1
P
w.p.1.
then the random series n Xn () converges
P
X
()
converges w.p.1. then
(b) Prove the converse, namely, that if
n n
(2.3.9) holds.
(c) Suppose Gn are mutually independent random variables, with Gn having
the normal distribution N (n , vn ). Show
P that2 w.p.1. the random series
P
2
G
()
converges
if
and
only
if
e
=
n (n + vn ) is finite.
n n
(d) Suppose n are mutually independent random variables, with n having
the exponentialP
distribution of parameter n > 0.PShow that w.p.1. the
random series n n () converges if and only if n 1/n is finite.
P
is finite if and
Hint: For
Pseries n an
Q part (b) recall that for any an [0, 1), the
only if n (1 an ) > 0. For part (c) let f (y) = n min((n + vn y)2 , 1) and
observe that if e = then f (y) + f (y) = for all y 6= 0.
You can now also show that for such strong law of large numbers (that is, with
independent but not identically distributed summands), it suffices to strengthen
the corresponding weak law (only) along the subsequence nr = 2r .
Pk
Exercise 2.3.26. Let Zk = j=1 Yj where Yj are mutually independent R.V.-s.
P
a.s.
(a) Fixing > 0 show that if 2r Z2r 0 then r P(|Z2r+1 Z2r | > 2r )
p
is finite and if m1 Zm 0 then maxm<k2m P(|Z2m Zk | m) 0.
(b) Adapting the proof of Kolmogorovs maximal inequality show that for any
n and z > 0,
P( max |Zk | 2z) min P(|Zn Zk | < z) P(|Zn | > z) .
1kn
1kn
a.s.
a.s.
CHAPTER 3
f (y) =
(y )2
1
exp(
).
2v
2v
As we show next, the normal distribution is preserved when the sum of independent
variables is considered (which is the main reason for its role as the limiting law for
the clt).
Lemma 3.1.1. Let Yn,k be mutually independent random
Pn variables, each having
the normal distribution N (n,k , vn,k
).
Then,
G
=
Yn,k has the normal
n
Pn
Pk=1
n
distribution N (n , vn ), with n = k=1 n,k and vn = k=1 vn,k .
96
of N (0, vi ) distribution, i = 1, 2, is
Z
1
1
(z y)2
y2
)
)dy .
fG (z) =
exp(
exp(
2v1
2v2
2v1
2v2
Comparing this with the formula of (3.1.1) for v = v1 + v2 , it just remains to show
that for any z R,
Z
y2
z2
(z y)2
1
(3.1.2)
1=
)dy ,
exp(
2v
2v1
2v2
2u
where u = v1 v2 /(v1 + v2 ). It is not hard to check that the argument of the exponential function in (3.1.2) is (y cz)2 /(2u) for c = v2 /(v1 + v2 ). Consequently,
(3.1.2) is merely the obvious fact that the N (cz, u) density function integrates to
one (as any density function should), no matter what the value of z is.
Considering Lemma 3.1.1 for Yn,k = (nv)1/2 (Yk ) and i.i.d. random variables
Yk , each having a normal distribution of mean and variance v, we see that n,k =
Pn
0 and vn,k = 1/n, so Gn = (nv)1/2 ( k=1 Yk n) has the standard N (0, 1)
distribution, regardless of n.
3.1.1. Lindebergs clt for triangular arrays. Our next proposition, the
P
celebrated clt, states that the distribution of Sbn = (nv)1/2 ( nk=1 Xk n) approaches the standard normal distribution in the limit n , even though Xk
may well be non-normal random variables.
Proposition 3.1.2 (Central Limit Theorem). Let
n
1 X
Xk n ,
Sbn =
nv
k=1
where {Xk } are i.i.d with v = Var(X1 ) (0, ) and = E(X1 ). Then,
Z b
y2
1
exp( )dy
for every b R .
(3.1.3)
lim P(Sbn b) =
n
2
2
As we have seen in the context of the weak law of large numbers, it pays to extend
the scope of consideration to triangular arrays in which the random variables Xn,k
are independent within each row, but not necessarily of identical distribution. This
is the context of Lindebergs clt, which we state next.
Pn
Theorem 3.1.3 (Lindebergs clt). Let Sbn =
k=1 Xn,k for P-mutually independent random variables Xn,k , k = 1, . . . , n, such that EXn,k = 0 for all k
and
n
X
2
vn =
EXn,k
1 as n .
k=1
(3.1.4)
gn () =
n
X
k=1
2
E[Xn,k
; |Xn,k | ] 0
as n .
Note that the variables in different rows need not be independent of each other
and could even be defined on different probability spaces.
97
Remark 3.1.4. Under the assumptions of Proposition 3.1.2 the variables Xn,k =
(nv)1/2 (Xk ) are mutually independent and such that
EXn,k = (nv)1/2 (EXk ) = 0,
vn =
n
X
2
EXn,k
=
k=1
n
1 X
Var(Xk ) = 1 .
nv
k=1
For each > 0 the sequence (X1 )2 I|X1 |nv converges a.s. to zero for
n and is dominated by the integrable random variable (X1 )2 . Thus, by
dominated convergence, gn () 0 as n . We conclude that all assumptions
of Theorem 3.1.3 are satisfied for this choice of Xn,k , hence Proposition 3.1.2 is a
special instance of Lindebergs clt, to which we turn our attention next.
2
Let rn = max{ vn,k : k = 1, . . . , n} for vn,k = EXn,k
. Since for every n, k and
> 0,
2
2
2
vn,k = EXn,k
= E[Xn,k
; |Xn,k | < ] + E[Xn,k
; |Xn,k | ] 2 + gn () ,
it follows that
rn2 2 + gn ()
n, > 0 ,
D
Remark. Recall that Gn = n G for n = vn . So, assuming vn 1 and
Lindebergs condition which implies that rn 0 for n , it follows from the
lemma that |Eh(Sbn ) Eh(n G)| 0 as n . Further, |h(n x) h(x)|
|n 1||x|kh k , so taking the expectation with respect to the standard normal
law we see that |Eh(n G) Eh(G)| 0 if the first derivative of h is also uniformly
bounded. Hence,
(3.1.5)
98
for any continuous function h() of continuous and uniformly bounded first three
derivatives. This is actually all we need from Lemma 3.1.5 in order to prove Lindebergs clt. Further, as we show in Section 3.2, convergence in distribution as in
(3.1.3) is equivalent to (3.1.5) holding for all continuous, bounded functions h().
P
Proof of Lemma 3.1.5. Let Gn = nk=1 Yn,k for mutually independent Yn,k ,
distributed according to N (0, vn,k ), that are independent of {Xn,k }. Fixing n and
h, we simplify the notations by eliminating n, that is, we write Yk for Yn,k , and Xk
for Xn,k . To facilitate the proof define the mixed sums
Ul =
l1
X
Xk +
k=1
n
X
Yk ,
l = 1, . . . , n
k=l+1
Un + Xn = Sbn ,
Ul + Xl = Ul+1 + Yl+1 , l = 1, . . . , n 1,
n
X
l ,
l=1
2
h (Ul )
2
2
|Rl ()| kh k || ,
|Rl ()| kh k
whence,
(3.1.7)
||3
|Rl ()| min kh k
, kh k ||2
6
we get that
X2
Y2
l E[(Xl Yl )h (Ul )] + E ( l l )h (Ul ) + |E[Rl (Xl ) Rl (Yl )]| .
2
2
Recall that Xl and Yl are independent of Ul and chosen such that EXl = EYl and
EXl2 = EYl2 . As the first two terms in the bound on l vanish we have that
(3.1.8)
99
|Xl |3
; |Xl | + kh k E[|Xl |2 ; |Xl | ]
6
kh k E[|Xl |2 ] + kh k E[Xl2 ; |Xl | ] .
6
P
Summing these bounds over l = 1, . . . , n, by our assumption that nl=1 EXl2 = vn
and the definition of gn (), we get that
n
X
(3.1.9)
l=1
E|Rl (Xl )|
vn kh k + gn ()kh k .
6
Recall that Yl / vn,l is a standard normal random variable, whose fourth moment
is 3 (see (1.3.18)). By monotonicity in q of the Lq -norms (c.f. Lemma 1.3.16), it
3/2
follows that E[|Yl / vn,l |3 ] 3, hence E|Yl |3 3vn,l 3rn vn,l . Utilizing once
Pn
more (3.1.7) and the fact that vn = l=1 vn,l , we arrive at
(3.1.10)
n
X
l=1
E|Rl (Yl )|
n
kh k X
rn
vn kh k .
E|Yl |3
6
2
l=1
Proof. There are many ways to prove this. Here is one which is from first principles, hence requires no analysis knowledge. The function : [0, 1] 7 [0, 1] given
R1
by (x) = 140 x u3 (1 u)3 du is monotone decreasing, with continuous derivatives
of all order, such that (0) = 1, (1) = 0 and whose first three derivatives at 0 and
at 1 are all zero. Its extension (x) = (min(x, 1)+ ) to a function on R that is one
for x 0 and zero for x 1 is thus non-increasing, with continuous and uniformly
bounded first three derivatives. It is easy to check that the translated and scaled
functions h+
k (x) = (k(x b)) and hk (x) = (k(x b) + 1) have all the claimed
properties.
Proof of Theorem 3.1.3. Applying (3.1.5) for h
k (), then taking k
we have by monotone convergence that
b
lim inf P(Sbn < b) lim E[h
k (Sn )] = E[hk (G)] FG (b ) .
n
h+
k (),
Similarly, considering
then taking k we have by bounded convergence
that
+
b
lim sup P(Sbn b) lim E[h+
k (Sn )] = E[hk (G)] FG (b) .
n
100
3.1.2. Applications of the clt. We start with the simpler, i.i.d. case. In
D
doing so, we use the notation Zn G when the analog of (3.1.3) holds for the
sequence {Zn }, that is P(Zn b) P(G b) as n for all b R (where G is
a standard normal variable).
Example 3.1.7 (Normal approximation of the Binomial). Consider i.i.d.
random variables {Bi }, each of whom is Bernoulli of parameter 0 < p < 1 (i.e.
P (B1 = 1) = 1 P (B1 = 0) = p). The sum Sn = B1 + + Bn has the Binomial
distribution of parameters (n, p), that is,
n k
P(Sn = k) =
p (1 p)nk , k = 0, . . . , n .
k
For example, if Bi indicates that the ith independent toss of the same coin lands on
a Head then Sn counts the total numbers of Heads in the first n tosses of the coin.
Recall that EB = p and Var(B) = p(1 p) (see Example 1.3.69), so the clt states
p
D
that (Sn np)/ np(1 p) G. It allows us to approximate, for all large enough
n, the typically non-computable weighted sums of binomial terms by integrals with
respect to the standard normal density.
Here is another example that is similar and almost as widely used.
Example 3.1.8 (Normal approximation of the Poisson distribution). It
is not hard to verify that the sum of two independent Poisson random variables has
the Poisson distribution, with a parameter which is the sum of the parameters of
the summands. Thus, by induction, if {Xi } are i.i.d. each of Poisson distribution
of parameter 1, then Nn = X1 + . . . + Xn has a Poisson distribution of parameter n. Since E(N1 ) = Var(N1 ) = 1 (see Example 1.3.69), the clt applies for
(Nn n)/n1/2 . This provides an approximation for the distribution function of the
Poisson variable N of parameter that is a large integer. To deal with non-integer
values = n + for some (0, 1), consider the mutually independent Poisson
D
D
variables Nn , N and N1 . Since N = Nn +N and Nn+1 = Nn +N +N1 , this
provides a monotone coupling , that is, a construction of the random variables Nn ,
N and Nn+1 on the same probability space, such that Nn N Nn+1 .
Because
of this monotonicity, for any
> 0 and all n n0 (b, ) the event
{(N )/ b}
is between {(Nn+1 (n + 1))/ n + 1 b } and {(Nn n)/ n b + }. Considering the limit as n followed by 0, it thus follows that the convergence
D
D
(Nn n)/n1/2 G implies also that (N )/1/2 G as . In words,
the normal distribution is a good approximation of a Poisson with large parameter.
In Theorem 2.3.3 we established the strong law of large numbers when the summands Xi are only pairwise independent. Unfortunately, as the next example shows,
pairwise independence is not good enough for the clt.
Example 3.1.9. Consider i.i.d. {i } such that P(i = 1) = P(i = 1) = 1/2
for all i. Set X1 = 1 and successively let X2k +j = Xj k+2 for j = 1, . . . , 2k and
k = 0, 1, . . .. Note that each Xl is a {1, 1}-valued variable, specifically, a product
of a different finite subset of i -s that corresponds to the positions of ones in the
binary representation of 2l 1 (with 1 for its least significant digit, 2 for the next
digit, etc.). Consequently, each Xl is of zero mean and if l 6= r then in EXl Xr
at least one of the i -s will appear exactly once, resulting with EXl Xr = 0, hence
with {Xl } being uncorrelated variables. Recall part (b) of Exercise 1.4.42, that such
101
variables are pairwise independent. Further, EXl = 0 and Xl {1, 1} mean that
P(Xl = 1) = P(X
P l = 1) = 1/2 are identically distributed. As for the zero mean
variables Sn = nj=1 Xj , we have arranged things such that S1 = 1 and for any
k0
k
S2k+1 =
2
X
(Xj + X2k +j ) =
2
X
j=1
j=1
Qk+1
hence S2k = 1 i=2 (1 + i ) for all k 1. In particular, S2k = 0 unless 2 = 3 =
. . . = k+1 = 1, an event of probability 2k . Thus, P(S2k 6= 0) = 2k and certainly
the clt result (3.1.3) does not hold along the subsequence n = 2k .
We turn next to applications of Lindebergs triangular array clt, starting with
the asymptotic of the count of record events till time n 1.
Exercise 3.1.10. Consider the count Rn of record events during the first n instances of i.i.d. R.V. with a continuous distribution function, as in Example 2.2.27.
Recall that Rn = B1 + +Bn for mutually independent Bernoulli random variables
{Bk } such that P(Bk = 1) = 1 P(Bk = 0) = k 1 .
(a) Check that bn / log n 1 where bn = Var(Rn ).
(b) Show that Lindebergs clt applies for Xn,k = (log n)1/2 (Bk k 1 ).
D
(c) Recall that |ERn log n| 1, and conclude that (Rn log n)/ log n
G.
Remark. Let Sn denote the symmetric group of permutations on {1, . . . , n}. For
s Sn and i {1, . . . , n}, denoting by Li (s) the smallest j n such that sj (i) = i,
we call {sj (i) : 1 j Li (s)} the cycle of s containing i. If each s Sn is equally
likely, then the law of the number Tn (s) of different cycles in s is the same as that
of Rn of Example 2.2.27 (for a proof see [Dur10, Example 2.2.4]). Consequently,
D
Exercise 3.1.10 also shows that in this setting (Tn log n)/ log n G.
Part (a) of the following exercise is a special case of Lindebergs clt, known also
as Lyapunovs theorem.
P
Exercise 3.1.11 (Lyapunovs theorem). Let Sn = nk=1 Xk for {Xk } mutually
independent such that vn = Var(Sn ) < .
(a) Show that if there exists q > 2 such that
lim vnq/2
n
1/2
vn (Sn
n
X
k=1
E(|Xk EXk |q ) = 0 ,
ESn ) G.
then
(b) Show that part (a) applies in case vn and E(|Xk EXk |q )
C(Var Xk )r for r = 1, some q > 2, C < and k = 1, 2, . . ..
(c) Provide an example where the conditions of part (b) hold with r = q/2
1/2
but vn (Sn ESn ) does not converge in distribution.
The next application of Lindebergs clt involves the use of truncation (which we
have already introduced in the context of the weak law of large numbers), to derive
the clt for normalized sums of certain i.i.d. random variables of infinite variance.
102
n
log
n
and cn 1 are such that both cn /bn
where
b
=
Xn,k = b1
X
I
n
n k |Xk |cn
0 and cn / n . Indeed, for each n the variables Xn,k , k = 1, . . . , n, are i.i.d.
of bounded and symmetric distribution (since both the distribution of Xk and the
truncation function are symmetric). Consequently, EXn,k = 0 for all n and k.
Further, we have chosen bn such that
Z
n cn
n
2
2x[P(|X1 | > x) P(|X1 | > cn )]dx
vn = nEXn,1
= 2 EX12 I|X1 |cn = 2
bn
bn 0
Z cn
Z cn
Z
nh 1
2
2x i 2n log cn
= 2
2xdx +
dx
dx =
1
bn 0
x
c2n
b2n
1
0
as n . Finally, note that |Xn,k | cn /bn 0 as n , implying that
gn () = 0 for any > 0 and all n large enough, hence Lindebergs condition
D
trivially holds. We thus deduce from Lindebergs clt that n 1log n S n G as
P
n , where S n = nk=1 Xk I|Xk |cn is the sum of the truncated variables. We
have chosen the truncation level cn large enough to assure that
n
X
P(Sn 6= S n )
P(|Xk | > cn ) = nP(|X1 | > cn ) = nc2
n 0
k=1
1
S
n log n n
G as claimed.
We conclude this section with Kolmogorovs three series theorem, the most definitive result on the convergence of random series.
Theorem 3.1.14 (Kolmogorovs three series theorem). Suppose {Xk } are
(c)
independent random variables. For non-random c > 0 let Xn = Xn I|Xn |c be the
corresponding truncated variables and consider the three series
X
X
X
(3.1.11)
P(|Xn | > c),
EXn(c) ,
Var(Xn(c) ).
n
P
Then, the random series n Xn converges a.s. if and only if for some c > 0 all
three series of (3.1.11) converge.
103
three series of (3.1.11) converge for some c > 0, then they necessarily converge for
every c > 0.
Proof. We prove the sufficiency first, that is, assume that for some c > 0
all three series of (3.1.11) converge. By Theorem 2.3.17 and the finiteness of
P
P
(c)
(c)
(c)
n (Xn EXn ) converges a.s.
n Var(Xn ) it follows that the random series
P
P
(c)
(c)
Then, by our assumption that n EXn converges, also n Xn converges a.s.
(c)
Further, by assumption the sequence of probabilities P(Xn 6= Xn ) = P(|Xn | > c)
(c)
is summable, hence by Borel-Cantelli I, we have that a.s. Xn 6= Xn for at most
P
(c)
finitely manyP
ns. The convergence a.s. of n Xn thus results with the convergence a.s. of n Xn , as claimed.
We turn to prove
P the necessity of convergence of the three series in (3.1.11) to the
convergence ofP n Xn , which is where we use the clt. To this end, assume the
random series n Xn converges
P a.s. (to a finite limit) and fix an arbitrary constant
c > 0. The convergence of n Xn implies that |Xn | 0, hence a.s. |Xn | > c
for only finitely many ns. In view of the independence of these events and BorelCantelli II, necessarily the sequence P(|Xn | > c) is summable,
Pthat is, the series
P
P(|X
|
>
c)
converges.
Further,
the
convergence
a.s.
of
n
n Xn then results
n
P
(c)
with the a.s. convergence of n Xn .
P
(c)
Suppose now that the non-decreasing sequence vn = nk=1 Var(Xk ) is unbounded,
1/2 Pn
(c)
in which case the latter convergence implies that a.s. Tn = vn
k=1 Xk 0
when n . We further claim that in this case Lindebergs clt applies for
P
Sbn = nk=1 Xn,k , where
(c)
(c)
(c)
(c)
and mk = EXk .
Indeed, per fixed n the variables Xn,k are mutually independent of zero mean
Pn
(c)
2
and such that k=1 EXn,k
= 1. Further, since |Xk | c and we assumed that
condition holding (as gn () = 0 when > 2c/ vn , i.e. for all n large enough).
a.s.
D
Combining Lindebergs clt conclusion that Sbn G and Tn 0, we deduce that
D
1/2 Pn
(c)
(Sbn Tn ) G (c.f. Exercise 3.2.8). However, since Sbn Tn = vn
k=1 mk
are non-random, the sequence P(Sbn Tn 0) is composed of zeros and ones,
hence cannot converge to P(G 0) = 1/2. We arrive at a contradiction to our
(c)
assumption that vn , and so conclude that the sequence Var(Xn ) is summable,
P
(c)
that is, the series n Var(Xn ) converges.
P
(c)
(c)
By Theorem 2.3.17, the summability of Var(Xn ) implies that the series n (Xn
P
(c)
(c)
mn ) converges a.s. We have already seen that n Xn converges a.s. so it follows
P
(c)
that their difference n mn , which is the middle term of (3.1.11), converges as
well.
3.2. Weak convergence
Focusing here on the theory of weak convergence, we first consider in Subsection
3.2.1 the convergence in distribution in a more general setting than that of the clt.
This is followed by the study in Subsection 3.2.2 of weak convergence of probability
measures and the theory associated with it. Most notably its relation to other modes
104
105
have to restrict such convergence to the continuity points of the limiting distribution
function FX , as done in Definition 3.2.1.
We have seen in Examples 3.1.7 and 3.1.8 that the normal distribution is a good
approximation for the Binomial and the Poisson distributions (when the corresponding parameter is large). Our next example is of the same type, now with the
approximation of the Geometric distribution by the Exponential one.
Example 3.2.5 (Exponential approximation of the Geometric). Let Zp
be a random variable with a Geometric distribution of parameter p (0, 1), that is,
P(Zp k) = (1 p)k1 for any positive integer k. As p 0, we see that
P(pZp > t) = (1 p)t/p et
for all
t0
While proving Theorem 1.2.37 we saw that FYn = Fn for any n , and as
remarked there Yn () = Yn+ () for all but at most countably many values of ,
hence P(Yn = Yn+ ) = 1. It thus suffices to show that for all (0, 1),
+
Y
() lim sup Yn+ () lim sup Yn ()
n
lim inf Yn () Y
() .
(3.2.1)
Y () for any
setting Yn = Yn+ for 1
Yn ()
Indeed, then
P(A) = 1. Hence,
theorem.
A = { : Y
() = Y
()} where
n would complete the proof of the
106
Turning to prove (3.2.1) note that the two middle inequalities are trivial. Fixing
(0, 1) we proceed to show that
+
Y
() lim sup Yn+ () .
(3.2.2)
Since the continuity points of F form a dense subset of R (see Exercise 1.2.39),
+
it suffices for (3.2.2) to show that if z > Y
() is a continuity point of F , then
+
+
necessarily z Yn () for all n large enough. To this end, note that z > Y
()
implies by definition that F (z) > . Since z is a continuity point of F and
w
Fn F we know that Fn (z) F (z). Hence, Fn (z) > for all sufficiently large
n. By definition of Yn+ and monotonicity of Fn , this implies that z Yn+ (), as
needed. The proof of
() ,
lim inf Yn () Y
(3.2.3)
Y
()
Exercise 3.2.8. Suppose that Xn X and Yn Y , where Y is nonrandom and for each n the variables Xn and Yn are defined on the same probability
space.
D
D
and E(X1 ) = 0, then SNm / vbm G as m .
p
Hint: Use Kolmogorovs inequality to show that SNm / vbm Sbm / vbm
0.
Pn
(b) Let Nt = sup{n : Sn t} for Sn = k=1 Yk and i.i.d. random variables
Yk > 0 such that v = Var(Y1 ) (0, ) and E(Y1 ) = 1. Show that
D
(Nt t)/ vt G as t .
Theorem 3.2.7 is key to solving the following:
D
f (c))/f (c) Z for every positive constants bn and every Borel function
107
The next exercise relates the decay (in n) of sups |FX (s) FXn (s)| to that of
sup |Eh(Xn ) Eh(X )| over all functions h : R 7 [M, M ] with supx |h (x)| L.
Exercise 3.2.12. Let n = sups |FX (s) FXn (s)|.
(a) Show that if supx |h(x)| M and supx |h (x)| L, then for any b > a,
C = 4M + L(b a) and all n
|Eh(Xn ) Eh(X )| Cn + 4M P(X
/ [a, b]) .
(b) Show that if X [a, b] and fX (x) > 0 for all x [a, b], then
|Qn () Q ()| 1 n for any (n , 1 n ), where Qn () =
sup{x : FXn (x) < } denotes -quantile for the law of Xn . Using this,
D
construct Yn = Xn such that P(|Yn Y | > 1 n ) 2n and deduce
the bound of part (a), albeit the larger value 4M + L/ of C.
Here is another example of convergence in distribution, this time in the context
of extreme value theory.
Exercise 3.2.13. Let Mn = max1in {Ti }, where Ti , i = 1, 2, . . . are i.i.d. random variables of distribution function FT (t). Noting that FMn (x) = FT (x)n , show
D
that b1
n (Mn an ) M when:
(a) FT (t) = 1 et for t 0 (i.e. Ti are Exponential of parameter one).
Here, an = log n, bn = 1 and FM (y) = exp(ey ) for y R.
(b) FT (t) = 1 t for t 1 and > 0. Here, an = 0, bn = n1/ and
FM (y) = exp(y ) for y > 0.
(c) FT (t) = 1 |t| for 1 t 0 and > 0. Here, an = 0, bn = n1/
and FM (y) = exp(|y| ) for y 0.
108
(b) Let Mn = max1in {Gi } for i.i.d. standard normal random variables
D
p
(c) Show that bn / 2 log n 1 as n and deduce that Mn / 2 log n 1.
(d) More generally, suppose Tt = inf{x 0 : Mx t}, where x 7 Mx
is some monotone non-decreasing family of random variables such that
D
M0 = 0. Show that if et Tt T as t with T having the
D
standard exponential distribution then (Mx log x) M as x ,
y
where FM (y) = exp(e ).
Our next example is of a more combinatorial flavor.
Exercise 3.2.15 (The birthday problem). Suppose {Xi } are i.i.d. with each
Xi uniformly distributed on {1, . . . , n}. Let Tn = min{k : Xk = Xl , for some l < k}
mark the first coincidence among the entries of the sequence X1 , X2 , . . ., so
r
Y
k1
),
(1
P(Tn > r) =
n
k=2
is the probability that among r items chosen uniformly and independently from
a set of n different objects, no two are the same (the name birthday problem
corresponds to n = 365 with the items interpreted as the birthdays for a group of
size r). Show that P(n1/2 Tn > s) exp(s2 /2) as n , for any fixed s 0.
Hint: Recall that x x2 log(1 x) x for x [0, 1/2].
The symmetric,
Pnsimple random walk on the integers is the sequence of random
variables Sn = k=1 k where k are i.i.d. such that P(k = 1) = P(k = 1) = 12 .
D
From the clt we already know that n1/2 Sn G. The next exercise provides
the asymptotics of the first and last visits to zero by this random sequence, namely
R = inf{ 1 : S = 0} and Ln = sup{ n : S = 0}. Much more is known about
this random sequence (c.f. [Dur10, Section 4.3] or [Fel68, Chapter 3]).
Exercise 3.2.16. Let qn,r = P(S1 > 0, . . . , Sn1 > 0, Sn = r) and
n n
k = (n + r)/2 .
pn,r = P(Sn = r) = 2
k
(a) Counting paths of the walk, prove the discrete reflection principle that
Px (R < n, Sn = y) = Px (Sn = y) = pn,x+y for any positive integers
x, y, where Px () denote probabilities for the walk starting at S0 = x.
(b) Verify that qn,r = 21 (pn1,r1 pn1,r+1 ) for any n, r 1.
Hint: Paths of the walk contributing to qn,r must have S1 = 1. Hence,
use part (a) with x = 1 and y = r.
(c) Deduce that P(R > n) = pn1,0 + pn1,1 and that P(L2n = 2k) =
p2k,0 p2n2k,0 for k = 0, 1, . . . , n.
(d) Using Stirlings formula (that 2n(n/e)n /n! 1 as n ), show
D
that nP(R > 2n) 1 and that (2n)1 L2n X, where X has the
arc-sine probability density function fX (x) = 1
on [0, 1].
x(1x)
(e) Let H2n count the number of 1 k 2n such that Sk 0 and Sk1 0.
D
D
Show that H2n = L2n , hence (2n)1 H2n X.
109
+
non-negative h
k Cb (R) be such that hk (x) I(,) (x) and hk (x) I(,] (x)
as k (c.f. Lemma 3.1.6 for a construction of such functions). We have by the
weak convergence of the laws when n , followed by monotone convergence as
k , that
Similarly, considering
that
h+
k ()
+
lim sup Pn ((, ]) lim Pn (h+
k ) = P (hk ) P ((, ]) = F () .
n
110
This holds for any h Cb (R), so by Proposition 3.2.18, we conclude that g(Xn )
g(X ).
Our next theorem collects several equivalent characterizations of weak convergence
of probability measures on (R, B). To this end we need the following definition.
Definition 3.2.20. For a subset A of a topological space S, we denote by A the
boundary of A, that is A = A \ Ao is the closed set of points in the closure of A
but not in the interior of A. For a measure on (S, BS ) we say that A BS is a
-continuity set if (A) = 0.
Theorem 3.2.21 (portmanteau theorem). The following four statements are
equivalent for any probability measures n , 1 n on (R, B).
w
(a) n
(b) For every closed set F , one has lim sup n (F ) (F )
n
(c) For every open set G, one has lim inf n (G) (G)
n
(d) For every -continuity set A, one has lim n (A) = (A)
n
Remark. As shown in Subsection 3.5.1, this theorem holds with (R, B) replaced
by any metric space S and its Borel -algebra BS .
For n = PXn we get the formulation of the Portmanteau theorem for random
variables Xn , 1 n , where the following four statements are then equivalent
D
to Xn X :
(a) Eh(Xn ) Eh(X ) for each bounded continuous h
(b) For every closed set F one has lim sup P(Xn F ) P(X F )
n
(c) For every open set G one has lim inf P(Xn G) P(X G)
n
(d) For every Borel set A such that P(X A) = 0, one has
lim P(Xn A) = P(X A)
n
Proof. It suffices to show that (a) (b) (c) (d) (a), which we
shall establish in that order. To this end, with Fn (x) = n ((, x]) denoting the
w
corresponding distribution functions, we replace n of (a) by the equivalent
w
condition Fn F (see Proposition 3.2.18).
w
(a) (b). Assuming Fn F , we have the random variables Yn , 1 n of
a.s.
Theorem 3.2.7, such that PYn = n and Yn Y . Since F is closed, the function
IF is upper semi-continuous bounded by one, so it follows that a.s.
lim sup IF (Yn ) IF (Y ) ,
n
as stated in (b).
111
implying that (c) holds. In an analogous manner we can show that (c) (b), so
(b) and (c) are equivalent.
(c) (d). Since (b) and (c) are equivalent, we assume now that both (b) and (c)
hold. Then, applying (c) for the open set G = Ao and (b) for the closed set F = A
we have that
(A) lim sup n (A) lim sup n (A)
n
(3.2.4)
We turn to relate the weak convergence to the convergence point-wise of probability density functions. To this end, we first define a new concept of convergence
of measures, the convergence in total-variation.
Definition 3.2.22. The total variation norm of a finite signed measure on the
measurable space (S, F ) is
kktv = sup{(h) : h mF , sup |h(s)| 1}.
sS
Remark. Note that kktv = 1 for any probability measure (since (h)
(|h|) khk (1) 1 for the functions h considered, with equality for h = 1). By
a similar reasoning, k ktv 2 for any two probability measures , on (S, F ).
Convergence in total-variation obviously implies weak convergence of the same
probability measures, but the converse fails, as demonstrated for example by n =
1/n , the probability measure on (R, B) assigning probability one to the point 1/n,
which converge weakly to = 0 (see Example 3.2.4), whereas kn k = 2
for all n. The difference of course has to do with the non-uniformity of the weak
convergence with respect to the continuous function h.
To gain a better understanding of the convergence in total-variation, we consider
an important special case.
Proposition 3.2.23. Suppose P = f and Q = g for some measure on (S, F )
and f, g mF+ such that (f ) = (g) = 1. Then,
Z
(3.2.5)
kP Qktv = |f (s) g(s)|d(s) .
S
112
with equality when h(s) = sgn((f (s)g(s)) (see Proposition 1.3.56 for the left-most
identity and note that f h and gh are in L1 (S, F , )). Consequently, kP Qktv =
sup{(f )(h) (g)(h) : h as above } = (|f g|), as claimed.
For n = fn , we thus have that kn ktv = (|fn f |), so the convergence in
total-variation is equivalent to fn f in L1 (S, F , ). Since fn 0 and (fn ) = 1
for any n , it follows from Scheffes lemma (see Lemma 1.3.35) that the latter
convergence is a consequence of fn (s) f (s) for a.e. s S.
Two specific instances of Proposition 3.2.23 are of particular value in applications.
Example 3.2.24. Let n = PXn denote the laws of random variables Xn that
have probability density functions fn , n = 1, 2, . . . , . Recall Exercise 1.3.66 that
then n = fn for Lebesgues measure on (R, B). Hence, by the preceding proposition, the convergence point-wise of fn (x) to f (x) implies the convergence in
D
total-variation of PXn to PX , and in particular implies that Xn X .
Example 3.2.25. Similarly, if Xn are integer valued for n = 1, 2 . . ., then n =
e for fn (k) = P(Xn = k) and the counting measure
e on (Z, 2Z ) such that
fn
e
({k})
= 1 for each k Z. So, by the preceding proposition, the point-wise convergence of Exercise 3.2.3 is not only necessary and sufficient for weak convergence
but also for convergence in total-variation of the laws of Xn to that of X .
In the next exercise, you are to rephrase Example 3.2.25 in terms of the topological
space of all probability measures on Z.
Exercise 3.2.26. Show that d(, ) = k ktv is a metric on the collection of all
probability measures on Z, and that in this space the convergence in total variation
is equivalent to the weak convergence which in turn is equivalent to the point-wise
convergence at each x Z.
Hence, under the framework of Example 3.2.25, the Glivenko-Cantelli theorem
tells us that the empirical measures of integer valued i.i.d. R.V.-s {Xi } converge in
total-variation to the true law of X1 .
Here is an example from statistics that corresponds to the framework of Example
3.2.24.
Exercise 3.2.27. Let Vn+1 denote the central value on a list of 2n + 1 values (that
is, the (n + 1)th largest value on the list). Suppose the list consists of mutually
independent R.V., each chosen uniformly in [0, 1).
n
n
(a) Show that Vn+1 has probability density function (2n + 1) 2n
n v (1 v)
at each v [0, 1).
(b) Verify that the density fn (v) of Vbn = 2n(2Vn+1 1) is of the form
n
fn (v) = cn (1 v 2 /(2n))
for some normalization constant cn that is
113
X2 )/ 2 for X1 and X2 i.i.d. of law each. Explain why the clt implies that
w
T m as m , for any M. Show that T = (see Lemma 3.1.1), and
explain why is the unique, globally attracting fixed point of T in M.
Your next exercise is the basis behind the celebrated method of moments for weak
convergence.
Exercise 3.2.29. Suppose that X and Y are [0, 1]-valued random variables such
that E(X n ) = E(Y n ) for n = 0, 1, 2, . . ..
(a) Show that Ep(X) = Ep(Y ) for any polynomial p().
(b) Show that Eh(X) = Eh(Y ) for any continuous function h : [0, 1]
7 R
D
114
Remark. As most texts use in the context of Definition 3.2.32 tight (or tight
sequence) instead of uniformly tight, we shall adopt the same convention here.
Uniform tightness of distribution functions has some structural resemblance to the
U.I. condition (1.3.11). As such we have the following simple sufficient condition
for uniform tightness (which is the analog of Exercise 1.3.54).
Exercise 3.2.33. A sequence of probability measures n on (R, B) is uniformly
tight if supn n (f (|x|)) is finite for some non-negative Borel function such that
f (r) as r . Alternatively, if supn Ef (|Xn |) < then the distribution
functions FXn form a tight sequence.
The importance of uniform tightness is that it guarantees the existence of limit
points for weak convergence.
Theorem 3.2.34 (Prohorov theorem). A collection of probability measures
on a complete, separable metric space S equipped with its Borel -algebra BS , is
uniformly tight if and only if for any sequence m there exists a subsequence
mk that converges weakly to some probability measure on (S, BS ) (where is
not necessarily in and may depend on the subsequence mk ).
Remark. For a proof of Prohorovs theorem, which is beyond the scope of these
notes, see [Dud89, Theorem 11.5.4].
Instead of Prohorovs theorem, we prove here a bare-hands substitute for the
special case S = R. When doing so, it is convenient to have the following notion of
convergence of distribution functions.
Definition 3.2.35. When a sequence Fn of distribution functions converges to a
right continuous, non-decreasing function F at all continuous points of F , we
v
say that Fn converges vaguely to F , denoted Fn F .
In contrast with weak convergence, the vague convergence allows for the limit
F (x) = ((, x]) to correspond to a measure such that (R) < 1.
115
Deferring the proof of Hellys theorem to the end of this section, uniform tightness
is exactly what prevents probability mass from escaping to , thus assuring the
existence of limit points for weak convergence.
Lemma 3.2.38. The sequence of distribution functions {Fn } is uniformly tight if
and only if each vague limit point of this sequence is a distribution function. That
v
is, if and only if when Fnk F , necessarily 1 F (x) + F (x) 0 as x .
v
Proof. Suppose first that {Fn } is uniformly tight and Fnk F . Fixing > 0,
there exist r1 < M and r2 > M that are both continuity points of F . Then, by
the definition of vague convergence and the monotonicity of Fn ,
1 F (r2 ) + F (r1 ) = lim (1 Fnk (r2 ) + Fnk (r1 ))
k
It follows that lim supx (1 F (x) + F (x)) and since > 0 is arbitrarily
small, F must be a distribution function of some probability measure on (R, B).
Conversely, suppose {Fn } is not uniformly tight, in which case by Definition 3.2.32,
for some > 0 and nk
(3.2.6)
for all k.
Considering now r = min(r1 , r2 ) , this shows that inf r (1F (r)+F (r)) ,
hence the vague limit point F cannot be a distribution function of a probability
measure on (R, B).
Remark. Comparing Definitions 3.2.31 and 3.2.32 we see that if a collection
of probability measures on (R, B) is uniformly tight, then for any sequence m
the corresponding sequence Fm of distribution functions is uniformly tight. In view
of Lemma 3.2.38 and Hellys theorem, this implies the existence of a subsequence
w
mk and a distribution function F such that Fmk F . By Proposition 3.2.18
w
we deduce that mk , a probability measure on (R, B), thus proving the only
direction of Prohorovs theorem that we ever use.
Proof of Theorem 3.2.37. Fix a sequence of distribution function Fn . The
key to the proof is to observe that there exists a sub-sequence nk and a nondecreasing function H : Q 7 [0, 1] such that Fnk (q) H(q) for any q Q.
This is done by a standard analysis argument called the principle of diagonal
selection. That is, let q1 , q2 , . . ., be an enumeration of the set Q of all rational
numbers. There exists then a limit point H(q1 ) to the sequence Fn (q1 ) [0, 1],
(1)
that is a sub-sequence nk such that Fn(1) (q1 ) H(q1 ). Since Fn(1) (q2 ) [0, 1],
(2)
(1)
116
(i1)
such
The diagonal
(k)
nk
for all j,
(k)
xn x
Recall that Fnk (x) [Fnk (r1 ), Fnk (r2 )] and Fnk (ri ) H(ri ) as k , for i = 1, 2.
Thus, by (3.2.7) for all k large enough
F (x) < Fnk (r1 ) Fnk (x) Fnk (r2 ) < F (x) + ,
which since > 0 is arbitrary implies Fnk (x) F (x) as k .
Hint: If |EXnl |2 then supl Var(Xnl ) < yields Xnl /EXnl 1, whereas the
p
uniform tightness of {FXnl } implies that Xnl /EXnl 0.
Using Lemma 3.2.38 and Hellys theorem, you next explore the possibility of establishing weak convergence for non-negative random variables out of the convergence
of the corresponding Laplace transforms.
Exercise 3.2.40.
(a) Based on Exercise 3.2.29 show that if Z 0 and W 0 are such that
D
E(esZ ) = E(esW ) for each s > 0, then Z = W .
(b) Further, show that for any Z 0, the function LZ (s) = E(esZ ) is
infinitely differentiable at all s > 0 and for any positive integer k,
E[Z k ] = (1)k lim
s0
dk
LZ (s) ,
dsk
117
where R and obviously both cos(X) and sin(X) are integrable R.V.-s.
We also denote by () the characteristic function associated with a probability
measure on (R, B). That is, () = (eix ) is the characteristic function of a
R.V. X whose law PX is .
Here are some of the properties of characteristic functions, where the complex
conjugate x iy ofp
z = x + iy C is denoted throughout by z and the modulus of
z = x + iy is |z| = x2 + y 2 .
Proposition 3.3.2. Let X be a R.V. and X its characteristic function, then
(a) X (0) = 1
(b) X () = X ()
(c) |X ()| 1
(d) 7 X () is a uniformly continuous function on R
(e) aX+b () = eib X (a)
118
Proof. For (a), X (0) = E[ei0X ] = E[1] = 1. For (b), note that
X () = E cos(X) + iE sin(X)
= E cos(X) iE sin(X) = X () .
p
For (c), note that the function |z| = x2 + y 2 : R2 7 R is convex, hence by
Jensens inequality (c.f. Exercise 1.3.20),
|X ()| = |EeiX | E|eiX | = 1
(3.3.1)
from which we deduce that |k,h (x)| 2|x|k for all and h 6= 0, and further that
|k,h (x)| 0 as h 0. Thus, for k = 1, . . . , n we have by dominated convergence
(and Jensens inequality for the modulus function) that
|Ek,h (X)| E|k,h (X)| 0
for
h 0.
119
The converse of Lemma 3.3.3 does not hold. That is, there exist random variables
with E|X| = for which X () is differentiable at = 0 (c.f. Exercise 3.3.24).
However, as we see next, the existence of a finite second derivative of X () at
= 0 implies that EX 2 < .
Lemma 3.3.4. If lim inf 0 2 (2X (0)X ()X ()) < , then EX 2 < .
Proof. Note that 2 (2X (0) X () X ()) = Eg (X), where
g (x) = 2 (2 eix eix ) = 22 [1 cos(x)] x2
for 0 .
The same type of explicit formula applies to any discrete valued R.V. For example,
if N has the Poisson distribution of parameter then
X
(ei )k
e = exp((ei 1)) .
(3.3.3)
N () = E[eiN ] =
k!
k=0
The characteristic function has an explicit form also when the R.V. X has a
probability density function fX as in Definition 1.2.40. Indeed, then by Corollary
1.3.62 we have that
Z
eix fX (x)dx ,
(3.3.4)
X () =
R
which is merely the Fourier transform of the density fX (and is well defined since
cos(x)fX (x) and sin(x)fX (x) are both integrable with respect to Lebesgues measure).
Example 3.3.6. If G has the N (, v) distribution, namely, the probability density
function fG (y) is given by (3.1.1), then its characteristic function is
G () = eiv
/2
2
a = v and b = , it suffices to show that X () = e /2 . To this end, as X is
integrable, we have from Lemma 3.3.3 that
Z
x sin(x)fX (x)dx
X () = E(iXeiX ) =
R
(since x cos(x)fX (x) is an integrable odd function, whose integral is thus zero).
120
(since sin(x)fX (x) is an integrable odd function). We know that X (0) = 1 and
2
since () = e /2 is the unique solution of the ordinary differential equation
eib eia
i(b a)
Rb
(recall that a ezx dx = (ezb eza )/z for any z C). For a = b the characteristic
function simplifies to sin(b)/(b). Or, in case b = 1 and a = 0 we have U () =
(ei 1)/(i) for the random variable U of Example 1.1.26.
For a = 0 and z = + i, > 0, the same integration identity applies also
when b (since the real part of z is negative). Consequently, by (3.3.4), the
exponential distribution of parameter > 0 whose density is fT (t) = et 1t>0
(see Example 1.3.68), has the characteristic function T () = /( i).
Finally, for the density fS (s) = 0.5e|s| it is not hard to check that S () =
0.5/(1 i) + 0.5/(1 + i) = 1/(1 + 2 ) (just break the integration over s R in
(3.3.4) according to the sign of s).
We next express the characteristic function of the sum of independent random
variables in terms of the characteristic functions of the summands. This relation
makes the characteristic function a useful tool for proving weak convergence statements involving sums of independent variables.
Lemma 3.3.8. If X and Y are two independent random variables, then
X+Y () = X ()Y ()
Proof. By the definition of the characteristic function
X+Y () = Eei(X+Y ) = E[eiX eiY ] = E[eiX ]E[eiY ] ,
where the right-most equality is obtained by the independence of X and Y (i.e.
applying (1.4.12) for the integrable f (x) = g(x) = eix ). Observing that the rightmost expression is X ()Y () completes the proof.
Here are three simple applications of this lemma.
Example 3.3.9. If X and Y are independent and uniform on (1/2, 1/2) then
by Corollary 1.4.33 the random variable = X + Y has the triangular density,
f (x) = (1 |x|)1|x|1 . Thus, by Example 3.3.7, Lemma 3.3.8, and the trigonometric identity cos = 1 2 sin2 (/2) we have that its characteristic function is
2 sin(/2) 2
2(1 cos )
() = [X ()]2 =
.
=
2
e be i.i.d. random variables.
Exercise 3.3.10. Let X, X
e is a non-negative,
(a) Show that the characteristic function of Z = X X
real-valued function.
e
(b) Prove that there do not exist a < b and i.i.d. random variables X, X
e is the uniform random variable on (a, b).
such that X X
121
In the next exercise you construct a random variable X whose law has no atoms
while its characteristic function does not converge to zero for .
P
k
Exercise 3.3.11. Let X = 2
Bk for {Bk } i.i.d. Bernoulli random varik=1 3
ables such that P(Bk = 1) = P(Bk = 0) = 1/2.
We conclude with an application of characteristic functions for proving an interesting identity in law.
Exercise 3.3.12. For integer 1 n and i.i.d. P
Tk , k = 1, 2, . . ., each of which
n
has the standard exponential distribution, let Sn := k=1 k 1 (Tk 1).
lim
Furthermore, if
density function
(3.3.7)
1
1
[FX (b) + FX (b )] [FX (a) + FX (a )] .
2
2
1
2
eix X ()d .
122
Example 3.3.14. The Cauchy density is fX (x) = 1/[(1 + x2 )]. Recall Example
3.3.7 that the density fS (s) = 0.5e|s| has the positive, integrable characteristic
function 1/(1 + 2 ). Thus, by (3.3.7),
Z
1
1
eits dt .
0.5e|s| =
2 R 1 + t2
Multiplying both sides by two, then changing t to x and s to , we get (3.3.4) for
the Cauchy density, resulting with its characteristic function X () = e||.
When using characteristic functions for proving limit theorems we do not need
the explicit formulas of Levys inversion theorem, but rather only the fact that the
characteristic function determines the law, that is:
Corollary 3.3.15. If the characteristic functions of two random variables X and
D
Y are the same, that is X () = Y () for all , then X = Y .
Remark. While the real-valued moment generating function MX (s) = E[esX ] is
perhaps a simpler object than the characteristic function, it has a somewhat limited
scope of applicability. For example, the law of a random variable X is uniquely
determined by MX () provided MX (s) is finite for all s [, ], some > 0 (c.f.
[Bil95, Theorem 30.1]). More generally, assuming all moments of X are finite, the
Hamburger moment problem is about uniquely determining the law of X from a
given sequence of moments EX k . You saw in Exercise 3.2.29 that this is always
possible when X has bounded support, but unfortunately, this is not always the case
when X has unbounded support. For more on this issue, see [Dur10, Subsection
3.3.5].
Proof of Corollary 3.3.15. Since X = Y , comparing the right side of
(3.3.6) for X and Y shows that
[FX (b) + FX (b )] [FX (a) + FX (a )] = [FY (b) + FY (b )] [FY (a) + FY (a )] .
As FX is a distribution function, both FX (a) 0 and FX (a ) 0 when a .
For this reason also FY (a) 0 and FY (a ) 0. Consequently,
FX (b) + FX (b ) = FY (b) + FY (b )
for all b R .
Remark. In Lemma 3.1.1, it was shown directly that the sum of independent
random variables of P
normal distributions
P N (k , vk ) has the normal distribution
N (, v) where = k k and v = k vk . The proof easily reduces to dealing
with two independent random variables, X of distribution N (1 , v1 ) and Y of
distribution N (2 , v2 ) and showing that X + Y has the normal distribution N (1 +
2 , v1 +v2 ). Here is an easy proof of this result via characteristic functions. First by
123
the independence of X and Y (see Lemma 3.3.8), and their normality (see Example
3.3.6),
X+Y () = X ()Y () = exp(i1 v1 2 /2) exp(i2 v2 2 /2)
1
= exp(i(1 + 2 ) (v1 + v2 )2 )
2
We recognize this expression as the characteristic function corresponding to the
N (1 + 2 , v1 + v2 ) distribution, which by Corollary 3.3.15 must indeed be the
distribution of X + Y .
Proof of L
evys inversion theorem. Consider the product of the law
PX of X which is a probability measure on R and Lebesgues measure of [T, T ],
noting that is a finite measure on R [T, T ] of total mass 2T .
Fixing a < b R let ha,b (x, ) = a,b ()eix , where by (3.3.5) and Jensens
inequality for the modulus function (and the uniform measure on [a, b]),
Z b
1
ba
|ha,b (x, )| = |a,b ()|
.
|eiu |du =
2 a
2
R
Consequently, |ha,b |d < , and applying Fubinis theorem, we conclude that
JT (a, b) :=
=
T
T hZ
hZ
i
a,b ()
eix dPX (x) d
Z hZ
i
ha,b (x, )dPX (x) d =
R
R
T
i
ha,b (x, )d dPX (x) .
Since ha,b (x, ) is the difference between the function eiu /(i2) at u = x a and
the same function at u = x b, it follows that
Z T
ha,b (x, )d = R(x a, T ) R(x b, T ) .
T
Further, as the cosine function is even and the sine function is odd,
Z T iu
Z T
e
sgn(u)
sin(u)
R(u, T ) =
d =
d =
S(|u|T ) ,
i2
T
0
Rr
with S(r) = 0 x1 sin x dx for r > 0.
R
Even though the Lebesgue integral 0 x1 sin x dx does not exist, because both
the integral of the positive part and the integral of the negative part are infinite,
we still have that S(r) is uniformly bounded on (0, ) and
lim S(r) =
0 if x < a or x > b
lim [R(x a, T ) R(x b, T )] = ga,b (x) = 21 if x = a or x = b
T
1 if a < x < b
124
1
1
= PX ({a}) + PX ((a, b)) + PX ({b}) .
2
2
With PX ({a}) = FX (a) FX (a ), PX ((a, b)) = FX (b ) FX (a) and PX ({b}) =
FX (b) FX (b ), we arrive at the assertion (3.3.6).
R
Suppose now that R |X ()|d = C < . This implies that both the real and the
imaginary parts of eix X () are integrable with respect to Lebesgues measure on
R, hence fX (x) of (3.3.7) is well defined. Further, |fX (x)| C is uniformly bounded
and by dominated convergence with respect to Lebesgues measure on R,
Z
1
lim |fX (x + h) fX (x)| lim
|eix ||X ()||eih 1|d = 0,
h0
h0 2 R
implying that fX () is also continuous. Turning to prove that fX () is the density
of X, note that
ba
|a,b ()X ()|
|X ()| ,
2
so by dominated convergence we have that
Z
a,b ()X ()d .
(3.3.8)
lim JT (a, b) = J (a, b) =
T
Further, in view of (3.3.5), upon applying Fubinis theorem for the integrable function eiu I[a,b] (u)X () with respect to Lebesgues measure on R2 , we see that
Z b
Z hZ b
i
1
iu
e
du X ()d =
fX (u)du ,
J (a, b) =
2 R a
a
fX (u)du ,
for all a < b. This shows that necessarily fX (x) is a non-negative real-valued
function, which is the density of X.
R 1 iz
Exercise 3.3.16. Integrating z e dz around the contour formed by the upper semi-circles
of radii and r and the intervals [r, ] and [r, ], deduce that
Rr
S(r) = 0 x1 sin xdx is uniformly bounded on (0, ) with S(r) /2 as r .
Our strategy for handling the clt and similar limit results is to establish the
convergence of characteristic functions and deduce from it the corresponding convergence in distribution. One ingredient for this is of course the fact that the
characteristic function uniquely determines the corresponding law. Our next result
provides an important second ingredient, that is, an explicit sufficient condition for
uniform tightness in terms of the limit of the characteristic functions.
125
(3.3.10)
2
, 0) rI{|x|>2/r} .
|x|
Now, applying Fubinis theorem for the function 1 eix whose modulus is bounded
by 2 and the product of the probability measure and Lebesgues measure on
[r, r], which is a finite measure of total mass 2r, we get the identity
Z r
Z r hZ
Z
i
(1 ())d =
(1 eix )d(x) d =
J(x)d(x) .
r
Thus, the lower bound (3.3.10) and monotonicity of the integral imply that
Z
Z
Z
1 r
1
(1 ())d =
J(x)d(x)
I{|x|>2/r} d(x) = ([2/r, 2/r]c ) ,
r r
r R
R
|1 ()|
4
and hence also
1
2
r
|1 ()|d .
126
r r
which in view of (3.3.9) results with
Z
1 r
Building upon Corollary 3.3.15 and Lemma 3.3.17 we can finally relate the pointwise convergence of characteristic functions to the weak convergence of the corresponding measures.
vys continuity theorem). Let n , 1 n be probaTheorem 3.3.18 (Le
bility measures on (R, B).
w
Proof. For part (a), since both x 7 cos(x) and x 7 sin(x) are bounded
continuous functions, the assumed weak convergence of n to implies that
n () = n (eix ) (eix ) = () (c.f. Definition 3.2.17).
Turning to deal with part (b), recall that by Lemma 3.3.17 we know that the
collection = {n } is uniformly tight. Hence, by Prohorovs theorem (see the
remark preceding the proof of Lemma 3.2.38), for every subsequence n(m) there is a
further sub-subsequence n(mk ) that converges weakly to some probability measure
. Though in general might depend on the specific choice of n(m), we deduce
from part (a) of the theorem that necessarily = . Since the characteristic
function uniquely determines the law (see Corollary 3.3.15), here the same limit
= applies for all choices of n(m). In particular, fixing h Cb (R), the sequence
yn = n (h) is such that every subsequence yn(m) has a further sub-subsequence
yn(mk ) that converges to y = (h). Consequently, yn = n (h) y = (h) (see
w
Lemma 2.2.11), and since this applies for all h Cb (R), we conclude that n
such that = .
Here is a direct consequence of Levys continuity theorem.
D
127
D
e with Z and
ek and show that n1/2 Pn Yk
Z Z,
(a) Set Yk = Xk X
k=1
e
Z i.i.d.
(b) Let Uk = Yk I|Yk |b and Vk = Yk I|Yk |>b . Show that for any u < and
all n,
n
n
n
n
X
X
X
1 X
Uk u n,
Vk 0) P(
Uk u n) .
P(
Yk u n) P(
2
k=1
k=1
k=1
k=1
(c) Apply the Portmanteau theorem and the clt for the bounded i.i.d. {Uk }
to get that for any u, b < ,
q
e u) 1 P(G u/ EU 2 ) .
P(Z Z
1
2
Considering the limit b followed by u deduce that EY12 < .
Pn
D
(d) Conclude that if n1/2 k=1 Xk Z, then necessarily EX12 < .
ek whose law is
Remark. The trick of replacing Xk by the variables Yk = Xk X
D
symmetric (i.e. Yk = Yk ), is very useful in many problems. It is often called the
symmetrization trick.
Exercise 3.3.21. Provide an example of
R a random variable X with a bounded
probability density function but for which R |X ()|d = , and another example
of a random variable X whose characteristic function X () is not differentiable at
= 0.
As you find out next, Levys inversion theorem can help when computing densities.
Exercise 3.3.22. Suppose the random variables Uk are i.i.d. where the law of each
Uk is the uniform probability measure on (1, 1). ConsideringPExample 3.3.7, show
n
that for each n 2, the probability density function of Sn = k=1 Uk is
Z
1
fSn (s) =
cos(s)(sin /)n d ,
0
R
and deduce that 0 cos(s)(sin /)n d = 0 for all s > n 2.
We next relate differentiability of X () with the weak law of large numbers and
show that it does not imply that E|X| is finite.
Pn
Exercise 3.3.24. Let Sn = k=1 Xk where the i.i.d. random variables {Xk } have
each the characteristic function X ().
p
1
X
Sn a
(a) Show that if d
d (0) = z C, then z = ia for some a R and n
as n .
p
(b) Show that if n1 Sn a, then X (hk )nk eia for any hk 0, > 0
X
and nk = [/hk ], and deduce that d
d (0) = ia.
p
(c) Conclude that the weak law of large numbers holds (i.e. n1 Sn a for
some non-random a), if and only if X () is differentiable at = 0 (this
result is due to E.J.G. Pitman, see [Pit56]).
(d) Use Exercise 2.1.13 to provide a random variable X for which X () is
differentiable at = 0 but E|X| = .
128
n ||r
a.s.
(a) Show that Y is a (2/)-lattice random variable, namely, that Y mod (2/)
is P-degenerate.
Hint: Check conditions for equality when applying p
Jensens inequality for
(cos Y, sin Y ) and the convex function g(x, y) = x2 + y 2 .
(b) Deduce that if in addition |Y ()| = 1 for some
/ Q then Y must be
P-degenerate, in which case Y () = exp(ic) for some c R.
Building on the preceding two exercises, you are to prove next the following convergence of types result.
D
D
Exercise 3.3.27. Suppose Zn Y and n Zn + n Yb for some Yb , non-Pdegenerate Y , and non-random n 0, n .
T
X
k=T
0 (1
1
1
|k|
)a,b (k0 )X (k0 ) = [FX (b) + FX (b )] [FX (a) + FX (a )] .
T
2
2
PT
Hint: Recall that ST (r) = k=1 (1 k/T ) sinkkr is uniformly bounded for
r (0, 2) and integer T 1, and ST (r) r
2 as T .
129
P
(b) Show that if
k |X (k0 )| < then X has the bounded continuous
probability density function, given for x (0, t) by
0 X ik0 x
fX (x) =
e
X (k0 ) .
2
kZ
(c) Deduce that if R.V.s X and Y supported on (0, t) are such that X (k0 ) =
D
Proof. Let R2 (x) = eix 1 ix (ix)2 /2. Then, rearranging terms, recalling
E(X) = 0 and using Jensens inequality for the modulus function, we see that
i2
1
X () 1 v2 = E eiX 1 iX 2 X 2 = ER2 (X) E|R2 (X)|.
2
2
130
Since |R2 (x)| min(|x|2 , |x|3 /6) for any x R (see also Exercise 3.3.35), by monotonicity of the expectation we get that E|R2 (X)| E min(|X|2 , |X|3 /6), completing the proof of the lemma.
The following simple complex analysis estimate is needed for relating the approximation of the characteristic function of summands to that of their sum.
Pn
Lemma 3.3.31. Suppose zn,k C are such that zn = k=1 zn,k z and n =
P
n
2
k=1 |zn,k | 0 when n . Then,
n :=
n
Y
k=1
(1 + zn,k ) exp(z )
for n .
X
(1)k1 z k
k=1
converges for |z| < 1. In particular, for |z| 1/2 it follows that
| log(1 + z) z|
X
|z|k
k=2
|z|
X
2(k2)
k=2
|z|
k=2
2(k1) = |z|2 .
n
Y
k=1
(1 + zn,k )
n
X
k=1
zn,k |
n
X
k=1
n
Y
k=1
131
where
(i)
Zk
We have already seen two examples of symmetric stable laws, namely those associated with the zero-mean normal density and with the Cauchy density of Example
3.3.14. Indeed, as you show next, for each (0, 2) there corresponds the symmetric -stable variable Y whose characteristic function is Y () = exp(|| )
(so the Cauchy distribution corresponds to the symmetric stable of index = 1
and the normal distribution corresponds to index = 2).
D
Exercise 3.3.34. Fixing (0, 2), suppose X = X and P(|X| > x) = x for
all x 1.
R
(a) Check that X () = 1(||)|| where (r) = r (1cos u)u(+1) du
converges as r 0 to (0) finite and positive.
Pn
(b) Setting ,0 () = exp(|| ), bn = ((0)n)1/ and Sbn = b1
n
k=1 Xk
for i.i.d. copies Xk of X, deduce that Sbn () ,0 () as n , for
any fixed R.
132
where g1 (r) = (2/) log |r| and g = tan(/2) is constant for all 6= 1 (in
particular, g2 = 0 so the parameter is irrelevant when = 2). Further, in case
< 2 the domain of attraction of Y, consists precisely of the random variables
X for which L(x) = x P(|X| > x) is slowly varying at and (P(X > x) P(X <
x))/P(|X| > x) as x (for example, see [Bre92, Theorem 9.34]). To
complete this picture, we recall [Fel71, Theorem XVII.5.1], that X is in the domain
of attraction of the normal variable Y2 if and only if L(x) = E[X 2 I|X|x ] is slowly
varying (as is of course the case whenever EX 2 is finite).
As shown in the following exercise, controlling the modulus of the remainder term
for the n-th order Taylor approximation of eix one can generalize the bound on
X () beyond the case n = 2 of Lemma 3.3.30.
Exercise 3.3.35. For any x R and non-negative integer n, let
Rn (x) = e
ix
n
X
(ix)k
k=0
k!
Rx
(a) Show that Rn (x) = 0 iRn1 (y)dy for all n 1 and deduce by induction
on n that
2|x|n |x|n+1
for all x R, n = 0, 1, 2, . . . .
|Rn (x)| min
,
n! (n + 1)!
n
h
X
(i)k EX k
2|X|n |||X|n+1 i
X ()
.
||n E min
,
k!
n!
(n + 1)!
k=0
By solving the next exercise you generalize the proof of Theorem 3.1.2 via characteristic functions to the setting of Lindebergs clt.
133
Pn
Exercise 3.3.36. Consider Sbn = k=1 Xn,k for mutually independent random
variables Xn,k , k = 1, . . . , n, of zero mean and variance vn,k , such that vn =
P
n
k=1 vn,k 1 as n .
(a) Fixing R show that
n
Y
n = Sbn () =
(1 + zn,k ) ,
k=1
We conclude this section with an exercise that reviews various techniques one may
use for establishing convergence in distribution for sums of independent random
variables.
P
Exercise 3.3.37. Throughout this problem Sn = nk=1 Xk for mutually independent random variables {Xk }.
(a) Suppose that P(Xk = k ) = P(Xk = k ) = 1/(2k ) and P(Xk = 0) =
1 k . Show that for any fixed R and > 1, the series Sn ()
converges almost surely as n .
(b) Consider the setting of part (a) when 0 < 1 and = 2 + 1 is
D
positive. Find non-random bn such that b1
n Sn Z and 0 < FZ (z) < 1
for some z R. Provide also the characteristic function Z () of Z.
(c) Repeat part (b) in case = 1 and > 0 (see Exercise 3.1.11 for = 0).
(d) Suppose now that P(Xk = 2k) = P(Xk = 2k) = 1/(2k 2 ) and P(Xk =
D
1) = P(Xk = 1) = 0.5(1 k 2 ). Show that Sn / n G.
3.4. Poisson approximation and the Poisson process
Subsection 3.4.1 deals with the Poisson approximation theorem and few of its applications. It leads naturally to the introduction of the Poisson process in Subsection
3.4.2, where we also explore its relation to sums of i.i.d. Exponential variables and
to order statistics of i.i.d. uniform random variables.
134
n
X
k=1
n the random variables Zn,k for 1 k n, are mutually independent, each taking
value in the set of non-negative integers. Suppose that pn,k = P(Zn,k = 1) and
n,k = P(Zn,k 2) are such that as n ,
n
X
(a)
pn,k < ,
k=1
(b)
(c)
max {pn,k } 0,
k=1, ,n
n
X
k=1
D
n,k 0.
n
X
k=1
n
X
k=1
p
n
X
k=1
P(Zn,k 2)
for n ,
by assumption (c) .
D
n
Y
n
Y
Z n,k () =
k=1
(1 pn,k + pn,k ei ) =
k=1
n
X
k=1
n
Y
(1 + zn,k ) .
k=1
n
X
zn,k = (
pn,k )(ei 1) (ei 1) := z .
k=1
Further, since |zn,k | 2pn,k , our assumptions (a) and (b) imply that for n ,
n =
n
X
k=1
|zn,k |2 4
n
X
k=1
n
X
p2n,k 4( max {pn,k })(
pn,k ) 0 .
k=1,...,n
k=1
135
(see (3.3.3) for the last identity), thus completing the proof.
Remark. Recall Example 3.2.25 that the weak convergence of the laws of the
integer valued Sn to that of N also implies their convergence in total
Pn variation.
In the setting of the Poisson approximation theorem, taking n = k=1 pn,k , the
more quantitative result
||PS n PNn ||tv =
k=0
n
X
p2n,k
k=1
due to Stein (1987) also holds (see also [Dur10, (3.6.1)] for a simpler argument,
due to Hodges and Le Cam (1960), which is just missing the factor min(n1 , 1)).
For the remainder of this subsection we list applications of the Poisson approximation theorem, starting with
Example 3.4.2 (Poisson approximation for the Binomial). Take independent variables Zn,k {0, 1}, so n,k = 0, with pn,k = pn that does not depend on
k. Then, the variable Sn = S n has the Binomial distribution of parameters (n, pn ).
By Steins result, the Binomial distribution of parameters (n, pn ) is approximated
well by the Poisson distribution of parameter n = npn , provided pn 0. In case
n = npn < , Theorem 3.4.1 yields that the Binomial (n, pn ) laws converge
weakly as n to the Poisson distribution of parameter . This is in agreement
with Example 3.1.7 where we approximate the Binomial distribution of parameters
(n, p) by the normal distribution, for in Example 3.1.8 we saw that, upon the same
scaling, Nn is also approximated well by the normal distribution when n .
Recall the occupancy problem where we distribute at random r distinct balls
among n distinct boxes and each of the possible nr assignments of balls to boxes is
equally likely. In Example 2.1.10 we considered the asymptotic fraction of empty
boxes when r/n and n . Noting that the number of balls Mn,k in the
k-th box follows the Binomial distribution of parameters (r, n1 ), we deduce from
D
Example 3.4.2 that Mn,k N . Thus, P(Mn,k = 0) P(N = 0) = e .
That is, for large n each box is empty with probability about e , which may
explain (though not prove) the result of Example 2.1.10. Here we use the Poisson
approximation theorem to tackle a different regime, in which r = rn is of order
n log n, and consequently, there are fewer empty boxes.
Proposition 3.4.3. Let Sn denote the number of empty boxes. Assuming r = rn
D
is such that ner/n [0, ), we have that Sn N as n .
Proof. Let Zn,k = IMn,k =0 for k = 1, . . . , n, that is Zn,k = 1 if the k-th box
Pn
is empty and Zn,k = 0 otherwise. Note that Sn =
k=1 Zn,k , with each Zn,k
having the Bernoulli distribution of parameter pn = (1 n1 )r . Our assumption
about rn guarantees that npn . If the occupancy Zn,k of the various boxes were
mutually independent, then the stated convergence of Sn to N would have followed
from Theorem 3.4.1. Unfortunately, this is not the case, so we present a barehands approach showing that the dependence is weak enough to retain the same
136
conclusion. To this end, first observe that for any l = 1, 2, . . . , n, the probability
that given boxes k1 < k2 < . . . < kl are all empty is,
l r
) .
n
Let pl = pl (r, n) = P(Sn = l) denote the probability that exactly l boxes are empty
out of the n boxes into which the r balls are placed at random. Then, considering
all possible choices of the locations of these l 1 empty boxes we get the identities
pl (r, n) = bl (r, n)p0 (r, n l) for
n
l r
.
(3.4.1)
bl (r, n) =
1
n
l
P(Zn,k1 = Zn,k2 = = Zn,kl = 1) = (1
Further, p0 (r, n) = 1P( at least one empty box), so that by the inclusion-exclusion
formula,
(3.4.2)
p0 (r, n) =
n
X
(1)l bl (r, n) .
l=0
which is an improvement over our weak law result that Tn /n log n 1. Indeed, to
derive (3.4.3) view the first r trials of the coupon collector as the random placement
of r balls into n distinct boxes that correspond to the n possible values. From this
point of view, the event {Tn r} corresponds to filling all n boxes with the r
balls, that is, having none empty. Taking r = rn = [n log n + nx] we have that
ner/n = ex , and so it follows from Proposition 3.4.3 that P(Tn rn )
P(N = 0) = e , as stated
Pn in (3.4.3).
Note that though Tn = k=1 Xn,k with Xn,k independent, the convergence in distribution of Tn , given by (3.4.3), is to a non-normal limit. This should not surprise
you, for the terms Xn,k with k near n are large and do not satisfy Lindebergs
condition.
137
Exercise 3.4.6. Recall that n denotes the first time one has distinct values
when collecting coupons that are uniformly distributed on {1, 2, . . . , n}. Using the
Poisson approximation theorem show that if n and = (n) is such that
D
n1/2 [0, ), then n N with N a Poisson random variable of
parameter 2 /2.
3.4.2. Poisson Process. The Poisson process is a continuous time stochastic
process 7 Nt (), t 0 which belongs to the following class of counting processes.
Our next proposition, which is often used as an alternative definition of the Poisson
process, also explains its name.
Proposition 3.4.9. For any and any 0 = t0 < t1 < < t , the increments
Nt1 , Nt2 Nt1 , . . . , Nt Nt1 , are independent random variables and for some
> 0 and all t > s 0, the increment Nt Ns has the Poisson((t s)) law.
138
Thus, the Poisson process has independent increments, each having a Poisson law,
where the parameter of the count Nt Ns is proportional to the length of the
corresponding interval [s, t].
The proof of Proposition 3.4.9 relies on the lack of memory of the exponential
distribution. That is, if the law of a random variable T is exponential (of some
parameter > 0), then for all t, s 0,
(3.4.4)
e(t+s)
P(T > t + s)
= es = P(T > s) .
=
P(T > t)
et
Indeed, the key to the proof of Proposition 3.4.9 is the following lemma.
Lemma 3.4.10. Fixing t > 0, the variables {j } with 1 = TNt +1 t, and j =
TNt +j TNt +j1 , j 2 are i.i.d. each having the exponential distribution of parameter . Further, the collection {j } is independent of Nt which has the Poisson
distribution of parameter t.
Remark. Note that in particular, Et = TNt +1 t which counts the time till
next arrival occurs, hence called the excess life time at t, follows the exponential
distribution of parameter .
Proof.
R x Fixing t > 0 and n 1 let Hn (x) = P(t Tn > t x). With
Hn (x) = 0 fTn (t y)dy and Tn independent of n+1 , we get by Fubinis theorem
(for ItTn >tn+1 ), and the integration by parts of Lemma 1.4.30 that
(3.4.5)
As this applies for any n 1, it follows that Nt has the Poisson distribution of
parameter t. Similarly, observe that for any s1 0 and n 1,
P(Nt = n, 1 > s1 ) = P(t Tn > t n+1 + s1 )
Z t
=
fTn (t y)P(n+1 > s1 + y)dy
0
k
Y
j=2
k
Y
j=1
P(j > sj ).
139
Since sj 0 and n 0 are arbitrary, this shows that the random variables Nt and
j , j = 1, . . . , k are mutually independent (c.f. Corollary 1.4.12), with each j having an exponential distribution of parameter . As k is arbitrary, the independence
of Nt and the countable collection {j } follows by Definition 1.4.3.
Proof of Proposition 3.4.9. Fix t, sj 0, j = 1, . . . , k, and non-negative
integers n and mj , 1 j k. The event {Nsj = mj , 1 j k} is of the form
{(1 , . . . , r ) H} for r = mk + 1 and
H=
k
\
j=1
By induction on this identity implies that if 0 = t0 < t1 < t2 < < t , then
(3.4.6)
P(Nti Nti1 = ni , 1 i ) =
i=1
P(Nti ti1 = ni )
140
n
X
Tk ).
k=1
Nt
X
(b) Compute the value of v = E( (t Tk )).
k=1
(c) Suppose that Tk is the arrival time to the train station of the k-th passenger on a train that departs the station at time t. What is the meaning
of Nt and of v in this case?
The representation of the order statistics {Vn,k } in terms of the jump times of
a Poisson process is very useful when studying the large n asymptotics of their
spacings {Rn,k }. For example,
Exercise 3.4.14. Let Rn,k = Vn,k Vn,k1 , k = 1, . . . , n, denote the spacings
between Vn,k of Exercise 3.4.11 (with Vn,0 = 0). Show that as n ,
n
p
max Rn,k 1 ,
(3.4.7)
log n k=1,...,n
and further for each fixed x 0,
(3.4.8)
Gn (x) := n1
n
X
k=1
(3.4.9)
Bn (x) := P( min
k=1,...,n
I{Rn,k >x/n} ex ,
Rn,k > x/n2 ) ex .
N
X
IXi =j
j = 0, . . . , k .
i=1
141
(b) Show that the sub-sequence of jump times {Tek } obtained by independently
keeping with probability p each of the jump times {Tk } of a Poisson proet of rate p.
cess Nt of rate , yields in turn a Poisson process N
We conclude this section noting the superposition property, namely that the sum
of two independent Poisson processes is yet another Poisson process.
(1)
(2)
(1)
(2)
Exercise 3.4.17. Suppose Nt = Nt + Nt where Nt and Nt are two independent Poisson processes of rates 1 > 0 and 2 > 0, respectively. Show that Nt
is a Poisson process of rate 1 + 2 .
3.5. Random vectors and the multivariate clt
The goal of this section is to extend the clt to random vectors, that is, Rd -valued
random variables. Towards this end, we revisit in Subsection 3.5.1 the theory
of weak convergence, this time in the more general setting of Rd -valued random
variables. Subsection 3.5.2 is devoted to the extension of characteristic functions
and Levys theorems to the multivariate setting, culminating with the Cramerwold reduction of convergence in distribution of random vectors to that of their
one dimensional linear projections. Finally, in Subsection 3.5.3 we introduce the
important concept of Gaussian random vectors and prove the multivariate clt.
3.5.1. Weak convergence revisited. Recall Definition 3.2.17 of weak convergence for a sequence of probability measures on a topological space S, which
suggests the following definition for convergence in distribution of S-valued random
variables.
Definition 3.5.1. We say that (S, BS )-valued random variables Xn converge in
D
distribution to a (S, BS )-valued random variable X , denoted by Xn X , if
w
PXn PX .
As already remarked, the Portmanteau theorem about equivalent characterizations
of the weak convergence holds also when the probability measures n are on a Borel
measurable space (S, BS ) with (S, ) any metric space (and in particular for S = Rd ).
Theorem 3.5.2 (portmanteau theorem). The following five statements are
equivalent for any probability measures n , 1 n on (S, BS ), with (S, ) any
metric space.
w
(a) n
(b) For every closed set F , one has lim sup n (F ) (F )
n
(c) For every open set G, one has lim inf n (G) (G)
n
(d) For every -continuity set A, one has lim n (A) = (A)
n
142
apply it again in Subsection 9.2, considering there S = C([0, )), the metric space
of all continuous functions on [0, ).
Proof. The derivation of (b) (c) (d) in Theorem 3.2.21 applies for any
topological space. The direction (e) (a) is also obvious since h Cb (S) has
Dh = and Cb (S) is a subset of the bounded Borel functions on the same space
(c.f. Exercise 1.2.20). So taking g Cb (S) in (e) results with (a). It thus remains
only to show that (a) (b) and that (d) (e), which we proceed to show next.
(a) (b). Fixing A BS let A (x) = inf yA (x, y) : S 7 [0, ). Since |A (x)
A (x )| (x, x ) for any x, x , it follows that x 7 A (x) is a continuous function
on (S, ). Consequently, hr (x) = (1 rA (x))+ Cb (S) for all r 0. Further,
A (x) = 0 for all x A, implying that hr IA for all r. Thus, applying part (a)
of the Portmanteau theorem for hr we have that
lim sup n (A) lim n (hr ) = (hr ) .
n
X
i=1
i n (Ai )
X
i=1
i (Ai )
P
as n . Our choice of i and Ai is such that g i=1 i IAi g + , resulting
with
X
i n (Ai ) n (g) +
n (g)
i=1
143
where khk = supxS |h(x)| is finite (by the boundedness of h). Considering n
followed by r we deduce from the convergence in probability of (Xn , X )
to zero, that
lim sup E[|h(Xn ) h(X )|] + 2khk lim P(X
/ Gr ) = .
r
Since this applies for any > 0, it follows by the triangle inequality that Eh(Xn )
D
where = (1 , 2 , , d ) Rd and i = 1.
144
(3.5.1)
with both real and imaginary parts being bounded (hence integrable) random variables. Actually, it is easy to check that all five properties of Proposition 3.3.2 hold,
where part (e) is modified to At X+b () = exp(i(b, ))X (A), for any non-random
d d-dimensional matrix A and b Rd (with At denoting the transpose of the
matrix A).
Here is the extension of the notion of probability density function (as in Definition
1.2.40) to a random vector.
RDefinition 3.5.5. Suppose fX is a non-negative Borel measurable function with
f (x)dx = 1. We say that a random vector X = (X1 , . . . , Xd ) has a probability
Rd X
density function fX () if for every b = (b1 , . . . , bd ),
Z bd
Z b1
FX (b) =
fX (x1 , . . . , xd )dxd dx1
We next state and prove the corresponding extension of Levys inversion theorem.
vys inversion theorem). Suppose X () is the characterTheorem 3.5.7 (Le
istic function of random vector X = (X1 , . . . , Xd ) whose law is PX , a probability
measure on (Rd , BRd ). If A = [a1 , b1 ] [ad , bd ] with PX (A) = 0, then
Z
d
Y
(3.5.2)
PX (A) = lim
aj ,bj (j )X ()d
T
[T,T ]d j=1
for a,b () of (3.3.5). Further, the characteristic function determines the law of a
random vector. That is, if X () = Y () for all then X has the same law as Y .
Proof. We derive (3.5.2) by adapting the proof of Theorem 3.3.13. First apply
Fubinis theorem with respect to the product of Lebesgues measure on [T, T ]d
and the law of X (both of which are finite measures on Rd ) to get the identity
Z
Z hY
d
d Z T
i
Y
JT (a, b) :=
aj ,bj (j )X ()d =
haj ,bj (xj , j )dj dPX (x)
[T,T ]d j=1
Rd
j=1
(where ha,b (x, ) = a,b ()eix ). In the course of proving Theorem 3.3.13 we have
seen that for j = 1, . . . , d the integral over j is uniformly bounded in T and that
it converges to gaj ,bj (xj ) as T . Thus, by bounded convergence it follows that
Z
ga,b (x)dPX (x) ,
lim JT (a, b) =
T
Rd
where
ga,b (x) =
d
Y
145
j=1
is zero on Ac and one on Ao (see the explicit formula for ga,b (x) provided there).
So, our assumption that PX (A) = 0 implies that the limit of JT (a, b) as T is
merely PX (A), thus establishing (3.5.2).
Suppose now that X () = Y () for all . Adapting the proof of Corollary 3.3.15
to the current setting, let J = { R : P(Xj = ) > 0 or P(Yj = ) > 0 for some
j = 1, . . . , d} noting that if all the coordinates {aj , bj , j = 1, . . . , d} of a rectangle
A are from the complement of J then both PX (A) = 0 and PY (A) = 0. Thus,
by (3.5.2) we have that PX (A) = PY (A) for any A in the collection C of rectangles
with coordinates in the complement of J . Recall that J is countable, so for any
rectangle A there exists An C such that An A, and by continuity from above of
both PX and PY it follows that PX (A) = PY (A) for every rectangle A. In view of
Proposition 1.1.39 and Exercise 1.1.21 this implies that the probability measures
PX and PY agree on all Borel subsets of Rd .
We next provide the ingredients needed when using characteristic functions enroute to the derivation of a convergence in distribution result for random vectors.
To this end, we start with the following analog of Lemma 3.3.17.
Lemma 3.5.8. Suppose the random vectors X n , 1 n on Rd are such that
X n () X () as n for each Rd . Then, the corresponding sequence
of laws {PX n } is uniformly tight.
Proof. Fixing Rd consider the sequence of random variables Yn = (, X n ).
Since Yn () = X n () for 1 n , we have that Yn () Y () for all
R. The uniform tightness of the laws of Yn then follows by Lemma 3.3.17.
Considering 1 , . . . , d which are the unit vectors in the d different coordinates,
we have the uniform tightness of the laws of Xn,j for the sequence of random
vectors X n = (Xn,1 , Xn,2 , . . . , Xn,d ) and each fixed coordinate j = 1, . . . , d. For
the compact sets K = [M , M ]d and all n,
/ K )
P(X n
d
X
j=1
P(|Xn,j | > M ) .
As d is finite, this leads from the uniform tightness of the laws of Xn,j for each
j = 1, . . . , d to the uniform tightness of the laws of X n .
Equipped with Lemma 3.5.8 we are ready to state and prove Levys continuity
theorem.
Theorem 3.5.9 (L
evys continuity theorem). Let X n , 1 n be random
D
146
upon the choice of n(m). As X n(mk ) Y , we have by the preceding part of the
proof that X n(m ) Y , and necessarily Y = X . The characteristic function
k
The proof of the multivariate clt is just one of the results that rely on the following
immediate corollary of Levys continuity theorem.
D
Corollary 3.5.10 (Cram
er-Wold device). A sufficient condition for X n
D
X is that (, X n ) (, X ) for each Rd .
D
Y () =
d
Y
Yk (k )
k=1
Conversely, show that if (3.5.3) holds for all Rd , the random variables Yk ,
k = 1, . . . , d are mutually independent of each other.
147
3.5.3. Gaussian random vectors and the multivariate clt. Recall the
following linear algebra concept.
Definition 3.5.12. An d d matrix A with entries Ajk is called non-negative
definite (or positive semidefinite) if Ajk = Akj for all j, k, and for any Rd
(, A) =
d X
d
X
j=1 k=1
j Ajk k 0.
We are ready to define the class of multivariate normal distributions via the corresponding characteristic functions.
Definition 3.5.13. We say that a random vector X = (X1 , X2 , , Xd ) is Gaussian, or alternatively that it has a multivariate normal distribution if
(3.5.4)
X () = e 2 (,V) ei(,) ,
148
n
X
k=1
(X k ), where {X k }
are i.i.d. random vectors with finite second moments and such that = EX 1 .
D
Then, Sbn G, with G having the N (0, V) distribution and where V is the d ddimensional covariance matrix of X 1 .
Proof. Consider the i.i.d. random vectors Y k = X k each having also the
covariance matrix V. Fixing an arbitrary vector Rd we proceed to show that
D
(, Sbn ) (, G), which in view of the Cramer-Wold device completes the proof
1 P
of the theorem. Indeed, note that (, Sbn ) = n 2 nk=1 Zk , where Zk = (, Y k ) are
i.i.d. R-valued random variables, having zero mean and variance
v = Var(Z1 ) = E[(, Y 1 )2 ] = (, E[Y 1 Y t1 ] ) = (, V) .
Observing that the clt of Proposition 3.1.2 thus applies to (, Sbn ), it remains only
to verify that the resulting limit distribution N (0, v ) is indeed the law of (, G).
To this end note that by Definitions 3.5.4 and 3.5.13, for any s R,
1
(,V)
= ev s
/2
which is the characteristic function of the N (0, v ) distribution (see Example 3.3.6).
Since the characteristic function uniquely determines the law (see Corollary 3.3.15),
we are done.
Here is an explicit example for which the multivariate clt applies.
P
Example 3.5.17. The simple random walk on Zd is S n = nk=1 X k where X,
X k are i.i.d. random vectors such that
1
P(X = +ei ) = P(X = ei ) =
i = 1, . . . , d,
2d
and ei is the unit vector in the i-th direction, i = 1, . . . , d. In this case EX = 0
and if i 6= j then EXi Xj = 0, resulting with the covariance matrix V = (1/d)I for
the multivariate normal limit in distribution of n1/2 S n .
149
Building on Lindebergs clt for weighted sums of i.i.d. random variables, the
following multivariate normal limit is the basis for the convergence of random walks
to Brownian motion, to which Section 9.2 is devoted.
Exercise 3.5.18. Suppose {k } are i.i.d. with E1 = 0 and E12 = 1. Consider
P[t]
the random functions Sbn (t) = n1/2 S(nt) where S(t) = k=1 k + (t [t])[t]+1
and [t] denotes the integer part of t.
P
(a) Verify that Lindebergs clt applies for Sbn = nk=1 an,k k whenever the
non-random
Pn{an,k } are such that rn = max{|an,k | : k = 1, . . . , n} 0
and vn = k=1 a2n,k 1.
(b) Let c(s, t) = min(s, t) and fixing 0 = t0 t1 < < td , denote by C the
d d matrix of entries Cjk = c(tj , tk ). Show that for any Rd ,
d
X
r=1
r
X
j )2 = (, C) ,
(tr tr1 )(
j=1
D
(c) Using the Cramer-Wold device deduce that (Sbn (t1 ), . . . , Sbn (td )) G
with G having the N (0, C) distribution.
As we see in the next exercise, there is more to a Gaussian random vector than
each coordinate having a normal distribution.
Exercise 3.5.19. Suppose X1 has a standard normal distribution and S is independent of X1 and such that P(S = 1) = P(S = 1) = 1/2.
(a) Check that X2 = SX1 also has a standard normal distribution.
(b) Check that X1 and X2 are uncorrelated random variables, each having
the standard normal distribution, while X = (X1 , X2 ) is not a Gaussian
random vector and where X1 and X2 are not independent variables.
We next relate the density of a random vector with its characteristic function, and
provide the density for the non-degenerate multivariate normal distribution.
Exercise 3.5.22.
R
(a) Show that if Rd |X ()|d < , then X has the bounded continuous
probability density function
Z
1
(3.5.5)
fX (x) =
ei(,x) X ()d .
(2)d Rd
150
2
Exercise 3.5.23. Suppose
variables with EY1 = 1 and
Pn {Yk }2 are i.i.d. random
1
EY1 = 0. Let Wn = n
k=1 Yk and Xn,k = Yk / Wn for k = 1, . . . , n.
D
a.s.
Next you find an interesting property about the coordinate of maximal value in
certain Gaussian random vectors.
Exercise 3.5.24. Suppose random vector X = (X1 , X2 , , Xd ) has the multivariate normal distribution N (0, V), with Vii = 1 for all i and Vij < 1 for all
i 6= j.
(a) Show that for each 1 j d, the random variable Xj is independent of
nX V X o
i
ij j
.
Mj := max
1id,i6=j
1 Vij
(b) Check that with probability one, the index
j := arg max Xj ,
1jd
lim
P(kY k k > n) = 0,
n
k=1
and for some symmetric, (strictly) positive definite matrix V and any fixed
(0, 1],
n
X
lim n1
E(Y k Y tk IkY k kn ) = V.
n
Pn
k=1
(a) Let T n = k=1 X n,k for X n,k = n1/2 Y k IkY k kn . Show that T n
G, with G having the N (0, V) multivariate normal distribution.
Pn
D
(b) Let Sbn = n1/2 k=1 Y k and show that Sbn G.
D
(c) Show that (Sbn )t V1 Sbn Z and identify the law of Z.
151
CHAPTER 4
P(X = xj , Z = zi )
,
P(Z = zi )
m
X
j=1
xj P(X = xj |Z = zi ) .
154
Then,
P(X = 1|Z = 1) =
implying that P(X = 2|Z = 1) =
1
6
5
P(X = 1, Z = 1)
= ,
P(Z = 1)
6
and
5
1
7
+2 = .
6
6
6
Likewise, check that E[X|Z = 2] = 74 , hence E[X|Z] = 67 IZ=1 + 74 IZ=2 .
E[X|Z = 1] = 1
iI
m
XX
xj P(X = xj , Z = zi ) = E[XIG ] .
iI j=1
(4.1.1)
E [(X Y ) IG ] = 0 .
Moreover, if (4.1.1) holds for any G G and R.V.s Y and Ye , both of which are in
L1 (, G, P), then P(Ye = Y ) = 1. In other words, the C.E. is uniquely defined for
P-almost every .
155
To prove the existence of the C.E. we need the following definition of absolute
continuity of measures.
Definition 4.1.4. Let and be two measures on measurable space (S, F ). We
say that is absolutely continuous with respect to , denoted by , if
(A) = 0
(A) = 0
Hence, P(G ) = 0. Since this applies for any > 0 and G G0 as 0, we deduce
that P(Y Ye > 0) = 0. The same argument applies with the roles of Y and Ye
reversed, so P(Y Ye = 0) = 1 as claimed.
We turn to the existence of the C.E. assuming first that X L1 (, F , P) is also
non-negative. Let denote the probability measure obtained by restricting P to
the measurable space (, G) and denote the measure obtained by restricting XP
of Proposition 1.3.56 to this measurable space, noting that is a finite measure
(since () = (XP)() = E[X] < ). If G G is such that (G) = P(G) = 0,
then by definition also (G) = (XP)(G) = 0. Therefore, is absolutely continuous
with respect to , and by the Radon-Nikodym theorem there exists Y mG+ such
that = Y . This implies that for any G G,
E[XIG ] = P(XIG ) = (G) = (Y )(G) = (Y IG ) = E[Y IG ]
(and in particular, that E[Y ] = () < ), proving the existence of the C.E. for
non-negative R.V.s.
Turning to deal with the case of a general integrable R.V. X we use the representation X = X+ X with X+ 0 and X 0 such that both E[X+ ] and E[X ] are
finite. Set Y = Y + Y where the integrable, non-negative R.V.s Y = E[X |G]
156
Remark. Beware that for Y = E[X|G] often Y 6= E[X |G] (for example, take
the trivial G = {, } and P(X = 1) = P(X = 1) = 1/2 for which Y = 0 while
E[X |G] = 1).
Exercise 4.1.7. Suppose either E(Yk )+ is finite or E(Yk ) is finite for random
variables Yk , k = 1, 2 on (, F , P) such that E[Y1 IA ] E[Y2 IA ] for any A F .
Show that then P(Y1 Y2 ) = 1.
In the next exercise you show that the Radon-Nikodym density preserves the
product structure.
Exercise 4.1.8. Suppose that k k are pairs of -finite measures on (Sk , Fk )
for k = 1, . . . , n with the corresponding Radon-Nikodym derivatives fk = dk /dk .
(a) Show that the -finite product measure = 1 n on the product
space (S, F ) is absolutely continuous with respect toQthe -finite measure
n
= 1 n on (S, F ), with d/d(s) = k=1 fk (sk ) for s =
(s1 , . . . , sn ).
(b) Suppose and are probability measures on S = {(s1 , . . . , sn ) : sk
Sk , k = 1, . . . , n}. Show that fk (sk ), k = 1, . . . , n, are both mutually
-independent and mutually -independent.
4.1.2. Proof of the Radon-Nikodym theorem. This section is devoted
to proving the Radon-Nikodym theorem, which we have already used for establishing the existence of C.E. This is done by proving the more general Lebesgue
decomposition, based on the following definition.
Definition 4.1.9. Two measures 1 and 2 on the same measurable space (S, F )
are mutually singular if there is a set A F such that 1 (A) = 0 and 2 (Ac ) = 0.
This is denoted by 1 2 , and we sometimes state that 1 is singular with respect
to 2 , instead of 1 and 2 mutually singular.
Equipped with the concept of mutually singular measures, we next state the
Lebesgue decomposition and show that the Radon-Nikodym theorem is a direct
consequence of this decomposition.
Theorem 4.1.10 (Lebesgue decomposition). Suppose and are measures
on the same measurable space (S, F ) such that (S) and (S) are finite. Then,
= ac + s where the measure s is singular with respect to and ac = f for
some f mF+ . Further, such a decomposition of is unique (per given ).
Remark. To build your intuition, note that Lebesgue decomposition is quite
explicit for -finite measures on a countable space S (with F = 2S ). Indeed, then
ac and s are the restrictions of to the support S = {s S : ({s}) > 0}
of and its complement, respectively, with f (s) = ({s})/({s}) for s S the
Radon-Nikodym derivative of ac with respect to (see Exercise 1.2.48 for more
on the support of a measure).
157
Proof of the Radon-Nikodym theorem. Assume first that (S) and (S)
are finite. Let = ac + s be the unique Lebesgue decomposition induced by .
Then, by definition there exists a set A F such that s (Ac ) = (A) = 0. Further,
our assumption that implies that s (A) (A) = 0 as well, hence s (S) = 0,
i.e. = ac = f for some f mF+ .
Next, in case and are -finite measures the sample space S is a countable union
of disjoint sets An F such that both (An ) and (An ) are finite. Considering the
measures n = IAn and n = IAn such that n (S) = (An ) and n (S) = (An )
are finite, our assumption that implies that n n . Hence, by the
preceding
argument for each n there exists fn mF+ such that n = fn n . With
P
(by the composition relation of Proposition 1.3.56),
= n n and n = (fn IAn ) P
it follows that = f for f = n fn IAn mF+ finite valued.
As for the uniqueness of the Radon-Nikodym derivative f , suppose
that f = g
T
for some g mF+ and a -finite measure . Consider En = Dn {s : g(s) f (s)
1/n, g(s) n} and measurable Dn S such that (Dn ) < . Then, necessarily
both (f IEn ) and (gIEn ) are finite with
n1 (En ) ((g f )IEn ) = (g)(En ) (f )(En ) = 0 ,
158
is contained in the negative set Dnc for n1 , it follows that (Ac ) n1 (Ac ).
Taking n we deduce that (Ac ) = 0. If (Dn ) = 0 for all n then (A) = 0
and necessarily is singular with respect to , contradicting the assumptions of
the lemma. Therefore, (Dn ) > 0 for some finite n. Taking = n1 and B = Dn
results with the thesis of the lemma.
Proof of Lebesgue decomposition. Our goal is to construct f mF+
such that the measure s = f is singular with respect to . Since necessarily
s (A) 0 for any A F , such a function f must belong to
H = {h mF+ : (A) (h)(A),
for all A F }.
That is, H is also closed under the formation of finite maxima and in particular, the function limn max(h1 , . . . , hn ) is in H for any hn H. Now let =
sup{(h)(S) : h H} noting that (S) is finite. Choosing hn H such
that (hn )(S) n1 results with f = limn max(h1 , . . . , hn ) in H such that
(f )(S) limn (hn )(S) = . Since f is an element of H both ac = f and
s = f are finite measures.
If s fails to be singular with respect to then by Lemma 4.1.11 there exists
B F and > 0 such that (B) > 0 and s (A) (IB )(A) for all A F . Since
= s +f , this implies that f +IB H. However, ((f +IB ))(S) +(B) >
contradicting the fact that is the finite maximal value of (h)(S) over h H.
Consequently, this construction of f has = f + s with a finite measure s that
is singular with respect to .
Finally, to prove the uniqueness of the Lebesgue decomposition, suppose there
exist f1 , f2 mF+ , such that both f1 and f2 are singular with respect to
. That is, there exist A1 , A2 F such that (Ai ) = 0 and ( fi )(Aci ) = 0 for
i = 1, 2. Considering A = A1 A2 it follows that (A) = 0 and ( fi )(Ac ) = 0
for i = 1, 2. Consequently, for any E F we have that ( f1 )(E) = (E A) =
( f2 )(E), proving the uniqueness of s , and hence of the decomposition of as
ac + s .
We conclude with a simple application of Radon-Nikodym theorem in conjunction
with Lemma 1.3.8.
Exercise 4.1.13. Suppose and are two -finite measures on the same measurable space (S, F ) such that (A) (A) for all A F . Show that if (g) = (g)
is finite for some g mF such that ({s : g(s) 0}) = 0 then () = ().
4.2. Properties of the conditional expectation
In some generic settings the C.E. is rather explicit. One such example is when X
is measurable on the conditioning -algebra G.
159
E[X|H] = EX .
We turn to derive various properties of the C.E. operation, starting with its positivity and linearity (per fixed conditioning -algebra).
Proposition 4.2.4. Let X L1 (, F ,P) and set Y = E[X|G] for some -algebra
G F . Then,
(a) EX = EY
(b) ( Positivity) X 0 = Y 0 a.s. and X > 0 = Y > 0
a.s.
Proof. Considering G = G in the definition of the C.E. we find that
EX = E[XIG ] = E[Y IG ] = EY .
Turning to the positivity of the C.E. note that if X 0 a.s. then 0 E[XIG ] =
E[Y IG ] 0 for G = { : Y () 0} G. Hence, in this case E[Y IY 0 ] = 0.
That is, almost surely Y 0. Further, P(X > , Y 0) E[XIX> IY 0 ]
E[XIY 0 ] = 0 for any > 0, so P(X > 0, Y = 0) = 0 as well.
We next show that the C.E. operator is linear.
160
From its positivity and linearity we immediately get the monotonicity of the C.E.
Corollary 4.2.6 (Monotonicity). If X, Y L1 (, F , P) are such that X Y ,
then E[X|G] E[Y |G] for any -algebra G F .
In the following exercise you are to combine the linearity and positivity of the
C.E. with Fubinis theorem.
Exercise 4.2.7. Show that if X, Y L1 (, F , P) are such that E[X|Y ] = Y and
E[Y |X] = X then almost surely X = Y .
Hint: First show that E[(X Y )I{X>cY } ] = 0 for any non-random c.
We next deal with the relationship between the C.E.s of the same R.V. for nested
conditioning -algebras.
Proposition 4.2.8 (Tower property). Suppose X L1 (, F ,P) and the algebras H and G are such that H G F . Then, E[X|H] = E[E(X|G)|H].
Remark. The tower property is also called the law of iterated expectations.
Any -algebra G contains the trivial -algebra F0 = {, }. Applying the tower
property with H = F0 and using the fact that E[Y |F0 ] = EY for any integrable
random variable Y , it follows that for any -algebra G
(4.2.1)
161
Proof. Let Z = E[X|G] which is well defined due to our assumption that
E|X| < . With Y Z mG and E|XY | < , it suffices to check that
(4.2.2)
E[XY IA ] = E[ZY IA ]
Exercise 4.2.12. Let Z = (X, Y ) be a uniformly chosen point in (0, 1)2 . That
is, X and Y are independent random variables, each having the U (0, 1) measure
of Example 1.1.26. Set T = 2IA (Z) + 10IB (Z) + 4IC (Z) where A = {(x, y) :
0 < x < 1/4, 3/4 < y < 1}, B = {(x, y) : 1/4 < x < 3/4, 0 < y < 1/2} and
C = {(x, y) : 3/4 < x < 1, 1/4 < y < 1}.
(a) Find an explicit formula for the conditional expectation W = E(T |X)
and use it to determine the conditional expectation U = E(T X|X).
(b) Find the value of E[(T W ) sin(eX )].
Exercise 4.2.13. Fixing a positive integer k, compute E(X|Y ) in case Y = kX
[kX] for X having the U (0, 1) measure of Example 1.1.26 (and where [x] denotes
the integer part of x).
Exercise 4.2.14. Fixing t R and X integrable random variable, let Y =
max(X, t) and Z = min(X, t). Setting at = E[X|X t] and bt = E[X|X t],
show that E[X|Y ] = Y IY >t + at IY =t and E[X|Z] = ZIZ<t + bt IZ=t .
Exercise 4.2.15. Let X, Y be i.i.d. random variables. Suppose is independent
of (X, Y ), with P( = 1) = p, P( = 0) = 1 p. Let Z = (Z1 , Z2 ) where
Z1 = X + (1 )Y and Z2 = Y + (1 )X.
162
Exercise 4.2.16. Suppose EX 2 < and define Var(X|G) = E[(X E(X|G))2 |G].
(a) Show that, E[Var(X|G2 )] E[Var(X|G1 )] for any two -algebras G1 G2
(that is, the dispersion of X about its conditional mean decreases as the
algebra grows).
(b) Show that for any -algebra G,
Var[X] = E[Var(X|G)] + Var[E(X|G)] .
Exercise 4.2.17. Suppose N is a non-negative, integer valued R.V. which is independentPof the independent, integrable R.V.-s i on the same probability space,
and that i P(N i)E|i | is finite.
(a) Check that
N ()
X
i () ,
X() =
i=1
P
is integrable and deduce that EX = i P(N i)Ei .
(b) Suppose in addition that i are identically distributed, in which case this
is merely Walds identity EX = EN E1 . Show that if both 1 and N are
square-integrable, then so is X and
As shown in the sequel, per fixed conditioning -algebra we can interpret the
C.E. as an expectation in a different (conditional) probability space. Indeed, every
property of the expectation has a corresponding extension to the C.E. For example,
the extension of Jensens inequality is
Proposition 4.2.19 (Jensens inequality). Suppose g() is a convex function
on an open interval G of R, that is,
g(x) + (1 )g(y) g(x + (1 )y)
x, y G,
0 1.
c, x G
163
Further, with (D g)() a finite, non-decreasing function on G where g() is continuous, it follows that
g(x) = sup {g(c) + (D g)(c)(x c)} = sup{an x + bn }
n
cGQ
E[|E(X|H)|q ] = kE(X|H)kqq .
(b) Deduce the conditional version of Markovs inequality, that for any a > 0
P(|X| a |G) ap E[|X|p |G]
164
Proof. Let Ym = E[Xm |G] mG+ . By monotonicity of the C.E. we have that
the sequence Ym is a.s. non-decreasing, hence it has a limit Y mG+ (possibly
infinite). We complete the proof by showing that Y = E[X |G]. Indeed, for any
G G,
E[Y IG ] = lim E[Ym IG ] = lim E[Xm IG ] = E[X IG ],
m
where since Ym Y and Xm X the first and third equalities follow by the
monotone convergence theorem (the unconditional version), and the second equality
from the definition of the C.E. Ym . Considering G = we see that Y is integrable.
In conclusion, E[Xm |G] = Ym Y = E[X |G], as claimed.
Lemma 4.2.25 (Fatous lemma for C.E.). If the non-negative, integrable Xn
on same measurable space (, F ) are such that lim inf n Xn is integrable, then
for any -algebra G F ,
E lim inf Xn G lim inf E[Xn |G] a.s.
n
Proof. Applying the monotone convergence theorem for the C.E. of the nondecreasing sequence of non-negative R.V.s Zn = inf{Xk : k n} (whose limit is
the integrable lim inf n Xn ), results with
(4.2.3)
E lim inf Xn |G = E( lim Zn |G) = lim E[Zn |G] a.s.
n
Since Zn Xn it follows that E[Zn |G] E[Xn |G] for all n and
(4.2.4)
lim E[Zn |G] = lim inf E[Zn |G] lim inf E[Xn |G]
a.s.
Upon combining (4.2.3) and (4.2.4) we obtain the thesis of the lemma.
Fatous lemma leads to the C.E. version of the dominated convergence theorem.
Theorem 4.2.26 (Dominated convergence for C.E.). If supm |Xm | is intea.s
a.s.
grable and Xm X , then E[Xm |G] E[X |G].
Proof. Let Y = supm |Xm | and Zm = Y Xm 0. Applying Fatous lemma
for the C.E. of the non-negative, integrable R.V.s Zm 2Y , we see that
E lim inf Zm G lim inf E[Zm |G] a.s.
m
We thus conclude that a.s. the lim inf and lim sup of the sequence E[Xm |G] coincide
and are equal to E[X |G], as stated.
Exercise 4.2.27. Let X1 , X2 be random variables defined on same probability
space (, F , P) and G F a -algebra. Prove that (a), (b) and (c) below are
equivalent.
(a) For any Borel sets B1 and B2 ,
P(X1 B1 , X2 B2 |G) = P(X1 B1 |G)P(X2 B2 |G) .
165
that E(|Xn X |q ) 0. Then, E[Xn |G] E[X |G] for any -algebra G F .
As you will show, the C.E. operation is also continuous with respect to the following topology of weak Lq convergence.
Definition 4.2.31. Let L (, F , P) denote the collection of all random variables
on (, F ) which are P-a.s. bounded, with kY k denoting the smallest non-random
K such that P(|Y | K) = 1. Setting p(q) : [1, ] [1, ] via p(q) = q/(q 1), we
wLq
wLq
Deduce that if Xn X then E[Xn |G] E[X |G] for any -algebra G F .
In view of Example 4.2.20 we already know that for each integrable random variable X the collection {E[X|G] : G F is a -algebra} is a bounded in L1 (, F , P).
As we show next, this collection is even uniformly integrable (U.I.), a key fact in
our study of uniformly integrable martingales (see Subsection 5.3.1).
Proposition 4.2.33. For any X L1 (, F , P), the collection {E[X|H] : H F
is a -algebra} is U.I.
Proof. Fixing > 0, let = (X, ) > 0 be as in part (b) of Exercise 1.3.43
and set the finite constant M = 1 E|X|. By Markovs inequality and Example
4.2.20 we get that M P(A) E|Y | E|X| for A = {|Y | M } H and Y =
E[X|H]. Hence, P(A) by our choice of M , whereby our choice of results
166
with E[|X|IA ] (c.f. part (b) of Exercise 1.3.43). Further, by (the conditional)
Jensens inequality |Y | E[|X| |H] (see Example 4.2.20). Therefore, by definition
of the C.E. E[|X| |H],
E[|Y |I|Y |>M ] E[|Y |IA ] E[E[|X||H]IA ] = E[|X|IA ] .
Since this applies for any -algebra H F and the value of M = M (X, ) does not
depend on Y , we conclude that the collection of such Y = E[X|H] is U.I.
To check your understanding of the preceding derivation, prove the following natural extension of Proposition 4.2.33.
Exercise 4.2.34. Let C be a uniformly integrable collection of random variables
a.s.
on (, F , P). Show that the collection D of all R.V. Y such that Y = E[X|H] for
some X C and -algebra H F , is U.I.
Here is a somewhat counter intuitive fact about the conditional expectation.
a.s
(a) Show that E[Yn |G] E[Y |G] for any -algebra G F .
(b) Provide an example of such sequence {Yn } and a -algebra G F such
that E[Yn |G] does not converge almost surely to E[Y |G].
4.3. The conditional expectation as an orthogonal projection
It readily follows from our next proposition that for X L2 (, F , P) and algebras G F the C.E. Y = E[X|G] is the unique Y L2 (, G, P) such that
(4.3.1)
kX Y k2 = inf{kX W k2 : W L2 (, G, P)}.
E[(X Y )Z] = 0
for all
Z L2 (, G, P) .
167
Example 4.3.2. If G = (A1 , . . . , An ) for finite n and disjoint sets Ai such that
P(Ai )P> 0 for i = 1, . . . , n, then L2 (, G, P) consists of all variables of the form
W = ni=1 vi IAi , vi R. A R.V. Y of this form satisfies (4.3.1) if and only if the
corresponding {vi } minimizes
E[(X
n
X
i=1
vi IAi )2 ] EX 2 =
n
nX
i=1
P(Ai )vi2 2
n
X
i=1
o
vi E[XIAi ] ,
Pn
which amounts to vi = E[XIAi ]/P(Ai ). In particular, if Z =
i=1 zi IAi for
distinct zi -s, then (Z) = G and we thus recover our first definition of the C.E.
E[X|Z] =
n
X
E[XIZ=zi ]
IZ=zi .
P(Z = zi )
i=1
Exercise 4.3.6. Let khk = (h, h) 2 with (h1 , h2 ) an inner product for a linear
vector space H. Show that Schwarz inequality
(u, v)2 kuk2 kvk2 ,
and the parallelogram law ku+vk2 +kuvk2 = 2kuk2 +2kvk2 hold for any u, v H.
Our next proposition shows that for each finite q 1 the space Lq (, F , P) is a
Banach space for the norm k kq , the usual addition of R.V.s and the multiplication
of a R.V. X() by a non-random (scalar) constant. Further, L2 (, G, P) is a Hilbert
sub-space of L2 (, F , P) for any -algebras G F .
168
Proposition 4.3.7. Upon identifying R-valued R.V. which are equal with probability one as being in the same equivalence class, for each q 1 and a -algebra F ,
the space Lq (, F , P) is a Banach space for the norm k kq . Further, L2 (, G, P) is
then a Hilbert sub-space of L2 (, F , P) for the inner product (X, Y ) = EXY and
any -algebras G F .
Proof. Fixing q 1, we identify X and Y such that P(X 6= Y ) = 0 as
being the same element of Lq (, F , P). The resulting set of equivalence classes is a
normed vector space. Indeed, both kkq , the addition of R.V. and the multiplication
by a non-random scalar are compatible with this equivalence relation. Further, if
X, Y Lq (, F , P) then kXkq = ||kXkq < for all R and by Minkowskis
inequality kX + Y kq kXkq + kY kq < . Consequently, Lq (, F , P) is closed
under the operations of addition and multiplication by a non-random scalar, with
k kq a norm on this collection of equivalence classes.
Suppose next that {Xn } Lq is a Cauchy sequence for k kq . Then, by definition,
there exist kn such that kXr Xs kqq < 2n(q+1) for all r, s kn . Observe that
by Markovs inequality
P(|Xkn+1 Xkn | 2n ) 2nq kXkn+1 Xkn kqq < 2n ,
converges to a finite limit X(). Next let, X = lim supn Xkn (which per Theorem 1.2.22 is an R-valued R.V.). Then, fixing n and r kn , for any t n,
i
h
E |Xr Xkt |q = kXr Xkt kqq 2nq ,
so that by the a.s. convergence of Xkt to X and Fatous lemma
i
h
E|Xr X|q = E lim |Xr Xkt |q lim inf E|Xr Xkt |q 2nq .
t
169
Exercise 4.3.8. For q 1 finite and a given Banach space (Y, k k), consider
the space Lq (S, F , ; Y) of all -a.e. equivalence classes of functions f : S 7 Y,
measurable with respect to the Borel -algebra induced on Y by k k and such that
(kf ()kq ) < .
(a) Show that kf kq = (kf ()kq )1/q makes Lq (S, F , ; Y) into a Banach
space.
(b) For future applications of the preceding, verify that the space Y = Cb (T)
of bounded, continuous real-valued functions on a topological space T is
a Banach space for the supremum norm kf k = sup{|f (t)| : t T}.
1
1
1
kh gk k2 + kh gmk2 = 2kh (gm + gk )k2 + 2k (gm gk )k2 2d2 + kgm gk k2
2
2
2
since 21 (gm + gk ) G. Taking k, m , both kh gk k2 and kh gm k2 approach
d2 and hence by the preceding inequality kgm gk k 0. In conclusion, {gn } is a
Cauchy sequence in the Hilbert sub-space G, which thus converges to some b
h G.
Recall that kh b
hk d by the definition of d. Since for n both kh gnk d
and kgn b
hk 0, the converse inequality is a consequence of the triangle inequality
kh b
hk kh gn k + kgn b
hk.
Next, suppose there exist g1 , g2 G such that (h gi , f ) = 0 for i = 1, 2 and
all f G. Then, by linearity of the inner product (g1 g2 , f ) = 0 for all f G.
Considering f = g1 g2 G we see that (g1 g2 , g1 g2 ) = kg1 g2 k2 = 0 so
necessarily g1 = g2 .
We complete the proof by showing that b
h G is such that kh b
hk2 kh gk2
b
for all g G if and only if (h h, f ) = 0 for all f G. This is done exactly as in
170
the proof of Proposition 4.3.1. That is, by symmetry and bi-linearity of the inner
product, for all f G and R,
kh b
h f k2 kh b
hk2 = 2 kf k2 2(h b
h, f )
We arrive at the stated conclusion upon noting that fixing f , this function is nonnegative for all if and only if (h b
h, f ) = 0.
Exercise 4.3.11. Show that for any non-negative integrable X, not necessarily in
L2 , the sequence Yn G corresponding to Xn = min(X, n) is non-decreasing and
that its limit Y satisfies (4.1.1). Verify that this allows you to prove Theorem 4.1.2
without ever invoking the Radon-Nikodym theorem.
Exercise 4.3.12. Suppose G F is a -algebra.
(a) Show that for any X L1 (, F , P) there exists some G G such that
E[XIG ] = sup E[XIA ] .
AG
(d) Let = sup E[E(X|H)2 ], where the supremum is over all finite -algebras
H G. Show that is finite, and that there exists an increasing sequence
of finite -algebras Hn such that E[E(X|Hn )2 ] as n .
(e) Let H = (n Hn ) and Yn = E[X|Hn ] for the Hn in part (d). Explain
why your proof of part (b) implies the L2 convergence of Yn to a R.V. Y
such that E[Y IA ] = E[XIA ] for any A H .
( f ) Fixing A G such that A
/ H , let Hn,A = (A, Hn ) and Zn =
E[X|Hn,A ]. Explain why some sub-sequence of {Zn } has an a.s. and
L2 limit, denoted Z. Show that EZ 2 = EY 2 = and deduce that
E[(Y Z)2 ] = 0, hence Z = Y a.s.
(g) Show that Y is a version of the C.E. E[X|G].
171
Proof. Since the Borel function h(x, z) = g(x)fX,Z (x, z) is integrable with
respect to Lebesgue measure on (R2 , BR2 ), it follows that gb() is also a Borel function
(c.f. our proof of Fubinis theorem). Further, by Fubinis
theorem the integrability
R
of g(X) implies that (R \ A) = 0 for A = {z : |g(x)|fX,Z (x, z)dx < }, and
with PZ = fZ this implies that P(Z A) = 1. By Jensens inequality,
Z
|b
g(z)| |g(x)|fX|Z (x|z)dx ,
z A .
Thus, by Fubinis theorem and the definition of fX|Z we have that
Z
Z
hZ
i
> E|g(X)| = |g(x)|fX (x)dx |g(x)|
fX|Z (x|z)fZ (z)dz dx
A
Z
Z hZ
i
|b
g(z)|fZ (z)dz = E|b
g (Z)| .
|g(x)|fX|Z (x|z)dx fZ (z)dz
=
A
So, b
g(Z) is integrable. With (4.4.1) holding for all z A and P(Z A) = 1, by
Fubinis theorem and the definition of fX|Z we have that for any Borel set B,
Z hZ
Z
i
gb(z)fZ (z)dz =
g(x)fX|Z (x|z)dx IBA (z)fZ (z)dz
E[b
g (Z)IB (Z)] =
ZBA
g(x)IBA (z)fX,Z (x, z)dxdz = E[g(X)IB (Z)] .
=
R2
172
To each conditional probability density fX|Z (|) corresponds the collection of conR
b X|Z (B, ) =
f
(x|Z())dx. The remainder of
ditional probability measures P
B X|Z
this section deals with the following generalization of the latter object.
Definition 4.4.2. Let Y : 7 S be an (S, S)-valued R.V. in the probability space
b Y |G (, ) :
(, F , P), per Definition 1.2.1, and G F a -algebra. The collection P
S 7 [0, 1] is called the regular conditional probability distribution (R.C.P.D.)
of Y given G if:
b Y |G (A, ) is a version of the C.E. E[IY A |G] for each fixed A S.
(a) P
b Y |G (, ) is a probability measure
(b) For any fixed , the set function P
on (S, S).
In case S = , S = F and Y () = , we call this collection the regular conditional
b
probability (R.C.P.) on F given G, denoted also by P(A|G)().
If the R.C.P. exists, then we can define all conditional expectations through the
R.C.P. Unfortunately, the R.C.P. might not exist (see [Bil95, Exercise 33.11] for
an example in which there exists no R.C.P. on F given G).
Recall that each C.E. is uniquely determined only a.e. Hence, for any countable
collection of disjoint sets An F there is possibly a set of of probability zero
for which a given collection of C.E. is such that
[
X
P( An |G)() 6=
P(An |G)() .
n
xn x
173
b ) on (R, B)
Thus, to each corresponds a unique probability measure P(,
b
such that P((,
x], ) = F (x, ) for all x R (recall Theorem 1.2.37 for its
existence and Proposition 1.2.45 for its uniqueness).
Note that G(q, ) mG for all q Q, hence so is F (x, ) for each x R (see
b
Theorem 1.2.22). It follows that {B B : P(B,
) mG} is a -system (see
Corollary 1.2.19 and Theorem 1.2.22), containing the -system P = {R, (, q] :
b
q Q}, hence by Dynkins theorem P(B,
) mG for all B B. Further, for
/ D and q Q,
H(q, ) = G(q, ) F (q, ) G(q + n1 , ) = H(q + n1 , ) H(q, )
Remark. The reason behind Proposition 4.4.3 is that (X) inherits the structure
of the Borel -algebra B which in turn is not too big due to the fact the rational
numbers are dense in R. Indeed, as you are to deduce in the next exercise, there
exists a R.C.P.D. for any (S, S)-valued R.V. X with a B-isomorphic (S, S).
Exercise 4.4.4. Suppose (S, S) is B-isomorphic, that is, there exists a Borel set
T (equipped with the induced Borel -algebra T = {B T : B B}) and a one to
one and onto mapping g : S 7 T such that both g and g 1 are measurable. For
b Y |G (, ) denote the R.C.P.D. of the
any -algebra G and (S, S)-valued R.V. X let P
real-valued random variable Y = g(X).
b Y |G (T, ) = 1 for all .
(a) Explain why without loss of generality P
(b) Verify that for any A S both { : X() A} = { : Y () g(A)} and
g(A) B.
b
b Y |G (g(A), ) is the R.C.P.D. of X given G.
(c) Deduce that Q(A,
) = P
Exercise 4.4.5. Suppose (S, S) is B-isomorphic and X and Y are (S, S)-valued
R.V. in the same probability space (, F , P). Prove that there exists (regular) tranb X|Y (, ) : S S 7 [0, 1] such that
sition probability P
b X|Y (y, A) is a measurable function and
(a) For each A S fixed, y 7 P
b
PX|Y (Y (), A) is a version of the C.E. E[IXA |(Y )]().
174
of the C.E. in terms of the corresponding R.C.P.D. (with the right side denoting the
b X|G (, )).
Lebesgues integral of Definition 1.3.1 for the probability space (R, B, P
Solving the next exercise should improve your understanding of the relation between the R.C.P.D. and the conditional probability density function.
Exercise 4.4.7. Suppose that the random vector (X, Y, Z) has a probability density
function fX,Y,Z per Definition 3.5.5.
b Y |(X,Z) in terms of fX,Y,Z .
(a) Express the R.C.P.D. P
(b) Using this expression show that if X is independent of (Y, Z), then
E[Y |X, Z] = E[Y |Z] .
175
Exercise 4.4.10. Suppose (X, Y ) are distributed according to a multivariate normal distribution, with EX = EY = 0 and EY 2 > 0. Show that E[X|Y ] = Y with
= E[XY ]/EY 2 .
CHAPTER 5
178
Definition 5.1.3. The filtration {FnX } with FnX = (X0 , X1 , . . . , Xn ) is the minimal filtration with respect to which {Xn } is adapted. We therefore call it the
canonical filtration for the S.P. {Xn }.
Whenever clear from the context what it means, we shall use the notation Xn
both for the whole S.P. {Xn } and for the n-th R.V. of this process, and likewise we
may sometimes use Fn to denote the whole filtration {Fn }.
A martingale consists of a filtration and an adapted S.P. which can represent the
outcome of a fair gamble. That is, the expected future reward given current
information is exactly the current value of the process, or as a rigorous definition:
Definition 5.1.4. A martingale (denoted MG) is a pair (Xn , Fn ), where {Fn } is
a filtration and {Xn } is an integrable S.P., that is, E|Xn | < for all n, adapted
to this filtration, such that
(5.1.1)
E[Xn+1 |Fn ] = Xn
n,
a.s.
and the alternative expression for the martingale property follows from (5.1.1).
Our first example of a martingale, is the random walk, perhaps the most fundamental stochastic process.
Pn
Definition 5.1.6. The random walk is the stochastic process Sn = S0 + k=1 k
with real-valued, independent, identically distributed {k } which are also independent of S0 . Unless explicitly stated otherwise, we always set S0 to be zero. We say
that the random walk is symmetric if the law of k is the same as that of k . We
179
call it a simple random walk (on Z), in short srw, if k {1, 1}. The srw is
completely characterized by the parameter p = P(k = 1) which is always assumed
to be in (0, 1) (or alternatively, by q = 1 p = P(k = 1)). Thus, the symmetric
srw corresponds to p = 1/2 = q (and the asymmetric srw corresponds to p 6= 1/2).
The random walk is a MG (with respect to its canonical filtration), whenever
E|1 | < and E1 = 0.
Remark. More generally, such partial sums {Sn } form a MG even when the
independent and integrable R.V. k of zero mean have non-identical distributions,
and the canonical filtration of {Sn } is merely {Fn }, where Fn = (1 , . . . , n ).
Indeed, this is an application of Proposition 5.1.5 for independent, integrable Dk =
Sk Sk1 = k , k 1 (with D0 = 0), where E[Dn+1 |D0 , D1 , . . . , Dn ] = EDn+1 = 0
for all n 0 by our assumption that Ek = 0 for all k.
Definition 5.1.7. We say that a stochastic process {Xn } is square-integrable if
EXn2 < for all n. Similarly, we call a martingale (Xn , Fn ) such that EXn2 <
for all n, an L2 -MG (or a square-integrable MG).
Square-integrable martingales have zero-mean, uncorrelated differences and admit
an elegant decomposition of conditional second moments.
Exercise 5.1.8. Suppose (Xn , Fn ) and (Yn , Fn ) are square-integrable martingales.
k=n+1
h X
k=1
Dk2
2 i
6C 4 .
Remark. A square-integrable stochastic process with zero-mean mutually independent differences is necessarily a martingale (consider Proposition 5.1.5). So, in
view of part (a) of Exercise 5.1.8, the MG property is between the more restrictive
requirement of having zero-mean, independent differences, and the not as useful
property of just having zero-mean, uncorrelated differences. While in general these
three conditions are not the same, as you show next they do coincide in case of
Gaussian stochastic processes.
Exercise 5.1.9. A stochastic process {Xn } is Gaussian if for each n the random vector (X1 , . . . , Xn ) has the multivariate normal distribution (c.f. Definition
3.5.13). Show that having independent or uncorrelated differences are equivalent
properties for such processes, which together with each of these differences having
a zero mean is then also equivalent to the MG property.
Products of R.V. is another classical source for martingales.
180
Qn
Example 5.1.10. Consider the stochastic process Mn = k=1 Yk for independent,
integrable random variables Yk 0. Its canonical filtration coincides with FnY (see
Exercise 1.2.33), and taking out what is known we get by independence that
E[Mn+1 |FnY ] = E[Yn+1 Mn |FnY ] = Mn E[Yn+1 |FnY ] = Mn E[Yn+1 ] ,
so {Mn } is a MG, which we then call the product martingale, if and only if
EYk = 1 for all k 1 (for general sequence {Yn } we need instead that a.s.
E[Yn+1 |Y1 , . . . , Yn ] = 1 for all n).
Remark. In investment applications, the MG condition EYk = 1 corresponds to
a neutral return rate, and is not the same as the condition E[log Yk ] = 0 under
which the associated partial sums Sn = log Mn form a MG.
We proceed to define the important concept of stopping time (in the simpler
context of a discrete parameter filtration).
Definition 5.1.11. A random variable taking values in {0, 1, . . . , n, . . . , } is a
stopping time for the filtration {Fn } (also denoted Fn -stopping time), if the event
{ : () n} is in Fn for each finite n 0.
Remark. Intuitively, a stopping time corresponds to a situation where the decision whether to stop or not at any given (non-random) time step is based on the
information available by that time step. As we shall amply see in the sequel, one of
the advantages of MGs is in providing a handle on explicit computations associated
with various stopping times.
The next two exercises provide examples of stopping times. Practice your understanding of this concept by solving them.
Exercise 5.1.12. Suppose that and are stopping times for the same filtration
{Fn }. Show that then , and + are also stopping times for this filtration.
Exercise 5.1.13. Show that the first hitting time () = min{k 0 : Xk () B}
of a Borel set B R by a sequence {Xk }, is a stopping time for the canonical
filtration {FnX }. Provide an example where the last hitting time = sup{k 0 :
Xk B} of a set B by the sequence, is not a stopping time (not surprising, since
we need to know the whole sequence {Xk } in order to verify that there are no visits
to B after a given time n).
Here is an elementary application of first hitting times.
Exercise 5.1.14 (Reflection principle). Suppose {Sn } is a symmetric random
walk starting at S0 = 0 (see Definition 5.1.6).
(a) Show that P(Sn Sk 0) 1/2 for k = 1, 2, . . . , n.
(b) Fixing x > 0, let = inf{k 0 : Sk > x} and show that
P(Sn > x)
n
X
k=1
P( = k, Sn Sk 0)
1X
P( = k) .
2
k=1
181
(d) Considering now the symmetric srw, show that for any positive integers
n, x,
n
P(max Sk x) = 2P(Sn x) P(Sn = x)
k=1
D
Z2n+1 =
n,
a.s.
E[Xn+1 |Fn ] Xn
n,
a.s.
182
Exercise 5.1.19. Show that if {Xn } and {Yn } are sub-MGs with respect to a
filtration {Fn }, then so is {Xn + Yn }. In contrast, show that for any sub-MG {Yn }
there exists integrable {Xn } adapted to {FnY } such that {Xn + Yn } is not a sub-MG
with respect to any filtration.
Here are some of the properties of sub-MGs (and of sup-MGs).
Proposition 5.1.20. If (Xn , Fn ) is a sub-MG, then a.s. E[X |Fm ] Xm for
any > m. Consequently, for s sub-MG necessarily n 7 EXn is non-decreasing.
Similarly, for a sup-MG a.s. E[X |Fm ] Xm (with n 7 EXn non-increasing),
and for a martingale a.s. E[X |Fm ] = Xm for all > m (with E[Xn ] independent
of n).
Proof. Suppose {Xn } is a sub-MG and = m + k for k 1. Then,
E[Xm+k |Fm ] = E[E(Xm+k |Fm+k1 )|Fm ] E[Xm+k1 |Fm ]
with the equality due to the tower property and the inequality by the definition
of a sub-MG and monotonicity of the C.E. Iterating this inequality for decreasing
values of k we deduce that E[Xm+k |Fm ] E[Xm |Fm ] = Xm for all non-negative
integers k, m, as claimed. Next taking the expectation of this inequality, we have
by monotonicity of the expectation and (4.2.1) that E[Xm+k ] E[Xm ] for all
k, m 0, or equivalently, that n 7 EXn is non-decreasing.
To get the corresponding results for a super-martingale {Xn } note that then
{Xn } is a sub-martingale, see Remark 5.1.17. As already mentioned there, if
{Xn } is a MG then it is both a super-martingale and a sub-martingale, hence
both E[X |Fm ] Xm and E[X |Fm ] Xm , resulting with E[X |Fm ] = Xm , as
stated.
Exercise 5.1.21. Show that a sub-martingale (Xn , Fn ) is a martingale if and only
if EXn = EX0 for all n.
We next detail a few examples in which sub-MGs or sup-MGs naturally appear,
starting with an immediate consequence of Jensens inequality
Proposition 5.1.22. Suppose : R 7 R is convex and E[|(Xn )|] < for all
n.
(a) If (Xn , Fn ) is a martingale then ((Xn ), Fn ) is a sub-martingale.
(b) If x 7 (x) is also non-decreasing, ((Xn ), Fn ) is a sub-martingale even
when (Xn , Fn ) is only a sub-martingale.
Proof. With (Xn ) integrable and adapted, it suffices to check that a.s.
E[(Xn+1 )|Fn ] (Xn ) for all n. To this end, since () is convex and Xn is
integrable, by the conditional Jensens inequality,
E[(Xn+1 )|Fn ] (E[Xn+1 |Fn ]) ,
so it remains only to verify that (E[Xn+1 |Fn ]) (Xn ). This clearly applies
when (Xn , Fn ) is a MG, and even for a sub-MG (Xn , Fn ), provided that () is
monotone non-decreasing.
Example 5.1.23. Typical convex functions for which the preceding proposition is
often applied are (x) = |x|p , p 1, (x) = (xc)+ , (x) = max(x, c) (for c R),
(x) = ex and (x) = x log x (the latter only for non-negative S.P.). Considering
instead () concave leads to a sup-MG, as for example when (x) = min(x, c) or
183
(x) = xp for some p (0, 1) or (x) = log x (latter two cases restricted to nonnegative S.P.). For example, if {Xn } is a sub-martingale then (Xn c)+ is also
a sub-martingale (since (x c)+ is a convex, non-decreasing function). Similarly,
if {Xn } is a super-martingale, then min(Xn , c) is also a super-martingale (since
Xn is a sub-martingale and the function min(x, c) = max(x, c) is convex
and non-decreasing).
Here is a concrete application of Proposition 5.1.22.
Exercise 5.1.24. Suppose {i } are mutually independent, Ei = 0 and Ei2 = i2 .
Pn
Pn
(a) Let Sn = i=1 i and s2n = i=1 i2 . Show that {Sn2 } is a sub-martingale
and {Sn2 s2n } is a martingale. Q
n
(b) Show that if in addition mn = i=1 Eei are finite, then {eSn } is a
sub-martingale and Mn = eSn /mn is a martingale.
Remark. A special case of Exercise 5.1.24 is the random walk Sn of Definition
5.1.6, with Sn2 nE12 being a MG when 1 is square-integrable and of zero mean.
Likewise, eSn is a sub-MG whenever E1 = 0 and Ee1 is finite. Though eSn is in
general not a MG, the normalized Mn = eSn /[Ee1 ]n is merely the product MG of
Example 5.1.10 for the i.i.d. variables Yi = ei /E(e1 ).
Here is another family of super-martingales, this time related to super-harmonic
functions.
Definition 5.1.25. A lower semi-continuous function f : Rd 7 R is superharmonic if for any x and r > 0,
Z
1
f (y)dy
f (x)
|B(0, r)| B(x,r)
is called the martingale transform of the Fn -predictable {Vn } with respect to a sub
or super martingale (Xn , Fn ).
Theorem 5.1.28. Suppose {Yn } is the martingale transform of Fn -predictable
{Vn } with respect to a sub or super martingale (Xn , Fn ).
184
185
n
X
k=1
is the martingale transform of {Vn } with respect to sub-MG (Xn , Fn ), we know from
Theorem 5.1.28 that (Xn Xn , Fn ) is also a sub-MG. Finally, considering the
latter sub-MG for = 0 and adding to it the sub-MG (X0 , Fn ), we conclude that
(Xn , Fn ) is a sub-MG (c.f. Exercise 5.1.19 and note that Xn0 = X0 ).
Theorem 5.1.32 thus implies the following key ingredient in the proof of Doobs
optional stopping theorem (to which we return in Section 5.4).
Corollary 5.1.33. If (Xn , Fn ) is a sub-MG and are Fn -stopping times,
then EXn EXn for all n. The reverse inequality holds in case (Xn , Fn ) is a
sup-MG, with EXn = EXn for all n in case (Xn , Fn ) is a MG.
Proof. Suffices to consider Xn which is a sub-MG for the filtration Fn . In
this case we have from Theorem 5.1.32 that Yn = Xn Xn is also a sub-MG
for this filtration. Noting that Y0 = 0 we thus get from Proposition 5.1.20 that
EYn 0. Theorem 5.1.32 also implies the integrability of Xn so by linearity of
the expectation we conclude that EXn EXn .
An important concept associated with each stopping time is the stopped -algebra
defined next.
Definition 5.1.34. The stopped -algebra F associated with the stopping time
for a filtration {Fn } is the collection of events A F such that A { : ()
n} Fn for all n.
With Fn representing the information known at time n, think of F as quantifying
the information known upon stopping at . Some of the properties of these stopped
-algebras are detailed in the next exercise.
Exercise 5.1.35. Let and be Fn -stopping times.
(a) Verify that F is a -algebra and if () = n is non-random then F =
Fn .
186
(b) Suppose Xn mFn for all n (including n = unless is finite for all ).
Show that then X mF . Deduce that ( ) F and Xk I{ =k} mF
for any k non-random.
(c) Show that for any integrable {Yn } and non-random k,
E[Y I{ =k} |F ] = E[Yk |Fk ]I{ =k} .
Our next exercise shows that the martingale property is equivalent to the strong
martingale property whereby conditioning at stopped -algebras F replaces the
one at Fn for non-random n.
Exercise 5.1.36. Given an integrable stochastic process {Xn } adapted to a filtration {Fn }, show that (Xn , Fn ) is a martingale if and only if E[Xn |F ] = X for
any non-random, finite n and all Fn -stopping times n.
For non-integrable stochastic processes we generalize the concept of a martingale
into that of a local martingale.
Exercise 5.1.37. The pair (Xn , Fn ) is called a local martingale if {Xn } is adapted
to the filtration {Fn } and there exist Fn -stopping times k such that k with
probability one and (Xnk , Fn ) is a martingale for each k. Show that any martingale is a local martingale and any integrable, local martingale is a martingale.
We conclude with the renewal property of stopping times with respect to the
canonical filtration of an i.i.d. sequence.
Exercise 5.1.38. Suppose is an a.s. finite stopping time with respect to the
canonical filtration {FnZ } of a sequence {Zk } of i.i.d. R.V-s.
(a) Show that TZ = (Z +k , k 1) is independent of the stopped -algebra
FZ .
(b) Provide an example of a finite FnZ -stopping time and independent {Zk }
for which TZ is not independent of FZ .
5.2. Martingale representations and inequalities
In Subsection 5.2.1 we show that martingales are at the core of all adapted processes. We further explore there the structure of certain sub-martingales, introducing the increasing process associated with square-integrable martingales. This
is augmented in Subsection 5.2.2 by the study of maximal inequalities for submartingales (and martingales). Such inequalities are an important technical tool
in many applications of probability theory. In particular, they are the key to the
convergence results of Section 5.3.
5.2.1. Martingale decompositions. To demonstrate the relevance of martingales to the study of general stochastic processes, we start with a representation
of any adapted, integrable, discrete-time S.P. as the sum of a martingale and a
predictable process.
Theorem 5.2.1 (Doobs decomposition). Given an integrable stochastic process
{Xn }, adapted to a discrete parameter filtration {Fn }, n 0, there exists a decomposition Xn = Yn + An such that (Yn , Fn ) is a MG and {An } is an Fn -predictable
sequence. This decomposition is unique up to the value of Y0 mF0 .
187
Exercise 5.2.2. Check that the predictable part of Doobs decomposition of a submartingale (Xn , Fn ) is a non-decreasing sequence, that is, An An+1 for all n.
Remark. As shown in Subsection 5.3.2, Doobs decomposition is particularly
useful in connection with square-integrable martingales {Xn }, where one can relate
the limit of Xn as n with that of the non-decreasing sequence {An } in the
decomposition of {Xn2 }.
We next evaluate Doobs decomposition for two classical sub-MGs.
Pn
Example 5.2.3. Consider the sub-MG {Sn2 } for the random walk Sn = k=1 k ,
where k are i.i.d. random variables with E1 = 0 and E12 = 1. Since Yn = Sn2 n
is a martingale (see Exercise 5.1.24), and Doobs decomposition Sn2 = Yn + An is
unique, it follows that the non-decreasing predictable part in the decomposition of
Sn2 is An = n.
In contrast with the preceding example, the non-decreasing predictable part in
Doobs decomposition is for most sub-MGs a non-degenerate random sequence, as
is the case in our next example.
188
Qn
Example 5.2.4. Consider the sub-MG (Mn , FnZ ) where Mn = i=1 Zi for i.i.d.
integrable Zi 0 such that EZ1 > 1 (see Example 5.1.10). The non-decreasing
predictable part of its Doobs decomposition is such that for n 1
An+1 An = E[Mn+1 Mn |FnZ ] = E[Zn+1 Mn Mn |FnZ ]
= Mn E[Zn+1 1|FnZ ] = Mn (EZ1 1)
Pn1
(since Zn+1 is independent of FnZ ). In this case An = (EZ1 1) k=1 Mk + A1 ,
where we are free to choose for A1 any non-random constant. We see that {An } is
a non-degenerate random sequence (assuming the R.V. Zi are not a.s. constant).
We conclude with the representation of any L1 -bounded martingale as the difference of two non-negative martingales (resembling the representation X = X+ X
for an integrable R.V. X and non-negative X ).
Exercise 5.2.5. Let (Xn , Fn ) be a martingale with supn E|Xn | < . Show that
there is a representation Xn = Yn Zn with (Yn , Fn ) and (Zn , Fn ) non-negative
martingales such that supn E|Yn | < and supn E|Zn | < .
5.2.2. Maximal and up-crossing inequalities. Martingales are rather tame
stochastic processes. In particular, as we see next, the tail of maxkn Xk is bounded
by moments of Xn . This is a major improvement over Markovs inequality, relating the typically much smaller tail of the R.V. Xn to its moments (see part (b) of
Example 1.3.14).
Theorem 5.2.6 (Doobs inequality). For any sub-martingale {Xn } and x > 0
let x = min{k 0 : Xk x}. Then, for any finite n 0,
n
(5.2.1)
k=0
An = { : x () n} = { : max Xk () x} ,
k=0
it follows that
E[Xnx ] = E[Xx Ix n ] + E[Xn Ix >n ] xP(An ) + E[Xn IAcn ].
With {Xn } a sub-MG and x a pair of FnX -stopping times, it follows from
Corollary 5.1.33 that E[Xnx ] E[Xn ]. Therefore, E[Xn ] E[Xn IAcn ] xP(An )
which is exactly the left inequality in (5.2.1). The right inequality there holds by
monotonicity of the expectation and the trivial fact XIA (X)+ for any R.V. X
and any measurable set A.
Remark. Doobs inequality generalizes Kolmogorovs maximal inequality. Indeed, consider Xk = Zk2 for the L2 -martingale Zk = Y1 + + Yk , where {Yl } are
mutually independent with EYl = 0 and EYl2 < . By Proposition 5.1.22 {Xk } is
a sub-MG, so by Doobs inequality we obtain that for any z > 0,
P( max |Zk | z) = P( max Xk z 2 ) z 2 E[(Xn )+ ] = z 2 Var(Zn )
1kn
1kn
189
(5.2.2)
Further, in this case E[Vp ] cp E[Ap ] for cp = 1 + 1/(1 p) and any p (0, 1).
Proof. Since Mn = Zn An is a MG with respect to the filtration {Fn }
(starting at M0 = 0), by Theorem 5.1.32 the same applies for the stopped stochastic
process Mn , with any Fn -stopping time. By the same reasoning Zn = Mn +
An is a sub-MG with respect to {Fn }. Applying Doobs inequality (5.2.1) for
this non-negative sub-MG we deduce that for any n and x > 0,
n
P(Vn x) = P(max Zk x) x1 E[ Zn ] = x1 E[ An ] .
k=0
P(max Yk x)
k=1
X
1
[1 +
(k + 1)s (1 k s )] .
2x
k=1
a.s.
190
Martingales also provide bounds on the probability that the sum of bounded independent variables is too close to its mean (in lieu of the clt).
Pn
Exercise 5.2.10. Let Sn =P k=1 k where {k } are independent and Ek = 0,
n
|k | K for all k. Let s2n = k=1 Ek2 . Using Corollary 5.1.33 for the martingale
Sn2 s2n and a suitable stopping time show that
n
If the positive part of the sub-MG has finite p-th moment you can improve the
rate of decay in x in Doobs inequality by an application of Proposition 5.1.22 for
the convex non-decreasing (y) = max(y, 0)p , denoted hereafter by (y)p+ . Further,
in case of a MG the same argument yields comparable bounds on tail probabilities
for the maximum of |Yk |.
Exercise 5.2.11.
(a) Show that for any sub-MG {Yn }, p 1, finite n 0 and y > 0,
i
h
n
P(max Yk y) y p E max(Yn , 0)p .
k=0
(c) Suppose the martingale {Yn } is such that Y0 = 0. Using the fact that
(Yn + c)2 is a sub-martingale and optimizing over c, show that for y > 0,
n
P(max Yk y)
k=0
EYn2
EYn2 + y 2
Here is the version of Doobs inequality for non-negative sup-MGs and its application for the random walk.
Exercise 5.2.12.
(a) Show that if is a stopping time for the canonical filtration of a nonnegative super-martingale {Xn } then EX0 EXn E[X I n ] for
any finite n.
(b) Deduce that if {Xn } is a non-negative super-martingale then for any x >
0
P(sup Xk x) x1 EX0 .
k
191
Proof. The bound (5.2.4) is obtained by applying part (b) of Lemma 1.4.31
for the non-negative variables X = (Xn )+ and Y = (maxkn Xk )+ . Indeed, the
hypothesis P(Y y) y 1 E[XIY y ] of this lemma is provided by the left inequality in (5.2.1) and its conclusion that EY p q p EX p is precisely (5.2.4). In
case {Yn } is a martingale, we get (5.2.5) by applying (5.2.4) for the non-negative
sub-MG Xn = |Yn |.
Remark. A bound such as (5.2.5) can not hold for all sub-MGs. For example,
the non-random sequence Yk = (k n) 0 is a sub-MG with |Y0 | = n but Yn = 0.
The following two exercises show that while Lp maximal inequalities as in Corollary
5.2.13 can not hold for p = 1, such an inequality does hold provided we replace
E(Xn )+ in the bound by E[(Xn )+ log min(Xn , 1)].
Qn
Exercise 5.2.14. Consider the martingale Mn = k=1 Yk for i.i.d. non-negative
random variables {Yk } with EY1 = 1 and P(Y1 = 1) < 1.
(a) Explain why E(log Y1 )+ is finite and why the strong law of large numbers
a.s.
implies that n1 log Mn < 0 when n .
a.s.
(b) Deduce that Mn 0 as n and that consequently {Mn } is not
uniformly integrable.
(c) Show that if (5.2.4) applies for p = 1 and some q < , then any nonnegative martingale would have been uniformly integrable.
Hint: Apply part (c) of Lemma 1.4.31 and recall that x(log y)+ e1 y + x(log x)+
for any x, y 0.
We just saw that in general L1 -bounded martingales might not be U.I. Nevertheless, as you show next, for sums of independent zero-mean random variables these
two properties are equivalent.
Pn
Exercise 5.2.16. Suppose Sn = k=1 k with k independent.
(a) Prove Ottavianis inequality. Namely, show that for any n and t, s 0,
k=1
k=1
k=1
192
In the spirit of Doobs inequality bounding the tail probability of the maximum
of a sub-MG {Xk , k = 0, 1, . . . , n} in terms of the value of Xn , we will bound the
oscillations of {Xk , k = 0, 1, . . . , n} over an interval [a, b] in terms of X0 and Xn .
To this end, we require the following definition of up-crossings.
Definition 5.2.17. The number of up-crossings of the interval [a, b] by {Xk (), k =
0, 1, . . . , n}, denoted Un [a, b](), is the largest Z+ such that Xsi () < a and
Xti () > b for 1 i and some 0 s1 < t1 < < s < t n.
For example, Fig. 1 depicts two up-crossings of [a, b].
(5.2.6)
a < b .
Proof. Fixing a < b, let V1 = I{X0 <a} and for n = 2, 3, . . ., define recursively
Vn = I{Vn1 =1,Xn1 b} + I{Vn1 =0,Xn1 <a} . Informally, the sequence Vk is zero
while waiting for the process {Xn } to enter (, a) after which time it reverts
to one and stays so while waiting for this process to enter (b, ). See Figure 1
for an illustration in which black circles depict indices k such that Vk = 1 and
open circles indicate those values of k with Vk = 0. Clearly, the sequence {Vn } is
predictable for the canonical filtration of {Xn }. Let {Yn } denote the martingale
193
transform of {Vn } with respect to {Xn } (per Definition 5.1.27). By the choice of
V every up-crossing of the interval [a, b] by {Xk , k = 0, 1, . . . , n} contributes to Yn
the difference between the value of X at the end of the up-crossing (i.e. the last in
the corresponding run of black circles), which is at least b and its value at the start
of the up-crossing (i.e. the last in the preceding run of open circles), which is at
most a. Thus, each up-crossing increases Yn by at least (b a) and if X0 < a then
the first up-crossing must have contributed at least (b X0 ) = (b a) + (X0 a)
to Yn . The only other contribution to Yn is by the up-crossing of the interval [a, b]
that is in progress at time n (if there is such), and since it started at value at most
a, its contribution to Yn is at least (Xn a) . We thus conclude that
Yn (b a)Un [a, b] + (X0 a) (Xn a)
for all . With {Vn } predictable, bounded and non-negative it follows that {Yn }
is a super-martingale (see parts (b) and (c) of Theorem 5.1.28). Thus, considering
the expectation of the preceding inequality yields the up-crossing inequality (5.2.6)
since 0 = EY0 EYn for the sup-MG {Yn }.
Doobs up-crossing inequality implies that the total number of up-crossings of [a, b]
by a non-negative sup-MG has a finite expectation. In this context, Dubins upcrossing inequality, which you are to derive next, provides universal (i.e. depending
only on a/b), exponential bounds on tail probabilities of this random variable.
Exercise 5.2.19. Suppose (Xn1 , Fn ) and (Xn2 , Fn ) are both sup-MGs and is an
Fn -stopping time such that X1 X2 .
(a) Show that Wn = Xn1 I >n + Xn2 I n is a sup-MG with respect to Fn and
deduce that so is Yn = Xn1 I n + Xn2 I <n (this is sometimes called the
switching principle).
(b) For a sup-MG Xn 0 and constants b > a > 0 define the FnX -stopping
times 0 = 1, = inf{k > : Xk a} and +1 = inf{k > : Xk
b}, = 0, 1, . . .. That is, the -th up-crossing of (a, b) by {Xn } starts at
1 and ends at . For = 0, 1, . . . let Zn = a b when n [ , ) and
Zn = a1 b Xn for n [ , +1 ). Show that (Zn , FnX ) is a sup-MG.
(c) For b > a > 0 let U [a, b] denote the total number of up-crossings of the
interval [a, b] by a non-negative super-martingale {Xn }. Deduce from the
preceding that for any positive integer ,
a
E[min(X0 /a, 1)]
P(U [a, b] )
b
(this is Dubins up-crossing inequality).
5.3. The convergence of Martingales
194
Indeed, these convergence results are closely related to the fact that the maximum
and up-crossings counts of a sub-MG do not grow too rapidly (and same applies
for sup-MGs and martingales). To further explore this direction, we next link the
finiteness of the total number of up-crossings U [a, b] of intervals [a, b], b > a, by a
process {Xn } to its a.s. convergence.
a.s
Lemma 5.3.1. If for each b > a almost surely U [a, b] < , then Xn X
where X is an R-valued random variable.
Proof. Note that the event that Xn has an almost sure (R-valued) limit as
n is the complement of
[
a,b ,
=
a,bQ
a<b
Since is a countable union of these events, it thus suffices to show that P(a,b ) = 0
for any a, b Q, a < b. To this end note that if a,b then lim supn Xn () > b
and lim inf n Xn () < a are both limit points of the sequence {Xn ()}, hence the
total number of up-crossings of the interval [a, b] by this sequence is infinite. That
is, a,b { : U [a, b]() = }. So, from our hypothesis that U [a, b] is finite
almost surely it follows that P(a,b ) = 0 for each a < b, resulting with the stated
conclusion.
Combining Doobs up-crossing inequality of Lemma 5.2.18 with Lemma 5.3.1 we
now prove Doobs a.s. convergence theorem for sup-MGs (and sub-MGs).
Theorem 5.3.2 (Doobs convergence theorem). Suppose sup-MG (Xn , Fn )
a.s.
is such that supn {E[(Xn ) ]} < . Then, Xn X and E|X | lim inf n E|Xn |
is finite.
Proof. Fixing b > a, recall that 0 Un [a, b] U [a, b] as n , where
U [a, b] denotes the total number of up-crossings of [a, b] by the sequence {Xn }.
Hence, by monotone convergence E(U [a, b]) = supn E(Un [a, b]). Further, with
(xa) |a|+x , we get from Doobs up-crossing inequality and the monotonicity
of the expectation that
1
1
|a| + sup E[(Xn ) ] .
E(Xn a)
E(Un [a, b])
(b a)
(b a)
n
Thus, our hypothesis that supn E[(Xn ) ] < implies that E(U [a, b]) is finite,
hence in particular U [a, b] is finite almost surely.
a.s
Since this applies for any b > a, we have from Lemma 5.3.1 that Xn X .
Further, with Xn a sup-MG, we have that E|Xn | = EXn + 2E(Xn ) EX0 +
2E(Xn ) for all n. Using this observation in conjunction with Fatous lemma for
a.s.
0 |Xn | |X | and our hypothesis, we find that
E|X | lim inf E|Xn | EX0 + 2 sup{E[(Xn ) ]} < ,
n
as stated.
195
(5.3.1)
196
197
a.s.
Further, by monotone convergence E[X I <n ] E[X I < ] and E[X I<n ]
E[X I< ]. Hence, taking n results with
E[X I< ] E[X I < ] + E[X I = I< ] .
Adding the identity E[X I= ] = E[X I= ], which holds for , yields the
stated inequality E[X ] E[X ]. Considering 0 we further see that E[X0 ]
E[X ] E[X ] 0 are finite, as claimed.
Solving the next exercise should improve your intuition about the domain of validity of Proposition 5.1.22 and of Doobs convergence theorem.
Exercise 5.3.9.
(a) Provide an example of a sub-martingale {Xn } for which {Xn2 } is a supermartingale and explain why it does not contradict Proposition 5.1.22.
(b) Provide an example of a martingale which converges a.s. to and
explain why it does
P not contradict Theorem 5.3.2.
Hint: Try Sn = ni=1 i , with zero-mean, independent but not identically
distributed i .
We conclude this sub-section with few additional applications of Doobs convergence theorem.
Exercise 5.3.10. Suppose {Xn } and {Y
Pn } are non-negative, integrable processes
adapted to the filtration Fn such that
n1 Yn < a.s. and E[Xn+1 |Fn ]
(1 + Yn)Xn + Yn for all n. Show that Xn converges a.s. to a finite limit as n .
Hint: Find a non-negative super-martingale (Wn , Fn ) whose convergence implies
that of Xn .
Exercise 5.3.11. Let {Xk } be mutually independent but not necessarily integrable
random variables, such that
P
(c)
(a) Fixing c < non-random, let Yn = nk=1 |Sk1 |I|Sk1 |c Xk I|Xk |c .
(c)
5.3.1. Uniformly integrable martingales. The main result of this subsection is the following L1 convergence theorem for uniformly integrable (U.I.) subMGs (and sup-MGs).
198
a.s.
Remark. If {Xn } is uniformly integrable then supn E|Xn | is finite (see Lemma
1.3.48). Thus, the assumption of Theorem 5.3.12 is stronger than that of Theorem
5.3.2, as is its conclusion.
Proof. If {Xn } is U.I. then supn E|Xn | < . For {Xn } sub-MG it thus
a.s.
follows by Doobs convergence theorem that Xn X with X integrable. Obp
viously, this implies that Xn X . Similarly, if we start instead by assuming
L1
all > m and any sub-MG (see Proposition 5.1.20). Further, since X X it
L1
follows that E[X |Fm ] E[X |Fm ] as , per fixed m (see Theorem 4.2.30).
The latter implies the convergence a.s. of these conditional expectations along some
sub-sequence k (c.f. Theorem 2.2.10). Hence, we conclude that for any m, a.s.
Xm lim inf E[X |Fm ] E[X |Fm ] ,
The preceding theorem identifies the collection of U.I. martingales as merely the
set of all Doobs martingales, a concept we now define.
Definition 5.3.13. The sequence Xn = E[X|Fn ] with X an integrable R.V. and
{Fn } a filtration, is called Doobs martingale of X with respect to {Fn }.
Corollary 5.3.14. A martingale (Xn , Fn ) is U.I. if and only if Xn = E[X |Fn ]
L1
199
Proof. Consider first the special case where Xn = X does not depend on
n. Then, Yn = E[X|Fn ] is a U.I. martingale. Therefore, E[Y |Fn ] = E[X|Fn ]
for all n, where Y denotes the a.s. and L1 limit of Yn (see Corollary 5.3.14).
As Yn mFn mF clearly Y = limn Yn mF . Further,
S by definition of
the C.E. E[XIA ] = E[Y IA ] for all A in the -system P = n Fn hence with
F = (P) it follows that Y = E[X|F ] (see Exercise 4.1.3).
a.s.
Turning to the general case, with Z = supm |Xm | integrable and Xm X , we
deduce that X and Wk = sup{|Xn X | : n k} 2Z are both integrable. So,
the conditional Jensens inequality and the monotonicity of the C.E. imply that for
all n k,
|E[Xn |Fn ] E[X |Fn ]| E[|Xn X | |Fn ] E[Wk |Fn ] .
a.s
Corollary 5.3.16 (L
evys 0-1 law). If Fn F , A F , then E[IA |Fn ]
IA .
As shown in the sequel, Kolmogorovs 0-1 law about P-triviality of the tail algebra T X = n TnX of independent random variables is a special case of Levys
0-1 law.
S
Proof of Corollary 1.4.10. Let F X = ( n FnX ). Recall Definition 1.4.9
a.s.
that T X TnX F X for all n. Thus, by Levys 0-1 law E[IA |FnX ] IA for
any A T X . By assumption {Xk } are P-mutually independent, hence for any
A T X the R.V. IA mTnX is independent of the -algebra FnX . Consequently,
a.s.
a.s.
E[IA |FnX ] = P(A) for all n. We deduce that P(A) = IA , implying that P(A)
X
{0, 1} for all A T , as stated.
The generalization of Theorem 4.2.30 which you derive next also relaxes the assumptions of Levys upward theorem in case only L1 convergence is of interest.
L1
L1
Exercise 5.3.18. Suppose Xn 0 are [0, 1]-valued random variables and {Mn }
is a non-negative MG.
200
|t s |
k
X
=1
|x(t ) x(s )| .
The next exercise uses convergence properties of MGs to prove a classical result
in real analysis, namely, that an absolutely continuous function is differentiable for
Lebesgue a.e. t [0, 1).
Exercise 5.3.20. On the probability space ([0, 1), B, U ) consider the events
Ai,n = [(i 1)2n , i2n )
for
i = 1, . . . , 2n ,
n = 0, 1, . . . ,
R s+
a.s
(e) Recall Lebesgues theorem, that 1 s
|h(s) h(u)|du 0 as 0,
dx
for a.e. s [0, 1). Using it, conclude that dt = h for almost every
t [0, 1).
Zk := E[X|(B, Y0 , . . . , Yk1 )] .
201
such
that
k
k Ek < . Since ESn =
Pn
2
k=1 Ek , it follows from Proposition 5.3.22 that the random series Sn ()
S () almost surely and in L2 (see also Theorem 2.3.17 for a direct proof of this
result, based on Kolmogorovs maximal inequality).
Pn
Exercise 5.3.24. Suppose Zn = 1n k=1 k for i.i.d. k L2 (, F , P) of zeromean and unit variance. Let Fn = (k , k n) and F = (k , k < ).
(a) Prove that EW Zn 0 for any fixed W L2 (, F , P).
(b) Deduce that the same applies for any W L2 (, F , P) and conclude that
Zn does not converge in L2 .
D
(c) Show that though Zn G, a standard normal variable, there exists no
p
Z mF such that Zn Z .
We conclude this sub-section with the application of martingales to the study of
P
olyas urn scheme.
lyas urn). Consider an urn that initially contains r red and
Example 5.3.25 (Po
b blue marbles. At the k-th step a marble is drawn at random from the urn, with all
possible choices being equally likely, and it and
Pnck more marbles of the same color are
then returned to the urn. With Nn = r+b+ k=1 ck counting the number of marbles
in the urn after n iterations of this procedure, let Rn denote the number of red
marbles at that time and Mn = Rn /Nn the corresponding fraction of red marbles.
Since Rn+1 {Rn , Rn +cn } with P(Rn+1 = Rn +cn |FnM ) = Rn /Nn = Mn it follows
that E[Rn+1 |FnM ] = Rn + cn Mn = Nn+1 Mn . Consequently, E[Mn+1 |FnM ] = Mn
for all n with {Mn } a uniformly bounded martingale.
For the study of P
olyas urn scheme we need the following definition.
202
Definition 5.3.26. The beta density with parameters > 0 and > 0 is
( + ) 1
u
(1 u)1 1u[0,1] ,
f (u) =
()()
R
where () = 0 s1 es ds is finite and positive (compare with Definition 1.4.45).
In particular, = = 1 corresponds to the density fU (u) of the uniform measure
on (0, 1], as in Example 1.2.41.
Exercise 5.3.27. Let {Mn } be the martingale of Example 5.3.25.
(a) Show that Mn M a.s. and in Lp for any p > 1.
(b) Assuming further that ck = c for all k 1, show that for = 0, . . . , n,
Qn1
Q1
(b + jc)
n
i=0 (r + ic)
j=0
P(Rn = r + c) =
,
Qn1
k=0 (r + b + kc)
and deduce that M has the beta density with parameters = b/c and
= r/c (in particular, M has the law of U (0, 1] when r = b = ck > 0).
(c) For r = b = ck > 0 show that P(supk1 Mk > 3/4) 2/3.
Exercise 5.3.29. Fixing bn [, 1] for some > 0, suppose {Xn } are [0, 1]-valued,
Fn -adapted such that Xn+1 = (1 bn )Xn + bn Bn , n 0, and P(Bn = 1|Fn ) =
a.s.
1 P(Bn = 0|Fn ) = Xn . Show that Xn X {0, 1} and P(X = 1|F0 ) = X0 .
203
Proposition 5.3.31. There exist finite constants cq , q (0, 1], such that if
(Xn , Fn ) is an L2 -MG with X0 = 0, then
E[sup |Xk |2q ] cq E[ hXiq ] .
k
(a) Xn () converges to a finite limit for a.e. for which hXi () is finite.
(b) Xn ()/hXin () 0 for a.e. for which hXi () is infinite.
(c) If the martingale differences Xn Xn1 are uniformly bounded then the
converse to part (a) holds. That is, hXi () is finite for a.e. for
which Xn () converges to a finite limit.
Proof. (a) Recall that for any n and Fn -stopping time we have the identity
2
Xn
= Mn + hXin with EMn = 0, yielding by monotone convergence
2
that supn E[Xn
] = EhXi . While proving Lemma 5.2.7 we noted that v =
min{n 0 : hXin+1 > v} are Fn -stopping times such that hXiv v. Thus,
setting Yn = Xnk for a positive integer k, the martingale (Yn , Fn ) is L2 -bounded
and as such it almost surely has a finite limit. Further, if hXi () is finite, then
by definition k () = for some random positive integer k = k(), in which case
Xnk = Xn for all n. As we consider only countably many values of k, this yields
the thesis of part (a) of the theorem.
(b). Since Vn = (1 + hXin )1 isP
an Fn -predictable sequence of bounded variables,
n
its martingale transform Yn = k=1 Vk (Xk Xk1 ) with respect to the squareintegrable martingale {Xn } is also a square-integrable martingale for the filtration
204
{Fn } (c.f. Theorem 5.1.28). Further, since Vk mFk1 it follows that for all k 1,
hY ik hY ik1 = E[(Yk Yk1 )2 |Fk1 ] = Vk2 E[(Xk Xk1 )2 |Fk1 ]
=
1
1
hXik hXik1
(1 + hXik )2
1 + hXik1
1 + hXik
k=1 k denote
P the sum of the first n conditional probabilities
k = P(Ak |Fk1 ) and Z = k k . Then, for almost every ,
(a) If Z () is finite, then so is S ().
(b) If Z () is infinite, then Sn ()/Zn () 1.
Remark. Given any sequence of events, by the tower property Ek = P(Ak ) for
all k and settingP
Fn = (Ak , k n) guarantees that Ak Fk for all k. Hence,
(a) If EZ P
=
k P(Ak ) is finite, then from part (a) of Proposition 5.3.34 we
deduce that k IAk is finite a.s., thus recovering the first Borel-Cantelli lemma.
(b) For Fn = (Ak , k n) and mutually independent events {Ak } we have that
k = P(Ak ) and Zn = ESn for all n. Thus, in this case, part (b) of Proposition
P
a.s.
5.3.34 is merely the statement that Sn /ESn 1 when k P(Ak ) = , which is
your extension of the second Borel-Cantelli via Exercise 2.2.26.
Proof. Clearly, Mn = Sn Zn is square-integrable and Fn -adapted. Further,
follows that
as Mn Mn1 = IAn E[IAn |Fn1 ] and Var(IAn |Fn1 ) = n (1n ), itP
n
the predictable compensator of the L2 martingale (Mn , Fn ) is hM in = k=1 k (1
k ). Hence, hM in Zn for all n, and if Z () is finite, then so is hM i (). By
205
part (a) of Theorem 5.3.33, for a.e. such the finite limit M () of Mn () exists,
implying that S = M + Z is finite as well.
With Sn = Mn + Zn , it suffices for part (b) of the proposition to show that
Mn /Zn 0 for a.e. for which Z () = . To this end, note first that by
the preceding argument, the finite limit M () exists also for a.e. for which
Z () = while hM i () is finite. For such we have that Mn /Zn 0 (since
Mn () is a bounded sequence while Zn () is unbounded). Finally, from part (b)
of Theorem 5.3.33 we know that Mn /hM in Mn /Zn converges to zero for a.e.
for which hM i () is infinite.
Here is a direct application of Theorem 5.3.33.
Exercise 5.3.35. Given a martingale (MP
n , Fn ) and positive, non-random bn ,
2
2
show that b1
M
0
for
a.e.
such
that
n
n
k1 bk E[(Mk Mk1 ) |Fk1 ] is finite.
Pn
1
Hint: Consider Xn = k=1 bk (Mk Mk1 ) and recall Kroneckers lemma.
The following extension of Kolmogorovs three series theorem uses both Theorem
5.3.33 and Levys extension of the Borel-Cantelli lemmas.
Exercise 5.3.36. Suppose {Xn } is adapted to filtration {Fn } and for any n, the
R.C.P.D. of Xn given Fn1 equals the R.C.P.D. of Xn given Fn1 . For non(c)
random c > 0 let Xn = Xn I|Xn |c be the corresponding truncated variables.
Pn
(c)
(a) Verify that (Zn , Fn ) is a MG, where Zn = k=1 Xk .
(b) Considering the series
X
X
(5.3.3)
P(|Xn | > c |Fn1 ),
and
Var(Xn(c) |Fn1 ),
n
P
show that for a.e. the series n Xn () has a finite limit if and only if
both series in (5.3.3) converge.
(c) Provide an example where the convergence in part (b) occurs with probability 0 < p < 1.
We now consider sufficient conditions for convergence almost surely of the martingale transform.
P
Exercise 5.3.37. Suppose Yn = nk=1 Vk (Zk Zk1 ) is the martingale transform
of the Fn -predictable {Vn } with respect to the martingale (Zn , Fn ), per Definition
5.1.27.
(a) Show that if {Zn } is L2 -bounded and {Vn } is uniformly bounded then
a.s.
Yn Y finite.
(b) Deduce that for L2 -bounded MG {Zn } the sequence Yn () converges to a
finite limit for a.e. for which supk1 |Vk ()| is finite.
(c) Suppose now that {Vk } is predictable for the canonical filtration {Fn } of
D
the i.i.d. {k }. Show that
P if k = k and u 7 uP(|1 | u) is bounded
above,
then the series
n Vn n has a finite limit for a.e. for which
P
|V
()|
is
finite.
k
k1
Hint: Consider Exercise 5.3.36 for the adapted sequence Xk = Vk k .
206
(a) Show that (Xn , Fn ) is a sub-martingale and provide its Doob decomposition.
(b) Using this decomposition and Levys extension of the Borel-Cantelli lemmas, show that Xn almost surely.
(c) Let Zn = Xn for = (1 )/(1 + ). Show that (Zn , Fn ) is a supermartingale and deduce that P(inf n Xn 0) .
As we show next, the predictable compensator controls the exponential tails for
martingales of bounded differences.
Exercise 5.3.39. Fix > 0 non-random and an L2 martingale (Mn , Fn ) with
M0 = 0 and bounded differences supk |Mk Mk1 | 1.
(a) Show that Nn = exp(Mn (e 1)hM in ) is a sup-MG for {Fn }.
Hint: Recall part (a) of Exercise 1.4.40.
(b) Show that for any a.s. finite Fn -stopping time and constants u, r > 0,
P(M u, hM i r) exp(u + r(e 1)) .
(c) Applying (a) show that if the martingale {Sn } of Example 5.3.23 has
uniformly
P bounded differences |k | 1, then E exp(S ) is finite for
S = k k and any R.
Applying part (c) of the preceding exercise, you are next to derive the following tail
estimate, due to Dvoretsky, in the context of Levys extension of the Borel-Cantelli
lemmas.
Pn
ExerciseP5.3.40. Suppose Ak Fk for some filtration {Fk }. Let Sn = k=1 IAk
n
and Zn = k=1 P(Ak |Fk1 ). Show that P(Sn r + u, Zn r) eu (r/(r + u))r+u
for all n and u, r > 0, then deduce that for any 0 < r < 1,
n
n
X
[
P
Ak er + P
P(Ak |Fk1 ) > r .
k=1
k=1
2
(a) Show that Nn =
n rn /2) is a sup-MG for Fn provided
Pexp(M
n
2
[0, 1] and rn = k=1 k .
Hint: Recall part (b) of Exercise 1.4.40.
(b) Deduce that for I(x) = (x 1)(2x x 1) and any u 0,
207
208
we see that E[(Z X )IA ] 0 for all A F . Since both Z and X are measurable
on F (see part (b) of Exercise 5.1.35), it thus follows that a.s. Z X , as
claimed.
Proposition 5.4.4. Suppose {Yn } is integrable and is a stopping time for a
filtration {Fn }. Then, {Yn } is uniformly integrable if any one of the following
conditions hold.
(a) E < and a.s. E[|Yn Yn1 ||Fn1 ] c for some finite, non-random
c.
(b) {Yn I >n } is uniformly integrable and Y I < is integrable.
(c) (Yn , Fn ) is a uniformly integrable sub-MG (or sup-MG).
Proof. (a) Clearly, |Yn | Zn , where
Zn = |Y0 | +
n
X
k=1
n
X
k=1
|Yk Yk1 |I k ,
X
EZ E|Y0 | + c
P( k) = E|Y0 | + c E ,
k=1
(c) The hypothesis of (c) that {Yn } is U.I. implies that {Yn I >n } is also U.I. and
that supn E[(Yn )+ ] is finite. With an Fn -stopping time and (Yn , Fn ) a sub-MG,
it further follows by Lemma 5.3.7 that Y I < is integrable. Having arrived at the
hypothesis of part (b), we are done.
209
Check that by part (b) of Exercise 5.2.16 and part (c) of Proposition 5.4.4 it follows
from Doobs optional stopping theorem that P
ES = 0 for any stopping time with
respect to the canonical filtration of Sn = nk=1 k provided the independent k
are integrable with Ek = 0 and supn E|Sn | < .
Sometimes Doobs optional stopping theorem is applied en-route to a useful contradiction. For example,
Exercise 5.4.6. Show that if {Xn } is a sub-martingale such that EX0 0 and
inf n Xn < 0 a.s. then necessarily E[supn Xn ] = .
Hint: Assuming first that supn |Xn | is integrable, apply Doobs optional stopping
theorem to arrive at a contradiction. Then consider the same argument for the
sub-MG Zn = max{Xn , 1}.
Exercise 5.4.7. Fixing b > 0, let b = min{n 0 : Sn b} for the random walk
{Sn } of Definition 5.1.6 and suppose n = Sn Sn1 are uniformly bounded, of
zero mean and positive variance.
(a) Show that b is almost surely finite.
Hint: See Proposition 5.3.5.
(b) Show that E[min{Sn : n b }] = .
Martingales often provide much information about specific stopping times. We
detail below one such example, pertaining to the srw of Definition 5.1.6.
Corollary 5.4.8 (Gamblers Ruin). Fixing positive integers a and b the probability that a srw {Sn }, starting at S0 = 0, hits a before first hitting +b is
r = (eb 1)/(eb ea ) for = log[(1 p)/p] 6= 0. For the symmetric srw, i.e.
when p = 1/2, this probability is r = b/(a + b).
Remark. The probability r is often called the gamblers ruin, or ruin probability
for a gambler with initial capital of +a, betting on the outcome of independent
rounds of the same game, a unit amount per round, gaining or losing an amount
equal to his bet in each round and stopping when either all his capital is lost (the
ruin event), or his accumulated gains reach the amount +b.
Proof. Consider the stopping time a,b = inf{n 0 : Sn b, or Sn a} for
the canonical filtration of the srw. That is, a,b is the first time that the srw exits
the interval (a, b). Since (Sk + k)/2 has the Binomial(k, p) distribution it is not
hard to check that sup P(Sk = ) 0 hence P(a,b > k) P(a < Sk < b) 0
as k . Consequently, a,b is finite a.s. Further, starting at S0 (a, b) and
using only increments k {1, 1}, necessarily Sa,b {a, b} with probability
one. Our goal is thus to compute the ruin probability r = P(Sa,b = a). To
k
210
You are now to derive Walds identities about stopping times for the random
walk, and use them to gain further information about the stopping times a,b of the
preceding corollary.
Exercise 5.4.10. Let be an integrable stopping time for the canonical filtration
of the random walk {Sn }.
(a) Show that if 1 is integrable, then Walds
P identity ES = E1 E holds.
Hint: Use the representation S = k=1 k Ik and independence.
(b) Show that if in addition 1 is square-integrable, then Walds second identity E[(S E1 )2 ] = Var(1 )E holds as well.
Hint: Explain why you may assume that E1 = 0, prove the identity with
n instead of and use Doobs L2 convergence theorem.
(c) Show that if 1 0 then Walds identity applies also when E =
(under the convention that 0 = 0).
Exercise 5.4.11. For the srw Sn and positive integers a, b consider the stopping
time a,b = min{n 0 : Sn
/ (a, b)} as in proof of Corollary 5.4.8.
(a) Check that E[a,b ] < .
Hint: See Exercise 5.1.15.
(b) Combining Corollary 5.4.8 with Walds identities, compute the value of
E[a,b ].
(c) Show that a,b b = min{n 0 : Sn = b} for a (where the
minimum over the empty set is ), and deduce that Eb = b/(2p 1)
when p 1/2.
(d) Show that b is almost surely finite when p 1/2.
(e) Find constants c1 and c2 such that Yn = Sn4 6nSn2 + c1 n2 + c2 n is a
martingale for the symmetric srw, and use it to evaluate E[(b,b )2 ] in
this case.
We next provide a few applications of Doobs optional stopping theorem, starting
with information on the law of b for srw (and certain other random walks).
Exercise 5.4.12. Consider the stopping time b = inf{n 0 : Sn = b} and the
martingale Mn = exp(Sn )M ()n for a srw {Sn }, with b a positive integer and
M () = E[e1 ].
211
(a) Show that if p = 1 q [1/2, 1) then eb E[M ()b ] = 1 for every > 0.
(b) Deduce that for p [1/2, 1) and every 0 < s < 1,
i
p
1 h
E[s1 ] =
1 1 4pqs2 ,
2qs
a.s.
D
(c) Show that n1 Ln a1 and further that (Ln n/a)/ vn G for some
finite, positive constant v.
Hint: Show that the renewal theory clt of Exercise 3.2.9 applies here.
Exercise 5.4.15. Consider a fair game consisting of successive turns whose outcome are the i.i.d. signs k {1, 1} such that P(1 = 1) = 12 , and where
upon betting
the wagers {Vk } in each turn, your gain (or loss) after n turns is
Pn
Yn =
212
k=1
213
Zn1
Zn =
(n)
Nj ,
j=1
(n)
Nj
214
, j = 1, . . .}
E[IA |Fn ]I{Zn k} E[I{Zn+1 =0} |Fn ]I{Zn k} P(N = 0)k I{Zn k}
for all n and k. That is, (5.5.1) holds in this case for k = P(N = 0)k > 0. As
shown already, this implies that with probability one either Zn or Zn = 0 for
all n large enough.
The generating function
L(s) = E[sN ] = P(N = 0) +
(5.5.2)
P(N = k)sk
k=1
plays a key role in analyzing the branching process. In this task, we employ the
following martingales associated with branching process.
Lemma 5.5.4. Suppose 1 > P(N = 0) > 0. Then, (Xn , Fn ) is a martingale where
Xn = mn
N Zn . In the super-critical case we also have the martingale (Mn , Fn ) for
Mn = Zn and (0, 1) the unique solution of s = L(s). The same applies in the
sub-critical case if there exists a solution (1, ) of s = L(s).
(k)
E[Zn+1 |Fn ] = mN Zn ,
(n+1)
of finite mean mN are indeIndeed, recall that the i.i.d. random variables Nj
pendent of Fn on which Zn is measurable. Hence, by linearity of the expectation
it follows that for any A Fn ,
E[Zn+1 IA ] =
X
j=1
(n+1)
E[Nj
I{Zn j} IA ] =
(n+1)
E[Nj
j=1
]E[I{Zn j} IA ] = mN E[Zn IA ] .
X
=0
I{Zn =}
j=1
(n+1)
sN j
X
=0
X
=0
E[I{Zn =} IA
E[I{Zn =} IA ]
j=1
and Fn ,
j=1
(n+1)
E[sNj
(n+1)
sN j
215
]=
X
=0
Since Zn 0 and L(s) max(s, 1) this implies that EsZn+1 1 + EsZn and the
integrability of sZn follows by induction on n. Given that sZn is integrable and the
preceding identity holds for all A Fn , we have thus verified the right identity in
(5.5.3), which in case s = L(s) is precisely the martingale condition for Mn = sZn .
Finally, to prove that s = L(s) has a unique solution in (0, 1) when mN = EN > 1,
note that the function s 7 L(s) of (5.5.2) is continuous and bounded on [0, 1].
Further, since L(1) = 1 and L (1) = EN > 1, it follows that L(s) < s for some
0 < s < 1. With L(0) = P(N = 0) > 0 we have by continuity that s = L(s)
for some s (0, 1). To show the uniqueness of such solution note that EN > 1
P
k(k 1)P(N = k)sk2
implies that P(N = k) > 0 for some k > 1, so L (s) =
k=2
is positive and finite on (0, 1). Consequently, L() is strictly convex there. Hence,
if (0, 1) is such that = L(), then L(s) < s for s (, 1), so such a solution
(0, 1) is unique.
Remark. Since Xn = mn
N Zn is a martingale with X0 = 1, it follows that EZn =
mnN for all n 0. Thus, a sub-critical branching process, i.e. when mN < 1, has
mean total population size
X
X
1
< ,
E[
Zn ] =
mnN =
1 mN
n=0
n=0
which is finite.
216
as stated.
Remark. In particular, the probability qn = P(Zn = 0) = Ln (0) that the branching process is extinct after n generations is given by the recursion qn = L(qn1 )
for n 1, starting at q0 = 0. Since the continuous function L(s) is above s on
the interval from zero to the smallest positive solution of s = L(s) it follows that
qn is a monotone non-decreasing sequence that converges to this solution, which is
thus the value of pex . This alternative evaluation of pex does not use martingales.
Though implicit here, it instead relies on the Markov property of the branching
process (c.f. Example 6.1.10).
(1)
X
P(N = k)Ln1 (s)k = L[Ln1 (s)]
Ln (s) = E[E(sZn |Z1 )] =
k=0
for L() of (5.5.2), as claimed. Obviously, L0 (s) = s and L1 (s) = E[sN ] = L(s).
b n (s) = L[L
b n1 (s1/mN )] for L
b n (s) = E[sXn ]
From this identity we conclude that L
217
a.s.
and Xn = mn
N Zn . With Xn X we have by bounded convergence that
b n (s) L
b (s) = E[sX ], which by the continuity of r 7 L(r) on [0, 1] is thus a
L
b (s) = L[L
b (s1/mN )]. Further, by monotone convergence
solution of the identity L
b
b
L (s) L (1) = 1 as s 1.
Remark. Of course, qn = P(T n) provides the distribution function of the time
of extinction T = min{k 0 : Zk = 0}. For example, if N has the Bernoulli(p)
distribution for some 0 < p < 1 then T is merely a Geometric(1 p) random
variable, but in general the law of T is more involved.
218
n
Y
k=1
ak )2 EMn
Y
k
ak
2
= c < .
Thus, with Mk Nk2 it follows by the L2 -maximal inequality that for all n,
n
n
E max Mk E max Nk2 4E[Nn2 ] 4c .
k=0
k=0
Hence, Mk 0 are such that supk Mk is integrable and in particular, {Mn } is U.I.
(that is, (a) holds).
Finally, to see why the statements (d) and (e) are equivalent note that upon
applying the Borel Cantelli P
lemmas for independent events An with P(An ) = 1an
the divergence of the series k (1ak ) isQequivalent to P(Acn eventually) = 0, which
for strictly positive ak is equivalent to k ak = 0.
We next consider another martingale that is key to the study of likelihood ratios
in sequential statistics. To this end, let P and Q be two probability
measures on
the same measurable space (, F ) with Pn = PFn and Qn = QFn denoting the
restrictions of P and Q to a filtration Fn F .
Theorem 5.5.11. Suppose Qn Pn for all n, with Mn = dQn /dPn denoting
the corresponding Radon-Nikodym derivatives on (, Fn ). Then,
a.s.
219
(c) More generally, the Lebesgue decomposition of Q to its absolutely continuous and singular parts with respect to P is
(5.5.4)
Q = Qac + Qs = M P + I{M =} Q .
and since M is finite P-a.s. this is precisely the stated Lebesgue decomposition
of Q with respect to P.
220
Combining Theorem 5.5.11 and Kakutanis theorem we next deduce that if the
marginals of one infinite product measure are absolutely continuous with respect to
those of another, then either the former product measure is absolutely continuous
with respect to the latter, or these two measures are mutually singular. This
dichotomy is a key result in the treatment by theoretical statistics of the problem
of hypothesis testing (with independent observables under both the null hypothesis
and the alternative hypothesis).
Proposition 5.5.13. Suppose that P and Q are product measures on (RN , Bc )
which make the coordinates Xn () = n independent with the respective laws Q
1
Xk1 P Xk1 for each k N. Let Yk () = d(Q Xk1 )/d(P
Q Xk )(Xk ()) then
denote the likelihood ratios ofthe marginals. Then, M = k Yk exists a.s. under
Q
both P and Q. If = k P( Yk ) is positive then Q is absolutely continuous with
respect to P with dQ/dP = M , whereas if = 0 then Q is singular with respect
to P such that Q-a.s. M = while P-a.s. M = 0.
Remark 5.5.14. Note that the preceding Yk are identically distributed when both
P and Q
are products of i.i.d. random variables. Hence in this
case > 0 if and
only if P( Y1 ) = 1, which with P(Y1 ) = 1 is equivalent to P[( Y1 1)2 ] = 0, i.e. to
having P-a.s. Y1 = 1. The latter condition implies that P-a.s. M = 1, so Q = P.
We thus deduce that any Q 6= P that are both products of i.i.d. random variables,
are mutually singular, and for n large enough the likelihood test of comparing Mn
to a fixed threshold decides correctly between the two hypothesis regarding the law
of {Xk }, since P-a.s. Mn 0 while Q-a.s. Mn .
Proof. We are in the setting of Theorem 5.5.11 for = RN and the filtration
FnX = (Xk : 1 k n) F X = (Xk , k < ) = Bc
(1
spect to P if and
only
if
p
q
(1 pk )(1 qk )) is finite.
k
k
k
P
(b) Show that if k |pk qk | is finite then Q is absolutely continuous with
respect to P.
221
(c) Show that ifPpk , qk [, 1 ] for some > 0 and all k, then Q P if
2
and only if P
k (pk qk ) < .P
(d) Show that P
if k qk < and k pk = then QP so in general the
condition k (pk qk )2 < is not sufficient for absolute continuity of
Q with respect to P.
In the spirit of Theorem 5.5.11, as you show next, a positive martingale (Zn , F
n)
induces a collection of probability measures Qn that are equivalent to Pn = PF
n
(i.e. both Qn Pn and Pn Qn ), and satisfy a certain martingale Bayes rule.
In particular, the following discrete time analog of Girsanovs theorem, shows that
such construction can significantly simplify certain computations upon moving from
Pn to Qn .
Exercise 5.5.16. Suppose (Zn , Fn ) is a (strictly) positive MG on (, F , P), normalized so that EZ0 = 1. Let Pn = PFn and consider the equivalent probability
measure Qn on (, Fn ) of Radon-Nikodym derivative dQn /dPn = Zn .
(a) Show that Qk = Qn F for any 0 k n.
k
(b) Fixing 0 k m n and Y L1 (, Fm , P) show that Qn -a.s. (hence
also P-a.s.), EQn [Y |Fk ] = E[Y Zm |Fk ]/Zk .
(c) For Fn = Fn , the canonical filtration of i.i.d. standard normal variables
{k } and any bounded, Fn -predictable Vn , consider the P
measures Qn inn
1
2
duced by
the
exponential
martingale
Z
=
exp(Y
n
n
k=1 Vk ), where
2 P
Pn
m
Yn = k=1 k Vk . Show that X of coordinates Xm = k=1 (k Vk ),
1 m n, is under
Pm Qn a Gaussian random vector whose law is the
same as that of { k=1 k : 1 m n} under P.
Hint: Use characteristic functions.
222
Remark. Actually, (Xn , Fn ) is a RMG for Xn = E[Y |Fn ], n 0 and any integrable Y (possibly Y 6 mF0 ). Further, E[Y |Fn ] E[Y |F ] almost surely
and in L1 . This is merely a restatement of Levys downward theorem, since for
X0 = E[Y |F0 ] we have by the tower property that E[Y |Fn ] = E[X0 |Fn ] for any
n 0.
Proof. Suppose (Xn , Fn ) is a RMG. Then, fixing n < 0 and applying Proposition 5.1.20 for the MG (Yk , Gk ) with Yk := Xn+k and Gk := Fn+k , k = 0, . . . , n
(taking there = n > m = 0), we deduce that E[X0 |Fn ] = Xn . Conversely, suppose Xn = E[X0 |Fn ] for X0 integrable and all n 0. Then, Xn L1 (, Fn , P) by
the definition of C.E. and further, with Fn Fn+1 , we have by the tower property
that
Xn = E[X0 |Fn ] = E[E(X0 |Fn+1 )|Fn ] = E[Xn+1 |Fn ] ,
Not all reversed sub-MGs are U.I. but here is an explicit characterization of those
that are.
Exercise 5.5.21. Show that a reversed sub-MG {Xn } is U.I. if and only if inf n EXn
is finite.
Our first application of RMG-s is to provide an alternative proof of the strong law
of large numbers of Theorem 2.3.3, with the added bonus of L1 convergence.
223
Pn
Theorem 5.5.22 (Strong law of large numbers). Suppose Sn = k=1 k
for i.i.d. integrable {k }. Then, n1 Sn E1 a.s. and in L1 when n .
Proof. Let Xm = (m + 1)1 Sm+1 for m 0, and define the corresponding
filtration Fm = (Xk , k m). Recall part (a) of Exercise 4.4.8, that Xn =
E[1 |Xn ] for each n 0. Further, clearly Fn = (Gn , Tn ) for Gn = (Xn ) and
X
T = (r , r > ). With Tn
independent of ((1 ), Gn ), we thus have that
Xn = E[1 |Fn ] for each n 0 (see Proposition 4.2.3). Consequently, (Xn , Fn ) is
a RMG which by Levys downward theorem converges for n both a.s. and
in L1 to the finite valued random variable X = E[1 |F ]. Combining this
and the tower property leads to EX = E1 so it only remains to show that
P(X 6= c) = 0 for some non-random constant c. To this end, note that for any
finite,
m
m
1 X
1 X
X = lim sup
k = lim sup
k .
m m
m m
k=1
k=+1
X
Clearly,
T X mT for any so X is also measurable on the tail -algebra
T = T of the sequence {k }. We complete the proof upon noting that the
-algebra T is P-trivial (by Kolmogorovs 0-1 law and the independence of k ), so
in particular, a.s. X equals a non-random constant (see Proposition 1.2.47).
In this context, you find next that while any RMG Xm is U.I., it is not necessarily
dominated by an integrable variable, and its a.s. convergence may not translate to
conditional expectations E[Xm |H].
Exercise 5.5.23. Consider integrable i.i.d. copies of 1 , having distribution function F1 (x) = 1 x1 (log x)2 for x e and P(1 = e/(e 1)) = 1 e1 , so
E1 = 0. Let H = (An , n 3) for An = {n en/(log n)} and recall Theorem
5.5.22 that for m the U.I. RMG Xm = (m + 1)1 Sm+1 converges a.s. to
zero.
(a) Verify that m1 E[m |H] IAm for all m 3 and deduce that a.s.
lim supm m1 E[m |H] 1.
(b) Conclude that E[Xm |H] does not converge to zero a.s. and supm |Xm |
is not integrable.
In preparation for the Hewitt-Savage 0-1 law and de-Finettis theorem we now
define the exchangeable -algebra and random variables.
Definition 5.5.24 (exchangeable -algebra and random variables). Consider the product measurable space (RN , Bc ) as in Kolmogorovs extension theorem.
Let Em Bc denote the -algebra of events that are invariant under permutations of
the first m coordinates; that is, A Em if ((1) , . . . , (m) , m+1 , . . .) A for any
permutation
of {1, . . . , m} and all (1 , 2 , . . .) A. The exchangeable -algebra
T
E = m Em consists of all events that are invariant under all finite permutations of
coordinates. Similarly, we call an infinite sequence of R.V.s {k }k1 on the same
D
224
Lemma 5.5.25. Suppose k () = k is an exchangeable sequence of random variables on (RN , Bc ). For any bounded Borel function : R 7 R and m let
P
1
Sbm () = (m)
i (i1 , . . . , i ), where i = (i1 , . . . , i ) is an -tuple of distinct in
m!
is the number of such -tuples. Then,
tegers from {1, . . . , m} and (m) = (m)!
(5.5.5)
Proof. Fixing m since the value of Sbm () is invariant under any permutation of the first m coordinates of we have that Sbm () is measurable on Em .
Further, this bounded R.V. is obviously integrable, so
1 X
(5.5.6)
Sbm () = E[Sbm ()|Em ] =
E[(i1 , . . . , i )|Em ] .
(m) i
Fixing any -tuple of distinct integers i1 , . . . , i from {1, . . . , m}, by our exchangeability assumption, the probability measure on (RN , Bc ) is invariant under any permutation of the first m coordinates of such that (ik ) = k for k = 1, . . . , . Consequently, E[(i1 , . . . , i )IA ] = E[(1 , . . . , )IA ] for any A Em , implying that
E[(i1 , . . . , i )|Em ] = E[(1 , . . . , )|Em ]. Since this applies for any -tuple of distinct integers from {1, . . . , m} it follows by (5.5.6) that Sbm () = E[(1 , . . . , )|Em ]
for all m . In conclusion, considering the filtration Fn = En , n 0 for which
F = E, we have in view of the remark following Levys downward theorem that
(Sbn (), En ), n 0 is a RMG and the convergence in (5.5.5) holds a.s. and in
L1 .
Remark. Noting that any sequence of i.i.d. random variables is also exchangeable,
our first application of Lemma 5.5.25 is the following zero-one law.
Theorem 5.5.26 (Hewitt-Savage 0-1 law). The exchangeable -algebra E of a
sequence of i.i.d. random variables k () = k is P-trivial (that is, P(A) {0, 1}
for any A E).
Remark. Given the Hewitt-Savage 0-1 law, we can simplify the proof of Theorem
5.5.22 upon noting that for each m the -algebra Fm is contained in Em+1 , hence
F E must also be P-trivial.
Proof. As the i.i.d. k () = k are exchangeable, from Lemma 5.5.25 we
have that for any bounded Borel : R R, almost surely Sbm () Sb () =
E[(1 , . . . , )|E].
We proceed to show that Sb () = E[(1 , . . . , )]. To this end, fixing a finite
integer r m let
X
1
(i1 , . . . , i )
Sbm,r () =
(m)
{i: i1 >r,...,i >r}
denote the contribution of the -tuples i that do not intersect {1, . . . , r}. Since
there are exactly (m r) such -tuples and is bounded, it follows that
(m r)
c
|Sbm () Sbm,r ()| [1
]k()k
(m)
m
225
Sbm (h ) =
X
m
1
Sbm (hj ) .
Sbm (f )Sbm (g)
m+1
m + 1 j=1
Y
k=1
Y
gk (k )|E =
E[gk (k )|E]
k=1
k=1
P(k Bk |E)
which implies that conditional on E the R.V.-s {k } are mutually independent (see
Proposition 1.4.21). Further, E[g(1 )IA ] = E[g(r )IA ] for any A E, bounded
Borel g(), positive integer r and exchangeable variables k () = k , from which it
follows that conditional on E these R.V.-s are also identically distributed.
226
BG
sup
m BGm
|P(A B) P(A)P(B)| = 0 .
CHAPTER 6
Markov chains
The rich theory of Markov processes is the subject of many text books and one can
easily teach a full course on this subject alone. Thus, we limit ourselves here to the
discrete time Markov chains and to their most fundamental properties. Specifically,
in Section 6.1 we provide definitions and examples, and prove the strong Markov
property of such chains. Section 6.2 explores the key concepts of recurrence, transience, invariant and reversible measures, as well as the asymptotic (long time)
behavior for time homogeneous Markov chains of countable state space. These concepts and results are then generalized in Section 6.3 to the class of Harris Markov
chains.
6.1. Canonical construction and the strong Markov property
We start with the definition of a Markov chain.
Definition 6.1.1. Given a filtration {Fn }, an Fn -adapted stochastic process {Xn }
taking values in a measurable space (S, S) is called an Fn -Markov chain with state
space (S, S) if for any A S,
(6.1.1)
n,
a.s.
Remark. We call {Xn } a Markov chain in case Fn = (Xk , k n), noting that
if {Xn } is an Fn -Markov chain then it is also a Markov chain. Indeed, FnX =
(Xk , k n) Fn since {Xn } is adapted to {Fn }, so by the tower property we
have that for any Fn -Markov chain, any A S and all n, almost surely,
P[Xn+1 A|FnX ] = E[E[IXn+1 A |Fn ]|FnX ] = E[E[IXn+1 A |Xn ]|FnX ]
= E[IXn+1 A |Xn ] = P[Xn+1 A|Xn ] .
The key object in characterizing an Fn -Markov chain are its transition probabilities, as defined next.
Definition 6.1.2. A set function p : S S 7 [0, 1] is a transition probability if
(a) For each x S, A 7 p(x, A) is a probability measure on (S, S).
(b) For each A S, x 7 p(x, A) is a measurable function on (S, S).
We say that an Fn -Markov chain {Xn } has transition probabilities pn (x, A), if
almost surely P[Xn+1 A|Fn ] = pn (Xn , A) for every n 0 and every A S and
call it a homogeneous Fn -Markov chain if pn (x, A) = p(x, A) for all n, x S and
A S.
With bS mS denoting the collection of all bounded (R, BR )-valued measurable
mappings on (S, S), we next express E[h(Xk+1 )|Fk ] for h bS in terms of the
transition probabilities of the Fn -Markov chain {Xn }.
227
228
6. MARKOV CHAINS
Lemma 6.1.3. If {Xn } is an Fn -Markov chain with state space (S, S) and transition probabilities pn ( , ), then for any h bS and all k 0
(6.1.2)
We turn to show how relevant the preceding proposition is for Markov chains.
Proposition 6.1.5. To any -finite measure on (S, S) and any sequence of
transition probabilities pn (, ) there correspond unique -finite measures k =
p0 pk1 on (Sk+1 , S k+1 ), k = 1, 2, . . . such that
Z
Z
Z
k (A0 Ak ) = (dx0 ) p0 (x0 , dx1 ) pk1 (xk1 , dxk )
A0
A1
Ak
and all k 0,
(6.1.3)
E[
k
Y
h (X )] =
=0
229
Proof. Starting at a -finite measure 1 = on (S, S) and applying Proposition 6.1.4 for 2 (x, B) = p0 (x, B) on S S yields the -finite measure 1 =
p0 on (S2 , S 2 ). Applying this proposition once more, now with 1 = 1 and
2 ((x0 , x1 ), B) = p1 (x1 , B) for x = (x0 , x1 ) S S yields the -finite measure
2 = p0 p1 on (S3 , S 3 ) and upon repeating this procedure k times we arrive
at the -finite measure k = p0 pk1 on (Sk+1 , S k+1 ). Since pn (x, S) = 1
for all n and x S, it follows that if is a probability measure, so are k which by
construction are also consistent.
Suppose next that the Markov chain {Xn } has transition probabilities pn (, ) and
initial distribution . Fixing k and h bS we have by the tower property and
(6.1.2) that
E[
k
Y
=0
h (X )] = E[
k1
Y
=0
X
h (X )E(hk (Xk )|Fk1
)] = E[
k1
Y
h (X )(pk1 hk )(Xk1 )] .
=0
230
6. MARKOV CHAINS
Remark. In particular, this construction implies that for any probability measure
on (S, S) and all A Sc
Z
(6.1.6)
P (A) = (dx)Px (A) .
We shall use the latter identity as an alternative definition for P , that is applicable
even for a non-finite initial measure (namely, when (S) = ), noting that if is
-finite then P is also the unique -finite measure on (S , Sc ) for which (6.1.5)
holds (see the remark following Corollary 1.4.25).
(where the first and last equalities are due to (6.1.5)). Consequently, for any B S
and k 0 finite, pk (Yk , B) is a version of the C.E. E[I{Yk+1 B} |FkY ] for FkY =
(Y0 , . . . , Yk ), thus showing that {Yn } is a Markov chain of transition probabilities
pn (, ).
Remark. Conversely, given a Markov chain {Xn } of state space (S, S), applying this construction for its transition probabilities and initial distribution yields a
Markov chain {Yn } that has the same law as {Xn }. To see this, recall (6.1.4) that
the f.d.d. of a Markov chain are uniquely determined by its transition probabilities and initial distribution, and further for a B-isomorphic state space, the f.d.d.
uniquely determine the law P of the corresponding stochastic process. For this
reason we consider (S , Sc , P ) to be the canonical probability space for Markov
chains, with Xn () = n given by the coordinate maps.
The evaluation of the f.d.d. of a Markov chain is considerably more explicit when
the state space S is a countable set (in which case S = 2S ), as then
X
pn (x, A) =
pn (x, y) ,
yA
for any
P A S, so the transition probabilities are determined by pn (x, y) 0 such
that yS pn (x, y) = 1 for all n and x S (and all Lebesgue integrals are in this case
merely sums). In particular, if S is a finite set and the chain is homogeneous, then
identifying S with {1, . . . , m} for some m < , we view p(x, y) as the (x, y)-th entry
of an m m dimensional transition probability matrix, and express probabilities of
interest in terms of powers of the latter matrix.
For homogeneous Markov chains whose state space is S = Rd (or a product of
closed intervals thereof), equipped with the corresponding Borel -algebra, computations are relatively explicit when for each x S the transition probability p(x, )
231
are i.i.d. Rd -valued random variables that are also independent of S0 is an example
of a homogeneous Markov chain. Indeed, Sn+1 = Sn + n+1 with n+1 independent
of FnS = (S0 , . . . , Sn ). Hence, P[Sn+1 A|FnS ] = P[Sn + n+1 A|Sn ]. With
n+1 having the same law as 1 , we thus get that P[Sn + n+1 A|Sn ] = p(Sn , A)
for the transition probabilities p(x, A) = P(1 {y x : y A}) (c.f. Exercise
4.2.2) and the state space S = Rd (with its Borel -algebra).
232
6. MARKOV CHAINS
As we see in the sequel, our next result, the strong Markov property, is extremely
useful. It applies to any homogeneous Markov chain with a B-isomorphic state
space and allows us to handle expectations of random variables shifted by any
stopping time with respect to the canonical filtration of the chain.
Proposition 6.1.16 (Strong Markov property). Consider a canonical probability space (S , Sc , P ), a homogeneous Markov chain Xn () = n constructed
on it via Theorem 6.1.8, its canonical filtration FnX = (Xk , k n) and the shift
operator : S 7 S such that ()k = k+1 for all k 0 (with the corresponding
iterates (n )k = k+n for k, n 0). Then, for any {hn } bSc with supn, |hn ()|
finite, and any FnX -stopping time
(6.1.7)
Remark. Here FX is the stopped -algebra associated with the stopping time
(c.f. Definition 5.1.34) and E (or Ex ) indicates expectation taken with respect
to P (Px , respectively). Both sides of (6.1.7) are set to zero when () =
and otherwise its right hand side is g(n, x) = Ex [hn ] evaluated at n = () and
x = X () ().
The strong Markov property is a significant extension of the Markov property:
(6.1.8)
holding almost surely for any non-negative integer n and fixed h bSc (that is,
the identity (6.1.7) with = n non-random). This in turn generalizes Lemma 6.1.3
where (6.1.8) is proved in the special case of h(1 ) and h bS.
Proof. We first prove (6.1.8) for h() =
k
Q
=0
g ( ) with g bS, = 0, . . . , k.
To this end, fix B S n+1 and recall that m = p p are the f.d.d. for P .
233
k
Y
g (x+n )]
=0
Z
Z
h
i
= n IB (x0 , . . . , xn )g0 (xn ) p(xn , dy1 )g1 (y1 ) p(yk1 , dyk )gk (yk )
= E [IB (X0 , . . . , Xn )EXn (h) ] .
This holds for all B S n+1 , which by definition of the conditional expectation
amounts to (6.1.8).
The collection H bSc of bounded, measurable h : S R for which (6.1.8)
holds, clearly contains the constant functions and is a vector space over R (by
linearity of the expectation and the conditional expectation). Moreover, by the
monotone convergence theorem for conditional expectations, if hm H are nonnegative and hm h which is bounded, then also h H. Taking in the preceding
g = IB we see that IA H for any A in the -system P of cylinder sets (i.e.
whenever A = { : 0 B0 , . . . , k Bk } for some k finite and B S). We thus
deduce by the (bounded version of the) monotone class theorem that H = bSc , the
collection of all bounded functions on S that are measurable with respect to the
-algebra Sc generated by P.
Having established the Markov property (6.1.8), fixing {hn } bSc and a FnX stopping time , we proceed to prove (6.1.7) by decomposing both sides of the
latter identity according to the value of . Specifically, the bounded random variables Yn = hn (n ) are integrable and applying (6.1.8) for h = hn we have that
E [Yn |FnX ] = g(n, Xn ). Hence, by part (c) of Exercise 5.1.35, for any finite integer
k 0,
E [h ( )I{ =k} |FX ] = g(k, Xk )I{ =k} = g(, X )I{ =k}
The identity (6.1.7) is then established by taking out the FX -measurable indicator
on { = k} and summing over k = 0, 1, . . . (where the finiteness of supn, |hn ()|
provides the required integrability).
Exercise 6.1.17. Modify the
Plast step of the proof of Proposition 6.1.16 to show
that (6.1.7) holds as soon as k EXk [ |hk | ]I{ =k} is P -integrable.
Here are few applications of the Markov and strong Markov properties.
(a) Using the Markov property and Levys upward theorem (Theorem 5.3.15),
a.s.
show that P(n |Xn ) I .
(b) Show that P({Xn An i.o.} \ ) = 0 for any {An } S such that for
some > 0 and all n, with probability one,
P(n |Xn ) I{Xn An } .
(c) Suppose A, B S are such that Px (Xl B for some l 1) for some
> 0 and all x A. Deduce that
P({Xn A finitely often} {Xn B i.o.}) = 1 .
234
6. MARKOV CHAINS
Derive also the following, more precise result for the symmetric srw, where for any
integer b > 0,
P(max Sk b) = 2P(Sn > b) + P(Sn = b) .
kn
The concept of invariant measure for a homogeneous Markov chain, which we now
introduce, plays an important role in our study of such chains throughout Sections
6.2 and 6.3.
Definition 6.1.20. A measure on (S, S) such that (S) > 0 is called a positive
or non-zero measure. An event A Sc is called shift invariant if A = 1 A (i.e.
A = { : () A}), and a positive measure on (S , Sc ) is called shift invariant
if 1 () = () (i.e. (A) = ({ : () A}) for all A Sc ). We say
that a stochastic process {Xn } with a B-isomorphic state space (S, S) is (strictly)
stationary if its joint law is shift invariant. A positive -finite measure on
a B-isomorphic space (S, S) is called an invariant measure for a transition probability p(, ) if it defines via (6.1.6) a shift invariant measure P (). In particular,
starting at X0 chosen according to an invariant probability measure results with
a stationary Markov chain {Xn }.
Lemma 6.1.21. Suppose a -finite measure and transition probability p0 (, ) on
(S, S) are such that p0 (S A) = (A) for any A S. Then, for all k 1 and
A S k+1 ,
p0 pk (S A) = p1 pk (A) .
Proof. Our assumption that ((p0 f )) = (f ) for f = IA and any A S
extends by the monotone class theorem to all f bS. Fixing Ai S and k 1
let fk (x) = IA0 (x)p1 pk (x, A1 Ak ) (where p1 pk (x, ) are the
probability measures of Proposition 6.1.5 in case = x is the probability measure
supported on the singleton {x} and p0 (y, {y}) = 1 for all y S). Since (pj h) bS
for any h bS and j 1 (see Lemma 6.1.3), it follows that fk bS as well.
Further, (fk ) = p1 pk (A) for A = A0 A1 Ak . By the same
reasoning also
Z
Z
p0 (y, dx)p1 pk (x, A1 Ak ) = p0 pk (SA) .
((p0 fk )) = (dy)
S
A0
Thus, the stated identity holds for the -system of product sets A = A0 Ak
which generates S k+1 and since p1 pk (Bn Sk ) = (Bn ) < for some
Bn S, this identity extends to all of S k+1 (see the remark following Proposition
1.1.39).
Remark 6.1.22. Let k = k p denote the -finite measures of Proposition 6.1.5
in case pn (, ) = p(, ) for all n (with 0 = 0 p = ). Specializing Lemma 6.1.21 to
this setting we see that if 1 (SA) = 0 (A) for any A S then k+1 (SA) = k (A)
for all k 0 and A S k+1 .
235
Building on the preceding remark we next characterize the invariant measures for
a given transition probability.
Proposition 6.1.23. A positive -finite measure () on B-isomorphic (S, S) is an
invariant measure for transition probability p(, ) if and only if p(S A) = (A)
for all A S.
Proof. With a positive -finite measure, so are the measures P and P
1 on (S , Sc ) which for a B-isomorphic space (S, S) are uniquely determined by
their finite dimensional distributions (see the remark following Corollary 1.4.25).
By (6.1.5) the f.d.d. of P are the -finite measures k (A) = k p(A) for A S k+1
and k = 0, 1, . . . (where 0 = ). By definition of the corresponding f.d.d. of
P 1 are k+1 (S A). Therefore, a positive -finite measure is an invariant
measure for p(, ) if and only if k+1 (S A) = k (A) for any non-negative integer
k and A S k+1 , which by Remark 6.1.22 is equivalent to p(S A) = (A) for
all A S.
6.2. Markov chains with countable state space
Throughout this section we restrict our attention to homogeneous Markov chains
{Xn } on a countable (finite or infinite), state space S, setting as usual S = 2S and
p(x, y) = Px (X1 = y) for the corresponding transition probabilities. Noting that
such chains admit the canonical construction of Theorem 6.1.8 since their state
space is B-isomorphic (c.f. Proposition 1.4.27 for M = S equipped with the metric
d(x, y) = 1x6=y ), we start with a few useful consequences of the Markov and strong
Markov properties that apply for any homogeneous Markov chain on a countable
state space.
Proposition 6.2.1 (Chapman-Kolmogorov). For any x, y S and non-negative
integers k n,
X
(6.2.1)
Px (Xn = y) =
Px (Xk = z)Pz (Xnk = y)
zS
236
6. MARKOV CHAINS
nr
X
k=0
X
j=0
Py (Xj = y)
+r
X
Py (Xn = y).
n=r
n1
X
k=0
237
+ (1 )n (0)
.
P (Xn = 0) =
+
+
(b) Fixing (0) and 1 6= 0 non-random, suppose = and conditional
on {Xn } the variables Bk are independent Bernoulli(
P Xk ). Evaluate the
mean and variance of the additive functional Sn = nk=1 Bk .
(c) Verify that Ex [(Xn N/2)] = (12/N )n (xN/2) for the Ehrenfest chain
whose transition probabilities are p(x, x 1) = x/N = 1 p(x, x + 1).
6.2.1. Classification of states, recurrence and transience. We start
with the partition of a countable state space of a homogeneous Markov chains
to its intercommunicating (equivalence) classes, as defined next.
Definition 6.2.7. Let xy = Px (Ty < ) denote the probability that starting
at x the chain eventually visits the state y. State y is said to be accessible from
state x 6= y if xy > 0 (or alternatively, we then say that x leads to y). Two
states x 6= y, each accessible to the other, are said to intercommunicate, denoted
by x y. A non-empty collection of states C S is called irreducible if each two
states in C intercommunicate, and closed if there is no y
/ C and x C such that
y is accessible from x.
Remark. Evidently an irreducible set C may be a non-closed set and vice verse.
For example, if p(x, y) > 0 for any x, y S then S\{z} is irreducible and non-closed
(for any z S). More generally, adopting hereafter the convention that x x,
any non-empty proper subset of an irreducible set is irreducible and non-closed.
Conversely, when there exists y S such that p(x, y) = 0 for all x S \ {y}, then S
is closed and reducible. More generally, a closed set that has a closed proper subset
is reducible. Note however the following elementary properties.
Exercise 6.2.8.
(a) Show that if xy > 0 and yz > 0 then also xz > 0.
(b) Deduce that intercommunication is an equivalence relation (that is, x
x, if x y then also y x and if both x y and y z then also
x z).
(c) Explain why its equivalence classes partition S into maximal irreducible
sets such that the directed graph indicating which one leads to each other
238
6. MARKOV CHAINS
(6.2.2)
Further, let N (y) denote the number of visits to state y by the Markov chain at
xy
is positive if and only if xy > 0, in which
positive times. Then, Ex N (y) = 1
yy
case it is finite when y is transient and infinite when y is recurrent.
Proof. The identity (6.2.2) is merely the observation that starting at x, in
order to have k visits to y, one has to first reach y andS
then to have k1 consecutive
returns to y. More formally, the event {Ty < } = n {Ty n} is in Sc so fixing
k 2 the strong Markov property applies for the stopping time = Tyk1 and the
indicator function h = I{Ty <} . Further, < implies that h( ) = I{Tyk <} ()
and X = y so EX h = Py (Ty < ) = yy . Combining the tower property with
the strong Markov property we thus find that
Px (Tyk < ) = Ex [h( )I < ] = Ex [Ex [h( ) | FX ]I < ]
= Ex [yy I < ] = yy Px (Tyk1 < ) ,
k=1
X
k=1
as claimed.
Px (N (y) k) =
k1
xy yy
=
Px (Tyk < )
k=1
xy
1yy , xy
0,
>0
xy = 0
In the same spirit as the preceding proof you next show that successive returns to
the same state by a Markov chain are renewal times.
Exercise 6.2.11. Fix a recurrent state y S of a Markov chain {Xn }. Let
Rk = Tyk and rk = Rk Rk1 the number of steps between consecutive returns to
y.
239
(a) Deduce from the strong Markov property that under Py the random vectors Y k = (rk , XRk1 , . . . , XRk 1 ) for k = 1, 2, . . . are independent and
identically distributed.
(b) Show that for any probability measure , under P and conditional on the
event {Ty < }, the random vectors Y k are independent of each other
D
where T is the set of transient states and the Ri are disjoint, irreducible closed sets
of recurrent states with xy = 1 whenever x, y Ri .
Remark. An alternative statement of the decomposition theorem is that for any
pair of recurrent states xy = yx {0, 1} while xy = 0 if x is recurrent and y
is transient (so x 7 {y S : xy > 0} induces a unique partition of the recurrent
states to irreducible closed sets).
Proof. Suppose x y. Then, xy > 0 implies that Px (XK = y) > 0 for
some finite K and yx > 0 implies that Py (XL = x) > 0 for some finite L. By the
Chapman-Kolmogorov equations we have for any integer n 0,
X
Px (XK+n+L = x) =
Px (XK = z)Pz (Xn = v)Pv (XL = x)
z,vS
(6.2.4)
240
6. MARKOV CHAINS
We thus consider the unique partition of S to (disjoint) maximal irreducible equivalence classes of (see Exercise 6.2.8), with Ri denoting those equivalence classes
that are recurrent and proceed to show that if x is recurrent and xy > 0 for y 6= x,
then yx = 1. The latter implies that any y accessible from x R must intercommunicate with x, so with R a maximal irreducible set, necessarily such y is also
in R . We thus conclude that each R is closed, with xy = 1 whenever x, y R ,
as claimed.
To complete the proof fix a state y 6= x that is accessible by the chain from
a recurrent state x, noting that then L = inf{n 1 : Px (Xn = y) > 0} is
finite. Further, because L is the minimal such value there exist y0 = x, yL = y
QL
and yi 6= x for 1 i L such that k=1 p(yk1 , yk ) > 0. Consequently, if
Py (Tx = ) = 1 yx > 0, then
Px (Tx = )
L
Y
k=1
241
n0
242
6. MARKOV CHAINS
Proposition 6.2.21. Suppose S is irreducible for a chain {Xn } and there exists
h : S 7 [0, ) of finite level sets Gr = {x : h(x) < r} that is super-harmonic at
S \ Gr for this chain and some finite r. Then, the chain {Xn } is recurrent.
Proof. If S is finite then the chain is recurrent by Proposition 6.2.15. Assuming hereafter that S is infinite, fix r0 large enough so the finite set F = Gr0 is
non-empty and h() is super-harmonic at x
/ F . By Proposition 6.2.15 and part
(c) of Exercise 6.1.18 (for B = F = S \ A), if Px (F < ) = 1 for all x S then F
contains at least one recurrent state, so by irreducibility of S the chain is recurrent,
as claimed. Proceeding to show that Px (F < ) = 1 for all x S, fix r > r0 and
C = Cr = F (S \ Gr ). Note that h() super-harmonic at x
/ C, hence h(XnC )
is a non-negative sup-MG under Px for any x S. Further, S \ C is a subset of Gr
hence a finite set, so it follows by irreducibility of S that Px (C < ) = 1 for all
x S (see part (a) of Exercise 6.2.5). Consequently, from Proposition 5.3.8 we get
that
h(x) Ex h(XC ) r Px (C < F )
(since h(XC ) r when C < F ). Thus,
Px (F < ) Px (F C ) 1 h(x)/r
m1
k
XY
k=0
qj
,
p
j=1 j
h(b) h(x)
h(b) h(a)
243
Example 6.2.26. Some chains do not have any invariant measure. For example,
in a birth and death chain with pi = 1, i 0 the identity (6.2.5) is merely (0) = 0
and (i) = (i 1) for i 1, whose only solution is the zero function. However,
the totally asymmetric srw on Z with p(x, x + 1) = 1 at every integer x has an
invariant measure (x) = 1, although just as in the preceding birth and death chain
all its states are transient with the only closed set being the whole state space.
Nevertheless, as we show next, to every recurrent state corresponds an invariant
measure.
Proposition 6.2.27. Let Tz denote the possibly infinite return time to a state z
by a homogeneous Markov chain {Xn }. Then,
z (y) = Ez
z 1
h TX
n=0
i
I{Xn =y} ,
is an excessive measure for {Xn }, the support of which is the closed set of all states
accessible from z. If z is recurrent then z () is an invariant measure, whose support
is the closed and recurrent equivalence class of z.
Remark. We have by the second claim of Proposition 6.2.15 (for the closed set
S), that any chain with a finite state space has at least one recurrent state. Further,
recall that any invariant measure is -finite, which for a finite state space amounts
to being a finite measure. Hence, by Proposition 6.2.27 any chain with a finite state
space has at least one invariant probability measure.
Example 6.2.28. For a transient state z the excessive measure z (y) may be
infinite at some y S. For example, the transition probability p(x, 0) = 1 for all
x S = {0, 1} has 0 as an absorbing (recurrent) state and 1 as a transient state,
with T1 = and 1 (1) = 1 while 1 (0) = .
Proof. Using the canonical construction of the chain, we set
Tz ()1
hk (, y) =
n=0
I{n+k =y} ,
244
6. MARKOV CHAINS
so that z (y) = Ez h0 (, y). By the tower property and the Markov property of the
chain,
Ez h1 (, y) = Ez
=
hX
n=0
XX
xS n=0
XX
xS n=0
X
xS
I{Xn =x}
h
i
Ez I{Tz >n} I{Xn =x} Pz (Xn+1 = y|FnX )
h
i
X
z (x)p(x, y) .
Ez I{Tz >n} I{Xn =x} p(x, y) =
xS
with equality when y 6= z or z is recurrent (in which case Pz (Tz < ) = 1).
By definition z (z) = 1, so z () is an excessive measure.
Iterating the preceding
P
inequality k times we further deduce that z (y) x z (x)Px (Xk = y) for any
k 1 and y S, with equality when z is recurrent. If zy = 0 then clearly
z (y) = 0, while if zy > 0 then Pz (Xk = y) > 0 for some k finite, hence z (y)
z (z)Pz (Xk = y) > 0. The support of z is thus the closed set of states accessible
from z, which for z recurrent is its equivalence class. Finally, note that if x z
then Px (Xk = z) > 0 for some k finite, so 1 = z (z) z (x)Px (Xk = z) implying
that z (x) < . That is, if z is recurrent then z is a -finite, positive invariant
measure, as claimed.
What about uniqueness of the invariant measure for a given transition probability?
By definition the set of invariant measures for p(, ) is a convex cone (that is, if 1
and 2 are invariant measures, possibly the same, then for any positive c1 and c2
the measure c1 1 + c2 2 is also invariant). Thus, hereafter we say that the invariant
measure is unique whenever it is unique up to multiplication by a positive constant.
The first negative result in this direction comes from Proposition 6.2.27. Indeed,
the invariant measures z and x are clearly mutually singular (and in particular,
not constant multiple of each other), whenever the two recurrent states x and z do
not intercommunicate. In contrast, your next exercise yields a positive result, that
the invariant measure supported within each recurrent equivalence class of states
is unique (and given by Proposition 6.2.27).
Exercise 6.2.29. Suppose : S 7 (0, ) is a strictly positive invariant measure
for the transition probability p(, ) of a Markov chain {Xn } on the countable set S.
(a) Verify that q(x, y) = (y)p(y, x)/(x) is a transition probability on S.
(b) Verify that if : S 7 [0, ) is an excessive measure for p(, ) then
h(x) = (x)/(x) is super-harmonic for q(, ).
(c) Show that if p(, ) is irreducible and recurrent, then so is q(, ). Deduce
from Exercise 6.2.23 that then h(x) is a constant function, hence (x) =
c(x) for some c > 0 and all x S.
245
Remark. Evidently, having a uniform (or counting) invariant measure (i.e. (x)
c > 0 for all x S), as in the preceding example,
is equivalent to the transition
P
probability being doubly stochastic, that is, xS p(x, y) = 1 for all y S.
Example 6.2.32 motivates our next subject, which are the conditions under which
a Markov chain is reversible, starting with the relevant definitions.
246
6. MARKOV CHAINS
As their name suggest, reversible measures have to do with the time reversed chain
(and the corresponding adjoint transition probability), which we now define.
Definition 6.2.34. If () is an invariant measure for transition probability p(x, y),
then q(x, y) = (y)p(y, x)/(x) is a transition probability on the support of (),
which we call the adjoint (or dual) of p(, ) with respect to . The corresponding
chain of law Q is called the time reversed chain (with respect to ).
It is not hard, and left to the reader, to check that for any invariant probability
measure the stationary Markov chains {Yn } of law Q and {Xn } of law P are
D
such that (Yk , . . . , Y ) = (X , . . . , Xk ) for any k finite. Indeed, this is why {Yn }
is called the time reversed chain.
Also note that () is a reversible measure if and only if p(, ) is self-adjoint with
respect to () (that is, q(x, y) = p(x, y) on the support of ()). Alternatively put,
() is a reversible measure if and only if P = Q , that is, the shift invariant law
of the chain induced by is the same as that of its time reversed chain.
By Definition 6.2.33 the set of reversible measures for p(, ) is a convex cone.
The following exercise affirms that reversible measures are zero outside the closed
equivalence classes of the chain and uniquely determined by it within each such
class. It thus reduces the problem of characterizing reversible chains (and measures)
to doing so for irreducible chains.
Exercise 6.2.35. Suppose (x) is a reversible measure for the transition probability p(x, y) of a Markov chain {Xn } with a countable state space S.
(a) Show that (x)Px (Xk = y) = (y)Py (Xk = x) for any x, y S and all
k 1.
(b) Deduce that if (x) > 0 then any y accessible from x must intercommunicate with x.
(c) Conclude that the support of () is a disjoint union of closed equivalence classes, within each of which the measure is uniquely determined
by p(, ) up to a non-negative constant multiple.
We proceed to characterize reversible irreducible Markov chains as random walks
on networks.
Definition 6.2.36. A network (or weighted graph) consists of a countable (finite
or infinite) set of vertices V with a symmetric weight function w : V PV 7 [0, )
(i.e. wxy = wyx for all x, y V). Further requiring that (x) =
yV wxy is
247
finite and positive for each x V, a random walk on the network is a homogeneous
Markov chain of state space V and transition probability p(x, y) = wxy /(x). That
is, when at state x the probability of the chain moving to state y is proportional to
the weight wxy of the pair {x, y}.
Remark. For example, an undirected graph is merely a network the weights wxy
of which are either one (indicating an edge in the graph whose ends are x and y)
or zero (no such edge). Assuming such graph has positive and finite degrees, the
random walker moves at each time step to a vertex chosen uniformly at random
from those adjacent in the graph to its current position.
Exercise 6.2.37. Check P
that a random walk on a network has a strictly positive
reversible measure (x) = y wxy and that a Markov chain is reversible if and only
if there exists an irreducible closed set V on which it is a random walk (with weights
wxy = (x)p(x, y)).
Example 6.2.38 (Birth and death chain). We leave for the reader to check
that the irreducible birth and death chain of Exercise 6.2.24 is a random walk on
the network Z+ with weights wx,x+1 = px (x) = qx+1 (x + 1), wxx
rx (x) and
Q=
x
wxy = 0 for |x y| > 1, and the unique reversible measure (x) = i=1 pi1
qi (with
(0) = 1).
Remark. Though irreducibility does not imply uniqueness of the invariant measure (c.f. Example 6.2.32), if is an invariant measure of the preceding birth and
death chain then (x + 1) is determined by (6.2.5) from (x) and (x 1), so
starting at (0) = 1 we conclude that the reversible measure of Example 6.2.38 is
also the unique invariant measure for this chain.
We conclude our discussion of reversible measures with an explicit condition for
reversibility of an irreducible chain, whose proof is left for the reader (for example,
see [Dur10, Theorem 6.5.1]).
Exercise 6.2.39 (Kolmogorovs cycle condition). Show that an irreducible
chain of transition probability p(x, y) is reversible if and only if p(x, y) > 0 whenever
p(y, x) > 0 and
k
k
Y
Y
p(xi , xi1 ) ,
p(xi1 , xi ) =
i=1
i=1
Remark. The renewal Markov chain of Example 6.1.11 is one of the many recurrent chains that fail to satisfy Kolmogorovs condition (and thus are not reversible).
Turning to investigate the existence and support of finite invariant measures (or
equivalently, that of invariant probability measures), we further partition the recurrent states of the chain according to the integrability (or lack thereof) of the
corresponding return times.
Definition 6.2.40. With Tz denoting the first return time to state z, a recurrent
state z is called positive recurrent if Ez (Tz ) < and null recurrent otherwise.
Indeed, invariant probability measures require the existence of positive recurrent
states, on which they are supported.
248
6. MARKOV CHAINS
n=1
P (Xn = z) = E N (z) =
X
xS
(x)Ex N (z) =
X
xS
(x)
1
xz
1 zz
1 zz
(since xz 1 for all x). Starting at X0 chosen according to an invariant probability measure () results with a stationary Markov chain {Xn } and in particular
P (Xn = z) = (z) for all n. The left side of the preceding inequality is thus
infinite for positive (z) and invariant probability measure (). Consequently, in
this case zz = 1, or equivalently z must be a recurrent state of the chain. Since
this applies for any z S we conclude that () is supported outside the set T of
transient states.
Next, recall that for any z S,
z (S) =
X
yS
z (y) = Ez
z 1
h X TX
yS n=0
i
I{Xn =y} = Ez Tz ,
249
e is an invariant measure for the (irreFor example, since the counting measure
ducible) srw of Example 6.2.32, this chain does not have an invariant probability
measure, regardless of the value of p. For the same reason, the symmetric srw on
Z (i.e. where p = 1/2), is a null recurrent chain.
Similarly, the irreducible birth and death chain of Exercise 6.2.24
Qxhas an invariant
probability measure if and only if its reversible measure (x) = i=1 pi1
qi is finite
(c.f. Example 6.2.38). In particular, if pj = 1 qj = p for all j 1 then this chain
is positive recurrent with an invariant probability measure when p < 1/2 but null
recurrent for p = 1/2 (and transient when 1 > p > 1/2).
Finally, a random walk on a graph is irreducible if and only if the graph is connected. With (v) 1 for all v V (see Definition 6.2.36), it is positive recurrent
only for finite graphs.
P
Exercise 6.2.44. Check that (j) =
k>j qk is an invariant measure for the
recurrent renewal Markov chain of Example 6.1.11 in case {k : qk > 0} is unbounded
(see
P Example 6.2.19). Conclude that this chain is positive recurrent if and only if
k kqk is finite.
In the next exercise you find how the invariant probability measure is modified by
the introduction of holding times.
Exercise 6.2.45. Let () be the unique invariant probability measure of an irreducible, positive recurrent Markov chain {Xn } with transition probability p(x, y)
such that p(x, x) = 0 for all x S. Fixing r(x) (0, 1), consider the Markov chain
{Yn } whose transition probability is q(x, x) = 1 r(x) and q(x, y) = r(x)p(x, y) for
all y 6= x. Show that {Yn } is an irreducible, recurrent chain of invariant measure
(x) = (x)/r(x) and deduce that {Yn } is further positive recurrent if and only if
P
x (x)/r(x) < .
Though we have established the next result in a more general setting, the proof
we outline here is elegant, self-contained and instructive.
Exercise 6.2.46. Suppose g() is a strictly concave bounded function on [0, )
and () is a strictly positive invariant probability measure
Pfor irreducible transition
probability p(x, y). For any : S 7 [0, ) let (p)(y) = xS (x)p(x, y) and
X (y)
(y) .
E() =
g
(y)
yS
n=1
2n Px (Xn = y) > 0 ,
x, y S ,
and that invariant measures for p(x, y) are also invariant for pb(x, y).
250
6. MARKOV CHAINS
(a) Show that Vn = Znz is a FnZ -Markov chain and compute its transition
probabilities q(x, y).
(b) Suppose
h : S 7 [0, ) is such that h(z) = 0, the function (ph)(x) =
P
p(x,
y)h(y)
is finite everywhere and h(x) (ph)(x) + for some
y
> 0 and all x 6= z. Show that (Wn , FnZ ) is a sup-MG under Px for
Wn = h(Vn ) + (n z ) and any x S.
(c) Deduce that Ex z h(x)/ for any x S and conclude that z is positive
recurrent in the stronger sense that Ex Tz is finite for all x S.
(d) Fixing > 0 consider i.i.d. random vectors vk = (k , k ) such that
P(v1 = (1, 0)) = P(v1 = (0, 1)) = 0.25 and P(v1 = (1, 0)) =
P(v1 = (0, 1)) = 0.25 + . The chain Zn = (Xn , Yn ) on Z2 is such
that Xn+1 = Xn + sgn(Xn )n+1 and Yn+1 = Yn + sgn(Yn )n+1 , where
sgn(0) = 0. Prove that (0, 0) is positive recurrent in the sense of part (c).
Exercise 6.2.48. Consider the Markov chain Zn = n + (Zn1 1)+ , n 1,
on S = {0, 1, 2, . . .}, where n are i.i.d. S-valued such that P(1 > 1) > 0 and
E1 = 1 for some > 0.
(a) Show that {Zn } is positive recurrent.
(b) Find its invariant probability measure () in case P(1 = k) = p(1 p)k ,
k S, for some p (1/2, 1).
(c) Is this Markov chain reversible?
6.2.3. Aperiodicity and limit theorems. Building on our classification of
states and study of the invariant measures of homogeneous Markov chains with
countable state space S, we focus here on the large n asymptotics of the state
Xn () of the chain and its law.
We start with the asymptotic behavior of the occupation time
n
X
Nn (y) =
IX =y ,
=1
251
Nn (w)
= y (w),
Nn (y)
Px -a.s.
lim Px (Xn = y) =
xy
.
Ey (Ty )
252
6. MARKOV CHAINS
irreducible chain of finite state space the sequence n 7 Px (Xn = y) may fail to
converge pointwise.
Example 6.2.53. Consider the Markov chain {Xn } on state space S = {0, 1}
with transition probabilities p(x, y) = 1x6=y . Then, Px (Xn = y) = 1{n even} when
x = y and Px (Xn = y) = 1{n odd} when x 6= y, so the sequence n 7 Px (Xn = y)
alternates between zero and one, having no limit for any fixed (x, y) S2 .
Nevertheless, as we prove in the sequel (more precisely, in Theorem 6.2.59), periodicity of the state y is the only reason for such non-convergence of Px (Xn = y).
Definition 6.2.54. The period dx of a state x S of a Markov chain {Xn } is
the greatest common divisor (g.c.d.) of the set Ix = {n 1 : Px (Xn = x) > 0},
with dx = 0 in case Ix is empty. Similarly, we say that the chain is of period d if
dx = d for all x S. A state x is called aperiodic if dx 1 and a Markov chain is
called aperiodic if every x S is aperiodic.
As the first step in this program, we show that the period is constant on each
irreducible set.
Lemma 6.2.55. The set Ix contains all large enough integer multiples of dx and
if x y then dx = dy .
Proof. Considering (6.2.4) for x = y and L = 0 we find that Ix is closed
under addition. Hence, this set contains all large enough integer multiples of dx
because every non-empty set I of positive integers which is closed under addition
must contain all large enough integer multiples of its g.c.d. d. Indeed, it suffices
to prove this fact when d = 1 since the general case then follows upon considering
the non-empty set I = {n 1 : nd I} whose g.c.d. is one (and which is
also closed under addition). Further, note that any integer n 2 is of the form
n = 2 + k + r = r( + 1) + ( r + k) for some k 0 and 0 r < . Hence, if
two consecutive integers and + 1 are in I then so are all integers n 2 . We
thus complete the proof by showing that K = inf{m : m, I, m > > 0} > 1
is in contradiction with I having g.c.d. d = 1. Indeed, both m0 and m0 + K are
in I for some positive integer m0 and if d = 1 then I must contain also a positive
integer of the form m1 = sK + r for some 0 < r < K and s 0. With I closed
under addition, (s + 1)(m0 + K) > (s + 1)m0 + m1 must then both be in I but
their difference is (s + 1)K m1 = K r < K, in contradiction with the definition
of K.
If x y then in view of the inequality (6.2.4) there exist finite K and L such that
K + n + L Ix whenever n Iy . Moreover, K + L Ix so every n Iy must
also be an integer multiple of dx . Consequently, dx is a common divisor of Iy and
therefore dy , being the greatest common divisor of Iy , is an integer multiple of dx .
Reversing the roles of x and y we likewise have that dx is an integer multiple of dy
from which we conclude that in this case dx = dy .
The key for determining the asymptotics of Px (Xn = y) is to handle this question
for aperiodic irreducible chains, to which end the next lemma is most useful.
Lemma 6.2.56. Consider two independent copies {Xn } and {Yn } of an aperiodic,
irreducible chain on a countable state space S with transition probabilities p(, ). The
Markov chain Zn = (Xn , Yn ) on S2 of transition probabilities p2 ((x , y ), (x, y)) =
253
p(x , x)p(y , y) is then also aperiodic and irreducible. If {Xn } has invariant probability measure () then {Zn } is further positive recurrent and has the invariant
probability measure 2 (x, y) = (x)(y).
Remark. Example 6.2.53 shows that for periodic p(, ) the chain of transition
probabilities p2 (, ) may not be irreducible.
By the Markov property and taking out the known I{ =k} it thus follows that
E[I{ =k} g(Xn )] = E(I{ =k} EXk [g(Xnk )])
Since |g(Xn ) g(Yn )| 2, we conclude that |Eg(Xn ) Eg(Yn )| 2P( > n) for
any g bS bounded by one, which is precisely what is claimed in (6.2.9).
254
6. MARKOV CHAINS
X
xy =
Px (Ty = k) ,
k=1
n
X
k=1
(see part (b) of Exercise 6.2.2), the asymptotics (6.2.8) follows by bounded convergence (with respect to the law of Ty conditional on {Ty < }), from
(6.2.10)
lim Py (Xn = y) =
1
.
Ey (Ty )
Turning to prove (6.2.10), in view of Corollary 6.2.52 we may and shall assume
hereafter that y is an aperiodic recurrent state. Further, recall that by Theorem
6.2.13 it then suffices to consider the aperiodic, irreducible, recurrent chain {Xn }
obtained upon restricting the original Markov chain to the closed equivalence
class of y, which with some abuse of notation we denote hereafter also by S.
Suppose first that {Xn } is positive recurrent and so it has the invariant probability
measure (w) = 1/Ew (Tw ) (see Proposition 6.2.41). The irreducible chain Zn =
(Xn , Yn ) of Lemma 6.2.56 is then recurrent, so we apply Theorem 6.2.57 for X0 = y
and Y0 chosen according to the invariant probability measure . Since Yn is a
D
stationary Markov chain (see Definition 6.1.20), in particular Yn = Y0 has the law
for all n. Moreover, the corresponding first meeting time is a.s. finite. Hence,
P( > n) 0 as n and by (6.2.9) the law of Xn converges in total variation
255
Proceeding to prove our thesis when the chain {Zn } is recurrent, suppose to the
contrary that the sequence n 7 Py (Xn = y) has a limit point (y) > 0. Then,
mapping S in a one to one manner into Z we deduce from Hellys theorem that
along a further sub-sequence n the distributions of Xn under Py converge vaguely,
hence pointwise (see Exercise 3.2.3), to some finite, positive measure on S. We
complete the proof of the theorem by showing that is an excessive measure for
the irreducible, recurrent chain {Xn }. Indeed, By part (c) of Exercise 6.2.29 this
would imply the existence of a finite invariant measure for {Xn }, in contradiction
with our assumption that this chain is null recurrent (see Corollary 6.2.42).
To prove that is an excessive measure, note first that considering Theorem
6.2.57 for Z0 = (x, y) we get from (6.2.9) that |Px (Xn = w) Py (Xn = w)| 0
as n , for any x, w S. Consequently, Px (Xn = w) (w) as , for
every x, w S. Moreover, from the Chapman-Kolmogorov equations we have that
for any w S, any finite set F S and all 1,
X
X
p(x, z)Pz (Xn = w) = Px (Xn +1 = w)
Px (Xn = z)p(z, w) .
zF
zS
In the limit this yields by bounded convergence (with respect to the probability measure p(x, ) on S), that for all w S
X
X
(w) =
p(x, z)(w)
(z)p(z, w) .
zS
zF
Taking F S we conclude by monotone convergence that () is an excessive measure on S, as we have claimed before.
Turning to the behavior of Px (Xn = y) for periodic state y, we start with the
following consequence of Theorem 6.2.59.
Corollary 6.2.60. The convergence (6.2.8) holds whenever y is a null recurrent
state of the Markov chain {Xn } and if y is a positive recurrent state of {Xn } having
period d = dy , then
(6.2.11)
lim Py (Xnd = y) =
d
.
Ey (Ty )
256
6. MARKOV CHAINS
In the next exercise, you extend (6.2.11) to the asymptotic behavior of Px (Xn = y)
for any two states x, y in a recurrent chain (which is not necessarily aperiodic).
Exercise 6.2.61. Suppose {Xn } is an irreducible, recurrent chain of period d. For
each x, y S let Ix,y = {n 1 : Px (Xn = y) > 0}.
(a) Fixing z S show that there exist integers 0 ry < d such that if n Iz,y
then d divides n ry .
(b) Show that if n Ix,y then n = (ry rx ) mod d and deduce that Si =
{y S : ry = i}, i = 0, . . . , d 1 are the irreducible equivalence classes
of the aperiodic chain {Xnd } (Si are called the cyclic classes of {Xn }).
(c) Show that for all x, y, S,
lim Px (Xnd+ry rx = y) =
d
.
Ey (Ty )
Remark. It is not always true that if a recurrent state y has period d then
Px (Xnd+r = y) dxy /Ey (Ty ) for some r = r(x, y) {0, . . . , d 1}. Indeed, let
p(x, y) be the transition probabilities of the renewal chain with q1 = 0 and qk > 0
for k 2 (see Example 6.1.11), except for setting p(1, 2) = 1 (instead of p(1, 0) = 1
in the renewal chain). The corresponding Markov chain has precisely two recurrent
states, y = 1 and y = 2, both of period d = 2 and mean return times E1 (T1 ) =
E2 (T2 ) = 2.PFurther, 02 = 1 but P0 (Xnd = 2) and P0 (Xnd+1 = 2) 1 ,
where = k q2k is strictly between zero and one.
We next
Pnconsider the large n asymptotic behavior of the Markov additive functional
Afn = =1 f (X ), where {X } is an irreducible, positive recurrent Markov chain.
In the following two exercises you establish first the strong law of large numbers
(thereby generalizing Proposition 6.2.49), and then the central limit theorem for
such Markov additive functionals.
Exercise 6.2.62. Suppose {Xn } is an irreducible, positive recurrent chain of initial probability measure and invariant probability measure (). Let f : S 7 R be
such that (|f |) < .
(a) Fixing y S let Rk = Tyk . Show that the random variables
Zkf =
RX
k 1
f (X ) ,
=Rk1
k 1,
EZ2f
= (f )
Ey (Ty )
P -a.s.
|f |
257
D
(a) Show that n1/2 Snf uG as n , for u = vf /Ey (Ty ) finite and
G a standard normal variable.
Hint: See part (a) of Exercise 3.2.9.
|f |
(b)
Show that max{n1/2 Zk
uG.
Building upon their strong law of large number, you are next to show that irreducible, positive recurrent chains have P-trivial tail -algebra and the laws of any
two such chains are mutually singular (for the analogous results for i.i.d. variables,
see Corollary 1.4.10 and Remark 5.5.14, respectively).
Exercise 6.2.64. Suppose {Xn } is an irreducible, positive recurrent chain of law
Px on (S , Sc ) (as in Definition 6.1.7).
(a) Show that Px (A) is independent of x S whenever A is in the tail algebra T X (of Definition 1.4.9).
(b) Deduce that T X is P-trivial.
Exercise 6.2.65. Suppose {Xn } is an irreducible, positive recurrent chain of transition probability p(x, y), initial and invariant probability measures () and (),
respectively.
(a) Show that {Xn , Xn+1 } is an irreducible, positive recurrent chain on S2+ =
{(x, y) : x, y S, p(x, y) > 0}, of initial and invariant measures (x)p(x, y)
and (x)p(x, y), respectively.
(b) Let P and P denote the laws of two irreducible, positive recurrent
chains on the same countable state space S, whose transition probabilities p(x, y) and p (x, y) are not identical. Show that P and P are
mutually singular measures (per Definition 4.1.9).
Hint: Consider the conclusion of Exercise 6.2.62 (for f () = 1x (), or, if
the invariant measures and are identical, then for f () = 1(x,y) ()
and the induced pair-chains of part (a)).
Exercise 6.2.66. Fixing 1 > > > 0 let P,
denote the law of (X0 , . . . , Xn )
n
for the Markov chain {Xk } of state space S = {1, 1} starting from X0 = 1
and evolving according to transition probability p(1, 1) = = 1 p(1, 1) and
p(1, 1) = = 1 p(1, 1). Fixing an
Pninteger b > 0 consider the stopping time
b = inf{n 0 : An = b} where An = k=1 Xk .
(a) Setting = log(/), h(1) = 1 and h(1) = (1 )/((1 )), show
,
that the Radon-Nikodym derivative Mn = dP,
is of the form
n /dPn
Mn = exp( An )h(Xn ).
(b) Deduce that P, (b < ) = exp( b)/h(1).
Exercise 6.2.67. Suppose {Xn } is a Markov chain of transition probability p(x, y)
and g() = (ph)() h() for some bounded function h() on S. Show that h(Xn )
Pn1
=0 g(X ) is then a martingale.
6.3. General state space: Doeblin and Harris chains
The refined analysis of homogeneous Markov chains with countable state space
is possible because such chains hit states with positive probability. This does not
happen in many important applications where the state space is uncountable. However, most proofs require only having one point of the state space that the chain
258
6. MARKOV CHAINS
hits with probability one. As we shall see, subject to the rather mild irreducibility and recurrence properties of Section 6.3.1, it is possible to create such a point
(called a recurrent atom), even in an uncountable state space, by splitting the chain
transitions. Guided by successive visits of the recurrent atom for the split chain, we
establish in Section 6.3.2 the existence and attractiveness of invariant (probability)
measures for the split chain (which then yield such results about the original chain).
6.3.1. Minorization, splitting, irreducibility and recurrence. Considering hereafter homogeneous Markov chains, we start by imposing a minorization
property of the transition probability p(, ) which yields the splitting of these transitions.
Definition 6.3.1. Consider a B-isomorphic state space (S, S). Suppose there
exists a non-zero measurable function v : S 7 [0, 1] and a probability measure q()
on (S, S) such that the transition probability of the chain {Xn } is of the form
(6.3.1)
p(x, ) = (1 v(x))b
p(x, ) + v(x)q() ,
for some transition probability pb(x, ) and v(x)q() pb(x, ). Amending the state
space to S = S {} with the corresponding -algebra S = {A, A {} : A S},
we then consider the split chain {X n } on (S, S) with transition probability
p(x, A) = (1 v(x))b
p(x, A)
x S, A S
xS
B S.
We note in passing that f (x) = f (x) for all x S and f () = q(f ), and further use in the sequel the following elementary fact about the closure of transition
probabilities under composition.
Corollary 6.3.3. Given any transition probabilities i : XR X 7 [0, 1], i = 1, 2,
the set function 1 2 : X X 7 [0, 1] such that 1 2 (x, A) = 1 (x, dy)2 (y, A) for
all x X and A X is a transition probability.
Proof. From Proposition 6.1.4 we see that
1 2 (x, A) = (1 (x, ) 2 )(X A) = (1 2 (, A))(x) .
259
260
6. MARKOV CHAINS
for all x S, by the Markov property of Y n (and Exercise 5.1.15), we deduce that
E [T ] 1/ is finite and uniformly bounded (in terms of the initial distribution
). Consequently, the atom is a positive recurrent, aperiodic state of the split
chain, which is accessible with probability one from each of its states.
As we see in Section 6.3.2, this is more than enough to assure that starting at
any initial state, PYn converges in total variation norm to the unique invariant
probability measure for {Yn }.
You are next going to examine which Markov chains of countable state space are
Doeblin chains.
Exercise 6.3.6. Suppose S = 2S with S a countable set.
(a) Show that a Markov chain of state space (S, S) is a Doeblin chain if and
only if there exists a S and r finite such that inf x Px (Xr = a) > 0.
(b) Deduce that for any Doeblin chain S = T R, where R = {y S : ay >
0} is a non-empty irreducible, closed set of positive recurrent, aperiodic
states and T = {y S : ay = 0} consists of transient states, all of which
lead to R.
(c) Verify that a Markov chain on a finite state space is a Doeblin chain if
and only if it has an aperiodic state a S that is accessible from any
other state.
(d) Check that branching processes with 0 < P(N = 0) < 1, renewal Markov
chains and birth and death chains are never Doeblin chains.
The preceding exercise shows that the Doeblin (recurrence) condition is too strong
for many chains of interest. We thus replace it by the weaker H-irreducibility
condition whereby the small function v(x) is only assumed bounded below on a
small, accessible set C. To this end, we start with the definitions of an accessible
set and weakly irreducible Markov chain.
Definition 6.3.7. We say that A S is accessible by the Markov chain {Xn } if
Px (TA < ) > 0 for all x S.
Given a non-zero -finite measure on (S, S), the chain is -irreducible if any set
A S with (A) > 0 is accessible by it. Finally, a homogeneous Markov chain on
(S, S) is called weakly irreducible if it is -irreducible for some non-zero -finite
measure (in particular, any Doeblin chain is weakly irreducible).
Remark. Modern texts on Markov chains typically refer to the preceding as the
standard definition of irreducibility but we use here the term weak irreducibility to
clearly distinguish it from the elementary definition for a countable S. Indeed, in
e denote the corresponding counting measure of S.
case S is a countable set, let
e
A Markov chain of state space S is then -irreducible
if and only if xy > 0 for
all x, y S, matching our Definition 6.2.14 of irreducibility, whereas a chain on S
countable is weakly irreducible if and only if xa > 0 for some a S and all x S.
In particular, a weakly irreducible chain of a countable state space S has exactly one
non-empty equivalence class of intercommunicating states (i.e. {y S : ay > 0}),
which is further accessible by the chain.
As we show next, a weakly irreducible chain has a maximal irreducibility measure
such that (A) > 0 if and only if A S is accessible by the chain.
261
Proposition 6.3.8. Suppose {Xn } is a weakly irreducible Markov chain on (S, S).
Then, there exists a probability measure on (S, S) such that for any A S,
(6.3.2)
(A) > 0
x S .
k(x, A) =
n=1
2n Px (Xn A) .
Indeed, with {TA < } = n1 {Xn A}, clearly Px (TA < ) > 0 if and only
if k(x, A) > 0. Consequently, if Px (TA < ) is positive for all x S then so is
k(x, A) and hence (A) > 0. Conversely, if (A) > 0 then necessarily q(C) > 0
for C = {x S : k(x, A) } and some > 0 small enough. In particular,
fixing x S, as {Xn } is q-irreducible, also Px (TC < ) > 0. That is, there exists
positive integer m = m(x) such that
P Px (Xm C) > 0. It now follows by the
Markov property at m (for h() = 1 2 I A ), that
k(x, A) 2m
X
=1
2 Px (Xm+ A)
262
6. MARKOV CHAINS
(b) Show that if {Xn } is strong H-irreducible then the atom of the split
chain {X n } is accessible by {X n } from all states in S.
(c) Show that in a countable state space every weakly irreducible chain is
strong H-irreducible.
Hint: Try C = {a} and q() = p(a, ) for some a S.
Actually, the converse to part (a) of Exercise 6.3.10 holds as well. That is, weak
irreducibility is equivalent to H-irreducibility (for the proof, see [Num84, Proposition 2.6]), and weakly irreducible chains can be analyzed via the study of an
appropriate split chain. For simplicity we focus hereafter on the somewhat more
restricted setting of strong H-irreducible chains. The following example shows that
it still applies for many Markov chains of interest.
Example 6.3.11 (Continuous transition densities). Let S = Rd with S =
BS . Suppose that for each x Rd the transition probability has a density p(x, y) with
respect to Lebesgue measure d () on Rd such that (x, y) 7 p(x, y) is continuous
jointly in x and y. Picking u and v such that p(u, v) > 0, there exists a neighborhood
C of u and a bounded neighborhood K of v, such that inf{p(x, y) : x C, y
K} > 0. Hence, setting q() to be the uniform measure on K (i.e. q(A) = d (A
K)/d (K) for any A S), such a chain is strong H-irreducible provided C is an
accessible set. For example, this occurs whenever p(x, u) > 0 for all x Rd .
Remark 6.3.12. Though our study of Markov chains has been mostly concerned
with measure theoretic properties of (S, S) (e.g. being B-isomorphic), quite often
S is actually a topological state space with S its Borel -algebra. As seen in the
preceding example, continuity properties of the transition probability are then of
much relevance in the study of Markov chains on S. In this context, we say that
p : S BS 7
R [0, 1] is a strong Feller transition probability, when the linear operator
(ph)() = p(, dy)h(y) of Lemma 6.1.3 maps every bounded BS -measurable function h to ph Cb (S), a continuous bounded function on S. In case of continuous
transition densities, as in Example 6.3.11, the transition probability is strong Feller
whenever the collection of probability measures {p(x, ), x S} is uniformly tight
(per Definition 3.2.31).
In case S = BS we further have the following topological notions of reachability
and irreducibility.
Exercise 6.3.14. Show that if a strong Feller transition probability p(, ) has a
reachable state x S, then it is weakly irreducible.
Hint: Try the irreducibility measure () = p(x, ).
263
Remark. The minorization (6.3.1) may cause the maximal irreducibility measure
for the split chain to be supported on a smaller subset of the state space than the
one for the original chain. For example, consider the trivial Doeblin chain of i.i.d.
{Xn }, that is, p(x, ) = q(). In this case, taking v(x) = 1 results with the split
chain X n = for all n 1, so the maximal irreducibility measures = and
= q of {Xn } and {X n } are then mutually singular.
This is of course precluded by our additional requirement that v(x)q() pb(x, ).
For a strong H-irreducible chain {Xn } it is easily accommodated by, for example,
setting v(x) = IC (x) with = /2 > 0, and then the restriction of to S is a
maximal irreducibility measure for {Xn }.
Strong H-irreducible chains with a recurrent atom are called H-recurrent chains.
That is,
Definition 6.3.15. A strong H-irreducible chain {Xn } is called H-recurrent if
P (T < ) = 1. By the strong Markov property of X n at the consecutive visit
times Tk of , H-recurrence further implies that P (Tk finite for all k) = 1, or
equivalently P (X n = i.o.) = 1.
Here are a few examples and exercises to clarify the concept of H-recurrence.
Example 6.3.16. Many strong H-irreducible chains are not H-recurrent. For example, combining part (c) of Exercise 6.3.10 with the remark following Definition
6.3.7 we see that such are all irreducible transient chains on a countable state space.
By the same reasoning, a Markov chain of countable state space S is H-recurrent if
and only if S = T R with R a non-empty irreducible, closed set of recurrent states
and T a collection of transient states that lead to R (c.f. part (b) of Exercise 6.3.6
for such a decomposition in case of Doeblin chains). In particular, such chains are
not necessarily recurrent in the sense of Definition 6.2.14. For example, the chain
on S = {1, 2, . . .} with transitions p(k, 1) = 1 p(k, k + 1) = k s for some constant
s > 0, is H-recurrent but has only one recurrent state, i.e. R = {1}. Further,
k1 < 1 for all k 6= 1 when s > 1, while k1 = 1 for all k when s 1.
Remark. Advanced texts on Markov chains refer to what we call H-recurrence
as the standard definition of recurrence and call such chains Harris recurrent when
in addition Px (T < ) = 1 for all x S. As seen in the preceding example,
both notions are weaker than the elementary notion of recurrence for countable
S, per Definition 6.2.14. For this reason, we adopt here the convention of calling H-recurrence (with H after Harris), what is not the usual definition of Harris
recurrence.
Exercise 6.3.17. Verify that any strong Doeblin chain is also H-recurrent. Conversely show that for any H-recurrent chain {Xn } there exists C S and a probability distribution q on (S, S) such that Pq (TCk finite for all k) = 1 and the Markov
chain Zk = XT k+1 for k 0 is then a strong Doeblin chain.
C
The next proposition shows that similarly to the elementary notion of recurrence,
H-recurrence is transferred from the atom to all sets that are accessible from
it. Building on this proposition, you show in Exercise 6.3.19 that the same applies
when starting at any irreducibility probability measure of the split chain and that
every set in S is either almost surely visited or almost surely never reached from
by the split chain.
264
6. MARKOV CHAINS
Proposition 6.3.18. For an H-recurrent chain {Xn } consider the probability measure
X
(B) =
2n P (X n B) .
(6.3.4)
n=1
Exercise 6.3.20. Suppose {X n } and X n } are two different split chains for the
same strong H-irreducible chain {Xn } with the corresponding atoms and .
Relying on Proposition 6.3.18 prove that P (T < ) = 1 if and only if P (T <
) = 1.
The concept of H-recurrence builds on measure theoretic properties of the chain,
namely the minorization associated with strong H-irreducibility. In contrast, for
topological state space we have the following topological concept of O-recurrence,
built on reachability of states.
Definition 6.3.21. A state x of a Markov chain {Xn } on (topological) state space
(S, BS ) is called O-recurrent (or open set recurrent), if Px (Xn O i.o.) = 1 for
any neighborhood O of x in S. All states x S which are not O-recurrent are called
O-transient. Such a chain is then called O-recurrent if every x S is O-recurrent
and O-transient if every x S is O-transient.
Remark. As was the case with O-irreducibility versus irreducibility, for a countable state space S equipped with its discrete topology, being O-recurrent (or Otransient), is equivalent to being recurrent (or transient, respectively), per Definitions 6.2.9 and 6.2.14.
265
X
P0 (|Sn | < r) = .
n=0
X
X
P0 (Sn [kr, (k + 1)r))
P0 (|Sm | < r) ,
n=0
m=0
and deduce that suffices to check divergence of the series in part (a) for
large r.
p
(c) Conclude that if n1 Sn 0 as n , then {Sn } is O-recurrent.
6.3.2. Invariant measures, aperiodicity and asymptotic behavior. We
consider hereafter an H-recurrent Markov chain {Xn } of transition probability p(, )
on the B-isomorphic state space (S, S) with its recurrent pseudo-atom and the
corresponding split and merge chains p(, ), m(, ) on (S, S) per Definitions 6.3.1
and 6.3.2.
The following lemma characterizes the invariant measures of the split chain p(, )
and their relation to the invariant measures for p(, ). To this end, we use hereafter
1 2 also for the measure 1 2 (A) = 1 (2 (, A)) on (X, X ) in case 1 is a measure
on (X, X ) and 2 is a transition probability on this space and let pn (x, B) denote
the transition probability Px (X n B) on (S, S).
Lemma 6.3.24. A measure
on (S, S) is invariant for the split chain p(, ) of a
strong H-irreducible chain if and only if
=
p and 0 <
({}) < . Further,
(B) =
p(S B) =
B S .
266
6. MARKOV CHAINS
(A) =
(A) +
({})q(A)
A S
and in particular, such is a positive, -finite measure on (S, S) for any -finite
on (S, S), and any probability measure q() on (S, S). Further, starting the
inhomogeneous Markov chain {Zn } of Proposition 6.3.4 with initial measure
for
Z0 = X 0 yields the measure for Z1 = X0 . By construction, the measure of
Z2 = X 1 is then p and that of Z3 = X1 is (p)m = (pm). Next, the invariance
of
for p implies that the measure of X 1 equals that of X 0 . Consequently, the
measure of X1 must equal that of X0 , namely = (pm). With m(, {}) 0
necessarily ({}) = 0 and the identity = (pm) holds also for the restrictions
to (S, S) of both and pm. Since the latter equals to p (see part (a) of Proposition
6.3.4), we conclude that = p, as claimed.
Conversely, let
= p where is an invariant measure for p (and we set ({}) =
0). Since is -finite, there exist An S such that (An ) < for all n and
necessarily also q(An ) > 0 for all n large enough (by monotonicity from below
of the probability measure q()). Further, the invariance of implies that
m =
(S) = (S) so
inherits
(p)m = , i.e. the relation (6.3.5) holds. In particular,
the positivity of . Moreover, both
({}) = and
(An ) = contradict the
finiteness of (An ) for all n, so the measure
is -finite on (S, S). Next, start
the chain {Zn } at Z0 = X 0 S of initial measure . It yields the same measure
= m for Z1 = X0 , with measure
= p for Z2 = X 1 followed by
m = for
Z3 = X1 and
p for Z4 = X 2 . As the measure of X1 equals that of X0 , it follows
of X 1 , i.e.
is invariant for p.
that the measure
p of X 2 equals the measure
Finally, suppose the measure
satisfies
=
p. Iterating this identity we deduce
=
k for the transition probability
that
=
pn for all n 1, hence also
X
(6.3.6)
k(x, B) =
2n pn (x, B) .
n=1
Due to its strong H-irreducibility, the atom {} of the split chain is an accessible
set for the transition probability p (see part (b) of Exercise 6.3.10). So, from (6.3.6)
we deduce that k(x, {}) > 0 for all x S. Consequently, as n ,
Bn = {x S : k(x, {}) n1 } S ,
({}) < .
Our next result shows that, similarly to Proposition 6.2.27, the recurrent atom
induces an invariant measure for the split chain (and hence also one for the original
chain).
Proposition 6.3.25. If {Xn } is H-recurrent of transition probability p(, ) then
(6.3.7)
(B) = E
1
TX
n=0
I{X n B}
267
(B) =
,n (B)
n=0
B S
and ,n (g) = E [I{T >n} g(X n )] for all g bS. Since {T > n} FnX =
(X k , k n), we have by the tower and Markov properties that, for each n 0,
P (X n+1 B, T > n) = E [I{T >n} P (X n+1 B|FnX ) ]
,n p)(B) .
= E [I{T >n} p(X n , B) ] = ,n (p(, B)) = (
Hence,
(
p)(B) =
(
,n p)(B) =
n=0
= E
n=0
T
X
n=1
P (X n+1 B, T > n)
I{X n B} = (B)
(A) =
X
1
Pq (X n A, T > n)
Eq (T ) n=0
A S .
Proof. By Lemma 6.3.24, to any invariant measure for p (with ({}) = 0),
corresponds the invariant measure
= p for the split chain p. It is also shown
there that 0 <
({}) < . Hence, with no loss of generality we assume hereafter
that the given invariant measure for p has already been divided by this positive,
finite constant, and so
({}) = 1. Recall that while proving Lemma 6.3.24 we
further noted that =
m, due to the invariance of for p. Consequently, to prove
the theorem it suffices to show that
= (for then =
m = m).
268
6. MARKOV CHAINS
is also
To this end, fix B S and recall from the proof of Lemma 6.3.24 that
invariant for pn and any n 1. Using the latter invariance property and applying
Exercise 6.2.3 for y = and the split chain {X n }, we find that
Z
Z
(B) = (
pn )(B) =
(dx)Px (X n B)
(dx)Px (X n B, T n)
S
n1
X
k=0
(
pnk )({})P (X k B, T > k) =
n1
X
,k (B) ,
k=0
X
(6.3.10)
(B)
,k (B) = (B)
B S .
k=0
We proceed to show that this inequality actually holds with equality, namely, that
= . To this end, recall that while proving Lemma 6.3.24 we showed that
and are also invariant for the transition
invariant measures for p, such as
probability k(, ) of (6.3.6), and by strong H-irreducibility the measurable function
g() = k(, {}) is strictly positive on S. Therefore,
(g) = (
k)({}) =
({}) = 1 = ({}) = (
k)({}) = (g) .
where k ktv denotes the total variation norm of Definition 3.2.22 and = min{
0 : X = Y = } is the time of the first joint visit of the atom by the corresponding
copies of the split chain under the coupling of Proposition 6.3.4.
Proof. Fixing g bS bounded by one, recall that the split mapping yields
g bS of the same bound, and by part (c) of Proposition 6.3.4
E g(Xn ) E g(Yn ) = E g(X n ) E g(Y n )
for any joint initial distribution of (X0 , Y0 ) on (S2 , S S) and all n 0. Further,
since X = Y in case n, following the proof of Theorem 6.2.57 one finds that
|E g(X n ) E g(Y n )| 2P( > n). Since this applies for all g bS bounded by
one, we are done.
Our goal is to extend the scope of the convergence result of Theorem 6.2.59 to the
setting of positive H-recurrent chains. To this end, we first adapt Definition 6.2.54
of an aperiodic chain.
Definition 6.3.28. The period of a strongly H-irreducible chain is the g.c.d. d
of the set I = {n 1 : P (X n = ) > 0}, of return times to its pseudo-atom and
such chain is called aperiodic if it has period one. For example, q(C) > 0 implies
aperiodicity of the chain.
269
Remark. Recall that being (strongly) H-irreducible amounts for a countable state
space to having exactly one non-empty equivalence class of intercommunicating
states (which is accessible from any other state). The preceding definition then
coincides with the common period of these intercommunicating states per Definition
6.2.54.
More generally, our definition of the period of the chain seems to depend on which
small set and regeneration measure one chooses. However, in analogy with Exercise
6.3.20, after some work it can be shown that any two split chains for the same strong
H-irreducible chain induce the same period.
Theorem 6.3.29. Let () denote the unique invariant probability measure of an
aperiodic positive H-recurrent Markov chain {Xn }. If x S is such that Px (T <
) = 1, then
lim kPx (Xn ) ()ktv = 0 .
(6.3.12)
(6.3.13)
X
X
2k P (Nn () = k, X n = ) .
P (TZ = n) =
2k P (Tk = n) =
k=1
k=1
270
6. MARKOV CHAINS
X
X
E (TZ ) =
E (Tk )P(Z = k) = E (T )
kP(Z = k) = E (T )E(Z) < .
k=1
k=1
Hence, the increments n of the irreducible random walk {Sn } on Z are integrable
p
and of zero mean. Consequently, n1 Sn 0 as n which by the Chung-Fuchs
theorem implies the recurrence of {Sn } (see Exercise 6.3.23).
Exercise 6.3.30. Suppose {Xk } is the first order auto-regressive process Xn =
Xn1 + n , n 1 with || < 1 and where the integrable i.i.d. {n } have a strictly
positive, continuous density f () with respect to Lebesgue measure on Rd .
(a) Show that {Xk } is
P
Pan strong H-irreducible chain.
(b) Show that Vn = k=0 k k converges a.s. to V = k0 k k whose
law () is an invariant probability measure for {Xk }.
(c) Show that {Xk } is positive H-recurrent.
(d) Explain why {Xk } is aperiodic and deduce that starting at any fixed x
Rd the law of Xn converges in total variation to ().
Exercise 6.3.31. Show that if {Xn } is an aperiodic, positive H-recurrent chain
and x, y S are such that Px (T < ) = Py (T < ) = 1, then for any A S,
lim |Px (Xn A) Py (Xn A)| = 0 .
CHAPTER 7
272
1.5
0.5
Xt()
t Xt(1)
0.5
t X ( )
1.5
0.5
1.5
t
2.5
Figure 1. Sample functions of a continuous time stochastic process, corresponding to two outcomes 1 and 2 .
Not all f.d.d. are relevant here, for you should convince yourself that the f.d.d. of
any S.P. should be consistent, as specified next.
Definition 7.1.3. We say that a collection of finite dimensional distributions is
consistent if for any Bk B, distinct tk T and finite n,
(7.1.1)
273
for some D Bc and C = {tk } T. The set C is then called the (countable) base
of the (countable) representation (C, D) of A.
Indeed, B T consists of the sets in RT having a countable representation and is
further image of F X = (Xt , t T) via the mapping X : 7 RT .
Lemma 7.1.7. The -algebra B T is the collection C of all subsets of RT that have
a countable representation. Further, for any S.P. {Xt , t T}, the -algebra F X is
the collection G of sets of the form { : X () A} with A B T .
Proof. First note that for any subsets T1 T2 of T, the restriction to T1 of
functions on T2 induces a measurable projection p : (RT2 , B T2 ) 7 (RT1 , B T1 ). Further, enumerating over a countable C maps the corresponding cylindrical -algebra
B C in a one to one manner into the product -algebra Bc . Thus, if A C has the
countable representation (C, D) then A = p1 (D) for the measurable projection p
from RT to RC , hence A B T . Having just shown that C B T we turn to show that
conversely B T C. Since each finite dimensional measurable rectangle has a countable representation (of a finite base), this is an immediate consequence of the fact
that C is a -algebra. Indeed, RT has a countable representation (of empty base),
and if A C has the countable representation (C, D) then Ac has the countable
representation (C, Dc ). Finally, if Ak C has a countable representation (Ck , Dk )
for k = 1, 2, . . . then the subset C = k Ck of T serves as a common countable base
e k ), for k = 1, 2, . . .
for these sets. That is, Ak has the countable representation (C, D
1
e
and Dk = pk (Dk ) Bc , where pk denotes the measurable projection from RC to
274
As for uniqueness, recall Lemma 7.1.7 that every set in F X is of the form { :
(Xt1 (), Xt2 (), . . .) D} for some D Bc and C = {tj } a countable subset of T.
Fixing such C, recall Kolmogorovs extension theorem, that the law of (Xt1 , Xt2 , . . .)
on Bc is uniquely determined by the specified laws of (Xt1 , . . . , Xtn ) for n = 1, 2, . . ..
Since this applies for any countable C, we see that the whole restriction of P to
F X is uniquely determined by the given collection of f.d.d.
275
any (S, S)-valued S.P. {Xt } provided (S, S) is B-isomorphic (c.f. [Dud89, Theorem
12.1.2] for an even more general setting in which the same applies).
Motivated by Proposition 7.1.8 our definition of the law of the S.P. is as follows.
Definition 7.1.9. The law (or distribution) of a S.P. is the probability measure
PX on B T such that for all A B T ,
PX (A) = P({ : X () A}) .
Proposition 7.1.8 tells us that the f.d.d. uniquely determine the law of any S.P.
and provide the probability of any event in F X . However, for our construction to
be considered a success story, we want most events of interest be in F X . That is,
their image via the sample function should be in B T . Unfortunately, as we show
next, this is certainly not the case for uncountable T.
Lemma 7.1.10. Fixing R and I = [a, b) for some a < b, the following sets
A = {x RI : x(t) for all t I} ,
276
(a) Show that none of the following collections of functions is in B I : all linear
functions, all polynomials, all constants, all non-decreasing functions, all
functions of bounded variation, all differentiable functions, all analytic
functions, all functions continuous at a fixed t I.
(b) Show that B I fails to contain the collection of functions that vanish somewhere in I, the collection of functions such that x(s) < x(t) for some
s < t, and the collection of functions with at least one local maximum.
(c) Show that C(I) has no non-empty subset A B I , but the complement of
C(I) in RI has a non-empty subset A B I .
I
(d) Show that the completion B of B I with respect to any probability measure
P on B I fails to contain the set A = B(I) of all Borel measurable functions
x : I 7 R.
Hint: Consider A and Ac .
In contrast to the preceding exercise, independence of the increments of a S.P. is
determined by its f.d.d.
Exercise 7.1.12. A continuous time S.P. {Xt , t 0} has independent increments
if Xt+h Xt is independent of FtX = (Xs , 0 s t) for any h > 0 and all t 0.
Show that if Xt1 , Xt2 Xt1 , . . . , Xtn Xtn1 are mutually independent, for all
n < and 0 t1 < t2 < < tn < , then {Xt } has independent increments.
Hence, this property is determined by the f.d.d. of {Xt }.
Here is the canonical construction for Poisson random measures, where T is not a
subset of R (for example, the Poisson point processes where T = BRd ).
Exercise 7.1.13. Let T = {A X : (A) < } for a given measure space
(X, X , ). Construct a S.P. {NA : A T} such that NA has the Poisson((A)) law
for each A T and NAk , k = 1, . . . , n are P-mutually independent whenever Ak ,
k = 1, . . . , n are disjoint sets.
c
Hint: Given Aj T, j = 1, 2, let Bj1 = Aj = Bj0
and Nb1 ,b2 , for b1 , b2 {0, 1}
such that (b1 , b2 ) 6= (0, 0), be independent R.V. of Poisson((B1b1 B2b2 )) law. As
the distribution of (NA1 , NA2 ) take the joint law of (N1,1 + N1,0 , N1,1 + N0,1 ).
bt of rate one is merely the restriction to sets A =
Remark. The Poisson process N
[0, t], t 0, of the Poisson random measure {NA } in case () is Lebesgues measure
on [0, ). More generally, in case () has density f () with respect to Lebesgues
measure on [0, ), we call such restriction Xt = N[0,t] the inhomogeneous Poisson
process of rate function f (t) 0, t 0. It is a counting process of independent
b([0,t]) of a Poisson process
increments, which is a non-random time change Xt = N
of rate one, but in general the gaps between jump times of {Xt } are neither i.i.d.
nor of exponential distribution.
7.2. Continuous and separable modifications
The canonical construction of Section 7.1 determines the law of a S.P. {Xt } on the
image B T of F X . Recall that F X is inadequate as far as properties of the sample
functions t 7 Xt () are concerned. Nevertheless, a typical patch of this approach
is to choose among S.P. with the given f.d.d. one that has regular enough sample
functions. To illustrate this, we start with a simple explicit example in which path
properties are not entirely determined by the f.d.d.
t,
Xt () =
277
1, t =
0, otherwise
on the probability space ([0, 1], B[0,1] , U ), with U the uniform measure on I = [0, 1].
Since At = {S: Xt () 6= Yt ()} = {t}, clearly P(Xt = Yt ) = 1 for each fixed t I.
Moreover, P( ni=1 Ati ) = 0 for any t1 , . . . , tn I, hence {Xt } has the same f.d.d.
as {Yt }. However, P({ : suptI Xt () 6= 0}) = 1, whereas P({ : suptI Yt () 6=
0}) = 0. Similarly, P({ : X () C(I)}) = 0, whereas P({ : Y () C(I)}) = 1.
While the two S.P. of Example 7.2.1 have different maximal value and differ in
their sample path continuity, we would typically consider one to be merely a (small)
modification of the other, motivating our next definition.
Definition 7.2.2. Stochastic processes {Xt } and {Yt } are called versions of one
another if they have the same f.d.d. A S.P. {Yt , t T} is further called a modification of {Xt , t T} if P(Xt 6= Yt ) = 0 for all t T and two such S.P. are
called indistinguishable if { : Xt () 6= Yt () for some t T} is a P-null set
(hence, upon completing the space, P(Xt 6= Yt for some t T) = 0). Similarly
to Definition 1.2.8, throughout we consider two indistinguishable S.P.-s to be the
same process, hence often omit the qualifier a.s. in reference to sample function
properties that apply for all t T.
For example, {Yt } is the continuous modification of {Xt } in Example 7.2.1 but
these two processes are clearly distinguishable. In contrast, modifications with a.s.
right-continuous sample functions are indistinguishable.
Exercise 7.2.3. Show that continuous time S.P.-s {Xt } and {Yt } which are modifications of each other and have w.p.1. right-continuous sample functions, must
also be indistinguishable.
You should also convince yourself at this point that as we have implied, if {Yt } is
a modification of {Xt }, then {Yt } is also a version of {Xt }. The converse fails, for
while a modification has to be defined on the same probability space as the original
S.P. this is not required of versions. Even on the same probability space it is easy
to find a pair of versions which are not modifications of each other.
Example 7.2.4. For the uniform probability measure on the finite set = {H, T },
the constant in time S.P.-s Xt () = IH () and Yt () = 1 Xt () are clearly
versions of each other but not modifications of each other.
We proceed to derive a relatively easy to check sufficient condition for the existence
of a (continuous) modification of the S.P. which has H
older continuous sample
functions, as defined next.
Definition 7.2.5. Recall that a function f (t) on a metric space (T, d(, )) is
locally -H
older continuous if
|f (t) f (s)|
cu ,
sup
d(t, s)
{t6=s,d(t,u)d(s,u)<hu }
for > 0, some c : T 7 [0, ) and h : T 7 (0, ], and is uniformly -Holder continuous if the same applies for constant c < h = . In case = 1 such functions
are also called locally (or uniformly) Lipschitz continuous, respectively. We say
278
P({ : sup
for some finite R.V. c().
for all
s, t T ,
which holds for any > 0, t, s I, where the first inequality follows from Markovs
inequality and the second from (7.2.1). From this bound we establish the a.s.
(2)
local H
older continuity of the sample function of {Xt } over the collection Q1 =
S
(2,)
(2,)
of dyadic rationals in [0, 1], where QT
= {j2 T, j Z+ }. To
1 Q1
this end, fixing < / and considering (7.2.2) for = 2 , we have by finite
sub-additivity that
2 1
n () ,
279
As you show in Exercise 7.2.7 this implies the local -Holder continuity of t 7
Xt () over the dyadic rationals. That is,
(7.2.3)
esn for some non-random {sn } Q(2) such that sn t [0, 1] \ Q(2) .
limn X
1
1
es () over Q(2) , the
Indeed, in view of (7.2.3), by the uniform continuity of s 7 X
1
esn () is Cauchy, hence convergent, per . By construction,
sequence n 7 X
et , t [0, 1]} is such that
the S.P. {X
et () X
es ()| ck |t s|k ,
|X
|x(t) x(s)| 2
k
X
,1 (x) .
=m+1
(2)
for all t, s Q1
280
in S = (C(T), k k ) is the countable intersection of Rt C(T) for the corresponding one dimensional measurable rectangles Rt B T indexed by t T Qr .
Consequently, each open ball B(x, r) is in the -algebra C = {A C(T) : A B T }.
With denoting a countable dense subset of the separable metric space S, it readily
follows that S has a countable base U, consisting of the balls B(x, 1/n) for positive
integers n and centers x . With every open set thus being a countable union of
elements from U, it follows that BS = (U). Further, U C, hence also BS C.
Conversely, recall that C = (O) for the collection O of sets of the form
O = {x C(T) : x(ti ) Oi , i = 1, . . . , n} ,
281
While direct application of Theorem 7.2.6 is limited to (locally -Holder) continuous modifications on compact intervals, say [0, T ], it is easy to combine these to
one (locally -H
older) continuous modification, valid on [0, ).
Lemma 7.2.12. Suppose there exist Tn such that the continuous time S.P.
et(n) , t [0, Tn ]}.
{Xt , t 0} has (locally -H
older) continuous modifications {X
Then, the S.P. {Xt , t 0} also has such modification on [0, ).
Proof. By assumption, for each positive integer n, the event
(n)
e () = Xt (),
An = { : X
t
t Q [0, Tn ]} ,
has probability one. The event A = n An of probability one is then such that
et(n) () = X
et(m) () for all A , positive integers n, m and any t Q[0, Tn Tm ].
X
e (n) () and t 7 X
e (m) () it follows that for all A ,
By continuity of t 7 X
t
t
et(n) () = X
et(m) () ,
X
n, m, t [0, Tn Tm ] .
Since Tn , we conclude that this S.P. is a (locally -Holder) continuous modification of {Xt , t 0}.
The following application of Kolmogorov-Centsov theorem demonstrates the importance of its free parameter .
Exercise 7.2.13. Suppose {Xt , t I} is a continuous time S.P. such that E(Xt ) =
0 and E(Xt2 ) = 1 for all t I, a compact interval on the line.
(a) Show that if for some finite c, p > 1 and h > 0,
(7.2.4)
282
Example 7.2.14. There exist S.P.-s satisfying (7.2.4) with p = 1 for which there
is no continuous modification. One such process is the random telegraph signal
Rt = (1)Nt R0 , where P(R0 = 1) = P(R0 = 1) = 1/2 and R0 is independent of
the Poisson process {Nt } of rate one. The process {Rt } alternately jumps between
1 and +1 at the random jump times {Tk } of the Poisson process {Nt }. Hence, by
the same argument as in Example 7.2.11 it does not have a continuous modification.
Further, for any t > s 0,
E[Rs Rt ] = 1 2P(Rs 6= Rt ) 1 2P(Ns < Nt ) 1 2(t s) ,
Remark. The S.P. {Rt } of Example 7.2.14 is a special instance of the continuoustime Markov jump processes, which we study in Section 8.3.3. Though the sample
function of this process is a.s. discontinuous, it has the following RCLL property,
as is the case for all continuous-time Markov jump processes.
Definition 7.2.15. Given a countable C I we say that a function x RI is
C-separable at t if there exists a sequence sk C that converges to t such that
x(sk ) x(t). If this holds at all t I, we call x() a C-separable function. A
continuous time S.P. {Xt , t I} is separable if there exists a non-random, countable
C I such that all sample functions t 7 Xt () are C-separable. Such a process is
further right-continuous with left-limits (in short, RCLL), if the sample function
t 7 Xt () is right-continuous and of left-limits at any t I (that is, for h 0
both Xt+h () Xt () and the limit of Xth () exists). Similarly, a modification
which is a separable S.P. or one having RCLL sample functions is called a separable
modification, or RCLL modification of the S.P., respectively. As usual, suffices to
have any of these properties w.p.1 (for we do not differentiate between a pair of
indistinguishable S.P.).
Remark. Clearly, a S.P. of continuous sample functions is also RCLL and a S.P.
having right-continuous sample functions (in particular, any RCLL process), is
further separable. To summarize,
H
older continuity Continuity RCLL Separable
283
(a) Fixing B B, consider the probabilities p(D) = P(Ys B for all s D),
for countable D I. Show that for any A I there exists a countable
subset D = D (A, B) of A such that p(D ) = inf{p(D) : countable
D A}.
Hint: Let D = k Dk where p(Dk ) k 1 +inf{p(D) : countable D A}.
(b) Deduce that if t A then Nt (A, B) = { : Ys () B for all s D (A, B)
and Yt ()
/ B} has zero probability.
(c) Let C denote the union of D (A, B) over all A = I (q1 , q2 ) and B =
(q3 , q4 )c , with qi Q. Show that at any t I there exists Nt F such
that P(Nt ) = 0 and the sample functions t 7 Yt () are C-separable at t
for every
/ Nt .
Hint: Let Nt denote the union of Nt (A, B) over the sets (A, B) as in the
definition of C, such that further t A.
for the events Nt of zero probability from part (c) of Exercise 7.2.17. The resulting
S.P. {Yet , t I} is a [0, 1]-valued modification of {Yt } (since P(Yet 6= Yt ) P(Nt ) = 0
for each t I). It clearly suffices to check C-separability of t 7 Yet () at each fixed
t
/ C and this holds by our construction if Nt and by part (c) of Exercise
7.2.17 in case Ntc . For any (0, 1)-valued S.P. {Yt } we have thus constructed a
separable [0, 1]-valued modification {Yet }. To handle an R-valued S.P. {Xt , t I},
let {Yet , t I} denote the [0, 1]-valued, separable modification of the (0, 1)-valued
S.P. Yt = FG (Xt ), with FG () denoting the standard normal distribution function.
Since FG () has a continuous inverse FG1 : [0, 1] 7 R (where FG1 (0) = and
et = F 1 (Yet ) is an R-valued separable
FG1 (1) = ), it directly follows that X
G
modification of the S.P. {Xt }.
Here are few elementary and useful consequences of separability.
tJC
|Xt Xs | =
sup
t[s,s+h)C
|Xt Xs |
284
which is clearly also a measurable S.P. By the denseness of {sk } in [0, 1], it follows
(n) p
from the continuity in probability of {Yt } that Yt Yt as n , for any fixed
(n)
(m)
t [0, 1]. Hence, by bounded convergence E[|Yt Yt |] 0 as n, m for
each t [0, 1]. Then, by yet another application of bounded convergence
(n)
lim E[|YT
m,n
(m)
YT
|] = 0 ,
where the R.V. T [0, 1] is chosen independently of P, according to the uniform probability measure U corresponding to Lebesgues measure () restricted
(n)
to ([0, 1], B[0,1] ). By Fubinis theorem, this amounts to {Yt ()} being a Cauchy,
hence convergent, sequence in L1 ([0, 1] , B [0,1] F , U P) (recall Proposition
4.3.7 that the latter is a Banach space). In view of Theorem 2.2.10, upon passing
(n )
to a suitable subsequence nj we thus have that (t, ) 7 Yt j () converges to some
285
()
()
Thus, {Yet , t [0, 1]} is {sk }-separable and as claimed, it is a separable, measurable
modification of {Yt , t [0, 1]}.
Recall (1.4.7) that the measurability of the process, namely of (t, ) 7 Xt (),
implies that all its sample functions t 7 Xt () are Lebesgue measurable functions
on I. Measurability of a S.P. also results with well defined integrals
R of its sample
function. For example, if a Borel function h(t, x) is such that I E[|h(t, Xt )|]dt
1
is
R finite, then by Fubinis theorem t 7 E[h(t, Xt )] is in L (I, BI , ), the integral
h(s, Xs )ds is an a.s. finite R.V. and
I
Z
Z
E[h(s, Xs )] ds = E[ h(s, Xs ) ds] .
I
Conversely, as you are to show next, under mild conditions the differentiability of
sample functions t 7 Xt implies the differentiability of t 7 E[Xt ].
286
Convince yourself that non-negative definiteness is the only property that the autocovariance function of a Gaussian S.P. must have and further that the following is
an immediate corollary of the canonical construction and the definitions of Gaussian
random vectors and stochastic processes.
Exercise 7.3.4.
(a) Show that for any index set T, the law of a Gaussian S.P. is uniquely
determined by its mean and auto-covariance functions.
(b) Show that a Gaussian S.P. exists for any mean function and any nonnegative definite auto-covariance function.
Remark. An interesting consequence of Exercise 7.3.4 is the existence of an
isonormal process on any vector space H equipped with an inner product as in Definition 4.3.5. That is, a centered Gaussian process {Xh , h H} indexed by elements
of H whose auto-covariance function is given by the inner product (hP
1 , h2 ) : HH 7
n
R. Indeed, the latter is non-negative definite on HH since for h = j=1 aj hj H,
n X
n
X
j=1 k=1
One of the useful properties of Gaussian processes is their closure with respect to
L2 -convergence (as a consequence of Proposition 3.5.15).
(k)
Proposition 7.3.5. If the S.P. {Xt , t T} and the Gaussian S.P. {Xt , t T}
(k)
are such that E[(Xt Xt )2 ] 0 as k , for each fixed t T, then {Xt , t T}
is a Gaussian S.P. whose mean and auto-covariance functions are the pointwise
(k)
limits of those for the processes {Xt , t T}.
287
for any n finite and s, ti 0 (or for any s, ti R in case of a two-sided continuous
time S.P.). In contrast, here is a much weaker concept of stationarity.
Definition 7.3.8. A square-integrable continuous time S.P. of constant mean
function and auto-covariance function of the form c(t, s) = r(|ts|) is called weakly
stationary (or L2 -stationary).
Indeed, considering (7.3.2) for n = 1 and n = 2, clearly any square-integrable
stationary S.P. is also weakly stationary. As you show next, the converse fails in
general, but applies for all Gaussian S.P.
Exercise 7.3.9. Show that any weakly stationary Gaussian S.P. is also (strictly)
stationary. In contrast, provide an example of a (non-Gaussian) weakly stationary
process which is not stationary.
To gain more insight about stationary processes solve the following exercise.
288
X
Wt = x +
ak (t)Gk ,
k=0
289
Wt
0.5
0
0.5
1
1.5
2
2.5
0
0.5
1.5
t
2.5
Figure 2. Three sample functions of Brownian motion. The density curves illustrate that the random variable W1 has a N (0, 1)
law, while W2 has a N (0, 2) law.
with {Gk } i.i.d. standard normal random variables and ak () continuous functions
on I such that
X
1
(7.3.3)
ak (t)ak (s) = t s = (|t + s| |t s|) .
2
k=0
For example, taking T = 1/2 and expanding f (x) = |x| for |x| 1 into a Fourier
series, one finds that
1 X
4
cos((2k + 1)x) .
|x| =
2
(2k + 1)2 2
k=0
X
P(k
ak (t)Gk k ) 0
k=n
290
CHAPTER 8
292
Proof. Fixing t > 0, let Qt+ denote the finite set of dyadic rationals of the
form j2 [0, t] augmented by {t} and arranged in increasing order 0 = t0 < t1 <
< tk = t (where k = t2 ). The -th approximation of the sample function
Xs () for s [0, t], is then
Xs() () = X0 I{0} (s) +
k
X
j=1
k
[
j=1
(B) ,
(tj1 , tj ] Xt1
j
(B) is
which is in the product -algebra B[0,t] Ft , since each of the sets Xt1
j
in Ft (recall that {Xs , s 0} is Ft -adapted and tj [0, t]). Consequently, each
293
()
Associated with any filtration {Ft } is the collection of all Ft -stopping times and
the corresponding stopped -algebras (compare with Definitions 5.1.11 and 5.1.34).
Definition 8.1.9. A random variable : 7 [0, ] is called a stopping time for
the (continuous time) filtration {Ft }, or in short Ft -stopping time, if { : ()
t} Ft for all t 0. Associated with each Ft -stopping time is the stopped
-algebra F = {A F : A { t} Ft for all t 0} (which quantifies the
information in the filtration at the stopping time ).
The Ft+ -stopping times are also called Ft -Markov times (or Ft -optional times),
with the corresponding Markov -algebras F + = {A F : A { t} Ft+
for all t 0}.
Remark. As their name suggest, Markov/optional times appear both in the context of Doobs optional stopping theorem (in Section 8.2.3), and in that of the
strong Markov property (see Section 8.3.2).
Obviously, any non-random constant t 0 is a stopping time. Further, by definition, every Ft -stopping time is also an Ft -Markov time and the two concepts
coincide for right-continuous filtrations. Similarly, the Markov -algebra F + contains the stopped -algebra F for any Ft -stopping time (and they coincide in case
of right-continuous filtrations).
Your next exercise provides more explicit characterization of Markov times and
closure properties of Markov and stopping times (some of which you saw before in
Exercise 5.1.12).
Exercise 8.1.10.
(a) Show that is an Ft -Markov time if and only if { : () < t} Ft for
all t 0.
(b) Show that if {n , n Z+ } are Ft -stopping times, then so are 1 2 ,
1 + 2 and supn n .
(c) Show that if {n , n Z+ } are Ft -Markov times, then in addition to 1 +
2 and supn n , also inf n n , lim inf n n and lim supn n are Ft -Markov
times.
(d) In the setting of part (c) show that 1 + 2 is an Ft -stopping time when
either both 1 and 2 are strictly positive, or alternatively, when 1 is a
strictly positive Ft -stopping time.
Similarly, here are some of the basic properties of stopped -algebras (compare
with Exercise 5.1.35), followed by additional properties of Markov -algebras.
Exercise 8.1.11. Suppose and are Ft -stopping times.
(a) Verify that ( ) F , that F is a -algebra, and if () = t is nonrandom then F = Ft .
(b) Show that F = F F and deduce that each of the events { < },
{ }, { = } belongs to F .
Hint: Show first that if A F then A { } F .
294
In contrast to adaptedness, progressive measurability transfers to stopped processes (i.e. the continuous time extension of Definition 5.1.31), which is essential
when dealing in Section 8.2 with stopped sub-martingales (i.e. the continuous time
extension of Theorem 5.1.32).
Proposition 8.1.13. Given Ft -progressively measurable S.P. {Xs , s 0}, the
stopped at (Ft -stopping time) S.P. {Xs ()(), s 0} is also Ft -progressively
measurable. In particular, if either < or there exists X mF , then
X mF .
Proof. Fixing t > 0, denote by S the product -algebra B[0,t] Ft on the
product space S = [0, t] . The assumed Ft -progressive measurability of {Xs , s
0} amounts to the measurability of g1 : (S, S) 7 (R, B) such that g1 (s, ) = Xs ().
Further, as (s, ) 7 Xs () () is the composition g1 (g2 (s, )) for the mapping
g2 (s, ) = (s (), ) from (S, S) to itself, by Proposition 1.2.18 the Ft -progressive
measurability of the stopped S.P. follows form our claim that g2 is measurable.
Indeed, recall that is an Ft -stopping time, so { : () > u} Ft for any
u [0, t]. Hence, for any fixed u [0, t] and A Ft ,
g21 ((u, t] A) = (u, t] (A { : () > u}) S ,
295
Example 8.1.14. Indeed, consider B = (0, ) and the S.P. Xt () = t of Example 8.1.6. In this case, B (1) = 0 while B (1) = , so the event { : B ()
0} = {1} is not in F0X = {, } (hence B is not an FtX -stopping time). As shown
next, this problem is only due to the lack of right-continuity in the filtration {FtX }.
Proposition 8.1.15. Consider an Ft -adapted, right-continuous S.P. {Xs , s 0}.
Then, the first hitting time B () = inf{t 0 : Xt () B} is an Ft -Markov
time for an open set B and further an Ft -stopping time when B is a closed set and
{Xs , s 0} has continuous sample functions.
Proof. Fixing t > 0, by definition of B the set B1 ([0, t)) is the union of
over all s [0, t). Further, if the right-continuous function s 7 Xs ()
intersects an open set B at some s [0, t) then necessarily Xq () B at some
q Qt = Q [0, t). Consequently,
[
(8.1.1)
B1 ([0, t)) =
Xs1 (B) .
Xs1 (B)
sQt
Now, the Ft -adaptedness of {Xs } implies that Xs1 (B) Fs Ft for all s t,
and in particular for any s in the countable collection Qt . We thus deduce from
(8.1.1) that {B < t} Ft for all t 0, and in view of part (a) of Exercise 8.1.10,
conclude that B is an Ft -Markov time in case B is open.
Assuming hereafter that B is closed and u 7 Xu continuous, we claim that for
any t > 0,
[
\
(8.1.2)
{B t} =
Xs1 (B) =
{Bk < t} := At ,
0st
k=1
where Bk = {x R : |x y| < k , for some y B}, and that the left identity
in (8.1.2) further holds for t = 0. Clearly, X01 (B) F0 and for Bk open, by the
preceding proof {Bk < t} Ft . Hence, (8.1.2) implies that {B t} Ft for all
t 0, namely, that B is an Ft -stopping time.
Turning to verify (8.1.2), fix t > 0 and recall that if At then |Xsk () yk | <
k 1 for some sk [0, t) and yk B. Upon passing to a sub-sequence, sk s [0, t],
hence by continuity of the sample function Xsk () Xs (). This in turn implies
that yk Xs () B (because B is a closed set). Conversely, if Xs () B for some
s [0, t) then also Bk s < t for all k 1, whereas even if only Xt () = y B,
by continuity of the sample function also Xs () y for 0 s t (and once again
Bk < t for all k 1). To summarize, At if and only if there exists s [0, t]
such that Xs () B, as claimed. Considering hereafter t 0 (possibly t = 0),
the existence of s [0, t] such that Xs B results with {B t}. Conversely,
if B () t then Xsn () B for some sn () t + n1 and all n. But then
snk s t along some sub-sequence nk , so for B closed, by continuity of
the sample function also Xsnk () Xs () B.
We conclude with a technical result on which we shall later rely, for example,
in proving the optional stopping theorem and in the study of the strong Markov
property.
Lemma 8.1.16. Given an Ft -Markov time , let = 2 ([2 ] + 1) for 1.
Then, are Ft -stopping times and A { : () = q} Fq for any A F + ,
1 and q Q(2,) = {k2, k Z+ }.
296
Proof. By its construction, takes values in the discrete set Q(2,) {}.
Moreover, with { : () < t} Ft for any t 0 (see part (a) of Exercise 8.1.10),
it follows that for any q Q(2,) ,
{ : () = q} = { : () [q 2 , q)} Fq .
297
EXt2
for any t s 0 ,
is non-decreasing.
As you see next, the Wiener process and the compensated Poisson process play
the same role that the random walk of zero-mean increments plays in the discrete
time setting (with Wiener process being the prototypical MG of continuous sample
functions, and compensated Poisson process the prototypical MG of discontinuous
RCLL sample functions).
Proposition 8.2.4. Any integrable S.P. {Xt , t 0} of independent increments
(see Exercise 7.1.12), and constant mean function is a MG.
Proof. Recall that a S.P. Xt has independent increments if Xt+h Xt is
independent of FtX , for all h > 0 and t 0. We have also assumed that E|Xt | <
and EXt = EX0 for all t 0. Therefore, E[Xt+h Xt |FtX ] = E[Xt+h Xt ] = 0.
Further, Xt mFtX and hence E[Xt+h |FtX ] = Xt . That is, {Xt , FtX , t 0} is a
MG, as claimed.
Example 8.2.5. In view of Exercise 7.3.13 and Proposition 8.2.4 we have that the
Wiener process/ Brownian motion (Wt , t 0) of Definition 7.3.12 is a martingale.
Combining Proposition 3.4.9 and Exercise 7.1.12, we see that the Poisson process
Nt of rate has independent increments and mean function ENt = t. Consequently, by Proposition 8.2.4 the compensated Poisson process Mt = Nt t is
also a martingale (and FtM = FtN ).
Similarly to Exercise 5.1.9, as you check next, a Gaussian martingale {Xt , t 0}
is necessarily square-integrable and of independent increments, in which case Mt =
Xt2 hXit is also a martingale.
Exercise 8.2.6.
(a) Show that if {Xt , t 0} is a square-integrable S.P. having zero-mean
independent increments, then (Xt2 hXit , FtX , t 0) is a MG with hXit =
EXt2 EX02 a non-random, non-decreasing function.
(b) Prove that the conclusion of part (a) applies to any martingale {Xt , t 0}
which is a Gaussian S.P.
(c) Deduce that if {Xt , t 0} is square-integrable, with X0 = 0 and zeromean, stationary independent increments, then (Xt2 tEX12 , FtX , t 0)
is a MG.
In the context of the Brownian motion {Bt , t 0}, we deduce from part (b) of
Exercise 8.2.6 that {Bt2 t, t 0} is a MG. This is merely a special case of the
following collection of MGs associated with the standard Brownian motion.
298
uk (t, y, 0) =
X
r=0
k!
y k2r (t/2)r .
(k 2r)!r!
(c) Deduce that the S.P. (uk (t, Bt , ), t 0), k = 1, 2, . . . are also MGs
with respect to FtB , as are Bt2 t, Bt3 3tBt , Bt4 6tBt2 + 3t2 and
Bt6 15tBt4 + 45t2 Bt2 15t3 .
(d) Verify that for each k Z+ and R the function uk (t, y, ) solves the
heat equation ut (t, y) + 21 uyy (t, y) = 0.
The collection of sub-MG (equivalently, sup-MG or MG), is closed under the
addition of S.P. (compare with Exercise 5.1.19).
Exercise 8.2.8. Suppose (Xt , Ft ) and (Yt , Ft ) are sub-MGs and t 7 f (t) a nondecreasing, non-random function.
(a) Verify that (Xt + Yt , Ft ) is a sub-MG and hence so is (Xt + f (t), Ft ).
(b) Rewrite this, first for sup-MGs Xt and Yt , then in case of MGs.
With the same proof as in Proposition 5.1.22, you are next to verify that the
collection of sub-MGs (and that of sup-MGs), is also closed under the application
of a non-decreasing convex (concave, respectively), function (c.f. Example 5.1.23
for the most common choices of this function).
Exercise 8.2.9. Suppose the integrable S.P. {Xt , t 0} and convex function
: R 7 R are such that E[|(Xt )|] < for all t 0. Show that if (Xt , Ft ) is a
MG then ((Xt ), Ft ) is a sub-MG and the same applies even when (Xt , Ft ) is only
a sub-MG, provided () is also non-decreasing.
As you show next, the martingale Bayes rule of Exercise 5.5.16 applies also for a
positive, continuous time martingale (Zt , Ft , t 0).
Exercise 8.2.10. Suppose (Zt , Ft , t 0) is a (strictly) positive MG on (, F , P),
normalized so that EZ0 = 1. For each t > 0, let Pt = PFt and consider the equivalent probability measure Qt on (, Ft ) of Radon-Nikodym derivative dQt /dPt = Zt .
(a) Show that Qs = Qt F for any s [0, t].
s
(b) Fixing u s [0, t] and Y L1 (, Fs , Qt ) show that Qt -a.s. (hence
also P-a.s.), EQt [Y |Fu ] = E[Y Zs |Fu ]/Zu .
e > 0 and Nt is a Poisson Process of rate > 0 then
(c) Verify that if
e
()t
e Nt is a strictly positive martingale with EZ0 = 1 and
Zt = e
(/)
e under the measure
show that {Nt , t [0, T ]} is a Poisson process of rate
QT , for any finite T .
e
Remark. Up to the re-parametrization = log(/),
the martingale Zt of part (c)
of the preceding exercise is of the form Zt = u0 (t, Nt , ) for u0 (t, y, ) = exp(y
t(e 1)). Building on it and following the line of reasoning of Exercise 8.2.7
yields the analogous collection of martingales for the Poisson process {Nt , t 0}.
For example, here the functions uk (t, y, ) on (t, y) R+ Z+ solve the equation
ut (t, y) + [u(t, y) u(t, y + 1)] = 0, with Mt = u1 (t, Nt , 0) being the compensated
Poisson process of Example 8.2.5 while u2 (t, Nt , 0) is the martingale Mt2 t.
299
Remark. While beyond our scope, we note in passing that in continuous time the
martingale
transform of Definition 5.1.27 is replaced by the stochastic integral Yt =
Rt
V
dX
.
This
stochastic integral results with stochastic differential equations and
s
0 s
is the main object of study of stochastic calculus (to which many texts are devoted,
among them [KaS97]). In case Vs = Xs is the Wiener process Ws , the analog
Rt
of Example 5.1.29 is Yt = 0 Ws dWs , which for the appropriate definition of the
stochastic integral (due to It
o), is merely the martingale Yt = 21 (Wt2 t). Indeed,
It
os stochastic integral is defined via martingale theory, at the cost of deviating
from the standard integration by parts formula. The latter would have applied if
the sample functions t 7 Wt () were differentiable w.p.1., which is definitely not
the case (as we shall see in Section 9.3).
Exercise 8.2.11. Suppose S.P. {Xt , t 0} is integrable and Ft -adapted. Show
that if E[Xu ] E[X ] for any u 0 and Ft -stopping time whose range () is
a finite subset of [0, u], then (Xt , Ft , t 0) is a sub-MG.
Hint: Consider = sIA + uIAc with s [0, u] and A Fs .
We conclude this sub-section with the relations between continuous and discrete
time (sub/super) martingales.
Example 8.2.12. Convince yourself that to any discrete time sub-MG (Yn , Gn , n
Z+ ) corresponds the interpolated continuous time sub-MG (Xt , Ft , t 0) of the interpolated right-continuous filtration Ft = G[t] and RCLL S.P. Xt = Y[t] of Example
8.1.5.
Remark 8.2.13. In proving results about continuous time MGs (or sub-MGs/supMGs), we often rely on the converse of Example 8.2.12. Namely, for any nonrandom, non-decreasing sequence {sk } [0, ), if (Xt , Ft ) is a continuous time
MG (or sub-MG/sup-MG), then clearly (Xsk , Fsk , k Z+ ) is a discrete time MG
(or sub-MG/sup-MG, respectively), while (Xsk , Fsk , k Z ) is a RMG (or reversed
subMG/supMG, respectively), where s0 s1 sk .
8.2.2. Inequalities and convergence. In this section we extend the tail
inequalities and convergence properties of discrete time sub-MGs (or sup-MGs), to
the corresponding results for sub-MGs (and sup-MGs) of right-continuous sample
functions, which we call hereafter in short right-continuous sub-MGs (or sup-MGs).
We start with Doobs inequality (compare with Theorem 5.2.6).
Theorem 8.2.14 (Doobs inequality). If {Xs , s 0} is a right-continuous
sub-MG, then for t 0 finite, Mt = sup0st Xs , and any x > 0
(8.2.2)
300
sub-MG {Xsk }. Applying Doobs inequality (5.2.1) for this sub-MG, we find that
xP(M () x) E[Xt I{M()x} ] for
M () = max Xs ,
(2,)
t+
sQ
M () M () =
sup
Xs ,
(2)
sQt {t}
By integrating Doobs inequality (8.2.2) you bound the moments of the supremum
of a right-continuous sub-MG over a compact time interval.
Corollary 8.2.16 (Lp maximal inequalities). With q = q(p) = p/(p 1), for
any p > 1, t 0 and a right-continuous sub-MG {Xs , s 0},
(8.2.4)
E ( sup Xu )p+ q p E[(Xt )p+ ] ,
0ut
Proof. Adapting the proof of Corollary 5.2.13, the bound (8.2.4) is just the
conclusion of part (b) of Lemma 1.4.31 for the non-negative variables X = (Xt )+
and Y = (Mt )+ , with the left inequality in (8.2.2) providing its hypothesis. We
are thus done, as the bound (8.2.5) is merely (8.2.4) in case of the non-negative
sub-MG Xt = |Yt |.
In case p = 1 we have the following extension of Exercise 5.2.15.
Exercise 8.2.17. Suppose {Xs , s 0} is a non-negative, right-continuous subMG. Show that for any t 0,
E sup Xu (1 e1 )1 {1 + E[Xt (log Xt )+ ]} .
0ut
301
a < b .
Recall Remark 8.2.13 that enumerating sk Qt+ in a non-decreasing order produces a discrete time sup-MG {Xsk , k = 0, . . . , n} with s0 = 0 and sn = t, so this
is merely Doobs up-crossing inequality (5.2.6).
Since Doobs maximal and up-crossing inequalities apply for any right-continuous
sub-MG (and sup-MG), so do most convergence results we have deduced from them
in Section 5.3. For completeness, we provide a short summary of these results (and
briefly outline how to adapt their proofs), starting with Doobs a.s. convergence
theorem.
Theorem 8.2.20 (Doobs convergence theorem). Suppose right-continuous
a.s.
sup-MG {Xt , t 0} is such that supt {E[(Xt ) ]} < . Then, Xt X and
E|X | lim inf t E|Xt | is finite.
Proof. Let U [a, b] = supnZ+ Un [a, b]. Paralleling the proof of Theorem
5.3.2, in view of our assumption that supt {E[(Xt ) ]} is finite, it follows from Lemma
8.2.19 and monotone convergence that E(U [a, b]) is finite for each b > a. Hence,
w.p.1. the variables U [a, b]() are finite for all a, b Q, a < b. By sample path
right-continuity and diagonal selection, in the set
a,b = { : lim inf Xt () < a < b < lim sup Xt ()} ,
t
it suffices to consider t Q
(2)
for some dyadic rationals qk , hence U [a, b]() = . Consequently, the a.s.
convergence of Xt to X follows as in the proof of Lemma 5.3.1. Finally, the stated
bound on E|X | is then derived exactly as in the proof of Theorem 5.3.2.
302
Remark. Similarly to Exercise 5.3.3, for right-continuous sub-MG {Xt } the finiteness of supt E|Xt |, of supt E[(Xt )+ ] and of lim inf t E|Xt | are equivalent to each other
and to the existence of a finite limit for E|Xt | (or equivalently, for limt E[(Xt )+ ]),
a.s.
each of which further implies that Xt X integrable. Replacing (Xt )+ by (Xt )
the same applies for sup-MGs. In particular, any non-negative, right-continuous,
sup-MG {Xt , t 0} converges a.s. to integrable X such that EX EX0 .
Note that Doobs convergence
p theorem does not apply for the Wiener process
{Wt , t 0} (as E[(Wt )+ ] = t/(2) is unbounded). Indeed, as we see in Exercise
8.2.35, almost surely, lim supt Wt = and lim inf t Wt = . That is, the
magnitude of oscillations of the Brownian sample path grows indefinitely.
In contrast, Doobs convergence theorem allows you to extend Doobs inequality
(8.2.2) to the maximal value of a U.I. right-continuous sub-MG over all t 0.
Exercise 8.2.21. Let M = sups0 Xs for a U.I. right-continuous sub-MG
a.s.
{Xt , t 0}. Show that Xt X integrable and for any x > 0,
(8.2.7)
Hint: Start with (8.2.3) and adapt the proof of Corollary 5.3.4.
The following integrability condition is closely related to L1 convergence of rightcontinuous sub-MGs (and sup-MGs).
Definition 8.2.22. We say that a sub-MG (Xt , Ft , t 0) is right closable, or
has a last element (X , F ) if Ft F and X L1 (, F , P) is such that
for any t 0, almost surely E[X |Ft ] Xt . A similar definition applies for a
sup-MG, but with E[X |Ft ] Xt and for a MG, in which case we require that
E[X |Ft ] = Xt , namely, that {Xt } is a Doobs martingale of X with respect to
{Ft } (see Definition 5.3.13).
Building upon Doobs convergence theorem, we extend Theorem 5.3.12 and Corollary 5.3.14, showing that for right-continuous MGs the properties of having a last
element, uniform integrability and L1 convergence, are equivalent to each other.
Proposition 8.2.23. The following conditions are equivalent for a right-continuous
non-negative sub-MG {Xt , t 0}:
(a) {Xt } is U.I.;
L1
(b) Xt X ;
a.s.
(c) Xt X a last element of {Xt }.
Further, even without non-negativity (a) (b) = (c) and a right-continuous
MG has any, hence all, of these properties, if and only if it is a Doob martingale.
303
also implies such convergence in probability. Either way, recall Vitalis convergence
theorem (i.e. Theorem 1.3.49), that U.I. is equivalent to L1 convergence when
p
Xt X . We thus deduce the equivalence of (a) and (b) for right-continuous
sub-MGs, where either (a) or (b) yields the corresponding a.s. convergence.
(a) and (b) yield a last element: With X denoting the a.s. and L1 limit of the
U.I. collection {Xt }, it is left to show that E[X |Fs ] Xs for any s 0. Fixing
t > s and A Fs , by the definition of sub-MG we have E[Xt IA ] E[Xs IA ].
Further, E[Xt IA ] E[X IA ] (recall part (c) of Exercise 1.3.55). Consequently,
E[X IA ] E[Xs IA ] for all A Fs . That is, E[X |Fs ] Xs .
Last element and non-negative = (a): Since Xt 0 and EXt EX finite, it
follows that for any finite t 0 and M > 0, by Markovs inequality P(Xt > M )
M 1 EXt M 1 EX 0 as M . It then follows that E[X I{Xt >M} ] converges to zero as M , uniformly in t (recall part (b) of Exercise 1.3.43). Further,
by definition of the last element we have that E[Xt I{Xt >M} ] E[X I{Xt >M} ].
Therefore, E[Xt I{Xt >M} ] also converges to zero as M , uniformly in t, i.e.
{Xt } is U.I.
Equivalence for MGs: For a right-continuous MG the equivalent properties (a)
and (b) imply the a.s. convergence to X such that for any fixed t 0, a.s.
Xt E[X |Ft ]. Applying this also for the right-continuous MG {Xt } we deduce
that X is a last element of the Doobs martingale Xt = E[X |Ft ]. To complete
the proof recall that any Doobs martingale is U.I. (see Proposition 4.2.33).
Finally, paralleling the proof of Proposition 5.3.22, upon combining Doobs convergence theorem 8.2.20 with Doobs Lp maximal inequality (8.2.5) we arrive at
Doobs Lp MG convergence.
Proposition 8.2.24 (Doobs Lp martingale convergence).
If right-continuous MG {Xt , t 0} is Lp -bounded for some p > 1, then Xt X
a.s. and in Lp (in particular, kXt kp kX kp ).
Throughout we rely on right continuity of the sample functions to control the tails
of continuous time sub/sup-MGs and thereby deduce convergence properties. Of
course, the interpolated MGs of Example 8.2.12 and the MGs derived in Exercise
8.2.7 out of the Wiener process are right-continuous. More generally, as shown
next, for any MG the right-continuity of the filtration translates (after a modification) into RCLL sample functions, and only a little more is required for an RCLL
modification in case of a sup-MG (or a sub-MG).
Theorem 8.2.25. Suppose (Xt , Ft , t 0) is a sup-MG with right-continuous filtration {Ft , t 0} and t 7 EXt is right-continuous. Then, there exists an RCLL
et , t 0} of {Xt , t 0} such that (X
et , Ft , t 0) is a sup-MG.
modification {X
qt,qQ(2)
qt,qQ(2)
304
Hence, Mn is a.s. finite. Starting with Doobs second inequality (5.2.3) for the
sub-MG {Xt }, by the same reasoning P(Mn+ > y) y 1 (E[(Xn ) ] + E[X0 ]) for
all y > 0. Thus, Mn+ is also a.s. finite and as claimed P( ) = 0.
Step 2. Recall that our convention, as in Remark 8.1.3, implies that the P-null
event F0 . It then follows by the Ft -adaptedness of {Xt } and the preceding
et } is Ft+ -adapted, namely Ft -adapted (by the assumed
construction of Xt+ , that {X
right-continuity of {Ft , t 0}). Clearly, our construction of Xt+ yields rightet (). Further, a re-run of part of Step 1 yields
continuous sample functions t 7 X
the RCLL property, by showing that for any c the sample function t 7 Xt+ ()
has finite left limits at each t > 0. Indeed, otherwise there exist a, b Q, b > a and
sk t such that Xs+ () < a < b < Xs+ (). By construction of Xt+ this implies
2k
2k1
(2)
the existence of qk Qn such that qk t and Xq2k1 () < a < b < Xq2k ().
Consequently, in this case Un [a, b]() = , in contradiction with c .
Step 3. Fixing s 0, we show that Xs+ = Xs for a.e.
/ , hence the Ft et , t 0} is a modification of the sup-MG (Xt , Ft , t 0) and as
adapted S.P. {X
et , Ft , t 0) is also a sup-MG. Turning to show that Xs+ a.s.
such, (X
= Xs , fix
non-random dyadic rationals qk s as k and recall Remark 8.2.13 that
(Xqk , Fqk , k Z ) is a reversed sup-MG. Further, from the sup-MG property, for
any A Fs ,
sup E[Xqk IA ] E[Xs IA ] < .
k
305
306
In case is an Ft -stopping time, note that by the tower property (and taking out
the known IA ), also E[(Z X )IA ] 0 for Z = E[Z + |F ] = E[X |F ] and all
A F . Here, as noted before, we further have that X mF and consequently,
in this case, a.s. Z X as well. Finally, if (Xt , Ft , t 0) is further a MG, combine
the statement of the corollary for sub-MGs (Xt , Ft ) and (Xt , Ft ) to find that a.s.
X = Z + (and X = Z for an Ft -stopping time ).
Remark 8.2.28. We refer hereafter to both Theorem 8.2.26 and its refinement
in Corollary 8.2.27 as Doobs optional stopping. Clearly, both apply if the rightcontinuous sub-MG (Xt , Ft , t 0) is such that a.s. E[Y |Ft ] Xt for some integrable R.V. Y and each t 0 (for by the tower property, such sub-MG has the
last element X = E[Y |F ]). Further, note that if is a bounded Ft -Markov
time, namely [0, T ] for some non-random finite T , then you dispense of the requirement of a last element by considering these results for Yt = XtT (whose last
element Y is the integrable XT mFT mF and where Y = X , Y = X ).
As this applies whenever both and are non-random, we deduce from Corollary
8.2.27 that if Xt is a right-continuous sub-MG (or MG) for some filtration {Ft },
then it is also a right-continuous sub-MG (or MG, respectively), for the corresponding filtration {Ft+ }.
The latter observation leads to the following result about the stopped continuous
time sub-MG (compare to Theorem 5.1.32).
Corollary 8.2.29. If is an Ft -stopping time and (Xt , Ft , t 0) is a rightcontinuous subMG (or supMG or a MG), then Xt = Xt() () is also a rightcontinuous subMG (or supMG or MG, respectively), for this filtration.
Proof. Recall part (b) of Exercise 8.1.10, that = u is a bounded Ft stopping time for each u [0, ). Further, fixing s u, note that for any A Fs ,
= (s )IA + (u )IAc ,
for all t 0. In view of Remark 8.2.28 we thus deduce, upon applying Theorem
8.2.26, that E[IA Xu ] E[IA Xs ] for all A Fs . From this we conclude that
the sub-MG condition E[Xu | Fs ] Xs holds a.s. whereas the right-continuity
of t 7 Xt is an immediate consequence of right-continuity of t 7 Xt .
In the discrete time setting we have derived Theorem 5.4.1 also for U.I. {Xn }
and mostly used it in this form (see Remark 5.4.2). Similarly, you now prove
Doobs optional stopping theorem for right-continuous sub-MG (Xt , Ft , t 0) and
Ft -stopping time such that {Xt } is U.I.
Exercise 8.2.30. Suppose (Xt , Ft , t 0) is a right-continuous sub-MG.
(a) Fixing finite, non-random u 0, show that for any Ft -stopping times
, a.s. E[Xu |F ] Xu (with equality in case of a MG).
Hint: Apply Corollary 8.2.27 for the stopped sub-MG (Xtu , Ft , t 0).
(b) Show that if (Xu , u 0) is U.I. then further X and X (defined
as lim supt Xt in case = ), are integrable and E[X |F ] X a.s.
(again with equality for a MG).
Hint: Show that Yu = Xu has a last element.
307
Relying on Corollary 8.2.27 you can now also extend Corollary 5.4.5.
Exercise 8.2.31. Suppose (Xt , Ft , t 0) is a right-continuous sub-MG and {k }
is a non-decreasing sequence of Ft -stopping times. Show that if (Xt , Ft , t 0) has a
last element or supk k T for some non-random finite T , then (Xk , Fk , k Z+ )
is a discrete time sub-MG.
Next, restarting a right-continuous sub-MG at a stopping time yields another
sub-MG and an interesting formula for the distribution of the supremum of certain
non-negative MGs.
Exercise 8.2.32. Suppose (Xt , Ft , t 0) is a right-continuous sub-MG and that
is a bounded Ft -stopping time.
Exercise 8.2.34. Using Doobs optional stopping theorem re-derive Doobs inequality. Namely, show that for t, x > 0 and right-continuous sub-MG {Xs , s 0},
P( sup Xs > x) x1 E[(Xt )+ ] .
0st
Hint: Consider the sub-MG ((Xut )+ , FuX ), the FuX -Markov time = inf{s 0 :
Xs > x}, and = .
We conclude this sub-section with concrete applications of Doobs optional stopping theorem in the context of first hitting times for the Wiener process (Wt , t 0)
of Definition 7.3.12.
(r)
(r)
b} of level
(r)
(r)
/ (a, b)} of an
308
(a) Check that a,b is a.s. finite FtW -stopping time and show that for any
r 6= 0,
P(Z
(r)
(r)
a,b
= a) = 1 P(Z
(r)
(r)
a,b
= b) =
1 e2rb
,
e2ra e2rb
a,b
(r)
(0)
sinh(a 2s) + sinh(b 2s)
sa,b
)=
E(e
.
sinh((a + b) 2s)
(0)
Hint: Stop the MGs u0 (t, Wt , 2s) of Exercise 8.2.7 at a,b = a,b .
2
2
(c) Deduce that Ea,b = ab and Var(a,b ) = ab
3 (a + b ).
Hint: Recall part (b) of Exercise 3.2.40.
Here is a related result about first hitting time of spheres by a standard ddimensional Brownian motion.
Definition 8.2.37. The standard d-dimensional Brownian motion is the Rd valued S.P. {W (t), t 0} such that W (t) = (W1 (t), . . . , Wd (t)) with {Wi (t), t 0},
i = 1, 2, . . . , d mutually independent, standard (one-dimensional) Wiener processes.
It is clearly a MG and a centered Rd -valued Gaussian S.P. of continuous sample
functions and stationary, independent increments.
Exercise 8.2.38. Let FtW = (W (s), s t) denote the canonical filtration of
a standard k-dimensional Brownian motion, Rt = kW (t)k2 its Euclidean distance
from the origin and b = inf{t 0 : Rt b} the corresponding first hitting time of
a sphere of radius b > 0 centered at the origin.
(a) Show that Mt = Rt2 kt is an FtW -martingale of continuous sample
functions and that b is an a.s. finite FtW -stopping time.
(b) Deduce that E[b ] = b2 /k.
Remark. The S.P. {Rt , t 0} of the preceding exercise is called the Bessel
process with dimension
R t 1k. Though we shall not do so, it can be shown that the
S.P. Bt = Rt 0 Rs ds is well-defined and in fact is a standard Wiener process
(c.f. [KaS97, Proposition 3.3.21]), with = (k 1)/2 the corresponding index of
the Bessel process. The Bessel process is thus defined for all 1/2 (and starting
at R0 = r > 0, also for 0 < < 1/2). One can then further show that if R0 = r > 0
then Pr (inf t0 Rt > 0) = I{>1/2} (hence the k-dimensional Brownian motion is
O-transient for k 3, see Definition 6.3.21), and Pr (Rt > 0, for all t 0) = 1 even
for the critical case of = 1/2 (so by translation, for any given point z R2 , the
two-dimensional Brownian path, starting at any position other than z R2 w.p.1.
enters every disc of positive radius centered at z but never reaches the point z).
8.2.4. Doob-Meyer decomposition and square-integrable martingales.
In this section we study the structure of square-integrable martingales and in
particular the roles of the corresponding predictable compensator and quadratic
variation. In doing so, we fix throughout the probability space (, F , P) and a
right-continuous filtration {Ft } on it, augmented so that every P-null set is in F0
(see Remark 8.1.3).
309
()
()
denote the q-th variation of the function f () on the partition . The q-th variation
of f () on [a, b] is then the [0, ]-valued
(q)
(8.2.9)
kk0
provided such limit exists (namely, the same R-valued limit exists along each sequence {n , n 1} such that kn k 0). Similarly, the q-th variation on [a, b]
(q)
of a S.P. {Xt , t 0}, denoted V (q) (X) is the limit in probability of V() (X ())
per (8.2.9), if such a limit exists, and when this occurs for any compact interval
[0, t], we have the q-th variation, denoted V (q) (X)t , as a stochastic process with
non-negative, non-decreasing sample functions, such that V (q) (X)0 = 0.
Remark. As you are soon to find out, of most relevance here is the case of qth variation for q = 2, which is also called the quadratic variation.
Note also
(1)
that V() (f ) is bounded above by the total variation of the function f , namely
(1)
310
norm on the linear subspace of functions of finite total variation, see also the related
Definition 3.2.22 of total variation norm for finite signed measures). Further, as
you show next, if V (1) (f ) exists then it equals to V (f ) (but beware that V (1) (f )
may not exist, for example, in case f (t) = 1Q (t)).
Exercise 8.2.42.
(1)
From the preceding exercise we see that any increasing process At has finite total
variation, with V (A)t = V (1) (A)t = At for all t. This is certainly not the case for
non-constant continuous martingales, as shown in the next lemma (which is also
key to the uniqueness of the Doob-Meyer decomposition for sub-MGs of continuous
sample path).
Lemma 8.2.43. A martingale Mt of continuous sample functions and finite total
variation on each compact interval, is indistinguishable from a constant.
Remark. Sample path continuity is necessary here, for in its absence we have
the compensated Poisson process Mt = Nt t which is a martingale (see Example 8.2.5), of finite total variation on compact intervals (since V (M )t V (N )t +
V (t)t = Nt + t by part (a) of Exercise 8.2.42).
ft = Mt M0 such that V (M
f)t = V (M )t
Proof. Considering the martingale M
for all t, we may and shall assume hereafter that M0 = 0. Suppose first that
V (M )t K is bounded, uniformly in t and by a non-random finite constant. In
particular, |Mt | K for all t 0 and fixing a finite partition = {0 = s0 < s1 <
< sk = t}, the discrete time martingale Msi is square integrable and as shown
in part (b) of Exercise 5.1.8
E[Mt2 ] = E[
k
X
i=1
(2)
311
Taking expectation on both sides we deduce in view of the preceding identity that
E[Mt2 ] KED where 0 D V (M )t K for all finite partitions of [0, t].
Further, by the uniform continuity of t 7 Mt () on [0, T ] we have that D () 0
when kk 0, hence E[D ] 0 as kk 0 and consequently E[Mt2 ] = 0.
We have thus shown that if the continuous martingale Mt is such that supt V (M )t
is bounded by a non-random constant, then Mt () = 0 for any t 0 and a.e.
. To deal with the general case, recall Remark 8.2.28 that (Mt , FtM
+ , t 0) is
a continuous martingale, hence by Corollary 8.2.29 and part (d) of Exercise 8.2.42
so is (Mtn , FtM
+ , t 0), where n = inf{t 0 : V (M )t n} are non-decreasing
and V (M )tn n for all n and t. Consequently, for any t 0, w.p.1. Mtn = 0
for n = 1, 2, . . .. The assumed finiteness of V (M )t () implies that n , hence
Mtn Mt as n , resulting with Mt () = 0 for a.e. . Finally, by the
continuity of t 7 Mt (), the martingale M must then be indistinguishable from
the zero stochastic process (see Exercise 7.2.3).
Considering a bounded, continuous martingale Xt , the next lemma allows us to
(2)
conclude in the sequel that V() (X) converges in L2 as kk 0 and its limit can be
set to be an increasing process.
Lemma 8.2.44. Suppose X Mc2 . For any partition = {0 = s0 < s1 < }
()
of [0, ) with a finite number of points on each compact interval, the S.P. Mt =
()
Xt2 Vt (X) is an Ft -martingale of continuous sample path, where
(8.2.10)
()
Vt
(X) =
k
X
i=1
t [sk , sk+1 ) .
(2)
If in addition supt |Xt | K for some finite, non-random constant K, then V(n ) (X)
is a Cauchy sequence in L2 (, F , P) for any fixed b and finite partitions n of [0, b]
such that kn k 0.
()
clearly then E[Mt Ms |Fs ] = 2Xsk E[Xt Xs |Fs ] = 0, by the martingale property
of (Xt , Ft ), which suffices for verifying that {Mt , Ft , t 0} is a martingale.
(b). Utilizing these martingales, we now turn to prove the second claim of the
lemma. To this end, fix two finite partitions and of [0, b] and let
b denote
( )
Next, recall (8.2.10) that Vb (Z) is a finite sum of terms of the form (Uu Us Uu +
Us )2 2(Uu Us )2 + 2(Uu Us )2 . Consequently, V (b) (Z) 2V (b) (U ) + 2V (b) (U )
312
and to conclude that V(n ) (X) is a Cauchy sequence in L2 (, F , P) for any finite
(b
)
Utj+1 Utj = (Xtj+1 Xsi )2 (Xtj Xsi )2 = (Xtj+1 Xtj )(Xtj+1 + Xtj 2Xsi )
(see (8.2.10)). Since tj+1 si kk, this implies in turn that
Vb
(b
)
(U ) 4Vb
(X)[osckk (X)]2
The random variables osc (X) are uniformly (in and ) bounded (by 2K) and
converge to zero as 0 (in view of the uniform continuity of t 7 Xt on [0, b]).
Thus, by bounded convergence the right-most expectation in the preceding inequal(b
)
ity goes to zero as kk 0. To complete the proof simply note that Vb (X) is of
P
the form j=1 Dj2 for the differences Dj = Xtj Xtj1 of the uniformly bounded
2
(b
)
discrete time martingale {Xtj }, hence E[ Vb (X) ] 6K 4 by part (c) of Exercise
5.1.8.
Building on the preceding lemma, the following decomposition is an important
special case of the more general Doob-Meyer decomposition and a key ingredient
in the theory of stochastic integration.
Theorem 8.2.45. For X Mc2 , the continuous modification of V (2) (X)t is the
unique Ft -increasing process At = hXit of continuous sample functions, such that
Mt = Xt2 At is an Ft -martingale (also of continuous sample functions), and any
two such decompositions of Xt2 as the sum of a martingale and increasing process
are indistinguishable.
Proof. Step 1. Uniqueness. If Xt2 = Mt +At = Nt +Bt with At , Bt increasing
processes of continuous sample paths and Mt , Nt martingales, then Yt = Nt Mt =
At Bt is a martingale of continuous sample paths, starting at Y0 = A0 B0 = 0,
such that V (Y )t V (A)t + V (B)t = At + Bt is finite for any t finite. From Lemma
8.2.43 we then deduce that w.p.1. Yt = 0 for all t 0 (i.e. {At } is indistinguishable
of {Bt }), proving the stated uniqueness of the decomposition.
Step 2. Existence of V (2) (X)t when X is uniformly bounded.
Turning to construct such a decomposition, assume first that X Mc2 is uniformly
( )
(in t and ) bounded by a non-random finite constant. Let V (t) = Vt (X) of
(8.2.10) for the partitions of [0, ) whose elements are the dyadic Q(2,) =
(2)
{k2, k Z+ }. By definition, V (t) = V( ) (X) for the partitions of [0, t] whose
(2,)
elements are the finite collections Qt+ of dyadic from [0, t] augmented by
{t}. Since k k k k = 2 , we deduce from Lemma 8.2.44 that per t 0 fixed,
{V (t), 1} is a Cauchy sequence in L2 (, F , P). Recall Proposition 4.3.7 that
313
X
j=1
2j E[(kV Akj )] 0 ,
(2)
in . Moreover, if q < q Q then for all k large enough q, q nk implying
that Vnk (q) Vnk (q ). Taking k it follows that Aq () Aq () for all ,
thus by sample path continuity, At is an Ft -increasing process.
Finally, since the Ft -martingales M,t converge in L1 for (and t 0 fixed),
to the Ft -adapted process Mt = Xt2 At , it is easy to check that {Mt , Ft , t 0} is
a martingale.
Step 5. Localization. Having established the stated decomposition in case X Mc2
is uniformly bounded by a non-random constant, we remove the latter condition
314
by localizing via the stopping times r = inf{t 0 : |Xt | r} for positive integers r. Indeed, note that since Xt () is bounded on any compact time interval,
(r)
r when r (for each ). Further, with Xt = Xtr a uniformly
bounded (by r), continuous martingale (see Corollary 8.2.29), by the preceding
(r)
proof we have Ft -increasing processes At , each of which is the continuous mod(r)
(r)
(2)
2
At
ification of the quadratic variation V (X (r) )t , such that Mt = Xt
r
are continuous Ft -martingales. Since E[(V ( ) (X (r) ), A(r) )] 0 for and
each positive integer r, in view of Theorem 2.2.10 we get by diagonal selection the
existence of a non-random sub-sequence nk and a P-null set N such that
/ N . From (8.2.10) we note
(V (nk ) (X (r) ), A(r) ) 0 for k , all r and
()
()
/ N then
that Vt (X (r) ) = Vtr (X) for any t, , r and . Consequently, if
(r)
At
(r)
(r)
(r )
= At
(r)
(r)
(r)
(2)
(2)
(r)
(r)
as kk 0. Hence,
(r)
(2)
(2)
and considering r we deduce that V() (X) At . That is, the process {At }
is a modification of the quadratic variation of {Xt }.
We complete the proof by verifying that the integrable, Ft -adapted process Mt =
Xt2 At of continuous sample functions satisfies the martingale condition. Indeed,
(r)
since Mt are Ft -martingales, we have for each s u and all r that w.p.1
(r)
2
|Fs ] = E[A(r)
E[Xu
u |Fs ] + Ms .
r
(r)
2
Xu2 and a.s. Au
Considering r we have already seen that Xu
r
(r) a.s.
2
} integrable, we get by dominated
Au , hence also Ms Ms . With supr {Xu
r
2
convergence of C.E. that E[Xur |Fs ] E[Xu2 |Fs ] (see Theorem 4.2.26). Similarly,
(r)
E[Au |Fs ] E[Au |Fs ] by monotone convergence of C.E. hence w.p.1 E[Xu2 |Fs ] =
E[Au |Fs ] + Ms for each s u, namely, (Mt , Ft ) is a martingale.
315
The following exercise shows that X Mc2 has zero q-th variation for all q > 2.
Moreover, unless Xt Mc2 is zero throughout an interval of positive length, its q-th
variation for 0 < q < 2 is infinite with positive probability and its sample path are
then not locally -H
older continuous for any > 1/2.
Exercise 8.2.46.
(a) Suppose S.P. {Xt , t 0} of continuous sample functions has an a.s.
finite r-th variation V (r) (X)t for each fixed t > 0. Show that then for
each t > 0 and q > r a.s. V (q) (X)t = 0 whereas if 0 < q < r, then
V (q) (X)t = for a.e. for which V (r) (X)t > 0.
et is a S.P. of continuous sample path and
(b) Show that if X Mc2 and A
finite total variation on compact intervals, then the quadratic variation
et is hXit .
of Xt + A
(c) Suppose X Mc2 and Ft -stopping time are such that hXi = 0. Show
that P(Xt = 0 for all t 0) = 1.
(d) Show that if a S.P. {Xt , t 0} is locally -H
older continuous on [0, T ]
for some > 1/2, then its quadratic variation on this interval is zero.
Remark. You may have noticed that so far we did not need the assumed rightcontinuity of Ft . In contrast, the latter assumption plays a key role in our proof of
the more general Doob-Meyer decomposition, which is to follow next.
We start by stating the necessary and sufficient condition under which a subMG has a Doob-Meyer decomposition, namely, it is the sum of a martingale and
increasing part.
Definition 8.2.47. An Ft -progressively measurable (and in particular Ft -adapted,
right-continuous), S.P. {Yt , t 0} is of class DL if the collection {Yu , an Ft stopping time} is U.I. for each finite, non-random u.
Theorem 8.2.48 (Doob-Meyer decomposition). A right-continuous, sub-MG
{Yt , t 0} for {Ft } admits the decomposition Yt = Mt + At with Mt a rightcontinuous Ft -martingale and At an Ft -increasing process, if and only if {Yt , t 0}
is of class DL.
Remark 8.2.49. To extend the uniqueness of Doob-Meyer decomposition beyond
sub-MGs with continuous sample functions, one has to require At to be a natural
process. While we do not define this concept here, we note in passing that every
continuous increasing process is a natural process and a natural process is also
an increasing process (c.f. [KaS97, Definition 1.4.5]), whereas the uniqueness is
attained since if a finite linear combination of natural processes is a martingale,
then it is indistinguishable from zero (c.f. proof of [KaS97, Theorem 1.4.10]).
Proof outline. We focus on constructing the Doob-Meyer decomposition
for {Yt , t I} in case I = [0, 1]. To this end, start with the right-continuous
modification of the non-positive Ft -sub-martingale Zt = Yt E[Y1 |Ft ], which exists
since t 7 EZt is right-continuous (see Theorem 8.2.25). Suppose you can find
A1 L1 (, F1 , P) such that
(8.2.11)
At = Zt + E[A1 |Ft ] ,
316
So, by the tower property (Mt , Ft , t I) satisfies the martingale condition and we
are done.
(2,)
Proceeding to construct such A1 , fix 1 and for the (ordered) finite set Q1
of dyadic rationals recall Doobs decomposition (in Theorem 5.2.1), of the discrete
(2,)
time sub-MG {Zsj , Fsj , sj Q1 } as the sum of a discrete time U.I. martingale
()
(2,)
{Msj , sj Q1 } and the predictable, non-decreasing (in view of Exercise 5.2.2),
(2,)
()
()
finite sequence {Asj , sj Q1 }, starting with A0 = 0. Noting that Z1 = 0, or
()
()
(2,)
equivalently M1 = A1 , it follows that for any q Q1
()
()
()
A()
q = Zq Mq = Zq E[M1 |Fq ] = Zq + E[A1 |Fq ] .
(8.2.12)
Relying on the fact that the sub-MG {Yt , t I} is of class DL, this representation
()
allows one to deduce that the collection {A1 , 1} is U.I. (for details see [KaS97,
proof of Theorem 1.4.10]). This in turn implies by the Dunford-Pettis compactness
criterion that there exists an integrable A1 and a non-random sub-sequence nk
(n ) wL1
(2)
(n )
So, by our choice of V necessarily P(Aq > Aq ) = 0 and consequently, w.p.1. the
(2)
sample functions t 7 At () are non-decreasing over Q1 . By right-continuity the
same applies over I and we are done, for {At , t I} of (8.2.11) is thus indistinguishable from an Ft -increasing process.
The same argument applies for I = [0, r] and any r Z+ . While we do not
do so here, the Ft -increasing process {At , t I} can be further shown to be a
natural process. By the uniqueness of such decompositions, as alluded to in Remark 8.2.49, it then follows that the restriction of the process {At } constructed on
[0, r ] to a smaller interval [0, r] is indistinguishable from the increasing process one
constructed directly on [0, r]. Thus, concatenating the processes {At , t r} and
{Mt , t r} yields the stated Doob-Meyer decomposition on [0, ).
As for the much easier converse, fixing non-random u R, by monotonicity of
t 7 At the collection {Au , an Ft -stopping time} is dominated by the integrable
Au hence U.I. Applying Doobs optional stopping theorem for the right-continuous
MG (Mt , Ft ), you further have that Mu = E[Mu |F ] for any Ft -stopping time
(see part (a) of Exercise 8.2.30), so by Proposition 4.2.33 the collection {Mu ,
an Ft -stopping time} is also U.I. In conclusion, the existence of such Doob-Meyer
decomposition Yt = Mt + At implies that the right-continuous sub-MG {Yt , t 0}
is of class DL (recall part (b) of Exercise 1.3.55).
317
Your next exercise provides a concrete instance in which Doob-Meyer decomposition applies, connecting it with the decomposition in Theorem 8.2.45 of the nonnegative sub-MG Yt = Xt2 of continuous sample path, as the sum of the quadratic
variation hXit and the continuous martingale Xt2 hXit .
Exercise 8.2.50. Suppose {Yt , t 0} is a non-negative, right-continuous, sub-MG
for {Ft }.
(a) Show that Yt is in class DL.
(b) Show that if Yt further has continuous sample functions then the processes
Mt and At in its Doob-Meyer decomposition also have continuous sample
functions (and are thus unique).
Remark. From the preceding exercise and Remark 8.2.49, we associate to each
X M2 a unique natural process, denoted hXit and called the predictable quadratic
variation of X, such that Xt2 hXit is a right-continuous martingale. However,
when X
/ Mc2 , it is no longer the case that the predictable quadratic variation
matches the quadratic variation of Definition 8.2.41 (as a matter of fact, the latter
may not exist).
Example 8.2.51. A standard Brownian Markov process consists of a standard
Wiener process {Wt , t 0} and filtration {Ft , t 0} such that FsW Fs for any
s 0 while (Wt Ws , t s) is independent of Fs (see also Definition 8.3.7 for its
Markov property). For right-continuous augmented filtration Ft , such process Wt
is in Mc2 and further, Mt = Wt2 t is a martingale of continuous sample path. We
thus deduce from Theorem 8.2.45 that its (predictable) quadratic variation is the
non-random hW it = t, which by Exercise 8.2.46 implies that the total variation of
the Brownian sample path is a.s. infinite on any interval of positive length. More
generally, recall part (b) of Exercise 8.2.6 that hXit is non-random for any Gaussian martingale, hence so is the quadratic variation of any Gaussian martingale of
continuous sample functions.
As you show next, the type of convergence to the quadratic variation may be
strengthened (e.g. to convergence in L2 or a.s.) for certain S.P. by imposing some
restrictions on the partitions considered.
(2)
Exercise 8.2.52. Let V(n ) (W ) denote the quadratic variations of the Wiener
process on a sequence of finite partitions n of [0, t] such that kn k 0 as n .
(2)
L2
(2)
a.s.
n=1
kn k < .
Remark. However, beware that for a.e. there exist random finite partitions
(2)
n of [0, 1] such that kn k 0 and V(n ) (W ) (see [Fre71, Page 48]).
Example 8.2.53. While we shall not prove it, Levys martingale characterization
of the Brownian motion states the converse of Example 8.2.51, that any X Mc2
of quadratic variation hXit = t must be a standard Brownian Markov process (c.f.
[KaS97, Theorem 3.3.16]). However, recall Example 8.2.5 that for a Poisson process Nt of rate , the compensated process Mt = Nt t is in M2 and you can easily
check that Mt2 t is then a right-continuous martingale. Since the continuous increasing process t is natural, we deduce from the uniqueness of the Doob-Meyer
decomposition that hM it = t. More generally, by the same argument we deduce
318
from part (c) of Exercise 8.2.6 that hXit = tE(X12 ) for any square-integrable S.P.
with X0 = 0 and zero-mean, stationary independent increments. In particular,
this shows that sample path continuity is necessary for Levys characterization of
the Brownian motion and that the standard Wiener process is the only zero-mean,
square-integrable stochastic process Xt of continuous sample path and stationary
independent increments, such that X0 = 0.
Building upon Levys characterization, you can now prove the following special
case of the extremely useful Girsanovs theorem.
Exercise 8.2.54. Suppose (Wt , Ft , t 0) is a standard Brownian Markov process
on a probability space (, F , P) and fixing a non-random parameters R and
T > 0 consider the exponential Ft -martingale Zt = exp(Wt 2 t/2) and the
corresponding probability measure QT (A) = E(IA ZT ) on (, FT ).
Rt
(a) Show that V (2) (Z)t = 2 0 Zu2 du.
fu = Wu u is for u [0, T ] an Fu -martingale on the
(b) Show that W
probability space (, FT , QT ).
ft , Ft , t T ) is a standard Brownian Markov process on
(c) Deduce that (W
(, FT , QT ).
Here is the extension to the continuous time setting of Lemma 5.2.7 and Proposition 5.3.31.
Exercise 8.2.55. Let Vt = sups[0,t] Ys and At be the increasing process of continuous sample functions in the Doob-Meyer decomposition of a non-negative, continuous, Ft -submartingale {Yt , t 0} with Y0 = 0.
(a) Show that P(V x, A < y) x1 E(A y) for all x, y > 0 and any
Ft -stopping time .
(b) Setting c1 = 4 and cq = (2 q)/(1 q) for q (0, 1), conclude that
E[sups |Xs |2q ] cq E[hXiq ] for any X Mc2 and q (0, 1], hence
{|Xt |2q , t 0} is U.I. when hXiq is integrable.
Definition 8.2.56. For any pair X, Y M2 , we call the S.P. hX, Y it the bracket
of X and Y and say that X, Y M2 are orthogonal if for any t 0 the bracket
hX, Y it is a.s. zero.
Remark. It is easy to check that hX, Xi = hXi for any X M2 . Further, for
any s [0, t], w.p.1.
E[(Xt Xs )(Yt Ys )|Fs ] = E[Xt Yt Xs Ys |Fs ] = E[hX, Y it hX, Y is |Fs ],
319
1
[hXit hXis + hY it hY is ] .
2
(e) Show that for X, Y Mc2 the bracket hX, Y it is also the limit in probability as kk 0 of
Zt Zs
k
X
i=1
()
where = {0 = t0
i1
()
< t1
i1
()
< < tk
i=0
where 0 = t0 < t1 < < tk < is a non-random unbounded sequence and the
Ftn -adapted sequence {n ()} is bounded uniformly in n and .
Rt
(a) With At = 0 Xu2 du, show that both
It =
k1
X
j=0
320
x S, B S,
x S, B S, u, s 0 ,
(8.3.3)
Remark. Recall that a Gn -Markov chain {Yn } is a discrete time S.P. Hence, in this
case one considers only t, s Z+ and (8.3.1) is automatically satisfied by setting
ps,t = ps,s+1 ps+1,s+2 pt1,t to be the composition of the (one-step) transition
probabilities of the Markov chain (see Definition 6.1.2, with ps,s+1 = p independent
of s when the chain is homogeneous). Further, the interpolated process Xt = Y[t]
is then a right-continuous Ft -Markov process for the right-continuous interpolated
filtration Ft = G[t] of Example 8.1.5, but {Xt } is in general an inhomogeneous
Markov process, even in case the Markov chain {Yn } is homogeneous.
Similarly, if Ft -adapted S.P. (Xt , t 0) satisfies (8.3.4) for pt (x, B) = P(Xt
B|X0 = x) and x 7 pt (x, B) is measurable per fixed t 0 and B S, then
considering the tower property for IB (Xs+u )I{x} (X0 ) and (X0 ) Fs , one easily
verifies that (8.3.2) holds, hence (Xt , t 0) is a homogeneous Ft -Markov process.
More generally, in analogy with our definition of Markov chains via (6.1.1), one
may opt to say that Ft -adapted S.P. (Xt , t 0) is an Ft -Markov process provided
for each B S and t s 0,
a.s.
Indeed, as noted in Remark 6.1.6 (in view of Exercise 4.4.5), for B-isomorphic
(S, S) this suffices for the existence of transition probabilities which satisfy (8.3.3).
However, this simpler to verify plausible definition of Markov processes results with
Chapman-Kolmogorov equations holding only up to a null set per fixed t3 t2
t1 0. The study of such processes is consequently made more cumbersome, which
is precisely why we, like most texts, do not take this route.
321
By Lemma 6.1.3 we deduce from Definition 8.3.1 that for any f bS and all
t s 0,
(8.3.5)
for 0 = s0 < s1 < < sn , and there exists a Markov process of state space (S, S)
having these f.d.d. Conversely, the f.d.d. of any Markov process having initial
probability distribution (B) = P(X0 B) and satisfying (8.3.3), are given by
(8.3.6).
Proof. Recall Proposition 6.1.5 that p0,s1 psn1 ,sn denotes the
Markov-product-like measures, whose evaluation on product sets is by iterated integrations over the transition probabilities psk1 ,sk , in reverse order k = n, . . . , 1,
followed by a final integration over the initial measure . As shown in this proposition, given any transition probabilities {ps,t , t s 0}, the probability distribution
on (S, S) uniquely determines p0,s1 psn1 ,sn , namely, the f.d.d. specified in (8.3.6). We then uniquely specify the remaining f.d.d. as the probability
measures s1 ,...,sn (D) = s0 ,s1 ,...,sn (S D). Proceeding to check the consistency
of these f.d.d. note that ps,u pu,t (, S ) = ps,t (, ) for any s < u < t (by the
Chapman-Kolmogorov identity (8.3.1)). Thus, considering s = sk1 , u = sk and
t = sk+1 we deduce that if D = A0 An with Ak = S for some k = 1, . . . , n1,
then
ps0 ,s1 psn1 ,sn (D) = psk1 ,sk+1 psn1 ,sn (Dk )
for Dk = A0 Ak1 Ak+1 An , which are precisely the consistency
conditions of (7.1.3) for the f.d.d. {s0 ,...,sn }. These consistency requirements are
further handled in case of a product set D with An = S by observing that for all
x S and any transition probability psn1 ,sn (x, S) = 1, whereas our definition of
s1 ,...,sn already dealt with A0 = S. Having shown that this collection of f.d.d. is
consistent, recall that Proposition 7.1.8 applies even with (R, B) replaced by the Bisomorphic measurable space (S, S). Setting T = [0, ), it provides the construction
of a S.P. {Yt () = (t), t T} via the coordinate maps on the canonical probability
space (ST , S T , P ) with the f.d.d. of (8.3.6). Turning next to verify that (Yt , FtY , t
T) satisfies the Markov condition (8.3.3), fix t s 0, B S and recall that, for
t > s as in the proof of Theorem 6.1.8, and by definition in case t = s,
(8.3.7)
322
The latter identity is proved by induction on n, where denoting its right side by
gn+1,s (Xs ), we see that gn+1,s = ps,tn (fn gn,tn ) and the case n = 0 is merely (8.3.5).
In the induction step we have from the tower property and Ft -adaptedness of {Xt }
that
n1
n
Y
Y
f (Xt )|Ftn ]|Fs ]
E[
f (Xt )|Fs ] = E[fn (Xtn )E[
=0
=0
where the induction hypothesis is used in the second equality and (8.3.5) in the
third. In particular, considering the expected value of (8.3.8) for s = 0 and indicator
functions f () it follows that the f.d.d. of this process are given by (8.3.6), as
claimed.
Remark 8.3.3. As in Lemma 7.1.7, for B-isomorphic state space (S, S) any F
F X is of the form F = (X )1 (A) for some A S T , where X () : 7 ST denote
the collection of sample functions of the given Markov process {Xt , t 0}. Then,
P(F ) = P (A), so while proving Theorem 8.3.2 we have defined the law P () of
Markov process {Xt , t 0} as the unique probability measure on S [0,) such that
P ({ : (s ) B , = 0, . . . , n}) = P(Xs0 B0 , . . . , Xsn Bn ) ,
for B S and distinct s 0 (compare with Definition 6.1.7 for the law of a
Markov chain). We denote by Px the law P inR case (B) = IxB , namely, when
X0 = x is non-random and note that P (A) = S Px (A)(dx) for any probability
measure on (S, S) and all A S T , with Px uniquely determined by the specified
(consistent) transition probabilities {ps,t , t s 0}.
The evaluation of the f.d.d. of aPMarkov process is more explicit when S is a
countable set, as then ps,t (x, B) = yB ps,t (x, y) for any B S (and all Lebesgue
integrals are merely sums). Likewise, in case S = Rd (equipped with S = BS ),
computations are relatively explicit if for each t > s 0 and x S the probability
measure ps,t (x, ) is absolutely
R continuous with respect to Lebesgue measure on S,
in which case (ps,t f )(x) = ps,t (x, y)f (y)dy and the right side of (8.3.8) amounts
323
to iterated integrations of the transition probability kernel ps,t (x, y) of the process
with respect to Lebesgue measure on S.
The next exercise is about the closure of the collection of Markov processes under
certain invertible non-random measurable mappings.
Exercise 8.3.4. Suppose (Xt , FtX , t 0) is a Markov process of state space (S, S),
u : [0, ) 7 [0, ) is an invertible, strictly increasing function and for each t 0
e is invertible, with 1
the measurable mapping t : (S, S) 7 (e
S, S)
measurable.
t
Y
X
(a) Setting Yt = t (Xu(t) ), verify that Ft = Fu(t) and that (Yt , FtY , t 0)
e
is a Markov process of state space (e
S, S).
X
(b) Show that if (Xt , Ft , t 0) is a homogeneous Markov process then so is
Zt = 0 (Xt ).
Of particular note is the following collection of Markov processes.
Proposition 8.3.5. If real-valued S.P. {Xt , t 0} has independent increments,
then (Xt , FtX , t 0) is a Markov process of transition probabilities ps,t (y, B) =
PXt Xs ({z : y + z B}), and if {Xt , t 0} further has stationary, independent
increments, then this Markov process is homogeneous.
Proof. Considering Exercise 4.2.2 for G = FsX , Y = Xs mG and the R.V.
Z = Zt,s = Xt Xs which is independent of G, you find that (8.3.3) holds for
ps,t (y, B) = P(y + Z B), which in case of stationary increments depends only on
t s. Clearly, B 7 P(y + Z B) is a probability measure on (R, B), for any t s
and y R. Further, if B = (, b] then ps,t (y, B) = FZ (b y) is a Borel function
of y (see Exercise 1.2.27). As the -system L = {B B : y 7 P(y + Z B)
is a Borel function} contains the -system {(, b] : b R} generating B, it
follows that L = B, hence ps,t (, ) is a transition probability for each t s 0.
To verify that the Chapman-Kolmogorov equations hold, fix u [s, t] noting that
Zs,t = Zs,u + Zu,t , with Zu,t = Xt Xu independent of Zs,u = Xu Xs . Hence,
by the tower property,
ps,t (y, B) = E[P(y + Zs,u + Zu,t B|Zs,u )]
and this relation, i.e. (8.3.1), holds for all y R and B B, as claimed.
Among the consequences of Proposition 8.3.5 is the fact that both the Brownian
motion and the Poisson process (potentially starting at N0 = x R), are homogeneous Markov processes of explicit Markov semi-groups.
Example 8.3.6. Recall Proposition 3.4.9 and Exercise 7.3.13 that both the Poisson process and the Brownian motion are processes of stationary independent increments. Further, this property clearly extends to the Brownian motion with drift
(r)
(r)
Zt = Wt + rt + x, and to the Poisson process with drift Nt = Nt + rt + x, where
the drift r R is a non-random constant, x R is the specified (under Px ), initial
(r)
(r)
value of N0 (or Z0 ), and Nt N0 is a Poisson process of rate . Consequently,
(r)
(r)
both {Zt , t 0} and {Nt , t 0} are real-valued homogeneous Markov processes.
Specifically, from the preceding proposition we have that the Markov semi-group of
the Brownian motion with drift is pt (x + rt, B), where for t > 0,
Z (yx)2 /2t
e
dy ,
(8.3.9)
pt (x, B) =
2t
B
324
having the transition probability kernel pt (x, y) = exp((y x)2 /2t)/ 2t. Similarly, the Markov semi-group of the Poisson process with drift is qt (x + rt, B),
where
X
(t)k
(8.3.10)
qt (x, B) = et
IB (x + k) .
k!
k=0
Remark. Homogeneous Markov chains are characterized by their (one-step) transition probabilities, whereas each homogeneous Markov process has a full semigroup pt (), t 0. While outside our scope, we note in passing that the semi-group
relation (8.3.2) can be rearranged as s1 (ps+t pt ) = s1 (ps p0 )pt , which subject
to the appropriate regularity conditions should yield for s 0 the celebrated backward Kolmogorov equation t pt = Lpt . The operator L = lims0 s1 (ps p0 ) is then
called the generator of the Markov process (or its semi-group). For example, the
transition probability kernel pt (x + rt, y) of the Brownian motion with drift solves
the partial differential equation (pde), ut = 12 uxx + rux and the generator of this
semi-group is Lu = 12 uxx + rux (c.f. [KaS97, Chapter 5]). For this reason, many
computations about Brownian motion can also be done by solving rather simple
elliptic or parabolic pde-s.
We saw in Proposition 8.3.5 that the Wiener process (Wt , t 0) is a homogeneous
FtW -Markov process of continuous sample functions and the Markov semi-group
of (8.3.9). This motivates the following definition of a Brownian Markov process
(Wt , Ft ), where our accommodation of possible enlargements of the filtration and
different initial distributions will be useful in future applications.
Definition 8.3.7 (Brownian Markov process). We call (Wt , Ft ) a Brownian
Markov process if {Wt , t 0} of continuous sample functions is a homogeneous Ft Markov process with the Brownian semi-group {pt , t 0} of (8.3.9). If in addition
W0 = 0, we call such process a standard Brownian Markov process.
Stationarity of Markov processes, in the sense of Definition 7.3.7, is related to the
important concept of invariant probability measures which we define next (compare
with Definition 6.1.20).
Definition 8.3.8. A probability measure on a B-isomorphic space (S, S) is
called an invariant (probability) measure for Ra semi-group of transition probabilities {pu , u 0}, if the induced law P () = S Px ()(dx) (see Remark 8.3.3), is
invariant under any time shift s , s 0.
You can easily check that if Markov process is also a stationary process under an
initial probability measure , then it is effectively a homogeneous Markov process,
in the sense that ps,t (x, ) = pts (x, ) for any t s 0 and -a.e. x S.
However, many homogeneous Markov processes are non-stationary (for example,
recall Examples 7.3.14 and 8.3.6, that the Brownian motion is non-stationary yet
homogeneous, Markov process).
Here is the explicit characterization of invariant measures for a given Markov
semi-group and their connection to stationary Markov processes.
Exercise 8.3.9. Adapting the proof of Proposition 6.1.23 show that a probability
measure on B-isomorphic (S, S) is an invariant measure for a Markov semi-group
{pu , u 0}, if and only if pt = for any t 0 (note that a homogeneous Markov
325
18
0.8
16
14
0.6
12
0.4
t
10
0.2
8
0
6
0.2
0.4
0.6
0.1
0.2
0.3
0.4
0.5
t
0.6
0.7
0.8
0.9
0.2
0.4
OrnsteinUhlenbeck process Ut
0.6
0.8
1
t
1.2
1.4
1.6
1.8
1.6
1.8
0.5
0.4
8
7
0.3
6
0.2
t
5
0.1
4
0
3
0.1
0.2
0.3
0.1
0.2
0.3
0.4
0.5
t
0.6
0.7
0.8
0.9
0.2
0.4
0.6
0.8
1
t
1.2
1.4
326
Remark 8.3.12. From Lemma 7.1.7 you can easily deduce that any V bF X
is of the form V = h(X ) with h bS [0,) . Further, in view of Exercises 1.2.32
and 7.2.9, any bounded Borel function h() on the space C([0, )) of continuous
functions equipped with the topology of uniform convergence on compact intervals
is the restriction to C([0, )) of some e
h bR[0,) . In particular, for a real-valued
Markov process {Xt , t 0} of continuous sample functions, Ex [e
h] = Ex [h] and
e
hs (X ) = hs (X ), hence (8.3.11) applies for any bounded, BC([0,)) -measurable
function h.
Qn
Proof. Fixing s 0, in case h(x()) = =0 f (x(u )) for finite n, f bS
and u0 > > un 0, we have by (8.3.8) for t = s + u and the semi-group
pr,t = ptr of (Xt , t 0), that
E[
n
Y
=0
n
Y
f (Xu )] .
=0
327
In particular, if (Xt , Ft ) is a strong Markov process, then {Xt , t 0} is a homogeneous Ft+ -Markov process and for any s 0 and h bS [0,) , almost surely
(8.3.14)
P[X +u B|F + ] = pu (X , B) .
328
almost surely
= 2, . . . , n.
The identity (8.3.16) is the n = 1 basis of the proof. To carry out the induction
step, recall part (c) of Exercise 8.1.10 that = + u is a decreasing sequence
of Ft -Markov times, which are finite if and only if is, and further, F + Fn+
(see part (b) of Exercise 8.1.11). It thus follows by the tower property and taking
out the known term fn (Xn+ ) bFn+ (when < , see Proposition 8.1.13), that
E[I{ <}
n
Y
=1
n1
Y
f (X )|Fn+ ]|F + ]
=1
Indeed, since n = u un are non-random and positive, the induction hypothesis applies for the Ft -Markov time n to yield the second equality, whereas
the third equality is established by considering the identity (8.3.16) for f = fn gn1
and u = un .
Step 3. Similarly to the proof of Proposition 8.3.11, fixing A F + , yet another
application of the monotone class theorem shows that any h bU is in the collection
329
Indeed, in Step 2 we have shown that H contains the indicators on the -system
such that U = (P). Further, constants are in H which by the linearity of the
expectation (and hence of h 7 gh ), is a vector space. Finally, if hn h bounded
and hn H are non-negative, then h bU and by monotone convergence ghn
gh bounded and measurable, with the pair (h, gh ) also satisfying (8.3.17). Since
gh (, X )I{ <} is in bF+ and the preceding argument applies for all A F + ,
we conclude that per and h the identity (8.3.12) holds w.p.1., as claimed.
As you are to show now, both Markov and strong Markov properties apply for
product laws of finitely many independent processes, each of which has the corresponding property.
Exercise 8.3.16. Suppose on some probability space (, F , P) we have homo(i)
(i)
geneous Markov processes (Xt , Ft ) of B-isomorphic state spaces (Si , Si ) and
(i)
(i)
Markov semi-groups pt (, ), such that F , i = 1, . . . , are P-mutually independent.
(1)
()
(1)
()
(i)
pt (xi , Bi )
i=1
330
Indeed, recall that in Lemma 8.1.16 we have constructed a sequence of finite Ft stopping times = 2 ([2 ]+1) taking values in the countable set of non-negative
dyadic rationals, such that . Further, for any we have that A F (see
part (b) of Exercise 8.1.12), hence as shown in Exercise 8.3.17,
E[f (X +u )IA ] = E[(pu f )(X )IA ] .
Due to the sample path right-continuity, both X +u X +u and X X .
Since f Cb (R) and pu f Cb (R) (by the assumed Feller property), as
both f (X +u ) f (X +u ) and (pu f )(X ) (pu f )(X ). We thus deduce by
bounded convergence that (8.3.18) holds.
Next, consider non-negative fk Cb (R) such that fk I(,b) (see Lemma 3.1.6
for an explicit construction of such). By monotone convergence pu fk pu I(,b)
and hence
(8.3.19)
for any B in the -system {(, b) : b R} which generates the Borel -algebra
B. The collection of L of sets B B for which the preceding identity holds is a system (by linearity of the expectation and monotone convergence), so by Dynkins
theorem it holds for any Borel set B. Since this applies for any A F + , the
strong Markov property of (Xt , Ft ) follows from Proposition 8.3.15, upon noting
that the right-continuity of t 7 Xt implies that Xt is Ft -progressively measurable,
with pu (X , B) mF + (see Propositions 8.1.8 and 8.1.13, respectively).
Taking advantage of the preceding result, you can now verify that any rightcontinuous S.P. of stationary, independent increments is a strong Markov process.
Exercise 8.3.20. Suppose {Xt , t 0} is a real-valued process of stationary, independent increments.
(a) Show that {Xt , t 0} has a Feller semi-group.
(b) Show that if {Xt , t 0} is also right-continuous, then it is a strong
Markov process. Deduce that this applies in particular for the Poisson
process (starting at N0 = x R as in Example 8.3.6), as well as for any
Brownian Markov process (Xt , Ft ).
(c) Suppose the right-continuous {Xt , t 0} is such that limt0 E|Xt | = 0
and X0 = 0. Show that Xt is integrable for all t 0 and Mt = Xt tEX1
is a martingale. Deduce that then E[X ] = E[ ]E[X1 ] for any integrable
FtX -stopping time .
Hint: Establish the last claim first for the FtX -stopping times = 2 ([2 ]+
1).
331
Our next example demonstrates that some regularity of the semi-group is needed
when aiming at the strong Markov property (i.e., merely considering the canonical
filtration of a homogeneous Markov process with continuous sample functions is
not enough).
Example 8.3.21. Suppose X0 is independent of the standard Wiener process
{Wt , t 0} and q = P(X0 = 0) (0, 1). The S.P. Xt = X0 + Wt I{X0 6=0} has
continuous sample functions and for any fixed s 0, a.s. I{X0 =0} = I{Xs =0} (as
the difference occurs on the event {Ws = X0 6= 0} which is of zero probability). Further, the independence of increments of {Wt } implies the same for {Xt }
conditioned on X0 , hence for any u 0 and Borel set B, almost surely,
P(Xs+u B|FsX ) = I0B I{X0 =0} + P(Ws+u Ws + Xs B|Xs )I{X0 6=0}
= pbu (Xs , B) ,
where pbu (x, B) = p0 (x, B)Ix=0 + pu (x, B)Ix6=0 for the Brownian semi-group pu ().
Clearly, per u fixed, pbu (, ) is a transition probability on (R, B) and pb0 (x, B) is
the identity element for the semi-group relation pbu+s = pbu pbs which is easily shown
to hold (but this is not a Feller semi-group, since x 7 (b
pt f )(x) is discontinuous
at x = 0 whenever f (0) 6= Ef (Wt )). In view of Definition 8.3.1, we have just
shown that pbu (, ) is the Markov semi-group associated with the FtX -progressively
measurable homogeneous Markov process {Xt , t 0} (regardless of the distribution
of X0 ). However, (Xt , FtX ) is not a strong Markov process. Indeed, note that
= inf{t 0 : Xt = 0} is an FtX -stopping time (see Proposition 8.1.15), which
(0)
is finite a.s. (since if X0 6= 0 then Xt = Wt + X0 and = X0 of Exercise
8.2.35, whereas for X0 = 0 obviously = 0). Further, by continuity of the sample
functions, X = 0 whenever < , so if (Xt , FtX ) was a strong Markov process,
then in particular, a.s.
P(X +1 > 0|FX ) = pb1 (0, (0, )) = 0
(this is merely (8.3.15) for the stopping time , u = 1 and B = (0, )). However,
the latter identity fails whenever X0 6= 0 (i.e. with probability 1 q > 0), for then
the left side is merely p1 (0, (0, )) = 1/2 (since {Wt , FtX } is a Brownian Markov
process, hence a strong Markov process, see Exercise 8.3.20).
Here is an alternative, martingale based, proof that any Brownian Markov process
is a strong Markov process.
Exercise 8.3.22. Suppose (Xt , Ft ) is a Brownian Markov process.
(a) Let Rt and It denote the real and imaginary parts of the complex-valued
S.P. Mt = exp(iXt + t2 /2). Show that both (Rt , Ft ) and (It , Ft ) are
MG-s.
(b) Fixing a bounded Ft -Markov time , show that E[M +u |F + ] = M
w.p.1.
(c) Deduce that w.p.1. the R.C.P.D. of X +u given F + matches the normal
distribution of mean X () and variance u.
(d) Conclude that the Ft -progressively measurable homogeneous Markov process {Xt , t 0} is a strong Markov process.
8.3.3. Markov jump processes. This section is about the following Markov
processes which in many respects are very close to Markov chains.
332
333
proceed to describe the jump parameters of Markov jump processes. These parameters then serve as a convenient alternative to the general characterization of a
homogeneous Markov process via its (Markov) semi-group.
Proposition 8.3.26. Suppose (Xt , t 0) is a right-continuous, homogeneous
Markov process.
(a) Under Py , the FtX -Markov time = inf{t 0 : Xt 6= X0 } has the exponential distribution of parameter y , for all y S and some measurable
: S 7 [0, ].
(b) If y > 0 then is Py -almost-surely finite and Py -independent of the
S-valued random variable X .
(c) If (Xt , t 0) is a strong Markov process and y > 0 is finite, then
Py -almost-surely X 6= y.
(d) If (Xt , t 0) is a Markov jump process, then is a strictly positive,
FtX -stopping time.
334
X 6= y.
(d). Here t 7 Xt () is a step function, hence clearly, for each t 0
[
{ t} =
{Xq 6= X0 } FtX
(2)
t+
335
336
337
the name explosion given to such phenomena. But, observe that unbounded jump
rates do not necessarily imply an explosion (as for example, in case of simple birth
processes), and explosion may occur for one initial state but not for another (for
example here 0 = 0 so there is no explosion if starting at x = 0).
As you are to verify now, the jump parameters characterize the relatively explicit
generator for the semi-group of a Markov jump process, which in particular satisfies
Kolmogorovs forward (in case of bounded jump rates), and backward equations.
Definition
8.3.32. The linear operator L : bS 7 mS such that (Lh)(x) =
R
x (h(y) h(x))p(x, dy) for h bS is called the generator of the Markov jump
process corresponding to jump parameters (, p). In particular, (LI{x}c )(x) = x
and more generally (LIB )(x) = x p(x, B) for any B {x}c (so specifying the
generator is in this context equivalent to specifying the jump parameters).
Exercise 8.3.33. Consider a Markov jump process (Xt , t 0) of semi-group
Pk
pt (, ) and jump parameters (, p) as in Definition 8.3.32. Let Tk = j=1 j denote
P
the jump times of the sample function s 7 Xs () and Yt =
k1 I{Tk t} the
number of such jumps in the interval [0, t].
(a) Show that if x > 0 then
Z
Px (2 t|1 ) = (1 ey t )p(x, dy) ,
and deduce that t1 Px (Yt 2) 0 as t 0, for any x S.
(b) Fixing x S and h bS, show that
(8.3.20)
s0
t 0, x S, h bS.
t 0, x S, h bS.
Remark. Exercise 8.3.33 relates the Markov semi-group with the corresponding jump parameters, showing that a Markov semi-group pt (, ) corresponds to a
Markov jump process only if for any x S, the limit
lim t1 (1 pt (x, {x}) = x
(8.3.23)
t0
B {x}c ,
338
for some transition probability p(, ). Recall Theorem 8.3.28 that with the exception of possible explosion, the converse applies, namely whenever (8.3.23) and
(8.3.24) hold, the semi-group pt (, ) corresponds to a (possibly explosive) Markov
jump process. We note in passing that while Kolmogorovs backward equation
(8.3.21) is well defined for any jump parameters, the existence of solution which is
a Markov semi-group, is equivalent to non-explosion of the corresponding Markov
jump process.
In particular, in case of bounded jump rates the conditions (8.3.23) and (8.3.24)
are equivalent to the Markov process being a Markov pure jump process and in this
setting you are now to characterize the invariant measures for the Markov jump
process in terms of its jump parameters (or equivalently, in terms of its generator).
Exercise 8.3.34. Suppose (, p) are jump parameters on B-isomorphic state space
(S, S) such that supxS x is finite.
(a) Show that probability measure is invariant for the corresponding Markov
jump process if and only if (Lh) = 0 for the generator L : bS 7 bS of
these jump parameters and all h bS.
Hint: Combine Exercises 8.3.9 and 8.3.33 (utilizing the boundedness of
x 7 (Lh)(x)).
(b) Deduce that is an invariant probability measure for (, p) if and only if
(p) = .
In particular, the invariant probability measures of a Markov jump process with
constant jump rates are precisely the invariant measures for its jump transition
probability.
Of particular interest is the following special family of Markov jump processes.
Definition 8.3.35. Real-valued Markov pure jump processes with a constant jump
rate whose jump transition probability is of the form p(x, B) = P ({z : x+z B})
for some law P on (R, B), are called compound Poisson processes. Recall Remark
8.3.29 thatP
a compound Poisson process is of the form Xt = SNt for a random walk
Sn = S0 + nk=1 k with i.i.d. {, k } which are independent of the Poisson process
Nt of rate .
339
Pi
this end, note that Nt0 = 0 and conditional on the event Nti = mi for mi = j=1 rj
P
i
and fixed r = (r1 , . . . , rn ) Zn+ , we have that Di = m
k=mi1 +1 k are mutually
independent with Di then having the same distribution as the random walk Sri
starting at S0 = 0. So, for any fi bB, by the tower property and the mutual
independence of {Nti Nti1 , 1 i n},
Ex [
n
Y
fi (Di )] = E[Ex (
n
Y
fi (Di )|F N )] =
rZn
+ i=1
i=1
i=1
n n X
Y
i=1
ri =0
n n
X Y
n
o Y
Ex [fi (Di )] ,
P(Nti Nti1 = ri )E0 [fi (Sri )] =
i=1
the contributions
(j)
Xt
Nt
X
k IBj (k )
k=1
(j)
340
Proof. While one can directly prove this result along the lines of Exercise
bt = X0 + Pm Y (j)
3.4.16, we resort to an indirect alternative, whereby we set X
j=1 t
(j)
for the independent compound Poisson processes Yt of jump rates (j) and i.i.d.
(j)
(j)
bt is a pure jump process
jump sizes {k }, starting at Y0 = 0. By construction, X
whose jump times {Tk ()} are contained in the union over j = 1, . . . , m of the
(j)
(j)
(j)
isolated jump times {Tk ()} of t 7 Yt (). Recall that each Tk has the gamma
density of parameters = k and (j) (see Exercise 1.4.46 and Definition 3.4.8).
(j)
Therefore, by the independence of {Yt , t 0} w.p.1. no two jump times among
(j)
bt(j) = Yt(j) for all j and t 0 (as the
{Tk , j, k 1} are the same, in which case X
(j)
jump sizes of each Yt are in the disjoint element Bj of the specified finite partition
b (1) , . . . , X
b (m) ) being indistinguishable
of R \ {0}). With the Rm -valued process (X
bt , t 0} is a compound
from (Y (1) , . . . , Y (m) ), it thus suffices to show that {X
Poisson process of the specified jump rate and jump size law P .
(j)
To this end, recall Proposition 8.3.36 that each of the processes Yt has stationary independent increments and due to their independence, the same applies for
bt , t 0}, which is thus a real-valued homogeneous Markov process (see Propo{X
Pm
sition 8.3.5). Next, note that since = j=1 (j) and for all R,
m
X
(j) (j) () =
m
X
j=1
j=1
Ex [eiXt ] = eix
= eix
m
Y
j=1
m
Y
(j)
E[eiYt ]
e
j=1
That is, denoting by pt (, ) and pbt (, ) the Markov semi-groups of {Xt , t 0} and
bt , t 0} respectively, we found that per fixed x R and t 0 the transi{X
tion probabilities pt (x, ) and pbt (x, ) have the same characteristic function. Consequently, by Levys inversion theorem pt (, ) = pbt (, ) for all t 0, i.e., these
two semi-groups are identical. Obviously, this implies that the Markov pure jump
bt have the same jump parameters (see (8.3.23) and (8.3.24)),
processes Xt and X
bt , t 0} is a compound Poisson process of jump rate and jump
so as claimed {X
size law P .
Exercise 8.3.39. Suppose {Yt , t 0} is a compound Poisson process of jump rate
> 0, integrable Y0 and jump size law P for some integrable > 0.
(a) Show that for any integrable FtY -Markov time , EY = EY0 + EE .
Hint: Consider the FtY -martingales Xtn , n 1, where Xt = Yt tE.
(b) Suppose that Y0 and are square integrable and let r = inf{t 0 :
(r)
(r)
Zt > Yt }, where Zt = Bt +rt for a standard Brownian Markov process
(Bt , t 0), independent of (Yt , t 0). Show that for any r R,
Er =
E(Y0 )+
(r E)+
341
(c) In case S is a finite set, show that the matrix Ps of entries ps (x, z) is
P sk k
given by Ps = esQ =
k=0 k! Q , where Q is the matrix of entries
q(x, y).
The formula Ps = esQ explains why Q, and more generally L, is called the generator of the semi-group Ps .
From Exercise 8.3.34 we further deduce that, at least for bounded jump rates, an
invariant probability measure for the Markov
jump process is uniquely determined
P
by the function : S 7 [0, 1] such that x (x) = 1 and
X
(8.3.26)
y (y) =
(x)x p(x, y)
y S .
xS
For constant positive jump rates this condition coincides with the characterization
(6.2.5) of invariant probability measures for the jump transition probability. Consequently, for such jump processes the invariant, reversible and excessive measures
as well as positive and null recurrent states are defined as the corresponding objects for the jump transition probability and obey the relations explored already in
Subsection 6.2.2.
Remark. While we do not pursue this further, we note in passing that more
generally, a measure () is reversible for a Markov jump process with countable
state space S if and only if y (y)p(y, x) = (x)x p(x, y) for any x, y S (so
any reversible probability measure is by (8.3.26) invariant for the Markov jump
process). Similarly, in general we call x S with x = 0 an absorbing, hence
positive recurrent, state and say that a non-absorbing state is positive recurrent
342
if it has finite mean return time. That is, if Ex Tx < for the first return time
Tx = inf{t : Xt = x} to state x. It can then be shown, in analogy with
Proposition 6.2.41, that any invariant probability measure () is zero outside the
positive recurrent states and if its support is an irreducible class R of non-absorbing
positive recurrent states, then (z) = 1/(z Ez [Tz ]) (see, [GS01, Section 6.9] for
more details).
To practice your understanding, the next exercise explores in more depth the
important family of birth and death Markov jump processes (or in short, birth and
death processes).
Exercise 8.3.41 (Birth and death processes). A birth and death process is a
Markov jump process {Xt } on S = {0, 1, 2, . . .} for which {Zn } is a birth and death
chain. That is, p(x, x + 1) = px = 1 p(x, x 1) for all x S (where of course
p0 = 1). Assuming x > 0 for all x and px (0, 1) for all x > 0, let
b(k) =
k
0 Y pi1
.
k i=1 1 pi
Show that
P {Xt } is irreducible and has an invariant probability measure if and only
b(k) is finite, in which case its invariant measure is (k) =
b(k)/c.
if c = k0
The next exercise deals with independent random sampling along the path of a
Markov pure jump process.
P
Exercise 8.3.42. Let Yk = XTek , k = 0, 1, . . ., where Tek = ki=1 ei and the i.i.d.
ei 0 are independent of the Markov pure jump process {Xt , t 0}.
(a) Show that {Yk } is a homogeneous Markov chain and verify that any invariant probability measure for {Xt } is also an invariant measure for
{Yk }.
(b) Show that in case of constant jump rates x = and each ei having the
e > 0, one has the representation
exponential distribution of parameter
Pk
Yk = ZLk of sampling the embedded chain {Zn } at Lk = i=1 (i 1) for
i.i.d. i 1, each having the Geometric distribution of success probability
e
e
p = /(
+ ).
(c) Conclude that if {Tek } are the jump times of a Poisson process of rate
e > 0 which is independent of the compound Poisson process {Xt }, then
P1
{Yk } is a random walk, the increment of which has the law of i=1
i .
Compare your next result with part (a) of Exercise 8.2.46.
CHAPTER 9
344
(9.1.1)
From this corollary and the Brownian time-inversion property we further deduce
both Blumenthals 0-1 law about the Px -triviality of the -algebra F0W
+ and its
analog about the Px -triviality of the tail -algebra of the Wiener process (compare
the latter with Kolmogorovs 0-1 law). To this end, we first extend the definition
of the tail -algebra, as in Definition 1.4.9, to continuous time S.P.-s.
Definition 9.1.3. Associate with any continuous time S.P. {Xt , t 0} the canonical future -algebras TtX\= (Xs , s t), with the corresponding tail -algebra of
the process being T X =
TtX .
t0
Proposition 9.1.4 (Blumenthals 0-1 law). Let Px denote the law of the
Wiener process {Wt , t 0} starting at W0 = x (identifying (, F W ) with C([0, ))
and its Borel -algebra). Then, Px (A) {0, 1} for each A F0W
+ and x R.
Further, if A T W then either Px (A) = 0 for all x or Px (A) = 1 for all x.
=x
A
and
(see
t>0
Consequently, applying our first claim for the canonical filtration of the standard
Wiener processes {Xt } we see that P0 (A) {0, 1} for any A F0X+ = T W .
Moreover, since A T1W , it is of the form IA = ID 1 for some D F W , so by
the tower and Markov properties,
Z
Px (A) = Ex [ID 1 (())] = Ex [PW1 (D)] = p1 (x, y)Py (D)dy ,
for the strictly positive Brownian transition kernel p1 (x, y) = exp((xy)2 /2)/ 2.
If P0 (A) = 0 then necessarily Py (D) = 0 for Lebesgue almost every y, hence also
Px (A) = 0 for all x R. Conversely, if P0 (A) = 1 then P0 (Ac ) = 0 and with
Ac T W , by the preceding argument 1 Px (A) = Px (Ac ) = 0 for all x R.
345
1
lim inf Wt = ,
t
t
Proof. Since P0 (0+ t) P0 (Wt > 0) = 1/2 for all t > 0, also P0 (0+ =
0) 1/2. Further, 0+ is an FtW -Markov time (see Proposition 8.1.15). Hence,
{0+ = 0} = {0+ 0} F0W
+ and from Blumenthals 0-1 law it follows that
P0 (0+ = 0) = 1. By the symmetry property of the standard Wiener process (see
part (a) of Exercise 9.1.1), also P0 (0 = 0) = 1. Combining these two facts we
deduce that P0 -a.s. there exist tn 0 and sn 0 such that Wtn > 0 > Wsn for
all n. By sample path continuity, this implies the existence of un 0 such that
Wun = 0 for all n. Hence, P0 (T0 = 0) = 1. As for the second claim, note that for
any r > 0,
where the first inequality is due to Exercise 2.2.2 and the equality holds
by the
scaling property of {Wt } (see part (d) of Exercise 9.1.1). Since {W
r
n i.o.}
n
T W we thus deduce from Blumenthals 0-1 law that Px (Wn r ni.o.) = 1 for
any x R. Considering rk this implies that lim supt Wt / t = with
Px -probability one. Further, by the symmetry property of the standard Wiener
process,
346
fixing A F + , by the tower property and the strong Markov property (8.3.12) of
the Brownian Markov process (Wt , Ft ) we have that
f )] .
h(W + )] = E[IA geh (W )] = P(A)E[h(W
E[IA h(B )] = E[IA e
In particular, considering A = we deduce that the S.P. {Bt , t 0} has the f.d.d.
ft }. Further, recall Lemma
and hence the law of the standard Wiener process {W
7.1.7 that for any F F B , the indicator IF is of the form IF = h(B ) for some
h bB [0,) , in which case by the preceding P(A F ) = P(A)P(F ). Since this
applies for any F F B and A F + we have established the P-independence of
the two -algebras, namely, the stated independence of {Bt , t 0} and F + .
Beware that to get such a regeneration it is imperative to start with a Markov
time . To convince yourself, solve the following exercise.
Exercise 9.1.7. Suppose {Wt , t 0} is a standard Wiener process.
(a) Provide an example of a finite a.s. random variable 0 such that
{W +t W , t 0} does not have the law of a standard Brownian motion.
(b) Provide an example of a finite FtW -stopping time such that [ ] is not
an FtW -stopping time.
Combining Corollary 9.1.6 with the fact that w.p.1. 0+ = 0, you are next to
prove the somewhat surprising fact that w.p.1. a Brownian Markov process enters
(b, ) as soon as it exits (, b).
Exercise 9.1.8. Let b+ = inf{t 0 : Wt > b} for b 0 and a Brownian Markov
process (Wt , Ft ).
(a) Show that P0 (b 6= b+ ) = 0.
(b) Suppose W0 = 0 and a finite random-variable H 0 is independent of
F W . Show that {H 6= H + } F has probability zero.
The strong Markov property of the Wiener process also provides the probability
that starting at x (c, d) it reaches level d before level c (i.e., the event W (0) = b
a,b
347
2.5
1.5
0.5
0.5
3
s
Remark. The reflection principle was stated by P. Levy [Lev39] and first rigorously proved by Hunt [Hun56]. It is attributed to D. Andre [And1887] who solved
the ballot problem of Exercise 5.5.30 by a similar symmetry argument (leading also
to the reflection principle for symmetric random walks, as in Exercise 6.1.19).
Proof. Recall Proposition 8.1.15 that b is a stopping time for FtW . Further,
since b > 0 = W0 and s 7 Ws is continuous, clearly b = Tb and WTb = b
whenever Tb is finite. Heuristically, given that Tb = s < u we have that Ws = b
and by reflection symmetry of the Brownian motion, expect the conditional law of
Wu Ws to retain its symmetry around zero, as illustrated in Figure 1. This of
course leads to the prediction that for any u, b > 0,
1
(9.1.3)
P(Tb < u, Wu > b) = P(Tb < u) .
2
With W0 = 0, by sample path continuity {Wu > b} {Tb < u}, so the preceding
prediction implies that
P(Tb < u) = 2P(Tb < u, Wu > b) = 2P(Wu > b) .
The supremum Mt () of the continuous function s 7 Ws () over the compact
interval [0, t] is attained at some s [0, t], hence the identity {Mt b} = {b t}
holds for all t, b > 0. Thus, considering u t > 0 leads in view of the continuity of
(u, b) 7 P(Wu > b) to the statement (9.1.2) of the proposition. Turning to rigorously prove (9.1.3), we rely on the strong Markov property of the standard Wiener
process for the FtW -stopping time Tb and the functional h(s, x()) = IA (s, x()),
where A = {(s, x()) : x() C(R+ ), s [0, u) and x(u s) > b}. To this end,
note that Fy,a,a = {x C([0, )) : x(u s) y for all s [a, a ]} is closed (with
respect to uniform convergence on compact subsets of [0, )), and x(u s) > b
for some s [0, u) if and only if x() Fbk ,q,q for some bk = b + 1/k, k 1 and
(2)
q < q Qu . So, A is the countable union of closed sets [q, q ] Fbk ,q,q , hence
348
Borel measurable on [0, ) C([0, )). Next recall that by the definition of the
set A,
1
gh (s, b) = Eb [IA (s, W )] = I[0,u) (s)Pb (Wus > b) = I[0,u) (s) .
2
Further, h(s, x(s+)) = I[0,u) (s)Ix(u)>b and WTb = b whenever Tb is finite, so taking
the expectation of (8.3.13) yields (for our choices of h(, ) and ), the identity,
E[I{Tb <u} I{Wu >b} ] = E[h(Tb , WTb + )] = E[gh (Tb , WTb )]
1
= E[gh (Tb , b)] = E[I{Tb <u} ] ,
2
which is precisely (9.1.3).
Since t1/2 Wt = G, a standard normal variable of continuous distribution function, we deduce from the reflection principle that the distribution functions
of Tb
and Mt are continuous and such that FTb (t) = 1 FMt (b) = 2(1 FG (b/ t)). In
particular, P(Tb > t) 0 as t , hence Tb is a.s. finite. We further have the
corresponding explicit probability density functions on [0, ),
b
b2
(9.1.4)
fTb (t) = FTb (t) =
e 2t ,
t
2t3
2 b2
(9.1.5)
FMt (b) =
e 2t .
fMt (b) =
b
2t
Remark. From the preceding formula for the density of Tb = b you can easily
check that it has infinite expected value, in contrast with the exit times a,b of
bounded intervals (a, b), which have finite moments (see part (c) of Exercise 8.2.36
for finiteness of the second moment and note that the same method extends to all
moments). Recall that in part (b) of Exercise 8.2.35 you have already found that
the Laplace transform of the density of Tb is
Z
(and for inverting Laplace transforms, see Exercise 2.2.15). Further, using the
density of passage times, you can now derive the well-known arc-sine law for the
last exit of the Brownian motion from zero by time one.
Exercise 9.1.11. For the standard Wiener process {Wt } and any t > 0, consider
the time Lt = sup{s [0, t] : Ws = 0} of last exit from zero by t, and the Markov
time Rt = inf{s > t : Ws = 0} of first return to zero after t.
(a) Verify that Px (Ty > u) = P(T|yx| > u) for any x, y R, and with
pt (x, y) denoting the Brownian transition probability kernel, show that
for u > 0 and 0 < u < t, respectively,
Z
pt (0, y)P(T|y| > u)dy ,
P(Rt > t + u) =
Z
P(Lt u) =
pu (0, y)P(T|y| > t u)dy .
349
p
(c) Show that Lt has the arc-sine law
p P(Lt u) = (2/) arcsin( u/t) and
hence the density fLt (u) = 1/( u(t u)) on [0, t].
(d) Find the joint probability density function of (Lt , Rt ).
Remark. Knowing the law of Lt is quite useful, for {Lt > u} is just the event
{Ws = 0 for some s (u, t]}. You have encountered the arc-sine law in Exercise
3.2.16 (where you proved the discrete reflection principle for the path of the symmetric srw). Indeed, as shown in Section 9.2 by Donskers invariance principle,
these two arc-sine laws are equivalent.
Here are a few additional results about passage times and running maxima.
Exercise 9.1.12. Generalizing the proof of (9.1.3), deduce that for a standard
Wiener process, any u > 0 and a1 < a2 b,
(9.1.6)
2
2(2b a) (2ba)
2t
e
,
3
2t
350
(c) Deduce that {Tb , b 0} is a S.P. of stationary, non-negative independent increments, whose Markov semi-group has the transition probability
kernel
t2
t
e 2(yx)+ ,
qbt (x, y) = q
2(y x)3+
First 20 steps
0.5
0.5
351
0.5
0.5
S(t) =
[t]
X
k=1
k + (t [t])[t]+1 ,
and {k } are i.i.d. Recall Exercise 3.5.18 that by the clt, if E1 = 0 and E12 = 1,
then as n the f.d.d. of the S.P. Sbn () of continuous sample path, converge
weakly to those of the standard Wiener process. Since f.d.d. uniquely determine
the law of a S.P. it is thus natural to expect also to have the stronger, convergence
in distribution, as defined next.
Definition 9.2.1. We say that S.P. {Xn (t), t 0} of continuous sample functions
D
converge in distribution to a S.P. {X (t), t 0}, denoted Xn () X (), if the
corresponding laws converge weakly in the topological space S consisting of C([0, ))
equipped with the topology of uniform convergence on compact subsets of [0, ).
D
That is, if g(Xn ()) g(X ()) whenever g : C([0, )) 7 R Borel measurable,
is such that w.p.1. the sample function of X () is not in the set Dg of points
of discontinuity of g (with respect to uniform convergence on compact subsets of
[0, )).
As we state now and prove in the sequel, such functional clt, also known as
Donskers invariance principle, indeed holds.
352
Since h(x()) = f (x(t1 ), . . . , x(tk )) is continuous and bounded on C([0, )) for any
f Cb (Rk ) and each finite subset {t1 , . . . , tk } of [0, ), convergence in distribution
of S.P. of continuous sample path implies the weak convergence of their f.d.d. But,
beware that the convergence of f.d.d. does not necessarily imply convergence in
distribution, even for S.P. of continuous sample functions.
Exercise 9.2.3. Give a counter-example to show that weak convergence of the
f.d.d. of S.P. {Xn ()} of continuous sample functions to those of S.P. {X ()} of
D
continuous sample functions, does not imply that Xn () X ().
Hint: Try Xn (t) = nt1[0,1/n] (t) + (2 nt)1(1/n,2/n] (t).
Nevertheless, with S = (C([0, ), ) a complete, separable metric space (c.f. Exercise 7.2.9), we have the following useful partial converse as an immediate consequence of Prohorovs theorem.
Proposition 9.2.4. If the laws of S.P. {Xn ()} of continuous sample functions
are uniformly tight in C([0, )) and for n the f.d.d. of {Xn ()} converge
D
weakly to the f.d.d. of {X ()}, then Xn () X ().
Proof. Recall part (e) of Theorem 3.5.2, that by the Portmanteau theorem
D
Xn () X () as in Definition 9.2.1, if and only if the corresponding laws
n = PXn converge weakly on the metric space S = C([0, )) (and its Borel
-algebra). That is, if and only if Eh(Xn ()) Eh(X ()) for each h continw
uous and bounded on S (also denoted by n , see Definition 3.2.17). Let
(m)
(m)
{n } be a subsequence of {n }. Since {n } is uniformly tight, so is {n }. Thus,
(mk )
by Prohorovs theorem, there exists a further sub-subsequence {n } such that
(m )
n k converges weakly to a probability measure e on S. Recall Proposition 7.1.8
that the f.d.d. uniquely determine the law of S.P. of continuous sample functions.
Hence, from the assumed convergence of f.d.d. of {Xn ()} to those of {X ()},
(m )
we deduce that e = PX = . Consequently, Eh(Xn k ()) Eh(X ()) for
each h Cb ([0, )) (see Exercise 7.2.9). Fixing h Cb ([0, )) note that we have
(m)
just shown that every subsequence yn of the sequence yn = Eh(Xn ()) has a
(mk )
further sub-subsequence yn
that converges to y . Hence, we deduce by Lemma
2.2.11 that yn y . Since this holds for all h Cb ([0, )), we conclude that
D
Xn () X ().
Having Proposition 9.2.4 and the convergence of f.d.d. of Sbn (), Donskers invariance principle is a consequence of the uniform tightness in S of the laws of these
S.P.-s. In view of Definition 3.2.31, we prove this uniform tightness by exhibiting
compact sets K such that supn P(Sbn
/ K ) 0 as . To this end, recall the
353
following classical result of functional analysis (for a proof see [KaS97, Theorem
2.4.9] or the more general version provided in [Dud89, Theorem 2.4.7]).
`-Ascoli theorem). A set K C([0, )) has compact
Theorem 9.2.5 (Arzela
closure with respect to uniform convergence on compact intervals, if and only if
supxK |x(0)| is finite and for t > 0 fixed, supxK osct, (x()) 0 as 0, where
(9.2.2)
sup
0h 0ss+ht
|x(s + h) x(s)| ,
is just the maximal absolute increment of x() over all pairs of times [0, t] which
are within distance of each other.
The Arzel`
a-Ascoli theorem suggests the following strategy for proving uniform
tightness.
Exercise 9.2.6. Let S denote the set C([0, )) equipped with the topology of
uniform convergence on compact intervals, and consider its subsets Fr, = {x() :
x(0) = 0, oscr, (x()) 1/r} for > 0 and integer r 1.
(a) Verify that the functional x() 7 osct, (x()) is continuous on S per fixed
t and and further that per x() fixed, the function osct, (x()) is nondecreasing in t and in . Deduce that Fr, are closed sets and for any
r 0, the intersection r Fr,r is a compact subset of S.
(b) Show that if S.P.-s {Xn (t), t 0} of continuous sample functions are
such that Xn (0) = 0 for all n and for any r 1,
lim sup P(oscr, (Xn ()) > r1 ) = 0 ,
0 n1
for S(t) =
P[t]
k=1 k
Proof. Fixing r 1, let qn, = P(oscnr,n (S()) > r1 n). Since t 7 S(t)
is uniformly continuous on compacts, oscnr,n (S())() 0 when 0 (for each
). Consequently, qn, 0 for each fixed n, hence uniformly over n n0
and any fixed n0 . With 7 qn, non-decreasing, this implies that b = b(n0 ) =
inf >0 supnn0 qn, is independent of n0 , hence b(1) = 0 provided inf n0 1 b(n0 ) =
lim0 lim supk qk, = 0. To show the latter, observe that since the piecewise
linear S(t) changes slope only at integer values of t,
osckr,k (S()) osckr,m (S()) Mm, ,
for m = [k] + 1, = rk/m and
(9.2.3)
Mm, =
max
1im
0jm1
|S(i + j) S(j)| .
354
p
k/m 1/(r ) as k and (v) = r3 v 2 . Since v when
where v = r
0, we complete the proof by appealing to part (c) of Exercise 9.2.8.
1
As you have just seen, the key to the proof of Proposition 9.2.7 is the following
bound on maximal fluctuations of increments of the random walk.
Pm
Exercise 9.2.8. Suppose Sm =
k=1 k for i.i.d. k such that E1 = 0 and
E12 = 1. For integers m, 1, let S(m) = Sm and Mm, be as in (9.2.3), with
Mm,0 = maxm
i=1 |Si |.
(a) Show that for any m 1 and t 0,
1/2
Sm G by the clt.
Applying Donskers invariance principle, you can induce limiting results for random walks out of the corresponding facts about the standard Brownian motion,
which we have found already in Subsection 9.1.
Example 9.2.9. Recall the running maxima Mt = sups[0,t] Ws , whose density
we got in (9.1.5) out of the reflection principle. Since h0 (x()) = sup{x(s) : s
[0, 1]} is continuous with respect to uniform convergence on C([0, 1]), we have from
Donskers invariance principle that as n ,
1 n
D
h0 (Sbn ) = max Sk M1
n k=0
(where we have used the fact that the maximum of the linearly interpolated function S(t) must be obtained at some integer value of t). The functions h (x()) =
R1
Exercise 9.2.10.
(a) Building on Example 9.2.9, show that for any integer 1,
Z 1
n
X
D
n(1+/2)
(Wu ) du ,
(Sk )
k=1
E12
as soon as E1 = 0 and
= 1 (i.e. there is no need to assume finiteness
of the -th moment of 1 ), and in case = 1 the limit law is merely a
normal of zero mean and variance 1/3.
355
(b) The cardinality of the set {S0 , . . . , Sn } is called the range of the walk by
time n and denoted rngn . Show that for the symmetric srw on Z,
D
s1
We continue in the spirit of Example 9.2.9, except for dealing with functionals
that are no longer continuous throughout C([0, )).
X
D
bn (b) = 1
A
I{Sk >bn} A1 (b) ,
n
(9.2.4)
k=1
as you are also to justify upon solving Exercise 9.2.12. Of particular note is the
D
case of b = 0, where Levys arc-sine law tells us that At (0) = Lt of Exercise 9.1.11
(as shown for example in [KaS97, Proposition 4.4.11]).
Recall the arc-sine limiting law of Exercise 3.2.16 for n1 sup{ n : S1 S 0}
in case of the symmetric srw. In view of Exercise 9.1.11, working with g0 (x()) =
sup{s [0, 1] : x(s) = 0} one can extend the validity of this limit law to any random
walk with increments of mean zero and variance one (c.f. [Dur10, Example 8.6.3]).
Exercise 9.2.12.
(a) Let g1+ (x()) = inf{t 0 : x(t) > 1}. Show that P(W G) = 1
for the subset G = {x() : x(0) = 0 and g1 (x()) = g1+ (x()) < }
of C([0, )), and that g1 (xn ()) g1 (x()) for any sequence {xn ()}
C([0, )) which converges uniformly on compacts to x() G. Further,
D
show that bn n1 g1 (Sbn ) bn and deduce that bn T1 .
(b) To justify (9.2.4), first verify that the non-negative g(x(), (b, )) is continuous on any sequence whose limit is in G = {x() : g(x(), {b}) = 0}
D
and that E[g(W , {b})] = 0, hence g(Sbn , (b, )) A1 (b). Then, deduce
D
bn (b)
that A
A1 (b) by showing that for any > 0 and n 1,
bn (b) g(Sbn , (b , )) + n () ,
g(Sbn , (b + , )) n () A
P
n
with n () = n1 k=1 I{|k |>n} converging in probability to zero
when n .
Our next result is a refinement due to Kolmogorov and Smirnov, of the GlivenkoCantelli theorem (whichPstates that for i.i.d. {X, Xk } the empirical distribution
n
functions Fn (x) = n1 i=1 I(,x] (Xi ), converge w.p.1., uniformly in x, to the
distribution function of X, whichever it may be, see Theorem 2.3.6).
356
bt | ,
n1/2 Dn sup |B
(9.2.5)
t[0,1]
sup
uFX (R)
|u n
n
X
I(0,u] (Ui )| .
i=1
e n = Zn max |n1/2 Sk
n1/2 D
k=1
k 1/2
(n
Sn )| = Zn sup |Sbn (t) tSbn (1)| ,
n
t[0,1]
where Sbn (t) = n1/2 S(nt) for S() of (9.2.1), so t 7 Sbn (t) tSbn (1) is linear on
each of the intervals [k/n, (k + 1)/n], k 0. Consequently,
e n = Zn g(Sbn ()) ,
n1/2 D
a.s.
for g(x()) = supt[0,1] |x(t) tx(1)|. By the strong law of large numbers, Zn
1
E1 = 1. Moreover, g() is continuous on C([0, 1]), so by Donskers invariance
D
bt | (see Definition 9.2.1), and by Slutskys
principle g(Sbn ()) g(W ) = sup
|B
t[0,1]
Remark. Defining Fn1 (t) = inf{x R : Fn (x) t}, with minor modifications
the preceeding proof also shows that in case FX () is continuous, n1/2 (FX (Fn1 (t))
D
D b
1/2
t) B
Dn
t on [0, 1]. Further, with little additional work one finds that n
bF (x) | even in case FX () is discontinuous (and which for continuous FX ()
supxR |B
X
coincides with (9.2.5)).
You are now to provide an explicit formula for the distribution function FKS () of
bt |.
supt[0,1] |B
bt on [0, 1], as in ExExercise 9.2.14. Consider the standard Brownian bridge B
ercises 7.3.15-7.3.16.
357
(r)
t[0,1]
n1
X
2 2
bt | b) = 1 2
(9.2.6)
FKS (b) = P( sup |B
(1)n1 e2n b .
t[0,1]
n=1
and any 1 fixed (where kxkt = sups[0,t] |x(s)|). Show that Xn X in the
topological space S consisting of C([0, )) equipped with the topology of uniform
convergence on compact subsets of [0, ).
Hint: Consider Exercise 7.2.9 and Corollary 3.5.3.
358
Proof. Recall that each sample function t 7 W (t)() is uniformly continuous on [0, ], hence osc, (W ())() 0 as 0 (see (9.2.2) for the definition of
osc, (x())). Fixing > 0 note that as r ,
Gr = { : osc,3/r (W ())() } ,
P(max |Tn,[nsj ] sj | r1 ) 1 .
j=0
and since sj+1 sj1 = 2/r, it follows that for any n max(n0 , r),
(9.2.8)
P(
sup
b{0,1},t[0,)
|Tn,[nt]+b t| 3r1 ) 1 .
where = nt [nt] [0, 1). Observe that if both Gr and the event in (9.2.8)
occur, then by definition of Gr each of the two terms on the right-side of the last
inequality is at most . We thus see that kSbn W k whenever both Gr and
the event in (9.2.8) occur. That is, P(kSbn W k ) 1 2. Since this applies
p
for all > 0, we have just shown that kSbn W k 0, as claimed.
The key tool in our program is an alternative Skorokhod representation of random variables. Whereas in Theorem 1.2.37 we applied the inverse of the desired
distribution function to a uniform random variable on [0, 1], here we construct a
stopping time of the form A,B = inf{t 0 : Wt
/ (A, B)} such that W has
the stated, mean zero law. To this end, your next exercise exhibits the appropriate
random levels (A, B) to be used in this construction.
Definition 9.2.17. Given a random variable V 0 of positive, finite mean, we
say that Z 0 is a size-biased sample of V if E[g(Z)] = E[V g(V )]/EV for all
g bB. Alternatively, the Radon-Nikodym derivative between the corresponding
dPZ
laws is dP
(v) = v/EV .
V
Exercise 9.2.18. Suppose X is an integrable random variable, such that EX = 0
(so EX+ = EX is finite). Consider the [0, )2 -valued random vector
(A, B) = (0, 0)I{X=0} + (Z, X)I{X>0} + (X, Y )I{X<0} ,
where Y and Z are size-biased samples of X+ and X , respectively, which are
further independent of X. Show that then for any f bB,
(9.2.9)
359
where r(a, b) = b/(a + b) for a > 0 and r(0, b) = 1. In view of the identity (9.2.9)
we thus deduce that for any f bB,
E[f (W )] = E[V ] = E[g(A, B)] = E[f (X)] .
That is, W = X, as claimed. Recall part (c) of Exercise 8.2.36 that E[a,b ] = ab.
Since ab = r(a, b)f (a) + (1 r(a, b))f (b) for f (x) = x2 and any (a, b) Y, we
deduce by the same reasoning that E[ ] = E[AB] = E[X 2 ]. Similarly, the bound
2
E 2 35 EX 4 follows from the identity E[a,b
] = (ab)2 + ab(a2 + b2 )/3 of part (c)
of Exercise 8.2.36, and the inequality
ab(a2 + b2 + 3ab) 5ab(a2 + b2 ab) = 5[r(a, b)a4 + (1 r(a, b))b4 ]
360
With (W (t), FtW ) a Brownian Markov process, it thus follows that so are (W (t), Fk,t ).
(b). Starting at T0 = 0 we sequentially construct the non-decreasing Fk,t -stopping
times Tk . Assuming Tk1 have been constructed already, consider Corollary 9.1.6
for the Brownian Markov process (W (t), Fk1,t ) and the Fk1,t -stopping time Tk1 ,
to deduce that Wk (t) = W (Tk1 +t)W (Tk1 ) is a standard Wiener process which
is independent of Hk1 . The pair (U2k1 , U2k ) of [0, 1]-valued independent uniform
variables is by assumption independent of Fk1, , hence of both Hk1 and the
standard Wiener process {Wk (t)}. Recall that Dk = Mk Mk1 is integrable and
by the martingale property E[Dk |Fk1 ] = 0. With Fk1 Hk1 and representing
our probability measure as a product measure on ( , F G ), we thus apply
Theorem 9.2.19 for the continuous filtration Gt = (U2k1 , U2k , Wk (s), s t) which
is independent of Hk1 and the random distribution function
b D |F ((, x], ) ,
FXk (x; ) = P
k
k1
(see Exercise 4.4.6 for the second identity), and by the same reasoning E[k2 |Hk1 ]
b D |F
2E[Dk4 |Fk1 ], while the R.C.P.D. of Wk (k ) given Hk1 matches the law P
k
k1
361
of Xk . Note that the threshold levels (Ak , Bk ) of Exercise 9.2.18 are measurable on
Fk,0 since by right-continuity of distribution functions their construction requires
only the values of (U2k1 , U2k ) and {FXk (q; ), q Q}. For example, for any x R,
{ : Xk () x} = { : U1 () FXk (q; ) for all q Q, q > x} ,
Further, setting Tk = Tk1 + k , from the proof of Theorem 9.2.19 we have that
{Tk t} if and only if {Tk1 t} and either supu[0,t] {W (u) W (u Tk1 )} Bk
or inf u[0,t] {W (u) W (u Tk1 )} Ak . Consequently, the event {Tk t} is in
(Ak , Bk , t Tk1 , I{Tk1 t} , W (s), s t) Fk,t ,
by the Fk,0 -measurability of (Ak , Bk ) and our hypothesis that Tk1 is an Fk1,t stopping time.
(c). With W (T0 ) = M0 = 0, it clearly suffices to show that the f.d.d. of {W ( ) =
W (T ) W (T1 )} match those of {D = M M1 }. To this end, recall that
Hk = Fk,Tk is a filtration (see part (b) of Exercise 8.1.11), and in part (b) we
saw that Wk (k ) is Hk -adapted such that its R.C.P.D. given Hk1 matches the
R.C.P.D. of the Fk -adapted Dk , given Fk1 . Hence, for any f bB, we have from
the tower property that
E[
n
Y
=1
n1
Y
f (W ( ))]
=1
n1
Y
f (W ( ))]
=1
n1
Y
=1
f (D )] = E[
n
Y
f (D )],
=1
n1
where the third equality is from the induction assumption that {D }=1
has the
n1
same law as {W ( )}=1 .
The following corollary of Strassens representation recovers Skorokhods representation of the random walk {Sn } as the samples of Brownian motion at a sequence
of stopping times with i.i.d. increments.
Corollary 9.2.21 (Skorokhods representation for random
Pnwalks). Suppose 1 is integrable and of zero mean. The random walk Sn = k=1 k of i.i.d.
{k } can be represented as Sn = W (Tn ) for T0 = 0, i.i.d. k = Tk Tk1 0 such
that E[1 ] = E[12 ] and standard Wiener process {W (t), t 0}. Further, each Tk is
a stopping time for Fk,t = ((Ui , i 2k), FtW ) (with i.i.d. [0, 1]-valued uniform
{Ui } that are independent of {Wt , t 0}).
Proof. The construction we provided in proving Theorem 9.2.20 is based on
inductively applying Theorem 9.2.19 for k = 1, 2, . . ., where Xk follows the R.C.P.D.
of the MG difference Dk given Fk1 . For a martingale of independent differences,
362
such as the random walk Sn , we can a-apriori produce the independent thresholds
(Ak , Bk ), k 1, out of the given pairs of {Ui }, independently of the Wiener process.
Then, in view of Corollary 9.1.6, for k = 1, 2, . . . both (Ak , Bk ) and the standard
Wiener process Wk () = W (Tk1 + ) W (Tk1 ) are independent of the stopped
at Tk1 element of the continuous time filtration (Ai , Bi , i < k, W (s), s t).
Consequently, so is the stopping time k = inf{t 0 : Wk (t)
/ (Ak , Bk )} with
respect to the continuous time filtration Gk,t = (Ak , Bk , Wk (s), s t), from which
we conclude that {k } are in this case i.i.d.
Combining Strassens martingale representation with Lemma 9.2.16, we are now
in position to prove a Lindeberg type martingale clt.
Theorem 9.2.22 (martingale clt, Lindebergs). Suppose that for any n 1
fixed, (Mn, , Fn, ) is a (discrete time) L2 -martingale, starting at Mn,0 = 0, and
the corresponding martingale differences Dn,k = Mn,k Mn,k1 and predictable
compensators
X
2
hMn i =
E[Dn,k
|Fn,k1 ] ,
k=1
hMn i[nt] t .
(9.2.10)
max |Dn,k | 2n ,
k=1
363
P
the representation Mn, = W (Tn, ). Recall that Tn, = k=1 n,k where for each
n, the non-negative n,k are adapted to the filtration {Hn,k , k 1} and such that
w.p.1. for k = 1, . . . , n,
2
E[n,k |Hn,k1 ] = E[Dn,k
|Fn,k1 ] ,
(9.2.13)
2
4
E[n,k
|Hn,k1 ] 2E[Dn,k
|Fn,k1 ] .
(9.2.14)
Under this representation, the process Sbn () of (9.2.12) is of the form considered in
p
p
Lemma 9.2.16, and as shown there, kSbn W k 0 provided Tn,[nt] t for each
fixed t [0, 1].
To verify the latter convergence in probability, set Tbn, = Tn, hMn i and bn,k =
n,k E[n,k |Hn,k1 ]. Note that by the identities of (9.2.13),
hMn i =
k=1
2
E[Dn,k
|Fn,k1 ] =
k=1
E[n,k |Hn,k1 ] ,
P
hence Tbn, = k=1 bn,k is for each n, the Hn, -martingale part in Doobs decomposition of the integrable, Hn, -adapted sequence {Tn, , 0}. Further, considering the
expectation in both sides of (9.2.14), by our assumed uniform bound |Dn,k | 2n
it follows that for any k = 1, . . . , n,
2
2
4
2
E[b
n,k
] E[n,k
] 2E[Dn,k
] 82n E[Dn,k
],
k=1
2
E[b
n,k
] 82n
n
X
k=1
2
E[Dn,k
] = 82n E[hMn in ] .
Recall our assumption that hMn in is uniformly bounded, hence fixing t [0, 1], we
L2
conclude that Tbn,[nt] 0 as n . This of course implies the convergence to zero
in probability of Tbn,[nt] , and in view of assumption (9.2.10) and Slutskys lemma,
p
also Tn,[nt] = Tbn,[nt] + hMn i[nt] t as n .
Step 2. We next eliminate the superfluous assumption hMn in 2 via the strategy
employed in proving part (a) of Theorem 5.3.33. That is, consider the Fn, -stopped
fn, = Mn,n for stopping times n = nmin{ < n : hMn i+1 > 2},
martingales M
e n,k =
such that hMn in 2. As the corresponding martingale differences are D
Dn,k I{kn } , you can easily verify that for all n
fn i = hMn in .
hM
364
k=1
2
b n,k
E[D
|Fn,k1 ]
(for the right hand side is bounded by 2gn (n ) which by our choice of n converges
to zero in probability, so the convergence (9.2.10) of the predictable compensators
fn ). These inequalities are in turn a direct consequence of
transfers from Mn to M
the bounds
e 2 |F ] + 2E[D
b 2 |F ]
e 2 |F ] E[D2 |F ] E[D2 |F ] E[D
(9.2.15)
E[D
n,k
n,k
n,k
n,k
n,k
2
2
b n,k
|F ] .
E[Dn,k
|F ] = E[Dn,k |F ] + E[D
The latter identity also yields the right-most inequality in (9.2.15), for E[Dn,k |F ] =
b n,k |F ] (due to the martingale condition E[Dn,k |Fn,k1 ] = 0), hence
E[D
2
2
2
2
e n,k
b n,k |F ] 2 E[D
b n,k
E[Dn,k |F ] E[D
|F ] = E[Dn,k |F ] = E[D
|F ] .
p
p
Now that we have exhibited a coupling for which kSen W k 0, if kSbn Sen k 0
then by the triangle inequality for the supremum norm k k there also exists a
p
coupling with kSbn W k 0 (to construct the latter coupling, since there exist
365
Note also that if the event n = {|Dn,k | < n , for all k n} occurs, then Dn,k
e n,k = E[Dn,k |Fn,k1 ] for all k. Therefore,
D
In kSbn Sen k In
n
X
k=1
e n,k |
|Dn,k D
n
X
k=1
|E(Dn,k |Fn,k1 )| 1
n gn (n ) ,
p
and our choice of n 0 such that
0 implies that In kSbn Sen k 0.
c
We thus complete the proof by showing that P(n ) 0. Indeed, fixing n and
r > 0, we apply Exercise 5.3.40 for the events Ak = {|Dn,k | n } adapted to
the filtration {Fn,k , k 0}, and by Markovs inequality for C.E. (see part (b) of
Exercise 4.2.22), arrive at
n
n
X
[
P(cn ) = P
Ak er + P
P(|Dn,k | n |Fn,k1 ) > r
k=1
p
2
n gn (n )
k=1
er + P 2
n
n
X
k=1
2
b n,k
E[D
|Fn,k1 ] > r = er + P(2
n gn (n ) > r) .
Consequently, our choice of n implies that P(cn ) 3r for any r > 0 and all n
large enough. So, upon considering r 0 we deduce that P(cn ) 0. As explained
before this concludes our proof of the martingale clt.
Specializing Theorem 9.2.22 to the case of a single martingale (M , F ) leads to
the following corollary.
Corollary 9.2.23. Suppose an L2 -martingale (M , F ) starting at M0 = 0, is of
p
F -predictable compensators such that n1 hM in 1 and as n ,
n1
n
X
k=1
for any fixed > 0. Then, as n , the linearly interpolated, time-scaled S.P.
(9.2.16)
Sbn (t) = n1/2 {M[nt] + (nt [nt])(M[nt]+1 M[nt] )} ,
Proof. Simply consider Theorem 9.2.22 for Mn, = n1/2 M and Fn, = F .
p
In this case hMn i = n1 hM i so (9.2.10) amounts to n1 hM in 1 and in stating
the corollary we merely replaced the condition (9.2.11) by the stronger assumption
that E[gn ()] 0 as n .
Further specializing Theorem 9.2.22, you are to derive next the martingale extension of Lyapunovs clt.
366
Pk
Exercise 9.2.24. Let Zk =
i=1 Xi , where the Fk -adapted, square-integrable
{Xk } are such that w.p.1. E[Xk |Fk1 ] = for some non-random and all k 1.
Pn
p
Setting Vn,q = nq/2 k=1 E[|Xk |q |Fk1 ] suppose further that Vn,2 1, while
p
for some q > 2 non-random, Vn,q 0.
(a) Setting Mk = Zk k show that Sbn () of (9.2.16) converges in distribution
on C([0, 1]) to the standard Wiener process.
D
(b) Deduce that Ln L , where
Ln = n1/2 max {Zk
0kn
k
Zn }
n
D
n ) N (0, 1) as n .
of the parameter , conclude that Vn (b
(c) Suppose now that = 1. Show that in this case (Yn , Fn ) is a martingale
of uniformly bounded differences and deduce from the martingale clt that
the two-dimensional random vectors (n1 Zn , n2 Vn ) converge in distriR1
bution to ( 12 (W12 1), 0 Wt2 dt) with {Wt , t 0} a standard Wiener
process. Conclude that in this case
p
W2 1
D
Vn (b
.
n ) qR1
1
2 0 Wt2 dt
(d) Show that the conclusion of part (c) applies in case = 1, except for
multiplying the limiting variable by 1.
Hint: Consider the sequence (1)k Yk with Yk corresponding to Dk =
(1)k Dk and = 1.
Pk
Exercise 9.2.26. Let Ln = n1/2 max0kn { i=1 ci Yi }, where the Fk -adapted,
square-integrable {Yk } are such that w.p.1. E[Yk |Fk1 ] = 0 and E[Yk2 |Fk1 ] = 1
for all k 1. Suppose further that supk1 E[|Yk |q |Fk1 ] is finite a.s. for some
Pn
p
D
q > 2 and ck mFk1 are such that n1 k=1 c2k 1. Show that Ln L with
P(L b) = 2P(G b) for a standard normal variable G and all b 0.
p
Hint: Show that k 1/2 ck 0, then consider part (a) of Exercise 9.2.24.
367
lim sup
t
ft
W
= 1,
e
h(t)
lim inf
t
ft
W
= 1.
e
h(t)
x2
Remark. To determine the scale h(t) recall the estimate P(G x) = e 2 (1+o(1))
which implies that for tn = 2n , (0, 1) the sequence P(Wtn bh(tn )) =
2
nb +o(1) is summable when b > 1 but not summable when b < 1. Indeed, using
such tail bounds we prove the lil in the form of (9.2.17) by the subsequence method
you have seen already in the proof of Proposition 2.3.1. Specifically, we consider
such time skeleton {tn } and apply Borel-Cantelli I for b > 1 and near one (where
Doobs inequality controls the fluctuations of t 7 Wt by those at {tn }), en-route
to the upper bound. To get a matching lower bound we use Borel-Cantelli II for
b < 1 and the independent increments Wtn Wtn+1 (which are near Wtn when
is small).
Proof. Since e
h(t) = th(1/t), by the time-inversion invariance of the standard
ft = tW1/t that (9.2.18) is equivalent
Wiener process, it follows upon considering W
to (9.2.17). Further, by the symmetry of this process, it suffices to prove the
statement about the lim sup in (9.2.17).
Proceeding to upper bound Wt /h(t), applying Doobs inequality (8.2.2) for the
non-negative martingale Xs = exp((Ws s/2)) (see part (a) of Exercise 8.2.7),
such that E[X0 ] = 1, we find that for any t, , y 0,
P( sup {Ws s/2} y) = P( sup Xs ey ) ey .
s[0,t]
s[0,t]
368
To bound below the left side of (9.2.19), consider the independent events An =
{Wtn Wtn+1 (1 2 )h(tn )}, where as before tn = 2n , n > n0 () and (0, 1)
h(t).
scale of O( t), than the lil scale e
Exercise 9.2.28. Show that for tn = exp(exp(n)) and a Brownian Markov process
{Wt , t 0}, almost surely,
p
lim sup Wtn / 2tn log log log tn = 1 .
n
Combining Kinchins lil and the representation of the random walk as samples of
the Brownian motion at random times, we have the corresponding lil of HartmanWintner for the random walk.
Pn
Proposition 9.2.29 (Hartman-Wintners lil). Suppose Sn = k=1 k , where
k are i.i.d. with E1 = 0 and E12 = 1. Then, w.p.1.
lim sup Sn /e
h(n) = 1 ,
n
where e
h(n) = nh(1/n) = 2n log log n.
Remark. Recall part (c) of Exercise 2.3.4 that if E[|1 | ] = for some 0 < < 2
then w.p.1. n1/ |Sn | is unbounded, hence so is |Sn |/e
h(n) and in particular the
lil fails.
lim sup
e
t
e
h(t) e
h(t)
h(t )
369
lim sup
V
,
e
h(t )
w.p.1.
V 2 sup{|Ws Wt | : s [t , t+3 ]} .
x /2
P(W1 x ) (2)1/2 x1
2 .
e
P
h(t )) is finite, we deduce by Borel-Cantelli
8e
Having just shown that P(V
I that (9.2.20) holds for = 8 , which as explained before, completes the
proof.
Remark. Strassens lil goes further than Hartman-Wintners lil, in characterizing the almost sure limit set (i.e., the collection of all limits of convergent
subsequences in C([0, 1])), for {S(n)/e
h(n)} and S() of (9.2.1), as
Z t
Z 1
K = {x() C([0, 1]) : x(t) =
y(s)ds :
y(s)2 ds 1} .
0
While Strassens lil is outside our scope, here is a small step in this direction.
Exercise 9.2.30. Show that w.p.1. [1, 1] is the limit set of the R-valued sequence
{Sn /e
h(n)}.
9.3. Brownian path: regularity, local maxima and level sets
Recall Exercise 7.3.13 that the Brownian sample function is a.s. locally -Holder
continuous for any < 1/2 and Kinchins lil tells us that it is not -Holder continuous for any 1/2 and any interval [0, t]. Generalizing the latter irregularity
property, we first state and prove the classical result of Paley, Wiener and Zygmund (see [PWZ33]), showing that a.s. a Brownian Markov process has nowhere
differentiable sample functions (not even at a random time t = t()).
Definition 9.3.1. For a continuous function f : [0, ) 7 R and (0, 1], the
upper and lower (right) -derivatives at s 0 are the R-valued
u0
370
which always exist. The Dini derivatives correspond to = 1 and denoted by D1 f (s)
and D1 f (s). Indeed, a continuous function f is differentiable from the right at s
if D1 f (s) = D1 f (s) is finite.
Proposition 9.3.2 (Paley-Wiener-Zygmund). With probability one, the sample function of a Wiener process t 7 Wt () is nowhere differentiable. More precisely, for = 1 and any T ,
(9.3.1) P({ : < D Wt () D Wt () < for some t [0, T ]}) = 0 .
Proof. Fixing integers k, r 1, let
[
\
{ : |Ws+u () Ws ()| ku} .
Akr =
s[0,1] u[0,1/r]
in F
, where for i = 1, . . . , n,
To see that Akr C note that for any n 4r, if Akr then for some integer
1 i n there exists s [(i 1)/n, i/n] such that |Wt () Ws ()| k(t s)
for all t [s, s + 1/r]. This applies in particular for t = (i + j)/n, j = 0, 1, 2, 3,
in which case 0 t s 4/n 1/r and consequently, |Wt () Ws ()| 4k/n.
Then, by the triangle inequality necessarily also Cn,i .
We next
show that P(C) = 0. Indeed, note that for each i, n the random variables
Gj = n(W(i+j)/n W(i+j1)/n ), j = 1, 2, . . ., are independent, each having the
3
Y
j=1
This in turn implies that P(C) in P(Cn,i ) (8k)3 / n for any n 4r and
upon taking n , results with P(C) = 0, as claimed.
Having established (9.3.1) for T = 1, we note that by the scaling property of
the Wiener process, the same applies for any finite T . Finally, the subset of
considered there in case T = is merely the increasing limit as n of such
subsets for T = n, hence also of zero probability.
You can even improve upon this negative result as follows.
Exercise 9.3.3. Adapting the proof of Proposition 9.3.2 show that for any fixed
> 12 , w.p.1. the sample function t 7 Wt () is nowhere -H
older continuous.
That is, (9.3.1) holds for any > 1/2.
371
lim sup
0
oscT, (W )
= 1,
g()
Proof. Fixing 0 < T < , note that g(T )/( T g()) 1 as 0. Furf ) = T 1/2 oscT,T (W ) where W
fs = T 1/2 WT s is a standard Wiener
ther, osc1, (W
process on [0, 1]. Consequently, it suffices to prove (9.3.2) only for T = 1.
Setting hereafter T = 1, we start with the easier lower bound on the left side of
(9.3.2). To this end, fix (0, 1) and note that by independence of the increments
of the Wiener process,
2 r
Further, by scaling and the lower bound of part (a) of Exercise 2.2.24, it is easy to
check that for some 0 = 0 () and all 0 ,
p
q = P(|G| > (1 ) 2 log 2) 2(1) .
By definition osc1,2 (x()) ,1 (x()) for any x : [0, 1] 7 R and with exp(2 q )
exp(2 ) summable, it follows by Borel-Cantelli I that w.p.1.
osc1,2 (W ) ,1 (W ) > (1 )g(2 )
for all 1 (, ) finite. In particular, for any > 0 fixed, w.p.1.
lim sup
0
osc1, (W )
1,
g()
372
osc1, (W )
1 w.p.1.
g()
To show the matching upper bound, we fix (0, 1) and b = b() = (1+2)/(1)
and consider the events
\
A =
{,r (W ) < bg(r2 )} .
r2
r2
r
X 2X
P(,r (W ) bg(r2 ))
P(|Gr,j | x,r )
r2 j=0
p
where x,r = 2b log(2 /r) and Gr,j = (W(j+r)2 Wj2 )/ r2 have the standard normal distribution. Since > 0, clearly x,r is bounded away from zero for
r 2 and exp(x2,r /2) = (r2 )b . Hence, from the upper bound of part (a) of
Exercise 2.2.24 we deduce that the r-th term of the outer sum is at most 2 C(r2 )b
for some finite constant C = C(). Further, for some finite = () and all 1,
Z 2 +1
X
tb dt 2(b+1) .
rb
r2
P(Ac ) C2 2b 2(b+1) = C2 ,
c
and since
P(A ) is finite, by the first Borel-Cantelli lemma, on a set of
probability one, A for all 0 (, ) finite. As you show in Exercise 9.3.6, it
then follows from the continuity of t 7 Wt that on ,
sup
0ss+h1,h=
|Ws+h () Ws ()|
b.
Since g() is non-decreasing on [0, 1/e], we can further replace the condition h =
in the preceding inequality by h [0, ] and deduce that on
osc1, (W ) p
b() .
lim sup
g()
0
Taking k = 1/k for which b(1/k) 1 we conclude that w.p.1. the same bound also
holds with b = 1.
Exercise 9.3.6. Suppose x C([0, 1]) and m,r (x) are as in (9.3.3).
(a) Show that for any m, r 0,
sup
r2m |ts|<(r+1)2m
|x(t) x(s)| 4
=m+1
Hint: Deduce from part (a) of Exercise 7.2.7 that this holds if in addition
(2,k)
t, s Q1
for some k > m.
373
(b) Show that for some c finite, if 2(m+1)(1) h 1/e with m 0 and
(0, 1), then
c
2(m+1)/2 g(h) .
g(2 ) cg(2m1 )
1
=m+1
p
Hint: Recall that g(h) = 2h log(1/h) is non-decreasing on [0, 1/e].
(c) Conclude
that there exists (, 0 , h) 0 as h 0, such that if ,r (x)
sup
|x(s + h) x(s)| bg(h)[1 + (, 0 , h)] .
0ss+h1
We take up now the study of level sets of the standard Wiener process
(9.3.4)
Z (b) = {t 0 : Wt () = b} ,
374
is finite and WTb = b, in which case it follows from (9.3.4) that t Z (b) if and
fu () = 0, where W
fu = WT +u WT
only if t = Tb + u for u 0 such that W
b
b
is, by Corollary 9.1.6, a standard Wiener process. That is, up to a translation by
ft and we conclude the proof by
Tb () the level set Z (b) is merely the zero set of W
applying Proposition 9.3.7 for the latter zero set.
Remark 9.3.9. Recall Example 8.2.51 that for a.e. the sample function Wt ()
is of unbounded total variation on each finite interval [s, t] with s < t. Thus, from
part (a) of Exercise 8.2.42 we deduce that on any such interval w.p.1. the sample
function W () is non-monotone. Since every nonempty interval includes one with
rational endpoints, of which there are only countably many, we conclude that for
a.e. , the sample path t 7 Wt () of the Wiener process is monotone in no
interval. Here is an alternative, direct proof of this fact.
Tn
Exercise 9.3.10. Let An = i=1 { : Wi/n () W(i1)/n () 0} and
A = { : t 7 Wt () is non-decreasing on [0, 1]}.
(a) Show that P(An ) = 2n for all n 1 and that A = n An F has zero
probability.
(b) Deduce that for any interval [s, t] with 0 s < t non-random the probability that W () is monotone on [s, t] is zero and conclude that the event
F F where t 7 Wt () is monotone on some non-empty open interval,
has zero probability.
Hint: Recall the invariance transformations of Exercise 9.1.1 and that F
can be expressed as a countable union of events indexed by s < t Q.
Our next objects of interest are the collections of local maxima and points of
increase along the Brownian sample path.
375
Remark. Recall that the upper Dini derivative D1 f (t) of Definition 9.3.1 is nonpositive, hence finite, at every point t of local maximum of f (). Thus, Proposition
9.3.13 provides a dense set of points t 0 where D1 Wt () < and by symmetry
of the Brownian motion, another dense set where D1 Wt () > , though as we
have seen in Proposition 9.3.2, w.p.1. there is no point t() 0 for which both
apply.
Proof. If a continuous function f has a non-strict local maximum then there
exist rational numbers 0 q1 < q4 such that the set M = {u (q1 , q4 ) : f (u) =
supt[q1 ,q4 ] f (t)} has an accumulation point in [q1 , q4 ). In particular, for some
rational numbers 0 q1 < q2 < q3 < q4 the set M intersects both intervals
(q1 , q2 ) and (q3 , q4 ). Thus, setting Ms,r = supt[s,r] Wt , if P(Ms3 ,s4 = Ms1 ,s2 ) = 0
for each 0 s1 < s2 < s3 < s4 , then w.p.1. every local maximum of Wt ()
is strict. This is all we need to show, since in view of Remark 9.3.9, Exercise 9.3.12 and the continuity of Brownian motion, w.p.1. the (countable) set
of (strict) local maxima of Wt () is dense on [0, ). Now, fixing 0 s1 < s2 <
s3 < s4 note that Ms3 ,s4 Ms1 ,s2 = Z Y + X for the mutually independent
Z = supt[s3 ,s4 ] {Wt Ws3 }, Y = supt[s1 ,s2 ] {Wt Ws2 } and X = Ws3 Ws2 .
Since g(x) = P(X = x) = 0 for all x R, we are done as by Fubinis theorem,
P(Ms3 ,s4 = Ms1 ,s2 ) = P(X Y + Z = 0) = E[g(Y Z)] = 0.
Remark. While proving Proposition 9.3.13 we have shown that for any countable collection of disjoint intervals {Ii }, w.p.1. the corresponding maximal values
suptIi Wt of the Brownian motion must all be distinct. In particular, P(Wq = Wq
for some q 6= q Q) = 0, which of course does not contradict the fact that
P(W0 = Wt for uncountably many t 0) = 1 (as implied by Proposition 9.3.7).
Here is a remarkable contrast with Proposition 9.3.13, showing that the Brownian
sample path has no point of increase (try to imagine such a path!).
s, Kakutani). Almost every sample path
Theorem 9.3.14 (Dvoretzky, Erdo
of the Wiener process has no point of increase (or decrease).
For the proof of this result, see [MP09, Theorem 5.14].
Bibliography
[And1887] D
esir
e Andr
e, Solution directe du probl
eme r
esolu par M. Bertrand, Comptes Rendus
Acad. Sci. Paris, 105, (1887), 436437.
[Bil95] Patrick Billingsley, Probability and measure, third edition, John Wiley and Sons, 1995.
[Bre92] Leo Breiman, Probability, Classics in Applied Mathematics, Society for Industrial and
Applied Mathematics, 1992.
[Bry95] Wlodzimierz Bryc, The normal distribution, Springer-Verlag, 1995.
[Doo53] Joseph Doob, Stochastic processes, Wiley, 1953.
[Dud89] Richard Dudley, Real analysis and probability, Chapman and Hall, 1989.
[Dur10] Rick Durrett, Probability: Theory and Examples, fourth edition, Cambridge U Press,
2010.
[DKW56] Aryeh Dvortzky, Jack Kiefer and Jacob Wolfowitz, Asymptotic minimax character of
the sample distribution function and of the classical multinomial estimator, Ann. Math.
Stat., 27, (1956), 642669.
[Dyn65] Eugene Dynkin, Markov processes, volumes 1-2, Springer-Verlag, 1965.
[Fel71] William Feller, An introduction to probability theory and its applications, volume II,
second edition, John Wiley and sons, 1971.
[Fel68] William Feller, An introduction to probability theory and its applications, volume I, third
edition, John Wiley and sons, 1968.
[Fre71] David Freedman, Brownian motion and diffusion, Holden-Day, 1971.
[GS01] Geoffrey Grimmett and David Stirzaker, Probability and random processes, 3rd ed., Oxford University Press, 2001.
[Hun56] Gilbert Hunt, Some theorems concerning Brownian motion, Trans. Amer. Math. Soc.,
81, (1956), 294319.
[KaS97] Ioannis Karatzas and Steven E. Shreve, Brownian motion and stochastic calculus,
Springer Verlag, third edition, 1997.
[Lev37] Paul L
evy, Th
eorie de laddition des variables al
eatoires, Gauthier-Villars, Paris, (1937).
[Lev39] Paul L
evy, Sur certains processus stochastiques homog
enes, Compositio Math., 7, (1939),
283339.
[KT75] Samuel Karlin and Howard M. Taylor, A first course in stochastic processes, 2nd ed.,
Academic Press, 1975.
[Mas90] Pascal Massart, The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality, Ann.
Probab. 18, (1990), 12691283.
[MP09] Peter M
orters and Yuval Peres, Brownian motion, Cambridge University Press, 2010.
[Num84] Esa Nummelin, General irreducible Markov chains and non-negative operators, Cambridge University Press, 1984.
[Oks03] Bernt Oksendal, Stochastic differential equations: An introduction with applications, 6th
ed., Universitext, Springer Verlag, 2003.
[PWZ33] Raymond E.A.C. Paley, Norbert Wiener and Antoni Zygmund, Note on random functions, Math. Z. 37, (1933), 647668.
[Pit56] E. J. G. Pitman, On the derivative of a characteristic function at the origin, Ann. Math.
Stat. 27 (1956), 11561160.
[SW86] Galen R. Shorak and Jon A. Wellner, Empirical processes with applications to statistics,
Wiley, 1986.
[Str93] Daniel W. Stroock, Probability theory: an analytic view, Cambridge university press,
1993.
[Wil91] David Williams, Probability with martingales, Cambridge university press, 1991.
377
Index
-system, 14
-integrable, 31
-system, 14
-algebra, 7, 177
-algebra, P-trivial, 30, 56, 159
-algebra, Borel, 10
-algebra, Markov, 293, 326
-algebra, completion, 14, 276, 277, 284,
291, 292
-algebra, countably generated, 10
-algebra, cylindrical, 273, 325, 344
-algebra, exchangeable, 223
-algebra, generated, 10, 20
-algebra, induced, 62
-algebra, invariant, 234
-algebra, optional, 293
-algebra, stopped, 185, 232, 293
-algebra, tail, 57, 257, 344
-algebra, trivial, 8, 160
-field, 7
0-1 law, 79
0-1 law, Blumenthals, 344
0-1 law, Hewitt-Savage, 224
0-1 law, Kolmogorovs, 57, 132, 199, 344
0-1 law, L
evys, 199
Brownian
Brownian
Brownian
Brownian
Brownian
Brownian
Brownian
Brownian
Brownian
Brownian
Brownian
Brownian
371
Brownian
369
Brownian
Brownian
Brownian
Brownian
Brownian
380
consistent, 61
continuous mapping, 80, 81, 109
continuous modification, 277
continuous, H
older, 370
convergence almost surely, 24
convergence in Lq , 39, 165
convergence in Lq , weak, 165
convergence in distribution, 104, 141, 351
convergence in measure, 39
convergence in probability, 39, 105, 142,
309
convergence weakly, 352
convergence, bounded, 45
convergence, dominated, 42, 164, 198, 314
convergence, monotone, 33, 41, 163, 314
convergence, of types, 128, 131
convergence, total-variation, 111, 253, 268
convergence, uniformly integrable, 46
convergence, vague, 114
convergence, Vitalis theorem, 46, 48, 222
convergence, weak, 104, 109, 352
convolution, 68
countable representation, 22, 273
counting process, 137, 336
coupling, 105, 360
coupling, Markovian, 253, 254
coupling, monotone, 100
coupon collectors problem, 73, 136
covariance, 71
Cram
er-Wold device, 146
DeMorgans law, 7, 77
density, Ces
aro, 17
derivative, Dini, 370, 375
diagonal selection, principle, 115, 301, 314
distribution function, 26, 88, 104, 143
distribution, Bernoulli, 53, 92, 100, 119, 134
distribution, beta, 202
distribution, Binomial, 100, 135
distribution, Cauchy, 122, 131
distribution, exponential, 28, 52, 53, 83,
105, 120, 121, 137, 174, 356
distribution, extreme value, 107
distribution, gamma, 70, 81, 137, 139, 340
distribution, geometric, 53, 73, 83, 105,
211, 342
distribution, multivariate normal, 147, 149,
175, 179, 232, 281, 286
distribution, multivariate normal,
non-degenerate, 149
distribution, normal, 29, 53, 95, 119, 289
distribution, Poisson, 53, 70, 100, 119, 134,
137
distribution, Poisson thinning, 140, 339
distribution, stable, 350
distribution, support, 30
distribution, triangular, 120
Doobs convergence theorem, 194, 301
INDEX
INDEX
381
382
INDEX
INDEX
383
random variable, 18
random variable, P-degenerate, 30, 128,
131
random variable, P-trivial, 30
random variable, integrable, 32
random variable, lattice, 128
random variables, exchangeable, 223
random vector, 18, 143, 147, 286
random walk, 129, 178, 183, 187, 196,
209211, 231, 242, 247, 265, 351
random walk, simple, 108, 148, 179, 181,
210, 231, 241
random walk, simple, range, 355
random walk, symmetric, 178, 180, 210, 234
record values, 85, 92, 101, 212
reflection principle, 108, 180, 347, 354
regeneration measure, 258
regeneration times, 258, 345
regular conditional probability, 171, 172
regular conditional probability distribution,
172, 205
renewal theory, 89, 106, 137, 211, 231, 238,
251
renewal times, 89, 231, 238, 251
RMG, 221, 299
ruin probability, 209, 308
sample function, continuous, 278, 288, 309
sample function, RCLL, 282
sample function, right-continuous, 299
sample space, 7
Scheff
es lemma, 43
set function, countably additive, 8, 274
set function, finitely additive, 8
set, boundary, 110
set, continuity, 110
set, cylinder, 61
set, Lebesgue measurable, 50
set, negative, 157
set, negligible, 24
set, null, 14, 277, 291, 308
set, positive, 157
Skorokhod representation, 27, 105, 356,
358, 359
Slutskys lemma, 106, 128, 356, 363
space, Lq , 32, 165
split mapping, 258
square-integrable, 179
srw, 288
stable law, 131
stable law, domain of attraction, 131
stable law, index, 131
stable law, skewness, 132
stable law, symmetric, 131
standard machine, 33, 49, 50
state space, 227, 320
Stirlings formula, 108
stochastic integral, 299, 319
384
INDEX