Preview-9780192586865 A40083385
Preview-9780192586865 A40083385
1
1
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Geoffrey R. Grimmett and David R. Stirzaker 2020
The moral rights of the authors have been asserted
Fourth Edition published in 2020
Reprinted with corrections in 2024
John Arbuthnot
An essay on the usefulness of mathematical learning
25 November 1700
To this may be added, that some of the problems about chance having a great appearance
of simplicity, the mind is easily drawn into a belief, that their solution may be attained
by the mere strength of natural good sense; which generally proving otherwise, and the
mistakes occasioned thereby being not infrequent, it is presumed that a book of this
kind, which teaches to distinguish truth from what seems so nearly to resemble it, will
be looked on as a help to good reasoning.
Abraham de Moivre
The Doctrine of Chances
1717
Preface to the Fourth Edition
The second part of the book is devoted to the theory of random processes. Chapter 6 deals
with Markov chains in discrete and continuous time. The treatment of discrete-time chains is
quite detailed and includes an easy proof of the limit theorem for chains with countably infinite
state spaces. The sections on continuous-time chains provide a significant amplification of
those of the third edition, and constitute an approachable but rigorous account of the principal
theory and applications. Chapter 7 contains a general discussion of convergence, together with
simple, rigorous accounts of the strong law of large numbers, and of martingale convergence.
Each of these two chapters could be used as a basis for a lecture course.
Chapter 8 is an introduction to stochastic processes in their various types; most of these
are studied in detail in later chapters, but we have also aspired there to engage the reader with
wider aspects of probability by including new essays on a number of topics such as Lévy
processes, self-similarity, and stability. Chapters 8–13 provide suitable material for about
five shorter lecture courses on: stationary processes and ergodic theory; renewal processes;
queues; martingales; diffusions and stochastic integration with applications to finance.
We thank those who have read and commented upon sections of this and earlier editions,
and also those readers who have taken the trouble to write to us with notice of imperfections.
Further help in thinning any remaining errors will be greatly appreciated.
6 Markov chains
6.1 Markov processes 239
6.2 Classification of states 246
6.3 Classification of chains 250
6.4 Stationary distributions and the limit theorem 254
6.5 Reversibility 265
6.6 Chains with finitely many states 269
6.7 Branching processes revisited 273
6.8 Birth processes and the Poisson process 276
6.9 Continuous-time Markov chains 288
6.10 Kolmogorov equations and the limit theorem 297
6.11 Birth–death processes and imbedding 306
6.12 Special processes 313
6.13 Spatial Poisson processes 319
6.14 Markov chain Monte Carlo 330
6.15 Problems 339
Contents xi
8 Random processes
8.1 Introduction 406
8.2 Stationary processes 407
8.3 Renewal processes 411
8.4 Queues 414
8.5 The Wiener process 416
8.6 Lévy processes and subordinators 418
8.7 Self-similarity and stability 421
8.8 Time changes 426
8.9 Existence of processes 429
8.10 Problems 431
9 Stationary processes
9.1 Introduction 433
9.2 Linear prediction 436
9.3 Autocovariances and spectra 438
9.4 Stochastic integration and the spectral representation 446
9.5 The ergodic theorem 452
9.6 Gaussian processes 464
9.7 Problems 468
10 Renewals
10.1 The renewal equation 471
10.2 Limit theorems 476
10.3 Excess life 480
10.4 Applications 483
10.5 Renewal–reward processes 491
10.6 Problems 497
11 Queues
11.1 Single-server queues 501
11.2 M/M/1 503
11.3 M/G/1 506
xii Contents
12 Martingales
12.1 Introduction 533
12.2 Martingale differences and Hoeffding’s inequality 538
12.3 Crossings and convergence 544
12.4 Stopping times 550
12.5 Optional stopping 554
12.6 The maximal inequality 559
12.7 Backward martingales and continuous-time martingales 562
12.8 Some examples 567
12.9 Problems 572
13 Diffusion processes
13.1 Introduction 577
13.2 Brownian motion 578
13.3 Diffusion processes 580
13.4 First passage times 590
13.5 Barriers 595
13.6 Excursions and the Brownian bridge 599
13.7 Stochastic calculus 602
13.8 The Itô integral 605
13.9 Itô’s formula 610
13.10 Option pricing 613
13.11 Passage probabilities and potentials 620
13.12 Problems 627
1.1 Introduction
Much of our life is based on the belief that the future is largely unpredictable. For example,
games of chance such as dice or roulette would have few adherents if their outcomes were
known in advance. We express this belief in chance behaviour by the use of words such as
‘random’ or ‘probability’, and we seek, by way of gaming and other experience, to assign
quantitative as well as qualitative meanings to such usages. Our main acquaintance with
statements about probability relies on a wealth of concepts, some more reasonable than others.
A mathematical theory of probability will incorporate those concepts of chance which are
expressed and implicit in common rational understanding. Such a theory will formalize these
concepts as a collection of axioms, which should lead directly to conclusions in agreement with
practical experimentation. This chapter contains the essential ingredients of this construction.
(2) Example. A coin is tossed. There are two possible outcomes, heads (denoted by H) and
tails (denoted by T), so that = {H, T}. We may be interested in the possible occurrences of
the following events:
(a) the outcome is a head;
(b) the outcome is either a head or a tail;
(c) the outcome is both a head and a tail (this seems very unlikely to occur);
(d) the outcome is not a head.
(3) Example. A die is thrown once. There are six possible outcomes depending on which of
the numbers 1, 2, 3, 4, 5, or 6 is uppermost. Thus = {1, 2, 3, 4, 5, 6}. We may be interested
in the following events:
(a) the outcome is the number 1;
(b) the outcome is an even number;
(c) the outcome is even but does not exceed 3;
(d) the outcome is not even.
We see immediately that each of the events of these examples can be specified as a subset
A of the appropriate sample space . In the first example they can be rewritten as
that is to say, F is closed under finite unions and hence under finite intersections also (see
Problem (1.8.3)). This is fine when is a finite set, but we require slightly more to deal with
the common situation when is infinite, as the following example indicates.
(4) Example. A coin is tossed repeatedly until the first head turns up; we are concerned
with the number of tosses before this happens. The set of all possible outcomes is the set
= {ω1 , ω2 , ω3 , . . . }, where ωi denotes the outcome when the first i − 1 tosses are tails
and the i th toss is a head. We may seek to assign a probability to the event A, that the first
head occurs after an even number of tosses, that is, A = {ω2 , ω4 , ω6 , . . . }. This is an infinite
countable union of members of and we require that such a set belong to F in order that we
can discuss its probability.
Thus we also require that the collection of events be closed under the operation of taking
countable unions. Any collection of subsets of with these properties is called a σ -field (or
sometimes a σ -algebra).
(5) Definition. A collection F of subsets of is called a σ -field if it satisfies the following
conditions:
(a) ∅ ∈ F ;
S
(b) if A1 , A2 , . . . ∈ F then ∞
i=1 A i ∈ F ;
(c) if A ∈ F then A ∈ F .c
It follows from Problem (1.8.3) that σ -fields are closed under the operation of taking
countable intersections. Here are some examples of σ -fields.
(6) Example. The smallest σ -field associated with is the collection F = {∅, }.
(8) Example. The power set of , which is written {0, 1} and contains all subsets of , is
obviously a σ -field. For reasons beyond the scope of this book, when is infinite, its power
set is too large a collection for probabilities to be assigned reasonably to all its members.
4 1.3 Events and their probabilities
To recapitulate, with any experiment we may associate a pair (, F ), where is the
set of all possible outcomes or elementary events and F is a σ -field of subsets of which
contains all the events in whose occurrences we may be interested; henceforth, to call a set
A an event is equivalent to asserting that A belongs to the σ -field in question. We usually
translate statements about combinations of events into set-theoretic jargon; for example, the
event that both A and B occur is written as A ∩ B. Table 1.1 is a translation chart.
2. Let A and B belong to some σ -field F. Show that F contains the sets A ∩ B, A \ B, and A △ B.
3. A conventional knock-out tournament (such as that at Wimbledon) begins with 2n competitors
and has n rounds. There are no play-offs for the positions 2, 3, . . . , 2n − 1, and the initial table of
draws is specified. Give a concise description of the sample space of all possible outcomes.
4. Let F be a σ -field of subsets of and suppose that B ∈ F. Show that G = { A ∩ B : A ∈ F} is a
σ -field of subsets of B.
5. Which of the following are identically true? For those that are not, say when they are true.
(a) A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C);
(b) A ∩ (B ∩ C) = (A ∩ B) ∩ C;
(c) (A ∪ B) ∩ C = A ∪ (B ∩ C);
(d) A \ (B ∩ C) = (A \ B) ∪ (A \ C).
1.3 Probability
We wish to be able to discuss the likelihoods of the occurrences of events. Suppose that we
repeat an experiment a large number N of times, keeping the initial conditions as equal as
possible, and in such a way that no outcome is influenced by the other outcomes. Let A be
some event which may or may not occur on each repetition. Our experience of most scientific
experimentation is that the proportion of times that A occurs settles down to some value as N
becomes larger and larger; that is to say, writing N(A) for the number of occurrences of A in
the N trials, the ratio N(A)/N appears to converge to a constant limit as N increases‡. We
can think of the ultimate value of this ratio as being the probability P(A) that A occurs on any
particular trial∗ ; it may happen that the empirical ratio does not behave in a coherent manner
and our intuition fails us at this level, but we shall not discuss this here. In practice, N may be
†Augustus De Morgan is well known for having given the first clear statement of the principle of mathematical
induction. He applauded probability theory with the words: “The tendency of our study is to substitute the
satisfaction of mental exercise for the pernicious enjoyment of an immoral stimulus”.
‡This property of practical experiments is called the stability of statistical ratios by physicists and statisti-
cians, or the empirical law of large numbers by probabilists.
∗ This superficial discussion of probabilities is inadequate in many ways; questioning readers may care to
discuss the philosophical and empirical aspects of the subject amongst themselves (see Appendix III). We do
not define probability by such limits in this text.
1.3 Probability 5
taken to be large but finite, and the ratio N(A)/N may be taken as an approximation to P(A).
Clearly, the ratio is a number between zero and one; if A = ∅ then N(∅) = 0 and the ratio
is 0, whilst if A = then N() = N and the ratio is 1. Furthermore, suppose that A and B
are two disjoint events, each of which may or may not occur at each trial. Then
and so the ratio N(A ∪ B)/N is the sum of the two ratios N(A)/N and N(B)/N. We now
think of these ratios as representing the probabilities of the appropriate events. The above
relations become
This discussion suggests that the probability function P should be finitely additive, which is
to say that
n
! n
[ X
if A1 , A2 , . . . , An are disjoint events, then P Ai = P(Ai ).
i=1 i=1
A glance at Example (1.2.4) suggests the more extensive property that P be countably additive,
in that the corresponding property should hold for countable collections A1 , A2 , . . . of disjoint
events.
These relations are sufficient to specify the desirable properties of a probability function P
applied to the set of events. Any such assignment of likelihoods to the members of F is called
a probability measure. Some individuals refer informally to P as a ‘probability distribution’,
especially when the sample space is finite or countably infinite; this practice is best avoided
since the term ‘probability distribution’ is reserved for another purpose to be encountered in
Chapter 2.
(1) Definition. A probability measure P on (, F ) is a function P : F → [0, 1] satisfying
(a) P(∅) = 0, P() = 1;
(b) if A1 , A2 , . . . is a collection of disjoint members of F , in that Ai ∩ A j = ∅ for all pairs
i, j satisfying i 6= j , then
∞
! ∞
[ X
P Ai = P(Ai ).
i=1 i=1
The triple (, F , P), comprising a set , a σ -field F of subsets of , and a probability
measure P on (, F ), is called a probability space.
A probability measure is a special example of what is called a measure on the pair (, F ).
A measure is a function µ : F → [0, ∞) satisfying µ(∅) = 0 together with (b) above. A
measure µ is a probability measure if µ() = 1.
We can associate a probability space (, F , P) with any experiment, and all questions
associated with the experiment can be reformulated in terms of this space. It may seem
natural to ask for the numerical value of the probability P(A) of some event A. The answer
to such a question must be contained in the description of the experiment in question. For
6 1.3 Events and their probabilities
example, the assertion that a fair coin is tossed once is equivalent to saying that heads and
tails have an equal probability of occurring; actually, this is the definition of fairness.
(2) Example. A coin, possibly biased, is tossed once. We can take = {H, T} and F =
{∅, H, T, }, and a possible probability measure P : F → [0, 1] is given by
where p is a fixed real number in the interval [0, 1]. If p = 21 , then we say that the coin is
fair, or unbiased.
(3) Example. A die is thrown once. We can take = {1, 2, 3, 4, 5, 6}, F = {0, 1} , and
the probability measure P given by
X
P(A) = pi for any A ⊆ ,
i∈A
where p1 , p2 , . . . , p6 are specified numbers from the interval [0, 1] having unit sum. The
probability that i turns up is pi . The die is fair if pi = 61 for each i , in which case
+ (−1)n+1 P(A1 ∩ A2 ∩ · · · ∩ An )
P
where, for example, i< j sums over all unordered pairs (i, j ) with i 6= j .
Proof.
(a) A ∪ Ac = and A ∩ Ac = ∅, so P(A ∪ Ac ) = P(A) + P(Ac ) = 1.
(b) B = A ∪ (B \ A). This is the union of disjoint sets and therefore
(d) The proof is by induction on n, and is left as an exercise (see Exercise (1.3.4)).
In Lemma (4b), B \ A denotes the set of members of B which are not in A. In order to
write down the quantity P(B \ A), we require that B \ A belongs to F , the domain of P; this is
always true when A and B belong to F , and to prove this was part of Exercise (1.2.2). Notice
that each proof proceeded by expressing an event in terms of disjoint unions and then applying
P. It is sometimes easier to calculate the probabilities of intersections of events rather than
their unions; part (d) of the lemma is useful then, as we shall discover soon. The next property
of P is more technical, and says that P is a continuous set function; this property is essentially
equivalent to the condition that P is countably additive rather than just finitely additive (see
Problem (1.8.16) also).
(5) Theorem. Continuity of probability measures. Let A1 , A2 , . . . be an increasing se-
quence of events, so that A1 ⊆ A2 ⊆ A3 ⊆ · · · , and write A for their limit:
∞
[
A= Ai = lim Ai .
i→∞
i=1
To show the result for decreasing families of events, take complements and use the first part
(exercise).
To recapitulate, statements concerning chance are implicitly related to experiments or
trials, the outcomes of which are not entirely predictable. With any such experiment we can
associate a probability space (, F , P) the properties of which are consistent with our shared
and reasonable conceptions of the notion of chance.
(6) Example. In many cases, and especially in games of chance (as in (3) above), it is natural
to assume that all outcomes are equally likely. In such a case, the probability of an event A is
|A|
P(A) = .
||
8 1.4 Events and their probabilities
Here is some final jargon. An event A is called null if P(A) = 0. If P(A) = 1, we say that
A occurs almost surely (frequently abbreviated to a.s.). Null events should not be confused
with the impossible event ∅. Null events are happening all around us, even though they have
zero probability; after all, what is the chance that a dart strikes any given point of the target at
which it is thrown? That is, the impossible event is null, but null events need not be impossible.
− · · · + (−1)n+1 P(A1 ∩ A2 ∩ · · · ∩ An ).
In each packet of Corn Flakes may be found a plastic bust of one of the last five Vice-Chancellors
of Cambridge University, the probability that any given packet contains any specific Vice-Chancellor
being 15 , independently of all other packets. Show that the probability that each of the last three
Vice-Chancellors is obtained in a bulk purchase of six packets is 1 − 3( 54 )6 + 3( 35 )6 − ( 52 )6 .
T∞
5. Let Ar , r ≥ 1, be events such that P(Ar ) = 1 for all r . Show that P r=1 Ar = 1.
6. You are given that at least one of the events Ar , 1 ≤ r ≤ n, is certain to occur, but certainly no
more than two occur. If P(Ar ) = p, and P(Ar ∩ As ) = q, r 6= s, show that p ≥ 1/n and q ≤ 2/n.
7. You are given that at least one, but no more than three, of the events Ar , 1 ≤ r ≤ n, occur, where
n ≥ 3. The probability of at least two occurring is 12 . If P(Ar ) = p, P(Ar ∩ As ) = q, r 6= s, and
P(Ar ∩ As ∩ At ) = x, r < s < t, show that p ≥ 3/(2n), and q ≤ 4/n.
If we now think of these ratios as probabilities, we see that the probability that A occurs, given
that B occurs, should be reasonably defined as P(A ∩ B)/P(B).
Probabilistic intuition leads to the same conclusion. Given that an event B occurs, it is the
case that A occurs if and only if A ∩ B occurs. Thus the conditional probability of A given B
should be proportional to P(A ∩ B), which is to say that it equals α P(A ∩ B) for some constant
α = α(B). The conditional probability of given B must equal 1, and thus α P( ∩ B) = 1,
yielding α = 1/P(B).
We formalize these notions as follows.
(1) Definition. If P(B) > 0 then the conditional probability that A occurs given that B
occurs is defined to be
P(A ∩ B)
P(A | B) = .
P(B)
We denote this conditional probability by P(A | B), pronounced ‘the probability of A given
B’, or sometimes ‘the probability of A conditioned (or conditional) on B’.
(2) Example. Two fair dice are thrown. Given that the first shows 3, what is the probability
that the total exceeds 6? The answer is obviously 12 , since the second must show 4, 5, or
6. However, let us labour the point. Clearly = {1, 2, 3, 4, 5, 6}2 , the set† of all ordered
pairs (i, j ) for i, j ∈ {1, 2, . . . , 6}, and we can take F to be the set of all subsets of , with
P(A) = |A|/36 for any A ⊆ . Let B be the event that the first die shows 3, and A be the
event that the total exceeds 6. Then
B = {(3, b) : 1 ≤ b ≤ 6}, A = {(a, b) : a + b > 6}, A ∩ B = {(3, 4), (3, 5), (3, 6)},
and
P(A ∩ B) |A ∩ B| 3
P(A | B) = = = .
P(B) |B| 6
(3) Example. Boys and girls. A family has two children. What is the probability that both
are boys, given that at least one is a boy? The older and younger child may each be male
or female, so there are four possible combinations of sexes, which we assume to be equally
likely. Hence we can represent the sample space in the obvious way as
1
where P(GG) = P(BB) = P(GB) = P(BG) = 4. From the definition of conditional
probability,
A popular but incorrect answer to the question is 21 . This is the correct answer to another
question: for a family with two children, what is the probability that both are boys given that
the younger is a boy? In this case,
More generally, let B1 , B2 , . . . , Bn be a partition of such that P(Bi ) > 0 for all i . Then
n
X
P(A) = P(A | Bi )P(Bi ).
i=1
Lemma (4) is often called the law of total probability or the partition theorem for proba-
bilities.
Proof. A = (A ∩ B) ∪ (A ∩ B c ). This is a disjoint union and so
(5) Example. We are given two urns, each containing a collection of coloured balls. Urn I
contains two white and three blue balls, whilst urn II contains three white and four blue balls.
A ball is drawn at random from urn I and put into urn II, and then a ball is picked at random
from urn II and examined. What is the probability that it is blue? We assume unless otherwise
specified that a ball picked randomly from any urn is equally likely to be any of those present.
The reader will be relieved to know that we no longer need to describe (, F , P) in detail;
we are confident that we could do so if necessary. Clearly, the colour of the final ball depends
1.4 Conditional probability 11
on the colour of the ball picked from urn I. So let us ‘condition’ on this. Let A be the event
that the final ball is blue, and let B be the event that the first one picked was blue. Then, by
Lemma (4),
P(A) = P(A | B)P(B) + P(A | B c )P(B c ).
We can easily find all these probabilities:
P(A | B) = P(A | urn II contains three white and five blue balls) = 85 ,
P(A | B c ) = P(A | urn II contains four white and four blue balls) = 21 ,
P(B) = 53 , P(B c ) = 25 .
Hence
5 3 1 2 23
P(A) = 8 · 5 + 2 · 5 = 40 .
Unprepared readers may have been surprised by the sudden appearance of urns in this book.
In the seventeenth and eighteenth centuries, lotteries often involved the drawing of slips from
urns, and voting was often a matter of putting slips or balls into urns. In France today, aller aux
urnes is synonymous with voting. It was therefore not unnatural for the numerous Bernoullis
and others to model births, marriages, deaths, fluids, gases, and so on, using urns containing
balls of varied hue.†
(6) Example. Zoggles. Only two factories manufacture zoggles. 20 per cent of the zoggles
from factory I and 5 per cent from factory II are defective. Factory I produces twice as many
zoggles as factory II each week. What is the probability that a zoggle, randomly chosen from
a week’s production, is satisfactory? Clearly this satisfaction depends on the factory of origin.
Let A be the event that the chosen zoggle is satisfactory, and let B be the event that it was
made in factory I. Arguing as before,
If the chosen zoggle is defective, what is the probability that it came from factory I? In our
notation this is just P(B | Ac ). However,
1 2
P(B ∩ Ac ) P(Ac | B)P(B) 5 · 8
P(B | Ac ) = = = 3
= .
P(Ac ) P(Ac ) 1− 51 9
60
†Readers favouring applications over thought experiments may wish to consider fish in lakes rather than
balls in urns.
12 1.4 Events and their probabilities
one of them and to shoot the other two, but he is not permitted to reveal to any prisoner the
fate of that prisoner. Prisoner A knows therefore that his chance of survival is 31 . In order
to gain information, he asks the warder to tell him in secret the name of some prisoner (but
not himself) who will be killed, and the warder names prisoner B. What now is prisoner A’s
assessment of the chance that he will survive? Could it be 12 : after all, he knows now that
the survivor will be either A or C, and he has no information about which? Could it be 13 :
after all, according to the rules, at least one of B and C has to be killed, and thus the extra
information cannot reasonably affect A’s earlier calculation of the odds? What does the reader
think about this? The resolution of the paradox lies in the situation when either response (B
or C) is possible.
An alternative formulation of this paradox has become known as the Monty Hall problem,
the controversy associated with which has been provoked by Marilyn vos Savant (and many
others) in Parade magazine in 1990; see Exercise (1.4.5).
3. A man possesses five coins, two of which are double-headed, one is double-tailed, and two are
normal. He shuts his eyes, picks a coin at random, and tosses it. What is the probability that the lower
face of the coin is a head?
He opens his eyes and sees that the coin is showing heads; what is the probability that the lower
face is a head? He shuts his eyes again, and tosses the coin again. What is the probability that the
lower face is a head? He opens his eyes and sees that the coin is showing heads; what is the probability
that the lower face is a head?
He discards this coin, picks another at random, and tosses it. What is the probability that it shows
heads?
4. What do you think of the following ‘proof’ by Lewis Carroll that an urn cannot contain two balls
of the same colour? Suppose that the urn contains two balls, each of which is either black or white;
thus, in the obvious notation, P(BB) = P(BW) = P(WB) = P(WW) = 14 . We add a black ball, so
that P(BBB) = P(BBW) = P(BWB) = P(BWW) = 41 . Next we pick a ball at random; the chance
that the ball is black is (using conditional probabilities) 1 · 41 + 23 · 14 + 23 · 41 + 13 · 14 = 23 . However, if
there is probability 32 that a ball, chosen randomly from three, is black, then there must be two black
and one white, which is to say that originally there was one black and one white ball in the urn.
5. The Monty Hall problem: goats and cars. (a) In a game show; you have to choose one of
three doors. One conceals a new car, two conceal old goats. You choose, but your chosen door is not
opened immediately. Instead the presenter opens another door, which reveals a goat. He offers you
the opportunity to change your choice to the third door (unopened and so far unchosen). Let p be the
(conditional) probability that the third door conceals the car. The presenter’s protocol is:
(i) he is determined to show you a goat; with a choice of two, he picks one at random. Show p = 23 .
(ii) he is determined to show you a goat; with a choice of two goats (Bill and Nan, say) he shows you
Bill with probability b. Show that, given you see Bill, the probability is 1/(1 + b).
(iii) he opens a door chosen at random irrespective of what lies behind. Show p = 21 .
(b) Show that, for α ∈ [ 21 , 23 ], there exists a protocol such that p = α. Are you well advised to change
your choice to the third door?
1.5 Independence 13
(c) In a variant of this question, the presenter is permitted to open the first door chosen, and to reward
you with whatever lies behind. If he chooses to open another door, then this door invariably conceals
a goat. Let p be the probability that the unopened door conceals the car, conditional on the presenter
having chosen to open a second door. Devise protocols to yield the values p = 0, p = 1, and deduce
that, for any α ∈ [0, 1], there exists a protocol with p = α.
6. The prosecutor’s fallacy†. Let G be the event that an accused is guilty, and T the event that
some testimony is true. Some lawyers have argued on the assumption that P(G | T ) = P(T | G).
Show that this holds if and only if P(G) = P(T ).
7. Urns. There are n urns of which the r th contains r − 1 red balls and n − r magenta balls. You
pick an urn at random and remove two balls at random without replacement. Find the probability that:
(a) the second ball is magenta;
(b) the second ball is magenta, given that the first is magenta.
8. Boys and girls, Example (1.4.3) revisited. Consider a family of two children in a population
in which each child is equally likely to be male as female; each child has red hair with probability r ;
these characteristics are independent of each other and occur independently between children.
What is the probability that both children are boys given that the family contains at least one red-
haired boy? Show that the probability that both are boys, given that one is a boy born on a Monday,
is 13/27.
1.5 Independence
In general, the occurrence of some event B changes the probability that another event A
occurs, the original probability P(A) being replaced by P(A | B). If the probability remains
unchanged, that is to say P(A | B) = P(A), then we call A and B ‘independent’. This is
well defined only if P(B) > 0. Definition (1.4.1) of conditional probability leads us to the
following.
(1) Definition. Events A and B are called independent if
P(A ∩ B) = P(A)P(B).
Remark. A common student error is to make the fallacious statement that A and B are
independent if A ∩ B = ∅.
It is straightforward to prove that two events A, B are independent if and only if A, B c are
independent. Similarly, for J ⊆ I , a family {Ai : i ∈ I } is independent if and only if the
family {Ai : i ∈ J } ∪ { Ak : k ∈ I \ J } is independent. See Exercise (1.5.10).
†The prosecution made this error in the famous Dreyfus case of 1894.
14 1.5 Events and their probabilities
(3) Example (1.4.6) revisited. The events A and B of this example are clearly dependent
because P(A | B) = 54 and P(A) = 51
60 .
(4) Example. Choose a card at random from a pack of 52 playing cards, each being picked
1
with equal probability 52 . We claim that the suit of the chosen card is independent of its rank.
For example,
4 1
P(king) = 52 , P(king | spade) = 13 .
Alternatively,
1 1 1
P(spade king) = 52 = 4 · 13 = P(spade)P(king).
7. Families. Jane has three children, each of which is equally likely to be a boy or a girl independently
of the others. Define the events:
and N . It can be shown that the domain of P may be extended in an obvious way from F to
G; (, G, P) is called the completion of (, F , P).
(B) Product spaces. The probability spaces discussed in this chapter have usually been con-
structed around the outcomes of one experiment, but instances occur naturally when we need
to combine the outcomes of several independent experiments into one space (see Examples
(1.2.4) and (1.4.2)). How should we proceed in general?
Suppose two experiments have associated probability spaces (1, F1 , P1 ) and (2 , F2 , P2 )
respectively. The sample space of the pair of experiments, considered jointly, is the collection
1 ×2 = {(ω1 , ω2 ) : ω1 ∈ 1 , ω2 ∈ 2 } of ordered pairs. The appropriate σ -field of events
is more complicated to construct. Certainly it should contain all subsets of 1 ×2 of the form
A1 × A2 = {(a1 , a2 ) : a1 ∈ A1 , a2 ∈ A2 } where A1 and A2 are typical members of F1 and F2
respectively. However, the family of all such sets, F1 ×F2 = {A1 × A2 : A1 ∈ F1 , A2 ∈ F2 },
is not in general a σ -field. By the discussion after (1), there exists a unique smallest σ -field
G = σ (F1 × F2 ) of subsets of 1 × 2 which contains F1 × F2 . All we require now is a
suitable probability function on (1 × 2 , G). Let P12 : F1 × F2 → [0, 1] be given by:
It can be shown that the domain of P12 can be extended from F1 × F2 to the whole of
G = σ (F1 × F2 ). The ensuing probability space (1 × 2 , G, P12 ) is called the product
space of (1 , F1 , P1 ) and (2 , F2 , P2 ). Products of larger numbers of spaces are constructed
similarly. The measure P12 is sometimes called the ‘product measure’ since its defining
equation (3) assumed that two experiments are independent. There are of course many other
measures that can be applied to (1 × 2 , G).
In many simple cases this technical discussion is unnecessary. Suppose that 1 and 2
are finite, and that their σ -fields contain all their subsets; this is the case in Examples (1.2.4)
and (1.4.2). Then G contains all subsets of 1 × 2 .
(1) Example. Consider a series of hands dealt at bridge. Let A be the event that in a given
deal each player has one ace. Show that the probability that A occurs at least once in seven
deals is approximately 12 .
Solution. The number of ways of dealing 52 cards into four equal hands is 52!/(13!)4. There
are 4! ways of distributing the aces so that each hand holds one, and there are 48!/(12!)4 ways
of dealing the remaining cards. Thus
4! 48!/(12!)4 1
P(A) = 4
≃ .
52!/(13!) 10
Now let Bi be the event that A occurs for the first time on the i th deal. Clearly Bi ∩ B j = ∅,
i 6= j . Thus
7
X
P(A occurs in seven deals) = P(B1 ∪ · · · ∪ B7 ) = P(Bi ) using Definition (1.3.1).
1
Thus
7
X 7
X i−1
9 1
P(A occurs in seven deals) = P(Bi ) ≃ 10 10 ≃ 21 .
1 1
(2) Example. There are two roads from A to B and two roads from B to C. Each of the four
roads has probability p of being blocked by snow, independently of all the others. What is
the probability that there is an open road from A to C?
Solution.
P(open road) = P (open road from A to B) ∩ (open road from B to C)
= P(open road from A to B)P(open road from B to C)
using the independence. However, p is the same for all roads; thus, using Lemma (1.3.4),
2
P(open road) = 1 − P(no road from A to B)
2
= 1 − P (first road blocked) ∩ (second road blocked)
2
= 1 − P(first road blocked)P(second road blocked)
Further suppose that there is also a direct road from A to C, which is independently blocked
with probability p. Then, by Lemma (1.4.4) and equation (3),
(4) Example. Symmetric random walk (or ‘Gambler’s ruin’). A man is saving up to buy
a new Jaguar at a cost of N units of money. He starts with k units where 0 < k < N, and
tries to win the remainder by the following gamble with his bank manager. He tosses a fair
coin repeatedly; if it comes up heads then the manager pays him one unit, but if it comes up
tails then he pays the manager one unit. He plays this game repeatedly until one of two events
occurs: either he runs out of money and is bankrupted or he wins enough to buy the Jaguar.
What is the probability that he is ultimately bankrupted?
Solution. This is one of many problems the solution to which proceeds by the construction
of a linear difference equation subject to certain boundary conditions. Let A denote the event
that he is eventually bankrupted, and let B be the event that the first toss of the coin shows
heads. By Lemma (1.4.4),
where Pk denotes probabilities calculated relative to the starting point k. We want to find
Pk (A). Consider Pk (A | B). If the first toss is a head then his capital increases to k + 1 units
and the game starts afresh from a different starting point. Thus Pk (A | B) = Pk+1 (A) and
similarly Pk (A | B c ) = Pk−1 (A). So, writing pk = Pk (A), (5) becomes
is the general solution to (6). The boundary conditions imply that p0 = 1, b1 = −1/N, giving
k
(7) Pk (A) = 1 − .
N
As the price of the Jaguar rises, that is as N → ∞, ultimate bankruptcy becomes very likely.
This is the problem of the ‘symmetric random walk with two absorbing barriers’ to which we
shall return in more generality later.
Remark. Our experience of student calculations leads us to stress that probabilities lie be-
tween zero and one; any calculated probability which violates this must be incorrect.
1.7 Worked examples 19
(8) Example. Testimony. A court is investigating the possible occurrence of an unlikely event
T . The reliability of two independent witnesses called Alf and Bob is known to the court:
Alf tells the truth with probability α and Bob with probability β, and there is no collusion
between the two of them. Let A and B be the events that Alf and Bob assert (respectively)
that T occurred, and let τ = P(T ). What is the probability that T occurred given that both
Alf and Bob declare that T occurred?
Solution. We are asked to calculate P(T | A ∩ B), which is equal to P(T ∩ A ∩ B)/P(A ∩ B).
Now P(T ∩ A ∩ B) = P(A ∩ B | T )P(T ) and
We have from the independence of the witnesses that A and B are conditionally independent
given either T or T c . Therefore
so that
αβτ
P(T | A ∩ B) = .
αβτ + (1 − α)(1 − β)(1 − τ )
9
As an example, suppose that α = β = 10 and τ = 1/1000. Then P(T | A∩ B) = 81/1080,
which is somewhat small as a basis for a judicial conclusion.
This calculation may be informative. However, it is generally accepted that such an appli-
cation of the axioms of probability is inappropriate to questions of truth and belief.
(9) Example. Zoggles revisited. A new process for the production of zoggles is invented,
and both factories of Example (1.4.6) install extra production lines using it. The new process
is cheaper but produces fewer reliable zoggles, only 75 per cent of items produced in this new
way being reliable.
Factory I fails to implement its new production line efficiently, and only 10 per cent of its
output is made in this manner. Factory II does better: it produces 20 per cent of its output by
the new technology, and now produces twice as many zoggles in all as Factory I.
Is the new process beneficial to the consumer?
Solution. Both factories now produce a higher proportion of unreliable zoggles than before,
and so it might seem at first sight that there is an increased proportion of unreliable zoggles
on the market.
Let A be the event that a randomly chosen zoggle is satisfactory, B the event that it came
from factory I, and C the event that it was made by the new method. Then
h D3 g D4
d D1 c D2
a e b f
R1 R2
(10) Example. Simpson’s paradox†. A doctor has performed clinical trials to determine
the relative efficacies of two drugs, with the following results.
Women Men
Drug I Drug II Drug I Drug II
but
We may think of A as the event that treatment is successful, B as the event that drug I is given
to a randomly chosen individual, and C as the event that this individual is female. The above
inequalities imply that B is preferred to B c when C occurs and when C c occurs, but B c is
preferred to B overall.
†This paradox, named after Simpson (1951), was remarked by Yule in 1903, having been observed previously
by Pearson in 1899. The nomenclature is an instance of Stigler’s law of eponymy: “No law, theorem, or discovery
is named after its originator”. This law applies to many eponymous statements in this book, including the law
itself. As remarked by A. N. Whitehead, “Everything of importance has been said before, by somebody who
did not discover it”.
1.7 Worked examples 21
Setting
area (D1 ) > area (D2 ), area (D3 ) > area (D4 ), area (R1 ) < area (R2 ).
3 1
Many such rectangles may be found, by inspection, as for example those with a = 30 , b = 30 ,
8 3 3 8 1 3
c = 30 , d = 30 , e = 30 , f = 30 , g = 30 , h = 30 . Similar conclusions are valid for finer
partitions {Ci : i ∈ I } of the sample space, though the corresponding pictures are harder to
draw.
Simpson’s paradox has arisen many times in practical situations. There are many well-
known cases, including the admission of graduate students to the University of California at
Berkeley and a clinical trial comparing treatments for kidney stones.
(14) Example. False positives. A rare disease affects one person in 105 . A test for the
99
disease shows positive with probability 100 when applied to an ill person, and with probability
1
100 when applied to a healthy person. What is the probability that you have the disease given
that the test shows positive?
Solution. In the obvious notation,
P(+ | ill)P(ill)
P(ill | +) =
P(+ | ill)P(ill) + P(+ | healthy)P(healthy)
99 −5
100 · 10 99 1
= 99
= ≃ .
−5 + 1 (1 − 10−5 ) 99 + 105 − 1 1011
100 · 10 100
The chance of being ill is rather small. Indeed it is more likely that the test was incorrect.
3. A symmetric random walk takes place on the integers 0, 1, 2, . . . , N with absorbing barriers at 0
and N , starting at k. Show that the probability that the walk is never absorbed is zero.
4. The so-called ‘sure thing principle’ asserts that if you prefer x to y given C, and also prefer x to
y given C c , then you surely prefer x to y. Agreed?
5. A pack contains m cards, labelled 1, 2, . . . , m. The cards are dealt out in a random order, one
by one. Given that the label of the kth card dealt is the largest of the first k cards dealt, what is the
probability that it is also the largest in the pack?
6. A group of 2b friends meet for a bridge soirée. There are m men and 2b − m women where
2 ≤ m ≤ b. The group divides into b teams of pairs, formed uniformly at random. What is the
probability that no pair comprises 2 men?
1.8 Problems
1. A traditional fair die is thrown twice. What is the probability that:
(a) a six turns up exactly once?
(b) both numbers are odd?
(c) the sum of the scores is 4?
(d) the sum of the scores is divisible by 3?
2. A fair coin is thrown repeatedly. What is the probability that on the nth throw:
(a) a head appears for the first time?
(b) the numbers of heads and tails to date are equal?
(c) exactly two heads have appeared altogether to date?
(d) at least two heads have appeared to date?
3. Let F and G be σ -fields of subsets of .
(a) Use elementary set operations to Tshow that F is closed under countable intersections; that is, if
A1 , A2 , . . . are in F, then so is i Ai .
(b) Let H = F ∩ G be the collection of subsets of lying in both F and G. Show that H is a σ -field.
(c) Show that F ∪ G, the collection of subsets of lying in either F or G, is not necessarily a σ -field.
4. Describe the underlying probability spaces for the following experiments:
(a) a biased coin is tossed three times;
(b) two balls are drawn without replacement from an urn which originally contained two ultramarine
and two vermilion balls;
(c) a biased coin is tossed repeatedly until a head turns up.
5. Show that the probability that exactly one of the events A and B occurs is
10. Let B1 , B2 , . . . be a partition of the sample space , each Bi having positive probability, and
show that
∞
X
P(A) = P(A | B j )P(B j ).
j =1
13. Let A1 , A2 , . . . , An be events, and let Nk be the event that exactly k of the Ai occur. Prove the
result sometimes referred to as Waring’s theorem:
n−k
!
X k +i X
i
P(Nk ) = (−1) Sk+i , where S j = P(Ai1 ∩ Ai2 ∩ · · · ∩ Ai j ).
k
i=0 i1 <i2 <···<i j
Use this result to find an expression for the probability that a purchase of six packets of Corn Flakes
yields exactly three distinct busts (see Exercise (1.3.4)).
14. Prove Bayes’s formula: if A1 , A2 , . . . , An is a partition of , each Ai having positive probability,
then
P(B | A j )P(A j )
P(A j | B) = Pn .
1 P(B | Ai )P(Ai )
15. A random number N of dice is thrown. Let Ai be the event that N = i , and assume that
P(Ai ) = 2−i , i ≥ 1. The sum of the scores is S. Find the probability that:
(a) N = 2 given S = 4;
(b) S = 4 given N is even;
(c) N = 2, given that S = 4 and the first die showed 1;
(d) the largest number shown by any die is r , where S is unknown.
16. Let A1 , A2 , . . . be a sequence of events. Define
∞
[ ∞
\
Bn = Am , Cn = Am .
m=n m=n
Clearly Cn ⊆ An ⊆ Bn . The sequences {Bn } and {Cn } are decreasing and increasing respectively
with limits
\ \ [ [ [ \
lim Bn = B = Bn = Am , lim Cn = C = Cn = Am .
n n m≥n n n m≥n
The events B and C are denoted lim supn→∞ An and lim inf n→∞ An respectively. Show that
(a) B = {ω ∈ : ω ∈ An for infinitely many values of n},
(b) C = {ω ∈ : ω ∈ An for all but finitely many values of n}.
24 1.8 Events and their probabilities
We say that the sequence { An } converges to a limit A = lim An if B and C are the same set A. Suppose
that An → A and show that
(c) A is an event, in that A ∈ F,
(d) P(An ) → P(A).
17. In Problem (1.8.16) above, show that B and C are independent whenever Bn and Cn are inde-
pendent for all n. Deduce that if this holds and furthermore An → A, then P(A) equals either zero or
one.
18. Show that the assumption that P is countably additive is equivalent to the assumption that P is
continuous. That is to say, show that if a function P : F → [0, 1] satisfies P(∅) = 0, P() = 1, and
P(A ∪ B) = P(A) + P(B) whenever A, B ∈ F and A ∩ B = ∅, then P is countably additive (in the
sense of satisfying Definition (1.3.1b)) if and only if P is continuous (in the sense of Lemma (1.3.5)).
19. Anne, Betty, Chloë, and Daisy were all friends at school. Subsequently each of the 42 = 6
subpairs meet up; at each of the six meetings the pair involved quarrel with some fixed probability
p, or become firm friends with probability 1 − p. Quarrels take place independently of each other.
In future, if any of the four hears a rumour, then she tells it to her firm friends only. If Anne hears a
rumour, what is the probability that:
(a) Daisy hears it?
(b) Daisy hears it if Anne and Betty have quarrelled?
(c) Daisy hears it if Betty and Chloë have quarrelled?
(d) Daisy hears it if she has quarrelled with Anne?
20. A biased coin is tossed repeatedly. Each time there is a probability p of a head turning up. Let pn
be the probability that an even number of heads has occurred after n tosses (zero is an even number).
Show that p0 = 1 and that pn = p(1 − pn−1 ) + (1 − p) pn−1 if n ≥ 1. Solve this difference equation.
21. A biased coin is tossed repeatedly. Find the probability that there is a run of r heads in a row
before there is a run of s tails, where r and s are positive integers.
22. (a) A bowl contains twenty cherries, exactly fifteen of which have had their stones removed. A
greedy pig eats five whole cherries, picked at random, without remarking on the presence or
absence of stones. Subsequently, a cherry is picked randomly from the remaining fifteen.
(i) What is the probability that this cherry contains a stone?
(ii) Given that this cherry contains a stone, what is the probability that the pig consumed at least
one stone?
(b) 100 contestants buy numbered lottery tickets for a reverse raffle, in which the last ticket drawn
from an urn is the winner. Halfway through the draw, the Mistress of Ceremonies discovers that
10 tickets have inadvertently not been added to the urn, so she adds them, and continues the draw.
Is the lottery fair?
23. The ‘ménages’ problem poses the following question. Some consider it to be desirable that men
and women alternate when seated at a circular table. If n heterosexual couples are seated randomly
according to this rule, show that the probability that nobody sits next to his or her partner is
n
!
1 X k 2n 2n − k
(−1) (n − k)!
n! 2n − k k
k=0
You may find it useful to show first that the number of ways of selecting k non-overlapping pairs of
adjacent seats is 2n−k
k 2n(2n − k)−1 .
24. An urn contains b blue balls and r red balls. They are removed at random and not replaced. Show
. r+b
that the probability that the first red ball drawn is the (k + 1)th ball drawn equals r+b−k−1
r−1 b .
Find the probability that the last ball drawn is red.
25. An urn contains a azure balls and c carmine balls, where ac 6= 0. Balls are removed at random
and discarded until the first time that a ball (B, say) is removed having a different colour from its
1.8 Problems 25
predecessor. The ball B is now replaced and the procedure restarted. This process continues until the
last ball is drawn from the urn. Show that this last ball is equally likely to be azure or carmine.
26. Protocols. A pack of four cards contains one spade, one club, and the two red aces. You deal
two cards faces downwards at random in front of a truthful friend. She inspects them and tells you
that one of them is the ace of hearts. What is the chance that the other card is the ace of diamonds?
Perhaps 13 ?
Suppose that your friend’s protocol was:
(a) with no red ace, say “no red ace”,
(b) with the ace of hearts, say “ace of hearts”,
(c) with the ace of diamonds but not the ace of hearts, say “ace of diamonds”.
Show that the probability in question is 13 .
Devise a possible protocol for your friend such that the probability in question is zero.
27. Eddington’s controversy. Four witnesses, A, B, C, and D, at a trial each speak the truth with
probability 31 independently of each other. In their testimonies, A claimed that B denied that C declared
that D lied. What is the (conditional) probability that D told the truth? [This problem seems to have
appeared first as a parody in a university magazine of the ‘typical’ Cambridge Philosophy Tripos
question.]
28. The probabilistic method. 10 per cent of the surface of a sphere is coloured blue, the rest is red.
Show that, irrespective of the manner in which the colours are distributed, it is possible to inscribe a
cube in S with all its vertices red.
29. Repulsion. The event A is said to be repelled by the event B if P(A | B) < P(A), and to be
attracted by B if P(A | B) > P(A). Show that if B attracts A, then A attracts B, and B c repels A.
If A attracts B, and B attracts C, does A attract C?
30. Birthdays. At a lecture, there a m students born on independent days in 2007.
(a) With 2 ≤ m ≤ 365, show that the probability that at least two of them share a birthday is
p = 1 − (365)!/{(365 − m)! 365m }. Show that p > 12 when m = 23.
(b) With 2 ≤ m ≤ 366, find the probability p1 that exactly one pair of individuals share a birthday,
with no others sharing.
(c) Suppose m students are born on independent random days on the planet Magrathea, whose year has
M ≫ m days. Show thatthe probability p0 that no two students share a birthday is approximately
exp − 12 m(m − 1)/M) for large M.
31. Lottery. You choose r of the first n positive integers, and a lottery chooses a random subset L of
the same size. What is the probability that:
(a) L includes no consecutive integers?
(b) L includes exactly one pair of consecutive integers?
(c) the numbers in L are drawn in increasing order?
(d) your choice of numbers is the same as L?
(e) there are exactly k of your numbers matching members of L?
32. Bridge. During a game of bridge, you are dealt at random a hand of thirteen cards. With an
obvious notation, show that P(4S, 3H, 3D, 3C) ≃ 0.026 and P(4S, 4H, 3D, 2C) ≃ 0.018. However
if suits are not specified, so numbers denote the shape of your hand, show that P(4, 3, 3, 3) ≃ 0.11
and P(4, 4, 3, 2) ≃ 0.22.
33. Poker. During a game of poker, you are dealt a five-card hand at random. With the convention
that aces may count high or low, show that:
34. Poker dice. There are five dice each displaying 9, 10, J, Q, K, A. Show that, when rolled:
35. You are lost in the National Park of Bandrika†. Tourists comprise two-thirds of the visitors to
the park, and give a correct answer to requests for directions with probability 34 . (Answers to repeated
questions are independent, even if the question and the person are the same.) If you ask a Bandrikan
for directions, the answer is always false.
(a) You ask a passer-by whether the exit from the Park is East or West. The answer is East. What is
the probability this is correct?
(b) You ask the same person again, and receive the same reply. Show the probability that it is correct
is 21 .
(c) You ask the same person again, and receive the same reply. What is the probability that it is
correct?
(d) You ask for the fourth time, and receive the answer East. Show that the probability it is correct
is 27
70 .
(e) Show that, had the fourth answer been West instead, the probability that East is nevertheless
9 .
correct is 10
36. Mr Bayes goes to Bandrika. Tom is in the same position as you were in the previous problem,
but he has reason to believe that, with probability ǫ, East is the correct answer. Show that:
(a) whatever answer first received, Tom continues to believe that East is correct with probability ǫ,
(b) if the first two replies are the same (that is, either WW or EE), Tom continues to believe that East
is correct with probability ǫ,
(c) after three like answers, Tom will calculate as follows, in the obvious notation:
9ǫ 11ǫ
P(East correct | EEE) = , P(East correct | WWW) = .
11 − 2ǫ 9 + 2ǫ
9 .
Evaluate these when ǫ = 20
37. Bonferroni’s inequality. Show that
[
n n
X X
P Ar ≥ P(Ar ) − P(Ar ∩ Ak ).
r=1 r=1 r<k
39. The lost boarding pass‡. The n passengers for a Bell-Air flight in an airplane with n seats have
been told their seat numbers. They get on the plane one by one. The first person sits in the wrong seat.
Subsequent passengers sit in their assigned seats whenever they find them available, or otherwise in a
†A fictional country made famous in the Hitchcock film ‘The Lady Vanishes’.
‡The authors learned of this problem from David Bell in 2000 or earlier.
1.8 Problems 27
randomly chosen empty seat. What is the probability that the last passenger finds his or her assigned
seat to be free?
What is the answer if the first person sits in a seat chosen uniformly at random from the n available?
40. Flash’s problem. A number n of spaceships land independently and uniformly at random on the
surface of planet Mongo. Each ship controls the hemisphere of which it is the centre. Show that the
probability that every point on Mongo is controlled by at least one ship is 1 − 2−n (n 2 − n + 2). [Hint:
n great circles almost surely partition the surface of the sphere into n 2 − n + 2 disjoint regions.]
41. Let X be uniformly distributed on {1, 2, . . . , n − 1}, where n ≥ 2. Given X, a team of size X is
selected at random from a pool of n players (including you), each such subset of size X being equally
likely. Call the selected team A, and the remainder team B.
(a) What is the probability that your team has size k?
(b) Each team picks a captain uniformly at random from its members. What is the probability your
team has size k given that you are chosen as captain?
42. Alice and Bob flip a fair coin in turn. A wins if she gets a head, provided her preceding flip was
a tail; B wins if he gets a tail, provided his preceding flip was a head. Let n ≥ 3. Show that the
probability the game ends on the nth flip is (n + 1)(n − 1)/2n+2 if n is odd, and (n + 2)(n − 2)/2n+2
if even.
What is the probability that A wins the game?
43. A coin comes up heads with probability p ∈ (0, 1). Let k ≥ 1, and let ρm be the probability that,
in m (≥ 1) coin flips, the longest run of consecutive heads has length strictly less than k. Show that
ρm − ρm−1 + (1 − p) pk ρm−k−1 = 0, m ≥ k + 1,
(1) Example. A fair coin is tossed twice: = {HH, HT, TH, TT}. For ω ∈ , let X (ω) be
the number of heads, so that
Now suppose that a gambler wagers his fortune of £1 on the result of this experiment. He
gambles cumulatively so that his fortune is doubled each time a head appears,and is annihilated
on the appearance of a tail. His subsequent fortune W is a random variable given by
FX (x)
1
3
4
1
4
1 2 x
Figure 2.1. The distribution function FX of the random variable X of Examples (1) and (5).
After the experiment is done and the outcome ω ∈ is known, a random variable X : →
R takes some value. In general this numerical value is more likely to lie in certain subsets
of R than in certain others, depending on the probability space (, F , P) and the function X
itself. We wish to be able to describe the distribution of the likelihoods of possible values of
X. Example (1) above suggests that we might do this through the function f : R → [0, 1]
defined by
f (x) = probability that X is equal to x,
but this turns out to be inappropriate in general. Rather, we use the distribution function
F : R → [0, 1] defined by
If you so desire, you may pay no attention to the technical condition in the definition
and think of random variables simply as functions mapping into R. We shall always use
upper-case letters, such as X, Y , and Z , to represent generic random variables, whilst lower-
case letters, such as x, y, and z, will be used to represent possible numerical values of these
variables. Do not confuse this notation in your written work.
Every random variable has a distribution function, given by (2); distribution functions are
very important and useful.
(4) Definition. The distribution function of a random variable X is the function F : R →
[0, 1] given by F(x) = P(X ≤ x).
FW (x)
1
3
4
4 x
Figure 2.2. The distribution function FW of the random variable W of Examples (1) and (5).
and is sketched in Figure 2.2. This illustrates the important point that the distribution function
of a random variable X tells us about the values taken by X and their relative likelihoods,
rather than about the sample space and the collection of events.
Proof.
(a) Let Bn = {ω ∈ : X (ω) ≤ −n} = {X ≤ −n}. The sequence B1 , B2 , . . . is decreasing
with the empty set as limit. Thus, by Theorem (1.3.5), P(Bn ) → P(∅) = 0. The other
part is similar.
(b) Let A(x) = {X ≤ x}, A(x, y) = {x < X ≤ y}. Then A(y) = A(x) ∪ A(x, y) is a
disjoint union, and so by Definition (1.3.1),
giving
F(y) = F(x) + P(x < X ≤ y) ≥ F(x).
(c) This is an exercise. Use Theorem (1.3.5).
2.1 Random variables 31
Actually, this lemma characterizes distribution functions. That is to say, F is the distribution
function of some random variable if and only if it satisfies (6a), (6b), and (6c).
For the time being we can forget all about probability spaces and concentrate on random
variables and their distribution functions. The distribution function F of X contains a great
deal of information about X.
(7) Example. Constant variables. The simplest random variable takes a constant value on
the whole domain . Let c ∈ R and define X : → R by
Slightly more generally, we call X constant (almost surely) if there exists c ∈ R such that
P(X = c) = 1. A constant random variable is sometimes said to be degenerate or trivial.
X (H ) = 1, X (T ) = 0.
Then X is the simplest non-trivial random variable, having two possible values, 0 and 1. Its
distribution function F(x) = P(X ≤ x) is
0 x < 0,
F(x) = 1 − p 0 ≤ x < 1,
1 x ≥ 1.
(9) Example. Indicator functions. A particular class of Bernoulli variables is very useful
in probability theory. Let A be an event and let I A : → R be the indicator function of A;
that is,
1 if ω ∈ A,
I A (ω) =
0 if ω ∈ Ac .
Then I A is a Bernoulli random variable taking the values 1 and 0 with probabilities PS
(A) and
P(Ac ) respectively. Suppose {Bi : i ∈ I } is a family of disjoint events with A ⊆ i∈I Bi .
Then
X
(10) IA = I A∩Bi ,
i
where x is large and positive. We shall see later that the rates at which the Ti decay to zero
as x → ∞ have a substantial effect on the existence or non-existence of certain associated
quantities called the ‘moments’ of the distribution.
course the existence of a corresponding probability space (, F , P), but we do not plan to get
bogged down in such mattersP here. We think of Ai as being the event ‘that A occurs on the i th
experiment’. We write Sn = ni=1 I Ai , the sum of the indicator functions of A1 , A2 , . . . , An ;
Sn is a random variable which counts the number of occurrences of Ai for 1 ≤ i ≤ n (certainly
Sn is a function of , since it is the sum of such functions, and it is left as an exercise to show
that Sn is F -measurable). The following result concerning the ratio n −1 Sn was proved by
James Bernoulli before 1692.
(1) Theorem. It is the case that n −1 Sn converges to p as n → ∞ in the sense that, for all
ǫ > 0,
1
P p − ǫ ≤ Sn ≤ p + ǫ → 1 as n → ∞.
n
There are certain technicalities involved in the study of the convergence of random variables
(see Chapter 7), and this is the reason for the careful statement of the theorem. For the time
being, we encourage the reader to interpret the theorem as asserting simply that the proportion
n −1 Sn of times that the events A1 , A2 , . . . , An occur converges as n → ∞ to their common
probability p. We shall see later how important it is to be careful when making such statements.
Interpreted in terms of tosses of a fair coin, the theorem implies that the proportion of heads
is (with large probability) near to 12 . As a caveat regarding the difficulties inherent in studying
the convergence of random variables, we remark that it is not true that, in a ‘typical’ sequence
of tosses of a fair coin, heads outnumber tails about one-half of the time.
Proof. Suppose that we toss a coin repeatedly, and heads occurs on each toss with probability
p. The random variable Sn has the same probability distribution as the number Hn of heads
which occur during the first n tosses, which is to say that P(Sn = k) = P(Hn = k) for all k.
It follows that, for small positive values of ǫ,
X
1
P Sn ≥ p + ǫ = P(Hn = k).
n
k≥n( p+ǫ)
2
by the binomial theorem. It is a simple exercise to show that e x ≤ x + e x for x ∈ R. With
the aid of this inequality, we obtain
1 2 2 2 2 n
(3) P Sn ≥ p + ǫ ≤ e−λnǫ peλ q + qeλ p
n
2 n−λnǫ
≤ eλ .
3. Let {X r : r ≥ 1} be observations which are independent and identically distributed with unknown
distribution function F. Describe and justify a method for estimating F(x).
We shall see that the distribution function of a discrete variable has jump discontinuities
at the values x 1 , x 2 , . . . and is constant in between; such a distribution is called atomic. This
contrasts sharply with the other important class of distribution functions considered here.
(2) Definition. The random variable X is called continuous if its distribution function can
be expressed as Z x
F(x) = f (u) du x ∈ R,
−∞
for some integrable function f : R → [0, ∞) called the (probability) density function of X.
(3) Example. Discrete variables. The variables X and W of Example (2.1.1) take values in
the sets {0, 1, 2} and {0, 4} respectively; they are both discrete.
(4) Example. Continuous variables. A straight rod is flung down at random onto a horizontal
plane and the angle ω between the rod and true north is measured. The result is a number
in = [0, 2π). Never mind about F for the moment; we can suppose that F contains
all nice subsets of , including the collection of open subintervals such as (a, b), where
0 ≤ a < b < 2π. The implicit symmetry suggests the probability measure P which satisfies
P((a, b)) = (b − a)/(2π); that is to say, the probability that the angle lies in some interval is
†The first such example of a nowhere dense subset of [0, 1] with strictly positive Lebesgue measure is in
fact due to H. S. S. Smith, whose discovery predated that of Cantor by eight years. This is an imperfect example
of Stigler’s law of eponymy since Smith’s set was to base 4 and Cantor’s to base 3.
36 2.3 Random variables and their distributions
FX (x)
−1 2π x
Figure 2.3. The distribution function FX of the random variable X in Example (5).
directly proportional to the length of the interval. Here are two random variables X and Y :
X (ω) = ω, Y (ω) = ω2 .
Notice that Y is a function of X in that Y = X 2 . The distribution functions of X and Y are
0 y ≤ 0,
0 x ≤ 0, √
FX (x) = x/(2π) 0 ≤ x < 2π, FY (y) = y/(2π) 0 ≤ y < 4π 2 ,
1 x ≥ 2π, 1 y ≥ 4π 2 .
2
To see this, let 0 ≤ x < 2π and 0 ≤ y < 4π . Then
FX (x) = P {ω ∈ : 0 ≤ X (ω) ≤ x}
= P {ω ∈ : 0 ≤ ω ≤ x} = x/(2π),
FY (y) = P {ω : Y (ω) ≤ y}
√ √
= P {ω : ω2 ≤ y} = P {ω : 0 ≤ ω ≤ y} = P(X ≤ y)
√
= y/(2π).
The random variables X and Y are continuous because
Z x Z y
FX (x) = f X (u) du, FY (y) = f Y (u) du,
−∞ −∞
where
1/(2π) if 0 ≤ u ≤ 2π,
f X (u) =
0 otherwise,
− 1
u /(4π) if 0 < u ≤ 4π 2 ,
2
fY (u) =
0 otherwise.
(5) Example. A random variable which is neither continuous nor discrete. A coin is
tossed, and a head turns up with probability p (= 1−q). If a head turns up then a rod is flung on
the ground and the angle measured as in Example (4). Then = {T}∪{(H, x) : 0 ≤ x < 2π},
in the obvious notation. Let X : → R be given by
X (T) = −1, X ((H, x)) = x.
The random variable X takes values in {−1} ∪ [0, 2π) (see Figure 2.3 for a sketch of its
distribution function). We say that X is continuous with the exception of a ‘point mass (or
atom) at −1’.
2.4 Worked examples 37
Let F be a distribution function which is continuous and strictly increasing. Show that Y = F −1 (X)
is a random variable having distribution function F. Is it necessary that F be continuous and/or strictly
increasing?
4. Show that, if f and g are density functions, and 0 ≤ λ ≤ 1, then λ f + (1 − λ)g is a density. Is
the product f g a density function?
5. Which of the following are density functions? Find c and the corresponding distribution function
F for those that
are.
cx −d x > 1,
(a) f (x) =
0 otherwise.
(b) f (x) = ce x (1 + e x )−2 , x ∈ R.
(1) Example. Darts. A dart is flung at a circular target of radius 3. We can think of the
hitting point as the outcome of a random experiment; we shall suppose for simplicity that the
player is guaranteed to hit the target somewhere. Setting the centre of the target at the origin
of R2 , we see that the sample space of this experiment is
Never mind about the collection F of events. Let us suppose that, roughly speaking, the
probability that the dart lands in some region A is proportional to its area |A|. Thus
The scoring system is as follows. The target is partitioned by three concentric circles C1 , C2 ,
and C3 , centred at the origin with radii 1, 2, and 3. These circles divide the target into three
annuli A1 , A2 , and A3 , where
q
Ak = (x, y) : k − 1 ≤ x 2 + y 2 < k .
38 2.4 Random variables and their distributions
FX (x)
1
4
9
1
9
1 2 3 x
We suppose that the player scores an amount k if and only if the dart hits Ak . The resulting
score X is the random variable given by
X (ω) = k whenever ω ∈ Ak .
where ⌊r ⌋ denotes the largest integer not larger than r (see Figure 2.4).
(3) Example. Continuation of (1). Let us consider a revised method of scoring in which the
player scores an amount equal to the distance between the hitting point ω and the centre of
the target. This time the score Y is a random variable given by
q
Y (ω) = x 2 + y2, if ω = (x, y).
Cr = {(x, y) : x 2 + y 2 ≤ r }.
Then
FY (r ) = P(Y ≤ r ) = P(Cr ) = 91 r 2 if 0 ≤ r ≤ 3.
2.4 Worked examples 39
FY (r )
1 2 3 r
FZ (r )
1
1− p
1 2 3 4 r
(4) Example. Continuation of (1). Now suppose that the player fails to hit the target with
fixed probability p; if he is successful then we suppose that the distribution of the hitting point
is described by equation (2). His score is specified as follows. If he hits the target then he
scores an amount equal to the distance between the hitting point and the centre; if he misses
then he scores 4. What is the distribution function of his score Z ?
Solution. Clearly Z takes values in the interval [0, 4]. Use Lemma (1.4.4) to see that
FZ (r ) = P(Z ≤ r )
= P(Z ≤ r | hits target)P(hits target) + P(Z ≤ r | misses target)P(misses target)
0 if r < 0,
= (1 − p)FY (r ) if 0 ≤ r < 4,
1 if r ≥ 4,
where FY is given in Example (3) (see Figure 2.6 for a sketch of FZ ).
40 2.5 Random variables and their distributions
(1) Example. Tontine is a scheme wherein subscribers to a common fund each receive an
annuity from the fund during his or her lifetime, this annuity increasing as the other subscribers
die. When all the subscribers are dead, the fund passes to the French government (this was
the case in the first such scheme designed by Lorenzo Tonti around 1653). The performance
of the fund depends on the lifetimes L 1 , L 2 , . . . , L n of the subscribers (as well as on their
wealths), and we may record these as a vector (L 1 , L 2 , . . . , L n ) of random variables.
(2) Example. Darts. A dart is flung at a conventional dartboard. The point of striking
determines a distance R from the centre, an angle 2 with the upward vertical (measured
clockwise, say), and a score S. With this experiment we may associate the random vector
(R, 2, S), and we note that S is a function of the pair (R, 2).
(3) Example. Coin tossing. Suppose that we toss a coin n times, and set X i equal to 0
or 1 depending on whether the i th toss results in a tail or a head. We think of the vector
X = (X 1 , X 2 , . . . , X n ) as describing the result of this composite experiment. The total
number of heads is the sum of the entries in X.
An individual random variable X has a distribution function FX defined by FX (x) =
P(X ≤ x) for x ∈ R. The corresponding ‘joint’ distribution function of a random vector
(X 1 , X 2 , . . . , X n ) is the quantity P(X 1 ≤ x 1 , X 2 ≤ x 2 , . . . , X n ≤ x n ), a function of n real
variables x 1 , x 2 , . . . , x n . In order to aid the notation, we introduce an ordering of vectors of
2.5 Random vectors 41
We state this lemma for a random vector with only two components X and Y , but the
corresponding result for n components is valid also. The proof of the lemma is left as an
exercise. Rather more is true. It may be seen without great difficulty that
and similarly
This more refined version of part (a) of the lemma tells us that we may recapture the individual
distribution functions of X and Y from a knowledge of their joint distribution function. The
converse is false: it is not generally possible to calculate FX,Y from a knowledge of FX and
FY alone. The functions FX and FY are called the ‘marginal’ distribution functions of FX,Y .
(8) Example. A schoolteacher asks each member of his or her class to flip a fair coin twice
and to record the outcomes. The diligent pupil D does this and records a pair (X D , Y D ) of
outcomes. The lazy pupil L flips the coin only once and writes down the result twice, recording
thus a pair (X L , Y L ) where X L = Y L . Clearly X D , Y D , X L , and Y L are random variables with
the same distribution functions. However, the pairs (X D , Y D ) and (X L , Y L ) have different
joint distribution functions. In particular, P(X D = Y D = heads) = 41 since only one of the
four possible pairs of outcomes contains heads only, whereas P(X L = Y L = heads) = 12 .
Once again there are two classes of random vectors which are particularly interesting: the
‘discrete’ and the ‘continuous’.
(9) Definition. The random variables X and Y on the probability space (, F , P) are called
(jointly) discrete if the vector (X, Y ) takes values in some countable subset of R2 only. The
jointly discrete random variables X, Y have joint (probability) mass function f : R2 →
[0, 1] given by f (x, y) = P(X = x, Y = y).
42 2.5 Random variables and their distributions
(10) Definition. The random variables X and Y on the probability space (, F , P) are called
(jointly) continuous if their joint distribution function can be expressed as
Z x Z y
FX,Y (x, y) = f (u, v) dv du, x, y ∈ R,
u=−∞ v=−∞
for some integrable function f : R2 → [0, ∞) called the joint (probability) density function
of the pair (X, Y ).
We shall return to such questions in later chapters. Meanwhile here are two concrete
examples.
(11) Example. Three-sided coin. We are provided with a special three-sided coin, each
toss of which results in one of the possibilities H (heads), T (tails), E (edge), each having
probability 13 . Let Hn , Tn , and E n be the numbers of such outcomes in n tosses of the coin.
The vector (Hn , Tn , E n ) is a vector of random variables satisfying Hn + Tn + E n = n. If the
outcomes of different tosses have no influence on each other, it is not difficult to see that
n
n! 1
P (Hn , Tn , E n ) = (h, t, e) =
h! t! e! 3
for any triple (h, t, e) of non-negative integers with sum n. The random variables Hn , Tn , E n
are (jointly) discrete and are said to have (jointly) the trinomial distribution.
(12) Example. Darts. Returning to the flung dart of Example (2), let us assume that no
region of the dartboard is preferred unduly over any other region of equal area. It may then
be shown (see Example (2.4.3)) that
r2 θ
P(R ≤ r ) = , P(2 ≤ θ ) = , for 0 ≤ r ≤ ρ, 0 ≤ θ ≤ 2π,
ρ2 2π
It follows that
Z r Z θ
FR,2 (r, θ ) = f (u, v) dv du
u=0 v=0
where
u
f (u, v) = , 0 ≤ u ≤ ρ, 0 ≤ v ≤ 2π.
πρ 2
(1) Example. Gambler’s ruin revisited. The gambler of Example (1.7.4) eventually won
his Jaguar after a long period devoted to tossing coins, and he has now decided to save up
for a yacht. His bank manager has suggested that, in order to speed things up, the stake on
each gamble should not remain constant but should vary as a certain prescribed function of
the gambler’s current fortune. The gambler would like to calculate the chance of winning the
yacht in advance of embarking on the project, but he finds himself incapable of doing so.
Fortunately, he has kept a record of the extremely long sequence of heads and tails encoun-
tered in his successful play for the Jaguar. He calculates his sequence of hypothetical fortunes
based on this information, until the point when this fortune reaches either zero or the price of
the yacht. He then starts again, and continues to repeat the procedure until he has completed
it a total of N times, say. He estimates the probability that he will actually win the yacht by
the proportion of the N calculations which result in success.
Why might this method make him overconfident? Should he retoss the coins?
(2) Example. A dam. It is proposed to build a dam in order to regulate the water supply,
and in particular to prevent seasonal flooding downstream. How high should the dam be?
Dams are expensive to construct, and some compromise between cost and risk is necessary.
It is decided to build a dam which is just high enough to ensure that the chance of a flood
of some given extent within ten years is less than 10−2 , say. No one knows exactly how
high such a dam need be, and a young probabilist proposes the following scheme. Through
examination of existing records of rainfall and water demand we may arrive at an acceptable
model for the pattern of supply and demand. This model includes, for example, estimates for
the distributions of rainfall on successive days over long periods. With the aid of a computer,
the ‘real world’ situation is simulated many times in order to study the likely consequences
of building dams of various heights. In this way we may arrive at an accurate estimate of the
height required.
(3) Example. Integration. Let g : [0, 1] → [0, 1] be a continuous but nowhere differentiable
R1
function. How may we calculate its integral I = 0 g(x) d x? The following experimental
technique is known as the ‘hit or miss Monte Carlo technique’.
Let (X, Y ) be a random vector
having the uniform distribution on the unit square. That is,
we assume that P (X, Y ) ∈ A = |A|, the area of A, for any nice subset A of the unit square
[0, 1]2 ; we leave the assumption of niceness somewhat up in the air for the moment, and shall
return to such matters in Chapter 4. We declare (X, Y ) to be ‘successful’ if Y ≤ g(X). The
chance that (X, Y ) is successful equals I , the area under the curve y = g(x). We now repeat
this experiment a large number N of times, and calculate the proportion of times that the
experiment is successful. Following the law of averages, Theorem (2.2.1), we may use this
value as an estimate of I .
Clearly it is desirable to know the accuracy of this estimate. This is a harder problem to
which we shall return later.
Simulation is a dangerous game, and great caution is required in interpreting the results.
There are two major reasons for this. First, a computer simulation is limited by the degree
to which its so-called ‘pseudo-random number generator’ may be trusted. It has been said
for example that the summon-according-to-birthday principle of conscription to the United
States armed forces may have been marred by a pseudo-random number generator with a bias
for some numbers over others. Secondly, in estimating a given quantity, one may in some
2.7 Problems 45
circumstances have little or no idea how many repetitions are necessary in order to achieve an
estimate within a specified accuracy.
We have made no remark about the methods by which computers calculate ‘pseudo-random
numbers’. Needless to say they do not flip coins, but rely instead on operations of sufficient
numerical complexity that the outcome, although deterministic, is apparently unpredictable
except by an exact repetition of the calculation.
These techniques were named in honour of Monte Carlo by Metropolis, von Neumann, and
Ulam, while they were involved in the process of building bombs at Los Alamos in the 1940s.
2.7 Problems
1. Each toss of a coin results in a head with probability p. The coin is tossed until the first head
appears. Let X be the total number of tosses. What is P(X > m)? Find the distribution function of
the random variable X.
2. (a) Show that any discrete random variable may be written as a linear combination of indicator
variables.
(b) Show that any random variable may be expressed as the limit of an increasing sequence of discrete
random variables.
(c) Show that the limit of any increasing convergent sequence of random variables is a random
variable.
3. (a) Show that, if X and Y are random variables on a probability space (, F, P), then so are
X + Y , XY , and min{X, Y }.
(b) Show that the set of all random variables on a given probability space (, F, P) constitutes a
vector space over the reals. If is finite, write down a basis for this space.
4. Let X have distribution function
0
if x < 0,
F(x) = 1x if 0 ≤ x ≤ 2,
2
1 if x > 2,
0 if x < −1,
1− p if − 1 ≤ x < 0,
F(x) =
1 − p + 1 xp
if 0 ≤ x ≤ 2,
2
1 if x > 2.
Sketch this function, and find: (a) P(X = −1), (b) P(X = 0), (c) P(X ≥ 1).
46 2.7 Random variables and their distributions
6. Buses arrive at ten minute intervals starting at noon. A man arrives at the bus stop a random
number X minutes after noon, where X has distribution function
0 if x < 0,
P(X ≤ x) = x/60 if 0 ≤ x ≤ 60,
1 if x > 60.
What is the probability that he waits less than five minutes for a bus?
1 indepen-
7. Airlines find that each passenger who reserves a seat fails to turn up with probability 10
dently of the other passengers. EasyPeasy Airlines always sell 10 tickets for their 9 seat aeroplane
while RyeLoaf Airways always sell 20 tickets for their 18 seat aeroplane. Which is more often
over-booked?
8. A fairground performer claims the power of telekinesis. The crowd throws coins and he wills
them to fall heads up. He succeeds five times out of six. What chance would he have of doing at least
as well if he had no supernatural powers?
9. Express the distribution functions of
Show that dTV satisfies (i) and (iii) with S the space of integer-valued random variables, and that
dTV (X, Y ) = 0 if and only if X and Y have the same distribution. Thus dTV is a metric on the space
of equivalence classes of S with equivalence relation given by X ∼ Y if X and Y have the same
distribution. We call dTV the total variation distance.
2.7 Problems 47
Show that
dTV (X, Y ) = 2 sup P(X ∈ A) − P(Y ∈ A) .
A⊆Z
14. Ascertain in the following cases whether or not F is the joint distribution function of some pair
(X, Y ) of random variables. If your conclusion is affirmative, find the distribution functions of X and
Y separately.
1 − e−x−y if x, y ≥ 0,
(a) F(x, y) =
0 otherwise.
−x − xe −y if 0 ≤ x ≤ y,
1 − e
(b) F(x, y) = 1 − e−y − ye−y if 0 ≤ y ≤ x,
0 otherwise.
15. It is required to place in order n books B1 , B2 , . . . , Bn on a library shelf in such a way that readers
searching from left to right waste as little time as possible on average. Assuming that each reader
requires book Bi with probability pi , find the ordering of the books which minimizes P(T ≥ k) for
all k, where T is the (random) number of titles examined by a reader before discovery of the required
book.
16. Transitive coins. Three coins each show heads with probability 35 and tails otherwise. The first
counts 10 points for a head and 2 for a tail, the second counts 4 points for both head and tail, and the
third counts 3 points for a head and 20 for a tail.
You and your opponent each choose a coin; you cannot choose the same coin. Each of you tosses
your coin and the person with the larger score wins £1010 . Would you prefer to be the first to pick a
coin or the second?
17. Before the development of radar, inertial navigation, and GPS, flying to isolated islands (for
example, from Los Angeles to Hawaii) was somewhat ‘hit or miss’. In heavy cloud or at night it was
necessary to fly by dead reckoning, and then to search the surface. With the aid of a radio, the pilot
had a good idea of the correct great circle along which to search, but could not be sure which of the
two directions along this great circle was correct (since a strong tailwind could have carried the plane
over its target). When you are the pilot, you calculate that you can make n searches before your plane
will run out of fuel. On each search you will discover the island with probability p (if it is indeed in
the direction of the search) independently of the results of other searches; you estimate initially that
there is probability α that the island is ahead of you. What policy should you adopt in deciding the
directions of your various searches in order to maximize the probability of locating the island?
18. Eight pawns are placed randomly on a chessboard, no more than one to a square. What is the
probability that:
(a) they are in a straight line (do not forget the diagonals)?
(b) no two are in the same row or column?
19. Which of the following are distribution functions? For those that are, give the corresponding
density function
( f. 2
1 − e−x x ≥ 0,
(a) F(x) =
0 otherwise.
−1/x
e x > 0,
(b) F(x) =
0 otherwise.
(c) F(x) = e x /(e x + e−x ), x ∈ R.
2
(d) F(x) = e−x + e x /(e x + e−x ), x ∈ R.
20. (a) If U and V are jointly continuous, show that P(U = V ) = 0.
(b) Let X be uniformly distributed on (0, 1), and let Y = X. Then X and Y are continuous, and
P(X = Y ) = 1. Is there a contradiction here?
48 2.7 Random variables and their distributions
21. Continued fractions. Let X be uniformly distributed on the interval [0, 1], and express X as a
continued fraction thus:
1
X= .
1
Y1 +
1
Y2 +
Y3 + · · ·
Show that the joint mass function of Y1 and Y2 is
1
f (u, v) = , u, v = 1, 2, . . . .
(uv + 1)(uv + u + 1)
22. Let V be a vector space of dimension n over a finite field F with q elements. Let X 1 , X 2 , . . . , X m
be independent random variables, each uniformly distributed on V .
P
(a) Let ai ∈ F, i = 1, 2, . . . , m, be not all zero. Show that the linear combination Z = i ai X i is
uniformly distributed on V .
(b) Let pm be the probability that X 1 , X 2 , . . . , X m are linearly dependent. Show that, if m ≤ n + 1,
q −(n−m−1) ≤ pm ≤ q −(n−m) , m = 1, 2, . . . , n + 1.
23. Modes. A random variable X with distribution function F is said to be unimodal† about a mode
M if F is convex on (−∞, M) and concave on (M, ∞). Show that, if F is unimodal about M, then
the following hold.
(a) F is absolutely continuous, except possibly for an atom at M.
(b) If F is differentiable, then it has a density that is non-decreasing on (−∞, M) and non-increasing
on (M, ∞), and furthermore, the set of modes of F is a closed bounded interval. [Cf. Problem
(2.7.11).]
(c) If the distribution functions F and G are unimodal about the same mode M, then a F + (1 − a)G
is unimodal about M for any 0 < a < 1.
†It is a source of potential confusion that the word ‘mode’ is used in several contexts. A function is sometimes
said to be unimodal if it has a unique maximum. The word mode is also used for the value(s) of x at which a
mass function (or density function) f (x) is maximized, and even on occasion the locations of its local maxima.
3
Discrete random variables
(3) Example. Binomial distribution. A coin is tossed n times, and a head turns up each time
with probability p (= 1 − q). Then = {H, T}n . The total number X of heads takes values
in the set {0, 1, 2, . . . , n} and is a discrete random variable. Its probability mass function
f (x) = P(X = x) satisfies
f (x) = 0 if x∈
/ {0, 1, 2, . . . , n}.
Let 0 ≤ k ≤ n, and consider f (k). Exactly nk points in give a total of k heads; each of
these points occurs with probability p k q n−k , and so
n k n−k
f (k) = p q if 0 ≤ k ≤ n.
k
The random variable X is said to have the binomial distribution with parameters n and p,
written bin(n, p). It is the sum X = Y1 + Y2 + · · · + Yn of n Bernoulli variables (see Example
(2.1.8)).
(4) Example. Poisson distribution. If a random variable X takes values in the set {0, 1, 2, . . . }
with mass function
λk
f (k) = e−λ , k = 0, 1, 2, . . . ,
k!
where λ > 0, then X is said to have the Poisson distribution with parameter λ.
3.2 Independence
Remember that events A and B are called ‘independent’ if the occurrence of A does not
change the subsequent probability of B occurring. More rigorously, A and B are independent
if and only if P(A ∩ B) = P(A)P(B). Similarly, we say that discrete variables X and Y are
‘independent’ if the numerical value of X does not affect the distribution of Y . With this in
mind we make the following definition.
(1) Definition. Discrete variables X and Y are independent if the events {X = x} and
{Y = y} are independent for all x and y.
Suppose X takes values in the set {x 1 , x 2 , . . . } and Y takes values in the set {y1 , y2 , . . . }.
Let
Ai = {X = x i }, B j = {Y = y j }.
Notice (see Problem (2.7.2)) that X and Y are linear combinations of the indicator variables
I Ai , I Bj , in that
X X
X= x i I Ai and Y = yj I Bj .
i j
The random variables X and Y are independent if and only if Ai and B j are independent for
all pairs i, j . A similar definition holds for collections {X 1 , X 2 , . . . , X n } of discrete variables.
(Recall Definition (1.5.1).)
(2) Example. Poisson flips. A coin is tossed once and heads turns up with probability
p = 1 − q. Let X and Y be the numbers of heads and tails respectively. It is no surprise that
X and Y are not independent. After all,
Suppose now that the coin is tossed a random number N of times, where N has the Poisson
distribution with parameter λ. It is a remarkable fact that the resulting numbers X and Y of
heads and tails are independent, since
P(X = x, Y = y) = P X = x, Y = y N = x + y P(N = x + y)
x + y x y λx+y −λ (λp)x (λq) y −λ
= p q e = e .
x (x + y)! x! y!
for all sets {x i : i ∈ I } and for all finite subsets J of I . The conditional independence
of a family of random variables, given an event C, is defined similarly to the conditional
independence of events; see equation (1.5.5).
Independent families of random variables are very much easier to study than dependent
families, as we shall see soon. Note that pairwise-independent families are not necessarily
independent.
3.3 Expectation
Let x 1 , x 2 , . . . , x N be the numerical outcomes of N repetitions of some experiment. The
average of these outcomes is
1 X
m= xi .
N
i
1 X X
m≃ x N f (x) = x f (x)
N x x
where the summation here is over all possible values of the X i . This average is called the
‘expectation’ or ‘mean value’ of the underlying distribution with mass function f .
(1) Definition. The mean value, or expectation, or expected value of the random variable
X with mass function f is defined to be
X
E(X) = x f (x)
x: f (x)>0
(2) Example (2.1.5) revisited. The random variables X and W of this example have mean
values
X
1 1 1
E(X) = x P(X = x) = 0 · 4 +1· 2 +2· 4 = 1,
x
X
3 1
E(W ) = x P(W = x) = 0 · 4 +4· 4 = 1.
x
(3) Lemma. Change of variable formula. If X has mass function f and g : R → R, then
X
E(g(X)) = g(x) f (x)
x
(4) Example. Suppose that X takes values −2, −1, 1, 3 with probabilities 14 , 81 , 14 , 83 respec-
tively. The random variable Y = X 2 takes values 1, 4, 9 with probabilities 38 , 14 , 83 respectively,
and so X
E(Y ) = x P(Y = x) = 1 · 83 + 4 · 14 + 9 · 38 = 194 .
x
Lemma (3) provides a method for calculating the ‘moments’ of a distribution; these are
defined as follows.
(5) Definition. If k is a positive integer, the kth moment m k of X is defined to be m k = E(X k ).
The kth central moment σk is σk = E((X − m 1 )k ).
The two moments of most use are m 1 = E(X) and σ2 = E((X − E X)2 ), called the mean (or
expectation) and variance of X. These two quantities are measures of the mean and dispersion
of X; that is, m 1 is the average value of X, and σ2 measures the amount by which X tends to
deviate from this average. The mean m 1 is often √ denoted µ, and the variance of X is often
denoted var(X). The positive square root σ = var(X) is called the standard deviation, and
in this notation σ2 = σ 2 . The central moments {σi } can be expressed in terms of the ordinary
moments {m i }. For example, σ1 = 0 and
X
σ2 = (x − m 1 )2 f (x)
x
X X X
= x 2 f (x) − 2m 1 x f (x) + m 21 f (x)
x x x
= m 2 − m 21 ,
Remark. Experience with student calculations of variances causes us to stress the following
elementary fact: variances cannot be negative. We sometimes omit the parentheses and write
simply var X. The expression E(X)2 means (E(X))2 and must not be confused with E(X 2 ).
3.3 Expectation 55
(6) Example. Bernoulli variables. Let X be a Bernoulli variable, taking the value 1 with
probability p (= 1 − q). Then
X
E(X) = x f (x) = 0 · q + 1 · p = p,
x
X
E(X 2 ) = x 2 f (x) = 0 · q + 1 · p = p,
x
var(X) = E(X 2 ) − E(X)2 = pq.
Thus the indicator variable I A has expectation P(A) and variance P(A)P(Ac ).
n
X n
X
n k n−k
E(X) = k f (k) = k p q .
k
k=0 k=0
n
X n
x k = (1 + x)n ,
k
k=0
multiply by x to obtain
n
X
n k
k x = nx(1 + x)n−1 ,
k
k=0
and substitute x = p/q to obtain E(X) = np. A similar argument shows that the variance of
X is given by var(X) = npq, although it is faster to use the forthcoming Theorem (11).
We can think of the process of calculating expectations as a linear operator on the space of
random variables.
(8) Theorem. The expectation operator E has the following properties:
(a) if X ≥ 0 then E(X) ≥ 0,
(b) if a, b ∈ R then E(a X + bY ) = a E(X) + bE(Y ),
(c) the random variable 1, taking the value 1 always, has expectation E(1) = 1.
and the solution of the first part of Problem (3.11.3) shows that
X
E(a X + bY ) = (ax + by)P(A x ∩ B y ).
x,y
56 3.3 Discrete random variables
However,
X [ !
P(A x ∩ B y ) = P Ax ∩ By = P(A x ∩ ) = P(A x )
y y
P
and similarly x P(A x ∩ B y ) = P(B y ), which gives
X X X X
E(a X + bY ) = ax P(A x ∩ B y ) + by P(A x ∩ B y )
x y y x
X X
=a x P(A x ) + b y P(B y )
x y
= a E(X) + bE(Y ).
and so
X
E(XY ) = x y P(A x )P(B y ) by independence
x,y
X X
= x P(A x ) y P(B y ) = E(X)E(Y ).
x y