[go: up one dir, main page]

0% found this document useful (0 votes)
6 views69 pages

Preview-9780192586865 A40083385

The document is the fourth edition of 'Probability and Random Processes' by Geoffrey R. Grimmett and David R. Stirzaker, providing a comprehensive introduction to probability theory and random processes. It aims to serve mathematics undergraduates and graduate students, covering foundational topics, random variables, Markov chains, and various stochastic processes with an increased number of exercises. The book has been revised for clarity and includes new sections on advanced topics, making it suitable for both theoretical and applied studies in probability.

Uploaded by

몰라요 MOLAYO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views69 pages

Preview-9780192586865 A40083385

The document is the fourth edition of 'Probability and Random Processes' by Geoffrey R. Grimmett and David R. Stirzaker, providing a comprehensive introduction to probability theory and random processes. It aims to serve mathematics undergraduates and graduate students, covering foundational topics, random variables, Markov chains, and various stochastic processes with an increased number of exercises. The book has been revised for clarity and includes new sections on advanced topics, making it suitable for both theoretical and applied studies in probability.

Uploaded by

몰라요 MOLAYO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Probability and Random Processes

Fortuna, goddess of chance and luck


Engraving by Hans Sebald Beham (1541)
Probability and Random
Processes
Fourth Edition
GEOFFREY R. GRIMMETT
Statistical Laboratory, Cambridge University
and
DAV I D R . S T I R Z A K E R
Mathematical Institute, Oxford University

1
1
Great Clarendon Street, Oxford, OX2 6DP,
United Kingdom
Oxford University Press is a department of the University of Oxford.
It furthers the University’s objective of excellence in research, scholarship,
and education by publishing worldwide. Oxford is a registered trade mark of
Oxford University Press in the UK and in certain other countries
© Geoffrey R. Grimmett and David R. Stirzaker 2020
The moral rights of the authors have been asserted
Fourth Edition published in 2020
Reprinted with corrections in 2024

All rights reserved. No part of this publication may be reproduced, stored in


a retrieval system, or transmitted, in any form or by any means, without the
prior permission in writing of Oxford University Press, or as expressly permitted
by law, by licence or under terms agreed with the appropriate reprographics
rights organization. Enquiries concerning reproduction outside the scope of the
above should be sent to the Rights Department, Oxford University Press, at the
address above
You must not circulate this work in any other form
and you must impose this same condition on any acquirer
Published in the United States of America by Oxford University Press
198 Madison Avenue, New York, NY 10016, United States of America
British Library Cataloguing in Publication Data
Data available
Library of Congress Control Number: 2020939178
ISBN 978–0–19–884760–1 (hbk)
ISBN 978–0–19–884759–5 (pbk)
ISBN 978–0–19–884762–5 (set with One Thousand Exercises in Probability, 3e)
Printed in Great Britain by
Bell & Bain Ltd., Glasgow
Links to third party websites are provided by Oxford in good faith and
for information only. Oxford disclaims any responsibility for the materials
contained in any third party website referenced in this work.
Lastly, numbers are applicable even to such things as seem to be governed by no rule, I
mean such as depend on chance: the quantity of probability and proportion of it in any
two proposed cases being subject to calculation as much as anything else. Upon this
depend the principles of game. We find sharpers know enough of this to cheat some
men that would take it very ill to be thought bubbles; and one gamester exceeds another,
as he has a greater sagacity and readiness in calculating his probability to win or lose
in any particular case. To understand the theory of chance thoroughly, requires a great
knowledge of numbers, and a pretty competent one of Algebra.

John Arbuthnot
An essay on the usefulness of mathematical learning
25 November 1700

To this may be added, that some of the problems about chance having a great appearance
of simplicity, the mind is easily drawn into a belief, that their solution may be attained
by the mere strength of natural good sense; which generally proving otherwise, and the
mistakes occasioned thereby being not infrequent, it is presumed that a book of this
kind, which teaches to distinguish truth from what seems so nearly to resemble it, will
be looked on as a help to good reasoning.

Abraham de Moivre
The Doctrine of Chances
1717
Preface to the Fourth Edition

This book provides an extensive introduction to probability and random processes. It is


intended for those working in the many and varied applications of the subject as well as for
those studying more theoretical aspects. We hope it will be found suitable for mathematics
undergraduates at all levels, as well as for graduate students and others with interests in these
fields.
In particular, we aim:
• to give a rigorous introduction to probability theory while limiting the amount of measure
theory in the early chapters;
• to discuss the most important random processes in some depth, with many examples;
• to include various topics which are suitable for undergraduate courses, but are not routinely
taught;
• to impart to the beginner the flavour of more advanced work, thereby whetting the appetite
for more.
The ordering and numbering of material in this fourth edition has for the most part been
preserved from the third. However, a good many minor alterations and additions have been
made in the pursuit of clearer exposition. Furthermore, we have revised extensively the
sections on Markov chains in continuous time, and added new sections on coupling from the
past, Lévy processes, self-similarity and stability, and time changes.
In an immoderate manifestation of millennial mania, the number of exercises and problems
has been increased beyond the statutory 1000 to a total of 1322. Moreover, many of the
existing exercises have been refreshed by additional parts, making a total of more than 3000
challenges for the diligent reader. These are frequently far from being merely drill exercises,
but they complement and illustrate the text, or are entertaining, or (usually, we hope) both. In
a companion volume, One Thousand Exercises in Probability (Oxford University Press, third
edition, 2020), we give worked solutions to almost all exercises and problems.
The basic layout of the book remains unchanged. Chapters 1–5 begin with the foundations
of probability theory, move through the elementary properties of random variables, and finish
with the weak law of large numbers and the central limit theorem; on route, the reader meets
random walks, branching processes, and characteristic functions. This material is suitable for
about two lecture courses at a moderately elementary level.
viii Preface to the Fourth Edition

The second part of the book is devoted to the theory of random processes. Chapter 6 deals
with Markov chains in discrete and continuous time. The treatment of discrete-time chains is
quite detailed and includes an easy proof of the limit theorem for chains with countably infinite
state spaces. The sections on continuous-time chains provide a significant amplification of
those of the third edition, and constitute an approachable but rigorous account of the principal
theory and applications. Chapter 7 contains a general discussion of convergence, together with
simple, rigorous accounts of the strong law of large numbers, and of martingale convergence.
Each of these two chapters could be used as a basis for a lecture course.
Chapter 8 is an introduction to stochastic processes in their various types; most of these
are studied in detail in later chapters, but we have also aspired there to engage the reader with
wider aspects of probability by including new essays on a number of topics such as Lévy
processes, self-similarity, and stability. Chapters 8–13 provide suitable material for about
five shorter lecture courses on: stationary processes and ergodic theory; renewal processes;
queues; martingales; diffusions and stochastic integration with applications to finance.
We thank those who have read and commented upon sections of this and earlier editions,
and also those readers who have taken the trouble to write to us with notice of imperfections.
Further help in thinning any remaining errors will be greatly appreciated.

Cambridge and Oxford G.R.G.


April 2020 D.R.S.

Note on the Frontispiece


The iconography associated with Fortuna, the goddess of chance and luck, (Tyche or Agatho-
daemon to the Greeks), has accumulated over more than two millennia, and is correspondingly
complex; we give no more than a brief and much simplified account here of the various alle-
gories involved. The goddess Fortuna was originally associated with fertility, hence the sheaf
of corn shown in her right hand. Later associations with the uncertainty of sea voyages are
indicated by the ship in full sail seen in the background. The sphere by her feet may initially
have represented the instability of life, and this interpretation is sometimes strengthened by
depicting the sphere as a bubble. (By contrast, the goddess Virtue is frequently depicted on
or by a cube, representing stability.) Subsequently the sphere comes to represent the entire
world, over which chance reigns supreme. The wheel carried by Fortuna in her left hand
represents the fickleness and uncertainty entailed by the passage of time; that is, the inevitable
ups and downs of life. ‘The wheel of fortune’ is a metaphor for chance and uncertainty still
today, as exemplified by the title of a recent television game show.
A further discussion of the iconography is provided in Roberts 1998.
Contents

1 Events and their probabilities


1.1 Introduction 1
1.2 Events as sets 1
1.3 Probability 4
1.4 Conditional probability 8
1.5 Independence 13
1.6 Completeness and product spaces 15
1.7 Worked examples 16
1.8 Problems 22

2 Random variables and their distributions


2.1 Random variables 28
2.2 The law of averages 32
2.3 Discrete and continuous variables 35
2.4 Worked examples 37
2.5 Random vectors 40
2.6 Monte Carlo simulation 43
2.7 Problems 45

3 Discrete random variables


3.1 Probability mass functions 49
3.2 Independence 51
3.3 Expectation 53
3.4 Indicators and matching 59
3.5 Examples of discrete variables 64
3.6 Dependence 66
3.7 Conditional distributions and conditional expectation 71
3.8 Sums of random variables 75
3.9 Simple random walk 77
3.10 Random walk: counting sample paths 81
3.11 Problems 89
x Contents

4 Continuous random variables


4.1 Probability density functions 99
4.2 Independence 101
4.3 Expectation 103
4.4 Examples of continuous variables 106
4.5 Dependence 110
4.6 Conditional distributions and conditional expectation 117
4.7 Functions of random variables 121
4.8 Sums of random variables 129
4.9 Multivariate normal distribution 131
4.10 Distributions arising from the normal distribution 135
4.11 Sampling from a distribution 138
4.12 Coupling and Poisson approximation 144
4.13 Geometrical probability 150
4.14 Problems 157

5 Generating functions and their applications


5.1 Generating functions 168
5.2 Some applications 176
5.3 Random walk 183
5.4 Branching processes 192
5.5 Age-dependent branching processes 197
5.6 Expectation revisited 199
5.7 Characteristic functions 203
5.8 Examples of characteristic functions 208
5.9 Inversion and continuity theorems 212
5.10 Two limit theorems 216
5.11 Large deviations 226
5.12 Problems 231

6 Markov chains
6.1 Markov processes 239
6.2 Classification of states 246
6.3 Classification of chains 250
6.4 Stationary distributions and the limit theorem 254
6.5 Reversibility 265
6.6 Chains with finitely many states 269
6.7 Branching processes revisited 273
6.8 Birth processes and the Poisson process 276
6.9 Continuous-time Markov chains 288
6.10 Kolmogorov equations and the limit theorem 297
6.11 Birth–death processes and imbedding 306
6.12 Special processes 313
6.13 Spatial Poisson processes 319
6.14 Markov chain Monte Carlo 330
6.15 Problems 339
Contents xi

7 Convergence of random variables


7.1 Introduction 349
7.2 Modes of convergence 352
7.3 Some ancillary results 362
7.4 Laws of large numbers 370
7.5 The strong law 374
7.6 The law of the iterated logarithm 377
7.7 Martingales 378
7.8 Martingale convergence theorem 383
7.9 Prediction and conditional expectation 388
7.10 Uniform integrability 395
7.11 Problems 400

8 Random processes
8.1 Introduction 406
8.2 Stationary processes 407
8.3 Renewal processes 411
8.4 Queues 414
8.5 The Wiener process 416
8.6 Lévy processes and subordinators 418
8.7 Self-similarity and stability 421
8.8 Time changes 426
8.9 Existence of processes 429
8.10 Problems 431

9 Stationary processes
9.1 Introduction 433
9.2 Linear prediction 436
9.3 Autocovariances and spectra 438
9.4 Stochastic integration and the spectral representation 446
9.5 The ergodic theorem 452
9.6 Gaussian processes 464
9.7 Problems 468

10 Renewals
10.1 The renewal equation 471
10.2 Limit theorems 476
10.3 Excess life 480
10.4 Applications 483
10.5 Renewal–reward processes 491
10.6 Problems 497

11 Queues
11.1 Single-server queues 501
11.2 M/M/1 503
11.3 M/G/1 506
xii Contents

11.4 G/M/1 512


11.5 G/G/1 516
11.6 Heavy traffic 523
11.7 Networks of queues 524
11.8 Problems 530

12 Martingales
12.1 Introduction 533
12.2 Martingale differences and Hoeffding’s inequality 538
12.3 Crossings and convergence 544
12.4 Stopping times 550
12.5 Optional stopping 554
12.6 The maximal inequality 559
12.7 Backward martingales and continuous-time martingales 562
12.8 Some examples 567
12.9 Problems 572

13 Diffusion processes
13.1 Introduction 577
13.2 Brownian motion 578
13.3 Diffusion processes 580
13.4 First passage times 590
13.5 Barriers 595
13.6 Excursions and the Brownian bridge 599
13.7 Stochastic calculus 602
13.8 The Itô integral 605
13.9 Itô’s formula 610
13.10 Option pricing 613
13.11 Passage probabilities and potentials 620
13.12 Problems 627

Appendix I. Foundations and notation 631


Appendix II. Further reading 636
Appendix III. History and varieties of probability 638
Appendix IV. John Arbuthnot’s Preface to Of the laws of chance (1692) 641
Appendix V. Table of distributions 644
Appendix VI. Chronology 646
Bibliography 649
Notation 653
Index 655
1
Events and their probabilities

Summary. Any experiment involving randomness can be modelled as a prob-


ability space. Such a space comprises a set  of possible outcomes of the
experiment, a set F of events, and a probability measure P. The definition and
basic properties of a probability space are explored, and the concepts of condi-
tional probability and independence are introduced. Many examples involving
modelling and calculation are included.

1.1 Introduction
Much of our life is based on the belief that the future is largely unpredictable. For example,
games of chance such as dice or roulette would have few adherents if their outcomes were
known in advance. We express this belief in chance behaviour by the use of words such as
‘random’ or ‘probability’, and we seek, by way of gaming and other experience, to assign
quantitative as well as qualitative meanings to such usages. Our main acquaintance with
statements about probability relies on a wealth of concepts, some more reasonable than others.
A mathematical theory of probability will incorporate those concepts of chance which are
expressed and implicit in common rational understanding. Such a theory will formalize these
concepts as a collection of axioms, which should lead directly to conclusions in agreement with
practical experimentation. This chapter contains the essential ingredients of this construction.

1.2 Events as sets


Many everyday statements take the form ‘the chance (or probability) of A is p’, where A is
some event (such as ‘the sun shining tomorrow’, ‘Cambridge winning the Boat Race’, . . . )
and p is a number or adjective describing quantity (such as ‘one-eighth’, ‘low’, . . . ). The
occurrence or non-occurrence of A depends upon the chain of circumstances involved. This
chain is called an experiment or trial; the result of an experiment is called its outcome. In
general, we cannot predict with certainty the outcome of an experiment in advance of its
completion; we can only list the collection of possible outcomes.
(1) Definition. The set of all possible outcomes of an experiment is called the sample space
and is denoted by .
2 1.2 Events and their probabilities

(2) Example. A coin is tossed. There are two possible outcomes, heads (denoted by H) and
tails (denoted by T), so that  = {H, T}. We may be interested in the possible occurrences of
the following events:
(a) the outcome is a head;
(b) the outcome is either a head or a tail;
(c) the outcome is both a head and a tail (this seems very unlikely to occur);
(d) the outcome is not a head.

(3) Example. A die is thrown once. There are six possible outcomes depending on which of
the numbers 1, 2, 3, 4, 5, or 6 is uppermost. Thus  = {1, 2, 3, 4, 5, 6}. We may be interested
in the following events:
(a) the outcome is the number 1;
(b) the outcome is an even number;
(c) the outcome is even but does not exceed 3;
(d) the outcome is not even.
We see immediately that each of the events of these examples can be specified as a subset
A of the appropriate sample space . In the first example they can be rewritten as

(a) A = {H}, (b) A = {H} ∪ {T},


(c) A = {H} ∩ {T}, (d) A = {H}c ,
whilst those of the second example become

(a) A = {1}, (b) A = {2, 4, 6},


(c) A = {2, 4, 6} ∩ {1, 2, 3}, (d) A = {2, 4, 6}c .
The complement of a subset A of  is denoted here and subsequently by Ac ; from now on,
subsets of  containing a single member, such as {H}, will usually be written without the
containing braces.
Henceforth we think of events as subsets of the sample space . Whenever A and B are
events in which we are interested, then we can reasonably concern ourselves also with the
events A∪ B, A∩ B, and Ac , representing ‘A or B’,‘ A and B’, and ‘not A’ respectively. Events
A and B are called disjoint if their intersection is the empty set ∅; ∅ is called the impossible
event. The set  is called the certain event, since some member of  will certainly occur.
Thus events are subsets of , but need all the subsets of  be events? The answer is no, but
some of the reasons for this are too difficult to be discussed here. It suffices for us to think of
the collection of events as a subcollection F of the set of all subsets of . This subcollection
should have certain properties in accordance with the earlier discussion:
(a) if A, B ∈ F then A ∪ B ∈ F and A ∩ B ∈ F ;
(b) if A ∈ F then Ac ∈ F ;
(c) the empty set ∅ belongs to F .
Any collection F of subsets of  which satisfies these three conditions is called a field. It
follows from the properties of a field F that
n
[
if A1 , A2 , . . . , An ∈ F then Ai ∈ F ;
i=1
1.2 Events as sets 3

Typical notation Set jargon Probability jargon

 Collection of objects Sample space


ω Member of  Elementary event, outcome
A Subset of  Event that some outcome in A occurs
Ac Complement of A Event that no outcome in A occurs
A∩B Intersection Both A and B
A∪B Union Either A or B or both
A\B Difference A, but not B
A△B Symmetric difference Either A or B, but not both
A⊆B Inclusion If A, then B
∅ Empty set Impossible event
 Whole space Certain event

Table 1.1. The jargon of set theory and probability theory.

that is to say, F is closed under finite unions and hence under finite intersections also (see
Problem (1.8.3)). This is fine when  is a finite set, but we require slightly more to deal with
the common situation when  is infinite, as the following example indicates.
(4) Example. A coin is tossed repeatedly until the first head turns up; we are concerned
with the number of tosses before this happens. The set of all possible outcomes is the set
 = {ω1 , ω2 , ω3 , . . . }, where ωi denotes the outcome when the first i − 1 tosses are tails
and the i th toss is a head. We may seek to assign a probability to the event A, that the first
head occurs after an even number of tosses, that is, A = {ω2 , ω4 , ω6 , . . . }. This is an infinite
countable union of members of  and we require that such a set belong to F in order that we
can discuss its probability.
Thus we also require that the collection of events be closed under the operation of taking
countable unions. Any collection of subsets of  with these properties is called a σ -field (or
sometimes a σ -algebra).
(5) Definition. A collection F of subsets of  is called a σ -field if it satisfies the following
conditions:
(a) ∅ ∈ F ;
S
(b) if A1 , A2 , . . . ∈ F then ∞
i=1 A i ∈ F ;
(c) if A ∈ F then A ∈ F .c

It follows from Problem (1.8.3) that σ -fields are closed under the operation of taking
countable intersections. Here are some examples of σ -fields.
(6) Example. The smallest σ -field associated with  is the collection F = {∅, }.

(7) Example. If A is any subset of  then F = {∅, A, Ac , } is a σ -field.

(8) Example. The power set of , which is written {0, 1} and contains all subsets of , is
obviously a σ -field. For reasons beyond the scope of this book, when  is infinite, its power
set is too large a collection for probabilities to be assigned reasonably to all its members.
4 1.3 Events and their probabilities

To recapitulate, with any experiment we may associate a pair (, F ), where  is the
set of all possible outcomes or elementary events and F is a σ -field of subsets of  which
contains all the events in whose occurrences we may be interested; henceforth, to call a set
A an event is equivalent to asserting that A belongs to the σ -field in question. We usually
translate statements about combinations of events into set-theoretic jargon; for example, the
event that both A and B occur is written as A ∩ B. Table 1.1 is a translation chart.

Exercises for Section 1.2


1. Let { Ai : i ∈ I } be a collection of sets. Prove ‘De Morgan’s Laws’†:
[ c \ \ c [
Ai = Aci , Ai = Aci .
i i i i

2. Let A and B belong to some σ -field F. Show that F contains the sets A ∩ B, A \ B, and A △ B.
3. A conventional knock-out tournament (such as that at Wimbledon) begins with 2n competitors
and has n rounds. There are no play-offs for the positions 2, 3, . . . , 2n − 1, and the initial table of
draws is specified. Give a concise description of the sample space of all possible outcomes.
4. Let F be a σ -field of subsets of  and suppose that B ∈ F. Show that G = { A ∩ B : A ∈ F} is a
σ -field of subsets of B.
5. Which of the following are identically true? For those that are not, say when they are true.
(a) A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C);
(b) A ∩ (B ∩ C) = (A ∩ B) ∩ C;
(c) (A ∪ B) ∩ C = A ∪ (B ∩ C);
(d) A \ (B ∩ C) = (A \ B) ∪ (A \ C).

1.3 Probability
We wish to be able to discuss the likelihoods of the occurrences of events. Suppose that we
repeat an experiment a large number N of times, keeping the initial conditions as equal as
possible, and in such a way that no outcome is influenced by the other outcomes. Let A be
some event which may or may not occur on each repetition. Our experience of most scientific
experimentation is that the proportion of times that A occurs settles down to some value as N
becomes larger and larger; that is to say, writing N(A) for the number of occurrences of A in
the N trials, the ratio N(A)/N appears to converge to a constant limit as N increases‡. We
can think of the ultimate value of this ratio as being the probability P(A) that A occurs on any
particular trial∗ ; it may happen that the empirical ratio does not behave in a coherent manner
and our intuition fails us at this level, but we shall not discuss this here. In practice, N may be

†Augustus De Morgan is well known for having given the first clear statement of the principle of mathematical
induction. He applauded probability theory with the words: “The tendency of our study is to substitute the
satisfaction of mental exercise for the pernicious enjoyment of an immoral stimulus”.
‡This property of practical experiments is called the stability of statistical ratios by physicists and statisti-
cians, or the empirical law of large numbers by probabilists.
∗ This superficial discussion of probabilities is inadequate in many ways; questioning readers may care to
discuss the philosophical and empirical aspects of the subject amongst themselves (see Appendix III). We do
not define probability by such limits in this text.
1.3 Probability 5

taken to be large but finite, and the ratio N(A)/N may be taken as an approximation to P(A).
Clearly, the ratio is a number between zero and one; if A = ∅ then N(∅) = 0 and the ratio
is 0, whilst if A =  then N() = N and the ratio is 1. Furthermore, suppose that A and B
are two disjoint events, each of which may or may not occur at each trial. Then

N(A ∪ B) = N(A) + N(B)

and so the ratio N(A ∪ B)/N is the sum of the two ratios N(A)/N and N(B)/N. We now
think of these ratios as representing the probabilities of the appropriate events. The above
relations become

P(A ∪ B) = P(A) + P(B), P(∅) = 0, P() = 1.

This discussion suggests that the probability function P should be finitely additive, which is
to say that
n
! n
[ X
if A1 , A2 , . . . , An are disjoint events, then P Ai = P(Ai ).
i=1 i=1

A glance at Example (1.2.4) suggests the more extensive property that P be countably additive,
in that the corresponding property should hold for countable collections A1 , A2 , . . . of disjoint
events.
These relations are sufficient to specify the desirable properties of a probability function P
applied to the set of events. Any such assignment of likelihoods to the members of F is called
a probability measure. Some individuals refer informally to P as a ‘probability distribution’,
especially when the sample space is finite or countably infinite; this practice is best avoided
since the term ‘probability distribution’ is reserved for another purpose to be encountered in
Chapter 2.
(1) Definition. A probability measure P on (, F ) is a function P : F → [0, 1] satisfying
(a) P(∅) = 0, P() = 1;
(b) if A1 , A2 , . . . is a collection of disjoint members of F , in that Ai ∩ A j = ∅ for all pairs
i, j satisfying i 6= j , then


! ∞
[ X
P Ai = P(Ai ).
i=1 i=1

The triple (, F , P), comprising a set , a σ -field F of subsets of , and a probability
measure P on (, F ), is called a probability space.

A probability measure is a special example of what is called a measure on the pair (, F ).
A measure is a function µ : F → [0, ∞) satisfying µ(∅) = 0 together with (b) above. A
measure µ is a probability measure if µ() = 1.
We can associate a probability space (, F , P) with any experiment, and all questions
associated with the experiment can be reformulated in terms of this space. It may seem
natural to ask for the numerical value of the probability P(A) of some event A. The answer
to such a question must be contained in the description of the experiment in question. For
6 1.3 Events and their probabilities

example, the assertion that a fair coin is tossed once is equivalent to saying that heads and
tails have an equal probability of occurring; actually, this is the definition of fairness.
(2) Example. A coin, possibly biased, is tossed once. We can take  = {H, T} and F =
{∅, H, T, }, and a possible probability measure P : F → [0, 1] is given by

P(∅) = 0, P(H) = p, P(T) = 1 − p, P() = 1,

where p is a fixed real number in the interval [0, 1]. If p = 21 , then we say that the coin is
fair, or unbiased.

(3) Example. A die is thrown once. We can take  = {1, 2, 3, 4, 5, 6}, F = {0, 1} , and
the probability measure P given by
X
P(A) = pi for any A ⊆ ,
i∈A

where p1 , p2 , . . . , p6 are specified numbers from the interval [0, 1] having unit sum. The
probability that i turns up is pi . The die is fair if pi = 61 for each i , in which case

P(A) = 16 |A| for any A ⊆ ,

where |A| denotes the cardinality of A.


The triple (, F , P) denotes a typical probability space. We now give some of its simple
but important properties.
(4) Lemma.
(a) P(Ac ) = 1 − P(A),
(b) if B ⊇ A then P(B) = P(A) + P(B \ A) ≥ P(A),
(c) P(A ∪ B) = P(A) + P(B) − P(A ∩ B),
(d) (inclusion–exclusion principle) more generally, if A1 , A2 , . . . , An are events, then
n
!
[ X X X
P Ai = P(Ai ) − P(Ai ∩ A j ) + P(Ai ∩ A j ∩ Ak ) − · · ·
i=1 i i< j i< j <k

+ (−1)n+1 P(A1 ∩ A2 ∩ · · · ∩ An )
P
where, for example, i< j sums over all unordered pairs (i, j ) with i 6= j .

Proof.
(a) A ∪ Ac =  and A ∩ Ac = ∅, so P(A ∪ Ac ) = P(A) + P(Ac ) = 1.
(b) B = A ∪ (B \ A). This is the union of disjoint sets and therefore

P(B) = P(A) + P(B \ A).

(c) A ∪ B = A ∪ (B \ A), which is a disjoint union. Therefore, by (b),

P(A ∪ B) = P(A) + P(B \ A) = P(A) + P(B \ (A ∩ B))


= P(A) + P(B) − P(A ∩ B).
1.3 Probability 7

(d) The proof is by induction on n, and is left as an exercise (see Exercise (1.3.4)). 
In Lemma (4b), B \ A denotes the set of members of B which are not in A. In order to
write down the quantity P(B \ A), we require that B \ A belongs to F , the domain of P; this is
always true when A and B belong to F , and to prove this was part of Exercise (1.2.2). Notice
that each proof proceeded by expressing an event in terms of disjoint unions and then applying
P. It is sometimes easier to calculate the probabilities of intersections of events rather than
their unions; part (d) of the lemma is useful then, as we shall discover soon. The next property
of P is more technical, and says that P is a continuous set function; this property is essentially
equivalent to the condition that P is countably additive rather than just finitely additive (see
Problem (1.8.16) also).
(5) Theorem. Continuity of probability measures. Let A1 , A2 , . . . be an increasing se-
quence of events, so that A1 ⊆ A2 ⊆ A3 ⊆ · · · , and write A for their limit:

[
A= Ai = lim Ai .
i→∞
i=1

Then P(A) = limi→∞ P(Ai ).


Similarly, if B1 , B2 , . . . is a decreasing sequence of events, so that B1 ⊇ B2 ⊇ B3 ⊇ · · · ,
then
\∞
B= Bi = lim Bi
i→∞
i=1

satisfies P(B) = limi→∞ P(Bi ).


Proof. A = A1 ∪ (A2 \ A1 ) ∪ (A3 \ A2 ) ∪ · · · is the union of a disjoint family of events.
Thus, by Definition (1),

X
P(A) = P(A1 ) + P(Ai+1 \ Ai )
i=1
n−1
X  
= P(A1 ) + lim P(Ai+1 ) − P(Ai )
n→∞
i=1
= lim P(An ).
n→∞

To show the result for decreasing families of events, take complements and use the first part
(exercise). 
To recapitulate, statements concerning chance are implicitly related to experiments or
trials, the outcomes of which are not entirely predictable. With any such experiment we can
associate a probability space (, F , P) the properties of which are consistent with our shared
and reasonable conceptions of the notion of chance.
(6) Example. In many cases, and especially in games of chance (as in (3) above), it is natural
to assume that all outcomes are equally likely. In such a case, the probability of an event A is

|A|
P(A) = .
||
8 1.4 Events and their probabilities

Here is some final jargon. An event A is called null if P(A) = 0. If P(A) = 1, we say that
A occurs almost surely (frequently abbreviated to a.s.). Null events should not be confused
with the impossible event ∅. Null events are happening all around us, even though they have
zero probability; after all, what is the chance that a dart strikes any given point of the target at
which it is thrown? That is, the impossible event is null, but null events need not be impossible.

Exercises for Section 1.3


1. Let A and B be events with probabilities P(A) = 43 and P(B) = 13 . Show that 12
1 ≤ P(A∩ B) ≤ 1 ,
3
and give examples to show that both extremes are possible. Find corresponding bounds for P(A ∪ B).
2. A fair coin is tossed repeatedly. Show that, with probability one, a head turns up sooner or later.
Show similarly that any given finite sequence of heads and tails occurs eventually with probability
one. Explain the connection with Murphy’s Law.
3. Six cups and saucers come in pairs: there are two cups and saucers which are red, two white, and
two with stars on. If the cups are placed randomly onto the saucers (one each), find the probability
that no cup is upon a saucer of the same pattern.
4. Let A1 , A2 , . . . , An be events where n ≥ 2, and prove that
[
n  X X X
P Ai = P(Ai ) − P(Ai ∩ A j ) + P(Ai ∩ A j ∩ Ak )
i=1 i i< j i< j <k

− · · · + (−1)n+1 P(A1 ∩ A2 ∩ · · · ∩ An ).

In each packet of Corn Flakes may be found a plastic bust of one of the last five Vice-Chancellors
of Cambridge University, the probability that any given packet contains any specific Vice-Chancellor
being 15 , independently of all other packets. Show that the probability that each of the last three
Vice-Chancellors is obtained in a bulk purchase of six packets is 1 − 3( 54 )6 + 3( 35 )6 − ( 52 )6 .
T∞ 
5. Let Ar , r ≥ 1, be events such that P(Ar ) = 1 for all r . Show that P r=1 Ar = 1.
6. You are given that at least one of the events Ar , 1 ≤ r ≤ n, is certain to occur, but certainly no
more than two occur. If P(Ar ) = p, and P(Ar ∩ As ) = q, r 6= s, show that p ≥ 1/n and q ≤ 2/n.
7. You are given that at least one, but no more than three, of the events Ar , 1 ≤ r ≤ n, occur, where
n ≥ 3. The probability of at least two occurring is 12 . If P(Ar ) = p, P(Ar ∩ As ) = q, r 6= s, and
P(Ar ∩ As ∩ At ) = x, r < s < t, show that p ≥ 3/(2n), and q ≤ 4/n.

1.4 Conditional probability


Many statements about chance take the form ‘if B occurs, then the probability of A is p’,
where B and A are events (such as ‘it rains tomorrow’ and ‘the bus being on time’ respectively)
and p is a likelihood as before. To include this in our theory, we return briefly to the discussion
about proportions at the beginning of the previous section. An experiment is repeated N times,
and on each occasion we observe the occurrences or non-occurrences of two events A and
B. Now, suppose we only take an interest in those outcomes for which B occurs; all other
experiments are disregarded. In this smaller collection of trials the proportion of times that A
occurs is N(A ∩ B)/N(B), since B occurs at each of them. However,

N(A ∩ B) N(A ∩ B)/N


= .
N(B) N(B)/N
1.4 Conditional probability 9

If we now think of these ratios as probabilities, we see that the probability that A occurs, given
that B occurs, should be reasonably defined as P(A ∩ B)/P(B).
Probabilistic intuition leads to the same conclusion. Given that an event B occurs, it is the
case that A occurs if and only if A ∩ B occurs. Thus the conditional probability of A given B
should be proportional to P(A ∩ B), which is to say that it equals α P(A ∩ B) for some constant
α = α(B). The conditional probability of  given B must equal 1, and thus α P( ∩ B) = 1,
yielding α = 1/P(B).
We formalize these notions as follows.
(1) Definition. If P(B) > 0 then the conditional probability that A occurs given that B
occurs is defined to be
P(A ∩ B)
P(A | B) = .
P(B)

We denote this conditional probability by P(A | B), pronounced ‘the probability of A given
B’, or sometimes ‘the probability of A conditioned (or conditional) on B’.
(2) Example. Two fair dice are thrown. Given that the first shows 3, what is the probability
that the total exceeds 6? The answer is obviously 12 , since the second must show 4, 5, or
6. However, let us labour the point. Clearly  = {1, 2, 3, 4, 5, 6}2 , the set† of all ordered
pairs (i, j ) for i, j ∈ {1, 2, . . . , 6}, and we can take F to be the set of all subsets of , with
P(A) = |A|/36 for any A ⊆ . Let B be the event that the first die shows 3, and A be the
event that the total exceeds 6. Then

B = {(3, b) : 1 ≤ b ≤ 6}, A = {(a, b) : a + b > 6}, A ∩ B = {(3, 4), (3, 5), (3, 6)},

and
P(A ∩ B) |A ∩ B| 3
P(A | B) = = = .
P(B) |B| 6

(3) Example. Boys and girls. A family has two children. What is the probability that both
are boys, given that at least one is a boy? The older and younger child may each be male
or female, so there are four possible combinations of sexes, which we assume to be equally
likely. Hence we can represent the sample space in the obvious way as

 = {GG, GB, BG, BB}

1
where P(GG) = P(BB) = P(GB) = P(BG) = 4. From the definition of conditional
probability,

P(BB | one boy at least) = P(BB | GB ∪ BG ∪ BB)


P(BB ∩ (GB ∪ BG ∪ BB))
=
P(GB ∪ BG ∪ BB)
P(BB) 1
= = .
P(GB ∪ BG ∪ BB) 3

†Remember that A × B = {(a, b) : a ∈ A, b ∈ B} and that A × A = A2 .


10 1.4 Events and their probabilities

A popular but incorrect answer to the question is 21 . This is the correct answer to another
question: for a family with two children, what is the probability that both are boys given that
the younger is a boy? In this case,

P(BB | younger is a boy) = P(BB | GB ∪ BB)


P(BB ∩ (GB ∪ BB)) P(BB) 1
= = = .
P(GB ∪ BB) P(GB ∪ BB) 2

The usual dangerous argument contains the assertion

P(BB | one child is a boy) = P(other child is a boy).

Why is this meaningless? [Hint: Consider the sample space.]


The next lemma is crucially important in probability theory. A family B1 , B2 , . . . , Bn of
events is called a partition of the set  if
n
[
Bi ∩ B j = ∅ when i 6= j, and Bi = .
i=1

Each elementary event ω ∈  belongs to exactly one set in a partition of .


(4) Lemma. For any events A and B such that 0 < P(B) < 1,

P(A) = P(A | B)P(B) + P(A | B c )P(B c ).

More generally, let B1 , B2 , . . . , Bn be a partition of  such that P(Bi ) > 0 for all i . Then
n
X
P(A) = P(A | Bi )P(Bi ).
i=1

Lemma (4) is often called the law of total probability or the partition theorem for proba-
bilities.
Proof. A = (A ∩ B) ∪ (A ∩ B c ). This is a disjoint union and so

P(A) = P(A ∩ B) + P(A ∩ B c )


= P(A | B)P(B) + P(A | B c )P(B c ).

The second part is similar (see Problem (1.8.10)). 

(5) Example. We are given two urns, each containing a collection of coloured balls. Urn I
contains two white and three blue balls, whilst urn II contains three white and four blue balls.
A ball is drawn at random from urn I and put into urn II, and then a ball is picked at random
from urn II and examined. What is the probability that it is blue? We assume unless otherwise
specified that a ball picked randomly from any urn is equally likely to be any of those present.
The reader will be relieved to know that we no longer need to describe (, F , P) in detail;
we are confident that we could do so if necessary. Clearly, the colour of the final ball depends
1.4 Conditional probability 11

on the colour of the ball picked from urn I. So let us ‘condition’ on this. Let A be the event
that the final ball is blue, and let B be the event that the first one picked was blue. Then, by
Lemma (4),
P(A) = P(A | B)P(B) + P(A | B c )P(B c ).
We can easily find all these probabilities:

P(A | B) = P(A | urn II contains three white and five blue balls) = 85 ,
P(A | B c ) = P(A | urn II contains four white and four blue balls) = 21 ,
P(B) = 53 , P(B c ) = 25 .

Hence
5 3 1 2 23
P(A) = 8 · 5 + 2 · 5 = 40 .

Unprepared readers may have been surprised by the sudden appearance of urns in this book.
In the seventeenth and eighteenth centuries, lotteries often involved the drawing of slips from
urns, and voting was often a matter of putting slips or balls into urns. In France today, aller aux
urnes is synonymous with voting. It was therefore not unnatural for the numerous Bernoullis
and others to model births, marriages, deaths, fluids, gases, and so on, using urns containing
balls of varied hue.†
(6) Example. Zoggles. Only two factories manufacture zoggles. 20 per cent of the zoggles
from factory I and 5 per cent from factory II are defective. Factory I produces twice as many
zoggles as factory II each week. What is the probability that a zoggle, randomly chosen from
a week’s production, is satisfactory? Clearly this satisfaction depends on the factory of origin.
Let A be the event that the chosen zoggle is satisfactory, and let B be the event that it was
made in factory I. Arguing as before,

P(A) = P(A | B)P(B) + P(A | B c )P(B c )


4 2 19 1 51
= 5 · 3 + 20 · 3 = 60 .

If the chosen zoggle is defective, what is the probability that it came from factory I? In our
notation this is just P(B | Ac ). However,
1 2
P(B ∩ Ac ) P(Ac | B)P(B) 5 · 8
P(B | Ac ) = = = 3
= .
P(Ac ) P(Ac ) 1− 51 9
60

This section is terminated with a cautionary example. It is not untraditional to perpetuate


errors of logic in calculating conditional probabilities. Lack of unambiguous definitions and
notation has led astray many probabilists, including even Boole, who was credited by Russell
with the discovery of pure mathematics and by others for some of the logical foundations of
computing. The well-known ‘prisoners’ paradox’ also illustrates some of the dangers here.
(7) Example. Prisoners’ paradox. In a dark country, three prisoners have been incarcerated
without trial. Their warder tells them that the country’s dictator has decided arbitrarily to free

†Readers favouring applications over thought experiments may wish to consider fish in lakes rather than
balls in urns.
12 1.4 Events and their probabilities

one of them and to shoot the other two, but he is not permitted to reveal to any prisoner the
fate of that prisoner. Prisoner A knows therefore that his chance of survival is 31 . In order
to gain information, he asks the warder to tell him in secret the name of some prisoner (but
not himself) who will be killed, and the warder names prisoner B. What now is prisoner A’s
assessment of the chance that he will survive? Could it be 12 : after all, he knows now that
the survivor will be either A or C, and he has no information about which? Could it be 13 :
after all, according to the rules, at least one of B and C has to be killed, and thus the extra
information cannot reasonably affect A’s earlier calculation of the odds? What does the reader
think about this? The resolution of the paradox lies in the situation when either response (B
or C) is possible.
An alternative formulation of this paradox has become known as the Monty Hall problem,
the controversy associated with which has been provoked by Marilyn vos Savant (and many
others) in Parade magazine in 1990; see Exercise (1.4.5).

Exercises for Section 1.4


1. Prove that P(A | B) = P(B | A)P(A)/P(B) whenever P(A)P(B) 6= 0. Show that, if P(A | B) >
P(A), then P(B | A) > P(B).
2. For events A1 , A2 , . . . , An satisfying P(A1 ∩ A2 ∩ · · · ∩ An−1 ) > 0, prove that
P(A1 ∩ A2 ∩ · · · ∩ An ) = P(A1 )P(A2 | A1 )P(A3 | A1 ∩ A2 ) · · · P(An | A1 ∩ A2 ∩ · · · ∩ An−1 ).

3. A man possesses five coins, two of which are double-headed, one is double-tailed, and two are
normal. He shuts his eyes, picks a coin at random, and tosses it. What is the probability that the lower
face of the coin is a head?
He opens his eyes and sees that the coin is showing heads; what is the probability that the lower
face is a head? He shuts his eyes again, and tosses the coin again. What is the probability that the
lower face is a head? He opens his eyes and sees that the coin is showing heads; what is the probability
that the lower face is a head?
He discards this coin, picks another at random, and tosses it. What is the probability that it shows
heads?
4. What do you think of the following ‘proof’ by Lewis Carroll that an urn cannot contain two balls
of the same colour? Suppose that the urn contains two balls, each of which is either black or white;
thus, in the obvious notation, P(BB) = P(BW) = P(WB) = P(WW) = 14 . We add a black ball, so
that P(BBB) = P(BBW) = P(BWB) = P(BWW) = 41 . Next we pick a ball at random; the chance
that the ball is black is (using conditional probabilities) 1 · 41 + 23 · 14 + 23 · 41 + 13 · 14 = 23 . However, if
there is probability 32 that a ball, chosen randomly from three, is black, then there must be two black
and one white, which is to say that originally there was one black and one white ball in the urn.
5. The Monty Hall problem: goats and cars. (a) In a game show; you have to choose one of
three doors. One conceals a new car, two conceal old goats. You choose, but your chosen door is not
opened immediately. Instead the presenter opens another door, which reveals a goat. He offers you
the opportunity to change your choice to the third door (unopened and so far unchosen). Let p be the
(conditional) probability that the third door conceals the car. The presenter’s protocol is:
(i) he is determined to show you a goat; with a choice of two, he picks one at random. Show p = 23 .
(ii) he is determined to show you a goat; with a choice of two goats (Bill and Nan, say) he shows you
Bill with probability b. Show that, given you see Bill, the probability is 1/(1 + b).
(iii) he opens a door chosen at random irrespective of what lies behind. Show p = 21 .
(b) Show that, for α ∈ [ 21 , 23 ], there exists a protocol such that p = α. Are you well advised to change
your choice to the third door?
1.5 Independence 13

(c) In a variant of this question, the presenter is permitted to open the first door chosen, and to reward
you with whatever lies behind. If he chooses to open another door, then this door invariably conceals
a goat. Let p be the probability that the unopened door conceals the car, conditional on the presenter
having chosen to open a second door. Devise protocols to yield the values p = 0, p = 1, and deduce
that, for any α ∈ [0, 1], there exists a protocol with p = α.
6. The prosecutor’s fallacy†. Let G be the event that an accused is guilty, and T the event that
some testimony is true. Some lawyers have argued on the assumption that P(G | T ) = P(T | G).
Show that this holds if and only if P(G) = P(T ).
7. Urns. There are n urns of which the r th contains r − 1 red balls and n − r magenta balls. You
pick an urn at random and remove two balls at random without replacement. Find the probability that:
(a) the second ball is magenta;
(b) the second ball is magenta, given that the first is magenta.
8. Boys and girls, Example (1.4.3) revisited. Consider a family of two children in a population
in which each child is equally likely to be male as female; each child has red hair with probability r ;
these characteristics are independent of each other and occur independently between children.
What is the probability that both children are boys given that the family contains at least one red-
haired boy? Show that the probability that both are boys, given that one is a boy born on a Monday,
is 13/27.

1.5 Independence
In general, the occurrence of some event B changes the probability that another event A
occurs, the original probability P(A) being replaced by P(A | B). If the probability remains
unchanged, that is to say P(A | B) = P(A), then we call A and B ‘independent’. This is
well defined only if P(B) > 0. Definition (1.4.1) of conditional probability leads us to the
following.
(1) Definition. Events A and B are called independent if

P(A ∩ B) = P(A)P(B).

More generally, a family {Ai : i ∈ I } is called independent if


!
\ Y
P Ai = P(Ai )
i∈J i∈J

for all finite subsets J of I .

Remark. A common student error is to make the fallacious statement that A and B are
independent if A ∩ B = ∅.
It is straightforward to prove that two events A, B are independent if and only if A, B c are
independent. Similarly, for J ⊆ I , a family {Ai : i ∈ I } is independent if and only if the
family {Ai : i ∈ J } ∪ { Ak : k ∈ I \ J } is independent. See Exercise (1.5.10).

†The prosecution made this error in the famous Dreyfus case of 1894.
14 1.5 Events and their probabilities

If the family {Ai : i ∈ I } has the property that

P(Ai ∩ A j ) = P(Ai )P(A j ) for all i 6= j

then it is called pairwise independent. Pairwise-independent families are not necessarily


independent, as the following example shows.
(2) Example. Suppose  = {abc, acb, cab, cba, bca, bac, aaa, bbb, ccc}, and each of the
nine elementary events in  occurs with equal probability 19 . Let Ak be the event that the kth
letter is a. It is left as an exercise to show that the family {A1 , A2 , A3 } is pairwise independent
but not independent.

(3) Example (1.4.6) revisited. The events A and B of this example are clearly dependent
because P(A | B) = 54 and P(A) = 51
60 .

(4) Example. Choose a card at random from a pack of 52 playing cards, each being picked
1
with equal probability 52 . We claim that the suit of the chosen card is independent of its rank.
For example,
4 1
P(king) = 52 , P(king | spade) = 13 .
Alternatively,
1 1 1
P(spade king) = 52 = 4 · 13 = P(spade)P(king).

Let C be an event with P(C) > 0. To the conditional probability measure P( · | C)


corresponds the idea of conditional independence. Two events A and B are called conditionally
independent given C if

(5) P(A ∩ B | C) = P(A | C)P(B | C);

there is a natural extension to families of events. [However, note Exercise (1.5.5).]

Exercises for Section 1.5


1. Let A and B be independent events; show that Ac , B are independent, and deduce that Ac , B c
are independent.
2. We roll a die n times. Let Ai j be the event that the i th and j th rolls produce the same number.
Show that the events { Ai j : 1 ≤ i < j ≤ n} are pairwise independent but not independent.
3. A fair coin is tossed repeatedly. Show that the following two statements are equivalent:
(a) the outcomes of different tosses are independent,
(b) for any given finite sequence of heads and tails, the chance of this sequence occurring in the first
m tosses is 2−m , where m is the length of the sequence.
4. Let  = {1, 2, . . . , p} where p is prime, F be the set of all subsets of , and P(A) = | A|/ p for
all A ∈ F. Show that, if A and B are independent events, then at least one of A and B is either ∅ or
.
5. Show that the conditional independence of A and B given C neither implies, nor is implied by,
the independence of A and B. For which events C is it the case that, for all A and B, the events A and
B are independent if and only if they are conditionally independent given C?
6. Safe or sorry? Some form of prophylaxis is said to be 90 per cent effective at prevention during
one year’s treatment. If the degrees of effectiveness in different years are independent, show that the
treatment is more likely than not to fail within 7 years.
1.6 Completeness and product spaces 15

7. Families. Jane has three children, each of which is equally likely to be a boy or a girl independently
of the others. Define the events:

A = {all the children are of the same sex},


B = {there is at most one boy},
C = {the family includes a boy and a girl}.

(a) Show that A is independent of B, and that B is independent of C.


(b) Is A independent of C?
(c) Do these results hold if boys and girls are not equally likely?
(d) Do these results hold if Jane has four children?
8. Galton’s paradox. You flip three fair coins. At least two are alike, and it is an evens chance that
the third is a head or a tail. Therefore P(all alike) = 21 . Do you agree?
9. Two fair dice are rolled. Show that the event that their sum is 7 is independent of the score shown
by the first die.
10. Let X and Y be the scores on two fair dice taking values in the set {1, 2, . . . , 6}. Let A1 =
{X + Y = 9}, A2 = {X ∈ {1, 2, 3}}, and A3 = {X ∈ {3, 4, 5}}. Show that

P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 )P(A3 ).

Are these three events independent?

1.6 Completeness and product spaces


This section should be omitted at the first reading, but we shall require its contents later. It
contains only a sketch of complete probability spaces and product spaces; the reader should
look elsewhere for a more detailed treatment (see Billingsley 1995). We require the following
result.
(1) Lemma. If F and G are two σ -fields of subsets of  then their intersection F ∩ G is
T also. More generally, if {Fi : i ∈ I } is a family of σ -fields of subsets of  then
a σ -field
G = i∈I Fi is a σ -field also.
The proof is not difficult and is left as an exercise. Note that the union F ∪ G may not be a
σ -field, although it may be extended to a unique smallest σ -field written σ (F ∪G), as follows.
Let {Gi : i ∈ I } be the collection of all σ -fields which contain both F and G as Tsubsets; this
collection is non-empty since it contains the set of all subsets of . Then G = i∈I Gi is the
unique smallest σ -field which contains F ∪ G.
(A) Completeness. Let (, F , P) be a probability space. Any event A which has zero
probability, that is P(A) = 0, is called null. It may seem reasonable to suppose that any subset
B of a null set A will itself be null, but this may be without meaning since B may not be an
event, and thus P(B) may not be defined.
(2) Definition. A probability space (, F , P) is called complete if all subsets of null sets
are events.
Any incomplete space can be completed thus. Let N be the collection of all subsets of
null sets in F and let G = σ (F ∪ N ) be the smallest σ -field which contains all sets in F
16 1.7 Events and their probabilities

and N . It can be shown that the domain of P may be extended in an obvious way from F to
G; (, G, P) is called the completion of (, F , P).

(B) Product spaces. The probability spaces discussed in this chapter have usually been con-
structed around the outcomes of one experiment, but instances occur naturally when we need
to combine the outcomes of several independent experiments into one space (see Examples
(1.2.4) and (1.4.2)). How should we proceed in general?
Suppose two experiments have associated probability spaces (1, F1 , P1 ) and (2 , F2 , P2 )
respectively. The sample space of the pair of experiments, considered jointly, is the collection
1 ×2 = {(ω1 , ω2 ) : ω1 ∈ 1 , ω2 ∈ 2 } of ordered pairs. The appropriate σ -field of events
is more complicated to construct. Certainly it should contain all subsets of 1 ×2 of the form
A1 × A2 = {(a1 , a2 ) : a1 ∈ A1 , a2 ∈ A2 } where A1 and A2 are typical members of F1 and F2
respectively. However, the family of all such sets, F1 ×F2 = {A1 × A2 : A1 ∈ F1 , A2 ∈ F2 },
is not in general a σ -field. By the discussion after (1), there exists a unique smallest σ -field
G = σ (F1 × F2 ) of subsets of 1 × 2 which contains F1 × F2 . All we require now is a
suitable probability function on (1 × 2 , G). Let P12 : F1 × F2 → [0, 1] be given by:

(3) P12 (A1 × A2 ) = P1 (A1 )P2 (A2 ) for A1 ∈ F1 , A2 ∈ F2 .

It can be shown that the domain of P12 can be extended from F1 × F2 to the whole of
G = σ (F1 × F2 ). The ensuing probability space (1 × 2 , G, P12 ) is called the product
space of (1 , F1 , P1 ) and (2 , F2 , P2 ). Products of larger numbers of spaces are constructed
similarly. The measure P12 is sometimes called the ‘product measure’ since its defining
equation (3) assumed that two experiments are independent. There are of course many other
measures that can be applied to (1 × 2 , G).
In many simple cases this technical discussion is unnecessary. Suppose that 1 and 2
are finite, and that their σ -fields contain all their subsets; this is the case in Examples (1.2.4)
and (1.4.2). Then G contains all subsets of 1 × 2 .

1.7 Worked examples


Here are some more examples to illustrate the ideas of this chapter. The reader is now equipped
to try his or her hand at a substantial number of those problems which exercised the pioneers
in probability. These frequently involved experiments having equally likely outcomes, such
as dealing whist hands, putting balls of various colours into urns and taking them out again,
throwing dice, and so on. In many such instances, the reader will be pleasantly surprised to
find that it is not necessary to write down (, F , P) explicitly, but only to think of  as being
a collection {ω1 , ω2 , . . . , ω N } of possibilities, each of which may occur with probability 1/N.
(Recall Example (1.3.6).) Thus, P(A) = |A|/N for any A ⊆ . The basic tools used in such
problems are as follows.
(a) Combinatorics: remember that the number of permutations  of n objects is n! and that
the number of ways of choosing r objects from n is nr .
(b) Set theory: to obtain P(A) we can compute P(Ac ) = 1 − P(A) or we can partition A
by conditioning on events Bi , and then use Lemma (1.4.4).
(c) Use of independence.
1.7 Worked examples 17

(1) Example. Consider a series of hands dealt at bridge. Let A be the event that in a given
deal each player has one ace. Show that the probability that A occurs at least once in seven
deals is approximately 12 .
Solution. The number of ways of dealing 52 cards into four equal hands is 52!/(13!)4. There
are 4! ways of distributing the aces so that each hand holds one, and there are 48!/(12!)4 ways
of dealing the remaining cards. Thus

4! 48!/(12!)4 1
P(A) = 4
≃ .
52!/(13!) 10
Now let Bi be the event that A occurs for the first time on the i th deal. Clearly Bi ∩ B j = ∅,
i 6= j . Thus
7
X
P(A occurs in seven deals) = P(B1 ∪ · · · ∪ B7 ) = P(Bi ) using Definition (1.3.1).
1

Since successive deals are independent, we have

P(Bi ) = P Ac occurs on deal 1, Ac occurs on deal 2,



. . . , Ac occurs on deal i − 1, A occurs on deal i
= P(Ac )i−1 P(A) using Definition (1.5.1)
 i−1
1 1
≃ 1 − 10 10 .

Thus
7
X 7 
X i−1
9 1
P(A occurs in seven deals) = P(Bi ) ≃ 10 10 ≃ 21 .
1 1

Can you see an easier way of obtaining this answer?

(2) Example. There are two roads from A to B and two roads from B to C. Each of the four
roads has probability p of being blocked by snow, independently of all the others. What is
the probability that there is an open road from A to C?
Solution.

P(open road) = P (open road from A to B) ∩ (open road from B to C)
= P(open road from A to B)P(open road from B to C)

using the independence. However, p is the same for all roads; thus, using Lemma (1.3.4),
2
P(open road) = 1 − P(no road from A to B)
  2
= 1 − P (first road blocked) ∩ (second road blocked)
 2
= 1 − P(first road blocked)P(second road blocked)

using the independence. Thus

(3) P(open road) = (1 − p 2 )2 .


18 1.7 Events and their probabilities

Further suppose that there is also a direct road from A to C, which is independently blocked
with probability p. Then, by Lemma (1.4.4) and equation (3),

P(open road) = P(open road | direct road blocked) · p


+ P(open road | direct road open) · (1 − p)
= (1 − p2 )2 · p + 1 · (1 − p).

(4) Example. Symmetric random walk (or ‘Gambler’s ruin’). A man is saving up to buy
a new Jaguar at a cost of N units of money. He starts with k units where 0 < k < N, and
tries to win the remainder by the following gamble with his bank manager. He tosses a fair
coin repeatedly; if it comes up heads then the manager pays him one unit, but if it comes up
tails then he pays the manager one unit. He plays this game repeatedly until one of two events
occurs: either he runs out of money and is bankrupted or he wins enough to buy the Jaguar.
What is the probability that he is ultimately bankrupted?
Solution. This is one of many problems the solution to which proceeds by the construction
of a linear difference equation subject to certain boundary conditions. Let A denote the event
that he is eventually bankrupted, and let B be the event that the first toss of the coin shows
heads. By Lemma (1.4.4),

(5) Pk (A) = Pk (A | B)P(B) + Pk (A | B c )P(B c ),

where Pk denotes probabilities calculated relative to the starting point k. We want to find
Pk (A). Consider Pk (A | B). If the first toss is a head then his capital increases to k + 1 units
and the game starts afresh from a different starting point. Thus Pk (A | B) = Pk+1 (A) and
similarly Pk (A | B c ) = Pk−1 (A). So, writing pk = Pk (A), (5) becomes

(6) pk = 21 ( pk+1 + pk−1 ) if 0 < k < N,

which is a linear difference equation subject to the boundary conditions p0 = 1, p N = 0.


The analytical solution to such equations is routine, and we shall return later to the general
method of solution. In this case we can proceed directly. We put bk = pk − pk−1 to obtain
bk = bk−1 and hence bk = b1 for all k. Thus

pk = b1 + pk−1 = 2b1 + pk−2 = · · · = kb1 + p0

is the general solution to (6). The boundary conditions imply that p0 = 1, b1 = −1/N, giving

k
(7) Pk (A) = 1 − .
N

As the price of the Jaguar rises, that is as N → ∞, ultimate bankruptcy becomes very likely.
This is the problem of the ‘symmetric random walk with two absorbing barriers’ to which we
shall return in more generality later.
Remark. Our experience of student calculations leads us to stress that probabilities lie be-
tween zero and one; any calculated probability which violates this must be incorrect.
1.7 Worked examples 19

(8) Example. Testimony. A court is investigating the possible occurrence of an unlikely event
T . The reliability of two independent witnesses called Alf and Bob is known to the court:
Alf tells the truth with probability α and Bob with probability β, and there is no collusion
between the two of them. Let A and B be the events that Alf and Bob assert (respectively)
that T occurred, and let τ = P(T ). What is the probability that T occurred given that both
Alf and Bob declare that T occurred?
Solution. We are asked to calculate P(T | A ∩ B), which is equal to P(T ∩ A ∩ B)/P(A ∩ B).
Now P(T ∩ A ∩ B) = P(A ∩ B | T )P(T ) and

P(A ∩ B) = P(A ∩ B | T )P(T ) + P(A ∩ B | T c )P(T c ).

We have from the independence of the witnesses that A and B are conditionally independent
given either T or T c . Therefore

P(A ∩ B | T ) = P(A | T )P(B | T ) = αβ,


P(A ∩ B | T c ) = P(A | T c )P(B | T c ) = (1 − α)(1 − β),

so that
αβτ
P(T | A ∩ B) = .
αβτ + (1 − α)(1 − β)(1 − τ )
9
As an example, suppose that α = β = 10 and τ = 1/1000. Then P(T | A∩ B) = 81/1080,
which is somewhat small as a basis for a judicial conclusion.
This calculation may be informative. However, it is generally accepted that such an appli-
cation of the axioms of probability is inappropriate to questions of truth and belief.

(9) Example. Zoggles revisited. A new process for the production of zoggles is invented,
and both factories of Example (1.4.6) install extra production lines using it. The new process
is cheaper but produces fewer reliable zoggles, only 75 per cent of items produced in this new
way being reliable.
Factory I fails to implement its new production line efficiently, and only 10 per cent of its
output is made in this manner. Factory II does better: it produces 20 per cent of its output by
the new technology, and now produces twice as many zoggles in all as Factory I.
Is the new process beneficial to the consumer?
Solution. Both factories now produce a higher proportion of unreliable zoggles than before,
and so it might seem at first sight that there is an increased proportion of unreliable zoggles
on the market.
Let A be the event that a randomly chosen zoggle is satisfactory, B the event that it came
from factory I, and C the event that it was made by the new method. Then

P(A) = 31 P(A | B) + 32 P(A | B c )


 
= 1
3
1
10 P(A | B ∩ C) + 9
10 P(A | B ∩ C c)
 
+ 2
3
1
5 P(A | B c ∩ C) + 54 P(A | B c ∩ C c )
   
1 1 3 9
= 3 10 · 4 + 10 · 54 + 23 15 · 43 + 45 · 20
19 523
= 600 > 51
60 ,

so that the proportion of satisfactory zoggles has been increased.


20 1.7 Events and their probabilities

h D3 g D4

d D1 c D2
a e b f
R1 R2

Figure 1.1. Two unions of rectangles illustrating Simpson’s paradox.

(10) Example. Simpson’s paradox†. A doctor has performed clinical trials to determine
the relative efficacies of two drugs, with the following results.

Women Men
Drug I Drug II Drug I Drug II

Success 200 10 19 1000


Failure 1800 190 1 1000

Which drug is the better? Here are two conflicting responses.


1. Drug I was given to 2020 people, of whom 219 were cured. The success rate was
219/2020, which is much smaller than the corresponding figure, 1010/2200, for drug II.
Therefore drug II is better than drug I.
2. Amongst women the success rates of the drugs are 1/10 and 1/20, and amongst men
19/20 and 1/2. Drug I wins in both cases.
This well-known statistical paradox may be reformulated in the following more general
way. Given three events A, B, C, it is possible to allocate probabilities such that

(11) P(A | B ∩ C) > P(A | B c ∩ C) and P(A | B ∩ C c ) > P(A | B c ∩ C c )

but

(12) P(A | B) < P(A | B c ).

We may think of A as the event that treatment is successful, B as the event that drug I is given
to a randomly chosen individual, and C as the event that this individual is female. The above
inequalities imply that B is preferred to B c when C occurs and when C c occurs, but B c is
preferred to B overall.

†This paradox, named after Simpson (1951), was remarked by Yule in 1903, having been observed previously
by Pearson in 1899. The nomenclature is an instance of Stigler’s law of eponymy: “No law, theorem, or discovery
is named after its originator”. This law applies to many eponymous statements in this book, including the law
itself. As remarked by A. N. Whitehead, “Everything of importance has been said before, by somebody who
did not discover it”.
1.7 Worked examples 21

Setting

a = P(A ∩ B ∩ C), b = P(Ac ∩ B ∩ C),


c = P(A ∩ B c ∩ C), d = P(Ac ∩ B c ∩ C),
e = P(A ∩ B ∩ C c ), f = P(Ac ∩ B ∩ C c ),
g = P(A ∩ B c ∩ C c ), h = P(Ac ∩ B c ∩ C c ),

and expanding (11)–(12), we arrive at the (equivalent) inequalities

(13) ad > bc, eh > f g, (a + e)(d + h) < (b + f )(c + g),

subject to the conditions a, b, c, . . . , h ≥ 0 and a + b + c + · · · + h = 1. Inequalities (13)


are equivalent to the existence of two rectangles R1 and R2 , as in Figure 1.1, satisfying

area (D1 ) > area (D2 ), area (D3 ) > area (D4 ), area (R1 ) < area (R2 ).
3 1
Many such rectangles may be found, by inspection, as for example those with a = 30 , b = 30 ,
8 3 3 8 1 3
c = 30 , d = 30 , e = 30 , f = 30 , g = 30 , h = 30 . Similar conclusions are valid for finer
partitions {Ci : i ∈ I } of the sample space, though the corresponding pictures are harder to
draw.
Simpson’s paradox has arisen many times in practical situations. There are many well-
known cases, including the admission of graduate students to the University of California at
Berkeley and a clinical trial comparing treatments for kidney stones.

(14) Example. False positives. A rare disease affects one person in 105 . A test for the
99
disease shows positive with probability 100 when applied to an ill person, and with probability
1
100 when applied to a healthy person. What is the probability that you have the disease given
that the test shows positive?
Solution. In the obvious notation,
P(+ | ill)P(ill)
P(ill | +) =
P(+ | ill)P(ill) + P(+ | healthy)P(healthy)
99 −5
100 · 10 99 1
= 99
= ≃ .
−5 + 1 (1 − 10−5 ) 99 + 105 − 1 1011
100 · 10 100

The chance of being ill is rather small. Indeed it is more likely that the test was incorrect.

Exercises for Section 1.7


1. There are two roads from A to B and two roads from B to C. Each of the four roads is blocked by
snow with probability p, independently of the others. Find the probability that there is an open road
from A to B given that there is no open route from A to C.
If, in addition, there is a direct road from A to C, this road being blocked with probability p
independently of the others, find the required conditional probability.
2. Calculate the probability that a hand of 13 cards dealt from a normal shuffled pack of 52 contains
exactly two kings and one ace. What is the probability that it contains exactly one ace given that it
contains exactly two kings?
22 1.8 Events and their probabilities

3. A symmetric random walk takes place on the integers 0, 1, 2, . . . , N with absorbing barriers at 0
and N , starting at k. Show that the probability that the walk is never absorbed is zero.
4. The so-called ‘sure thing principle’ asserts that if you prefer x to y given C, and also prefer x to
y given C c , then you surely prefer x to y. Agreed?
5. A pack contains m cards, labelled 1, 2, . . . , m. The cards are dealt out in a random order, one
by one. Given that the label of the kth card dealt is the largest of the first k cards dealt, what is the
probability that it is also the largest in the pack?
6. A group of 2b friends meet for a bridge soirée. There are m men and 2b − m women where
2 ≤ m ≤ b. The group divides into b teams of pairs, formed uniformly at random. What is the
probability that no pair comprises 2 men?

1.8 Problems
1. A traditional fair die is thrown twice. What is the probability that:
(a) a six turns up exactly once?
(b) both numbers are odd?
(c) the sum of the scores is 4?
(d) the sum of the scores is divisible by 3?
2. A fair coin is thrown repeatedly. What is the probability that on the nth throw:
(a) a head appears for the first time?
(b) the numbers of heads and tails to date are equal?
(c) exactly two heads have appeared altogether to date?
(d) at least two heads have appeared to date?
3. Let F and G be σ -fields of subsets of .
(a) Use elementary set operations to Tshow that F is closed under countable intersections; that is, if
A1 , A2 , . . . are in F, then so is i Ai .
(b) Let H = F ∩ G be the collection of subsets of  lying in both F and G. Show that H is a σ -field.
(c) Show that F ∪ G, the collection of subsets of  lying in either F or G, is not necessarily a σ -field.
4. Describe the underlying probability spaces for the following experiments:
(a) a biased coin is tossed three times;
(b) two balls are drawn without replacement from an urn which originally contained two ultramarine
and two vermilion balls;
(c) a biased coin is tossed repeatedly until a head turns up.
5. Show that the probability that exactly one of the events A and B occurs is

P(A) + P(B) − 2P(A ∩ B).

6. Prove that P(A ∪ B ∪ C) = 1 − P(Ac | B c ∩ C c )P(B c | C c )P(C c ).


7. (a) If A is independent of itself, show that P(A) is 0 or 1.
(b) If P(A) is 0 or 1, show that A is independent of all events B.
8. Let F be a σ -field of subsets of , and suppose P : F → [0, 1] satisfies: (i) P() = 1, and (ii) P
is additive, in that P(A ∪ B) = P(A) + P(B) whenever A ∩ B = ∅. Show that P(∅) = 0.
9. Suppose (, F, P) is a probability space and B ∈ F satisfies P(B) > 0. Let Q : F → [0, 1] be
defined by Q(A) = P(A | B). Show that (, F, Q) is a probability space. If C ∈ F and Q(C) > 0,
show that Q(A | C) = P(A | B ∩ C); discuss.
1.8 Problems 23

10. Let B1 , B2 , . . . be a partition of the sample space , each Bi having positive probability, and
show that

X
P(A) = P(A | B j )P(B j ).
j =1

11. Prove Boole’s inequalities:


[
n  n
X \
n  n
X
P Ai ≤ P(Ai ), P Ai ≥1− P(Aci ).
i=1 i=1 i=1 i=1

12. Prove that


\
n  X X X
P Ai = P(Ai ) − P(Ai ∪ A j ) + P(Ai ∪ A j ∪ Ak )
1 i i< j i< j <k
− · · · − (−1)n P(A1 ∪ A2 ∪ · · · ∪ An ).

13. Let A1 , A2 , . . . , An be events, and let Nk be the event that exactly k of the Ai occur. Prove the
result sometimes referred to as Waring’s theorem:

n−k
!
X k +i X
i
P(Nk ) = (−1) Sk+i , where S j = P(Ai1 ∩ Ai2 ∩ · · · ∩ Ai j ).
k
i=0 i1 <i2 <···<i j

Use this result to find an expression for the probability that a purchase of six packets of Corn Flakes
yields exactly three distinct busts (see Exercise (1.3.4)).
14. Prove Bayes’s formula: if A1 , A2 , . . . , An is a partition of , each Ai having positive probability,
then
P(B | A j )P(A j )
P(A j | B) = Pn .
1 P(B | Ai )P(Ai )

15. A random number N of dice is thrown. Let Ai be the event that N = i , and assume that
P(Ai ) = 2−i , i ≥ 1. The sum of the scores is S. Find the probability that:
(a) N = 2 given S = 4;
(b) S = 4 given N is even;
(c) N = 2, given that S = 4 and the first die showed 1;
(d) the largest number shown by any die is r , where S is unknown.
16. Let A1 , A2 , . . . be a sequence of events. Define

[ ∞
\
Bn = Am , Cn = Am .
m=n m=n

Clearly Cn ⊆ An ⊆ Bn . The sequences {Bn } and {Cn } are decreasing and increasing respectively
with limits
\ \ [ [ [ \
lim Bn = B = Bn = Am , lim Cn = C = Cn = Am .
n n m≥n n n m≥n

The events B and C are denoted lim supn→∞ An and lim inf n→∞ An respectively. Show that
(a) B = {ω ∈  : ω ∈ An for infinitely many values of n},
(b) C = {ω ∈  : ω ∈ An for all but finitely many values of n}.
24 1.8 Events and their probabilities

We say that the sequence { An } converges to a limit A = lim An if B and C are the same set A. Suppose
that An → A and show that
(c) A is an event, in that A ∈ F,
(d) P(An ) → P(A).
17. In Problem (1.8.16) above, show that B and C are independent whenever Bn and Cn are inde-
pendent for all n. Deduce that if this holds and furthermore An → A, then P(A) equals either zero or
one.
18. Show that the assumption that P is countably additive is equivalent to the assumption that P is
continuous. That is to say, show that if a function P : F → [0, 1] satisfies P(∅) = 0, P() = 1, and
P(A ∪ B) = P(A) + P(B) whenever A, B ∈ F and A ∩ B = ∅, then P is countably additive (in the
sense of satisfying Definition (1.3.1b)) if and only if P is continuous (in the sense of Lemma (1.3.5)).

19. Anne, Betty, Chloë, and Daisy were all friends at school. Subsequently each of the 42 = 6
subpairs meet up; at each of the six meetings the pair involved quarrel with some fixed probability
p, or become firm friends with probability 1 − p. Quarrels take place independently of each other.
In future, if any of the four hears a rumour, then she tells it to her firm friends only. If Anne hears a
rumour, what is the probability that:
(a) Daisy hears it?
(b) Daisy hears it if Anne and Betty have quarrelled?
(c) Daisy hears it if Betty and Chloë have quarrelled?
(d) Daisy hears it if she has quarrelled with Anne?
20. A biased coin is tossed repeatedly. Each time there is a probability p of a head turning up. Let pn
be the probability that an even number of heads has occurred after n tosses (zero is an even number).
Show that p0 = 1 and that pn = p(1 − pn−1 ) + (1 − p) pn−1 if n ≥ 1. Solve this difference equation.
21. A biased coin is tossed repeatedly. Find the probability that there is a run of r heads in a row
before there is a run of s tails, where r and s are positive integers.
22. (a) A bowl contains twenty cherries, exactly fifteen of which have had their stones removed. A
greedy pig eats five whole cherries, picked at random, without remarking on the presence or
absence of stones. Subsequently, a cherry is picked randomly from the remaining fifteen.
(i) What is the probability that this cherry contains a stone?
(ii) Given that this cherry contains a stone, what is the probability that the pig consumed at least
one stone?
(b) 100 contestants buy numbered lottery tickets for a reverse raffle, in which the last ticket drawn
from an urn is the winner. Halfway through the draw, the Mistress of Ceremonies discovers that
10 tickets have inadvertently not been added to the urn, so she adds them, and continues the draw.
Is the lottery fair?
23. The ‘ménages’ problem poses the following question. Some consider it to be desirable that men
and women alternate when seated at a circular table. If n heterosexual couples are seated randomly
according to this rule, show that the probability that nobody sits next to his or her partner is
n
!
1 X k 2n 2n − k
(−1) (n − k)!
n! 2n − k k
k=0

You may find it useful to show first that the number of ways of selecting k non-overlapping pairs of

adjacent seats is 2n−k
k 2n(2n − k)−1 .
24. An urn contains b blue balls and r red balls. They are removed at random and not replaced. Show
 . r+b 
that the probability that the first red ball drawn is the (k + 1)th ball drawn equals r+b−k−1
r−1 b .
Find the probability that the last ball drawn is red.
25. An urn contains a azure balls and c carmine balls, where ac 6= 0. Balls are removed at random
and discarded until the first time that a ball (B, say) is removed having a different colour from its
1.8 Problems 25

predecessor. The ball B is now replaced and the procedure restarted. This process continues until the
last ball is drawn from the urn. Show that this last ball is equally likely to be azure or carmine.
26. Protocols. A pack of four cards contains one spade, one club, and the two red aces. You deal
two cards faces downwards at random in front of a truthful friend. She inspects them and tells you
that one of them is the ace of hearts. What is the chance that the other card is the ace of diamonds?
Perhaps 13 ?
Suppose that your friend’s protocol was:
(a) with no red ace, say “no red ace”,
(b) with the ace of hearts, say “ace of hearts”,
(c) with the ace of diamonds but not the ace of hearts, say “ace of diamonds”.
Show that the probability in question is 13 .
Devise a possible protocol for your friend such that the probability in question is zero.
27. Eddington’s controversy. Four witnesses, A, B, C, and D, at a trial each speak the truth with
probability 31 independently of each other. In their testimonies, A claimed that B denied that C declared
that D lied. What is the (conditional) probability that D told the truth? [This problem seems to have
appeared first as a parody in a university magazine of the ‘typical’ Cambridge Philosophy Tripos
question.]
28. The probabilistic method. 10 per cent of the surface of a sphere is coloured blue, the rest is red.
Show that, irrespective of the manner in which the colours are distributed, it is possible to inscribe a
cube in S with all its vertices red.
29. Repulsion. The event A is said to be repelled by the event B if P(A | B) < P(A), and to be
attracted by B if P(A | B) > P(A). Show that if B attracts A, then A attracts B, and B c repels A.
If A attracts B, and B attracts C, does A attract C?
30. Birthdays. At a lecture, there a m students born on independent days in 2007.
(a) With 2 ≤ m ≤ 365, show that the probability that at least two of them share a birthday is
p = 1 − (365)!/{(365 − m)! 365m }. Show that p > 12 when m = 23.
(b) With 2 ≤ m ≤ 366, find the probability p1 that exactly one pair of individuals share a birthday,
with no others sharing.
(c) Suppose m students are born on independent random days on the planet Magrathea, whose year has
M ≫ m days. Show thatthe probability p0 that no two students share a birthday is approximately
exp − 12 m(m − 1)/M) for large M.
31. Lottery. You choose r of the first n positive integers, and a lottery chooses a random subset L of
the same size. What is the probability that:
(a) L includes no consecutive integers?
(b) L includes exactly one pair of consecutive integers?
(c) the numbers in L are drawn in increasing order?
(d) your choice of numbers is the same as L?
(e) there are exactly k of your numbers matching members of L?
32. Bridge. During a game of bridge, you are dealt at random a hand of thirteen cards. With an
obvious notation, show that P(4S, 3H, 3D, 3C) ≃ 0.026 and P(4S, 4H, 3D, 2C) ≃ 0.018. However
if suits are not specified, so numbers denote the shape of your hand, show that P(4, 3, 3, 3) ≃ 0.11
and P(4, 4, 3, 2) ≃ 0.22.
33. Poker. During a game of poker, you are dealt a five-card hand at random. With the convention
that aces may count high or low, show that:

P(1 pair) ≃ 0.423, P(2 pairs) ≃ 0.0475, P(3 of a kind) ≃ 0.021,


P(straight) ≃ 0.0039, P(flush) ≃ 0.0020, P(full house) ≃ 0.0014,
P(4 of a kind) ≃ 0.00024, P(straight flush) ≃ 0.000015.
26 1.8 Events and their probabilities

34. Poker dice. There are five dice each displaying 9, 10, J, Q, K, A. Show that, when rolled:

P(1 pair) ≃ 0.46, P(2 pairs) ≃ 0.23, P(3 of a kind) ≃ 0.15,


P(no 2 alike) ≃ 0.093, P(full house) ≃ 0.039, P(4 of a kind) ≃ 0.019,
P(5 of a kind) ≃ 0.0008.

35. You are lost in the National Park of Bandrika†. Tourists comprise two-thirds of the visitors to
the park, and give a correct answer to requests for directions with probability 34 . (Answers to repeated
questions are independent, even if the question and the person are the same.) If you ask a Bandrikan
for directions, the answer is always false.
(a) You ask a passer-by whether the exit from the Park is East or West. The answer is East. What is
the probability this is correct?
(b) You ask the same person again, and receive the same reply. Show the probability that it is correct
is 21 .
(c) You ask the same person again, and receive the same reply. What is the probability that it is
correct?
(d) You ask for the fourth time, and receive the answer East. Show that the probability it is correct
is 27
70 .
(e) Show that, had the fourth answer been West instead, the probability that East is nevertheless
9 .
correct is 10
36. Mr Bayes goes to Bandrika. Tom is in the same position as you were in the previous problem,
but he has reason to believe that, with probability ǫ, East is the correct answer. Show that:
(a) whatever answer first received, Tom continues to believe that East is correct with probability ǫ,
(b) if the first two replies are the same (that is, either WW or EE), Tom continues to believe that East
is correct with probability ǫ,
(c) after three like answers, Tom will calculate as follows, in the obvious notation:

9ǫ 11ǫ
P(East correct | EEE) = , P(East correct | WWW) = .
11 − 2ǫ 9 + 2ǫ
9 .
Evaluate these when ǫ = 20
37. Bonferroni’s inequality. Show that
[
n  n
X X
P Ar ≥ P(Ar ) − P(Ar ∩ Ak ).
r=1 r=1 r<k

38. Kounias’s inequality. Show that


 
[
n  Xn X 
P Ar ≤ min P(Ar ) − P(Ar ∩ Ak ) .
k  
r=1 r=1 r:r6=k

39. The lost boarding pass‡. The n passengers for a Bell-Air flight in an airplane with n seats have
been told their seat numbers. They get on the plane one by one. The first person sits in the wrong seat.
Subsequent passengers sit in their assigned seats whenever they find them available, or otherwise in a

†A fictional country made famous in the Hitchcock film ‘The Lady Vanishes’.
‡The authors learned of this problem from David Bell in 2000 or earlier.
1.8 Problems 27

randomly chosen empty seat. What is the probability that the last passenger finds his or her assigned
seat to be free?
What is the answer if the first person sits in a seat chosen uniformly at random from the n available?
40. Flash’s problem. A number n of spaceships land independently and uniformly at random on the
surface of planet Mongo. Each ship controls the hemisphere of which it is the centre. Show that the
probability that every point on Mongo is controlled by at least one ship is 1 − 2−n (n 2 − n + 2). [Hint:
n great circles almost surely partition the surface of the sphere into n 2 − n + 2 disjoint regions.]
41. Let X be uniformly distributed on {1, 2, . . . , n − 1}, where n ≥ 2. Given X, a team of size X is
selected at random from a pool of n players (including you), each such subset of size X being equally
likely. Call the selected team A, and the remainder team B.
(a) What is the probability that your team has size k?
(b) Each team picks a captain uniformly at random from its members. What is the probability your
team has size k given that you are chosen as captain?
42. Alice and Bob flip a fair coin in turn. A wins if she gets a head, provided her preceding flip was
a tail; B wins if he gets a tail, provided his preceding flip was a head. Let n ≥ 3. Show that the
probability the game ends on the nth flip is (n + 1)(n − 1)/2n+2 if n is odd, and (n + 2)(n − 2)/2n+2
if even.
What is the probability that A wins the game?
43. A coin comes up heads with probability p ∈ (0, 1). Let k ≥ 1, and let ρm be the probability that,
in m (≥ 1) coin flips, the longest run of consecutive heads has length strictly less than k. Show that

ρm − ρm−1 + (1 − p) pk ρm−k−1 = 0, m ≥ k + 1,

and find ρm when k = 2.


2
Random variables and their distributions

Summary. Quantities governed by randomness correspond to functions on the


probability space called random variables. The value taken by a random vari-
able is subject to chance, and the associated likelihoods are described by a
function called the distribution function. Two important classes of random
variables are discussed, namely discrete variables and continuous variables.
The law of averages, known also as the law of large numbers, states that the
proportion of successes in a long run of independent trials converges to the
probability of success in any one trial. This result provides a mathematical
basis for a philosophical view of probability based on repeated experimenta-
tion. Worked examples involving random variables and their distributions are
included, and the chapter terminates with sections on random vectors and on
Monte Carlo simulation.

2.1 Random variables


We shall not always be interested in an experiment itself, but rather in some consequence
of its random outcome. For example, many gamblers are more concerned with their losses
than with the games which give rise to them. Such consequences, when real valued, may
be thought of as functions which map  into the real line R, and these functions are called
‘random† variables’.

(1) Example. A fair coin is tossed twice:  = {HH, HT, TH, TT}. For ω ∈ , let X (ω) be
the number of heads, so that

X (HH) = 2, X (HT) = X (TH) = 1, X (TT) = 0.

Now suppose that a gambler wagers his fortune of £1 on the result of this experiment. He
gambles cumulatively so that his fortune is doubled each time a head appears,and is annihilated
on the appearance of a tail. His subsequent fortune W is a random variable given by

W (HH) = 4, W (HT) = W (TH) = W (TT) = 0.

†Derived from the Old French word randon meaning ‘haste’.


2.1 Random variables 29

FX (x)

1
3
4

1
4

1 2 x

Figure 2.1. The distribution function FX of the random variable X of Examples (1) and (5).

After the experiment is done and the outcome ω ∈  is known, a random variable X :  →
R takes some value. In general this numerical value is more likely to lie in certain subsets
of R than in certain others, depending on the probability space (, F , P) and the function X
itself. We wish to be able to describe the distribution of the likelihoods of possible values of
X. Example (1) above suggests that we might do this through the function f : R → [0, 1]
defined by
f (x) = probability that X is equal to x,
but this turns out to be inappropriate in general. Rather, we use the distribution function
F : R → [0, 1] defined by

F(x) = probability that X does not exceed x.

More rigorously, this is

(2) F(x) = P(A(x))

where A(x) ⊆  is given by A(x) = {ω ∈  : X (ω) ≤ x}. However, P is a function on the


collection F of events; we cannot discuss P(A(x)) unless A(x) belongs to F , and so we are
led to the following definition.
(3) Definition. A random variable is a function X :  → R with the property that {ω ∈  :
X (ω) ≤ x} ∈ F for each x ∈ R. Such a function is said to be F -measurable.

If you so desire, you may pay no attention to the technical condition in the definition
and think of random variables simply as functions mapping  into R. We shall always use
upper-case letters, such as X, Y , and Z , to represent generic random variables, whilst lower-
case letters, such as x, y, and z, will be used to represent possible numerical values of these
variables. Do not confuse this notation in your written work.
Every random variable has a distribution function, given by (2); distribution functions are
very important and useful.
(4) Definition. The distribution function of a random variable X is the function F : R →
[0, 1] given by F(x) = P(X ≤ x).

This is the obvious abbreviation of equation (2). Events written as {ω ∈  : X (ω) ≤ x}


are commonly abbreviated to {ω : X (ω) ≤ x} or {X ≤ x}. We write FX where it is necessary
to emphasize the role of X.
30 2.1 Random variables and their distributions

FW (x)

1
3
4

4 x

Figure 2.2. The distribution function FW of the random variable W of Examples (1) and (5).

(5) Example (1) revisited. The distribution function FX of X is given by



 0 if x < 0,


 1 if 0 ≤ x < 1,
4
FX (x) = 3

 if 1 ≤ x < 2,

 4
1 if x ≥ 2,

and is sketched in Figure 2.1. The distribution function FW of W is given by



0 if x < 0,
3
FW (x) = 4 if 0 ≤ x < 4,

1 if x ≥ 4,

and is sketched in Figure 2.2. This illustrates the important point that the distribution function
of a random variable X tells us about the values taken by X and their relative likelihoods,
rather than about the sample space and the collection of events.

(6) Lemma. A distribution function F has the following properties:


(a) lim F(x) = 0, lim F(x) = 1,
x→−∞ x→∞
(b) if x < y then F(x) ≤ F(y),
(c) F is right-continuous, that is, F(x + h) → F(x) as h ↓ 0.

Proof.
(a) Let Bn = {ω ∈  : X (ω) ≤ −n} = {X ≤ −n}. The sequence B1 , B2 , . . . is decreasing
with the empty set as limit. Thus, by Theorem (1.3.5), P(Bn ) → P(∅) = 0. The other
part is similar.
(b) Let A(x) = {X ≤ x}, A(x, y) = {x < X ≤ y}. Then A(y) = A(x) ∪ A(x, y) is a
disjoint union, and so by Definition (1.3.1),

P(A(y)) = P(A(x)) + P(A(x, y))

giving
F(y) = F(x) + P(x < X ≤ y) ≥ F(x).
(c) This is an exercise. Use Theorem (1.3.5). 
2.1 Random variables 31

Actually, this lemma characterizes distribution functions. That is to say, F is the distribution
function of some random variable if and only if it satisfies (6a), (6b), and (6c).
For the time being we can forget all about probability spaces and concentrate on random
variables and their distribution functions. The distribution function F of X contains a great
deal of information about X.

(7) Example. Constant variables. The simplest random variable takes a constant value on
the whole domain . Let c ∈ R and define X :  → R by

X (ω) = c for all ω ∈ .

The distribution function F(x) = P(X ≤ x) is the step function



0 x < c,
F(x) =
1 x ≥ c.

Slightly more generally, we call X constant (almost surely) if there exists c ∈ R such that
P(X = c) = 1. A constant random variable is sometimes said to be degenerate or trivial.

(8) Example. Bernoulli variables. Consider Example (1.3.2). Let X :  → R be given by

X (H ) = 1, X (T ) = 0.

Then X is the simplest non-trivial random variable, having two possible values, 0 and 1. Its
distribution function F(x) = P(X ≤ x) is

0 x < 0,
F(x) = 1 − p 0 ≤ x < 1,

1 x ≥ 1.

X is said to have the Bernoulli distribution sometimes denoted Bern( p).

(9) Example. Indicator functions. A particular class of Bernoulli variables is very useful
in probability theory. Let A be an event and let I A :  → R be the indicator function of A;
that is, 
1 if ω ∈ A,
I A (ω) =
0 if ω ∈ Ac .
Then I A is a Bernoulli random variable taking the values 1 and 0 with probabilities PS
(A) and
P(Ac ) respectively. Suppose {Bi : i ∈ I } is a family of disjoint events with A ⊆ i∈I Bi .
Then
X
(10) IA = I A∩Bi ,
i

an identity which is often useful. We sometimes write I (A) for I A .


32 2.2 Random variables and their distributions

(11) Lemma. Let F be the distribution function of X . Then


(a) P(X > x) = 1 − F(x),
(b) P(x < X ≤ y) = F(y) − F(x),
(c) P(X = x) = F(x) − lim F(y).
y↑x

Proof. (a) and (b) are exercises.


(c) Let Bn = {x − 1/n < X ≤ x} and use the method of proof of Lemma (6). 
Note one final piece of jargon for future use. A random variable X with distribution function
F is said to have two ‘tails’ given by

T1 (x) = P(X > x) = 1 − F(x), T2 (x) = P(X ≤ −x) = F(−x),

where x is large and positive. We shall see later that the rates at which the Ti decay to zero
as x → ∞ have a substantial effect on the existence or non-existence of certain associated
quantities called the ‘moments’ of the distribution.

Exercises for Section 2.1


1. Let X be a random variable on a given probability space, and let a ∈ R. Show that
(a) a X is a random variable,
(b) X − X = 0, the random variable taking the value 0 always, and X + X = 2X.
2. A random variable X has distribution function F. What is the distribution function of Y = a X +b,
where a and b are real constants?
3. A fair coin
 is tossed n times. Show that, under reasonable assumptions, the probability of exactly
k heads is nk ( 12 )n . What is the corresponding quantity when heads appears with probability p on
each toss?
4. Show that if F and G are distribution functions and 0 ≤ λ ≤ 1 then λF +(1−λ)G is a distribution
function. Is the product F G a distribution function?
5. Let F be a distribution function and r a positive integer. Show that the following are distri-
bution functions: (a) F(x)r , (b) 1 − {1 − F(x)}r , (c) F(x) + {1 − F(x)} log{1 − F(x)},
(d) {F(x) − 1}e + exp{1 − F(x)}.
6. Uniform distribution. A random variable that is equally likely to take any value in a finite set S
is said to have the uniform distribution on S. If U is such a random variable and ∅ 6= R ⊆ S, show
that the distribution of U conditional on {U ∈ R} is uniform on R.

2.2 The law of averages


We may recall the discussion in Section 1.3 of repeated experimentation. In each of N
repetitions of an experiment, we observe whether or not a given event A occurs, and we write
N(A) for the total number of occurrences of A. One possible philosophical underpinning of
probability theory requires that the proportion N(A)/N settles down as N → ∞ to some
limit interpretable as the ‘probability of A’. Is our theory to date consistent with such a
requirement?
With this question in mind, let us suppose that A1 , A2 , . . . is a sequence of independent
events having equal probability P(Ai ) = p, where 0 < p < 1; such an assumption requires of
2.2 The law of averages 33

course the existence of a corresponding probability space (, F , P), but we do not plan to get
bogged down in such mattersP here. We think of Ai as being the event ‘that A occurs on the i th
experiment’. We write Sn = ni=1 I Ai , the sum of the indicator functions of A1 , A2 , . . . , An ;
Sn is a random variable which counts the number of occurrences of Ai for 1 ≤ i ≤ n (certainly
Sn is a function of , since it is the sum of such functions, and it is left as an exercise to show
that Sn is F -measurable). The following result concerning the ratio n −1 Sn was proved by
James Bernoulli before 1692.
(1) Theorem. It is the case that n −1 Sn converges to p as n → ∞ in the sense that, for all
ǫ > 0,  
1
P p − ǫ ≤ Sn ≤ p + ǫ → 1 as n → ∞.
n

There are certain technicalities involved in the study of the convergence of random variables
(see Chapter 7), and this is the reason for the careful statement of the theorem. For the time
being, we encourage the reader to interpret the theorem as asserting simply that the proportion
n −1 Sn of times that the events A1 , A2 , . . . , An occur converges as n → ∞ to their common
probability p. We shall see later how important it is to be careful when making such statements.
Interpreted in terms of tosses of a fair coin, the theorem implies that the proportion of heads
is (with large probability) near to 12 . As a caveat regarding the difficulties inherent in studying
the convergence of random variables, we remark that it is not true that, in a ‘typical’ sequence
of tosses of a fair coin, heads outnumber tails about one-half of the time.
Proof. Suppose that we toss a coin repeatedly, and heads occurs on each toss with probability
p. The random variable Sn has the same probability distribution as the number Hn of heads
which occur during the first n tosses, which is to say that P(Sn = k) = P(Hn = k) for all k.
It follows that, for small positive values of ǫ,
  X
1
P Sn ≥ p + ǫ = P(Hn = k).
n
k≥n( p+ǫ)

We have from Exercise (2.1.3) that


 
n k
P(Hn = k) = p (1 − p)n−k for 0 ≤ k ≤ n,
k
and hence
  n  
X
1 n
(2) P Sn ≥ p + ǫ = pk (1 − p)n−k
n k
k=m
where m = ⌈n( p + ǫ)⌉, the least integer not less than n( p + ǫ). The following argument is
standard in probability theory. Let λ > 0 and note that eλk ≥ eλn( p+ǫ) if k ≥ m. Writing
q = 1 − p, we have that
  X n  
1 λ[k−n( p+ǫ)] n
P Sn ≥ p + ǫ ≤ e pk q n−k
n k
k=m
Xn  
−λnǫ n
≤e ( peλq )k (qe−λp )n−k
k
k=0
= e−λnǫ ( peλq + qe−λp )n ,
34 2.2 Random variables and their distributions

2
by the binomial theorem. It is a simple exercise to show that e x ≤ x + e x for x ∈ R. With
the aid of this inequality, we obtain
 
1  2 2 2 2 n
(3) P Sn ≥ p + ǫ ≤ e−λnǫ peλ q + qeλ p
n
2 n−λnǫ
≤ eλ .

We can pick λ to minimize the right-hand side, namely λ = 21 ǫ, giving


 
1 1 2
(4) P Sn ≥ p + ǫ ≤ e− 4 nǫ for ǫ > 0,
n
an inequality that is commonly known as ‘Bernstein’s inequality’. It follows immediately
that P(n −1 Sn ≥ p + ǫ) → 0 as n → ∞. An exactly analogous argument shows that
P(n −1 Sn ≤ p − ǫ) → 0 as n → ∞, and thus the theorem is proved. 
Bernstein’s inequality (4) is rather powerful, asserting that the chance that Sn deviates from
np by a quantity of order n tends to zero exponentially fast as n → ∞; such an inequality is
known as a ‘large-deviation estimate’†. We may use the inequality to prove rather more than
the conclusion of the theorem. Instead of estimating the chance that, for a specific value of
n, Sn lies between n( p − ǫ) and n( p + ǫ), let us estimate the chance that thisT occurs for all
large n. Writing An = { p − ǫ ≤ n −1 Sn ≤ p + S ǫ}, we wish to estimate P( ∞ n=m A n ). Now
the complement of this intersection is the event ∞ c
n=m A n , and the probability of this union
satisfies, by the inequalities of Boole and Bernstein,

! ∞ ∞
[ X X 1 2
(5) P c
An ≤ P(Acn ) ≤ 2e− 4 nǫ → 0 as m → ∞,
n=m n=m n=m

giving that, as required,


 
1
(6) P p − ǫ ≤ Sn ≤ p + ǫ for all n ≥ m → 1 as m → ∞.
n

Exercises for Section 2.2


1. You wish to ask each of a large number of people a question to which the answer “yes” is
embarrassing. The following procedure is proposed in order to determine the embarrassed fraction of
the population. As the question is asked, a coin is tossed out of sight of the questioner. If the answer
would have been “no” and the coin shows heads, then the answer “yes” is given. Otherwise people
respond truthfully. What do you think of this procedure?
2. A coin is tossed repeatedly and heads turns up on each toss with probability p. Let Hn and Tn be
the numbers of heads and tails in n tosses. Show that, for ǫ > 0,
 
1
P 2 p − 1 − ǫ ≤ (Hn − Tn ) ≤ 2 p − 1 + ǫ → 1 as n → ∞.
n

3. Let {X r : r ≥ 1} be observations which are independent and identically distributed with unknown
distribution function F. Describe and justify a method for estimating F(x).

†The quantity np is of course the mean of Sn . See Example (3.3.7).


2.3 Discrete and continuous variables 35

2.3 Discrete and continuous variables


Much of the study of random variables is devoted to distribution functions, characterized by
Lemma (2.1.6). The general theory of distribution functions and their applications is quite
difficult and abstract and is best omitted at this stage. It relies on a rigorous treatment of
the construction of the Lebesgue–Stieltjes integral; this is sketched in Section 5.6. However,
things become much easier if we are prepared to restrict our attention to certain subclasses
of random variables specified by properties which make them tractable. We shall consider in
depth the collection of ‘discrete’ random variables and the collection of ‘continuous’ random
variables.
(1) Definition. The random variable X is called discrete if it takes values in some countable
subset {x 1 , x 2 , . . . }, only, of R. The discrete random variable X has (probability) mass
function f : R → [0, 1] given by f (x) = P(X = x).

We shall see that the distribution function of a discrete variable has jump discontinuities
at the values x 1 , x 2 , . . . and is constant in between; such a distribution is called atomic. This
contrasts sharply with the other important class of distribution functions considered here.
(2) Definition. The random variable X is called continuous if its distribution function can
be expressed as Z x
F(x) = f (u) du x ∈ R,
−∞

for some integrable function f : R → [0, ∞) called the (probability) density function of X.

The distribution function of a continuous random variable is certainly continuous (actually


it is ‘absolutely continuous’). For the moment we are concerned only with discrete variables
and continuous variables. There is another sort of random variable, called ‘singular’, for a
discussion of which the reader should look elsewhere. A common example of this phenomenon
is based upon the Cantor ternary set† (see Grimmett and Welsh 2014, or Billingsley 1995).
Other variables are ‘mixtures’ of discrete, continuous, and singular variables. Note that the
word ‘continuous’ is a misnomer when used in this regard: in describing X as continuous,
we are referring to a property of its distribution function rather than of the random variable
(function) X itself.

(3) Example. Discrete variables. The variables X and W of Example (2.1.1) take values in
the sets {0, 1, 2} and {0, 4} respectively; they are both discrete.

(4) Example. Continuous variables. A straight rod is flung down at random onto a horizontal
plane and the angle ω between the rod and true north is measured. The result is a number
in  = [0, 2π). Never mind about F for the moment; we can suppose that F contains
all nice subsets of , including the collection of open subintervals such as (a, b), where
0 ≤ a < b < 2π. The implicit symmetry suggests the probability measure P which satisfies
P((a, b)) = (b − a)/(2π); that is to say, the probability that the angle lies in some interval is

†The first such example of a nowhere dense subset of [0, 1] with strictly positive Lebesgue measure is in
fact due to H. S. S. Smith, whose discovery predated that of Cantor by eight years. This is an imperfect example
of Stigler’s law of eponymy since Smith’s set was to base 4 and Cantor’s to base 3.
36 2.3 Random variables and their distributions

FX (x)

−1 2π x

Figure 2.3. The distribution function FX of the random variable X in Example (5).

directly proportional to the length of the interval. Here are two random variables X and Y :
X (ω) = ω, Y (ω) = ω2 .
Notice that Y is a function of X in that Y = X 2 . The distribution functions of X and Y are
 0 y ≤ 0,
0 x ≤ 0, √
FX (x) = x/(2π) 0 ≤ x < 2π, FY (y) = y/(2π) 0 ≤ y < 4π 2 ,
 
1 x ≥ 2π, 1 y ≥ 4π 2 .
2
To see this, let 0 ≤ x < 2π and 0 ≤ y < 4π . Then

FX (x) = P {ω ∈  : 0 ≤ X (ω) ≤ x}

= P {ω ∈  : 0 ≤ ω ≤ x} = x/(2π),

FY (y) = P {ω : Y (ω) ≤ y}
 √  √
= P {ω : ω2 ≤ y} = P {ω : 0 ≤ ω ≤ y} = P(X ≤ y)

= y/(2π).
The random variables X and Y are continuous because
Z x Z y
FX (x) = f X (u) du, FY (y) = f Y (u) du,
−∞ −∞
where

1/(2π) if 0 ≤ u ≤ 2π,
f X (u) =
0 otherwise,
 − 1
u /(4π) if 0 < u ≤ 4π 2 ,
2
fY (u) =
0 otherwise.

(5) Example. A random variable which is neither continuous nor discrete. A coin is
tossed, and a head turns up with probability p (= 1−q). If a head turns up then a rod is flung on
the ground and the angle measured as in Example (4). Then  = {T}∪{(H, x) : 0 ≤ x < 2π},
in the obvious notation. Let X :  → R be given by
X (T) = −1, X ((H, x)) = x.
The random variable X takes values in {−1} ∪ [0, 2π) (see Figure 2.3 for a sketch of its
distribution function). We say that X is continuous with the exception of a ‘point mass (or
atom) at −1’.
2.4 Worked examples 37

Exercises for Section 2.3


1. Let X be a random variable with distribution function F, and let a = (am : −∞ < m < ∞)
be a strictly increasing sequence of real numbers satisfying a−m → −∞ and am → ∞ as m → ∞.
Define G(x) = P(X ≤ am ) when am−1 ≤ x < am , so that G is the distribution function of a discrete
random variable. How does the function G behave as the sequence a is chosen in such a way that
supm |am − am−1 | becomes smaller and smaller?
2. Let X be a random variable and let g : R → R be continuous and strictly increasing. Show that
Y = g(X) is a random variable.
3. Let X be a random variable with distribution function

 0 if x ≤ 0,
P(X ≤ x} = x if 0 < x ≤ 1,

1 if x > 1.

Let F be a distribution function which is continuous and strictly increasing. Show that Y = F −1 (X)
is a random variable having distribution function F. Is it necessary that F be continuous and/or strictly
increasing?
4. Show that, if f and g are density functions, and 0 ≤ λ ≤ 1, then λ f + (1 − λ)g is a density. Is
the product f g a density function?
5. Which of the following are density functions? Find c and the corresponding distribution function
F for those that
 are.
cx −d x > 1,
(a) f (x) =
0 otherwise.
(b) f (x) = ce x (1 + e x )−2 , x ∈ R.

2.4 Worked examples

(1) Example. Darts. A dart is flung at a circular target of radius 3. We can think of the
hitting point as the outcome of a random experiment; we shall suppose for simplicity that the
player is guaranteed to hit the target somewhere. Setting the centre of the target at the origin
of R2 , we see that the sample space of this experiment is

 = {(x, y) : x 2 + y 2 < 9}.

Never mind about the collection F of events. Let us suppose that, roughly speaking, the
probability that the dart lands in some region A is proportional to its area |A|. Thus

(2) P(A) = |A|/(9π).

The scoring system is as follows. The target is partitioned by three concentric circles C1 , C2 ,
and C3 , centred at the origin with radii 1, 2, and 3. These circles divide the target into three
annuli A1 , A2 , and A3 , where
 q
Ak = (x, y) : k − 1 ≤ x 2 + y 2 < k .
38 2.4 Random variables and their distributions

FX (x)
1

4
9

1
9

1 2 3 x

Figure 2.4. The distribution function FX of X in Example (1).

We suppose that the player scores an amount k if and only if the dart hits Ak . The resulting
score X is the random variable given by

X (ω) = k whenever ω ∈ Ak .

What is its distribution function?


Solution. Clearly

P(X = k) = P(Ak ) = |Ak |/(9π) = 19 (2k − 1), for k = 1, 2, 3,

and so the distribution function of X is given by



0 if r < 1,
1 2
FX (r ) = P(X ≤ r ) = 9 ⌊r ⌋ if 1 ≤ r < 3,

1 if r ≥ 3,

where ⌊r ⌋ denotes the largest integer not larger than r (see Figure 2.4).

(3) Example. Continuation of (1). Let us consider a revised method of scoring in which the
player scores an amount equal to the distance between the hitting point ω and the centre of
the target. This time the score Y is a random variable given by
q
Y (ω) = x 2 + y2, if ω = (x, y).

What is the distribution function of Y ?


Solution. For any real r let Cr denote the disc with centre (0, 0) and radius r , that is

Cr = {(x, y) : x 2 + y 2 ≤ r }.

Then
FY (r ) = P(Y ≤ r ) = P(Cr ) = 91 r 2 if 0 ≤ r ≤ 3.
2.4 Worked examples 39

FY (r )

1 2 3 r

Figure 2.5. The distribution function FY of Y in Example (3).

FZ (r )

1
1− p

1 2 3 4 r

Figure 2.6. The distribution function FZ of Z in Example (4).

This distribution function is sketched in Figure 2.5.

(4) Example. Continuation of (1). Now suppose that the player fails to hit the target with
fixed probability p; if he is successful then we suppose that the distribution of the hitting point
is described by equation (2). His score is specified as follows. If he hits the target then he
scores an amount equal to the distance between the hitting point and the centre; if he misses
then he scores 4. What is the distribution function of his score Z ?
Solution. Clearly Z takes values in the interval [0, 4]. Use Lemma (1.4.4) to see that
FZ (r ) = P(Z ≤ r )
= P(Z ≤ r | hits target)P(hits target) + P(Z ≤ r | misses target)P(misses target)

0 if r < 0,
= (1 − p)FY (r ) if 0 ≤ r < 4,

1 if r ≥ 4,
where FY is given in Example (3) (see Figure 2.6 for a sketch of FZ ).
40 2.5 Random variables and their distributions

Exercises for Section 2.4


1. Let X be a random variable with a continuous distribution function F. Find expressions for the
distribution functions of the following random variables:

(a) X 2 , (b) X,
(c) sin X, (d) G −1 (X),
(e) F(X), (f) G −1 (F(X)),
where G is a continuous and strictly increasing function.
2. Truncation. Let X be a random variable with distribution function F, and let a < b. Sketch the
distribution functions of the ‘truncated’ random variables Y and Z given by

a if X < a, 
X if |X| ≤ b,
Y = X if a ≤ X ≤ b, Z=
 0 if |X| > b.
b if X > b,

Indicate how these distribution functions behave as a → −∞, b → ∞.

2.5 Random vectors


Suppose that X and Y are random variables on the probability space (, F , P). Their dis-
tribution functions, FX and FY , contain information about their associated probabilities. But
how may we encapsulate information about their properties relative to each other? The key
is to think of X and Y as being the components of a ‘random vector’ (X, Y ) taking values in
R2 , rather than being unrelated random variables each taking values in R.

(1) Example. Tontine is a scheme wherein subscribers to a common fund each receive an
annuity from the fund during his or her lifetime, this annuity increasing as the other subscribers
die. When all the subscribers are dead, the fund passes to the French government (this was
the case in the first such scheme designed by Lorenzo Tonti around 1653). The performance
of the fund depends on the lifetimes L 1 , L 2 , . . . , L n of the subscribers (as well as on their
wealths), and we may record these as a vector (L 1 , L 2 , . . . , L n ) of random variables.

(2) Example. Darts. A dart is flung at a conventional dartboard. The point of striking
determines a distance R from the centre, an angle 2 with the upward vertical (measured
clockwise, say), and a score S. With this experiment we may associate the random vector
(R, 2, S), and we note that S is a function of the pair (R, 2).

(3) Example. Coin tossing. Suppose that we toss a coin n times, and set X i equal to 0
or 1 depending on whether the i th toss results in a tail or a head. We think of the vector
X = (X 1 , X 2 , . . . , X n ) as describing the result of this composite experiment. The total
number of heads is the sum of the entries in X.
An individual random variable X has a distribution function FX defined by FX (x) =
P(X ≤ x) for x ∈ R. The corresponding ‘joint’ distribution function of a random vector
(X 1 , X 2 , . . . , X n ) is the quantity P(X 1 ≤ x 1 , X 2 ≤ x 2 , . . . , X n ≤ x n ), a function of n real
variables x 1 , x 2 , . . . , x n . In order to aid the notation, we introduce an ordering of vectors of
2.5 Random vectors 41

real numbers: for vectors x = (x 1 , x 2 , . . . , x n ) and y = (y1 , y2 , . . . , yn ) we write x ≤ y if


x i ≤ yi for each i = 1, 2, . . . , n.
(4) Definition. The joint distribution function of a random vector X = (X 1 , X 2 , . . . , X n ) on
the probability space (, F , P) is the function FX : Rn → [0, 1] given by FX (x) = P(X ≤ x)
for x ∈ Rn .

As before, the expression {X ≤ x} is an abbreviation for the event {ω ∈  : X(ω) ≤ x}.


Joint distribution functions have properties similar to those of ordinary distribution functions.
For example, Lemma (2.1.6) becomes the following.
(5) Lemma. The joint distribution function FX,Y of the random vector (X, Y ) has the follow-
ing properties:
(a) lim x,y→−∞ FX,Y (x, y) = 0, lim x,y→∞ FX,Y (x, y) = 1,
(b) if (x 1 , y1 ) ≤ (x 2 , y2 ) then FX,Y (x 1 , y1 ) ≤ FX,Y (x 2 , y2 ),
(c) FX,Y is continuous from above, in that

FX,Y (x + u, y + v) → FX,Y (x, y) as u, v ↓ 0.

We state this lemma for a random vector with only two components X and Y , but the
corresponding result for n components is valid also. The proof of the lemma is left as an
exercise. Rather more is true. It may be seen without great difficulty that

(6) lim FX,Y (x, y) = FX (x) (= P(X ≤ x))


y→∞

and similarly

(7) lim FX,Y (x, y) = FY (y) (= P(Y ≤ y)).


x→∞

This more refined version of part (a) of the lemma tells us that we may recapture the individual
distribution functions of X and Y from a knowledge of their joint distribution function. The
converse is false: it is not generally possible to calculate FX,Y from a knowledge of FX and
FY alone. The functions FX and FY are called the ‘marginal’ distribution functions of FX,Y .
(8) Example. A schoolteacher asks each member of his or her class to flip a fair coin twice
and to record the outcomes. The diligent pupil D does this and records a pair (X D , Y D ) of
outcomes. The lazy pupil L flips the coin only once and writes down the result twice, recording
thus a pair (X L , Y L ) where X L = Y L . Clearly X D , Y D , X L , and Y L are random variables with
the same distribution functions. However, the pairs (X D , Y D ) and (X L , Y L ) have different
joint distribution functions. In particular, P(X D = Y D = heads) = 41 since only one of the
four possible pairs of outcomes contains heads only, whereas P(X L = Y L = heads) = 12 .
Once again there are two classes of random vectors which are particularly interesting: the
‘discrete’ and the ‘continuous’.
(9) Definition. The random variables X and Y on the probability space (, F , P) are called
(jointly) discrete if the vector (X, Y ) takes values in some countable subset of R2 only. The
jointly discrete random variables X, Y have joint (probability) mass function f : R2 →
[0, 1] given by f (x, y) = P(X = x, Y = y).
42 2.5 Random variables and their distributions

(10) Definition. The random variables X and Y on the probability space (, F , P) are called
(jointly) continuous if their joint distribution function can be expressed as
Z x Z y
FX,Y (x, y) = f (u, v) dv du, x, y ∈ R,
u=−∞ v=−∞

for some integrable function f : R2 → [0, ∞) called the joint (probability) density function
of the pair (X, Y ).

We shall return to such questions in later chapters. Meanwhile here are two concrete
examples.

(11) Example. Three-sided coin. We are provided with a special three-sided coin, each
toss of which results in one of the possibilities H (heads), T (tails), E (edge), each having
probability 13 . Let Hn , Tn , and E n be the numbers of such outcomes in n tosses of the coin.
The vector (Hn , Tn , E n ) is a vector of random variables satisfying Hn + Tn + E n = n. If the
outcomes of different tosses have no influence on each other, it is not difficult to see that
 n
n!  1
P (Hn , Tn , E n ) = (h, t, e) =
h! t! e! 3

for any triple (h, t, e) of non-negative integers with sum n. The random variables Hn , Tn , E n
are (jointly) discrete and are said to have (jointly) the trinomial distribution.

(12) Example. Darts. Returning to the flung dart of Example (2), let us assume that no
region of the dartboard is preferred unduly over any other region of equal area. It may then
be shown (see Example (2.4.3)) that

r2 θ
P(R ≤ r ) = , P(2 ≤ θ ) = , for 0 ≤ r ≤ ρ, 0 ≤ θ ≤ 2π,
ρ2 2π

where ρ is the radius of the board, and furthermore

P(R ≤ r, 2 ≤ θ ) = P(R ≤ r )P(2 ≤ θ ).

It follows that
Z r Z θ
FR,2 (r, θ ) = f (u, v) dv du
u=0 v=0

where
u
f (u, v) = , 0 ≤ u ≤ ρ, 0 ≤ v ≤ 2π.
πρ 2

The pair (R, 2) is (jointly) continuous.


2.6 Monte Carlo simulation 43

Exercises for Section 2.5


1. A fair coin is tossed twice. Let X be the number of heads, and let W be the indicator function of
the event {X = 2}. Find P(X = x, W = w) for all appropriate values of x and w.
2. Let X be a Bernoulli random variable, so that P(X = 0) = 1 − p, P(X = 1) = p. Let Y = 1 − X
and Z = XY . Find P(X = x, Y = y) and P(X = x, Z = z) for x, y, z ∈ {0, 1}.
3. The random variables X and Y have joint distribution function

0  
if x < 0,
FX,Y (x, y) = 1 1
 (1 − e−x ) + tan−1 y if x ≥ 0.
2 π
Show that X and Y are (jointly) continuously distributed.
4. Let X and Y have joint distribution function F. Show that
P(a < X ≤ b, c < Y ≤ d) = F(b, d) − F(a, d) − F(b, c) + F(a, c)

whenever a < b and c < d.


5. Let X, Y be discrete random variables taking values in the integers, with joint mass function f .
Show that, for integers x, y,
f (x, y) = P(X ≥ x, Y ≤ y) − P(X ≥ x + 1, Y ≤ y)
− P(X ≥ x, Y ≤ y − 1) + P(X ≥ x + 1, Y ≤ y − 1).
Hence find the joint mass function of the smallest and largest numbers shown in r rolls of a fair die.
6. Is the function F(x, y) = 1 − e−x y , 0 ≤ x, y < ∞, the joint distribution function of some pair
of random variables?

2.6 Monte Carlo simulation


It is presumably the case that the physical shape of a coin is one of the major factors relevant
to whether or not it will fall with heads uppermost. In principle, the shape of the coin may
be determined by direct examination, and hence we may arrive at an estimate for the chance
of heads. Unfortunately, such a calculation would be rather complicated, and it is easier to
estimate this chance by simulation, which is to say that we may toss the coin many times and
record the proportion of successes. Similarly, roulette players are well advised to observe
the behaviour of the wheel with care in advance of placing large bets, in order to discern
its peculiarities (unfortunately, casinos are now wary of such observation, and change their
wheels at regular intervals).
Here is a related question. Suppose that we know that our coin is fair (so that the chance of
heads is 21 on each toss), and we wish to know the chance that a sequence of 50 tosses contains
a run of outcomes of the form HTHHT. In principle, this probability may be calculated
explicitly and exactly. If we require only an estimate of its value, then another possibility is
to simulate the experiment: toss the coin 50N times for some N, divide the result into N runs
of 50, and find the proportion of such runs which contain HTHHT.
It is not unusual in real life for a specific calculation to be possible in principle but extremely
difficult in practice, often owing to limitations on the operating speed or the size of the memory
of a computer. Simulation can provide a way around such a problem. Here are some examples.
44 2.6 Random variables and their distributions

(1) Example. Gambler’s ruin revisited. The gambler of Example (1.7.4) eventually won
his Jaguar after a long period devoted to tossing coins, and he has now decided to save up
for a yacht. His bank manager has suggested that, in order to speed things up, the stake on
each gamble should not remain constant but should vary as a certain prescribed function of
the gambler’s current fortune. The gambler would like to calculate the chance of winning the
yacht in advance of embarking on the project, but he finds himself incapable of doing so.
Fortunately, he has kept a record of the extremely long sequence of heads and tails encoun-
tered in his successful play for the Jaguar. He calculates his sequence of hypothetical fortunes
based on this information, until the point when this fortune reaches either zero or the price of
the yacht. He then starts again, and continues to repeat the procedure until he has completed
it a total of N times, say. He estimates the probability that he will actually win the yacht by
the proportion of the N calculations which result in success.
Why might this method make him overconfident? Should he retoss the coins?

(2) Example. A dam. It is proposed to build a dam in order to regulate the water supply,
and in particular to prevent seasonal flooding downstream. How high should the dam be?
Dams are expensive to construct, and some compromise between cost and risk is necessary.
It is decided to build a dam which is just high enough to ensure that the chance of a flood
of some given extent within ten years is less than 10−2 , say. No one knows exactly how
high such a dam need be, and a young probabilist proposes the following scheme. Through
examination of existing records of rainfall and water demand we may arrive at an acceptable
model for the pattern of supply and demand. This model includes, for example, estimates for
the distributions of rainfall on successive days over long periods. With the aid of a computer,
the ‘real world’ situation is simulated many times in order to study the likely consequences
of building dams of various heights. In this way we may arrive at an accurate estimate of the
height required.

(3) Example. Integration. Let g : [0, 1] → [0, 1] be a continuous but nowhere differentiable
R1
function. How may we calculate its integral I = 0 g(x) d x? The following experimental
technique is known as the ‘hit or miss Monte Carlo technique’.
Let (X, Y ) be a random vector
 having the uniform distribution on the unit square. That is,
we assume that P (X, Y ) ∈ A = |A|, the area of A, for any nice subset A of the unit square
[0, 1]2 ; we leave the assumption of niceness somewhat up in the air for the moment, and shall
return to such matters in Chapter 4. We declare (X, Y ) to be ‘successful’ if Y ≤ g(X). The
chance that (X, Y ) is successful equals I , the area under the curve y = g(x). We now repeat
this experiment a large number N of times, and calculate the proportion of times that the
experiment is successful. Following the law of averages, Theorem (2.2.1), we may use this
value as an estimate of I .
Clearly it is desirable to know the accuracy of this estimate. This is a harder problem to
which we shall return later.
Simulation is a dangerous game, and great caution is required in interpreting the results.
There are two major reasons for this. First, a computer simulation is limited by the degree
to which its so-called ‘pseudo-random number generator’ may be trusted. It has been said
for example that the summon-according-to-birthday principle of conscription to the United
States armed forces may have been marred by a pseudo-random number generator with a bias
for some numbers over others. Secondly, in estimating a given quantity, one may in some
2.7 Problems 45

circumstances have little or no idea how many repetitions are necessary in order to achieve an
estimate within a specified accuracy.
We have made no remark about the methods by which computers calculate ‘pseudo-random
numbers’. Needless to say they do not flip coins, but rely instead on operations of sufficient
numerical complexity that the outcome, although deterministic, is apparently unpredictable
except by an exact repetition of the calculation.
These techniques were named in honour of Monte Carlo by Metropolis, von Neumann, and
Ulam, while they were involved in the process of building bombs at Los Alamos in the 1940s.

2.7 Problems
1. Each toss of a coin results in a head with probability p. The coin is tossed until the first head
appears. Let X be the total number of tosses. What is P(X > m)? Find the distribution function of
the random variable X.
2. (a) Show that any discrete random variable may be written as a linear combination of indicator
variables.
(b) Show that any random variable may be expressed as the limit of an increasing sequence of discrete
random variables.
(c) Show that the limit of any increasing convergent sequence of random variables is a random
variable.
3. (a) Show that, if X and Y are random variables on a probability space (, F, P), then so are
X + Y , XY , and min{X, Y }.
(b) Show that the set of all random variables on a given probability space (, F, P) constitutes a
vector space over the reals. If  is finite, write down a basis for this space.
4. Let X have distribution function


0
 if x < 0,
F(x) = 1x if 0 ≤ x ≤ 2,
 2

1 if x > 2,

and let Y = X 2 . Find


(a) P 21 ≤ X ≤ 3 (b) P(1 ≤ X < 2),
2 ,
(c) P(Y ≤ X), (d) P(X ≤ 2Y ), √
(e) P X + Y ≤ 3 , (f) the distribution function of Z = X.
4
5. Let X have distribution function


 0 if x < −1,


 1− p if − 1 ≤ x < 0,
F(x) =
 1 − p + 1 xp
 if 0 ≤ x ≤ 2,

 2
1 if x > 2.

Sketch this function, and find: (a) P(X = −1), (b) P(X = 0), (c) P(X ≥ 1).
46 2.7 Random variables and their distributions

6. Buses arrive at ten minute intervals starting at noon. A man arrives at the bus stop a random
number X minutes after noon, where X has distribution function

0 if x < 0,
P(X ≤ x) = x/60 if 0 ≤ x ≤ 60,

1 if x > 60.

What is the probability that he waits less than five minutes for a bus?
1 indepen-
7. Airlines find that each passenger who reserves a seat fails to turn up with probability 10
dently of the other passengers. EasyPeasy Airlines always sell 10 tickets for their 9 seat aeroplane
while RyeLoaf Airways always sell 20 tickets for their 18 seat aeroplane. Which is more often
over-booked?
8. A fairground performer claims the power of telekinesis. The crowd throws coins and he wills
them to fall heads up. He succeeds five times out of six. What chance would he have of doing at least
as well if he had no supernatural powers?
9. Express the distribution functions of

X + = max{0, X}, X − = − min{0, X}, |X| = X + + X − , −X,

in terms of the distribution function F of the random variable X.


10. Show that FX (x) is continuous at x = x0 if and only if P(X = x0 ) = 0.
11. The real number m is called a median of the distribution function F whenever lim y↑m F(y) ≤
1
2 ≤ F(m).
(a) Show that every distribution function F has at least one median, and that the set of medians of F
is a closed interval of R.
(b) Show, if F is continuous, that F(m) = 21 for any median m.
12. Loaded dice.
(a) Show that it is not possible to weight two dice in such a way that the sum of the two numbers
shown by these loaded dice is equally likely to take any value between 2 and 12 (inclusive).
(b) Given a fair die and a loaded die, show that the sum of their scores, modulo 6, has the same
distribution as a fair die, irrespective of the loading.
13. A function d : S × S → R is called a metric on S if:
(i) d(s, t) = d(t, s) ≥ 0 for all s, t ∈ S,
(ii) d(s, t) = 0 if and only if s = t, and
(iii) d(s, t) ≤ d(s, u) + d(u, t) for all s, t, u ∈ S.
(a) Lévy metric. Let F and G be distribution functions and define the Lévy metric
n o
dL (F, G) = inf ǫ > 0 : G(x − ǫ) − ǫ ≤ F(x) ≤ G(x + ǫ) + ǫ for all x .

Show that dL is indeed a metric on the space of distribution functions.


(b) Total variation distance. Let X and Y be integer-valued random variables, and let
X
dTV (X, Y ) = P(X = k) − P(Y = k) .
k

Show that dTV satisfies (i) and (iii) with S the space of integer-valued random variables, and that
dTV (X, Y ) = 0 if and only if X and Y have the same distribution. Thus dTV is a metric on the space
of equivalence classes of S with equivalence relation given by X ∼ Y if X and Y have the same
distribution. We call dTV the total variation distance.
2.7 Problems 47

Show that
dTV (X, Y ) = 2 sup P(X ∈ A) − P(Y ∈ A) .
A⊆Z

14. Ascertain in the following cases whether or not F is the joint distribution function of some pair
(X, Y ) of random variables. If your conclusion is affirmative, find the distribution functions of X and
Y separately.

1 − e−x−y if x, y ≥ 0,
(a) F(x, y) =
0 otherwise.
 −x − xe −y if 0 ≤ x ≤ y,
 1 − e
(b) F(x, y) = 1 − e−y − ye−y if 0 ≤ y ≤ x,

0 otherwise.

15. It is required to place in order n books B1 , B2 , . . . , Bn on a library shelf in such a way that readers
searching from left to right waste as little time as possible on average. Assuming that each reader
requires book Bi with probability pi , find the ordering of the books which minimizes P(T ≥ k) for
all k, where T is the (random) number of titles examined by a reader before discovery of the required
book.
16. Transitive coins. Three coins each show heads with probability 35 and tails otherwise. The first
counts 10 points for a head and 2 for a tail, the second counts 4 points for both head and tail, and the
third counts 3 points for a head and 20 for a tail.
You and your opponent each choose a coin; you cannot choose the same coin. Each of you tosses
your coin and the person with the larger score wins £1010 . Would you prefer to be the first to pick a
coin or the second?
17. Before the development of radar, inertial navigation, and GPS, flying to isolated islands (for
example, from Los Angeles to Hawaii) was somewhat ‘hit or miss’. In heavy cloud or at night it was
necessary to fly by dead reckoning, and then to search the surface. With the aid of a radio, the pilot
had a good idea of the correct great circle along which to search, but could not be sure which of the
two directions along this great circle was correct (since a strong tailwind could have carried the plane
over its target). When you are the pilot, you calculate that you can make n searches before your plane
will run out of fuel. On each search you will discover the island with probability p (if it is indeed in
the direction of the search) independently of the results of other searches; you estimate initially that
there is probability α that the island is ahead of you. What policy should you adopt in deciding the
directions of your various searches in order to maximize the probability of locating the island?
18. Eight pawns are placed randomly on a chessboard, no more than one to a square. What is the
probability that:
(a) they are in a straight line (do not forget the diagonals)?
(b) no two are in the same row or column?
19. Which of the following are distribution functions? For those that are, give the corresponding
density function
( f. 2
1 − e−x x ≥ 0,
(a) F(x) =
0 otherwise.
 −1/x
e x > 0,
(b) F(x) =
0 otherwise.
(c) F(x) = e x /(e x + e−x ), x ∈ R.
2
(d) F(x) = e−x + e x /(e x + e−x ), x ∈ R.
20. (a) If U and V are jointly continuous, show that P(U = V ) = 0.
(b) Let X be uniformly distributed on (0, 1), and let Y = X. Then X and Y are continuous, and
P(X = Y ) = 1. Is there a contradiction here?
48 2.7 Random variables and their distributions

21. Continued fractions. Let X be uniformly distributed on the interval [0, 1], and express X as a
continued fraction thus:
1
X= .
1
Y1 +
1
Y2 +
Y3 + · · ·
Show that the joint mass function of Y1 and Y2 is

1
f (u, v) = , u, v = 1, 2, . . . .
(uv + 1)(uv + u + 1)

22. Let V be a vector space of dimension n over a finite field F with q elements. Let X 1 , X 2 , . . . , X m
be independent random variables, each uniformly distributed on V .
P
(a) Let ai ∈ F, i = 1, 2, . . . , m, be not all zero. Show that the linear combination Z = i ai X i is
uniformly distributed on V .
(b) Let pm be the probability that X 1 , X 2 , . . . , X m are linearly dependent. Show that, if m ≤ n + 1,

q −(n−m−1) ≤ pm ≤ q −(n−m) , m = 1, 2, . . . , n + 1.

23. Modes. A random variable X with distribution function F is said to be unimodal† about a mode
M if F is convex on (−∞, M) and concave on (M, ∞). Show that, if F is unimodal about M, then
the following hold.
(a) F is absolutely continuous, except possibly for an atom at M.
(b) If F is differentiable, then it has a density that is non-decreasing on (−∞, M) and non-increasing
on (M, ∞), and furthermore, the set of modes of F is a closed bounded interval. [Cf. Problem
(2.7.11).]
(c) If the distribution functions F and G are unimodal about the same mode M, then a F + (1 − a)G
is unimodal about M for any 0 < a < 1.

†It is a source of potential confusion that the word ‘mode’ is used in several contexts. A function is sometimes
said to be unimodal if it has a unique maximum. The word mode is also used for the value(s) of x at which a
mass function (or density function) f (x) is maximized, and even on occasion the locations of its local maxima.
3
Discrete random variables

Summary. The distribution of a discrete random variable may be specified via


its probability mass function. The key notion of independence for discrete
random variables is introduced. The concept of expectation, or mean value,
is defined for discrete variables, leading to a definition of the variance and the
moments of a discrete random variable. Joint distributions, conditional distri-
butions, and conditional expectation are introduced, together with the ideas of
covariance and correlation. The Cauchy–Schwarz inequality is presented. The
analysis of sums of random variables leads to the convolution formula for mass
functions. Random walks are studied in some depth, including the reflection
principle, the ballot theorem, the hitting time theorem, and the arc sine laws
for visits to the origin and for sojourn times.

3.1 Probability mass functions


Recall that a random variable X is discrete if it takes values only in some countable set
{x 1 , x 2 , . . . }. Its distribution function F(x) = P(X ≤ x) is a jump function; just as important
as its distribution function is its mass function.
(1) Definition. The (probability) mass function†of a discrete random variable X is the
function f : R → [0, 1] given by f (x) = P(X = x).

The distribution and mass functions are related by


X
F(x) = f (x i ), f (x) = F(x) − lim F(y).
y↑x
i:x i ≤x

(2) Lemma. The probability mass function f : R → [0, 1] satisfies:


(a) the set of x such that f (x) 6= 0 is countable,
P
(b) i f (x i ) = 1, where x 1 , x 2 , . . . are the values of x such that f (x) 6= 0.

Proof. The proof is obvious. 


This lemma characterizes probability mass functions.

†Some refer loosely to the mass function of X as its distribution.


50 3.1 Discrete random variables

(3) Example. Binomial distribution. A coin is tossed n times, and a head turns up each time
with probability p (= 1 − q). Then  = {H, T}n . The total number X of heads takes values
in the set {0, 1, 2, . . . , n} and is a discrete random variable. Its probability mass function
f (x) = P(X = x) satisfies

f (x) = 0 if x∈
/ {0, 1, 2, . . . , n}.

Let 0 ≤ k ≤ n, and consider f (k). Exactly nk points in  give a total of k heads; each of
these points occurs with probability p k q n−k , and so
 
n k n−k
f (k) = p q if 0 ≤ k ≤ n.
k

The random variable X is said to have the binomial distribution with parameters n and p,
written bin(n, p). It is the sum X = Y1 + Y2 + · · · + Yn of n Bernoulli variables (see Example
(2.1.8)).

(4) Example. Poisson distribution. If a random variable X takes values in the set {0, 1, 2, . . . }
with mass function
λk
f (k) = e−λ , k = 0, 1, 2, . . . ,
k!
where λ > 0, then X is said to have the Poisson distribution with parameter λ.

Exercises for Section 3.1


1. For what values of the constant C do the following define mass functions on the positive integers
1, 2, . . . ?
(a) Geometric: f (x) = C2−x .
(b) Logarithmic: f (x) = C2−x /x.
(c) Inverse square: f (x) = C x −2 .
(d) ‘Modified’ Poisson: f (x) = C2x /x!.
2. For a random variable X having (in turn) each of the four mass functions of Exercise (3.1.1), find:
(i) P(X > 1),
(ii) the most probable value of X,
(iii) the probability that X is even.
3. We toss n coins, and each one shows heads with probability p, independently of each of the
others. Each coin which shows heads is tossed again. What is the mass function of the number of
heads resulting from the second round of tosses?
4. Let Sk be the set of positive integers whose base-10 expansion contains exactly k elements (so
that, for example, 1024 ∈ S4 ). A fair coin is tossed until the first head appears, and we write T for
the number of tosses required. We pick a random element, N say, from ST , each such element having
equal probability. What is the mass function of N ?
5. Log-convexity. (a) Show that, if X is a binomial or Poisson random variable, then the mass
function f (k) = P(X = k) has the property that f (k − 1) f (k + 1) ≤ f (k)2 .
(b) Show that, if f (k) = 90/(π k)4 , k ≥ 1, then f (k − 1) f (k + 1) ≥ f (k)2 .
(c) Find a mass function f such that f (k)2 = f (k − 1) f (k + 1), k ≥ 1.
3.2 Independence 51

3.2 Independence
Remember that events A and B are called ‘independent’ if the occurrence of A does not
change the subsequent probability of B occurring. More rigorously, A and B are independent
if and only if P(A ∩ B) = P(A)P(B). Similarly, we say that discrete variables X and Y are
‘independent’ if the numerical value of X does not affect the distribution of Y . With this in
mind we make the following definition.
(1) Definition. Discrete variables X and Y are independent if the events {X = x} and
{Y = y} are independent for all x and y.

Suppose X takes values in the set {x 1 , x 2 , . . . } and Y takes values in the set {y1 , y2 , . . . }.
Let
Ai = {X = x i }, B j = {Y = y j }.
Notice (see Problem (2.7.2)) that X and Y are linear combinations of the indicator variables
I Ai , I Bj , in that
X X
X= x i I Ai and Y = yj I Bj .
i j

The random variables X and Y are independent if and only if Ai and B j are independent for
all pairs i, j . A similar definition holds for collections {X 1 , X 2 , . . . , X n } of discrete variables.
(Recall Definition (1.5.1).)
(2) Example. Poisson flips. A coin is tossed once and heads turns up with probability
p = 1 − q. Let X and Y be the numbers of heads and tails respectively. It is no surprise that
X and Y are not independent. After all,

P(X = Y = 1) = 0, P(X = 1)P(Y = 1) = p(1 − p).

Suppose now that the coin is tossed a random number N of times, where N has the Poisson
distribution with parameter λ. It is a remarkable fact that the resulting numbers X and Y of
heads and tails are independent, since

P(X = x, Y = y) = P X = x, Y = y N = x + y P(N = x + y)
 
x + y x y λx+y −λ (λp)x (λq) y −λ
= p q e = e .
x (x + y)! x! y!

However, by Lemma (1.4.4),


X
P(X = x) = P(X = x | N = n)P(N = n)
n≥x
X n  λn −λ (λp)x −λp
= p x q n−x e = e ;
n≥x
x n! x!

a similar result holds for Y , and so

P(X = x, Y = y) = P(X = x)P(Y = y).


52 3.2 Discrete random variables

If X is a random variable and g : R → R, then Z = g(X), defined by Z (ω) = g(X (ω)),


is a random variable also. We shall need the following.
(3) Theorem. If X and Y are independent and g, h : R → R, then g(X) and h(Y ) are
independent also.
Proof. Exercise. See Problem (3.11.1). 
More generally, we say that a family {X i : i ∈ I } of (discrete) random variables is
independent if the events {X i = x i }, i ∈ I , are independent for all possible choices of the set
{x i : i ∈ I } of the values of the X i . That is to say, {X i : i ∈ I } is an independent family if
and only if Y
P(X i = x i for all i ∈ J ) = P(X i = x i )
i∈ J

for all sets {x i : i ∈ I } and for all finite subsets J of I . The conditional independence
of a family of random variables, given an event C, is defined similarly to the conditional
independence of events; see equation (1.5.5).
Independent families of random variables are very much easier to study than dependent
families, as we shall see soon. Note that pairwise-independent families are not necessarily
independent.

Exercises for Section 3.2


1. Let X and Y be independent random variables, each taking the values −1 or 1 with probability
1 , and let Z = XY . Show that X, Y , and Z are pairwise independent. Are they independent?
2
2. Let X and Y be independent random variables taking values in the positive integers and having
the same mass function f (x) = 2−x for x = 1, 2, . . . . Find:
(a) P(min{X, Y } ≤ x), (b) P(Y > X),
(c) P(X = Y ), (d) P(X ≥ kY ), for a given positive integer k,
(e) P(X divides Y ), (f) P(X = r Y ), for a given positive rational r .
3. Let X 1 , X 2 , X 3 be independent random variables taking values in the positive integers and having
mass functions given by P(X i = x) = (1 − pi ) pix−1 for x = 1, 2, . . . , and i = 1, 2, 3.
(a) Show that
(1 − p1 )(1 − p2 ) p2 p32
P(X 1 < X 2 < X 3 ) = .
(1 − p2 p3 )(1 − p1 p2 p3 )
(b) Find P(X 1 ≤ X 2 ≤ X 3 ).
4. Three players, A, B, and C, take turns to roll a die; they do this in the order ABCABCA. . . .
(a) Show that the probability that, of the three players, A is the first to throw a 6, B the second, and
C the third, is 216/1001.
(b) Show that the probability that the first 6 to appear is thrown by A, the second 6 to appear is thrown
by B, and the third 6 to appear is thrown by C, is 46656/753571.
5. Let X r , 1 ≤ r ≤ n, be independent random variables which are symmetric about 0; that is,
X r andP−X r have the same distributions. Show that, for all x, P(Sn ≥ x) = P(Sn ≤ −x) where
n
Sn = r=1 Xr .
Is the conclusion necessarily true without the assumption of independence?
3.3 Expectation 53

3.3 Expectation
Let x 1 , x 2 , . . . , x N be the numerical outcomes of N repetitions of some experiment. The
average of these outcomes is
1 X
m= xi .
N
i

In advance of performing these experiments we can represent their outcomes by a sequence


X 1 , X 2 , . . . , X N of random variables, and we shall suppose that these variables are discrete
with a common mass function f . Then, roughly speaking (see the beginning of Section 1.3),
for each possible value x, about N f (x) of the X i will take that value x. Therefore, for large
N, the average m is approximately

1 X X
m≃ x N f (x) = x f (x)
N x x

where the summation here is over all possible values of the X i . This average is called the
‘expectation’ or ‘mean value’ of the underlying distribution with mass function f .
(1) Definition. The mean value, or expectation, or expected value of the random variable
X with mass function f is defined to be
X
E(X) = x f (x)
x: f (x)>0

whenever this sum is absolutely convergent.

We require absolute convergence in order thatPE(X) be unchanged by reordering the x i . We


can, for notational convenience, write E(X) = x x f (x). This appears to be an uncountable
sum; however, all but countably many of its contributions are zero. If the numbers f (x) are
regarded as masses f (x) at points x then E(X) is just the position of the centre of gravity; we
can speak of X as having an ‘atom’ or ‘point mass’ of size f (x) at x. We sometimes omit the
parentheses and simply write E X.

(2) Example (2.1.5) revisited. The random variables X and W of this example have mean
values
X
1 1 1
E(X) = x P(X = x) = 0 · 4 +1· 2 +2· 4 = 1,
x
X
3 1
E(W ) = x P(W = x) = 0 · 4 +4· 4 = 1.
x

If X is a random variable and g : R → R, then Y = g(X), given formally by Y (ω) =


g(X (ω)), is a random variable also. To calculate its expectation we need first to find its
probability mass function f Y . This process can be complicated, and it is avoided by the
following lemma (called by some the ‘law of the unconscious statistician’!).
54 3.3 Discrete random variables

(3) Lemma. Change of variable formula. If X has mass function f and g : R → R, then
X
E(g(X)) = g(x) f (x)
x

whenever this sum is absolutely convergent.

Proof. This is Problem (3.11.3). 

(4) Example. Suppose that X takes values −2, −1, 1, 3 with probabilities 14 , 81 , 14 , 83 respec-
tively. The random variable Y = X 2 takes values 1, 4, 9 with probabilities 38 , 14 , 83 respectively,
and so X
E(Y ) = x P(Y = x) = 1 · 83 + 4 · 14 + 9 · 38 = 194 .
x

Alternatively, use the law of the unconscious statistician to find that


X
E(Y ) = E(X 2 ) = x 2 P(X = x) = 4 · 1
4 +1· 1
8 +1· 1
4 +9· 3
8 = 19
4 .
x

Lemma (3) provides a method for calculating the ‘moments’ of a distribution; these are
defined as follows.
(5) Definition. If k is a positive integer, the kth moment m k of X is defined to be m k = E(X k ).
The kth central moment σk is σk = E((X − m 1 )k ).

The two moments of most use are m 1 = E(X) and σ2 = E((X − E X)2 ), called the mean (or
expectation) and variance of X. These two quantities are measures of the mean and dispersion
of X; that is, m 1 is the average value of X, and σ2 measures the amount by which X tends to
deviate from this average. The mean m 1 is often √ denoted µ, and the variance of X is often
denoted var(X). The positive square root σ = var(X) is called the standard deviation, and
in this notation σ2 = σ 2 . The central moments {σi } can be expressed in terms of the ordinary
moments {m i }. For example, σ1 = 0 and
X
σ2 = (x − m 1 )2 f (x)
x
X X X
= x 2 f (x) − 2m 1 x f (x) + m 21 f (x)
x x x
= m 2 − m 21 ,

which may be written as



var(X) = E (X − E X)2 = E(X 2 ) − (E X)2 .

Remark. Experience with student calculations of variances causes us to stress the following
elementary fact: variances cannot be negative. We sometimes omit the parentheses and write
simply var X. The expression E(X)2 means (E(X))2 and must not be confused with E(X 2 ).
3.3 Expectation 55

(6) Example. Bernoulli variables. Let X be a Bernoulli variable, taking the value 1 with
probability p (= 1 − q). Then
X
E(X) = x f (x) = 0 · q + 1 · p = p,
x
X
E(X 2 ) = x 2 f (x) = 0 · q + 1 · p = p,
x
var(X) = E(X 2 ) − E(X)2 = pq.

Thus the indicator variable I A has expectation P(A) and variance P(A)P(Ac ).

(7) Example. Binomial variables. Let X be bin(n, p). Then

n
X n
X  
n k n−k
E(X) = k f (k) = k p q .
k
k=0 k=0

To calculate this, differentiate the identity

n  
X n
x k = (1 + x)n ,
k
k=0

multiply by x to obtain
n
X  
n k
k x = nx(1 + x)n−1 ,
k
k=0

and substitute x = p/q to obtain E(X) = np. A similar argument shows that the variance of
X is given by var(X) = npq, although it is faster to use the forthcoming Theorem (11).
We can think of the process of calculating expectations as a linear operator on the space of
random variables.
(8) Theorem. The expectation operator E has the following properties:
(a) if X ≥ 0 then E(X) ≥ 0,
(b) if a, b ∈ R then E(a X + bY ) = a E(X) + bE(Y ),
(c) the random variable 1, taking the value 1 always, has expectation E(1) = 1.

Proof. (a) and (c) are obvious.


(b) Let A x = {X = x}, B y = {Y = y}. Then
X
a X + bY = (ax + by)I Ax ∩B y
x,y

and the solution of the first part of Problem (3.11.3) shows that
X
E(a X + bY ) = (ax + by)P(A x ∩ B y ).
x,y
56 3.3 Discrete random variables

However,
X [ !
P(A x ∩ B y ) = P Ax ∩ By = P(A x ∩ ) = P(A x )
y y
P
and similarly x P(A x ∩ B y ) = P(B y ), which gives
X X X X
E(a X + bY ) = ax P(A x ∩ B y ) + by P(A x ∩ B y )
x y y x
X X
=a x P(A x ) + b y P(B y )
x y
= a E(X) + bE(Y ). 

Remark. It is not in general true that E(X Y ) is the same as E(X)E(Y ).

(9) Lemma. If X and Y are independent then E(X Y ) = E(X)E(Y ).

Proof. Let A x and B y be as in the proof of (8). Then


X
XY = x y I Ax ∩B y
x,y

and so
X
E(XY ) = x y P(A x )P(B y ) by independence
x,y
X X
= x P(A x ) y P(B y ) = E(X)E(Y ). 
x y

(10) Definition. X and Y are called uncorrelated if E(X Y ) = E(X)E(Y ).


Lemma (9) asserts that independent variables are uncorrelated. The converse is not true,
as Problem (3.11.16) indicates.
(11) Theorem. For random variables X and Y ,
(a) var(a X) = a 2 var(X) for a ∈ R,
(b) var(X + Y ) = var(X) + var(Y ) if X and Y are uncorrelated.
Proof. (a) Using the linearity of E,
 
var(a X) = E (a X − E(a X))2 = E a 2 (X − E X)2

= a 2 E (X − E X)2 = a 2 var(X).
(b) We have when X and Y are uncorrelated that
 2
var(X + Y ) = E X + Y − E(X + Y )
h  i
= E (X − E X)2 + 2 X Y − E(X)E(Y ) + (Y − EY )2
 
= var(X) + 2 E(X Y ) − E(X)E(Y ) + var(Y )
= var(X) + var(Y ). 

You might also like