Lecture Note
Lecture Note
Kayvan Sadeghi
September 2024
2
Contents
Outline 7
1.4.3 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3
4
1.4.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2 Transformation of Variables 47
3 Generating Functions 57
3.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
(Xi − X)2 . . . . . . . . . . . . . . . . . . .
P
4.3.1 The distribution of 76
5 Statistical Estimation 83
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Terminology: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6
5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Lecturer:
Kayvan Sadeghi
Aims of course
To continue the study of probability and statistics beyond the basic concepts introduced in
previous courses (see prerequisites below). To provide further study of probability theory, in
particular as it relates to multivariate distributions, and to introduce some formal concepts
and methods in statistical estimation.
Objectives of course
Application areas
As with other core modules in probability and statistics, the material in this course has
applications in almost every field of quantitative investigation; the course expands on earlier
7
8
Prerequisites
Lectures
Exercise Sheets
There will be ten weekly exercise sheets in total numbered 0 to 9. Small group tutorials
related to exercise sheets will be held weekly: The tutorial related to Sheet 0 will be held in
the second week of the term, and to Exercise Sheet 8 in the last week of the term. Exercise
Sheet 9 will have no tutorials, and consequently full solutions to it will be provided on Moodle.
Four of these exercises are assessed. These are Exercise Sheets 2, 4, 6, and 8. These
four sheets make up a 25% in-course assessed component.The best three marks out of the
marks of the four assessed exercise sheets count. Your answers should be handed in on-line
to the Turnitin facility made available on the STAT0005 Moodle page by the deadline stated
on the exercise sheet (you can submit an entirely word-processed document, which will be a
lot of work, or a scan or photograph of hand-written work as long as the scan/photographs
are clearly legible, submitted in one single file, e.g. as pdf, and no contrary stipulations are
part of the exercise sheet).
If you are unable to meet the in-course assessment submission deadline for reasons outside
your control, for example illness or bereavement, you must submit a claim for extenuating
circumstances, normally within a week of the deadline. Your home department will advise
you of the appropriate procedures. For Statistical Science students, the relevant information
is on the DOSSSH Moodle page.
Your exercises will be handed back to you and solutions as well as common mistakes will be
KS: STAT0005, 2024 – 2025 9
Discussion Forum
There is a general discussion forum on Moodle, in which you are encouraged to take part.
You can post any questions regarding the course content including on the exercise questions
and you are allowed to do so peer-anonymously. You must not give away any of the answers
to the exercise sheet questions before the deadline has passed, however. (Staff can break
the anonymity if required, so please do not post any inappropriate content.) You are also
encouraged to answer other students’ questions on the forum, again as long as you do not
give away any of the answers to the exercise sheets before the deadline has passed. I do not
generally discuss mathematics by email. Please ask questions in the Moodle discussion forum
instead.
Summer Exam
A written closed-book examination paper in term 3 will make up 75% of the total mark for
the module. All questions need to be answered, past papers are available on Moodle. The
final mark will be a 75% to 25% weighted average of the written examination, and the four
assessed exercise sheets, respectively.
If you do not attend small group tutorials (which are compulsory), then you will be asked to
discuss your progress with the Departmental Tutor. In an extreme case of non-participation
in tutorials, you may be banned from taking the summer exam for the course, which means
that you will be classified as ‘not complete’ for the course (in practice this means that you
will fail the course).
Feedback
10
Feedback in this course will be given mainly through two channels: written feedback on your
weekly exercise sheet and discussion of the exercise sheet, in particular of common mistakes,
in your small group tutorial. You can also come to the office hours to discuss any questions
you may have on the course material in greater detail.
Texts
The following texts are a small selection of the many good books available on this material.
They are recommended as being especially useful and relevant to this course. The first book
listed is particularly recommended. It includes large numbers of sensible worked examples
and exercises (with answers to selected exercises) and also covers material on data analysis
that will be useful for other statistics courses. Books marked ‘*’ are slightly more theoretical
and cover more details than given in the lectures. Overall, from past experience, the lecture
script contains all the relevant material and there are plenty of examples in the lecture notes,
homework sheets and past exams so that you should not need to use a book if you do not
want to.
Joint probability distributions (or multivariate distributions) describe the joint be-
haviour of two or more random variables. Before introducing this new concept we will revise
the basic notions related to the distribution of only one random variable.
The fundamental idea of probability is that chance can be measured on a scale which runs
from zero, which represents impossibility, to one, which represents certainty.
Event Space, A: The family of all events A whose probability we may be interested in. A
is a family of sets, so e.g. the events A1 ⊆ Ω and A2 ⊆ Ω may be contained in it: A1 ∈ A,
A2 ∈ A. The event space always contains Ω, i.e Ω ∈ A.
Probability measure, P : a mapping from the event space to [0, 1]. To qualify as a
probability measure P must satisfy the following axioms of probability:
2. P (Ω) = 1;
11
12
If Ω is uncountable, like the real numbers, we have to define a ‘suitable’ family of subsets,
i.e. the event space A does not contain all subsets of Ω. However, in practice the event
space can always be constructed to include all events of interest.
From the axioms of probability one can mathematically prove the addition rule:
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
for all possible choices of k and 1 ≤ i1 < i2 < · · · < ik ≤ n. That is, the product rule must
hold for every subclass of the events A1 , . . . , An .
Note: In some contexts this would be called mutual independence. Whenever we speak of
independence of more than two events or random variables, in this course, we mean mutual
independence.
Example 1.1 Consider two independent tosses of a fair coin and the events A = ‘first toss
is head’, B = ‘second toss is head’, C = ‘different results on two tosses’.
Find the sample space, the probability of an elementary event and the individual probabilities
of A, B, and C.
Show that A, B, and C are not independent.
KS: STAT0005, 2024 – 2025 13
Suppose that P (B) > 0. Then the conditional probability of A given B, P (A|B) is defined
as
P (A ∩ B)
P (A|B) =
P (B)
i.e. the relative weight attached to event A within the restricted sample space B. The
conditional probability is undefined if P (B) = 0. Note that P (·|B) is a probability measure
on B. Further note that if A and B are independent events then P (A|B) = P (A) and
P (B|A) = P (B).
P (A ∩ B) = P (A|B)P (B)
= P (B|A)P (A)
Note that if P (B|A) = P (B) then we recover the multiplication rule for independent events.
Conditional independence means that once we know that C is true A carries no information
on B. Note that conditional independence does not imply independence, nor vice versa.
Example 1.2 (1.1 ctd.) Show that A and B are not conditionally independent given C.
The law of total probability, or partition law follows from the additivity axiom and the
definition of conditional probability: suppose that B1 , . . . , Bk are mutually exclusive and
exhaustive events ( i.e. Bi ∩ Bj = ∅ for all i ̸= j and ∪i Bi = Ω) and let A be any event.
Then
X k k
X
P (A) = P (A ∩ Bj ) = P (A|Bj )P (Bj )
j=1 j=1
Example 1.3 A child gets to throw a fair die. If the die comes up 5 or 6, she gets to sample
a sweet from box A which contains 10 chocolate sweets and 20 caramel sweets. If the die
comes up 1,2,3 or 4 then she gets to sample a sweet from box B which contains 5 chocolate
sweets and 15 caramel sweets. What is the conditional probability she will get a chocolate
sweet if the die comes up 5 or 6? What is the conditional probability she will get a chocolate
sweet if the die comes up 1,2,3 or 4? What is her probability of getting a chocolate sweet?
14
Bayes theorem follows from the law of total probability and the multiplication rule. Again,
let B1 , . . . , Bk be mutually exclusive and exhaustive events and let A be any event with
P (A) > 0. Then Bayes theorem states that
P (A|Bi )P (Bi )
P (Bi |A) = Pk
j=1 P (A|Bj )P (Bj )
Bayes’ theorem can be used to update the probability P (Bi ) attached to some belief Bi
held before an experiment is conducted in the light of the new information obtained in the
experiment. P (Bi ) is then called the a priori probability and P (Bi |A) is known as the a
posteriori probability. A lot more on this later!
Example 1.4 Give an example of a random variable whose cdf is right-continuous (it has to
be) but not continuous.
X takes only a finite or countably infinite set of values {x1 , x2 , . . .}. FX is a step-function,
with steps at the xi of sizes pX (xi ) = P (X = xi ), and pX (·) is the probability mass
function (pmf) of X. ( E.g. X = place of horse in race, grade of egg.) CDFs of discrete
random variables are only right-continuous but not continuous.
KS: STAT0005, 2024 – 2025 15
Example 1.5 (1.1 ctd. II) Consider the random variable X = number of heads obtained
on the two tosses. Obtain the pmf and cdf of X. Sketch the cdf – is it continuous?
Thus fX (x) dx is the probability that X lies in the infinitesimal interval (x, x + dx). Note
that the probability that X is exactly equal to x is zero for all x ( i.e. P (X = x) = 0).
If FX is a valid cdf which is continuous with piecewise derivative g, then FX is the cdf of a
continuous random variable and the pdf is given by g.
Example 1.7 Suppose fX (x) = k(2 − x2 ) on (−1, 1). Calculate k and sketch the pdf.
Calculate and sketch the cdf. Is the cdf differentiable? Calculate P (|X| > 1/2).
A distribution has several characteristics that could be of interest, such as its shape or
skewness. Another one is its expectation, which can be regarded as a summary of the
‘average’ value of a random variable.
16
Discrete case:
X X
E[X] = xi pX (xi ) = X(ω)P ({ω}) .
i ω
That is, the averaging can be taken over the (distinct) values of X with weights given by the
probability distribution pX , or over the sample space Ω with weights P({ω}).
Continuous case:
Z ∞
E[X] = x fX (x) dx.
−∞
Note: Integration is applied for continuous random variables, summation is applied for
discrete random variables. Make sure not to confuse the two.
1 µk
Example 1.8 The discrete random variable X has pmf pX (k) = eµ −1 k!
for k ∈ N. Compute
its expectation.
The first expression on the right-hand side averages the values of ϕ(x) over the distribution
of X, whereas the second expression averages the values of ϕ(X(ω)) over the probabilities
of ω ∈ Ω. A third method would be to compute the distribution of Y and average the values
of y over the distribution of Y .
Example 1.9 (1.1 ctd. III) Let X be the random variable indicating the number of heads
on two tosses. Consider the transformation ϕ with ϕ(0) = ϕ(2) = 0 and ϕ(1) = 1.
Find E[X] and E[ϕ(X)].
The variance of X is
σ 2 = Var(X) = E (X − E[X])2 .
Linear functions of X
The following properties of expectation and variance are easily proved ( exercise/previous
notes):
E[a + bX] = a + bE[X], Var(a + bX) = b2 Var(X)
Example 1.11 (1.1 ctd. V) Let Y be the excess of heads over tails obtained on the two
tosses of the coin. Write down E[Y ] and Var(Y ).
Learning Outcomes: Most of the material in STAT0002 and STAT0003 (or MATH0057)
is relevant and important for STAT0005. Students are strongly advised to revise this
material if they don’t feel confident about basic probability.
In particular, regarding subsections 1.1 and 1.2, you should be able to
18
Let us first consider the bivariate case. Suppose that the two random variables X and Y
share the same sample space Ω ( e.g. the height and the weight of an individual). Then we
can consider the event
{ω : X(ω) ≤ x, Y (ω) ≤ y}
and define its probability, regarded as a function of the two variables x and y, to be the
joint (cumulative) distribution function of X and Y , denoted by
It is often helpful to think geometrically about X and Y : In fact, (X, Y ) is a random point on
the two-dimensional Euclidean plane, R2 , i.e. each outcome of the pair of random variables
X and Y , or equivalently each outcome of the bivariate random variable (X, Y ) corresponds
to the point in R2 whose horizontal coordinate is X and whose vertical coordinate is Y . For
this reason, (X, Y ) is also called a random vector. FX,Y (x, y) is then simply the probability
that the point lands in the semi-infinite rectangle (−∞, x] × (−∞, y] = {(a, b) ∈ R2 : a ≤
x and b ≤ y}.
The joint cumulative distribution function (cdf) has similar properties to the univariate cdf.
If the function FX,Y (x, y) is the joint distribution function of random variables X and Y then
and
FY (y) = P (X < ∞, Y ≤ y) = FX,Y (∞, y)
respectively.
We already know in the univariate case that P (x1 < X ≤ x2 ) = FX (x2 ) − FX (x1 ). Similarly,
we find in the bivariate case that
FX,Y (x, y) = x2 y + y 2 x − x2 y 2 , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1 .
20
extended by suitable constants outside (0, 1)2 such as to make it a cdf. Show that FX,Y
has the properties of a cdf mentioned above. Find the marginal cdfs of X and Y . Also find
P (0 ≤ X ≤ 12 , 0 ≤ Y ≤ 12 ).
Are the properties of bivariate cdfs given so far enough to decide whether a given function on
R2 is a cdf? Unfortunately, the answer is negative. A positive result is given by the following
lemma:
• F is right-continuous,
• limx,y→∞ F (x, y) = 1,
Cumulative distribution functions fully specify the distribution of a random variable - they
encode everything there is to know about that distribution. However, as Example 1.12
showed, they can be somewhat difficult to handle. Making additional assumptions about the
random variables makes their distribution easier to handle, so let’s assume in this part that
X and Y take only values in a countable set, i.e. that (X, Y ) is a discrete bivariate random
variable. Then FX,Y is a step function in each variable separately and we consider the joint
probability mass function
pX,Y (xi , yj ) = P (X = xi , Y = yj ).
Example 1.13 Consider three independent tosses of a fair coin. Let X = ‘number of heads
in first and second toss’ and Y = ‘number of heads in second and third toss’. Give the
probabilities for any combination of possible outcomes of X and Y in a two-way table and
obtain the marginal pmfs of X and Y .
In general, from the joint distribution we can use the law of total probability to obtain the
marginal pmf of Y as
X
pY (yj ) = P (Y = yj ) = P (X = xi , Y = yj )
xi
X
= pX,Y (xi , yj ) .
xi
The marginal distribution is thus the distribution of just one of the variables.
Note that there will be jumps in FX,Y at each of the xi and yj values.
Independence
The random variables X and Y , defined on the sample space Ω with probability measure P,
are independent if the events
{X = xi } and {Y = yj }
are independent events, for all possible values xi and yj . Thus X and Y are independent
if
for all xi , yj . This implies that P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B) for all sets A and
B, so that the two events {ω : X(ω) ∈ A}, {ω : Y (ω) ∈ B} are independent. (Exercise:
prove this.)
22
NB: If x is such that pX (x) = 0, then pX,Y (x, yj ) = 0 for all yj and (1.1) holds automatically.
Thus it does not matter whether we require (1.1) for all possible xi , yj i.e. those with
positive probability, or all real x, y. (That is, pX,Y (x, y) = pX (x)pY (y) for all x, y would be
an equivalent definition of independence.)
If X, Y are independent then the entries in the two-way table are the products of the marginal
probabilities. In Example 1.13 we see that X and Y are not independent.
These are defined for random variables by analogy with conditional probabilities of events.
Consider the conditional probability
P (X = xi , Y = yj )
P (X = xi | Y = yj ) =
P (Y = yj )
pX,Y (xi , yj )
=
pY (yj )
and it gives the probabilities for observing X = xi given that we already know Y = yj . We
therefore define the conditional probability distribution of X given Y = yj as
pX,Y (xi , yj )
pX|Y (xi |yj ) =
pY (yj )
From the above definition we immediately obtain the multiplication rule for pmfs:
which can be used to find a bivariate pmf when we know one marginal distribution and one
conditional distribution.
KS: STAT0005, 2024 – 2025 23
Note that if X and Y are independent then pX,Y (xi , yj ) = pX (xi )pY (yi ) so that pX|Y (xi |yj ) =
pX (xi ) i.e. the conditional distribution is the same as the marginal distribution.
In general, X and Y are independent if and only if the conditional distribution of X given
Y = yj is the same as the marginal distribution of X for all yj . (This condition is equivalent
to pX,Y (xi , yj ) = pX (xi )pY (yj ) for all xi , yj , above).
Example 1.14 (1.13 ctd.) Obtain the conditional pmf of X given Y = y. Use this condi-
tional distribution to verify that X and Y are not independent.
Example 1.15 Suppose that R and N have a joint distribution in which R|N is Bin(N, π)
and N is Poi(λ). Show that R is Poi(πλ).
Conditional expectation
Since pX|Y (xi |yj ) is a probability distribution, it has a mean or expected value:
X
E[X|Y = yj ] = xi pX|Y (xi |yj )
xi
which represents the average value of X among outcomes ω for which Y (ω) = yj . This
may also be written EX|Y [X|Y = yj ]. We can also regard the conditional expectation
E[X|Y = yj ] as the mean value of X in the subgroup characterised by Y = yj .
Example 1.16 (1.13 ctd. II) Find the conditional expectations E[X|Y = y] for y =
0, 1, 2. Plot the graph of the function ϕ(y) = E[X|Y = y]. What do these values tell
us about the relationship between X and Y ?
In general, what is the relationship between the unconditional expectation E[X] and the
conditional expectation E[X|Y = yj ]?
We see from the above example that the overall mean is just the average of the conditional
means. We now prove this fact in general. Consider the conditional expectation ϕ(y) =
EX|Y [X|Y = y] as a function of y. This function ϕ may be used to transform the random
variable Y , i.e. we can consider the new random variable ϕ(Y ). This random variable is
usually written simply EX|Y [X|Y ] because the possibly more correct notation EX|Y [X|Y =
Y ] would be even more confusing! We may then compute the expectation of our new random
variable ϕ(Y ), i.e. E[ϕ(Y )] = EY [EX|Y [X|Y ]]. But, from the definition of the expectation
P
of a function of Y , we have E[ϕ(Y )] = yj ϕ(yj )pY (yj ), so that
X
EY [EX|Y [X|Y ]] = E[X|Y = yj ] pY (yj )
| {z }
yj
function of yj
This gives the marginal expectation of E[X] as will be shown in the lectures. That is,
which is known as the iterated conditional expectation formula. It is most useful when
the conditional distribution of X given Y = y is known and easier to handle than the joint
distribution (requiring integration/summation to find the marginal of X if it is not known).
Example 1.18 (1.13 ctd. II) Verify that E[X] = EY [EX|Y [X|Y ]] in this example.
Example 1.19 (1.15 ctd.) Find the mean of R using the iterated conditional expectation
formula.
Note that the definition of expectation generalises immediately to functions of two variables,
i.e.
X
E [ϕ(X, Y )] = ϕ(X(ω), Y (ω))P ({ω})
ω
XX
= ϕ(xi , yj )P ({ω : X(ω) = xi , Y (ω) = yj })
xi yj
XX
= ϕ(xi , yj )pX,Y (xi , yj )
xi yj
KS: STAT0005, 2024 – 2025 25
and that the above result on conditional expectations generalises too, since
XX
E [ϕ(X, Y )] = ϕ(xi , yj )pX|Y (xi |yj )pY (yj )
xi yj
X X
= pY (yj ) ϕ(xi , yj )pX|Y (xi |yj )
yj xi
| {z }
EX|Y [ϕ(X,yj )|yj ]
This will be shown in lectures for discrete random variables only. It also holds for continuous
random variables, however.
Example 1.20 Consider two discrete random variables X and Y , where the marginal prob-
abilities of Y are P (Y = 0) = 3/4, P (Y = 1) = 1/4 and the conditional probabilities of
X are P (X = 1|Y = 0) = P (X = 2|Y = 0) = 1/2 and P (X = 0|Y = 1) = P (X =
1|Y = 1) = P (X = 2|Y = 1) = 1/3. Use the iterated conditional expectation formula to
find E(XY ).
We consider now the case where both X and Y take values in a continuous range (i.e. their
set of possible values is uncountable) and their joint distribution function FX,Y (x, y) can be
expressed as Z Z x y
FX,Y (x, y) = fX,Y (u, v)dvdu
−∞ −∞
where fX,Y (x, y) is the joint probability density function of X and Y . In short, we
consider a bivariate continuous random variable (X, Y ).
Letting y → ∞ we get
Z x Z ∞
FX (x) = FX,Y (x, ∞) = fX,Y (u, v)dv du .
−∞ −∞
26
Rx
But from §1.2 we also know that FX (x) = −∞
fX (u) du. It follows that the marginal
density function of X is Z ∞
fX (x) = fX,Y (x, v)dv
−∞
That is, fX,Y (x, y)dxdy is the probability that (X, Y ) lies in the infinitesimal rectangle
(x, x + dx) × (y, y + dy). As in the univariate case, P (X = x, Y = y) = 0 for all x, y.
Example 1.21 Consider two continuous random variables X and Y with joint density
8xy 0 ≤ x ≤ y ≤ 1
fX,Y (x, y) =
0 otherwise
Sketch the area where fX,Y is positive. Derive the marginal pdfs of X and Y .
Independence
By analogy with the discrete case, two random variables X and Y are said to be independent
if their joint density factorises, i.e. if
Two continuous random variables are independent if and only if there exist func-
tions g(·) and h(·) such for all (x, y) the joint density factorises as fX,Y (x, y) =
g(x)h(y), where g is a function of x only and h is a function of y only.
KS: STAT0005, 2024 – 2025 27
Proof. If X and Y are independent then simply take g(x) = fX (x) and h(y) = fY (y). For
the converse, suppose that fX,Y (x, y) = g(x)h(y) and define
Z ∞ Z ∞
G= g(x)dx, H= h(y)dy.
−∞ −∞
Note that both G and H are finite (why?). Then the marginal densities are fX (x) = g(x)H,
fY (y) = Gh(y) and either of these equations implies that GH = 1 (integrate wrt. x in the
first equation or wrt. y in the second equation to see this). It follows that
fX (x) fY (y)
fX,Y (x, y) = g(x)h(y) = = fX (x)fY (y)
H G
and so X and Y are independent. □
The advantage of knowing that under independence fX,Y (x, y) = g(x)h(y) is that we don’t
need to find the marginal densities fX (x) and fY (y) (which would typically involve some
integration) to verify independence. It suffices to know fX (x) and fY (y) up to some unknown
constant.
Conditional distributions
For the conditional distribution of X given Y , we cannot condition on Y = y in the usual way,
as for any arbitrary set A, P (X ∈ A and Y = y) = P (Y = y) = 0 when Y is continuous,
so that
P (X ∈ A, Y = y)
P (X ∈ A | Y = y) =
P (Y = y)
is not defined (0/0). However, we can consider
P (x < X ≤ x + dx, y < Y ≤ y + dy) fX,Y (x, y)dxdy
≈
P (y < Y ≤ y + dy) fY (y)dy
and interpret fX,Y (x, y)/fY (y) as the conditional density of X given Y = y written as
fX|Y (x | y).
If X and Y are independent then, as before, the conditional density of X given Y = y is just
the marginal density of X.
Example 1.23 (1.21 ctd. II) Give the conditional densities of X given Y = y and of Y
given X = x indicating clearly the area where they are positive. Also, find E[X|Y = y] and
E[X], using the law of iterated conditional expectation for the latter. Compare this with the
direct calculation of E[X].
Consider the sum ϕ(X) + ψ(Y ) when X, Y have joint probability mass function pX,Y (x, y).
(The continuous case follows similarly, replacing probability mass functions by probability
densities and summations by integrals.) Then
XX
EX,Y [ϕ(X) + ψ(Y )] = {ϕ(xi ) + ψ(yj )} pX,Y (xi , yj )
xi yj
X X X X
= ϕ(xi ) pX,Y (xi , yj ) + ψ(yj ) pX,Y (xi , yj )
xi yj yj xi
| {z } | {z }
pX (xi ) pY (yj )
= EX [ϕ(X)] + EY [ψ(Y )] .
Note that the subscripts on the E’s are unnecessary as there is no possible ambiguity in this
equation, and also that this holds regardless of whether or not X and Y are independent.
In particular we have E[X + Y ] = E[X] + E[Y ]. Note the power of this result: there is no
need to calculate the probability distribution of X + Y (which may be hard!) if all we need
is the mean of X + Y .
=?
If X and Y are independent, then pX,Y (xi , yj ) = pX (xi ) pY (yj ) and the double sum in
(1.2) factorises, that is
X X
EX,Y [ϕ(X)ψ(Y )] = ϕ(xi )pX (xi ) ψ(yj )pY (yj ).
xi y
| {z } |j {z }
EX [ϕ(X)] EY [ψ(Y )]
Thus, except for the case where X and Y are independent, we typically have that
Slogan:
Independence means Multiply
1.4.3 Covariance
A particular function of interest is the covariance between X and Y . As we will see, this
is a measure for the strength of the linear relationship between X and Y . The covariance is
defined as
Cov(X, Y ) = E [(X − E[X])(Y − E[Y ])]
An alternative formula for the covariance follows on expanding the bracket, giving
Note that Cov(X, X) = Var(X), giving the familiar formula Var(X) = E[X 2 ] − {E[X]}2 .
However in general Cov(X, Y ) = 0 does not imply that X and Y are independent! An
example for this will be given below in Example 1.25.
Exercise: Using the fact that Var(X +Y ) = Cov(X +Y, X +Y ), derive the general formula
Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ).
1.4.4 Correlation
From above, we see that the covariance varies with the scale of measurement of the variables
(lbs/kilos etc), making it difficult to interpret its numerical value. The correlation is a
standardised form of the covariance, which is scale-invariant and therefore its values are
easier to interpret.
Cov(X, Y )
Corr(X, Y ) = p
Var(X)Var(Y )
Suppose that a > 0. Then Cov(aX, Y ) = a Cov(X, Y ) and Var(aX) = a2 Var(X) and it
follows that Corr(aX, Y ) = Corr(X, Y ). Thus the correlation is scale-invariant.
KS: STAT0005, 2024 – 2025 31
−1 ≤ Corr(X, Y ) ≤ +1
Example 1.24 (1.20 ctd.) Find the covariance and correlation of X and Y .
Not examined:
To prove this, we use the following trick. For any constant z ∈ R,
We get the extreme values, Corr(X, Y ) = ±1, when the quadratic touches the z-axis;
that is, when Var (zX + Y ) = 0. But if the variance of a random variable is zero then
the random variable must be a constant (we say that its distribution is degenerate).
Therefore, letting z be the particular value for which the quadratic touches the z-axis,
we obtain
zX + Y = constant. (1.3)
Cov(X, Y )
Y − E[Y ] = (X − E[X]).
Var(X)
We therefore see that correlation measures the degree of linearity of the relationship
between X and Y , and takes its maximum and minimum values (±1) when there is an
exact linear relationship between them. As there may be other forms of dependence
between X and Y ( i.e. non–linear dependence), it is now clear that Corr(X, Y ) = 0
does not imply independence.
KS: STAT0005, 2024 – 2025 33
Consider random variables X and Y and the conditional probability distribution of X given
Y = y. This conditional distribution has a mean, denoted E(X|Y = y), and a variance,
var(X|Y = y). We have already shown that the marginal (unconditional) mean E(X) is
related to the conditional mean via the formula
In the lectures we will obtain a similar result for the relation between the marginal and
conditional variances. The result is that
Example 1.26 (1.20 ctd.) Find the conditional variances of X given Y = 0, 1. Compute
the marginal variance of X by using the above result.
Example 1.27 (1.15 ctd. II) Find the variance of R using the iterated conditional variance
formula.
Learning Outcomes: Sections 1.3 and 1.4 represent the base of STAT0005. A thorough
understanding of the material is essential in order to follow the remaining sections as
well as many courses in the second and third year.
1. Compute the covariance of two variables using the simplest possible way for doing
so in standard situations;
2. Compute the correlation of two random variables, and interpret the result in terms
of linear dependence;
3. Derive the covariance / correlation for simple linear transformations of the vari-
ables;
4. State the main properties of the correlation coefficient;
5. Sketch the proof of −1 ≤ Corr ≤ 1.
KS: STAT0005, 2024 – 2025 35
The idea of joint probability distributions extends immediately to more than two variables, giv-
ing general multivariate distributions, i.e. the variables X1 , . . . , Xn have a joint cumulative
distribution function
so that a function ϕ(X1 , . . . , Xn ) has an expectation with respect to this joint distribution
etc.
Conditional distributions of a subset of variables given the rest then follow as before; for
example, for discrete random variables X1 , X2 , X3 ,
is the conditional pmf of (X1 , X2 ) given X3 = x3 . Similarly, the discrete random variables
X1 , . . . , Xn are (mutually) independent if and only if
n
Y
pX1 ,...,Xn (x1 , . . . , xn ) = pXi (xi )
i=1
pX1 ,X2 |X3 (x1 , x2 | x3 ) = pX1 |X3 (x1 | x3 ) pX2 |X3 (x2 | x3 )
for all x1 , x2 , x3 . These definitions hold for continuous distributions by replacing the pmf by
the pdf.
36
P (individual is of type i) = pi , i = 1, . . . , m + 1
Pm+1
where i=1 pi = 1. Let Ni be the number of type i individuals in the sample. Note that,
since Nm+1 = n − m
P
i=1 Ni , Nm+1 is determined by N1 , . . . , Nm . We therefore only need to
consider the joint distribution of the m random variables N1 , . . . , Nm .
where nm+1 = n − m
P
i=1 ni . This is the multinomial distribution with index n and
parameters p1 , . . . , pm , where pm+1 = 1 − m
P
i=1 pi (so pm+1 is not a ‘free’ parameter).
To justify the above joint pmf note that we want the probability that the n trials result in
exactly n1 outcomes of the first category, n2 of the second, . . . , nm+1 in the last category.
nm+1
Any specific ordering of these n outcomes has probability pn1 1 . . . pm+1 by the assumption of
n!
independent trials, and there are n1 !...nm+1 ! such orderings.
Example 1.28 Suppose that a bag contains five red, five black and five yellow balls and
that three balls are drawn at random with replacement. What is the probability that there is
one of each colour?
Marginal distribution of Ni
Example 1.29 Let NA , NB and NF be the numbers of A grades, B grades and fails respec-
tively amongst a class of 100 students. Suppose that generally 5% of students achieve grade
A, 30% grade B and that 5% fail. Write down the joint distribution of NA , NB and NF and
find the marginal distribution of NA .
Again we can regard individuals as being one of three types, i, j and k={not i or j}. This
is the trinomial distribution with probabilities
n
n!
pni p j pnk k ,
ni !nj !nk ! i j
n i + nj ≤ n
P (Ni = ni , Nj = nj ) =
0 otherwise
The manipulations in the third line are designed to create a multinomial expansion that we
can sum. Note that we may take ni , nj ≥ 1 in the sum, since if either ni or nj is zero then
the corresponding term in the sum is zero.
Finally
Cov(Ni , Nj ) = E[Ni Nj ] − E[Ni ]E[Nj ] = n(n − 1)pi pj − (npi )(npj ) = −npi pj
and so
−npi pj pi pj
r
Corr(Ni , Nj ) = p =− .
npi (1 − pi )npj (1 − pj ) (1 − pi )(1 − pj )
Note that Corr(Ni , Nj ) is negative, as anticipated, and also that it does not depend on n.
Given Nj = nj , there are n − nj remaining independent Bernoulli trials, each with probability
of being type i given by
P (type i) pi
P (type i|not type j) = = .
P (not type j) 1 − pj
pi
Thus, given Nj = nj , Ni has a binomial distribution with index n − nj and probability 1−pj
.
Exercise: Verify this result by using the definition of conditional probability together with
the joint distribution of Ni and Nj and the marginal distribution of Nj .
Example 1.30 (1.29 ctd.) Find the conditional distribution of NA given NF = 10 and
calculate Corr(NA , NF ).
Remark: The multinomial distribution can also be used as a model for contingency tables.
Let X and Y be discrete random variables with a number of I and J different outcomes,
respectively. Then, in a trial of size n, Nij will count the number of outcomes where we
observe X = i and Y = j. The counts Nij , i = 1, . . . , I, j = 1, . . . , J, are typically
arranged in a contingency table, and from the above considerations we know that their joint
distribution is multinomial with parameters n and pij = P (X = i, Y = j), i = 1, . . . , I,
j = 1, . . . , J. This leads to the analysis of categorical data, for which a question of interest
is often ‘are the categories independent?’, i.e. is pij = pi pj for all i, j? Exact significance
tests of this hypothesis can be constructed from the multinomial distribution of the entries
in the contingency table.
KS: STAT0005, 2024 – 2025 39
The continuous random variables X and Y are said to have a bivariate normal distribution
if they have joint probability density function
fX,Y (x, y) =
" ( 2 2 )#
1 1 x − µX x − µX y − µY y − µY
p exp − − 2ρ +
2πσX σY 1 − ρ2 2(1 − ρ2 ) σX σX σY σY
for −∞ < x, y < ∞, where −∞ < µX , µY < ∞; σX , σY > 0; ρ2 < 1. The parameters of
2
this distribution are µX , µY , σX , σY2 , and ρ. As we will see below, these turn out to be the
marginal means, variances, and the correlation of X and Y .
The bivariate normal is widely used as a model for many observed phenomena where depen-
dence is expected, e.g. height and weight of an individual, length and width of a petal,
income and investment returns. Sometimes the data need to be transformed ( e.g. by taking
logs) before using the bivariate normal.
Marginal distributions
In order to simplify the integrations required to find the marginal densities of X and Y , we
set
x − µX y − µY
= u, = v.
σX σY
Then, integrating with respect to y, the marginal density of X can be found as
Z ∞
fX (x) = fX,Y (x, y)dy
−∞
Z ∞
1 1 2 2
= p exp − u − 2ρuv + v σY dv
−∞ 2πσX σY 1 − ρ2 2(1 − ρ2 )
Z ∞
1 1 1 2 2 2
= √ p exp − (v − ρu) + u (1 − ρ ) dv
σX 2π −∞ 2π(1 − ρ2 ) 2(1 − ρ2 )
where we have completed the square in v in the exponent. Taking the term not involving
v outside the integral we then get
Z ∞
1 1 2 1 1 2
fX (x) = √ exp − u p exp − (v − ρu) dv
σX 2π 2 −∞ 2π(1 − ρ2 ) 2(1 − ρ2 )
( 2 )
1 1 2 1 1 x − µX
= √ exp − u = √ exp −
σX 2π 2 2πσX 2 σX
40
The final step here follows by noting that the integrand is the density of a N (ρu, 1 − ρ2 )
random variable and hence integrates to one. Thus the marginal distribution of X is normal,
2
with mean µX and variance σX .
It will be shown later in chapter 3 that the fifth parameter, ρ, also has a simple interpretation,
namely ρ = Corr(X, Y ).
Conditional distributions
The role of ρ
that the conditional mean of X is a linear function of y. If y is relatively large then the
conditional mean of X is also relatively large if ρ = Corr(X, Y ) > 0, or is relatively small if
ρ < 0.
showing that uncorrelated normal variables are independent (remember that this is not true
in the general case).
Example 1.31 Let X be the one-year yield of portfolio A and Y be the one-year yield of
portfolio B. From past data, the marginal distribution of X is modelled as N (7, 1), whereas
the marginal distribution of Y is N (8, 4) (being a more risky portfolio but having a higher
average yield). Furthermore, the correlation between X and Y is 0.5. Assuming that X, Y
have a bivariate normal distribution, find the conditional distribution of X given that Y = 9
and compare this with the marginal distribution of X. Calculate the probability P (X >
8|Y = 9).
Matrix Basics
First of all, let’s remind some general matrix notation, where an m by n matrix, i.e. a
matrix containing m rows and n columns of real numbers is denoted by (ai,j )m,ni=1,j=1 = A
m×n
and is thought of as an element of R . Matrices are added entry-wise and two matrices
k×m m×n
A∈R and B ∈ R can be multiplied to yield a matrix C ∈ Rk×n where the entries
are obtained by taking inner products of rows in A with columns in B, i.e. the entries are
obtained as follows:
m
X
ck,j = ak,i bi,j
i=1
Matrices can be multiplied by real numbers and they can act on column vectors (from the
right) and row vectors (from the left), so if A ∈ Rm×n is a matrix and x ∈ Rn is a vector
42
αAx = y ∈ Rm
n
X
yi = α ai,j xj .
j=1
Matrices, vectors and scalars satisfy the usual associative and distributive laws, e.g. A(x +
y) = Ax + Ay and (AB)C = A(BC) etc. However, note that matrix multiplication,
contrary to normal multiplication of real numbers, is not commutative, i.e. in general we
have AB ̸= BA.
Transpose
x1
x2
The transpose of a column vector x = is the row vector xT = (x1 , x2 , x3 , . . . , xn ),
..
.
xn
although sometimes we will use column and row vectors interchangeably when no confusion
can arise. The transpose of a matrix A = (ai,j )m,n T n,m
i,j=1 is just A = (aj,i )i=1,j=1 (i.e. you mirror
the matrix entries across its diagonal) and the following rules apply for transposition:
Here, matrices are denoted by A, B, and x, y are vectors and α, β are real numbers.
Inner Product
The inner product, also known as scalar product, of two vectors, x, y, ∈ Rn is denoted by
xT y ∈ R. It is symmetric, i.e. xT y = y T x, and works with the transpose and inverse of
invertible matrices A ∈ Rn×n as follows:
xT Ay = (AT x)T y
T −1
A−1 = AT ,
T
where the latter equality is the reason for the abbreviated notation A−T = (A−1 ) – it doesn’t
matter whether the transpose or the inverse is carried out first.
KS: STAT0005, 2024 – 2025 43
Determinants
The determinant of a square matrix A ∈ Rn×n , denoted either det(A) or simply |A|, satisfies
the following rules:
det(αA) = αn det(A)
det(AT ) = det(A)
1
det(A−1 ) = ,
det(A)
where A is assumed to be invertible for the last line to hold. A determinant can be computed
by proceeding in a column first or a row first fashion and proceeding recursively to the
determinants of the sub-matrices created, e.g. in dimension n = 3 we have
a1,1 a1,2 a1,3
det(A) = det a2,1 a2,2 a2,3
a3,1 a3,2 a3,3
a2,2 a2,3 a1,2 a1,3 a1,2 a1,3
= a1,1 det − a2,1 det + a3,1 det
a3,2 a3,3 a3,2 a3,3 a2,2 a2,3
= a1,1 (a2,2 a3,3 − a3,2 a2,3 ) − a2,1 (a1,2 a3,3 − a3,2 a1,3 ) + a3,1 (a1,2 a2,3 − a2,2 a1,3 ),
Define
2
2
X µX σX σXY σX ρσX σY
X= µ= Σ= =
Y µY σXY σY2 ρσX σY σY2
Here we call X a random vector, µ = E(X) is its mean vector and Σ = Cov(X) is the
covariance matrix, or dispersion matrix of X. Then
2 2 2 −1 1 σY2 −ρσX σY
det(Σ) = σX σY (1 − ρ ) , Σ = 2
det(Σ) −ρσX σY σX
44
x
and, writing x = ,
y
T −1 1 σY2 −ρσX σY x − µX
(x − µ) Σ (x − µ) = (x − µX , y − µY ) 2
det(Σ) −ρσX σY σX y − µY
1
= 2 2 {(x − µX )2 σY2 − 2(x − µX )(y − µY )ρσX σY
σX σY (1 − ρ2 )
+ (y − µY )2 σX
2
}
( 2 2 )
1 x − µX x − µX y − µY y − µY
= − 2ρ + .
1 − ρ2 σX σX σY σY
The usefulness of this matrix representation is that the bivariate normal distribution now
extends immediately to a general multivariate form, with joint density given by (1.8), with
and
(Σ)ij = Cov(Xi , Xj ) = ρij σi σj
Further note that, since Σ is k × k, we can write det(2πΣ)1/2 = (2π)k/2 det(Σ)1/2 .
Example 1.32 Let X1 , X2 , X3 have a trivariate normal distribution with mean vector (µ1 , µ2 , µ3 )
and covariance matrix
a 0 0
Σ = 0 b 0 .
0 0 c
Show that fX1 ,X2 ,X3 = fX1 fX2 fX3 and give the marginal distributions of X1 , X2 , and X3 .
KS: STAT0005, 2024 – 2025 45
Learning Outcomes:
1. Recognise the bivariate normal density, describe its shape and interpret the pa-
rameters;
2. Name the marginal distributions and know how to derive them;
3. Give the conditional distributions, know how to derive them and name the char-
acteristic properties of the conditional distributions;
4. Relate the correlation parameter to independence between jointly normal random
variables;
5. With the help of the foregoing points, characterise situations where the multivari-
ate normal distribution is appropriate;
6. Compute probabilities of joint and conditional events;
7. Explain how the multivariate normal distribution is constructed using matrix no-
tation.
46
Chapter 2
Transformation of Variables
In this section we will see how to derive the distribution of transformed random variables.
This is useful because many statistics applied to data analysis (e.g. test statistics) are
transformations of the sample variables.
47
48
X
P (Y = y) = P ({ω : ϕ(X(ω)) = y}) = P ({ω})
{ω:ϕ(X(ω))=y}
X
= P ({ω : X(ω) = x})
{x:ϕ(x)=y}
X
= pX (x).
{x:ϕ(x)=y}
Example 2.1 Consider two independent throws of a fair die. Let X be the sum of the
numbers that show up. Give the distribution of X. Now consider the transformation Y =
(X − 7)2 . Derive the distribution of Y .
Suppose that Y = ϕ(X) where ϕ is a strictly increasing and differentiable function. Then,
The first and third equalities arise simply from the definition of the cdf. The middle equality
says that the two probabilities on either side are equal because the events are the same, i.e.
{ω ∈ Ω : ϕ(X(ω)) ≤ y} = {ω ∈ Ω : X(ω) ≤ ϕ−1 (y)}. In words this simply means that
the event that ϕ(X) ≤ y happens if and only if the event that X ≤ ϕ−1 (y) happens. To see
that this is true ...
KS: STAT0005, 2024 – 2025 49
where the index x = ϕ−1 (y) means that any x in the formula has to be replaced by the
inverse ϕ−1 (y) because fY (y) is a function of y.
so that
dx
fY (y) = −fX (x)
dy
x=ϕ−1 (y)
In the first case dy/dx = dϕ(x)/dx is positive (since ϕ is increasing), in the second it is
negative (since ϕ is decreasing) so either way the transformation formula is
dx
fY (y) = fX (x)
dy x=ϕ−1 (y)
We can check that the right-hand side of the above formula is a valid pdf as follows. Recall
R∞
that −∞ fX (x)dx = 1. Changing variable to y = ϕ(x) we have, for ϕ increasing,
Z
dx
1= fX (x) dy
dy x=ϕ−1 (y)
dx
so that fX (x) dy
is a valid pdf. Similarly for ϕ decreasing.
fX (x) dx
P
When ϕ is a many-to-one function we use the generalised formula fY (y) = dy
,
where the summation is over the set {x : ϕ(x) = y}. That is, we add up the contributions
to the density at y from all x values which map to y.
Example 2.3 Suppose that fX (x) = 2x on (0, 1) and let Y = (X − 21 )2 . Obtain the pdf of
Y.
50
For the bivariate case we consider two random variables X, Y with joint density fX,Y (x, y).
What is the joint density of transformations U = u(X, Y ), V = v(X, Y ), where u(·, ·) and
v(·, ·) are functions from (R)2 to (R), such as the ratio X/Y or the sum X + Y ?
In order to use the following generalisation of the method of section 2.1, we need to assume
that u, v are such that each pair (x, y) defines a unique (u, v) and conversely, so that u =
u(x, y) and v = v(x, y) are differentiable and invertible. The formula that gives the joint
density of U, V is similar to the univariate case but the derivative, as we used it above, now
has to be replaced by the Jacobian J(x, y) of this transformation.
But how do we get the Jacobian J(x, y)? It is actually the determinant of the matrix of
partial derivatives:
∂x ∂x
∂(x, y) ∂u ∂v
J(x, y) = det = det ∂y ∂y
∂(u, v) ∂u ∂v
We finally take its absolute value, |J(x, y)|. There are two ways of computing this:
(1) Obtain the inverse transformation x = x(u, v), y = y(u, v), compute the matrix of partial
derivatives ∂(x, y)/∂(u, v) and then its determinant and absolute value.
(2) Alternatively find the determinant J(u, v) from the matrix of partial derivatives of (u, v)
with respect to (x, y) and then its absolute value and invert this.
KS: STAT0005, 2024 – 2025 51
Example 2.4 Let X and Y be two independent exponential variables with X ∼ Exp(λ) and
Y ∼ Exp(µ). Find the distribution of U = X/Y
Example 2.5 Consider two independent and identically distributed random variables X and
Y having a uniform distribution on [0, 2]. Derive the joint density of Z = X/Y and W = Y ,
stating the area where this density is positive. Are Z and W independent?
Obtain the marginal density of Z = X/Y .
That is,
X
pZ (z) = pX,Y (x, z − x)
x
Example 2.6 Let X and Y be two positive random variables with joint pdf
fX,Y (x, y) = xye−(x+y) , x, y > 0 .
Derive and name the distribution of their sum Z = X + Y .
52
The ideas of section 2.2 extend in a straightforward way to the case of more than two
continuous random variables. The general problem is to find the distribution of Y = ϕ(X),
where Y is s × 1 and X is r × 1, from the known distribution of X. Here X is the random
vector
X1
X2
X= · .
·
Xr
Case (i): ϕ is a one-to-one transformation (so that s = r). Then the rule is
Case (ii): s < r. First transform the s-vector Y to the r-vector Y ′ , where Yi′ = Yi , i =
1, . . . , s , and the other r − s random variables Yi′ , i = s + 1, . . . , r , are chosen for con-
venience. Now find the density of Y ′ as in case (i) and then integrate out Ys+1 ′
, . . . , Yr′ to
obtain the marginal density of Y , as required. ( c.f. Examples 2.6 & 2.7 in the bivariate
case.)
Case (iii): s = r but ϕ(·) is not monotonic. Then there will generally be more than one value
of x corresponding to a given y and we need to add the probability contributions from all
relevant xs.
Sometimes we may not need the complete probability distribution of ϕ(X), but just the first
two moments. Recall that E[aX + b] = aE[X] + b, so the relation E[ϕ(X)] = ϕ(E[X]) is
true whenever ϕ is a linear function. However, in general if Y = ϕ(X) it will not be true
R R
that E[Y ] = E[ϕ(X)] = ϕ(x)fX (x)dx is the same as ϕ(E[X]) = ϕ xfX (x)dx , (or
equivalent summations if X is discrete).
To find moments of Y we can use the distribution of X, as above. However, the sums or
integrals involved may be analytically intractable. In practice an approximate answer may be
sufficient. Intuitively, if X has mean µX and X is not very variable, then we would expect
E[Y ] to be quite close to ϕ(µX ).
Suppose that ϕ(x) is a continuous function of x for which the following Taylor expansion
about µX exists (which requires the existence of the derivatives of ϕ):
1
ϕ(x) = ϕ(µX ) + (x − µX )ϕ′ (µX ) + (x − µX )2 ϕ′′ (µX ) + . . .
2
Replacing x by X and taking expectations (or, equivalently, multiplying both sides of the
above equation by fX (x) and integrating over x) term by term, we get
1
E[ϕ(X)] = ϕ(µX ) + ϕ′ (µX ) E[X − µX ] + ϕ′′ (µX ) E [(X − µX )2 ] + . . .
| {z } 2 | {z }
=0 2
σX
so that
1 ′′ 2
E[Y ] ≈ ϕ(µX ) + ϕ (µX )σX
2
This approximation will be good if the function ϕ is well-approximated by the quadratic (i.e.
second order) Taylor expansion in the region to which fX assigns significant probability. If
2
this region is large, e.g. because σX is large, then ϕ will need to be well-approximated by the
2
quadratic Taylor expansion throughout a large region. If σX is small, ϕ may only need to be
nearly a quadratic function on a much smaller region.
The rougher approximation E(Y ) ≈ ϕ(µX ) will usually only be good if ϕ′′ (µX )σX 2
is small,
2 ′′
which will be the case if σX is small and/or ϕ (µX ) is small; that is, if X is not very variable
and/or ϕ is approximately linear at µX . Including the next term in the expansion will usually
provide a better approximation.
A usually sufficiently good approximate formula for the variance is based on a first order
approximation yielding
Var ϕ(X) = E[ϕ(X) − E(ϕ(X))]2
2
≈ E[ϕ(X) − ϕ(µX )]2 ≈ E[(X − µX )2 (ϕ′ (µX ))2 ] = (ϕ′ (µX )) E[(X − µX )2 ] ,
54
2
Var(Y ) ≈ (ϕ′ (µX )) σX
2
Example 2.8 Consider a Poisson variable X ∼Poi(µ). Find approximations to the expecta-
√
tion and variance of Y = X.
Order statistics are a special kind of transformation of the sample variables. Their joint and
marginal distributions can be derived by combinatorial considerations.
Suppose that X1 , . . . , Xn are independent with common density fX . Denote the ordered
values by X(1) ≤ X(2) ≤ . . . ≤ X(n) . What is the distribution Fr of X(r) ?
In particular, X(n) = max (X1 , . . . , Xn ) is the sample maximum and X(1) = min(X1 , . . . , Xn )
is the sample minimum. To find the distribution of X(n) , note that {X(n) ≤ x} and
{all Xi ≤ x} are the same event – and so have the same probability! Therefore the distribu-
tion function of X(n) is
since the Xi are independent with the same distribution function FX . Thus
F1 (x) = 1 − (1 − FX (x))n
KS: STAT0005, 2024 – 2025 55
Consider next the situation for general 1 ≤ r ≤ n. For dx sufficiently small we have
r − 1 values Xi such that Xi ≤ x, and
P (x < X(r) ≤ x + dx) = P one value in (x, x + dx], and
n − r values such that Xi > x + dx
n!
≈ {FX (x)}r−1 fX (x)dx (1 − FX (x + dx))n−r
(r − 1)!(n − r)!
| {z }
no. of ways of ordering the r − 1, 1 and n − r values
Recalling that fr (x) = limdx→0 P (x < X(r) ≤ x + dx)/dx, dividing both sides of the above
expression by dx and letting dx → 0 we obtain the density function of the rth order statistic
X(r) as
n!
fr (x) = (FX (x))r−1 (1 − FX (x))n−r fX (x)
(r − 1)!(n − r)!
Exercise: Show that this formula gives the previous densities when r = n and r = 1.
Activity 2.1 Collect a random sample of five students from the audience and have their
heights X1 , . . . , X5 measured. Record the heights and compute the mean height. Reorder
the students by height to obtain the first to fifth order statistic, X(1) , . . . , X(5) . What do the
first, third and fifth order statistic, X(1) , X(3) and X(5) correspond to?
Example 2.9 A village is protected from a river by a dike of height h. The maximum water
levels Xi reached by the river in subsequent years i = 1, 2, 3, . . . are modelled as independent
following an exponential distribution with mean λ−1 = 10. What is the probability that the
village will be flooded (in statistical language this would be called a “threshold exceedance”)
at least once in the next 100 years? How high does the dike need to be to make this probability
smaller than 0.5?
The distribution of the sample maximum is an important quantity in the field of extreme value
theory as the preceding example showed. Extreme value theory is important for its application
in insurance pricing and risk assessment. It turns out that the probability distribution of
threshold exceedances follows a universal class of distributions in the limit of high thresholds
h, independently of the individual distribution of the Xi . A most remarkable result!
56
Learning Outcomes: In Chapter 2, the most important aspects are the following.
1. Compute the distribution of the rth order statistic (in particular sample maximum
and sample minimum) and explain the required probabilistic and combinatorial
reasoning;
Chapter 3
Generating Functions
Overview
The transformation method presented in the previous chapter may become tedious when a
large number of variables is involved, in particular for transformations of the sample variables
when the sample size tends to infinity. Generating functions provide an alternative way of
determining a distribution ( e.g. of sums of random variables).
We consider different generating functions for the discrete and continuous case. For the
former, we can use probability generating functions (section 3.1) as well as moment generating
functions (section 3.2), whereas for the latter, only moment generating functions can be
used. We will point out the simple connection between pgfs and mgfs and further consider
joint generating functions in section 3.3 and apply these to linear combinations of random
variables in section 3.4. Finally, in section 3.4.1, we will state, prove, and use the Central
Limit Theorem.
57
58
Suppose that X is a discrete random variable taking values 0, 1, 2 . . .. Then the probability
generating function (pgf) G(z) of X is defined as
G(z) ≡ E(z X )
The pgf is a function of particular interest, because it sometimes provides an easy way of
determining the distribution of a discrete random variable.
We know from the theory of Taylor expansions that the rth derivative G(r) (0) = pr r!, yielding
an expression for the probability pr in terms of the rth derivative of G evaluated at z = 0:
G(r) (0)
pr = , r = 0, 1, 2, . . . .
r!
In practice it is usually easier to find the power series expansion of G and extract pr as the
coefficient of z r .
Whereas the probabilities are related to the derivatives of G at z = 0, it turns out that
the moments of X are related to the derivatives of G at z = 1. To see this, note that
∞ ∞
G′ (z) = p1 + 2zp2 + 3z 2 p3 + · · · = iz i−1 pi so that G′ (1) =
P P
ipi = E(X). Thus, we
i=1 i=1
have
E(X) = G′ (1)
∞
Further, G′′ (z) = i(i − 1)z i−2 pi so that G′′ (1) =
P P
i(i − 1)pi = E{X(X − 1)}. But we
i=2
can write Var(X) = E(X 2 ) − {E(X)}2 = E{X(X − 1)} + E(X) − {E(X)}2 from which
KS: STAT0005, 2024 – 2025 59
Example 3.1 Let X ∼ Poi(µ). Find the pgf G(z) = E(z X ). Also, verify the above formulae
for the expectation and variance of X.
G(z) = (1 − p + pz)n ,
where 0 < p < 1 and n ≥ 1 is an integer. Find the power expansion of G(z) and hence
derive the distribution of the random variable X that has this pgf.
3.2.1 Definition
Another function of special interest, particularly for continuous variables, is the moment
generating function (mgf) M (s) of X, defined as
Z∞
sX
M (s) ≡ E(e )= esx fX (x)dx
−∞
The moment generating function does not necessarily exist for all s ∈ R, i.e. the integral
might be infinite. However, we assume for the following that M (s) is finite for s in some
open interval containing zero.
integrating term by term (there are no convergence problems here due to assuming finiteness).
It follows that ∞
X sn
M (s) = E(X n ) .
n=0
n!
Thus M (s) is a power series in s and the coefficient of sn is E(X n )/n! – hence the name
‘moment generating function’.
Again, from the theory of Taylor expansions the rth derivative, M (r) (0), of M (s) at s = 0
must therefore equal the rth (raw) moment E(X r ) of X. In particular we have M ′ (0) = E(X)
and M ′′ (0) = E(X 2 ), so that
E(X) = M ′ (0)
and
Var(X) = M ′′ (0) − {M ′ (0)}2
(Alternatively, and more directly, note that M ′ (s) = E(XesX ) and M ′′ (s) = E(X 2 esX ) and
set s = 0.) Note also that M (0) = E(e0 ) = 1.
It can be shown that if the moment generating function exists on an open interval including
zero then it uniquely determines the distribution.
The pgf tends to be used more for discrete distributions and the mgf for continuous ones,
although note that when X takes nonnegative integer values then the two are related by
M (s) = E(esX ) = E{(es )X } = G(es ).
Example 3.3 Suppose that X has a gamma distribution with parameters (α, λ). Find the
mgf of X. Use this to derive the expectation and variance of X.
Example 3.4 Let X ∼ N (µ, σ 2 ) be a normal variable. Find the mgf of X. Use this mgf to
obtain the expectation and variance of X.
Suppose that Y = a + bX and that we know the mgf MX (s) of X. What is the mgf of Y ?
We have
MY (s) = E(esY ) = E{es(a+bX) } = esa E(esbX ) = eas MX (bs) .
KS: STAT0005, 2024 – 2025 61
We can therefore easily obtain the mgf of any linear function of X from the mgf of X.
Example 3.5 (3.4 ctd.) Use the mgf of X to find the distribution of Y = a + bX.
A more general concept is that of the characteristic function. This is defined in a similar
way to the mgf and has similar properties but involves complex variables. The main advantage
over the moment generating function is that the characteristic function of a random variable
always exists. However, we will not consider it here.
So far we have considered the pgf or mgf of a single real variable. The joint distribution of
a collection of random variables X1 , . . . , Xn can be characterised in a similar way by the
joint generating functions:
In both cases (pgf and mgf) we find that if X1 , . . . , Xn are independent random variables
then the pgf / mgf are given as the product of the individual pgfs / mgfs. (Recall: E(XY ) =
E(X)E(Y ) when X, Y are independent.)
62
The above property can be used to characterise independence because it can be shown that
the factorisation of the joint mgf holds if and only if the variables are independent.
Marginal mgfs
It is straightforward to see that if MX,Y (s1 , s2 ) is the joint mgf of X, Y then the marginal
mgf of X is given by MX (s1 ) = MX,Y (s1 , 0).
(Proof: E(es1 X ) = E(es1 X+0.Y ) = MX,Y (s1 , 0).)
Higher Moments
The joint moment generating function can further be useful to find higher moments of a
distribution. More precisely, we can compute E(Xir Xjk ) in the following way.
∂rM T
r
= E(Xir es X )
∂si
r+k
∂ M T
r k
= E(Xir Xjk es X ) ,
∂si ∂sj
Example 3.6 Suppose X1 , . . . , Xn are jointly multivariate normally distributed. Then, from
equation (1.8) in §1.5.3, the density of X = (X1 , . . . , Xn ) is given in matrix notation by
1 1 T −1
fX (x) = exp − (x − µ) Σ (x − µ) ,
| 2πΣ |1/2 2
where
x = (x1 , . . . , xn )T , µ = (µ1 , . . . , µn )T , Σ = covariance matrix
i.e. the µi = E(Xi ) are the individual expectations and the σij = (Σ)ij = Cov(Xi , Xj ) are
the pairwise covariances (variances if i = j).
(Remember that the integral here represents an n-dimensional integral.) To evaluate this
integral we need to complete the square in {·}. The result (derived in lectures) is that
1
M (s1 , . . . , sn ) = exp T
s µ + sT Σs
2
Exercise: from this derive the mgf of the univariate N (µ, σ 2 ) distribution.
For illustration, we now derive the joint moment E(Xi Xj ). Differentiate first with respect to
si . Since
Xn X n X n
T T
s µ= sk µk , s Σs = sk sl σkl
k=1 k=1 l=1
X
s2i σii + 2si sl σil
l̸=i
64
giving X X
∂(sT Σs)/∂si = 2si σii + 2 sl σil = 2 sl σil
l̸=i l
Therefore
n
!
TX
X 1
E(Xi es ) = ∂M (s)/∂si = µi + σil sl exp {sT µ + sT Σs}
l=1
2
Now differentiate again with respect to sj , j ̸= i, to give
( n
! n
!)
T
X X 1
E(Xi Xj es X ) = σij + µi + σil sl µj + σjl sl exp {sT µ + sT Σs}.
l=1 l=1
2
Find E(X1 X2 ) (i) directly and (ii) from the joint mgf.
We will now use the above methods to derive (properties of) the distribution of linear com-
binations of random variables.
Y = a1 X 1 + · · · + an X n
KS: STAT0005, 2024 – 2025 65
for any real-valued constants a1 , . . . , an . A popular linear combination is for example the
sample mean Y = X, for which ai = 1/n, i = 1, . . . , n.
Let us first find the expectation and variance of the linear combination Y in terms of the
moments of the Xi . The methods from §1.4 can be used for this purpose. First
regardless of whether or not the Xi are independent. For the variance we have
X X
Var (Y ) = Cov ( ai X i , aj X j )
i j
XX
= ai aj Cov(Xi , Xj )
i j
X X
= a2i Var(Xi ) + ai aj Cov(Xi , Xj ).
i i̸=j
E(Y ) = aT µ, Var(Y ) = aT Σa ,
Now we want to find out about the actual distribution of Y . If we have the joint distribution
of X1 , . . . , Xn we could proceed by transformation similar to §2.3, but this could be very
tedious if n is large. Instead, let us explore an approach based on the joint pgf / mgf of
the Xi . (The result below for the mgf is actually just a special case of the earlier linear
transformation property of a joint mgf.)
So, an alternative to the transformation method is to obtain the joint pgf or mgf and use this
to derive the pgf or mgf of Y . Of course, we still have to get from there to the probability
mass function or density if needed — but often the generating function of the univariate Y
is known to belong to a specific distribution.
which is the product of the individual mgfs MXi (sai ) of the Xi evaluated at si = sai .
Example 3.9 (3.3 ctd.) Let X1 , . . . , Xn be independent with Xi ∼ Gam(αi , λ). Find the
P
mgf and the distribution of Y = Xi .
Example 3.10 (3.6 ctd.) Suppose X1 , . . . , Xn are jointly multivariate normally distributed.
Consider again the linear transformation Y = a1 X1 + · · · + an Xn , or Y = aT X in vector
notation, where a = (a1 , . . . , an )T , X = (X1 , . . . , Xn )T . It follows from the general result
that the mgf of Y is
T T 1 2 T
MY (s) = MX (sa) = E{exp(sa X)} = exp sa µ + s a Σa
2
from §3.3. By comparison with the univariate mgf (see Example 3.4) we see that
Y = aT X ∼ N (aT µ, aT Σa)
KS: STAT0005, 2024 – 2025 67
We have seen earlier that for any random vector X we have E(aT X) = aT µ, Var(aT X) =
aT Σa, so the importance of the foregoing result is that any linear combination of jointly
normal variables is itself normally distributed even if the variables are correlated (and thus
not independent).
Example 3.11 (3.7 ctd.) Use the joint mgf of X1 and X2 to find the distribution of Y =
X1 − X 2 .
Possibly the most important theorem in statistics is the Central Limit Theorem which enables
us to approximate the distribution of sums of independent random variables. One of the most
popular uses is to make statements about the sample mean X̄ which, after all, is nothing but
a scaled sum of the samples.
d
Here, −→ denotes convergence in distribution which means that as n → ∞ the cdf of the
random variable on the left converges to the cdf of√the random variable on the right at each
point. Writing this down more carefully, let Yn = σn (X n − µ). Then, the statement simply
means limn→∞ FYn (x) = Φ(x) holds for any x ∈ R, where Φ is the standard normal cdf.
We can use the central limit theorem to compute probabilities about X n from
√ √
n(b − µ) n(a − µ)
P(a < X n ≤ b) ≈ Φ −Φ .
σ σ
NB. There are many generalisations of the theorem where the assumptions of independence,
common distribution or finite variance of the Xi are relaxed.
68
From next to no assumptions (independence and identical distribution with finite variance)
we arrive at a phenomenally strong conclusion: The sample mean asymptotically follows a
Gaussian distribution. Regardless of which particular distribution the Xn follow (exponential,
Bernoulli, binomial, uniform, triangle, ...), the resulting distribution is always Gaussian! It is
as though the whole of statistics and probability theory reduces to one distribution only!!
Example 3.13 Consider a sequence of independent Bernoulli trials with constant probability
of success π, so that P (Xi = 1) = π, i = 1, 2, . . .. Use the Central Limit Theorem with
µ = π and σ 2 = π(1 − π) to derive the normal approximation to the binomial distribution.
KS: STAT0005, 2024 – 2025 69
One of the disadvantages of the Central Limit Theorem is that in practical situations, one
can never be quite sure just how large n needs to be for the approximation to work well.
As an illustration, consider the sum of independent uniformly distibuted random variables.
Approximate probability density functions (obtained through simulation and histograms - if
you want to know more about how this works, take STAT0023) are shown in Figures 3.1
and 3.2 below. In the case of U (0, 1) variables, it is clear that for n ≥ 5, the difference in
the pdf is nearly invisible. Any rule of thumb as to how large n needs to be suffers from
mathematicians’ curse: since the actual distribution of the Xi is assumed unknown, one can
always find a distribution for which convergence has not yet occurred for the particular value
of n considered!
Figure 3.1: Estimated probability density functions standardised to mean zero and P
variance
one. Top left: X1 , top right: X1 + X2 , bottom left: X1 + X2 + X3 , bottom right: 5i=1 Xi
70
Figure 3.2: Estimated cumulative density functions standardised to mean zero. Note that for
N ≥ 5, the cdfs are essentially indistinguishable from standard normal.
Example 3.14 (Sums of i.i.d. uniformly distributed random variables) Let Xi ∼ U (0, 1)
be independent random variables. Compute the approximate distribution of Y = 100
P
n=1 Xn
and the probability that Y < 60.
Finally, as increasingly faster computers become available at steadily decreasing cost, the
Central Limit Theorem loses some of its practical importance. It is still regularly being used,
however, as an easy first guess.
KS: STAT0005, 2024 – 2025 71
We prove the theorem only in the case when the Xi have a moment generating function
defined on an open interval containing zero. Let us first suppose that E(X) = µ = 0. We
2
then know that E(X n ) = 0 and var(X n ) = σ √ /n.
Consider the standardised variable Zn = n · X n /σ. Then E[Zn ] = 0 and Var(Zn ) = 1
for all n. Note that Zn is a linear combination of the Xi , since
√ Pn n
n i=1 Xi X Xi
Zn = = √ .
σn i=1
σ n
√
Thus, by the arguments from the previous section, taking ai = 1/(σ n), the mgf of Zn is
given by
n n
Y s s
MZn (s) = MX i √ = MX √
i=1
σ n σ n
since the Xi all have the same distribution and thus the same mgf MX .
As the distribution of the Xi is not given we don’t know the mgf MX . However, MX has
the following Taylor series expansion about 0:
t2 ′′
MX (t) = MX (0) + tMX′ (0) + MX (0) + εt , (3.1)
2
where εt is a term for which we know that εt /t2 → 0 for t → 0. We write this as εt = o(t2 )
(‘small order’ of t2 , meaning that εt tends to zero at a rate faster than t2 ).
To make use of the above Taylor series expansion, note that (recalling that µ = 0)
MX (0) = E[e0X ] = 1
MX′ (0) = E[Xi ] = 0
MX′′ (0) = E[Xi2 ] = σ 2 .
√
Inserting these values in (3.1) and replacing t by s/(σ n) we find that
2
s2
s 1 s 1 1
MX √ = 1 + 0 + σ2 √ +o =1+ +o .
σ n 2 σ n n 2n n
Now back to Zn : from earlier, the mgf MZ of Zn is now given as the nth power of the above
and we find that
n n
s2
s 1
MZ (s) = MX √ = 1+ +o
σ n 2n n
1 2 n
s + δn 1 2
= 1+ 2 −→ e 2 s
n
It can be shown that convergence of the moment generating functions implies convergence
of the corresponding distribution functions at all points of continuity. This proves the claim
in the case µ = 0.
In the case that E(Xi ) = µ ̸= 0 define Yi = Xi − µ. Then E(Yi ) = 0 so the result already
√ √ d
proved gives Zn = n(X̄ − µ)/σ = nȲ /σ → N (0, 1) as required. □
Learning Outcomes:
1. Reproduce the definitions of a pgf / mgf and derive their main properties;
2. Compute and recognise the pgf / mgf for standard situations;
3. Use the pgf to find the pmf of a discrete random variable as well as its expectation
and variance for standard cases;
4. Use the mgf to find the moments of random variables, in particular the mean and
variance, for standard cases.
4.1 Motivation
The Central Limit Theorem (CLT) implies that the normal distribution arises in, or is a good
approximation to, many practical situations. Also, the case of a random sample X1 , . . . , Xn
from a normal population N (µ, σ 2 ) is favourable because many expressions are available in
explicit form which is probably at least as important a reason for the widespread use of the
normal approximation as the CLT. The random sample from a normal population was treated
in STAT0003 and the estimators X and S 2 for the mean and variance were exhibited as well
as some of their properties shown. In this chapter, we will develop a more careful discussion
of the properties and go beyond STAT0003 by establishing results on the joint distribution
of these two estimators. Finally, the t- and F-distributions are derived and discussed in order
to have them ready for applications in statistics (such as estimation and hypothesis testing).
73
74
n
1X
X= Xi
n i=1
is known (from STAT0003 or equivalent) to be unbiased, i.e. E[X] = µ and to have variance
Var(X) = σ 2 /n. To estimate the variance, σ 2 , in STAT0003 you used the sample variance
n
2 1 X
S = (Xi − X)2
n − 1 i=1
which you showed to be unbiased and consistent, i.e. you showed that the variance of S 2
goes to zero as the sample gets larger (i.e. as n → ∞) which implies consistency. To obtain
the sampling distribution of this estimator, i.e. the distribution of S 2 thought of as a random
variable itself, it is thus necessary to think about the distributions of sums of iid random
variables as well as the distribution of (Xi − X)2 and its sums, i.e. the sums of squares of
normal random variables. These distributions will occupy us for the remainder of this chapter
whereas the next chapter will consider estimators (and present more detail on unbiasedness,
variance, sampling distribution etc.) as well as address the question whether there could be
any better estimators than sample mean and sample variance.
Preliminaries
Recall the pdf of the Gam(α, λ) distribution (see Appendix 2), its mean and variance E(X) =
α/λ, Var(X) = α/λ2 , its mgf {λ/(λ − s)}α (see example 3.3) and the additive property: if
X1 , . . . , Xn are independent Gam(α, λ) random variables then X1 + · · · + Xn is Gam(nα, λ).
We start by showing that if X has the standard normal distribution then X 2 has the gamma
distribution with index 1/2 and rate parameter 1/2.
KS: STAT0005, 2024 – 2025 75
where (4.2) follows by comparison of (4.1) with the integral of the density of a normal variable
√
with mean 0 and variance (1 − 2s)−1 . (Alternatively, substitute z = x 1 − 2s in the integral
(4.1).)
By comparison with the gamma mgf, we see that X 2 has a gamma distribution with index
and rate parameter each having the value 1/2. This distribution is also known as the χ2
distribution with one degree of freedom. (Thus χ21 ≡ Gam( 21 , 12 ).)
Example 4.1 Verify the result that X 2 ∼ Gam(1/2, 1/2) by using the transformation
ϕ(x) = x2 on a mean zero normal random variable X ∼ N (0, 1).
It further follows from the mean and variance of the gamma distribution that the mean and
variance of U ∼ χ2ν are (verify)
E(U ) = ν , Var(U ) = 2ν
for u > 0. This can be verified by comparison with the Gam( ν2 , 12 ) density (see Appendix 2).
Question: Could this also be derived using transformation of variables and the formula
R∞
fX+Y (z) = −∞ f (x)f (z − x)dx for the density of a sum of random variables?
76
We know that if X has a normal distribution with mean µ and variance σ 2 then the standard-
ised variable (X − µ)/σ has the standard normal distribution. It follows that if X1 , . . . , Xn
are independent normal variables, all with mean µ and variance σ 2 , then
n 2
X Xi − µ
∼ χ2n
i=1
σ
because the left-hand side is the sum of squares of n independent standard normal random
variables. Since σ 2 = E(X−µ)2 , when µ is known the sample average Sµ2 = n1 ni=1 (Xi − µ)2
P
is an intuitively natural estimator of σ 2 . The above result gives the sampling distribution of
Sµ2 as nSµ2 ∼ σ 2 χ2n . We can now deduce the mean and variance of Sµ2 from those of the χ2n
distribution:
2
2 σ2 nSµ σ2
E(Sµ ) = E = × n = σ2
n σ2 n
2
2 σ4 nSµ σ4 2σ 4
Var(Sµ ) = 2 Var = × 2n =
n σ2 n2 n
Note that the expectation formula is generally true, even if the Xi are not normal. However,
the variance formula is only true for normal distributions.
However, in order to be able to use this result for statistical inference when σ 2 is also unknown,
we need to estimate σ 2 . An intuitively natural estimator of σ 2 when µ is unknown is the
1 1
sample variance S 2 = n−1 (Xi − X)2 . The reason for the factor n−1 rather than n1 is
P
that S 2 is unbiased for σ 2 , as we will see in the next chapter. In order to deduce the sampling
distribution of S 2 we need to find the distribution of the sum of squares (Xi − X)2 .
P
(Xi − X)2
P
4.3.1 The distribution of
degrees of freedom. In the following it will become clear why, exactly, we have to reduce the
number of degrees of freedom by one.
Result 4.1:
X and S 2 = (Xi − X)2 /(n − 1) are independent.
P
Result 4.2:
(n−1) 2
The sampling distribution of S 2 is σ2
S ∼ χ2n−1 .
We do not prove Result 4.1 in this module: it uses ”joint” moment generating functions.
so that
(Xi − µ)2 (Xi − X)2 (X − µ)2
P P
= + .
σ2 σ2 σ 2 /n
√
But, since (Xi − µ)/σ are independent N (0, 1) and (X − µ)/(σ/ n) is N (0, 1), we have
(Xi − µ)2 (X − µ)2
P
2
∼ χn and ∼ χ21
σ2 σ 2 /n
Using mgfs, it follows that
(Xi − X)2
P
2
∼ χ2n−1
σ
since (X − µ) /(σ /n) and (Xi − X)2 /σ 2 are independent.
2 2
P
From Results 4.1 and 4.2 we can deduce the sampling mean and variance of S 2 to be
σ2 (n − 1)S 2 σ2
2
E(S ) = E 2
= × n − 1 = σ2
n−1 σ n−1
4 2
σ4 2σ 4
σ (n − 1)S
Var(S 2 ) = Var = × 2(n − 1) =
(n − 1)2 σ2 (n − 1)2 n−1
recalling that the mean and variance of χ2n−1 are n − 1, 2(n − 1) respectively. As for the case
where µ is known, the unbiasedness property of S 2 is generally true, even if the Xi are not
normal, but the variance formula is only true for normal distributions.
78
If X1 , . . . , Xn are independent normally distributed variables with common mean µ and vari-
ance σ 2 then X is normally distributed with mean µ and variance σ 2 /n. Recall (STAT0002
& STAT0003, or equivalent) that if σ 2 is known, then we may test the hypothesis µ = µ0 by
examining the statistic
X − µ0
Z= √ .
σ/ n
Z is a linear transformation of a normal variable and hence is also normally distributed.
When µ = µ0 we see that Z ∼ N (0, 1) so we conduct the test by computing Z and referring
to the N (0, 1) distribution.
X − µ0 σ Z
T = √ =p ,
σ/ n S U/(n − 1)
where U = i (Xi − X)2 /σ 2 . From the above results we have Z ∼ N (0, 1), U ∼ χ2n−1 and
P
Z, U are independent random variables. (Note that the distribution of T does not depend
on µ0 and σ 2 , but only on the known number n of observations and is therefore suitable as
a test statistic.)
We can now find the probability distribution of T by the transformation method that was
described in chapter 2.2. Alternatively, it can be derived from the F distribution. The
distribution of T , denoted by tn−1 , is known as Student’s t distribution with n-1 degrees
of freedom.
The general description of the t distribution is as follows. Suppose that Z has a standard nor-
mal distribution, U has a χ2 distribution with ν degrees of freedom, and Z, U are independent
random variables. Then
Z
T =p
U/ν
KS: STAT0005, 2024 – 2025 79
has the Student’s t distribution with ν degrees of freedom, denoted by tν . T has probability
density function
−(ν+1)/2
t2
Γ((ν + 1)/2) 1
fT (t) = √ 1+
Γ(ν/2) νπ ν
for −∞ < t < ∞. The distribution of T is symmetrical about 0 , so that E(T ) = 0 (ν > 1).
It can further be shown that the variance of T is ν/(ν − 2) for ν > 2.
Example 4.2 Write down the pdf of a t1 -distribution and identify the distribution by name
(look back at Example 2.2). Why is ν > 1 needed for the expected value to be zero?
Not examined: What happens as the sample size ν becomes large? Derive the limit of the
pdf in this case.
2 2
Now, since (m − 1)SX /σX ∼ χ2m−1 , and (n − 1)SY2 /σY2 ∼ χ2n−1 , we can write
2 2
SX /σX U/(m − 1)
2 2
= ,
SY /σY V /(n − 1)
The above considerations motivate the following general description of the F distribution.
Suppose that U ∼ χ2α , V ∼ χ2β and U, V are independent. Then
U/α
W =
V /β
80
for w ≥ 0.
It can be shown that, for all α and for β > 2, E(W ) = β/(β − 2).
We can now apply this result to our test statistic discussed earlier. Under the hypothesis that
2
σX = σY2 we have (taking α = m − 1, β = n − 1)
2
SX
∼ Fm−1,n−1
SY2
This is the sampling distribution of the variance ratio. Note that this distribution is free from
σX and σY .
Example 4.3 Suppose that X, Y, U are independent random variables such that X ∼
N (2, 9), Y ∼ t4 and U ∼ χ23 . Give four functions of the above variables that have the
following distributions:
(i) χ21 , (ii) χ24 , (iii) t3 , (iv) F1,4 . □
Learning Outcomes: Chapter 4 derives some fundamental results for statistical theory
using the methods presented in Chapters 1 – 3. It should therefore consolidate your
usage of and familiarity with those methods. In particular, you should be able to
2. State the main properties of the above distributions (mean, variance, shape, mean-
ing of parameters),
3. State the relationship between the chi–squared and gamma distributions,
4. Remember the relationship between X and S 2 and sketch how it is derived,
5. Show how the chi–squared distribution can be derived using the properties of
mgfs;
82
Chapter 5
Statistical Estimation
5.1 Overview
We will now address the problem of estimating an unknown population parameter based on
a sample X1 , . . . , Xn . In particular, we focus on how to decide whether a potential estimator
is a ‘good’ one or if we can find a ‘better’ one. To this end, we need to define criteria
for good estimation. Here we present two obvious criteria that state that a good estimator
should be ‘close’ to the true unknown parameter value (accuracy) and should have as little
variation as possible (precision). We then go on to describe some methods that typically
yield good estimators in the above sense. Results from the previous sections may be used to
derive properties of estimators, which can be viewed as transformations of the original sample
variables X1 , . . . , Xn .
Since Tn is a random variable it has a probability distribution, called the sampling distribu-
tion of the estimator. The properties of this distribution determine whether or not Tn is a
‘good’ estimator of θ.
83
84
The difference E(Tn ) − θ = bTn (θ) is called the bias of the estimator Tn . If bTn (θ) ≡ 0 ( i.e.
bTn (θ) = 0 for all values of θ) then Tn is an unbiased estimator of θ. Although we tend to
regard unbiasedness as desirable, there may be a biased estimator Tn′ giving values that tend
to be closer to θ than those of the unbiased estimator Tn .
If we want an estimator that gives estimates close to θ we might look for one with small
mean square error (mse), defined as
mse(Tn ; θ) = E(Tn − θ)2 .
Note that
mse(Tn , θ) = E{Tn − E(Tn ) + E(Tn ) − θ}2
= E{Tn − E(Tn )}2 + {E(Tn ) − θ}2 + 2{E(Tn ) − θ}E{Tn − E(Tn )}
= Var(Tn ) + {bTn (θ)}2
since E{Tn − E(Tn )} = 0. We therefore have the relation
mse(Tn , θ) = Var(Tn ) + {bTn (θ)}2
We see that small mean square error provides a trade-off between small variance and small
bias.
Example 5.1 Let X1 , . . . , Xn be an iid sample from a normal distribution with mean µ and
variance σ 2 . Then S 2 is unbiased for σ 2 , whereas S̃ 2 = n1 (Xi − X)2 is biased for σ 2 .
P
5.2.1 Terminology:
The standard deviation of an estimator is also called its standard error. When the standard
deviation is estimated by replacing the unknown θ by its estimate, it is the estimated standard
error – but often just referred to as the ‘standard error’.
p
Example 5.2 (Example 5.1 ctd.) S 2 has standard deviation 2σ 4 /(n − 1). So S 2 is an
p
estimator of σ 2 with standard error s2 2/(n − 1).
5.3.1 Overview
We have discussed some desirable properties that estimators should have, but so far have only
‘inspired guesswork’ to find them. We need an objective method that will tend to produce
good estimators.
In general, for a sample X1 , . . . , Xn from a density with k unknown parameters, the method
of moments (Karl Pearson 1894) is to calculate the first k sample moments and equate
them to their theoretical counterparts (in terms of the unknown parameters), giving a set of k
simultaneous equations in k unknowns for solution. This procedure requires no distributional
assumptions. However, it might sometimes be helpful to assume a specific distribution in
order to express the theoretical moments in terms of the parameters to be estimated. The
moments may be derived using the mgf, for example.
Example 5.4 (Example 5.1 ctd.) What are the moment estimators of µ and σ 2 given an
i.i.d. normal sample?
Example 5.5 Let X1 , . . . , Xn be a random sample with common mean E(Xi ) = µ. Find
the least squares estimator of µ.
Note that no other assumptions are required — the sample does not need to be independent
or identically distributed and no distributional assumptions are needed. The properties of
the estimator, however, will depend on the probability distribution of the sample.
Least squares is commonly used for estimation in (multiple) regression models. For example,
the straight-line regression model is
E(Xi ) = α + βzi
where the zi are given values. The least squares estimators of α and β are obtained by solving
the equations ∂R/∂α = ∂R/∂β = 0, where R = (Xi − α − βzi )2 .
P
i
BLUE property. In linear models least squares estimators have minimum variance amongst
all estimators that are linear in X1 , . . . , Xn and unbiased – best linear unbiased estimator.
(The Gauss-Markov Theorem.)
When X1 , . . . , Xn are i.i.d. with density (or mass function) fX (xi ; θ) — where now we make
explicit the dependence of the function on the unknown parameter(s) θ (which can be a
vector) — suppose that we observe the sample values x1 , . . . , xn . Then the joint density or
joint probability of the sample is
Y
fX (xi ; θ).
i
The method of maximum likelihood estimates θ by that value, θ̂M L , which maximises
L(θ) over the parameter space Θ, i.e.
Since, for a sample of independent observations, the likelihood is a product of the likelihoods
for the individual observations, it is often more convenient to maximise the log-likelihood
ℓ(θ) = log L(θ) :
Y X
ℓ(θ) = log L(θ) = log fX (xi ; θ) = log fX (xi ; θ).
i i
Differentiation gives only local maxima and minima, so that even if the second derivative
d2 ℓ(θ)/dθ2 is negative (corresponding to a local maximum, rather than minimum) it is still
possible that the global maximum is elsewhere. The global maximum will be either a local
maximum or achieved on the boundary of the parameter space. (In Example 5.7 there is a
single local maximum and the likelihood vanishes on the boundary of Θ.)
Example 5.8 Let X1 , . . . , Xn be an i.i.d. sample from a Uniform distribution on [0, θ]. Find
the ML estimator θ̂M L of θ.
NOTE: The ML estimator cannot be found by differentiating in this case. Why?
Example 5.9 Let X1 , . . . , Xn be an i.i.d. sample from a Uniform distribution on [θ, θ + 1].
Then the ML estimator of θ is not unique.
Example 5.10 [Example 5.7 ctd.] Let X1 , . . . , Xn be an i.i.d. sample of Exponential vari-
ables with parameter λ. Find the ML estimator of µ = E(Xi ).
88
Example 5.10 illustrates an important result, that if we reparameterise the distribution using
particular functions of the original parameters, then the maximum likelihood estimators of
the new parameters are the corresponding functions of the maximum likelihood estimators of
the original parameters. This is easy to see, as follows.
Suppose that ϕ = g(θ) where θ is a (single) real parameter and g is an invertible function.
Then, letting L̃(ϕ) be the likelihood function in terms of ϕ, we have
and so
dL̃(ϕ) dL(θ) dθ
=
dϕ dθ dϕ
Assuming the existence of a local maximum, the ML estimator ϕ̂M L of ϕ occurs when
dL̃(ϕ)/dϕ = 0. Since g is assumed invertible we have dϕ/dθ ̸= 0 so it follows that
dL(θ)/dθ = 0 and hence L(ϕ) is maximum when θ = θ̂M L , where θ = g −1 (ϕ). It fol-
lows that ϕ̂M L = g(θ̂M L ). (This result also generalises to the case when θ is a vector.)
Maximum likelihood estimators can be shown to have good properties, and are often used.
Under fairly general regularity conditions it can be shown that the maximum likelihood esti-
mator (mle) exists and is unique for sufficiently large sample size and that it is consistent;
that is, θ̂M L → θ, the true value of θ, ‘in probability’. This means that when the sample size
is large the ML estimator of θ will (with high probability) be close to θ.
∂2
i(θ) = E − 2 log fX (X, θ)
∂θ
Q
Note that if X1 , . . . , Xn are i.i.d. with density f (x, θ) then fX (X; θ) = f (Xi ; θ). There-
i
fore
∂2 X ∂2
log f X (X; θ) = log fX (Xi ; θ) ,
∂θ2 i
∂θ 2
KS: STAT0005, 2024 – 2025 89
where i(θ) is Fisher’s information in the observation X. This illustrates the additive property
of Fisher information; information increases with increasing sample size.
The result is that θ̂M L is asymptotically normally distributed with mean θ and variance
1/I(θ) where, for an i.i.d. sample, I(θ) = nE{−∂ 2 /∂θ2 log fX (X; θ)} as before. That is,
θ̂M L ∼ N (θ, 1/I(θ)) approximately for large n. More formally,
d
p
I(θ)(θ̂M L − θ) → N (0, 1)
as n → ∞.
In addition, since the asymptotic probability distribution is known and normal, we can use
it to construct tests or confidence intervals for θ. However, notice that the variance involves
Fisher’s information I(θ), which is unknown since we do not know θ. But it can be shown
that the above asympotic result remains true if we estimate I(θ) by simply plugging in the
ML estimator of θ; that is, use I(θ̂M L ).
NB: Although we have discussed the method of maximum likelihood in the context of an
i.i.d. sample, this is not required. All that is needed is that we know the form of the joint
density or probability mass function for the data.
Example 5.11 (Example 5.7 ctd.) Obtain the asymptotic distribution of λ̂M L .
Learning Outcomes: Chapter 5 presents the essentials of the theory of statistical estima-
tion. It is important that you are able to
6.1 Introduction
thus
P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) + P (A ∩ B3 ).
We also know from the definition of conditional probabilities, that if each of the Bi ’s have
non-zero probabilities, then
P (A ∩ Bi ) = P (A|Bi )P (Bi ),
Example 6.1 (Two Face) The DC comic book villain Two-Face often uses a coin to decide
the fate of his victims. If the result of the flip is tails, then the victim is spared, otherwise
91
92
the victim is killed. It turns out he actually randomly selects from three coins: a fair one,
one that comes up tails 1/3 of the time, and another that comes up tails 1/10 of the time.
What is the probability that a victim is spared?
Sometimes we also want to compute P (Bi |A), and a bit algebra gives the following formula,
in the case i = 3:
P (B3 ∩ A)
P (B3 |A) =
P (A)
P (A|B3 )P (B3 )
= .
P (A|B1 )P (B1 ) + P (A|B2 )P (B2 ) + P (A|B3 )P (B3 )
Example 6.2 (Example 6.1 ctd.) Suppose that the victim was spared. Then what is the
probability that the fair coin was used?
In Bayesian statistics, rather than thinking of the parameter as unknown, we think of it has
a random variable having some unknown distribution. Let (fθ )θ∈∆ be a family of pdfs. Let
Θ be a random variable with pdf r taking values in ∆. Here r is called prior pdf for Θ; we
do not really know the true pdf for Θ, and this is a subjective assignment or guess based on
our present knowledge or ignorance. We think of f (x1 ; θ) = f (x1 |θ) as the conditional pdf
of a random variable X1 that can be generated in the following two step procedure: First,
we generate Θ = θ, then we generate X1 with pdf fθ . In other words, we let the joint pdf of
X1 and Θ be given by
f (x1 |θ)r(θ).
X1 |θ ∼ f (x1 |θ)
Θ ∼ r(θ)
KS: STAT0005, 2024 – 2025 93
Similarly, we say that X = (X1 , . . . , Xn ) is an i.i.d. random sample from the conditional
distribution of X1 given Θ = θ if X1 |θ ∼ f (x1 |θ) and
n
Y
L(x; θ) = L(x|θ) = f (xi |θ);
i=1
j(x, θ) = L(x|θ)r(θ).
Thus the random sample X can be sampled on a computer, by sampling Θ, and then upon
knowing that Θ = θ, we draw a random sample from the pdf fθ .
What we are interested in is updating our knowledge or belief about the distribution of Θ,
after observing X = x; more precisely, we consider
j(x, θ) L(x|θ)r(θ)
s(θ|x) = = ,
fX (x) fX (x)
where fX is the pdf of X (alone), which can be obtained by integrating or summing the joint
density j(x, θ) with respect to θ. We call s the posterior pdf. Thus ‘prior’ refers to our
knowledge of the distribution of Θ prior to our observation X and ‘posterior’ refers to our
knowledge after our observation of X.
Let us remark on our notation. Earlier in the course, we used Θ to denote the set of possible
parameter values; here, we use ∆ to denote this, since we are reserving Θ to be a random
variable taking values in θ ∈ ∆. We use r to denote the prior pdf, the next letter s to denote
the posterior pdf, j to denote the joint pdf of X and Θ, fX to denote the pdf of X alone,
and fθ = f (·|θ) to denote the pdf corresponding to the conditional distribution of Xi given
θ.
Example 6.3 Let X = (X1 , . . . , Xn ) be an i.i.d. random sample from the conditional dis-
tribution of X1 given Θ = θ, where X1 |θ ∼ Ber(θ) and Θ ∼ U (0, 1). Find the posterior
distribution.
Since s itself is a pdf, fX (x) (which does not depend on θ) can be thought of as a normalizing
constant. Often one writes
s(θ|x) ∝ G(x; θ),
Thus
s(θ|x) ∝ L(x|θ)r(θ).
It is often possible to identify the pdf from an expression involving L(x|θ)r(θ) or other
simplified expressions.
Example 6.4 Let X = (X1 , . . . , Xn ) be an i.i.d. random sample from the conditional dis-
tribution of X1 given Θ = θ, where X1 |θ ∼ N (θ, 1) and Θ ∼ N (0, 1). Find the posterior
distribution.
Consider a family of prior pdfs C = {rθ : θ ∈ ∆}; and the family of conditional distributions
of Xi given Θ = θ, F = {fθ : θ ∈ ∆}.
Example 6.5 Let C = (rα,β )α>0,β>0 , where rα,β is the pdf of the gamma distribution with
parameters α and β. Let F = (fθ )θ>0 , where fθ is the pdf of a Poisson random variable with
mean θ. Show that C is a conjugate family for F.
Example 6.6 Fix σ > 0. Let C = (rµ0 ,σ0 )µ0 >0,σ0 >0 , where rµ0 ,σ0 is the pdf of a normal
random variable with mean µ0 and variance σ02 . Let F = (fθ )θ>0 , where fθ is the pdf of a
KS: STAT0005, 2024 – 2025 95
normal random variable with mean θ and variance σ 2 . Show that C is a conjugate family for
F and the posterior hyperparameters are given by
σ02 σ 2 /n
µ′ = x̄ + µ0
σ02 + (σ 2 /n) σ02 + σ 2 /n
and
σ 2 /n
σ ′2 = σ2
σ02 + σ 2 /n 0
Example 6.7 Let C = (rα,β )α>0,β>0 , where rα,β is the pdf of a beta random variable with
parameters α and β. Let F = (fθ )θ>0 , where fθ is the pdf of a Bernoulli random variable
with parameter θ ∈ (0, 1). Show that C is a conjugate family for F and the posterior
hyperparameters are given by
α′ = α + t
and
β ′ = β + n − t,
Let us remark that in the notation of these examples, we will sometimes call α′ and β ′ the
posterior hyperparamaters and α and β the prior hyperparameters.
One natural point estimator for θ is δ(x) and the associated point estimator is given by
δ(X) = E(Θ|X). This is just one important example of a Bayes’ estimator.
96
E[(Θ − a)2 |X = x]
is minimized. Thus in the case where L(θ, θ′ ) = |θ − θ′ |2 , we have that δ(x) minimizes
In general, we can consider other choices of L. The function L is called a loss function, and
we aim to find a δ(x) which minimizes the conditional expected loss. The function δ is called
a decision function and is a Bayes’ estimate of θ if it is a minimizer. More, generally, if
we are interested in estimating a function of θ, given by g(θ), a Bayes’ estimator of g(θ) is
a decision function δ(x) which minimizes
In this course we will mostly be concerned with the squared loss function L(θ, θ′ ) = |θ − θ′ |2 .
There are many other reasonable choices of loss function to consider, for example, the absolute
loss given by L(θ, θ′ ) = |θ − θ′ |.
Example 6.8 Let Y be a random variable. Set g(a) = E(Y − a)2 . Minimize g.
Example 6.9 Let Y be a continuous random variable. Set g(a) = E[|Y − a|]. Minimize g.
Example 6.10 Let X = (X1 , . . . , Xn ) be an i.i.d. random sample from the conditional
distribution of X1 given Θ = θ, where X1 |θ ∼ U (0, θ) and Θ has the Pareto distribution
with scale parameter b > 0 and shape parameter α > 0 with pdf:
αbα
r(θ) = ,
θα+1
for θ > b; and 0 otherwise. Find the Bayes’ estimator (with respect to the squared loss
function) for θ.
So, we see from Example 6.10, the key to computing a Bayes’ estimator boils down to
computing the posterior distribution.
KS: STAT0005, 2024 – 2025 97
Learning Outcomes: Chapter 6 presents the essentials of the theory of Bayesian statistics.
It is important that you are able to
Bernoulli distribution
0 probability 1 − p
X=
1 probability p
or
pX (x) = px (1 − p)1−x , x = 0, 1 ; 0 < p < 1
E(X) = p , Var(X) = p(1 − p)
Denoted by Ber(p)
Binomial distribution
n
pX (x) = px (1 − p)n−x , x = 0, 1, . . . , n ; 0 < p < 1
x
Denoted by Bin(n, p)
X is the number of successes in n independent Bernoulli trials with constant probability of
success p.
X can be written X = Y1 + · · · + Yn , where Yi are independent Ber(p).
Geometric distribution
1 1−p
E(X) = p
, Var(X) = p2
Denoted by Geo(p)
X is the number of trials until the first success in a sequence of independent Bernoulli trials
with constant probability of success p.
KS: STAT0005, 2024 – 2025 99
r r(1−p)
E(X) = p
, Var(X) = p2
Denoted by NB(r, p)
X is the number of trials until the rth success in a sequence of independent Bernoulli trials
with constant probability of success p.
The case r = 1 is the Geo(p) distribution.
X can be written X = Y1 + · · · + Yr , where Yi are independent Geo(p)
Hypergeometric distribution
M N −M
x n−x
pX (x) = , x = 0, 1, . . . , min(n, M ) ; 0 < n, M ≤ N
N
n
nM nM (N −M )(N −n)
E(X) = N
, Var(X) = N 2 (N −1)
Denoted by H(n, M, N )
X is the number of items of Type I when sampling n items without replacement from a
population of size N , where there are M items of Type I in the population.
May be approximated by Bin(n, M N
) as N → ∞
Poisson distribution
e−λ λx
pX (x) = , x = 0, 1, . . . ; λ > 0
x!
E(X) = λ , Var(X) = λ
Denoted by Poi(λ)
X is the number of random events in time or space, where the rate of occurrence is λ.
Approximation to Bin(n, p) as n → ∞, p → 0 such that np → λ.
100
Uniform distribution
1
fX (x) = , a < x < b; a < b
b−a
b+a (b−a)2
E(X) = 2
, Var(X) = 12
Denoted by U(a, b)
Every point in the interval (a, b) is ‘equally likely’.
Important use for simulation of random numbers.
Exponential distribution
1 1
E(X) = λ
, Var(X) = λ2
Denoted by Exp(λ)
X is the waiting time until the first event in a Poisson process, rate λ.
Gamma distribution
λα xα−1 e−λx
fX (x) = , x > 0 ; λ > 0, α > 0
Γ(α)
where Z ∞
Γ(α) = xα−1 e−x dx
0
Denoted by Gam(α, λ)
When α = r, an integer, X is the waiting time until the rth event in a Poisson process, rate
λ. The case α = 1 is the Exp(λ) distribution.
When α = r, an integer, X can be written X = Y1 + · · · + Yr , where Yi are independent
Exp(λ).
KS: STAT0005, 2024 – 2025 101
Beta distribution
xα−1 (1 − x)β−1
fX (x) = , 0 < x < 1 ; α > 0, β > 0
B(α, β)
where Z 1
Γ(α)Γ(β)
B(α, β) = xα−1 (1 − x)β−1 dx =
0 Γ(α + β)
is the beta function.
α αβ
E(X) = α+β , Var(X) = (α+β)2 (α+β+1)
Denoted by Beta(α, β)
The case α = β = 1 is the U(0, 1) distribution.
X sometimes represents an unknown proportion lying in the interval (0, 1).
Normal distribution
(x − µ)2
1
fX (x) = √ exp − , −∞ < x < ∞ ; σ > 0
σ 2π 2σ 2
E(X) = µ , Var(X) = σ 2
Denoted by N (µ, σ 2 )
Widely used as a distribution for continuous variables representing many real-world phenom-
ena, and as an approximation to many other distributions.