Expectations and sums of variables
You are expected to know some probability theory, including expectations/averages. This
sheet reviews some of that background.
1 Probability Distributions / Notation
The notation on this sheet follows MacKay’s textbook, available online here:
https://www.inference.org.uk/itila/book.html
An outcome, x, comes from a discrete set or ‘alphabet’ A X = { a1 , a2 , . . . , a I }, with corre-
sponding probabilities P X = { p1 , p2 , . . . , p I }.
Examples:
A standard six-sided die has A X = {1, 2, 3, 4, 5, 6} with corresponding probabilities P X =
{1/6, 1/6, 1/6, 1/6, 1/6, 1/6}.
A Bernoulli distribution, which has probability distribution
p
x = 1,
P( x ) = 1 − p x = 0, (1)
0 otherwise,
has alphabet A X = {1, 0} with P X = { p, 1 − p}.
2 Expectations
An expectation is a property of a probability distribution, defined by a probability-weighted
sum. The expectation of some function, f , of an outcome, x, is:
I
EP( x) [ f ( x )] = ∑ p i f ( a i ). (2)
i =1
Often the subscript P( x ) is dropped from the notation because the reader knows under
which distribution the expectation is being taken. Notation can vary considerably, and details
are often dropped. You might also see E[ f ], E [ f ], or h f i, which all mean the same thing.
The expectation is sometimes a useful representative value of a random function value. The
expectation of the identity function, f ( x ) = x, is the ‘mean’, which is one measure of the
centre of a distribution.
The expectation is a linear operator:
E[ f ( x ) + g( x )] = E[ f ( x )] + E[ g( x )] and E[c f ( x )] = cE[ f ( x )]. (3)
These properties are apparent if you explicitly write out the summations.
The expectation of a constant with respect to x is the constant:
I
E[c] = c ∑ pi = c, (4)
i =1
because probability distributions sum to one (‘probabilities are normalized’).
The expectation of independent outcomes separate:
E[ f ( x ) g(y)] = E[ f ( x )] E[ g(y)]. (5)
True if x and y are independent.
Exercise 1: prove this. (Answers at the end of the note.)
MLPR:w0f Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1
3 The mean
The mean of a distribution over a number, is simply the ‘expected’ value of the numerical
outcome.
I
‘Expected Value’ = ‘mean’ = µ = E[ x ] = ∑ pi ai . (6)
i =1
For a six-sided die:
1 1 1 1 1 1
E[ x ] = × 1 + × 2 + × 3 + × 4 + × 5 + × 6 = 3.5. (7)
6 6 6 6 6 6
In every day language I wouldn’t say that I ‘expect’ to see 3.5 as the outcome of throwing
a die. . . I expect to see an integer! However, 3.5 is the ‘expected value’ as it is commonly
defined. Similarly a single Bernoulli outcome will be a zero or a one, but its ‘expected’ value
is a fraction,
E[ x ] = p × 1 + (1 − p)× 0 = p, (8)
the probability of getting a one.
Change of units: I might have a distribution over heights measured in metres, for which I
have computed the mean. If I multiply the heights by 100 to obtain heights in centimetres,
the mean in centimetres can be obtained by multiplying the mean in metres by 100. Formally:
E[100 x ] = 100 E[ x ].
4 The variance
The variance is also an expectation, measuring the average squared distance from the mean:
var[ x ] = σ2 = E[( x − µ)2 ] = E[ x2 ] − E[ x ]2 , (9)
where µ = E[ x ] is the mean.
Exercise 2: prove that E[( x − µ)2 ] = E[ x2 ] − E[ x ]2 .
Exercise 3: show that var[cx ] = c2 var[ x ].
Exercise 4: show that var[ x + y] = var[ x ] + var[y], for independent outcomes x and y.
Exercise 5: Given outcomes distributed with mean µ and variance σ2 , how could you shift
and scale them to have mean zero and variance one?
Change of units: If the outcome x is a height measured in metres, then x2 has units m2 ;
x2 is an area. The variance also has units m2 , it cannot be represented on the same scale as
the outcome, because it has different units. If you multiply all heights by 100 to convert to
centimetres, the variance is multiplied by 1002 . Therefore, the relative size of the mean and
the variance depends on the units you use, and so often isn’t meaningful.
Standard deviation: The standard deviation σ, the square root of the variance, does have the
same units as the mean. The standard deviation is often used as a measure of the typical
distance from the mean. Often variances are used in intermediate calculations because
they are easier to deal with: it is variances that add (as in Exercise 4 above), not standard
deviations.
5 Sums of independent variables: “random walks”
A drunkard starts at the centre of an alleyway, with exits at each end. He takes a sequence
of random staggers either to the left or right along the alleyway. His position after N steps
is k N = ∑nN=1 xn , where the outcomes, { xn }, the staggering motions, are drawn from a
distribution with zero mean and finite variance σ2 . For example A X = {−1, +1} with
P X = {1/2, 1/2}, which has E[ xn ] = 0 and var[ xn ] = 1.
MLPR:w0f Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2
If the drunkard started in the centre of the alleyway, will he ever escape? If so, roughly how
long will it take? (If you don’t already know, have a think. . . )
The expected, or mean position after N steps is E[k N ] = NE[ xn ] = 0. This doesn’t mean we
don’t think the drunkard will escape. There are ways of escaping both left and right, it’s just
‘on average’ that he’ll stay in the middle.
The variance of the drunkard’s position is√var[k N ] = Nvar[ xn ] = Nσ2 . The standard de-
viation of the position is then std[k N ] = Nσ, which is a measure of the width of the
distribution over the displacement from the centre of the alleyway. If we double the length
of the alley, then it will typically take four times the number of random steps to escape.
Worthwhile √ remembering: the typical magnitude of the sum of N independent zero-mean variables
scales with N. The individual variables need to have finite variance, and ‘typical magnitude’
is measured by standard deviation. Sometimes you might have to work out the σ for your
problem, or do other detailed calculations. But sometimes the scaling of the width of the
distribution is all that really matters.
Corollary: the typical √
magnitude of the mean of N independent zero-mean variables with finite
variance scales with 1/ N.
6 Solutions
As always, you are strongly recommended to work hard on a problem yourself before looking
at the solutions. As you transition into doing research, there won’t be any answers, and you
have to build confidence in getting and checking your own answers.
Exercise 1: For independent outcomes x and y, p( x, y) = p( x ) p(y) and so
E[ f ( x ) g(y)] = ∑ x ∑y p( x ) p(y) f ( x ) g(y) = ∑ x p( x ) f ( x ) ∑y p(y) g(y) = E[ f ( x )]E[ g(y)].
Exercise 2: E[( x − µ)2 ] = E[ x2 + µ2 − 2xµ] = E[ x2 ] + µ2 − 2µE[ x ] = E[ x2 ] − µ2 .
Exercise 3: var[cx ] = E[(cx )2 ] − E[cx ]2 = E[c2 x2 ] − (cE[ x ])2 = c2 (E[ x2 ] − E[ x ]2 ) = c2 var[ x ].
Exercise 4: var[ x + y] = E[( x + y)2 ] − E[ x + y]2 = E[ x2 ] + E[y2 ] + 2E[ xy] − (E[ x ]2 + E[y]2 +
2E[ x ]E[y]) = var[ x ] + var[y], if E[ xy] = E[ x ]E[y], true if x and y are independent variables.
Exercise 5: z = ( x − µ)/σ has mean 0 and variance 1. The division is by the standard
deviation, not the variance. You should now be able to prove this result for yourself.
What to remember: using the expectation notation where possible, rather than writing out
the summations or integrals explicitly, makes the mathematics concise.
MLPR:w0f Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3