Slides 2 PDF

ECON509 Probability and Statistics
Slides 2
Bilkent
This Version: 21 Oct 2013
(Bilkent) ECON509 This Version: 21 Oct 2013 1 / 110

Introduction
In this part of the lectures, our interest is on the behaviour of functions of a random
variable, X , whose cdf we know.
This part is heavily based on Casella & Berger, Chapters 2 and 3.
Speci…cally, we cover material from Sections 2.1, 2.2, 2.3, 2.6, 3.1, 3.2, 3.3 and 3.5.

Distributions of Functions of a Random Variable
If X is a random variable with cdf FX , then Y = g (X ) is also a random variable.

Importantly, since Y is a function of X , we can determine its random behaviour in
terms of the behaviour of X .
Then, for any set A,
P (Y 2 A ) = P (g (X ) 2 A ).
This clearly shows that the distribution of Y depends on the function g ( ) and the
cdf FX .
Formally,
g (x ) : X ! Y ,
where X and Y are the sample spaces of X and Y , respectively.

Notice that the mapping g ( ) is associated with the inverse mapping g 1( ), a

mapping from the subsets of Y to those of X :
1
g (A ) = fx 2 X : g (x ) 2 A g. (1)
Therefore, the mapping g 1 ( ) takes sets into sets; that is, g 1 (A ) is the set of
points in X that g (x ) takes into the set A.
If A = fy g, a point set, then
1
g (fy g) = fx 2 X : g (x ) = y g.
Now, if Y = g (X ), then for all A 2 Y ,
P (Y 2 A ) = P (g (X ) 2 A )
= P (fx 2 X : g (x ) 2 A g)
1
= P (X 2 g (A )), (2)
where the last line follows from (1). This de…nes the probability distribution of Y .

If X is a discrete random variable, then X is countable. The sample space for

Y = g (X ) is Y = fy : y = g (x ), x 2 X g, which is also a countable set. Thus, Y is
also a discrete random variable.
Then, using (2), the pdf for Y is given by
fY (z ) = P (Y = z ) = ∑ P (X = q ) = ∑ fX (q ), for z 2 Y ,
q 2g 1 (z ) q 2g 1 (z )
and fY (z ) = 0 for z 2
/ Y.
Note that ∑q 2g 1 (z ) P (X = q ) simply amounts to adding up the probabilities
P (X = q ) for all q such that g (q ) = z.

Example (2.1.1): A discrete random variable X has a binomial distribution if its
pmf is of the form
n x
fX (x ) = P (X = x ) = p (1 p )n x
, x = 0, 1, ..., n,
x
where n is a positive integer and 0 p 1. Values such as n and p are called
parameters of a distribution. Di¤erent parameter values imply di¤erent distributions.
Consider Y = g (X ) where g (x ) = n x . Therefore, Y = n X . Here
X = f0, 1, ..., n g and Y = fy : y = g (x ), x 2 X g = f0, 1, ..., n g.
For any y 2 Y , n x = g (x ) = y if and only if x = n y . Thus, g 1 (y ) is the
single point x = n y .
Now,
n
fY (y ) = ∑ fX (x ) = fX (n y) =
n y
pn y
(1 p )n (n y)
x 2g 1 (y )
n
= (1 p )y p n y
,
y
since (n n y ) = (n n!y )!y ! = (yn ).

Therefore, Y also has a binomial distribution, but with parameters n and 1 p.
Behind all these dry results, our main objective actually is to obtain simple formulas
for the cdf and pdf of Y in terms of the mapping g ( ) and the cdf and pdf of X .
Now, the cdf of Y = g (X ) is
F Y (y ) = P (Y y ) = P (g (X ) y)
= P (fx 2 X : g (x ) y g)
Z
= fX (x )dx .
fx 2X :g (x ) y g
However, in some cases it might be di¢ cult to identify the set fx 2 X : g (x ) y g.

First, we consider the following result, which is not of direct interest, but it will help
us to obtain a useful theorem.
Theorem (2.1.3): Let X have cdf FX (x ), let Y = g (X ) and let X and Y be
de…ned as
X = f x : fX (x ) > 0 g and Y = fy : y = g (x ) for some x 2 X g. (3)
1 If g is an increasing function on X , F Y (y ) = F X (g 1 (y )) for y 2 Y .

2 If g is a decreasing function on X and X is a continuous random variable,
F Y (y ) = 1 F X (g 1 (y )) for y 2 Y .
Proof: If g is increasing
1 1
fx 2 X : g (x ) yg = fx 2 X : g (g (x )) g (y )g
= fx 2 X : x g 1 (y )g.
If, on the other hand g is decreasing, then

1 1
fx 2 X : g (x ) yg = fx 2 X : g (g (x )) g (y )g
= fx 2 X : x g 1 (y )g.
Note the change of direction of the inequality in the second case.

Now, for increasing g (x ),

Z Z g 1
(y )
1
F Y (y ) = fX (x )dx = fX (x )dx = FX (g (y )).
fx 2X :x g 1 (y )g ∞
Finally, for decreasing g (x ),

Z Z ∞ Z g 1
(y )
F Y (y ) = fX (x )dx = fX (x )dx = 1 fX (x )dx
fx 2X :x g 1 (y )g g 1 (y ) ∞
1
= 1 F X (g (y )).
Now comes the more interesting result.

Theorem (2.1.5): Let X have pdf fX (x ) and let Y = g (X ), where g is a monotone

function. Let X and Y be de…ned as in (3). Suppose that fX (x ) is continuous on X
and that g 1 (y ) has a continuous derivative on Y . Then the pdf of Y is given by
(
d
fX (g 1 (y )) dy g 1 (y ) y 2 Y
fY (y ) = .
0 otherwise
Proof: From Theorem (2.1.3) we have, by the chain rule,

(
d
d fX (g 1 (y )) dy g 1 (y ) if g is increasing
fY (y ) = F Y (y ) = 1 d 1 ,
dy fX (g (y )) dy g (y ) if g is decreasing
which is identical to the result given in Theorem (2.1.5).

Example (2.1.4): Suppose X fX (x ) = 1 for 0 < x < 1 and 0 otherwise, which is

the uniform (0 , 1 ) distribution. Observe that FX (x ) = x , 0 < x < 1. We now make
the transformation Y = g (X ) = log X . Then,
d 1
g 0 (x ) = ( log x ) = <0 for 0 < x < 1;
dx x
hence, g (x ) is monotone and has a continuous derivative on 0 < x < 1. Also,
Y = (0, ∞). Observe that g 1 (y ) = e y . Then, using Theorem (2.1.5),
y
fY (y ) = 1 e if 0 < y < ∞
y
= e if 0 < y < ∞.

Example (2.1.6): Let fX (x ) be the gamma pdf

1
f (x ) = xn 1
e x /β
, 0 < x < ∞,
(n 1)!βn
where β > 0 and n is a positive integer. We want to …nd the pdf of g (X ) = 1/X .
Note that here the support sets X and Y are both the interval (0, ∞). Now,
d
y = g (x ), and so, g 1 (y ) = 1/y and dy g 1 (y ) = 1/y 2 . By Theorem (2.1.5), for
y 2 (0, ∞), we obtain
1 d 1 1 1 1/( βy ) 1
fY (y ) = fX (g (y )) g (y ) = e
dy (n 1)!βn y n 1 y2
1 1 1/( βy )
= e .
(n 1)!βn y n +1
This is a special case of a pdf known as the inverted gamma pdf.

In many cases, g ( ) may be neither increasing nor decreasing. However, as long as it
is monotone over certain intervals, we are still doing …ne.
Example (2.1.7): Suppose X is a continuous random variable. For y > 0, the cdf of
Y = X 2 is
p p
F Y (y ) = P (Y y ) = P (X 2 y ) = P ( y X y ).
Because x is continuous, we can drop the equality from the left endpoint and obtain
p p
F Y (y ) = P ( y <X y)
p p
= P (X y ) P (X y)
p p
= FX ( y ) FX ( y ).
Then,
d d p p
fY (y ) = F (y ) = [F ( y ) F X ( y )]
dy Y dy X
1 p 1 p
= p f ( y ) + p fX ( y ).
2 y X 2 y
Importantly, the pdf of Y consists of two pieces which represent the intervals where
g (x ) = x 2 is monotone.
The above idea can be extended to cases where we need a larger number of intervals
to obtain, if you like, interval-by-interval monotonicity.
Theorem (2.1.8): Let X have pdf fX (x ), let Y = g (X ) and de…ne the sample
space X as in (3). Suppose there exists a partition A0 , A1 , A2 , ..., Ak of X such that
P (X 2 A0 ) = 0 and fX (x ) is continuous on each Ai . Further, suppose there exist
functions g1 (x ), ..., gk (x ) de…ned on A1 , ..., Ak , respectively satisfying
1 g (x ) = g i (x ), for x 2 A i ,
2 g i (x ) is monotone on A i ,
3 the set Y = fy : y = g i (x ) for some x 2 A i g is the same for each i = 1, ..., k ,
4 g i 1 (y ) has a continuous derivative on Y , for each i = 1, ..., k .
Then, (
1 1
∑ki=1 fX (gi (y )) d
dy gi (y ) y 2Y
fY (y ) = .
0 otherwise

Example (2.1.9): Let X have the standard normal distribution,
1 x 2 /2
fX (x ) = p e , ∞ < x < ∞.
2π
Consider Y = X 2 . The function g (x ) = x 2 is monotone on ( ∞, 0) and on (0, ∞).

The set Y = (0, ∞). Applying Theorem (2.1.8), we take
A0 = f0 g;
p
A1 = ( ∞, 0), g1 (x ) = x 2 , g1 1 (y ) = y;
p
A2 = (0, ∞), g2 (x ) = x 2 , 1
g2 (y ) = y .
Then, the pdf of Y is

1 p 2 1 1 p 2 1
fY (y ) = p e ( y ) /2 p + p e ( y ) /2 p
2π 2 y 2π 2 y
1 1
= p p e y /2 , 0 < y < ∞.
2π y
This is the pdf of a chi squared random variables with 1 degree of freedom. You will
use this distribution frequently in your later econometrics courses.

We …nish with another important result known as the Probability Integral Transform.
Theorem (2.1.10): Let X have continuous cdf FX (x ) and de…ne the random
variable Y as Y = FX (X ). Then, Y is uniformly distributed on (0, 1), that is
P (Y y) = y, 0 < y < 1.
Proof: For Y = FX (X ) we have, for 0 < y < 1,
P (Y y) = P (F X (X ) y)
= P (FX 1 [FX (X )] FX 1 (y ))
= P (X FX 1 (y ))
= FX (FX 1 (y ))
= y.
At the endpoints we have P (Y y ) = 1 for y 1 and P (Y y ) = 0 for y 0.

Hence, Y has a uniform distribution.

Why is this result useful?

This result connects any random variable with some cdf FX (x ) with a uniformly
distributed random variable. Hence, if we want to simulate random numbers from
some distribution FX (x ), all we have to do is to generate uniformly distrbuted
random variables, Y , and then solve for FX (x ) = y . As long as we can compute
FX 1 (y ), we can generate random numbers from the distribution FX (x ).
This result is also employed when using Copula methods. Copulas are used to model
dependence between random variables in …nancial econometrics.

Expected Value
In this section, we will introduce one of the most widely used concepts in
econometrics, the expected value.
As we will see in more detail, this is one of the moments that a random variable can
possess.
This concept is akin to the concept of “average.” The standard “average” is an
arithmetic average where all available observations are weighted equally.
The expected value, on the other hand, is the average of all possible values a
random variable can take, weighted by the probability distribution.
The question is, which value would we expect the random variable to take on
average.

Expected Value
De…nition (2.2.1): The expected value or mean of a random variable g (X ),

denoted by E [g (X )], is
( R
∞
E [g (X )] = ∞ g (x )fX (x )dx if X is continuous
,
∑x 2X g (x )fX (x ) = ∑x 2X g (x )P (X = x ) if X discrete
provided that the integral or sum exists. If E [jg (X )j] = ∞, we say that E [g (X )]
does not exist.
In both cases, the idea is that we are taking the average of g (x ) over all of its
possible values (x 2 X ), where these values are weighted by the respective value of
the pdf, fX (x ).

Expected Value
Example (2.2.2): Suppose X has an exponential (λ) distribution, that is, it has pdf
given by
1
fX (x ) = e x /λ , 0 x < ∞, λ > 0.
λ
Then,
Z ∞ ∞ Z ∞
1 x /λ x /λ x /λ
E [X ] = xe dx = xe + e dx (4)
0 λ 0 0
Z ∞
x /λ
= e dx = λ. (5)
0
To obtain this result, we use a method called integration by parts. This is based on
Z Z
udv = uv vdu.
Then, taking
u = x, du = 1dx ,
x /λ 1 x /λ
v = e , dv = λ e dx ,
gives (4).

Expected Value
To obtain (5), notice that, by L’Hôpital’s Rule,

d
x dx x 1
lim = lim = lim 1 x /λ
= 0.
x !∞ e x /λ x !∞ d e x /λ x !∞ λ e
dx
Finally, Z ∞ ∞
x /λ x /λ
e dx = λe = λ.
0 0

Expected Value
A very useful property of the expectation operator is that it is a linear operator.

For example, consider some X such that E [X ] = µ.
Then, for two constants a and b,
E [a + Xb ] = a + E [Xb ] = a + bE [X ] = a + bµ.
Notice that, clearly, the expectation of a constant is equal to itself.

Expected Value
More generally, we have the following results.
Theorem (2.2.5): Let X be a random variable and let a, b and c be constants.
Then for any functions g1 (x ) and g2 (x ) whose expectations exist,
1 E [ag 1 (X ) + bg 2 (X ) + c ] = aE [g 1 (X )] + bE [g 2 (X )] + c .
2 If g 1 (x ) 0 for all x , then E [g 1 (X )] 0.
3 If g 1 (x ) g 2 (x ) for all x , then E [g 1 (X )] E [g 2 (X )].
4 If a g 1 (x ) b for all x , then a E [g 1 (X )] b.
Proof: The most useful of these results is (1). Observe that, by de…nition,
Z ∞
E [ag1 (X ) + bg2 (X ) + c ] = [ag1 (x ) + bg2 (x ) + c ]fX (x )dx
∞
Z ∞ Z ∞
= ag1 (x )fX (x )dx + bg2 (x )fX (x )dx
∞ ∞
Z ∞
+ cfX (x )dx
∞
Z ∞ Z ∞
= a g1 (x )fX (x )dx + b g2 (x )fX (x )dx
∞ ∞
Z ∞
+c fX (x )dx
∞
= aE [g1 (X )] + bE [g2 (X )] + c 1.

Expected Value
Also consider (3). Now,

Z ∞ Z ∞
E [g1 (X )] E [g2 (X )] = g1 (x )fX (x )dx g2 (x )fX (x )dx
∞ ∞
Z ∞
= [g1 (x ) g2 (x )]fX (x )dx .
∞
But we know that g1 (x ) g2 (x ) 0 for all x ! So, we are integrating over a function
which is nonnegative everywhere, and we are weighting by fX (x ) 0 for all x .
Hence, E [g1 (X )] E [g2 (X )] 0. The rest can be shown similarly.

Expected Value
When evaluating expectations of nonlinear functions of X , we can proceed in one of

two ways. From the de…nition of E [g (X )], we could directly calculate
Z ∞
E [g (X )] = g (x )fX (x )dx .
∞
But we could also …nd the pdf fY (y ) of Y = g (X ) and we would have

Z ∞
E [g (X )] = E [Y ] = yfY (y )dy .
∞
Let’s consider the following examples. We will …rst …nd the pdf of the inverted
gamma distribution. Then, we will use this result to …nd the expectation of a
random variable which has the inverted gamma distribution.

Expected Value
Example (2.2.7): Let X have a uniform distribution, such that
(
1 if 0 x 1
fX (x ) = .
0 otherwise
De…ne g (X ) = log X . Then,

Z 1 1
E [g (X )] = E [ log X ] = log xdx = ( x log x + x ) = 1,
0 0
where we use integration by parts. We can also use fY (y ) to calculate E [Y ] directly.

In Example (2.1.4), it was shown that, for the case at hand
y
fY (y ) = e if 0 < y < ∞.
Remember from Example (2.2.2) that for Y fY (y ) = λ 1 e y /λ , where

0 y < ∞ and λ > 0, we have E [Y ] = λ. Notice that this comes down to the same
pdf if we pick λ = 1. Hence, E [Y ] = 1.
Note that the textbook is a bit vague on the set of possible values for X . In Example
(2.1.4) we have 0 < x < 1 while in Example (2.2.7), the textbook actually considers
0 x 1.

Moments
Another widely used moment is the variance of a random variable. Obviously, this
moment measures the variation/dispersion/spread of the random variable (around its
expectation).
While the expectation is usually denoted by µ, σ2 is generally used for variance.
Variance is a second-order moment. If available, higher order moments of a random
variable can be calculated, as well.
For example, the third and fourth moments are concerned with how symmetric and
fat-tailed the underlying distribution is. We will talk more about these.

Moments
De…nition (2.3.1): For each integer n, the n th moment of X is
µn0 = E [X n ].
The n th central moment of X , µn , is
µn = E [(X µ )n ],
where µ = µ10 = E [X ].
De…nition (2.3.2): The variance of a random variable X is its second central

moment, h i
Var (X ) = E (X µ)2 ,
p
while Var (X ) is known as the standard deviation of X .
Importantly, h i
Var (X ) = E (X µ )2 = E [X 2 ] µ2 .

Moments
Variance/risk receives a huge attention in …nance and …nancial econometrics. Some

of the most central topics in these areas are either directly or indirectly related to
variance (and covariance).
Think of a portfolio consisting of many assets. These assets have di¤erent means
and variances. Some assets are very secure (little variance) but bring a lower return.
On the other hand, some of the highest expected returns could be due to highly
risky assets.
Obviously, other things being equal, if two portfolios have the same return, then the
one with lower risk (variance) would be preferred. These issues fall under the focus
of …nancial economics.
Financial econometrics is more concerned with modelling risk itself by looking at
large amounts of daily and/or intra-daily data. The idea is that, we can look at how
volatile an asset was in the past, in order to model its volatility today. This idea and
the resulting GARCH type literature warranted a Nobel prize to Robert F Engle in
2003. See the original paper by Engle (1982, Econometrica).

Moments
Let us digress for a moment and brie‡y mention another important concept:
covariance. When it exists, the covariance of two random variables X and Y is
de…ned as
Cov (X , Y ) = E (fX E [X ]g fY E [Y ]g) .
We will talk in more detail about this, when we deal with multivariate random
variables. For the time being, su¢ ce to say that the covariance is a measure of the
co-movement between two random variables.

Common Families of Distributions
We will be dealing with moments in more detail. However, this is a good time to
stop and review some of the commonly used distributions.
Usually, one deals with families of distributions. Families of distributions are indexed
by one or more parameters, which allow one to vary certain characteristics of the
distribution while staying within one functional form.
To give an example, one of the most commonly employed distributions, the Normal
distribution has two parameters, the mean and the variance, denoted by µ and σ2 ,
respectively.
Although one might know the value of σ2 for the random variable at hand, the
actual value of µ might be unknown.
In that case, the distribution will be indexed by µ, and the behaviour of the random
variable will change as µ varies.

0.4
sigmasq= 1
sigmasq= 2
0.35 sigmasq= 4
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
Figure: The normal distribution centred at zero, for varying values of σ2 .

Discrete Distributions
This part will also provide a good opportunity to put some of the abstract concepts
we have learned into action.
We start with discrete distributions.
A random variable X is said to have a discrete distribution if the range of X , the
sample space, is countable.

Discrete Distributions: Discrete Uniform Distribution
A random variable X has a discrete uniform (1, N ) distribution if

1
P (X = x jN ) = , x = 1, 2, ..., N,
N
where N is a speci…ed integer. This distribution puts equal mass on each of the
outcomes 1, 2, ..., N.
Some useful results: Remember that
k k
k (k + 1 ) k (k + 1)(2k + 1)
∑i = 2
and ∑ i2 = 6
. (6)
i =1 i =1
Using (6) we can now calculate the mean and the variance.

Discrete Distributions: Discrete Uniform Distribution
Now,
N N
1 N (N + 1 ) 1 N +1
E [X ] = ∑ xP (X = x jN ) = ∑ xN =
2 N
=
2
,
x =1 x =1
N N
1 (N + 1)(2N + 1)
and E [X 2 ] = ∑ x 2 P (X = x jN ) = ∑ x2 N =
6
.
x =1 x =1
Therefore, since Var (X ) = E [X 2 ] fE [X ]g2 , we have

2
(N + 1)(2N + 1) N +1
Var (X ) =
6 2
(N + 1)(N 1)
= .
12
It is easy to generalise this distribution so that the sample space is any range of
integers, N0 , N0 + 1, ..., N1 , with pdf P (X = x jN0 , N1 ) = 1/(N1 N0 + 1).

Discrete Distributions: Binomial Distribution
This is based on a Bernoulli trial (after James Bernoulli) which is an experiment

with two, and only two, possible outcomes.
A random variable X has a Bernoulli (p ) distribution if
(
1 with probability p
X = , 0 p 1.
0 with probability 1 p
X = 1 is often termed a “success” and p is, accordingly, the probability of success.

Similarly, X = 0 is termed a “failure.”
Now,
E [X ] = 1 p+0 (1 p ) = p,
and Var (X ) = (1 p )2 p + (0 p )2 (1 p ) = p (1 p ).
Examples:
1 Tossing a coin (p=probability of a head, X = 1 if heads)
2 Roulette (X = 1 if red occurs, p=probability of red)
3 Election polls (X = 1 if candidate A gets vote)
4 Incidence of disease (p=probability that a random person gets infected)

We can extend the scope to a collection of many independent trials.

De…ne
Ai = fX = 1 on the i th trialg, i = 1, 2, ..., n.
Assuming that A1 , ..., An are independent events, we can derive the distribution of
the total number of successes in n trials. De…ne Y =“total number of successes in n
trials”.
The event fY = y g means that out of n trials, y resulted as success. Therefore,
n y trials have been unsuccessful.
In other words, exactly y of A1 , ..., An must have occurred.
There are many possible orderings of the events that would lead to this outcome.
Any particular such ordering has probability
p y (1 p )n y
.
Since there are (yn ) such sequences, we have
n y
P (Y = y jn, p ) = p (1 p )n y
, y = 0, 1, 2, ..., n,
y
and Y is called a binomial (n, p ) random variable.
The following theorem, called the “Binomial Theorem” is a useful result, which can
be used to show that ∑ny =0 P (Y = y ) = 1 for the binomial (n, p ) random variable
mentioned above.
Theorem (3.2.2): For any real numbers x and y and integer n 0,
n
n i n
(x + y )n = ∑ i
x y i
.
i =0
Proof: We have (x + y )n = (x + y ) ... (x + y ) . From each factor (x + y ) , we

| {z }
n times
choose either an x or y , and multiply together the n choices. For i = 0 we will
choose no x 0 s; for i = 1 we will choose only one x ; for i = 17 we will choose 17 etc.
These selections will give us x i y n i each time. Now, for each of these particular
values for i , we have (ni ) di¤erent ways of making a selection. Hence, the result
follows.
Now, for x = p and y = 1 p, we get
n n
n i
∑ P (Y = i) = ∑ i
p (1 p )n i
= (p + (1 p ))n = 1.
i =0 i =0

Let’s calculate the mean and variance for the Binomial Distribution.
Example (2.2.3):
n n
n x n x
E [X ] = ∑x x
p (1 p )n x
= ∑x x
p (1 p )n x
,
x =0 x =1
since x (xn )p x (1 p )n x = 0 for x = 0.

We can show that x (xn ) = n (xn 1
1 ).
n n! (n 1 ) ! (n 1 ) ! n 1
x =x =n 1 =n =n .
x x ! (n x ) ! x x ! (n x ) ! (x 1 ) ! (n x ) ! x 1

Then,
n n 1
n 1 x n 1
E [X ] = ∑n x 1
p (1 p )n x
= ∑n y
p y +1 (1 p )n (y +1 )
x =1 y =0
n 1
n 1
= np ∑ y
p y (1 p )n 1 y
= np.
y =0
The second equality above follows by substituting y = x 1. The interval for

summation changes accordingly. The …nal result becomes clear once we observe that
P (Y = y jn 1, p ) = (n y 1 )p y (1 p )n 1 y . Hence, by de…nition,
∑ny =10 P (Y = y jn 1, p ) = 1.
To calculate the variance, we need to calculate E [X 2 ].
Example (2.3.5): Start with
n
n x
E [X 2 ] = ∑ x2 x
p (1 p )n x
, x = 0, 1, ..., n.
x =0

Observe that
n n! (n 1 ) ! n 1
x2 = x2 = xn = xn .
x (n x )!x ! (n x ) ! (x 1 ) ! x 1
Then,
n n
n x n 1 x
∑ x2 x
p (1 p )n x
= ∑ xn x 1
p (1 p )n x
x =0 x =1
n 1
n 1
= n ∑ (y + 1 ) y
p y +1 (1 p )n y 1
y =0
n 1
n 1
= np ∑y y
p y (1 p )n y 1
y =0
n 1
n 1
+np ∑ y
p y (1 p )n y 1
.
y =0

We have to deal with

n 1 n 1
n 1 n 1
np ∑y y
p y (1 p )n y 1
+ np ∑ y
p y (1 p )n y 1
.
y =0 y =0
Think about the …rst sum …rst. This is equal to E [Z ] for Z binomial (n 1, p ).
What about the second sum? Following a similar reasoning as above, it is equal to
one.
Hence,
n 1 n 1
n 1 n 1
E [X 2 ] = np ∑ y p y (1 p )n y 1
+ np ∑ p y (1 p )n y 1
y =0 y y =0 y
| {z } | {z }
(n 1 )p 1
= n (n 1)p 2 + np.
Finally,
Var (X ) = E [X 2 ] µ2 = n (n 1)p 2 + np (np )2 = np 2 + np = np (1 p ).

Example (3.2.3): Suppose we are interested in …nding the probability of obtaining

at least one 6 in four rolls of a fair die. This experiment can be modelled as a
sequence of four Bernoulli trials with success probability P (die shows 6) = p = 1/6.
De…ne the random variable X by
X = total number of 6s in four rolls.
Then X binomial (4, 1/6) and
P (at least one 6) = P (X > 0 ) = 1 P (X = 0 )

0 4
4 1 5
= 1
0 6 6
4
5
= 1
6
= .518.

Example (3.2.3) - cont’d: Now we consider another game; throw a pair of dice 24
times and ask for the probability of at least one double 6. This, again, can be
modelled by the binomial distribution with success probability p, where
1
p = P (roll a double 6) = .
36
So, if Y =number of double 6s in 24 rolls, Y binomial (24, 1/36) and
P (at least one double 6) = P (Y > 0 ) = 1 P (Y = 0 )

0 24
24 1 35
= 1
0 36 36
24
35
= 1 = .491.
36
This is the calculation originally done in the eighteenth century by Pascal at the
request of the gambler de Meré, who thought both events had the same probability.

Discrete Distributions: Poisson Distribution
If we are modelling a phenomenon in which we are waiting for an occurrence (such

as waiting for a bus, waiting for customers to arrive in a bank), the number of
occurrences in a given time interval can sometimes be modelled by the Poisson
distribution.
The basic assumption is as follows: for small time intervals, the probability of an
arrival is proportional to the length of waiting time.
If we are waiting for the bus, the probability that a bus will arrive within the next
hour is higher than the probability that it will arrive within 5 minutes.
Similarly, the more we wait, the more likely it is that a customer will enter the bank.
Other possible applications are distribution of bomb hits in an area or distribution of
…sh in a lake.
The only parameter is λ, also sometimes called the “intensity parameter.”

A random variable X . taking values in the nonnegative integers, has a Poisson (λ)
distribution if
e λ λx
P (X = x j λ ) = , x = 0, 1, ... .
x!
Is ∑x∞=0 P (X = x jλ) = 1? The Taylor series expansion of e y is given by
∞
yi
ey = ∑ i!
.
i =0
Don’t worry about this for the moment, if you are not familiar with this expansion.
We will cover it later in the course.
So,
∞ λ λx ∞
e λx
∑ x!
= e λ
∑ x!
= 1.
x =0 x =0
| {z }
eλ

Let’s calculate the mean:

∞ λ λx ∞ ∞
e λx 1 λx 1
E [X ] = ∑x = λe λ
∑x = λe λ
∑
x =0 x! x =1 x! x =1 (x 1)!
∞ y
λ
= λe λ
∑ y!
= λ,
y =0
| {z }
eλ
where we substituted y = x 1.

Similarly,
∞ λ λx ∞ x 1 ∞
e λλ λx 1
E [X 2 ] = ∑ x2 x!
=λ ∑ x 2e x!
=λ ∑ xe λ
(x 1 ) !
x =0 x =1 x =1
∞ x 1 ∞ x 1
λ λ
= λ ∑ (x 1 )e λ
(x 1 ) !
+λ ∑ e λ
(x 1 ) !
x =1 x =1
∞ y ∞
λλ λy
= λ ∑ ye y!
+ λe λ
∑ ,
y =0 y =0 y !
| {z }
eλ
where ∑y∞=0 ye
y
λλ = E [Y ] for Y Poisson (λ). Therefore,
y!
E [X 2 ] = λ2 + λ.
Thus,
Var (X ) = E [X 2 ] µ2 = λ2 + λ λ2 = λ.

Example (3.2.4): As an example of a waiting-for-occurrence application, consider a

telephone operator who, on average, handles …ve calls every 3 minutes. What is the
probability that there will be no calls in the next minute? At least two calls?
If we let X =number of calls in a minute, then X has a Poisson distribution with
E [X ] = λ = 5/3. So,
P (no calls in the next minute) = P (X = 0 )

e (5/3)0
5/3
= = e 5/3 = .189
0!
and P (at least two calls in the next minute) = P (X 2)
= 1 P (X = 0 ) P (X = 1 )
e 5/3 (5/3)1
= 1 .189
1!
= .496.

Examples of other discrete distributions are geometric distribution, negative binomial

distribution, hypergeometric distribution etc. You can check them out yourselves; we
will not cover them here.

Continuous Distributions
We will now cover some continuous distributions. These are

1 Uniform distribution
2 Gamma distribution
3 χ2 -distribution
4 Exponential distribution
5 Normal distribution
6 t-distribution

Continuous Distributions: Uniform Distribution
The continuous uniform distribution is de…ned by spreading mass uniformly over an

interval [a, b ]. Its pdf is given by
(
1
b a if x 2 [a, b ]
f (x ja, b ) = .
0 otherwise
One can easily show that

Z b
f (x )dx = 1,
a
b+a
E [X ] = ,
2
(b a )2
Var (X ) = .
12
In many cases, when people say Uniform distribution, they implictly mean
(a, b ) = (0, 1).

Continuous Distributions: Gamma Distribution
The gamma family of distributions is a ‡exible family of distributions on [0, ∞).

If α is a positive constant, the gamma function is given by
Z ∞
1 t
Γ(α) = tα e dt < ∞.
0
Importantly, it can be veri…ed using integration by parts that
Γ(α + 1) = αΓ(α), α > 0.
Notice that Z ∞
t
Γ (1 ) = e dt = 1,
0
and so, for any integer n > 0,
Γ (n ) = (n 1)!
Finally, p
Γ(1/2) = π.
These results make life a lot easier when calculating Γ(c ) for some positive c.

The gamma (α, β) family is given by

1 1 x /β
f (x jα, β) = xα e , 0 < x < ∞, α > 0, β > 0.
Γ(α) βα
α is known as the shape parameter, since it most in‡uences the peakedness of the
distribution.
β is called the scale parameter, since most of its in‡uence is on the spread of the
distribution.

0.45
α=1, β=2
α=2, β=2
α=3, β=2
0.4
α=5, β=1
α=9, β=0.5
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16 18 20
Figure: Plots of Gamma distribution for various values of α and β.

Now, Z
1 ∞
E [X ] = xx α 1 e x /β dx .
Γ(α) β 0
α
R∞
This is almost identical to 0 f (x jα + 1, β)dx . Notice that, by de…nition,
Z ∞ Z ∞
1 1 x /β
f (x jα, β)dx = xα e dx = 1.
0 0 Γ(α) βα
Then,
Z ∞
1 x /β
xα e dx = Γ(α) βα ,
0
Z ∞
x αe x /β
dx = Γ ( α + 1 ) β α +1 .
0
So,
Z ∞
1 1 x /β Γ ( α + 1 ) β α +1 αΓ(α) βα+1
E [X ] = xx α e dx = = = αβ.
Γ(α) βα 0 Γ(α) β α
Γ(α) βα
You will be asked in an exercise to show that
Var (X ) = αβ2 .

Continuous Distributions: Chi Squared Distribution
Some interesting distribution functions are special cases of the gamma distribution.
For example, for α = p/2, where p is an integer, and β = 2,
1
f (x ja, β) = f (x jp ) = x (p/2 ) 1
e x /2
, 0 < x < ∞.
Γ(p/2)2p/2
This is known as the pdf of the χ2(p ) distribution, read “chi squared distribution with
p degrees of freedom”.
Obviously, for X χ2(p ) ,
E [X ] = p and Var (X ) = 2p.
This distribution is used heavily in hypothesis testing. You will come back to this
distribution next term.

Continuous Distributions: Exponential Distribution
Now consider, α = 1 :
1 1
f (x ja, β) = f (x j1, β) = x 0e x /β
= e x /β
, 0 < x < ∞.
Γ (1 ) β1 β
Again, using our previous results, for X exponential ( β) we have
E [X ] = β and Var (X ) = β2
A peculiar feature of this distribution is that it has no memory.

If X exponential ( β), then, for s > t 0,
P (X > s, X > t ) P (X > s )

P (X > s jX > t ) = =
P (X > t ) P (X > t )
R ∞ 1 x /β
s βe dx e /β
s
= R ∞ 1 x /β =
t βe dx e t /β
= e (s t )/β
= P (X > s t ).
This is because,
Z ∞ ∞
1 x /β x /β (s t )/β
e dx = e =e .
s t β x =s t
What does this mean? When calculating P (X > s jX > t ), what matters is not
whether X has passed a threshold or not. What matters is the distance between the
threshold and the value to be reached.
If Mr X has been …red twice, what is the probability that he will be …red three
times? It is not di¤erent from the probability that a person, who has never been
…red, is …red. History does not matter.

Example 3.3.1: Let X gamma (α, β) where α is an integer and let

Y Poisson (x /β). Then, for any x ,
P (X x ) = P (Y α ). (7)
Let’s sketch the proof of this result. Remember that, if α is an integer. Then,
Γ (α) = (α 1)!.
Hence,
Z x
1 1 t /β
P (X x) = tα e dt
(α 1)!βα 0
Z x
1 1 t /β x 2 t /β
= tα βe jt =0 + (α 1 )t α βe dt ,
(α 1)!βα 0
where we use integration by parts with

1 2
u = tα , du = (α 1 )t α ,
t /β t /β
v = βe , dv = e dt.

Now,
Z x
1 1 t /β x 2 t /β
P (X x) = tα βe jt =0 + (α 1 )t α βe dt
(α 1)!βα 0
Z x
(x /β)α 1 e x /β 1 2 t /β
= + 1
tα e dt
(α 1)! (α 2)!βα 0
Z x
1 2 t /β
= P (Y = α 1) + 1
tα e dt,
(α 2)!βα 0
where Y Poisson (x /β). If we keep doing the same operation, we will eventually
obtain (7).

Continuous Distributions: Normal Distribution
The normal distribution or the Gaussian distribution is the one distribution which
you should know by heart!
Why is this distribution so popular?
1 Analytical tractability.
2 Bell shaped or symmetric
3 It is central to the Central Limit Theorem; this type of results guarantee that, under
(mild) conditions, the normal distribution can be used to approximate a large variety of
distributions in large samples.
The distribution has two parameters: mean and variance, denoted by µ and σ2 ,
respectively.
The pdf is given by,
1 1 (x µ )2
f (x jµ, σ2 ) = p exp .
σ 2π 2 σ2
You should memorise this!!!

This distribution is usually denoted as N (µ, σ2 ).

A very useful result is that for X N (µ, σ2 ),
X µ
Z = N (0, 1).
σ
N (0, 1) is known as the standard normal distribution.
To see this, consider the following:
X µ
P (Z z) = P z
σ
= P (X z σ + µ)
Z z σ+µ
1 2 2
= p e (x µ) /(2σ ) dx
σ 2π ∞
Z z
1 2
= p e t /2 dt,
2π ∞
where we substitute t = (x µ)/σ. Notice that this implies that dt/dx = 1/σ.
This shows that P (Z z ) is the standard normal cdf.

Then, we can do all calculations for the standard normal variable and then convert
these results for whatever normal random variable we have in mind.
Consider, for Z N (0, 1), the following:
Z ∞ ∞
1 z 2 /2 1 z 2 /2
E [Z ] = p ze dz = p e = 0.
2π ∞ 2π
∞
Then, to …nd E [X ] for X N (µ, σ2 ), we can use X = µ + Z σ :
E [X ] = E [µ + Z σ ] = µ + σE [Z ] = µ + σ 0 = µ.
Similarly,
Var (X ) = Var (µ + Z σ ) = σ2 Var (Z ) = σ2 .
What about Z ∞
1 z 2 /2 ?
p e dz = 1.
2π ∞

Remember …rst that

p Z ∞
1 1/2 w
π=Γ = w e dw
2 0
Moreover, the integrand is symmetric about 0. So, …rst observe that for w = z 2 /2
we have dw = zdz. Then, using this substitution we obtain
Z ∞ Z ∞ Z ∞
1 z 2 /2 1 1 w 1 1 w 1
p e dz = p p e dw = p p e dw = .
2π 0 2π 0 2w 2 π 0 w 2
| {z }
Γ( 12 )
Hence, Z ∞
1 z 2 /2
p e dz = 1.
2π ∞

An important characteristic of the normal distribution is that the shape and location
of the distribution are determined completely by its two parameters.
It can be shown easily that the normal pdf has its maximum at x = µ.
The probability content within 1, 2, or 3 standard deviations of the mean is
P (jX µj σ ) = P (jZ j 1) = .6826,

P (jX µj 2σ ) = P (jZ j 2) = .9544,
P (jX µj 3σ ) = P (jZ j 3) = .9974,
where X N (µ, σ2 ) and Z N (0, 1).

0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
Figure: The standard normal distribution.

0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
Figure: The standard normal distribution.

Continuous Distributions: Lognormal Distribution
Let X be a random variable such that
log X N (µ, σ2 ).
Then, X is said to have a lognormal distribution.

By using a transformation argument (Theorem (2.1.5)), the pdf of X is given by,
1 1 (log x µ)2
f (x jµ, σ2 ) = p exp ,
2πσ2 x 2σ2
where 0 < x < ∞, ∞ < µ < ∞, and σ > 0.
How? Take W = log X . We start from the distribution of W and want to …nd the
distribution of X = exp W . Then, g (W ) = exp(W ) and g 1 (X ) = log (X ). The
rest follows by using Theorem (2.1.5).

Continuous Distributions: Lognormal Distribution
We will prove later that
σ2 h i
E [X ] = exp µ + and Var (X ) = exp 2(µ + σ2 ) exp 2µ + σ2 .
2
Why use this distribution? It is similar in appearance to the Gamma distribution: it
is skewed to the right. Convenient for some variables which are skewed to the right,
such as income.
But why not use Gamma instead? Lognormal is based on the normal distribution so
it allows one to use normal-theory statistics, which is technically more convenient.

Continuous Distributions: t-Distribution
This distribution is also known as the Student’s t distribution.

It is generated by a ratio of two random variables. Suppose Z N (0, 1), X χ2v
and that X and Z are independent. Then,
Z
tv = p ,
X /v
is a random variable with a t-distribution with v degrees of freedom.
This distribution is also symmetric about 0.
An important feature of Student’s t random variables is that
E [jtv jr ] < ∞ i¤ v > r.
When v = 1, the distribution is called the Cauchy distribution. Note that in this
case, even the mean does not exist.

Location and Scale Families
The use of the term “family” will be a bit less intuitive in this part.
We will consider three types of families under this heading: location families, scale
families and location-scale families.
Let’s start with the following Theorem.
Theorem (3.5.1): Let f (x ) be any pdf and let µ and σ > 0 be any given constants.
Then the function
1 x µ
g (x jµ, σ ) = f
σ σ
is a pdf.

Proof: What we need to check is that

1 x µ
f , (8)
σ σ
as a function of x is a pdf for every value of µ and σ. In other words, we need to
ensure that (8)
1 is nonnegative,
2 integrates to one.
x µ
Observe that f ( ) is a pdf, so f σ 0 for all values of x , µ and σ2 .
1 x µ
Consequently σf σ 0.
Moreover, Z ∞ Z ∞
1 x µ
f dx = f (y ) dy = 1,
∞ σ σ ∞
x µ 1 dx .
by substituting y = σ which implies dy = σ The second inequality follows
from the fact that f ( ) is a pdf.

De…nition (3.5.2): Let f (x ) be any pdf. Then the family of pdfs f (x µ), indexed
by the parameter µ, ∞ < µ < ∞, is called the location family with standard pdf
f (x ) and µ is called the location parameter for the family.
This simply changes the location of the distribution without changing any other
properties of it.
0.4
f(x )
f(x - µ), µ=3
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4 6 8

In general, at x = µ + a, we have f (x µ) = f (a ). Therefore, the transformation

means that all points have been shifted by µ units.
So, the distributions that belong to a location family are identical except that their
locations vary according to the particular value of µ.
In the previous …gure, µ = 3. What this does is simply shifting the distribution by
three units.
If X has pdf f (x µ), then, for example,
P( 1 X 2jµ = 0) = P (µ 1 X µ + 2jµ = 3)
= P (2 X 5jµ = 3)

Some of the families of continuous distributions discussed here have location families.
Consider the normal distribution for some speci…ed σ > 0.
1 x
f (x j σ 2 ) = p exp .
2πσ2 2σ2
By replacing x with x µ, we will obtain normal distribution with di¤erent means
for each particular value of µ. Hence, the location family with a standard f (x ) is the
set of normal distributions with unknown mean µ and known variance σ2 .
De…nition (3.5.2) simply says that we can start with any pdf f (x ) and introduce a
location parameter µ which will generate a family of pdfs.

Now, consider X such that its pdf is given by f (x µ ).

Then, one can represent X as X = Z + µ where Z is a random variable with the pdf
f (z ). We will formalise this idea in a moment.
However, for the time being, it would be revealing to think about situations where
considering a framework of location families would be useful.
Suppose we want to measure some physical constant µ, but our measurement is
subject to a measurement error, Z . Then, what we observe is X = Z + µ. For some
reason, we might have a good idea about the distribution of this measurement error,
with the pdf f (z ). Then, the pdf of the observed value is given by f (x µ).
Now, let Z be the reaction time of any given driver, with the known pdf f (z ). Let µ
be the treatment e¤ect for, say, three glasses of wine; obviously, alcohol
consumption a¤ects the reaction time. Then, after the treatment, the reaction time
will be X = Z + µ. Consequently, the family of possible distributions for X will be
given by f (x µ).

We now introduce scale families.

De…nition (3.5.4): Let f (x ) be any pdf. Then for any σ > 0, the family of pdfs
(1/σ)f (x /σ), indexed by the parameter σ, is called the scale family with standard
pdf f (x ) and σ is called the scale parameter of the family.
This usually serves the purpose of either stretching (σ > 0) or contracting (σ < 0)
the distribution while still keeping the same shape.
Most often, this is used when either f (x ) is symmetric about 0 or positive only for
x > 0. Correspondingly, stretching will either be symmetric around zero or in the
positive direction, respectively.
Examples of scale families are the normal family with µ = 0 and σ the scale
parameter, the gamma family if α is a …xed value and β is the scale parameter and
the exponential family.
In any case, the standard pdf is obtained by setting σ = 1 which yields
(1/1)f (x /1) = f (x ).

0.4
s igmas q=1
s igmas q=2
0.35 s igmas q=4
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
Figure: The normal distribution centred at zero, for varying values of σ2 .

De…nition (3.5.5): Let f (x ) be any pdf. Then for any µ, ∞ < µ < ∞, and, any
σ > 0, the family of pdfs
1 x µ
f ,
σ σ
indexed by the parameter (µ, σ ) , is called the location-scale family with standard
pdf f (x ); µ is called the location parameter and σ is called the scale parameter.
From the previous discussion, it is obvious that the scale parameters are used to
stretch/contract the distribution while the location parameter shifts the distribution.
The normal family is an example of location-scale families.

0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-6 -4 -2 0 2 4 6
Figure: An example of a location-scale family: the normal distribution for varying values of µ and
σ.

Theorem (3.5.6): Let f ( ) be any pdf. Let µ be any real number, and let σ be any
positive real number. Then X is a random variable with pdf
1 x µ
f
σ σ
if and only if there exists a random variable Z with pdf f (z ) and X = σZ + µ.

Proof: Let g (z ) = σz + µ. Then X = g (Z ), g is a monotone function,
g 1 (x ) = (x µ)/σ, and (d /dx )g 1 (x ) = 1/σ. Thus, by Theorem (2.1.5), the
pdf of X is
d x µ 1
fX (x ) = fZ (g 1 (x )) g 1 (x ) = f .
dx σ σ
This proves the “if” part.

Proof (cont’d): Now de…ne g (x ) = (x µ)/σ and let Z = g (X ). Theorem (2.1.5)

again applies: g 1 (z ) = σz + µ, (d /dz )g 1 (z ) = σ, and the pdf of Z is
1 d 1 1 σz + µ µ
fZ (z ) = fX (g (z )) g (z ) = f σ = f (z ).
dz σ σ
Also,
X µ
σZ + µ = σg (X ) + µ = σ + µ = X.
σ
This proves the “only if” part.

Now, let’s consider Z = (X µ) /σ again, for a second. According to Theorem

(3.5.6), the pdf for Z is given by
fZ (z ) = f (z ),
1 x µ
which is the same as σ f σ for µ = 0 and σ = 1.
Therefore, the distribution of Z is that member of the location-scale family
corresponding to µ = 0 and σ = 1. For the normal family, remember that we have
already showed that for Z de…ned as above, Z is a normally distributed random
variable with µ = 0 and σ = 1.

Note that, probabilities for any member of a location-scale family may be computed
in terms of the standard variable Z .
This is because,
X µ x µ x µ
P (X x) = P =P Z .
σ σ σ
x µ
Consider the normal family. If we know P Z σ for all values of x , µ and σ
where Z is distributed with the standard normal distribution, then we can calculate
P (X x ) for all values of x where X is a normally distributed random variable with
some mean µ and variance σ2 .
There is another special family of densities known as the exponential family, used
very much in statistics and to some extent in econometrics, thanks to its convenient
properties. For the time being we do not cover this topic, but will return to it if time
permits.

Moments and Moment Generating Functions
Now, let’s get back on track.

So far we have spoken mainly about the …rst two orders of moments.
We now introduce a new function that is associated with a probability distribution,
the moment generating function.
This function can be used to obtain moments of a random variable.
In practice, it is much easier in many cases to calculate moments directly than to
use the moment generating function. However, the main use of the mgf is not to
generate moments, but to help in characterising a distribution.

De…nition (2.3.6): Let X be a random variable with cdf FX . The moment

generating function (mgf) of X (or FX ), denoted by MX (t ), is
MX (t ) = E [e tX ],
provided that the expectation exists for t in some neighbourhood of 0. That is, there
is an h > 0 such that, for all t in h < t < h, E [e tX ] exists. If the expectation does
not exist in a neighbourhood of 0, we say that the mgf does not exist.
We can write the mgf of X as
Z ∞
M X (t ) = e tx fX (x )dx if X is continuous,
∞
M X (t ) = ∑e tx
P (X = x ) if X is discrete.
x
But why is this called a moment generating function?

Theorem (2.3.7): If X has mgf MX (t ), then

(n )
E [X n ] = M X (0 ),
where we de…ne
(n ) dn
M X (0 ) = M (t ) .
dt n X t =0
That is, the n th moment is equal to the n th derivative of MX (t ) evaluated at t = 0.

Proof: Assuming that we can di¤erentiate under the integral sign,
Z ∞ Z ∞
d d d tx
M (t ) = e tx fX (x )dx = e fX (x )dx
dt X dt ∞ ∞ dt
Z ∞
= (xe tx )fX (x )dx = E [Xe tX ].
∞
Hence,
d
M (t ) = E [Xe tX ]jt =0 = E [X ].
dt X t =0

Similarly,
Z ∞ Z ∞
d2 d2 d 2 tx
M (t ) = e tx fX (x )dx = e fX (x )dx
dt 2 X dt 2 ∞ ∞ dt 2
Z ∞
= (x 2 e tx )fX (x )dx = E [X 2 e tX ],
∞
and
d2
M (t ) = E [X 2 e tX ]jt =0 = E [X 2 ].
dt 2 X t =0
Proceeding in the same manner, it can be shown that
dn
M (t ) = E [X n e tX ]jt =0 = E [X n ].
dt n X t =0

There is an alternative, perhaps less formal, proof which we will also go through.
But …rst we have to informally introduce a useful tool.
De…nition: If a function g (x ) has derivatives of order r , that is,
dr
g (r ) (x ) = g (x )
dx r
exists, then for any constant a, the Taylor expansion of order r about a is
r
g (i ) (a )
g (x ) = ∑ i!
(x a )i + R ,
i =0
where R is some (hopefully negligible) remainder term.

For example, the second order expansion is given by
1
g (x ) = g (a ) + g 0 (a )(x a ) + g 00 (a )(x a )2 + R .
2
A more formal discussion of this useful tool will be provided later in the course.

Now, let’s get back to the moment generating function.
Informal Proof (Gallant (1997), p.105):
e tX te tX
e tX = (X 0 )0 + (X 0 )1
0! X =0 1! X =0
t 2 e tX t 3 e tX
+ (X 0 )2 + (X 0)3 + ...
2! X =0 3! X =0
e t0 te t0 t 2 e t0 t 3 e t0
= (X 0 )0 + (X 0 )1 + (X 0 )2 + (X 0)3 + ...
0! 1! 2! 3!
1 1
= 1 + tX + t 2 X 2 + t 3 X 3 + ...,
2! 3!
which we obtain by the Taylor expansion of e tX about X = 0.
Then,
dj d j tX
M X (t ) = E e (9)
dt j t =0 dt j t =0
1 2 j +2
= E X j + tX j +1 + t X + ... (10)
2! t =0
= E [X j ].

A Word of Caution: Notice that

dj dj
j
M X (t ) = j E [e tX ]
dt t =0 dt t =0
So, we have changed the order of expectation (integration) and di¤erentiation in

(9). If we had not done it, then we would not have obtained the convenient
expression of (10).

Gamma mgf
We will now talk about the moment generating functions of some common
distributions. But …rst introduce the concept of kernel.
De…nition: The kernel of a function is the main part of the function, the part that
remains when constants are disregarded.
Example (2.3.8): Remember that the gamma pdf is
1 1 x /β
f (x ) = xα e , 0 < x < ∞, α > 0, β > 0.
Γ(α) βα
So its kernel is given by x α 1 e x /β .
Now,
∞ Z
1
M X (t ) = e tx x α 1 e x /β
dx
Γ(α) βα 0
Z ∞
1 1
= x α 1 exp t x dx
Γ(α) βα 0 β
Z ∞
" #
1
1 1 β
= xα exp x dx . (11)
Γ(α) βα 0 1 βt

Gamma mgf
1
Notice that, x α 1 exp x
β
is the kernel of another gamma pdf (to see
1 βt
this, substitute b = β(1 βt ) 1 ).
In addition, we know that for any a > 0, b > 0,

Z ∞
1
xa 1
e x /b
dx = 1
0 Γ (a )b a
(can you see why?). Therefore,
Z ∞
xa 1
e x /b
dx = Γ(a )b a .
0
Hence, picking a = α and b = β(1 βt ) 1
Γ (a )b a Γ(α) β α
1 α
1
M X (t ) = = = if t < .
Γ(α) βα Γ(α) βα 1 βt 1 βt β
1 1
If t β , then β(1 βt ) < 0 and the integral in (11) is in…nite.
Then,
d αβ
E [X ] =
M (t ) = = αβ,
dt X t =0 (1 βt )α+1 t =0
as we have shown previously.
Binomial mgf
Example (2.3.9): Remember that the binomial (n, p ) pmf is given by
n x
fX (x ) = p (1 p )n x
.
x
Then,
n n
n x n
M X (t ) = ∑ e tx x
p (1 p )n x
= ∑ x
(pe t )x (1 p )n x
.
x =0 x =0
Remember the binomial formula from Theorem (3.2.2):

n
n x n
(u + v )n = ∑ x
u v x
.
i =0
Then, for u = pe t and v = 1 p, we have
MX (t ) = [pe t + (1 p )]n .

Normal mgf
Now consider the pdf for X N (µ, σ2 )

" #
2
1 1 x µ
f (x ) = p exp , ∞ < x < ∞.
2πσ2 2 σ
The mgf is given by

Z ∞
1 µ)2 /2σ2
MX (t ) = E [e Xt ] = e tx p e (x dx
∞ 2πσ2
Z ∞
1 2
2µx +µ2 2σ2 tx )/2σ2
= p e (x dx .
∞ 2πσ2
Observe that one can complete the square as follows.
x2 2µx + µ2 2σ2 tx = x2 2 ( µ + σ 2 t )x ( µ + σ 2 t )2 + µ2
h i2
= x ( µ + σ2 t ) ( µ + σ 2 t )2 + µ2
h i2 h i
= x ( µ + σ2 t ) 2µσ2 t + (σ2 t )2 .

Normal mgf
Therefore,
Z ∞ n 2
o
1 [x ( µ + σ2 t ) ] [2µσ2 t +(σ2 t )2 ] /2σ2
M X (t ) = p e dx
∞ 2πσ2
Z ∞
2 2 2 2 1 2 2
/2σ2
= e [2µσ t +(σ t ) ]/2σ p e [x ( µ + σ t ) ] dx .
2πσ2 ∞
Notice that
1 2 2
/2σ2
g (x ) = p e [x ( µ + σ t ) ]
2πσ2
can be considered as the pdf of a random variable Y N (a, b ) where a = µ + σ2 t
and b = σ2 .

Normal mgf
Then
2 2 2 2 σ2 t 2
MX (t ) = e [2µσ t +(σ t ) ]/2σ = exp µt + .
2
Clearly,
d σ2 t 2
E [X ] = M (t ) = µ + σ2 t exp µt + = µ,
dt X t =0 2 t =0
d2 σ2 t 2
E [X 2 ] = 2
M X (t ) = σ2 exp µt +
dt t =0 2 t =0
2
2 σ2 t 2
+ µ + σ2 t exp µt +
2 t =0
2 2
= σ +µ ,
Var (X ) = E [X 2 ] fE [X ]g2 = σ2 + µ2 µ2 = σ 2 .

The major usefulness of mgf is not its ability to generate moments.

The more important feature is that, in many cases, the moment generating function
can characterise a distribution.
However, there are some technical di¢ culties associated with this feature.
Now, if the mgf exists, it characterises an in…nite set of moments. Does
characterising the in…nite set of moments uniquely determine a distribution function?
Unfortunately, no.
There may actually be two distinct random variables having the same moments.

Example (2.3.10): Consider

1 h i
f1 (x ) = p exp (log x )2 /2 , 0 x < ∞,
2πx 2
f2 (x ) = f1 (x )[1 + sin(2π log x )], 0 x < ∞.
Now, for X1 f1 (x ),
2
E [X1r ] = e r /2
, r = 0, 1, ...,
so X1 has all of its moments.
Now, take X2 f2 (x ). Then,
Z ∞
E [X2r ] = x r f1 (x )[1 + sin(2π log x )]dx
0
Z ∞
= E [X1r ] + x r f1 (x )[sin(2π log x )]dx .
0
It can be shown that the integral is actually equal to zero for r = 0, 1, ... .
Hence, even though X1 and X2 have distinct pdfs, they have the same moments for
all r .

Is this the end of the story?

If random variables have bounded support, then the problem of uniqueness of
moments does not occur. In this case, the in…nite sequence of moments does
uniquely determine the distribution.
In addition, if the mgf exists in a neighbourhood of 0, then the distribution is
uniquely determined, no matter what its support.
Theorem (2.3.11): Let FX (x ) and FY (y ) be two cdfs all of whose moments exist.
1 If X and Y have bounded support, then F X (u ) = F Y (u ) for all u if and only if
E [X r ] = E [Y r ] for all integers r = 0, 1, 2, ... .
2 If the moment generating functions exist and M X (t ) = M Y (t ) for all t in some
neighbourhood of 0, then F X (u ) = F Y (u ) for all u.
Notice that in (1), we only say that the moments of all orders exist. This does not
necessarily mean that the mgf exists, as well.
Existence of all moments is not equivalent to existence of the moment generating
function!

Now consider the following Theorem.

Theorem (2.3.12): Suppose fXi , i = 1, 2, ...g is a sequence of random variables,
each with mgf MX i (t ). Furthermore, suppose that
lim MX i (t ) = MX (t ), for all t in a neighbourhood of 0,

i !∞
and MX (t ) is an mgf. Then, there is a unique cdf FX whose moments are

determined by MX (t ) and, for all x where FX (x ) is continuous we have
lim FX i (x ) = FX (x ).
i !∞
That is, convergence, for jt j < h, of mgfs to an mgf implies convergence of cdfs.

The following discussion is an aside.

The possible nonuniqueness of the moment sequence is a nuisance. Even if we can
show that a sequence of moments converges, we will not be able to conclude
formally that the variables converge.
To do so, we would have to verify the uniqueness of the moment sequence - and this
is a very di¢ cult job. For X FX , one would have to show that
T
1
lim
T !∞
∑ fE [X 2r ]g1/2r
= +∞.
r =1
However, if the sequence of mgfs converges in a neighbourhood of 0, then the

random variables converge.
Thus, convergence of mgfs is su¢ cient, but not necessary, for the sequence of
random variables to converge.

Example (2.3.13): In many elementary textbooks, it is mentioned that one can

approximate the binomial probabilities by Poisson probabilities, under certain
assumptions on the parameters.
Let X binomial (n, p ) and Y Poisson (λ). Let also λ = np. Then, the rule of
thumb for this approximation to work is that n is large and np is small.
Let’s see what the moment generating functions say to this. We have already shown
that
MX (t ) = [pe t + (1 p )]n .
You will also show in a homework that
t
M Y (t ) = e λ (e 1)
.

The following Lemma will be useful in obtaining the desired result.

Lemma (2.3.14): Let a1 , a2 , ... be a sequence of numbers converging to a, that is,
limn !∞ an = a. Then,
an n
lim 1 + = ea.
n !∞ n
Now, de…ne
n n
1 t 1
MX n (t ) = [pe t + (1 p )]n = 1 + (e 1)np = 1 + (e t 1)λ ,
n n
and letting an = a = λ(e t 1 ),
1 t n h an in
lim MX n (t ) = lim 1+ (e 1)λ = lim 1 +
n !∞ n !∞ n n !∞ n
= exp λ(e t 1 ) = M Y (t ).

We …nish the discussion of moment generating functions with the following Theorem.
Theorem (2.3.15): For any constants a and b, the mgf of the random variable
aX + b is given by
MaX +b (t ) = e bt MX (at ).
Proof: This is pretty easy to show;
h i
MaX +b (t ) = E e (aX +b )t
h i
= E e (aX )t e bt
h i
= e bt E e (at )X
= e bt MX (at ),
where the last line follows from the de…nition of the moment generating function.

Other “Generating Functions”
It is now more or less clear that using moment generating functions is not the most
straightforward way to determine the distribution function. However, there are other
“generating functions” we can consider.
De…nition: The characteristic function of X is de…ned by
φX (t ) = E [e itX ] = E [cos tX ] + iE [sin tX ],

p
where i is the complex number 1.
The main problem about this function is that it is much more complicated and it
involves integration with complex numbers.
However, there are many advantages:
1 When the moments of F X exist, φX can be used to generate them.
2 The characteristic function always exists (since both the sine and cosine functions are
bounded by one).
3 The characteristic function completely determines the distribution. In other words,
there a unique characteristic function for every distribution function.

Theorem 11.5, Severini (2005): Suppose Xk , k = 1, 2, ..., is a sequence of random

variables, each with characteristic function φX k (t ) and let φX (t ) denote the
characteristic function of X . Then,
lim φX k (t ) = φX (t ), for all ∞ < t < ∞,

k !∞
if and only if
lim FX k (x ) = FX (x ),
k !∞
at every x where FX (x ) is continuous. Here, FX k (x ) and FX (x ) are the cdfs for Xk
and X , respectively.

Another generating function, closely related to the moment generating function, is

the cumulant generating function.
De…nition: For a random variable X , the cumulant generating function is
KX (t ) = log E [e tX ] = log MX (t ).
When the moment generating function exists in a neighbourhood of 0, then KX (t )

has the following Taylor expansion:
∞
1
K X (t ) = ∑ n!
κ n (X )t n .
n =1
Here, κ n (X ) is called the n th cumulant of X .

The cumulant generating function is, obviously, closely related to the moment
generating function. In fact, κ n (X ) can be expressed in terms of moments (and
vice-versa). For example,
E [X ] = κ 1 (X ),
E [X 2 ] = κ 2 (X ) + [κ 1 (X )]2 ,
E [X 3 ] = κ 3 (X ) + 3κ 2 (X )κ 1 (X ) + [κ 1 (X )]3 ,
etc... .
Then, for example
Var (X ) = E [X 2 ] fE [X ]g2 = κ 2 (X ) + [κ 1 (X )]2 [κ 1 (X )]2 = κ 2 (X ).
So the the expected value and the variance correspond to the …rst two cumulants!
Depending on the question at hand, it might be more convenient to work with
cumulant or moment generating functions. In any case, information on the
cumulants can give us information on the moments (and vice versa).
A detailed reference for cumulant generating functions is “Tensor Methods in
Statistics” by McCullagh (1987).

Slides 2 PDF

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Slides 2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Slides 2 PDF

Uploaded by

Copyright:

Available Formats

ECON509 Probability and Statistics

This Version: 21 Oct 2013

(Bilkent) ECON509 This Version: 21 Oct 2013 1 / 110

(Bilkent) ECON509 This Version: 21 Oct 2013 2 / 110

If X is a random variable with cdf FX , then Y = g (X ) is also a random variable.

(Bilkent) ECON509 This Version: 21 Oct 2013 3 / 110

Notice that the mapping g ( ) is associated with the inverse mapping g 1( ), a

Now, if Y = g (X ), then for all A 2 Y ,

(Bilkent) ECON509 This Version: 21 Oct 2013 4 / 110

If X is a discrete random variable, then X is countable. The sample space for

(Bilkent) ECON509 This Version: 21 Oct 2013 5 / 110

since (n n y ) = (n n!y )!y ! = (yn ).

However, in some cases it might be di¢ cult to identify the set fx 2 X : g (x ) y g.

(Bilkent) ECON509 This Version: 21 Oct 2013 7 / 110

X = f x : fX (x ) > 0 g and Y = fy : y = g (x ) for some x 2 X g. (3)

1 If g is an increasing function on X , F Y (y ) = F X (g 1 (y )) for y 2 Y .

If, on the other hand g is decreasing, then

Note the change of direction of the inequality in the second case.

(Bilkent) ECON509 This Version: 21 Oct 2013 8 / 110

Now, for increasing g (x ),

Finally, for decreasing g (x ),

Now comes the more interesting result.

(Bilkent) ECON509 This Version: 21 Oct 2013 9 / 110

Theorem (2.1.5): Let X have pdf fX (x ) and let Y = g (X ), where g is a monotone

Proof: From Theorem (2.1.3) we have, by the chain rule,

which is identical to the result given in Theorem (2.1.5).

(Bilkent) ECON509 This Version: 21 Oct 2013 10 / 110

Example (2.1.4): Suppose X fX (x ) = 1 for 0 < x < 1 and 0 otherwise, which is

(Bilkent) ECON509 This Version: 21 Oct 2013 11 / 110

Example (2.1.6): Let fX (x ) be the gamma pdf

This is a special case of a pdf known as the inverted gamma pdf.

(Bilkent) ECON509 This Version: 21 Oct 2013 12 / 110

(Bilkent) ECON509 This Version: 21 Oct 2013 14 / 110

Consider Y = X 2 . The function g (x ) = x 2 is monotone on ( ∞, 0) and on (0, ∞).

Then, the pdf of Y is

(Bilkent) ECON509 This Version: 21 Oct 2013 15 / 110

Proof: For Y = FX (X ) we have, for 0 < y < 1,

At the endpoints we have P (Y y ) = 1 for y 1 and P (Y y ) = 0 for y 0.

(Bilkent) ECON509 This Version: 21 Oct 2013 16 / 110

Why is this result useful?

(Bilkent) ECON509 This Version: 21 Oct 2013 17 / 110

(Bilkent) ECON509 This Version: 21 Oct 2013 18 / 110

De…nition (2.2.1): The expected value or mean of a random variable g (X ),

(Bilkent) ECON509 This Version: 21 Oct 2013 19 / 110

(Bilkent) ECON509 This Version: 21 Oct 2013 20 / 110

To obtain (5), notice that, by L’Hôpital’s Rule,

(Bilkent) ECON509 This Version: 21 Oct 2013 21 / 110

A very useful property of the expectation operator is that it is a linear operator.

Notice that, clearly, the expectation of a constant is equal to itself.

(Bilkent) ECON509 This Version: 21 Oct 2013 22 / 110

(Bilkent) ECON509 This Version: 21 Oct 2013 23 / 110

Also consider (3). Now,

(Bilkent) ECON509 This Version: 21 Oct 2013 24 / 110

When evaluating expectations of nonlinear functions of X , we can proceed in one of

But we could also …nd the pdf fY (y ) of Y = g (X ) and we would have

(Bilkent) ECON509 This Version: 21 Oct 2013 25 / 110

De…ne g (X ) = log X . Then,

where we use integration by parts. We can also use fY (y ) to calculate E [Y ] directly.