Slides 2 PDF
Slides 2 PDF
Slides 2 PDF
Slides 2
Bilkent
In this part of the lectures, our interest is on the behaviour of functions of a random
variable, X , whose cdf we know.
This part is heavily based on Casella & Berger, Chapters 2 and 3.
Speci…cally, we cover material from Sections 2.1, 2.2, 2.3, 2.6, 3.1, 3.2, 3.3 and 3.5.
Therefore, the mapping g 1 ( ) takes sets into sets; that is, g 1 (A ) is the set of
points in X that g (x ) takes into the set A.
If A = fy g, a point set, then
1
g (fy g) = fx 2 X : g (x ) = y g.
P (Y 2 A ) = P (g (X ) 2 A )
= P (fx 2 X : g (x ) 2 A g)
1
= P (X 2 g (A )), (2)
where the last line follows from (1). This de…nes the probability distribution of Y .
fY (z ) = P (Y = z ) = ∑ P (X = q ) = ∑ fX (q ), for z 2 Y ,
q 2g 1 (z ) q 2g 1 (z )
and fY (z ) = 0 for z 2
/ Y.
Note that ∑q 2g 1 (z ) P (X = q ) simply amounts to adding up the probabilities
P (X = q ) for all q such that g (q ) = z.
x 2g 1 (y )
n
= (1 p )y p n y
,
y
Behind all these dry results, our main objective actually is to obtain simple formulas
for the cdf and pdf of Y in terms of the mapping g ( ) and the cdf and pdf of X .
Now, the cdf of Y = g (X ) is
F Y (y ) = P (Y y ) = P (g (X ) y)
= P (fx 2 X : g (x ) y g)
Z
= fX (x )dx .
fx 2X :g (x ) y g
where β > 0 and n is a positive integer. We want to …nd the pdf of g (X ) = 1/X .
Note that here the support sets X and Y are both the interval (0, ∞). Now,
d
y = g (x ), and so, g 1 (y ) = 1/y and dy g 1 (y ) = 1/y 2 . By Theorem (2.1.5), for
y 2 (0, ∞), we obtain
1 d 1 1 1 1/( βy ) 1
fY (y ) = fX (g (y )) g (y ) = e
dy (n 1)!βn y n 1 y2
1 1 1/( βy )
= e .
(n 1)!βn y n +1
Because x is continuous, we can drop the equality from the left endpoint and obtain
p p
F Y (y ) = P ( y <X y)
p p
= P (X y ) P (X y)
p p
= FX ( y ) FX ( y ).
Then,
d d p p
fY (y ) = F (y ) = [F ( y ) F X ( y )]
dy Y dy X
1 p 1 p
= p f ( y ) + p fX ( y ).
2 y X 2 y
Importantly, the pdf of Y consists of two pieces which represent the intervals where
g (x ) = x 2 is monotone.
(Bilkent) ECON509 This Version: 21 Oct 2013 13 / 110
Distributions of Functions of a Random Variable
The above idea can be extended to cases where we need a larger number of intervals
to obtain, if you like, interval-by-interval monotonicity.
Theorem (2.1.8): Let X have pdf fX (x ), let Y = g (X ) and de…ne the sample
space X as in (3). Suppose there exists a partition A0 , A1 , A2 , ..., Ak of X such that
P (X 2 A0 ) = 0 and fX (x ) is continuous on each Ai . Further, suppose there exist
functions g1 (x ), ..., gk (x ) de…ned on A1 , ..., Ak , respectively satisfying
1 g (x ) = g i (x ), for x 2 A i ,
2 g i (x ) is monotone on A i ,
3 the set Y = fy : y = g i (x ) for some x 2 A i g is the same for each i = 1, ..., k ,
4 g i 1 (y ) has a continuous derivative on Y , for each i = 1, ..., k .
Then, (
1 1
∑ki=1 fX (gi (y )) d
dy gi (y ) y 2Y
fY (y ) = .
0 otherwise
A0 = f0 g;
p
A1 = ( ∞, 0), g1 (x ) = x 2 , g1 1 (y ) = y;
p
A2 = (0, ∞), g2 (x ) = x 2 , 1
g2 (y ) = y .
We …nish with another important result known as the Probability Integral Transform.
Theorem (2.1.10): Let X have continuous cdf FX (x ) and de…ne the random
variable Y as Y = FX (X ). Then, Y is uniformly distributed on (0, 1), that is
P (Y y) = y, 0 < y < 1.
P (Y y) = P (F X (X ) y)
= P (FX 1 [FX (X )] FX 1 (y ))
= P (X FX 1 (y ))
= FX (FX 1 (y ))
= y.
In this section, we will introduce one of the most widely used concepts in
econometrics, the expected value.
As we will see in more detail, this is one of the moments that a random variable can
possess.
This concept is akin to the concept of “average.” The standard “average” is an
arithmetic average where all available observations are weighted equally.
The expected value, on the other hand, is the average of all possible values a
random variable can take, weighted by the probability distribution.
The question is, which value would we expect the random variable to take on
average.
provided that the integral or sum exists. If E [jg (X )j] = ∞, we say that E [g (X )]
does not exist.
In both cases, the idea is that we are taking the average of g (x ) over all of its
possible values (x 2 X ), where these values are weighted by the respective value of
the pdf, fX (x ).
To obtain this result, we use a method called integration by parts. This is based on
Z Z
udv = uv vdu.
Then, taking
u = x, du = 1dx ,
x /λ 1 x /λ
v = e , dv = λ e dx ,
gives (4).
Finally, Z ∞ ∞
x /λ x /λ
e dx = λe = λ.
0 0
E [a + Xb ] = a + E [Xb ] = a + bE [X ] = a + bµ.
But we know that g1 (x ) g2 (x ) 0 for all x ! So, we are integrating over a function
which is nonnegative everywhere, and we are weighting by fX (x ) 0 for all x .
Hence, E [g1 (X )] E [g2 (X )] 0. The rest can be shown similarly.
Let’s consider the following examples. We will …rst …nd the pdf of the inverted
gamma distribution. Then, we will use this result to …nd the expectation of a
random variable which has the inverted gamma distribution.
Another widely used moment is the variance of a random variable. Obviously, this
moment measures the variation/dispersion/spread of the random variable (around its
expectation).
While the expectation is usually denoted by µ, σ2 is generally used for variance.
Variance is a second-order moment. If available, higher order moments of a random
variable can be calculated, as well.
For example, the third and fourth moments are concerned with how symmetric and
fat-tailed the underlying distribution is. We will talk more about these.
µn0 = E [X n ].
µn = E [(X µ )n ],
where µ = µ10 = E [X ].
Let us digress for a moment and brie‡y mention another important concept:
covariance. When it exists, the covariance of two random variables X and Y is
de…ned as
Cov (X , Y ) = E (fX E [X ]g fY E [Y ]g) .
We will talk in more detail about this, when we deal with multivariate random
variables. For the time being, su¢ ce to say that the covariance is a measure of the
co-movement between two random variables.
We will be dealing with moments in more detail. However, this is a good time to
stop and review some of the commonly used distributions.
Usually, one deals with families of distributions. Families of distributions are indexed
by one or more parameters, which allow one to vary certain characteristics of the
distribution while staying within one functional form.
To give an example, one of the most commonly employed distributions, the Normal
distribution has two parameters, the mean and the variance, denoted by µ and σ2 ,
respectively.
Although one might know the value of σ2 for the random variable at hand, the
actual value of µ might be unknown.
In that case, the distribution will be indexed by µ, and the behaviour of the random
variable will change as µ varies.
0.4
sigmasq= 1
sigmasq= 2
0.35 sigmasq= 4
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
This part will also provide a good opportunity to put some of the abstract concepts
we have learned into action.
We start with discrete distributions.
A random variable X is said to have a discrete distribution if the range of X , the
sample space, is countable.
Using (6) we can now calculate the mean and the variance.
Now,
N N
1 N (N + 1 ) 1 N +1
E [X ] = ∑ xP (X = x jN ) = ∑ xN =
2 N
=
2
,
x =1 x =1
N N
1 (N + 1)(2N + 1)
and E [X 2 ] = ∑ x 2 P (X = x jN ) = ∑ x2 N =
6
.
x =1 x =1
E [X ] = 1 p+0 (1 p ) = p,
and Var (X ) = (1 p )2 p + (0 p )2 (1 p ) = p (1 p ).
Examples:
1 Tossing a coin (p=probability of a head, X = 1 if heads)
2 Roulette (X = 1 if red occurs, p=probability of red)
3 Election polls (X = 1 if candidate A gets vote)
4 Incidence of disease (p=probability that a random person gets infected)
p y (1 p )n y
.
n y
P (Y = y jn, p ) = p (1 p )n y
, y = 0, 1, 2, ..., n,
y
and Y is called a binomial (n, p ) random variable.
(Bilkent) ECON509 This Version: 21 Oct 2013 37 / 110
Common Families of Distributions
Discrete Distributions: Binomial Distribution
The following theorem, called the “Binomial Theorem” is a useful result, which can
be used to show that ∑ny =0 P (Y = y ) = 1 for the binomial (n, p ) random variable
mentioned above.
Theorem (3.2.2): For any real numbers x and y and integer n 0,
n
n i n
(x + y )n = ∑ i
x y i
.
i =0
Let’s calculate the mean and variance for the Binomial Distribution.
Example (2.2.3):
n n
n x n x
E [X ] = ∑x x
p (1 p )n x
= ∑x x
p (1 p )n x
,
x =0 x =1
n n! (n 1 ) ! (n 1 ) ! n 1
x =x =n 1 =n =n .
x x ! (n x ) ! x x ! (n x ) ! (x 1 ) ! (n x ) ! x 1
Then,
n n 1
n 1 x n 1
E [X ] = ∑n x 1
p (1 p )n x
= ∑n y
p y +1 (1 p )n (y +1 )
x =1 y =0
n 1
n 1
= np ∑ y
p y (1 p )n 1 y
= np.
y =0
Observe that
n n! (n 1 ) ! n 1
x2 = x2 = xn = xn .
x (n x )!x ! (n x ) ! (x 1 ) ! x 1
Then,
n n
n x n 1 x
∑ x2 x
p (1 p )n x
= ∑ xn x 1
p (1 p )n x
x =0 x =1
n 1
n 1
= n ∑ (y + 1 ) y
p y +1 (1 p )n y 1
y =0
n 1
n 1
= np ∑y y
p y (1 p )n y 1
y =0
n 1
n 1
+np ∑ y
p y (1 p )n y 1
.
y =0
Think about the …rst sum …rst. This is equal to E [Z ] for Z binomial (n 1, p ).
What about the second sum? Following a similar reasoning as above, it is equal to
one.
Hence,
n 1 n 1
n 1 n 1
E [X 2 ] = np ∑ y p y (1 p )n y 1
+ np ∑ p y (1 p )n y 1
y =0 y y =0 y
| {z } | {z }
(n 1 )p 1
= n (n 1)p 2 + np.
Finally,
Example (3.2.3) - cont’d: Now we consider another game; throw a pair of dice 24
times and ask for the probability of at least one double 6. This, again, can be
modelled by the binomial distribution with success probability p, where
1
p = P (roll a double 6) = .
36
So, if Y =number of double 6s in 24 rolls, Y binomial (24, 1/36) and
A random variable X . taking values in the nonnegative integers, has a Poisson (λ)
distribution if
e λ λx
P (X = x j λ ) = , x = 0, 1, ... .
x!
Is ∑x∞=0 P (X = x jλ) = 1? The Taylor series expansion of e y is given by
∞
yi
ey = ∑ i!
.
i =0
Don’t worry about this for the moment, if you are not familiar with this expansion.
We will cover it later in the course.
So,
∞ λ λx ∞
e λx
∑ x!
= e λ
∑ x!
= 1.
x =0 x =0
| {z }
eλ
where we substituted y = x 1.
Similarly,
∞ λ λx ∞ x 1 ∞
e λλ λx 1
E [X 2 ] = ∑ x2 x!
=λ ∑ x 2e x!
=λ ∑ xe λ
(x 1 ) !
x =0 x =1 x =1
∞ x 1 ∞ x 1
λ λ
= λ ∑ (x 1 )e λ
(x 1 ) !
+λ ∑ e λ
(x 1 ) !
x =1 x =1
∞ y ∞
λλ λy
= λ ∑ ye y!
+ λe λ
∑ ,
y =0 y =0 y !
| {z }
eλ
where ∑y∞=0 ye
y
λλ = E [Y ] for Y Poisson (λ). Therefore,
y!
E [X 2 ] = λ2 + λ.
Thus,
Var (X ) = E [X 2 ] µ2 = λ2 + λ λ2 = λ.
Notice that Z ∞
t
Γ (1 ) = e dt = 1,
0
and so, for any integer n > 0,
Γ (n ) = (n 1)!
Finally, p
Γ(1/2) = π.
These results make life a lot easier when calculating Γ(c ) for some positive c.
0.45
α=1, β=2
α=2, β=2
α=3, β=2
0.4
α=5, β=1
α=9, β=0.5
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 2 4 6 8 10 12 14 16 18 20
Some interesting distribution functions are special cases of the gamma distribution.
For example, for α = p/2, where p is an integer, and β = 2,
1
f (x ja, β) = f (x jp ) = x (p/2 ) 1
e x /2
, 0 < x < ∞.
Γ(p/2)2p/2
This is known as the pdf of the χ2(p ) distribution, read “chi squared distribution with
p degrees of freedom”.
Obviously, for X χ2(p ) ,
This distribution is used heavily in hypothesis testing. You will come back to this
distribution next term.
Now consider, α = 1 :
1 1
f (x ja, β) = f (x j1, β) = x 0e x /β
= e x /β
, 0 < x < ∞.
Γ (1 ) β1 β
E [X ] = β and Var (X ) = β2
= e (s t )/β
= P (X > s t ).
This is because,
Z ∞ ∞
1 x /β x /β (s t )/β
e dx = e =e .
s t β x =s t
What does this mean? When calculating P (X > s jX > t ), what matters is not
whether X has passed a threshold or not. What matters is the distance between the
threshold and the value to be reached.
If Mr X has been …red twice, what is the probability that he will be …red three
times? It is not di¤erent from the probability that a person, who has never been
…red, is …red. History does not matter.
P (X x ) = P (Y α ). (7)
Let’s sketch the proof of this result. Remember that, if α is an integer. Then,
Γ (α) = (α 1)!.
Hence,
Z x
1 1 t /β
P (X x) = tα e dt
(α 1)!βα 0
Z x
1 1 t /β x 2 t /β
= tα βe jt =0 + (α 1 )t α βe dt ,
(α 1)!βα 0
Now,
Z x
1 1 t /β x 2 t /β
P (X x) = tα βe jt =0 + (α 1 )t α βe dt
(α 1)!βα 0
Z x
(x /β)α 1 e x /β 1 2 t /β
= + 1
tα e dt
(α 1)! (α 2)!βα 0
Z x
1 2 t /β
= P (Y = α 1) + 1
tα e dt,
(α 2)!βα 0
where Y Poisson (x /β). If we keep doing the same operation, we will eventually
obtain (7).
The normal distribution or the Gaussian distribution is the one distribution which
you should know by heart!
Why is this distribution so popular?
1 Analytical tractability.
2 Bell shaped or symmetric
3 It is central to the Central Limit Theorem; this type of results guarantee that, under
(mild) conditions, the normal distribution can be used to approximate a large variety of
distributions in large samples.
The distribution has two parameters: mean and variance, denoted by µ and σ2 ,
respectively.
The pdf is given by,
1 1 (x µ )2
f (x jµ, σ2 ) = p exp .
σ 2π 2 σ2
You should memorise this!!!
where we substitute t = (x µ)/σ. Notice that this implies that dt/dx = 1/σ.
This shows that P (Z z ) is the standard normal cdf.
Then, we can do all calculations for the standard normal variable and then convert
these results for whatever normal random variable we have in mind.
Consider, for Z N (0, 1), the following:
Z ∞ ∞
1 z 2 /2 1 z 2 /2
E [Z ] = p ze dz = p e = 0.
2π ∞ 2π
∞
E [X ] = E [µ + Z σ ] = µ + σE [Z ] = µ + σ 0 = µ.
Similarly,
Var (X ) = Var (µ + Z σ ) = σ2 Var (Z ) = σ2 .
What about Z ∞
1 z 2 /2 ?
p e dz = 1.
2π ∞
Moreover, the integrand is symmetric about 0. So, …rst observe that for w = z 2 /2
we have dw = zdz. Then, using this substitution we obtain
Z ∞ Z ∞ Z ∞
1 z 2 /2 1 1 w 1 1 w 1
p e dz = p p e dw = p p e dw = .
2π 0 2π 0 2w 2 π 0 w 2
| {z }
Γ( 12 )
Hence, Z ∞
1 z 2 /2
p e dz = 1.
2π ∞
An important characteristic of the normal distribution is that the shape and location
of the distribution are determined completely by its two parameters.
It can be shown easily that the normal pdf has its maximum at x = µ.
The probability content within 1, 2, or 3 standard deviations of the mean is
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
log X N (µ, σ2 ).
1 1 (log x µ)2
f (x jµ, σ2 ) = p exp ,
2πσ2 x 2σ2
where 0 < x < ∞, ∞ < µ < ∞, and σ > 0.
How? Take W = log X . We start from the distribution of W and want to …nd the
distribution of X = exp W . Then, g (W ) = exp(W ) and g 1 (X ) = log (X ). The
rest follows by using Theorem (2.1.5).
σ2 h i
E [X ] = exp µ + and Var (X ) = exp 2(µ + σ2 ) exp 2µ + σ2 .
2
Why use this distribution? It is similar in appearance to the Gamma distribution: it
is skewed to the right. Convenient for some variables which are skewed to the right,
such as income.
But why not use Gamma instead? Lognormal is based on the normal distribution so
it allows one to use normal-theory statistics, which is technically more convenient.
When v = 1, the distribution is called the Cauchy distribution. Note that in this
case, even the mean does not exist.
The use of the term “family” will be a bit less intuitive in this part.
We will consider three types of families under this heading: location families, scale
families and location-scale families.
Let’s start with the following Theorem.
Theorem (3.5.1): Let f (x ) be any pdf and let µ and σ > 0 be any given constants.
Then the function
1 x µ
g (x jµ, σ ) = f
σ σ
is a pdf.
De…nition (3.5.2): Let f (x ) be any pdf. Then the family of pdfs f (x µ), indexed
by the parameter µ, ∞ < µ < ∞, is called the location family with standard pdf
f (x ) and µ is called the location parameter for the family.
This simply changes the location of the distribution without changing any other
properties of it.
0.4
f(x )
f(x - µ), µ=3
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -2 0 2 4 6 8
P( 1 X 2jµ = 0) = P (µ 1 X µ + 2jµ = 3)
= P (2 X 5jµ = 3)
Some of the families of continuous distributions discussed here have location families.
Consider the normal distribution for some speci…ed σ > 0.
1 x
f (x j σ 2 ) = p exp .
2πσ2 2σ2
By replacing x with x µ, we will obtain normal distribution with di¤erent means
for each particular value of µ. Hence, the location family with a standard f (x ) is the
set of normal distributions with unknown mean µ and known variance σ2 .
De…nition (3.5.2) simply says that we can start with any pdf f (x ) and introduce a
location parameter µ which will generate a family of pdfs.
0.4
s igmas q=1
s igmas q=2
0.35 s igmas q=4
0.3
0.25
0.2
0.15
0.1
0.05
0
-4 -3 -2 -1 0 1 2 3 4
De…nition (3.5.5): Let f (x ) be any pdf. Then for any µ, ∞ < µ < ∞, and, any
σ > 0, the family of pdfs
1 x µ
f ,
σ σ
indexed by the parameter (µ, σ ) , is called the location-scale family with standard
pdf f (x ); µ is called the location parameter and σ is called the scale parameter.
From the previous discussion, it is obvious that the scale parameters are used to
stretch/contract the distribution while the location parameter shifts the distribution.
The normal family is an example of location-scale families.
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-6 -4 -2 0 2 4 6
Figure: An example of a location-scale family: the normal distribution for varying values of µ and
σ.
Theorem (3.5.6): Let f ( ) be any pdf. Let µ be any real number, and let σ be any
positive real number. Then X is a random variable with pdf
1 x µ
f
σ σ
1 d 1 1 σz + µ µ
fZ (z ) = fX (g (z )) g (z ) = f σ = f (z ).
dz σ σ
Also,
X µ
σZ + µ = σg (X ) + µ = σ + µ = X.
σ
This proves the “only if” part.
fZ (z ) = f (z ),
1 x µ
which is the same as σ f σ for µ = 0 and σ = 1.
Therefore, the distribution of Z is that member of the location-scale family
corresponding to µ = 0 and σ = 1. For the normal family, remember that we have
already showed that for Z de…ned as above, Z is a normally distributed random
variable with µ = 0 and σ = 1.
Note that, probabilities for any member of a location-scale family may be computed
in terms of the standard variable Z .
This is because,
X µ x µ x µ
P (X x) = P =P Z .
σ σ σ
x µ
Consider the normal family. If we know P Z σ for all values of x , µ and σ
where Z is distributed with the standard normal distribution, then we can calculate
P (X x ) for all values of x where X is a normally distributed random variable with
some mean µ and variance σ2 .
There is another special family of densities known as the exponential family, used
very much in statistics and to some extent in econometrics, thanks to its convenient
properties. For the time being we do not cover this topic, but will return to it if time
permits.
MX (t ) = E [e tX ],
provided that the expectation exists for t in some neighbourhood of 0. That is, there
is an h > 0 such that, for all t in h < t < h, E [e tX ] exists. If the expectation does
not exist in a neighbourhood of 0, we say that the mgf does not exist.
We can write the mgf of X as
Z ∞
M X (t ) = e tx fX (x )dx if X is continuous,
∞
M X (t ) = ∑e tx
P (X = x ) if X is discrete.
x
where we de…ne
(n ) dn
M X (0 ) = M (t ) .
dt n X t =0
Hence,
d
M (t ) = E [Xe tX ]jt =0 = E [X ].
dt X t =0
Similarly,
Z ∞ Z ∞
d2 d2 d 2 tx
M (t ) = e tx fX (x )dx = e fX (x )dx
dt 2 X dt 2 ∞ ∞ dt 2
Z ∞
= (x 2 e tx )fX (x )dx = E [X 2 e tX ],
∞
and
d2
M (t ) = E [X 2 e tX ]jt =0 = E [X 2 ].
dt 2 X t =0
Proceeding in the same manner, it can be shown that
dn
M (t ) = E [X n e tX ]jt =0 = E [X n ].
dt n X t =0
There is an alternative, perhaps less formal, proof which we will also go through.
But …rst we have to informally introduce a useful tool.
De…nition: If a function g (x ) has derivatives of order r , that is,
dr
g (r ) (x ) = g (x )
dx r
exists, then for any constant a, the Taylor expansion of order r about a is
r
g (i ) (a )
g (x ) = ∑ i!
(x a )i + R ,
i =0
= E [X j ].
We will now talk about the moment generating functions of some common
distributions. But …rst introduce the concept of kernel.
De…nition: The kernel of a function is the main part of the function, the part that
remains when constants are disregarded.
Example (2.3.8): Remember that the gamma pdf is
1 1 x /β
f (x ) = xα e , 0 < x < ∞, α > 0, β > 0.
Γ(α) βα
Now,
∞ Z
1
M X (t ) = e tx x α 1 e x /β
dx
Γ(α) βα 0
Z ∞
1 1
= x α 1 exp t x dx
Γ(α) βα 0 β
Z ∞
" #
1
1 1 β
= xα exp x dx . (11)
Γ(α) βα 0 1 βt
Γ (a )b a Γ(α) β α
1 α
1
M X (t ) = = = if t < .
Γ(α) βα Γ(α) βα 1 βt 1 βt β
1 1
If t β , then β(1 βt ) < 0 and the integral in (11) is in…nite.
Then,
d αβ
E [X ] =
M (t ) = = αβ,
dt X t =0 (1 βt )α+1 t =0
as we have shown previously.
(Bilkent) ECON509 This Version: 21 Oct 2013 94 / 110
Moments and Moment Generating Functions
Binomial mgf
n x
fX (x ) = p (1 p )n x
.
x
Then,
n n
n x n
M X (t ) = ∑ e tx x
p (1 p )n x
= ∑ x
(pe t )x (1 p )n x
.
x =0 x =0
MX (t ) = [pe t + (1 p )]n .
x2 2µx + µ2 2σ2 tx = x2 2 ( µ + σ 2 t )x ( µ + σ 2 t )2 + µ2
h i2
= x ( µ + σ2 t ) ( µ + σ 2 t )2 + µ2
h i2 h i
= x ( µ + σ2 t ) 2µσ2 t + (σ2 t )2 .
Therefore,
Z ∞ n 2
o
1 [x ( µ + σ2 t ) ] [2µσ2 t +(σ2 t )2 ] /2σ2
M X (t ) = p e dx
∞ 2πσ2
Z ∞
2 2 2 2 1 2 2
/2σ2
= e [2µσ t +(σ t ) ]/2σ p e [x ( µ + σ t ) ] dx .
2πσ2 ∞
Notice that
1 2 2
/2σ2
g (x ) = p e [x ( µ + σ t ) ]
2πσ2
can be considered as the pdf of a random variable Y N (a, b ) where a = µ + σ2 t
and b = σ2 .
Then
2 2 2 2 σ2 t 2
MX (t ) = e [2µσ t +(σ t ) ]/2σ = exp µt + .
2
Clearly,
d σ2 t 2
E [X ] = M (t ) = µ + σ2 t exp µt + = µ,
dt X t =0 2 t =0
d2 σ2 t 2
E [X 2 ] = 2
M X (t ) = σ2 exp µt +
dt t =0 2 t =0
2
2 σ2 t 2
+ µ + σ2 t exp µt +
2 t =0
2 2
= σ +µ ,
Var (X ) = E [X 2 ] fE [X ]g2 = σ2 + µ2 µ2 = σ 2 .
Now, for X1 f1 (x ),
2
E [X1r ] = e r /2
, r = 0, 1, ...,
so X1 has all of its moments.
Now, take X2 f2 (x ). Then,
Z ∞
E [X2r ] = x r f1 (x )[1 + sin(2π log x )]dx
0
Z ∞
= E [X1r ] + x r f1 (x )[sin(2π log x )]dx .
0
It can be shown that the integral is actually equal to zero for r = 0, 1, ... .
Hence, even though X1 and X2 have distinct pdfs, they have the same moments for
all r .
lim FX i (x ) = FX (x ).
i !∞
That is, convergence, for jt j < h, of mgfs to an mgf implies convergence of cdfs.
1 t n h an in
lim MX n (t ) = lim 1+ (e 1)λ = lim 1 +
n !∞ n !∞ n n !∞ n
= exp λ(e t 1 ) = M Y (t ).
We …nish the discussion of moment generating functions with the following Theorem.
Theorem (2.3.15): For any constants a and b, the mgf of the random variable
aX + b is given by
MaX +b (t ) = e bt MX (at ).
Proof: This is pretty easy to show;
h i
MaX +b (t ) = E e (aX +b )t
h i
= E e (aX )t e bt
h i
= e bt E e (at )X
= e bt MX (at ),
where the last line follows from the de…nition of the moment generating function.
It is now more or less clear that using moment generating functions is not the most
straightforward way to determine the distribution function. However, there are other
“generating functions” we can consider.
De…nition: The characteristic function of X is de…ned by
if and only if
lim FX k (x ) = FX (x ),
k !∞
at every x where FX (x ) is continuous. Here, FX k (x ) and FX (x ) are the cdfs for Xk
and X , respectively.
KX (t ) = log E [e tX ] = log MX (t ).
The cumulant generating function is, obviously, closely related to the moment
generating function. In fact, κ n (X ) can be expressed in terms of moments (and
vice-versa). For example,
E [X ] = κ 1 (X ),
E [X 2 ] = κ 2 (X ) + [κ 1 (X )]2 ,
E [X 3 ] = κ 3 (X ) + 3κ 2 (X )κ 1 (X ) + [κ 1 (X )]3 ,
etc... .
So the the expected value and the variance correspond to the …rst two cumulants!
Depending on the question at hand, it might be more convenient to work with
cumulant or moment generating functions. In any case, information on the
cumulants can give us information on the moments (and vice versa).
A detailed reference for cumulant generating functions is “Tensor Methods in
Statistics” by McCullagh (1987).