Statistics and Econometrics Lecture Notes 2021
Statistics and Econometrics Lecture Notes 2021
Paolo Zacchia
2 Common Distributions 37
2.1 Discrete Distributions 37
2.2 Continuous Distributions I 48
2.3 Continuous Distributions II 60
2.4 Continuous Distributions III 75
3 Random Vectors 82
3.1 Multivariate Distributions 82
3.2 Independence and Random Ratios 90
3.3 Multivariate Moments 96
3.4 Multivariate Moment Generation 106
3.5 Conditional Distributions 111
3.6 Two Multivariate Distributions 120
i
Contents
ii
Contents
Bibliography 420
iii
Part I
1
Lecture 1
Random Variables
Definition 1.1. Sample Space. The set S collecting all possible outcomes
associated with a certain phenomenon is called the sample space.
A basic example of a sample space is the one associated with the classical
experiment about tossing a coin: Scoin = {Head, T ail}. A more expanded
sample space is that of grades in a university exam: with letter grades for
example (but not allowing for plus and minus), Sexam = {A, B, C, D, E, F }.
2
1.1. Events and Probabilities
3
1.1. Events and Probabilities
4
1.1. Events and Probabilities
for any two a, b ∈ R with a ≤ b. It appears then that the notion of sigma
algebra is quite general to allow for a wide class of reasonable collections of
subsets or events; note though that collections that are not sigma algebras
exist and probability functions cannot be applied to them.2
The definition of probability function is thus in order.
a. P (A) ≥ 0 ∀A ∈ B;
b. P (S) = 1;
1
For readers unfamiliar with topology, a connected set is somewhat informally defined
as a set that cannot be partitioned into two nonempty subsets such that each subset has
no points in common with the set closure of the other subset. For example, the subset of
R defined as A = {x : x ∈ [a, b) ∨ x ∈ (b, c] , a < b < c} is not connected, because it can
be partitioned in such a way that defies the above definition.
2
Here is an example of a collection B 0 of subsets of R which is not a sigma algebra.
Suppose that B 0 contains all the finite disjoint unions of sets of the form
then ∪∞ i−1
/ B 0 which contradicts the definition of sigma algebra.
i=1 0, i = (0, 1) ∈
5
1.1. Events and Probabilities
0
1
2
3
4
5
6
1.1. Events and Probabilities
It would seem that in order to calculate the desired probabilities, one would
need to divide such area by the total area of the dartboard – which equals
πr2 I 2 – however this is not quite enough, because one must take into account
the event that a player fails to heat the board and scores 0 points, and that
all the probabilities must sum up to one. Let the area outside the dartboard
measure T > 0; an appropriate probability function would thus be given by
P (0 points) = T / (T + πI 2 r2 ) and the following expression for 0 < i ≤ I.
−1
P (i points) = πr2 (I + 1 − i)2 − (I − i)2 T + πI 2 r2
7
1.1. Events and Probabilities
Proof. To prove a. note that B can be expressed as the union of two disjoint
sets B = {B ∩ A} ∪ {B ∩ Ac }, thus P (B) = P (B ∩ A) + P (B ∩ Ac ). To show
b. decompose the union of A and B as A ∪ B = A ∪ {B ∩ Ac }, again two
disjoint sets; hence:
P (A ∪ B) = P (A) + P (B ∩ Ac ) = P (A) + P (B) − P (A ∩ B)
where a. implies the second equality. Finally, c. follows from a. as A ⊂ B
implies that P (A ∩ B) = P (A), thus P (B ∩ Ac ) = P (B) − P (A) ≥ 0.
Theorem 1.4. Properties of Probability Functions (c). If P is some
probability function, the following properties hold:
a. P (A) = ∞
P
i=1 P (A ∩ Ci ) for any A ∈ B and any partition C1 , C2 , . . .
of the sample space such that Ci ∈ B for all i ∈ N;
P∞
b. P (∪∞ i=1 Ai ) ≤ i=1 P (Ai ) for any sets A1 , A2 , . . . such that Ai ∈ B for
all i ∈ N.
Proof. Regarding a. note that, by the Distributive Laws of events, it is
∞
! ∞
[ [
A=A∩S=A∩ Ci = (A ∩ Ci )
i=1 i=1
where the intersection sets of the form A ∩ Ci are pairwise disjoint as the
Ci sets are, hence:
∞
! ∞
[ X
P (A) = P (A ∩ Ci ) = P (A ∩ Ci )
i=1 i=1
∞
! ∞
! ∞ ∞
[ [ X X
P Ai = P A∗i = P (A∗i ) ≤ P (Ai )
i=1 i=1 i=1 i=1
where the second equality would follow from the pairwise disjoint property.
Such additional collection of events can be obtained as:
i−1
!c i−1
!
[ \
A∗1 = A1 , A∗i = Ai ∩ Aj = Ai ∩ Acj for i = 2, 3, . . .
j=1 j=1
Furthermore, by construction
P∞ Ai ⊂∗ Ai for
∗
P∞every i, implying P (Ai ) ≤ P (Ai )
∗
8
1.2. Conditional Probability
(nobody uses E anymore). Simple additions thus give P (passing) = 0.9 and
P (f ailing) = 0.1. Also note that A ⊂ Apassing , so P (A ∩ passing) = P (A).
Consequently, the probability of getting an A given that a student passes
the exam can be expressed as the following conditional probability:
P (A ∩ passing) 0.3 1
P (A| passing) = = =
P (passing) 0.9 3
and similarly P (D| f ailing) = P (F | f ailing) = 0.5 express the odds that
a student who fails the exams gets either D or F .
4
Consider the sigma algebra for S based upon the maximal partition including all sin-
gleton grade sets; the two “passing” and “failing” sets are obtained by taking appropriate
unions of the singleton grade sets, hence must be part of the same sigma algebra.
9
1.2. Conditional Probability
10
1.2. Conditional Probability
11
1.2. Conditional Probability
12
1.3. Probability Distributions
13
1.3. Probability Distributions
Example 1.6. Tossing Two Coins. Consider the “extended” coin exper-
iment for n = 2. The sample space for this scenario is the set
S2.coins = {Head & Head, Head & T ail, T ail & Head, T ail & T ail}
where for each element, the two terms before and after the ‘&’ sign represent
the outcome for the first and second coin respectively. The random variable
of interest takes values in the set X2.coins ∈ {0, 1, 2} ∈ R, and the mapping
X equals the number of tails in each attempt.
Assuming that the coins in the experiment are “balanced” (that is, there are
equal chances to obtain heads or tails), and given that clearly the outcome
of either coin cannot be predicted by the other (if treated as separate events,
they would be independent), the probability associated with each element of
S2.coins is 0.25, meaning that P (X2.coins = 0) = P (X2.coins = 2) = 0.25 while
P (X2.coins = 1) = 0.50. This results in the following cumulative probability
distribution:
if x ∈ (−∞, 0)
0
.25 if x ∈ [0, 1)
FX2.coins (x) =
.75 if x ∈ [1, 2)
if x ∈ [2, ∞)
1
which is easily represented graphically as in in Figure 1.2 below.
1 FX2.coins
.75
.50
.25
x
0 1 2
14
1.3. Probability Distributions
15
1.3. Probability Distributions
while also taking positive values on the entire real line and complying with
Theorem 1.7. Even this distribution is displayed in Figure 1.3 (dashed line);
observe how relative to the standard logistic, the standard normal implies
a higher probability associated to the values of x closer to zero.
FX (x) Λ (x)
1 Φ (x)
0.5
x
−5 −3 −1 1 3 5
As elaborated later in this chapter, the standard logistic and the stan-
dard normal are both specialized cases of more flexible specifications of the
logistic and normal distributions, which allow for parameters that deter-
mine their exact shape (however, the particular notation Λ (x) and Φ (x) is
typically reserved for the “standard” versions of both distributions). Both
distributions are often used to represent real world scenarios that are best
represented on the entire real line, like the deviations of a certain variable of
interest (say, human height) from some focal point (say, a group-specific av-
erage). The predominance of the standard normal is motivated by a funda-
mental result in asymptotic probability theory, the Central Limit Theorem,
which is discussed at length in Lecture 6.
Cumulative probability distributions are seldom handled directly; it is
usually more convenient to manipulate some associated mathematical ob-
jects that more directly relate to the underlying probability measures. Such
objects, the probability mass and density functions, are defined differently
for discrete and continuous distributions, respectively. These two concepts
make it easier to also characterize the support of a random variable, which
intuitively is the subset of R where all the probability is concentrated.
Definition 1.13. Probability Mass Function. Given a discrete random
variable X, its probability mass function fX (x) (which is often abbreviated
as p.m.f.) is defined as follows.
fX (x) = P (X = x) for all x ∈ R
16
1.3. Probability Distributions
hence:
b
X
P (X ≤ b) = FX (b) = fX (t)
t=inf X
and: X
P (X ∈ X) = fX (t) = 1
t∈X
which connects directly with the cumulative probability distribution FX (x).
Example 1.8. Tossing Two Coins, Revisited. Consider the cumulative
distribution function for the experiment about “tossing two coins” described
in Example 1.6. The associated probability mass function is obtained from
the original probability function:
.25 if x = 0
.50 if x = 1
fX2.coins (x) =
.25 if x = 2
otherwise
0
and it is visually represented in Figure 1.4, as displayed next.
17
1.3. Probability Distributions
1 fX2.coins
.75
.50
.25
x
0 1 2
Figure 1.4: Probability Mass Function for the Two Coins Experiment
For density functions instead the support is an uncountable set, and the
interpretation of the quantity fX (x) ≥ 0 is subtler: it cannot be interpreted
as a probability because x has measure zero in the support. However, when
X is continuous the definition of cumulative distribution functions implies
that: ˆ b
P (a ≤ X ≤ b) = FX (b) − FX (a) = fX (t) dt
a
hence: ˆ
P (X ∈ X) = fX (t) dt = 1
X
hence density functions bear a probabilistic interpretation for segments of R.
Also observe that unlike in the case of mass functions, density functions can
generally take values larger than one, since their probabilistic interpretation
is based on the above integral formulations.
Both functions are displayed in Figure 1.5 below. The graphical comparison
between the two density functions highlights again that the standard logistic
is “thicker in the tails” relative to the standard normal, that is, the standard
18
1.3. Probability Distributions
0.2
x
−5 −3 −1 1 3 5
Note: the shaded areas represent the probability that x falls in the [1, 3] interval for either distribution
logistic probability is more dispersed towards the outer values of the support
R. To better exemplify the probabilistic interpretation of density functions,
the Figure also displays – by means of distinct shaded areas – the probability
that x occurs between 1 and 3 for either distribution.
One can summarize the properties of mass and density function through
the following statement.
It must be noted at this point that not all random variables are either exclu-
sively discrete, or exclusively continuous. In numerous situations of interest,
a random variable appears continuous only on a subset of the support and
discrete in other points. In such cases, the definition of cumulative proba-
bility distribution is still valid, however those of mass and density functions
are only valid upon a subset of the support. It is possible to describe these
mixed cases by using a generalized density which is formulated in terms of
a Lebesgue integral, but this is beyond the scope of this treatment.
19
1.3. Probability Distributions
Φ≥0 (x)
1
0.8
0.6
0.4
0.2
x
−5 −3 −1 1 3 5
In this case, it is sensible to characterize the density function only for the
nonnegative part of the distribution’s support:
2
1 x
φ≥0 (x) = √ exp − if x ≥ 0
2π 2
P (X = 0) = 0.5
20
1.4. Relating Distributions
21
1.4. Relating Distributions
where g −1 ([a, b]) ∈ R is the subset of real numbers that are mapped by the
inverse function g −1 (·).6 Also note that in general, a transformed random
variable Y has a a support Y which differs from the support X of the original
random variable X; an obvious example is Y = exp (X) whereby if X = R,
it is Y = R++ ; conversely if Y = log (X) and X = R++ , it is Y = R.
A relevant question is about how to calculate the distribution and the
mass or density functions of Y starting from those of X. If X is discrete
also Y is, and the calculation of mass functions is straightforward.
fY (y) = fX g −1 (y) (1.5)
For continuous random variables, things are slightly more complicated. Let
us start from the following result about cumulative distributions.
Theorem 1.10. Cumulative Distribution of Transformed Random
Variables. Let X and Y = g (X) be two random variables that are related
by a transformation g (·), X and Y their respective supports, and FX (x) the
cumulative distribution of X.
a. If g (·) is increasing in X, it is FY (y) = FX (g −1 (y)) for all y ∈ Y.
b. If g (·) is decreasing in X and X is a continuous random variable, it
is FY (y) = 1 − FX (g −1 (y)) for all y ∈ Y.
Proof. This is almost tautological: a. is shown as:
ˆ g−1 (y)
fX (x) dx = FX g −1 (y)
FY (y) =
−∞
where the first equality is motivated on (1.4) and the fact that an increasing
function applied upon some interval preserves its order. The demonstration
of b. is symmetric:
ˆ ∞
fX (x) dx = 1 − FX g −1 (y)
FY (y) =
g −1 (y)
since
´ a a decreasing´ ∞function upon an interval inverts the order and because
−∞ X
f (x) dx + a
fX (x) dx = 1 if fX (x) is a density function.
Note that this subset may not equal g −1 (a) , g −1 (b) because the inverse mapping
6
g −1 (·) may not preserve the order or the connectedness of the original interval.
22
1.4. Relating Distributions
Proof. Both increasing and decreasing functions are monotone; hence, since
g −1 (·) is continuously differentiable on Y, for all y ∈ Y:
(
d fX (g −1 (y)) dy
d −1
g (y) if g (·) is increasing
fY (y) = FY (y) =
dy −fX (g −1 (y)) dy g (y) if g (·) is decreasing
d −1
FX (x) fX (x)
1 1
x x
0 1 0 1
Note: cumulative distribution function FX (x) on the left, density function fX (x) on the right
23
1.4. Relating Distributions
FY (y) fY (y)
1 1
y y
0 2 4 0 2 4
Note: cumulative distribution function FY (y) on the left, density function fY (y) on the right
24
1.4. Relating Distributions
Proof. (Outline.) The logic of this result is that if g (·) is not monotone, but
it can be separated into a sequence of monotone subfunctions over different
intervals of the support of X, then the result of Theorem 1.11 can be applied
to each interval, and thus the density for each point y ∈ Y can be obtained
as the sum of the transformed densities associated with all points in x ∈ X
that map to y (note that this allows g (·) not to be invertible over the entire
support of X, it suffices that it is invertible on each interval in the partition).
The “dummy” set X0 with zero probability allows for discontinuity or even
saddle points separating the K subfunctions in the partition.
Example 1.12. Normal to Chi-squared Transformation. Let X be
a random variable that follows the standard normal distribution Φ (x); the
support of X is thus X = R. Consider the transformation Y = X 2 : function
g (x) = x2 is obviously not monotone over all R. However, it is respectively
decreasing in R− and increasing in R+ ; and it is easy to verify that it satisfies
the requirements of Theorem 1.12 for the following sets.
X0 = {0}
X1 = R−−
X2 = R++
Therefore, the density of Y is obtained, for y ∈ Y = R++ , as:
√ 2 ! √ 2 !
1 − y 1 1 y 1
fY (y) = √ exp − − √ + √ exp − √
2π 2 2 y 2π 2 2 y
1 1 y
= √ √ exp −
2π y 2
√ √
as g1−1 (y) = − y and g2−1 (y) = y. Its cumulative distribution is obtained
by integrating the density.
ˆ y
1 1 t
FY (y) = √ √ exp − dt
0 2π t 2
25
1.4. Relating Distributions
FY (y) fY (y)
1 1
y y
0 3 6 0 3 6
Note: cumulative distribution function FY (y) on the left, density function fY (y) on the right.
The shape of this distribution closely mimics that of the standard logistic
from Example 1.7, but with a crucial difference that makes it “non-strictly”
monotonic: over the (0, 3) interval this distribution is flat, and it resumes
a “standard logistic” behavior only for x ≥ 3, as if the support is shifted by
three units of measurement associated with the random variable X.
26
1.4. Relating Distributions
FX (x) 7 QX (p)
1
5
3
0.5 1 p
−1 0.5 1
−3
x −5
−5 −3 −1 1 3 5 7
Note: cumulative distribution function FX (X) on the left, quantile function QX (p) on the right.
27
1.5. Moments of Distributions
This theorem motivates the use of the uniform distribution for generating
random draws from any distribution FX (x). Given that it is easier to obtain
actual random draws from the uniform distribution, it is convenient to do so
and then apply the quantile function QX (p) in order to obtain the desired
random draws from FX (x), whatever the random variable X of interest.
28
1.5. Moments of Distributions
= E X 2 − E [X]2
29
1.5. Moments of Distributions
x heads (or tails) occurs with probability 0.6x · 0.4n−x , the probability mass
function for this specific random variable can be written as:
n
fXn.coins (x) = · 0.6x · 0.4n−x
x
and its expected value can be calculated as follows, for y = x − 1:
n
X n
E [Xn.coins ] = x · 0.6x · 0.4n−x
x=0
x
n
X n−1
= n · 0.6x · 0.4n−x
x=1
x − 1
n
X n−1
= 0.6 · n · 0.6x−1 · 0.4n−1−x+1
x=1
x − 1
n
X n−1
= 0.6 · n · 0.6y · 0.4n−1−y
y=0
y
| {z }
=1
= 0.6 · n
where the simplification in the second-to-last line occurs because the sum-
mation therein is recognized as the total probability mass of an analogous,
hypothetical experiment with n − 1 attempts and y successes.
30
1.5. Moments of Distributions
31
1.5. Moments of Distributions
´∞
since limM →∞ −y exp (−y)|M 0 = 0 and 0 exp (−y) dy = 1. The variance is
calculated by noting that, integrating by parts again:
ˆ ∞ ˆ ∞
2 2 2 ∞
E Y = y exp (−y) dy = −y exp (−y) 0 + 2 y exp (−y) dy = 2
0 0
hence Var [Y ] = E [Y 2 ] − E [Y ]2 = 2 − 1 = 1.
One is often interested in the analysis of the moments of a transformed
random variable Y = g (X) in terms of the moments of the original random
variable X. By applying the standard linear properties of summations and
integration, it is quite easy to see that if Y = a + bX, then:
E [Y ] = a + b E [X]
E [g (X)] ≤ g (E [X])
E [g (X)] ≥ g (E [X])
This shows that, in general, a non-linear function does not pass through
the expectation operator. In addition, a first-order approximation of g (X)
based on a Taylor expansion around E [X]:
is actually a decent one, as it accounts for the first order term of the series.
The next two properties of the mean and the variance are instrumental
in establishing some important results of asymptotic theory.
32
1.5. Moments of Distributions
33
1.5. Moments of Distributions
since neither E [X] nor X b are random and can be taken out of the expecta-
tion operator, while E [(X − E [X])] = 0 by definition. Of the two remaining
terms in the last line of (1.7), the first one – the variance of X – is constant,
while the second is shrunk to zero when X b = E [X]. Thus, in addition to
the interpretation of the mean as “best predictor” under quadratic distances,
the variance is intuitively interpreted as the prediction error that “cannot
be removed.” Later in Lecture 7 this property of the mean and variance is
generalized in a setting where multiple random variables are used to jointly
predict the realization of some other random variable of interest.
The last concept covered in this lecture is about classes of functions that
are most useful to calculate the moments of a distribution.
Definition 1.20. Moment generating function. Given a random vari-
able X with support X, the moment-generating function MX (t) is de-
fined, for t ∈ R, as the expectation of the transformation g (X) = exp (tX),
so long as it exists; for discrete random variables this is:
X
MX (t) = E [exp (tX)] = exp (tx) fX (x)
x∈X
Moment generating functions draw their name from the following result.
Theorem 1.16. Moment generation. If a random variable X has an
associated moment generating function MX (t), its r-th uncentered moment
can be calculated as the r-th derivative of the moment generating function
evaluated at t = 0.
dr MX (t)
E [X r ] =
dtr t=0
34
1.5. Moments of Distributions
35
1.5. Moments of Distributions
36
Lecture 2
Common Distributions
37
2.1. Discrete Distributions
Bernoulli distribution
The Bernoulli distribution is the one describing dichotomous events akin to
those of the coin experiment. In general, one must allow for the two events
under consideration to occur with different probabilities (for example, coins
may not be “fair” or “balanced”). One writes X ∼ Be (p) if X = {0, 1} and:
P (X = 1) = p
P (X = 0) = 1 − p
implying a probability mass function that can be written as in the above
example about notation, or equivalently – but more elegantly – as:
fX (x; p) = px (1 − p)1−x (2.1)
for x ∈ {0, 1} and p ∈ [0, 1]. The cumulative distribution writes:
FX (x; p) = (1 − p) · 1 [x ∈ [0, 1)] + 1 [x ∈ [1, ∞)] (2.2)
its moment generating function is:
MX (t; p) = p exp (t) + (1 − p) (2.3)
and this lets obtain the mean and the variance easily as follows.
E [X] = p (2.4)
Var [X] = p (1 − p) (2.5)
The Bernoulli distribution is elementary; thus, it forms the basis for several
other discrete distributions.
Binomial distribution
The binomial distribution characterizes a random variable defined on a sam-
ple space constituted by all possible recombinations of n Bernoulli (binary)
events with probability p, and that measures the probability for the number
x and n−x of realizations of each alternative. Thus, this distribution corre-
sponds to the hypothetical experiment from Lecture 1 about tossing several,
(say n) possibly unbalanced coins. Conventionally, the outcomes counted as
x are defined as “successes” and those counted as n − x as “failures;” for this
reason, it is common to verbally describe the binomial distribution as the
one that measures the “probability of x successes of a binary phenomenon
out of n attempts.” In less verbal terms, a random variable X that follows
the binomial distribution is typically denoted as follows.
X ∼ Bn (p, n)
38
2.1. Discrete Distributions
writes, for x ∈ [0, n], and given bxc the largest integer smaller than x, as:
bxc
X n i
P (X ≤ x; p, n) = FX (x; p, n) = p (1 − p)n−i (2.7)
i=0
i
with P (X = n; p, n) = FX (n; p, n) = 1 per the binomial formula.
n
X n x
FX (n; p, n) = p (1 − p)n−x
x=0
x
= (p + (1 − p))n
=1
The moment generating function is similarly obtained (see Example 1.17):
n
X n
MX (t; p, n) = exp (tx) px (1 − p)n−x
x=0
x (2.8)
= [p exp (t) + (1 − p)]n
while the mean and variance are as follows (one can calculate them through
the approach in Example 1.14, or with the moment generation function).
E [X] = np (2.9)
Var [X] = np (1 − p) (2.10)
The two distributions discussed next are variations on the idea of multiple
Bernoulli events or, as these are usually referred to, “Bernoulli trials.”
39
2.1. Discrete Distributions
Geometric distribution
Consider the sample space constructed out of all combinations of an infinite
number of Bernoulli trials with identical probability p. Suppose that these
trials are ordered; for example, the order might correspond to a sequence in
time when the trials are realized. Rather than defining a random variable
X that counts the number of successes, let X ∈ N denote the index of the
first Bernoulli trial in the sequence for which a “success” is observed. It is:
P (X = x; p) = fX (x; p) = p (1 − p)x−1 (2.11)
because for a success with probability p to happen in the x-th trial, x − 1
failures must first occur, an event with probability (1 − p)x−1 . The proba-
bility mass function in (2.11) is characterized by a geometric series, which
motivates the name for the distribution associated with X. By the proper-
ties of the geometric series, for x ∈ R the cumulative distribution function
of X is obtained as:
bxc−1
X
P (X ≤ x; p) = FX (x; p) = p (1 − p)i
i=0
1 − (1 − p)bxc (2.12)
= p
1 − (1 − p)
= 1 − (1 − p)bxc
which converges to 1 as x → ∞; while it is FX (x; p) = 0 for x < 1. Similarly,
the moment generating function is obtained, for t < − log (1 − p), as:
M
X
MX (t; p) = lim exp (tx) · p (1 − p)x−1
M →∞
x=0
M
X
= p exp (t) · lim [(1 − p) · exp (t)]x−1
M →∞
x=0 (2.13)
M
1 − [(1 − p) · exp (t)]
= p exp (t) · lim
M →∞ 1 − (1 − p) · exp (t)
p exp (t)
=
1 − (1 − p) exp (t)
allowing to derive the mean and variance following tedious calculations.
1
E [X] = (2.14)
p
1−p
Var [X] = (2.15)
p2
40
2.1. Discrete Distributions
41
2.1. Discrete Distributions
Poisson distribution
The Poisson distribution, which is presented next, is an important discrete
distribution with numerous applications. Like the other distributions pre-
sented thus far, also the Poisson is related to the concept of Bernoulli trials,
although the connection is less immediate to intuitively appreciate. To that
end, it is helpful to provide first a formal description of the distribution.
42
2.1. Discrete Distributions
43
2.1. Discrete Distributions
The approximation is quite good even for moderately large values of n and
moderately small values of p, as shown in the example from Figure 2.1.
fX (x)
.2
.1
x
0 2 4 6 8 10
Note: binomial probabilities are denoted with solid thin lines, smaller full points; Poisson probabilities
are denoted with dashed thicker lines, larger hollow points. All probabilities for x > 10 are negligible.
44
2.1. Discrete Distributions
45
2.1. Discrete Distributions
Hypergeometric distribution
The last discrete distribution considered here is a variation of the binomial
experiment of obtaining x successes out of n attempts of a binary outcome,
but in an environment where the probability of each success is not fixed,
and thus the correspondence with n independent Bernoulli trials cannot be
established. Specifically, the hypergeometric distribution is modeled on the
idea of randomly selecting n ∈ N objects out of a population that contains
N ∈ N of them in total, of which K ∈ N presents a certain binary feature
that is called a “success,” and where n < N and K < N . The two most
common concrete representations of this mental experiment are:
• the urn model about the extractions of certain n items (“balls”) from
a container (“urn”) – the items-balls are N in total and K of them
present a certain feature (“color”);
• the concept of sampling without replacement from a finite population
of size N , K of whose elements present a certain feature; this corre-
sponds to the case of a statistical sample which is obtained by ran-
domly drawing n individual units from the population, one at a time,
and excluding them from additional draws once they are selected.
The random variable X that represents the number of “successes” x out
of the n “extractions” is said to follow the hypergeometric distribution, also
written as:
X ∼ H (N, K, n)
which presents three parameters: N , K, and n. Its support X must satisfy
the following four conditions: x ≥ 0, x ≤ n, (these two are obvious) x ≤ K
and n−x ≤ N −K (actual successes and failures cannot exceed the possible
maxima). Combining all these inequalities together, the support is written
as follows.
X = {max (0, n + K − N ) , . . . , min (n, K)}
The hypergeometric probability mass function presents an expression com-
posed by three binomial coefficients:
K N −K
x n−x
P (X = x; N, K, n) = fX (x; N, K, n) = (2.32)
N
n
46
2.1. Discrete Distributions
47
2.2. Continuous Distributions I
48
2.2. Continuous Distributions I
Normal distribution
The all-important location-scale family of normal distributions, sometimes
called Gaussian, includes all continuous distributions with support on the
whole of R, density function of the form
!
2
1 (x − µ)
fX (x; µ, σ2 ) = √ exp − (2.35)
2πσ2 2σ2
and cumulative distribution obtained by appropriate integration as follows.
ˆ x 2
!
1 (t − µ)
FX (x; µ, σ2 ) = √ exp − dt (2.36)
−∞ 2πσ2 2σ2
These distributions present two parameters, µ for location and σ for shape,
although the latter is typically replaced with its square σ2 for an immediate
interpretation as variance (see below). The expression
X ∼ N µ, σ2
indicates that X follows the normal distribution with the specified parame-
ters. In the case of the standard normal distribution from Examples 1.7 and
1.9, with density φ (x) and cumulative distribution Φ (x), these parameters
are equal to 0 and 1 respectively, hence the above writes:
X ∼ N (0, 1)
although X is often replaced with Z for clearer indication of standardization
– e.g. Z ∼ N (0, 1).
49
2.2. Continuous Distributions I
Figure 2.2 shows three examples of normal density functions. The con-
tinuous line represents the familiar standard version. The dashed line dis-
plays a density with µ = 2 and σ2 = 1: note how an increase of the location
parameter produces a “shift to the right” of the distribution (if the location
parameter is decreased, one would obtain a “shift to the left”). The dotted
line depicts a density with µ = 0 and σ2 = 4: while still centered at zero, an
increase in the shape parameter produces a more “flattened out,” dispersed
distribution (conversely, one would obtain a more concentrated distribution
if the shape parameter is decreased).
fX (x) Stand.
0.4 µ=2
σ2 = 4
0.2
x
−5 −3 −1 1 3 5 7
where φ2 = σ−2 is called the precision parameter. In Figure 2.2, the dotted
density has φ2 = 0.25.
Showing that the density function of a normal distribution integrates to
one is not an immediate task; thanks to Theorem 2.1, it is enough to show
that the result holds in the standardized case, that is:
ˆ ∞ 2
1 z
√ exp − dz = 1
−∞ 2π 2
50
2.2. Continuous Distributions I
51
2.2. Continuous Distributions I
that is, the two parameters of the normal distribution have an immediate
interpretation in terms of fundamental moments, a quite convenient fact!
In addition, it can be shown that:
that is, all normal distributions have zero skewness (they are all perfectly
symmetric) and kurtosis equal to three, which makes this number a refer-
ence value for evaluating the kurtosis of other distributions.2
The normal distribution has ubiquitous applications that are motivated
by its numerous relationships with other probability distributions and, most
importantly, by the asymptotic result known as Central Limit Theorem, in
its various versions (Lecture 6). For these reasons, the normal distribution
is central in statistical inference as well as in econometric analysis.
Lognormal distribution
The lognormal distribution has the following probability density function,
for y ∈ R++ , µ ∈ R, and σ2 ∈ R++ .
!
2
1 1 (log y − µ)
fY (y; µ, σ2 ) = √ exp − (2.44)
2πσ y
2 2σ2
Observation 2.2. By Theorem 1.11, one can observe that the lognormal
distribution is evidently obtained through the transformation Y = exp (X),
where X ∼ N (µ, σ2 ). Conversely, X = log (Y ): the logarithm of a random
variable Y which follows the lognormal distribution is normally distributed,
hence the former distribution’s name.
The cumulative distribution of the lognormal distribution is:
ˆ y !
1 1 (log t − µ)2
2
FY (y; µ, σ ) = √ exp − dt (2.45)
0 2πσ2 t 2σ2
and the density must clearly integrate to one since the normal distribution
does. In order to specify that some random variable Y follows the lognormal
distribution, the most convenient way is surely as follows.
log (Y ) ∼ N (µ, σ2 )
2
In fact, the fourth standardized central moment of some random variable X is often
expressed in terms of “excess kurtosis” Kurt [X] − 3 that is, as the differences between
the kurtosis of X and that of the normal distribution.
52
2.2. Continuous Distributions I
Figure 2.3 displays example of lognormal density functions that are anal-
ogous to those from Figure 2.2, highlighting how lognormal densities can
assume different shapes, always asymmetric by some degree; for example,
the standard version is recognizable by its characteristic hump. These char-
acteristics make the lognormal distribution well suited to describe phenom-
ena that only take positive values and that are characterized by apparent
“inequality,” such as the income distribution or that of firm size.
fX (x) Stand.
1 µ=2
σ2 = 4
0.5
x
2 4 6 8
The lognormal distribution is one for which the moment generating func-
tion does not exist (the integral that defines it diverges), and even the char-
acteristic function takes a very complex expression. Fortunately, uncentered
moments can be calculated easily:
E [Y r ] = E [(exp (X))r ]
= E [exp (Xr)]
σ2 r 2
= exp µr +
2
because the second equality corresponds with the definition of moment gen-
erating function of X ∼ N (µ, σ2 ) given r = t. The mean and variance, for
example, are obtained as follows.
σ2
E [Y ] = exp µ + (2.46)
2
Var [Y ] = exp σ2 − 1 exp 2µ + σ2 (2.47)
53
2.2. Continuous Distributions I
Logistic distribution
Even the standard logistic distribution introduced with Examples 1.7 and
1.9 has a full-fledged location-scale family. With support X = R, a location
parameter µ ∈ R, and a scale parameter σ ∈ R++ , the probability density
function of a generic logistic distribution is written as follows.
−2
1 x−µ x−µ
fX (x; µ, σ) = exp − 1 + exp − (2.49)
σ σ σ
Also the logistic cumulative distribution has a closed form expression, which
can be obtained by an exercise in integrating the density above.
−1
x−µ
FX (x; µ, σ) = 1 + exp − (2.50)
σ
making it obvious that limx→∞ FX (x; µ, σ) = 1. Expression (2.50) is easy
to manipulate and invert; consequently the logistic distribution has a simple
expression for its quantile function.
p
QX (p; µ, σ) = µ + σ log for p ∈ (0, 1) (2.51)
1−p
fX (x) Stand.
µ=2
0.2 σ=2
0.1
x
−8 −4 4 8 12
54
2.2. Continuous Distributions I
and since the Beta generating function is defined only for positive values
of a and b, the moment generating function is only defined for t ∈ (−1, 1).
For a general logistic distribution such that X = σZ + µ, by the properties
of moment generating functions the one of X is:
ˆ 1
MX (t; µ, σ) = exp (µt) · u−σt (1 − u)σt du (2.52)
0
55
2.2. Continuous Distributions I
Cauchy distribution
The Cauchy distribution is an interesting “pathological” case of a location-
scale family of symmetric, bell-shaped distributions having X = R as their
support. Its density functions writes, given a location parameter µ ∈ R and
a scale parameter σ ∈ R++ , as:
" 2 #−1
1 x−µ
fX (x; µ, σ) = 1+ (2.57)
πσ σ
while the cumulative distribution has the following closed form expression,
displaying the necessary property limx→∞ FX (x; µ, ξ) = 1.
1 x−µ 1
FX (x; µ, σ) = arctan + (2.58)
π σ 2
56
2.2. Continuous Distributions I
Since the above is an invertible function, also the quantile function of the
Cauchy distribution has a simple closed form expression, for p ∈ (0, 1).
1
QX (p; µ, σ) = µ + σ tan π p − (2.59)
2
X ∼ Cauchy (µ, σ)
fX (x) Stand.
0.3 µ=2
σ=2
0.2
0.1
x
−5 −3 −1 1 3 5 7
To show this property, one typically works with the standardized case
Z = (X − µ) /σ, whose expectation is rewritten as follows:
ˆ 0 ˆ +∞
1 z 1 z
E [Z] = 2
dz + dz
−∞ π 1 + z 0 π 1 + z2
that is, the integral that defines the mean is split between two parts that,
since the distribution is symmetric around z = 0, must have opposite signs
but equal absolute value. Consider the integral defined on R++ ; note that:
ˆ +∞ M
z log (1 + z 2 )
dz = lim =∞
0 1 + z2 M →∞ 2 0
57
2.2. Continuous Distributions I
the integral does not converge and lacks a finite solution. Because the same
applies to the other half of the partition above, the value of E [Z] remains
undefined. Similar arguments also apply to the non-standardized versions of
the distribution, the higher moments, and the moment generating function.
However, the Cauchy distributions – like all distributions – always has a
characteristic function, which can be shown to be the following.
ϕX (t; µ, σ) = exp (iµt − σ |t|) (2.60)
Note that this particular characteristic function is not differentiable at t = 0,
thus it cannot help derive the moments of the distribution. As it lacks the
moments, the Cauchy distribution has limited practical applications; it is
however notable for its links – to be illustrated later – with distributions of
more practical use such as the normal and the Student’s t-distribution.
Laplace distribution
The last location-scale family discussed here is that of the Laplace distri-
bution, characterized by support X = R and the following density function.
1 |x − µ|
fX (x; µ, σ) = exp − (2.61)
2σ σ
Note that this density function is discontinuous at x = µ, but despite this,
the above density is easy to integrate on the two complementary subsets of
R that are split at the discontinuity point. Hence, the cumulative density
function has two distinct closed form expressions:
(
1
exp x−µ if x < µ
FX (x; µ, σ) = 2 σ
(2.62)
1 x−µ
if x ≥ µ
1 − 2 exp − σ
58
2.2. Continuous Distributions I
fX (x) Stand.
µ=2
0.4 σ=2
0.2
x
−5 −3 −1 1 3 5 7
The three Laplace densities displayed in Figure 2.6 all take the typical
“tent shape” that characterizes this distribution. Unlike in the Cauchy case,
this distribution has well-defined moments; to see this, it is easiest to start
from calculating the moment generating function in the standardized case:
ˆ +∞
1
MZ (t) = exp (tz − |z|) dz
−∞ 2
ˆ ˆ
1 0 1 +∞
= exp ((1 + t) z) dz + exp (− (1 − t) z) dz
2 −∞ 2 0
1 1 1
= +
2 1+t 1−t
1
=
1 − t2
which is only defined for |t| < 1 (or else the two integrals in the second line
diverge). By the properties of the moment generating functions for linearly
transformed random variables, it is easy to see that in the general case:
exp (µt)
MX (t; µ, σ) = (2.64)
1 − σ2 t2
where, for analogous reasons, this moment generating function is defined
only for |t| < σ−1 . By (2.64) it is possible to obtain all the other moments;
the mean and variance for example are as follows.
E [X] = µ (2.65)
Var [X] = 2σ2 (2.66)
The Laplace distribution has a limited number of applications (it can be
used, for example, to model the growth rates of firms) but is mostly known
59
2.3. Continuous Distributions II
60
2.3. Continuous Distributions II
role; neither of them is more characteristic of the location or the scale of the
distribution (like µ and σ do for the distributions examined previously). If
a random variable X follows the uniform distribution on the [a, b] interval,
one usually writes:
X ∼ U (a, b)
where X ∼ U (0, 1) is just but a special case. By generalizing the examples
given in the previous lecture, it is straightforward to verify that the moment
generating function of a generic uniform distribution are:
(
1
[exp (bt) − exp (at)] if t 6= 0
MX (t; a, b) = t(b−a)
(2.69)
1 if t = 0
while the mean and variance are as follows.
a+b
E [X] = (2.70)
2
1
Var [X] = (b − a)2 (2.71)
12
Depending on the context of interest, all uniform distributions can be alter-
natively defined on the open interval (a, b); the analysis of the distribution
is largely unaffected whether the support is an open or a closed set.
Beta distribution
With the expression Beta distributions one usually refers to a family of dis-
tributions that, like the uniform distribution, have support on a segment of
the real line, but unlike the uniform distribution, can take varying shapes.
The starting point in their description is the standard family of Beta dis-
tributions with support on the unit interval, X = (0, 1). These particular
distributions are defined by two positive parameters α ∈ R++ and β ∈ R++
that jointly define the shape of the density function, which reads:
xα−1 (1 − x)β−1 xα−1 (1 − x)β−1
fX (x; α, β) = ´ 1 = for x ∈ (0, 1)
u α−1 (1 − u)β−1 du B (α, β)
0
(2.72)
where B (α, β) is a Beta function with the parameters α and β as arguments
and serves as a normalization constant to ensure that the density integrates
to one on the unit interval. The Beta function is also related to the so-called
Gamma function Γ (γ), a function with one argument γ ∈ R++ :3
ˆ ∞
Γ (γ) = uγ−1 exp (−u) du
0
3
Specifically, it can be shown that B (a, b) = Γ (a) Γ (b) /Γ (a + b).
61
2.3. Continuous Distributions II
fX (x) α = 2, β=2
3 α = .5, β = .5
α = 2, β=5
2 α = 5, β=2
x
0.25 0.5 0.75 1
Figure 2.7 displays some examples of the standard Beta distribution that
are different from the uniform case. The shapes assumed by the different
density functions in the figure are wildly different, and more can be obtained
with different configurations of the parameters. It is easy to calculate the
moment of this distribution directly. Observe that, because of (2.72):
ˆ 1
1 B (r + α, β)
r
E [X ] = xr+α−1 (1 − x)β−1 dx =
B (α, β) 0 B (α, β)
62
2.3. Continuous Distributions II
Exponential distribution
Like the uniform distribution, even the exponential distribution has already
been introduced in Lecture 1 through a special case, defined as that with
“unit parameter.” The larger family of exponential distributions takes its
support on the set of nonnegative real numbers X = R+ , and it allows for
different values of the parameter λ ∈ R++ (where λ = 1 is clearly the “unit
parameter” case). The probability density function reads generally as:
1 x
fX (x; λ) = exp − for x ≥ 0 (2.79)
λ λ
63
2.3. Continuous Distributions II
2 fX (x) λ = .5
λ=1
λ=2
x
1 2 3 4 5
64
2.3. Continuous Distributions II
Y ∼ Exp (Kλ)
Gamma distribution
The Gamma family is central in the taxonomy of continuous distributions,
since it relates directly or indirectly to many other such families. Its support
equates the set of positive real numbers: X = R++ ; like the Beta family, it
is identified by two positive parameters α ∈ R++ and β ∈ R++ (but several
different parametrizations are possible, here the focus is on a specific one).
The name of this distribution derives from the fact that its density function
65
2.3. Continuous Distributions II
is, exponential distributions are all special cases of the Gamma family.
fX (x) α = 2, β = 2
1.5 α = 4, β = 2
α = 2, β = 8
1
0.5
x
1 2 3 4 5
Figure 2.9 displays different Gamma density functions for α > 1, they
all display the asymmetric humped shape that usually characterizes these
distributions. Similarly as with the Beta distributions, the direct calculation
of the Gamma centered moments is easy. By inspecting (2.84), it is:
ˆ ∞
r 1 1 Γ (r + α)
E [X ] = r
βr+α xr+α−1 exp (−βx) dx =
Γ (α) β 0 Γ (α) βr
66
2.3. Continuous Distributions II
since the expression inside the integral, divided by Γ (r + α), is the density
function of yet another Gamma distribution with parameters r + α and β.
All the moments are thus easily obtained thanks to the Gamma function’s
property that Γ (γ + 1) = γΓ (γ); for example, the mean and the variance
are expressed as follows.
α
E [X] = (2.86)
β
α
Var [X] = 2 (2.87)
β
Alternatively, one could have used the moment generating function, which
is calculated in analogy with the centered moments.
ˆ ∞
1
MX (t; α, β) = exp (tx) βα xα−1 exp (−βx) dx
Γ (α) 0
ˆ ∞
βα 1
= (β − t)α xα−1 exp (− (β − t) x) dx
(β − t)α Γ (α) 0
α
β
= (2.88)
β−t
−α
t
= 1−
β
Note that the integral in the second line is easily related to a density function
of a Gamma function with parameters α and β − t > 0, which again helps
simplify the accounts. This implies that the moment generating function is
only defined for t < β.
The Gamma distribution has numerous applications in several branches
of science; but its direct applications in the social sciences are quite scant.
As mentioned, the main importance of this distribution lies in its relation-
ship with other distributions.
Chi-squared distribution
The family of chi-squared distributions is central in the theory of statistical
inference. It has its support on the positive real numbers X = R++ , a single
positive parameter κ ∈ R++ , and its density function is as follows.
1 κ
x
fX (x; κ) = κ x 2
−1
exp − for x > 0 (2.89)
Γ κ2 · 2 2
2
If a random variable X follows the chi-squared distribution, this is written
in the following way, which might slightly differ in the details.
X ∼ χ2 (κ) or X ∼ χ2κ
67
2.3. Continuous Distributions II
fX (x) κ=3
κ=5
0.2 κ=7
0.1
x
4 8 12 16
The two observations above clarify that the chi-squared distributions are
a subfamily of the Gamma family, and are consequently related to the ex-
ponential distribution as well. Unsurprisingly, the chi-squared distributions
typically have humped shapes similar to the Gamma ones, as displayed in
Figure 2.10. For low values of κ however, as for Gamma distributions with
specific combinations of α and β, this is not the case; recall for example
Figure 1.9 which shows that fX (x, κ = 1) approaches the y-axis asymptoti-
cally as x → 0, and lacks a maximum.5 Since every chi-squared distribution
is a particular Gamma distribution, the analysis of the former follows that
of the latter. For example, the mean and variance are as follows.
E [X] = κ (2.90)
Var [X] = 2κ (2.91)
The moment generating function, defined for t < 0.5, is instead given below.
κ
MX (t; κ) = (1 − 2t)− 2 (2.92)
4
In this case one says that some random variable X follows the chi-squared distribu-
tion “with κ degrees of freedom.”
5
The chi-squared density function with one
degree
√ of freedom given in Example 1.12
is reconciled with (2.89) by noting that Γ 21 = π.
68
2.3. Continuous Distributions II
Snedecor’s F -distribution
The distribution named after Ronald Fisher and George W. Snedecor – for
brevity, Snedecor’s F -distribution or just the F -distribution, is yet another
quite involved family of distributions with support restricted to the positive
set of real numbers, X = R++ which is defined by two positive parameters
(ν1 , ν2 ) ∈ R2++ ; its density is given as:
ν21 − ν1 +ν 2
1 ν1 ν1 ν1 2
fX (x; ν1 , ν2 ) = x 2
−1
1+ x for x > 0
B ν21 , ν22
ν2 ν2
(2.93)
where B ν1 ν2
is a Beta function of the (halved) parameters. Its cumula-
2
, 2
tive distribution can be expressed in a compact way by using the previously
introduced incomplete Beta function:
B x, ν21 , ν22
FX (x; ν1 , ν2 ) = (2.94)
B ν21 , ν22
69
2.3. Continuous Distributions II
but these two moments are only defined for ν2 > 2 and ν2 > 4 respectively
(or else the integral that defines them diverges).
1 fX (x) ν1 = 2, ν2 = 2
ν1 = 2, ν2 = 6
ν1 = 12, ν2 = 12
0.5
x
1 2 3 4 5
Student’s t-distribution
The distribution named after the pseudonym of William Sealy Gosset, that
is Student (while ‘t’ derives from the distribution’s use in statistical tests),
is the only one analyzed in this section which has support on the entire set of
6
In such settings, one would say that X follows the F -distribution “with ν1 and ν2
degrees of freedom.”
70
2.3. Continuous Distributions II
71
2.3. Continuous Distributions II
fX (x) Stud.’s t, ν = 3
0.4
Stand. Cauchy
Stand. Normal
0.2
x
−5 −3 −1 1 3 5
72
2.3. Continuous Distributions II
Pareto distribution
This section is concluded with the analysis of the distribution named after
Vilfredo Pareto, a famous distribution with support defined on a subset of
the set of positive real numbers, X = [α, ∞). Here, α ∈ R++ is a parameter
of the family of Pareto distributions; this role is shared with another positive
parameter β ∈ R++ . Given two such parameters, the density function of a
particular Pareto distribution is:
βαβ
fX (x; α, β) = β+1 for x ≥ α (2.104)
x
and the cumulative distribution is:
α β
FX (x; α, β) = 1 − for x ≥ α (2.105)
x
and zero otherwise (x < α). Its cumulative distribution clearly tends to 1 as
X tends to infinity. The shape of the distribution is displayed in Figure 2.13
below for a fixed value α = 1 and three different values of β. Clearly, the
parameter β affects the shape of the distribution, with lower values making
the distribution flatter, and vice versa (similarly as, but contrarily to, λ in
the exponential distribution’s case). Instead, parameter α affects both the
location of the distribution (as it defines the support) and the overall scale.
A random variable X distributed according to some Pareto distribution is
denoted as follows.
X ∼ Pareto (α, β)
fX (x) β=1
3 β=2
β=3
2
x
1 2 3 4
73
2.3. Continuous Distributions II
for β ≤ 1
∞
E [X] = αβ (2.107)
for β > 1
β−1
while the variance is as follows.
∞
for β ≤ 2
Var [X] = 2
αβ (2.108)
for β > 2
(β − 1)2 (β − 2)
Intuitively, the Pareto distribution should not be too flat, or else its right
tail becomes “too heavy” causing the relevant moments to diverge. This is
a well known property of the Pareto distribution.
Like the lognormal and the Gamma distributions, the Pareto distribu-
tion is typically used to model asymmetric phenomena that are associated
with positive real numbers; it may be more motivated than its competing
alternatives when it is necessary to study phenomena that display “fat tails”
(i.e. notable inequality), such as the distribution of wealth or that of cities’
size. Another attractive feature of the Pareto distribution is that the den-
sity function (2.104) satisfies a mathematical power law, which is easy to
describe as a linear function upon applying a logarithmic transformation.
Furthermore, the power law implies that the cumulative distribution (2.105)
can be easily inverted, resulting in a quantile function of conveniently simple
manipulation.
1
QX (p; α, β) = α (1 − p)− β
74
2.4. Continuous Distributions III
75
2.4. Continuous Distributions III
and one can easily verify that in all cases, the distribution integrates to 1.
Observe that for both the density function and the cumulative distribution,
the expression for ξ = 0 corresponds to the limit case of the expression for
ξ 6= 0. An important property of GEV distributions is that (2.113) is easy
to invert in all cases, so that the quantile function can be written as follows.
−ξ
(− log (p)) − 1
for ξ > 0, p ∈ [0, 1) ; or ξ < 0, p ∈ (0, 1]
QZ (p; ξ) = ξ
− log (− log (p)) for ξ = 0 and p ∈ (0, 1)
(2.114)
Note that the restrictions in the domain of the quantile function correspond
to the restrictions on the support of X.
fX (x) ξ = .5
0.4 ξ=0
ξ = −.5
0.2
x
−4 −2 2 4 6
X ∼ GEV (µ, σ, ξ)
76
2.4. Continuous Distributions III
where Γ (·) is the Gamma function and γ ' 0.577 is the Euler-Mascheroni
constant, while the variance is as follows.
2
σ
Γ (1 − 2ξ) − (Γ (1 − ξ))2 if ξ 6= 0, ξ < 12
2
ξ
Var [X] = π2 (2.116)
σ 2
if ξ = 0
6
if ξ ≥ 21
∞
The remainder of this section (and of this lecture alike) discusses in more
detail three particular cases of GEV distributions. These are of particular
interest for economists and econometricians, as they feature prominently in
both theoretical economic models with a stochastic component and in the
closely related structural econometric models. These restricted subfamilies
of the larger GEV family are typically named according to their discoverers,
but are also distinguished by a number from a classification between types.
77
2.4. Continuous Distributions III
0.2
x
−5 −2 1 4 7 10
78
2.4. Continuous Distributions III
In such a case, the support is Y = [µ, ∞), the density function reads as:
−α−1 −α !
α y−µ y−µ
fY (y; α, µ, σ) = exp − for y > µ
σ σ σ
(2.121)
the cumulative distribution function as:
−α !
y−µ
FY (y; α, µ, σ) = exp − for y > µ (2.122)
σ
and the quantile function as follows.
1
QY (p; α, µ, σ) = (− log (p)) α for p ∈ [0, 1) (2.123)
fY (y) Stand.
µ=2
σ=2
0.5
y
2 4 6 8
Three different Fréchet density for a fixed value of the shape parameter
are shown above in Figure 2.16, again highlighting the role of the location
and scale parameters; observe how the former (µ) also affects the bound of
the distribution’s support. The mean and the variance of the transformed
random variable, if they are finite, are obtained by applying the standard
properties of simple moments to (2.115) and (2.116) respectively. If it fol-
lows the Fréchet distribution, a random variable Y is indicated either as:
Y ∼ Frechet (α, µ, σ)
or as:
Y ∼ EV2 (α, µ, σ)
where the latter notation now refers to “Type II.” The most frequent use of
the Fréchet distribution is similar to the Gumbel’s, that is, for modeling the
maximum value of many realizations. Furthermore, the Fréchet distribution
features prominently in several economic and econometric models, especially
in the field of international trade.
79
2.4. Continuous Distributions III
W = − [σ + µ (1 − ξ) + ξX] (2.124)
one obtains the (traditional) Weibull distribution, which predates the the-
ory of GEV distributions (thus explaining the name reverse Weibull for the
Type III GEV case, since W = −Y ). The traditional Weibull distribution
has support W = [µ, ∞); its density function is:
α−1 α
α w−µ w−µ
fW (w; α, µ, σ) = exp − for w > µ
σ σ σ
(2.125)
its cumulative distribution function as:
α
w−µ
FW (w; α, µ, σ) = 1 − exp − for w > µ (2.126)
σ
1 fY (y) Stand.
µ=2
σ=2
0.5
y
−6 −4 −2
80
2.4. Continuous Distributions III
1 fW (w) Stand.
µ=2
σ=2
0.5
w
2 4 6
Figures 2.17 and 2.18 display the reverse Weibull and the Weibull dis-
tribution respectively, for α = 2 and the same perturbations of the location
and scale parameters. The two figures clarify that each distribution is the
perfect “mirror image” of the other, as their names and mathematical rela-
tionship suggest. The moments of both the inverse Weibull and the Weibull
distribution can again be appropriately derived from the expressions in the
GEV case. If a random variable Y follows the reverse Weibull distribution,
this is best written with an explicit reference to the “Type III” GEV:
Y ∼ EV3 (α, µ, σ)
while one writes:
W ∼ Weibull (α, µ, σ)
if a random variable W follows the (traditional) Weibull distribution.
This section is concluded by highlighting some relationships between the
Weibull distribution, other types of GEV distributions, and the exponential
distribution (and by extension, all distributions related to the exponential).
Observation 2.18. If X ∼ Exp (1), Y = µ−σ log (X), and W = µ+σX 1/α ,
it is Y ∼ Gumbel (µ, σ) and W ∼ Weibull (α, µ, σ).
√
Observation 2.19. If X ∼ Exp α and W ∼ Weibull α, 0, 21 , the two
√
random variables are symmetrically related: X = W and W = X 2 .
Observation 2.20. If Y ∼ Frechet (α, µY , σ), and W = (Y − µY )−1 + µW ,
it is W ∼ Weibull (α, µW , σ−1 ).
The (traditional) Weibull distribution is most often used to model the min-
imum value among multiple realizations, contrary to the Gumbel and the
Fréchet cases. A frequent application is in survival analysis (the statistical
study of waiting times) along with the related exponential distribution.
81
Lecture 3
Random Vectors
This lecture introduces those conceptual and mathematical tools that are
necessary to handle multiple random variables: these notions include those
of random vector, joint versus marginal probability distribution, multivari-
ate transformation of a random vector, independence, covariance and cor-
relation, conditional distribution and conditional moments. While develop-
ing these concepts, this lecture introduces additional relationships between
common univariate distributions, concluding with the treatment of two im-
portant multivariate distributions. The mathematical notation is chosen so
to facilitate the later treatment of econometric theory.
82
3.1. Multivariate Distributions
83
3.1. Multivariate Distributions
where the summation is taken over the vectors t in X whose all elements are
smaller or equal than all the elements of x. In addition, the joint probability
mass function must obviously satisfy the following condition.
X
P (x ∈ X) = fx (x) = 1
x∈X
Clearly, the joint density integrates to one over the entire support of x.
ˆ ˆ
P (x ∈ X) = ... fx (x1 , . . . , xK ) dx1 . . . dxK = 1
X1 XK
84
3.1. Multivariate Distributions
The above relationship expresses a summation over all the supports of all
the other random variables in x, excluding Xk . It has to be interpreted in
a general sense, whatever the dimension of K and the actual index k are: if
K is small and/or k is at either extreme of the list, it must be reformulated
accordingly. This is best seen with small values of K.
Example 3.1. Joint Medical Outcomes. Recall Example 1.4 about the
probability of getting sick following the take-up (or lack thereof) of some
preemptive medical treatment. One could reformulate that example via a
random vector (X, Y ) where: x = 1 if an individual is a taker, x = 0 if he
or she hesitates, y = 1 if an individual stays healthy, y = 0 if he or she gets
sick. This is a bivariate Bernoulli distribution with:
fX,Y (x = 1, y = 1) = 0.40, fX,Y (x = 1, y = 0) = 0.20,
fX,Y (x = 0, y = 1) = 0.15, fX,Y (x = 0, y = 0) = 0.25,
and the marginal mass function of either Bernoulli-distributed random vari-
able is obtained by appropriately summing over the support of the other.
fX (x) = fX,Y (x, y = 1) + fX,Y (x, y = 0) for x = 0, 1
fY (y) = fX,Y (x = 1, y) + fX,Y (x = 0, y) for y = 0, 1
Bernoulli distributions are typically represented through frequency tables,
where joint probabilities are displayed at the center, and marginal proba-
bilities at the margins; a frequency table for this example is shown below.
Y =0 Y =1 Total
X=0 0.25 0.15 0.40
X=1 0.20 0.40 0.60
Total 0.45 0.55 1
The denomination “marginal” clearly derives from this graphical device.
Analogously, the density functions of continuous marginal distributions
can be obtained by integrating the joint density over the support of all the
random variables in the random vector, except the one of interest.
Definition 3.7. Marginal Distribution (continuous case). For a given
random vector x made of continuous random variables only, the probability
density function of Xk – the k-th random variable in x – is obtained as:
ˆ
fXk (xk ) = fx (x) dx−k
×`6=k X`
´ xk
and thus FXk (xk ) = −∞
fXk (t) dt.
85
3.1. Multivariate Distributions
In this more compact definition, the notation ×`6=k X` indicates the Carte-
sian product of all the supports of each random variable in x excluding Xk :
e.g. ×`6=k X` = X1 × · · · × Xk−1 × Xk+1 × · · · × XK ; similarly the expression
dx−k for the differential is to be interpreted as the product of all differentials
except the one for xk : dx−k = dx1 . . . dxk−1 dxk+1 . . . dxK .
1
fX1 ,X2 (x1 , x2 ; µ1 , µ2 , σ1 , σ2 , ρ) = √ ×
2πσ1 σ2 1 − ρ2
!
(x1 − µ1 )2 (x2 − µ2 )2 ρ (x1 − µ1 ) (x2 − µ2 )
× exp − 2 − + (3.1)
2σ1 (1 − ρ2 ) 2σ22 (1 − ρ2 ) σ1 σ2 (1 − ρ2 )
0.4
0.2
0
−1 4
0 3
0.3
1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1
86
3.1. Multivariate Distributions
while the observation that both densities must integrate to one gives:
fG (g = 1; p) = p
fG (g = 0; p) = 1 − p
87
3.1. Multivariate Distributions
88
3.1. Multivariate Distributions
This implies
X1 = g1 (Y1, Y2 ) = log (Y1 ) and X2 = g2 (Y1 , Y2 ) = log (Y2 ),
−1 −1
fY1 (y1 )
fY2 (y2 )
0.2
fY1 ,Y2 (y1 , y2 )
0.1
0
0 5
1 4
2 3
4 · 10−2
y1 3 2 y2
2 · 10−2
4 1
0
5
89
3.2. Independence and Random Ratios
This example is relatively simple, because the two original random vari-
ables X1 and X2 do not interact in the transformation (that is, the Jacobian
is diagonal). Some more elaborate cases are discussed in the next section.
However, this example is also useful in itself as an occasion to graphically
visualize another bivariate distribution (in this case, the lognormal).
All the concepts and ideas discussed until this point in this lecture ex-
tend easily to random matrices, that is arrayed combinations of L random
vectors, written e.g. X = x1 . . . xL . In analogy with random vectors,
dom matrices explains why uppercase letters are not used to denote random
vectors. Random matrices do not involve particular conceptual hurdles, be-
ing nothing else but a different algebraic way of arraying random variables.
However, they are necessary in multivariate statistical analysis and econo-
metrics as a means to more elegantly, compactly and often clearly describe
statistical estimators and their properties.
90
3.2. Independence and Random Ratios
Note that within each random vector, the underlying random variables are
not necessarily independent. Moreover, if all the random vectors in question
have length one, these definitions reduce to those given above.
Two results about independent random variables are well worth of being
discussed: the first helps the interpretation of independence, the second is of
more practical use and instrumental to derive other properties and results.
91
3.2. Independence and Random Ratios
92
3.2. Independence and Random Ratios
93
3.2. Independence and Random Ratios
and:
∂ −1 z y
det g (y, z) = det =z>0
∂y T X2 0 1
and in both cases the result is equal to z which is positive by construction,
leaving no need to take absolute values. In addition, since the two sets X1
and X2 in the partition are both symmetric around x2 = 0 – just like the
transformation that defines z is – the joint density of y can be obtained by
applying 3.1 once on the joint density of x, and multiplying the result by
two.
(y 2 + 1) z 2
z
fY,Z (y, z) = exp −
π 2
The final objective is to show that the marginal density of Y indeed follows
the standard Cauchy distribution. To achieve this, the route is to integrate
the joint density of y over the support of Z, which is R+ :
ˆ +∞
fY (y) = fY,Z (y, z) dz
ˆ +∞
0
(y 2 + 1) z 2
z
= exp − dz
0 π 2
ˆ +∞
(y 2 + 1)
1
= exp − u du
0 2π 2
ˆ +∞ 2
(y 2 + 1)
1 (y + 1)
= exp − u du
π (y 2 + 1) 0 2 2
1
= 2
π (y + 1)
where in the third line the change of variable u = z 2 is applied, while the
integral in the fourth line vanishes because it is the total probability of an
exponential distribution with parameter λ = (y 2 + 1) /2. The final result is
indeed the probability density function of a standard Cauchy distribution,
as it was originally postulated.
Observation 3.2. If Z ∼ N (0, 1) and X ∼ χ2 (ν), and the two random
variablespZ and X are independent, the random variable Y obtained as
Y = Z/ X/ν is such that Y ∼ T (ν).
Proof. The first step ispto show what the distribution of the transformed
random variable W = X/ν is. The distribution of the squared root of a
random variable following the chi-squared distribution is called (unsurpris-
ingly) the chi distribution, and W just follows one rescaled version of it.
This transformation is monotone, it preserves the support X = W = R+ , its
94
3.2. Independence and Random Ratios
95
3.3. Multivariate Moments
In the above analysis, the third line applies the change of variable u = w2 ;
the fourth line is obtained through some manipulation, whereas the integral
therein is recognized as the density function of a Gamma distribution with
parameters α = (ν + 1) /2 and β = (ν + y 2 ) /2, thus vanishing.
Observation 3.3. If X1 ∼ χ2 (ν1 ) and X2 ∼ χ2 (ν2 ), and the two random
variables X1 and X2 are independent, the random variable Y obtained as
Y = (X1 /ν1 ) / (X2 /ν2 ) is such that Y ∼ F (ν1 , ν2 ).
Proof. (Outline.) This proceeds as in the previous two observations. First,
define W1 = X1 /ν1 ; the density function of this transformation is the same
as X1 ’s but multiplied by ν1 , and similarly for W2 = X2 /ν2 . The next step
is the transformation Y = W1 /W2 and Z = |W2 |; the joint density function
of Y and Z can be derived consequently from the one of W1 and W2 . Some
manipulation would then reveal that the marginal density of the ratio Y is
that of an F -distribution with parameters ν1 and ν2 .
The last observation is presented completely without proof.
Observation 3.4. If X1 ∼ Γ (α, γ) and X2 ∼ Γ (β, γ), and the two random
variables X1 and X2 are independent, the random variable Y obtained as
Y = X1 / (X1 + X2 ) is such that Y ∼ Beta (α, β), and is independent of the
random variable W obtained as W = X1 + X2 such that W ∼ Γ (α + β, γ).
This completes the picture about the best known results on random ratios.
Among these Observations, 3.2 and 3.3 play an important role in statistical
inference, as elaborated in the next lectures.
96
3.3. Multivariate Moments
where in both cases, dx = dx1 . . . dxK is the product of all the differentials.
In a multivariate context it is interesting to describe the degree by which
two random variables tend to deviate from their mean in the same direction,
in a probabilistic sense. This concept is expressed by the covariance (an
absolute measure) and the correlation (a normalized one).
Definition 3.11. Covariance. For any two random variables Xk and X`
belonging to a random vector x, their specific covariance is defined as the
expectation of a particular function of Xk and X` , that is, the product of
both variables’ deviations from their respective means.
Cov [Xk , X` ] = E [(Xk − E [Xk ]) (X` − E [X` ])]
The full expression is written as follows, for discrete and continuous random
variables respectively.
X X
Cov [Xk , X` ] = ··· (xk − E [Xk ]) (x` − E [X` ]) fx (x)
x ∈X1 xK ∈XK
ˆ1 ˆ
Cov [Xk , X` ] = ... (xk − E [Xk ]) (x` − E [X` ]) fx (x) dx
X1 XK
The covariance takes positive values if the two variables (Xk and X` ) tend
to deviate from the mean in the same direction, and negative vice versa. It
must be observed that, however, the covariance expresses a relationship of
dependence which is essentially linear; if two random variables tend to move
together in a very non-linear or irregular way, this may not be captured at
all by the covariance. Similarly to the variance, the definition of covariance
can be rewritten in a way that is often more convenient to handle.
Cov [Xk , X` ] = E [(Xk − E [Xk ]) (X` − E [X` ])]
= E [Xk X` ] − E [Xk E [X` ]] − E [X` E [Xk ]] + E [Xk ] E [X` ]
= E [Xk X` ] − E [Xk ] E [X` ]
Definition 3.12. Correlation. For any two random variables Xk and X`
belonging to a random vector x, their population correlation is defined as
follows.
Cov [Xk , X` ]
Corr [Xk , X` ] = p p
Var [Xk ] Var [X` ]
97
3.3. Multivariate Moments
or equivalently:
P ((X − E [X]) t + (Y − E [Y ]) = 0) = 1
which only occurs if, given Y = aX + b:
a = −t
b = E [X] · t + E [Y ]
Cov [X, Y ]
t=−
Var [X]
and the proof of b. is completed by showing that a and Corr [X, Y ] must
also share the same sign.
Result a. in the above Theorem characterizes the normalized interpretation
of correlation. Result b. instead specifies the linear nature of the relation-
ship captured by measures of correlation, which equal either 1 or −1 if and
only if the two random variables under consideration are connected through
an exact linear dependence.
98
3.3. Multivariate Moments
E [XY ] = E [X] E [Y ]
E [U V ] = E [U ] E [V ]
because U and V are also independent (so long as all the relevant moments
exist); this also implies that U and V have zero covariance and correlation
and that all higher moments of X and Y inherit this property, for example:
= E (X − E [X])2 E (Y − E [Y ])2
E [X1 X2 ] = ρσ1 σ2 + µ1 µ2
99
3.3. Multivariate Moments
100
3.3. Multivariate Moments
0.4
0.2
0
−1 4
0 3
0.3
1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1
101
3.3. Multivariate Moments
0.4
0.2
0
−1 4
0 3
0.3
1 2
0.2
2 1
0.1 x1 x2
3 0
0
4 −1
Note that in all these cases the marginal distributions stay the same, since
they do not depend upon the parameter ρ (as already mentioned, a unique
feature of the bivariate normal distribution is that marginal distributions
do not depend on ρ, a unique feature of the bivariate normal).
102
3.3. Multivariate Moments
• Finally, Var [x] is positive semi-definite, that is, for any non-zero vec-
tor a of length K, the quadratic form aT Var [x] a ≥ 0 is nonnegative.
This property is demonstrated later while analyzing the moments of
linear transformations of random vectors.
Example 3.6. Summarizing the Moments of Bivariate Normal Dis-
tributions. All the moments of the bivariate normal distribution from the
previous examples can be summarized using the following notation:
µ
µ ≡ 1 = E [x]
µ2
and
σ21 ρσ1 σ2
Σ≡ = Var [x]
ρσ1 σ2 σ22
and it is straightforward to verify that Σ complies with the properties of all
variance-covariance matrices. If x = (X1 , X2 ) follows the bivariate normal
distribution, one can write x ∼ N (µ, Σ).
103
3.3. Multivariate Moments
With the aid of some linear algebra, the usual properties of means and
variances are generalized to a multivariate environment. Consider a random
vector x with mean E [x] and variance Var [x] in the three following cases.
• Linear Transformations returning Scalars. Consider some vec-
tor a = (a1 , . . . , aK )T of length K which, multiplied to x, returns the
random variable Y = aT x as a linear combination. Because expecta-
tions are linear operators, the mean of Y is:
E [Y ] = E aT x
= E [a1 X1 + · · · + aK XK ]
= a1 E [X1 ] + · · · + aK E [XK ]
= aT E [x]
as for the variance of Y instead:
Var [Y ] = Var aT x
h T i
= E aT x − E aT x
T
a x − E aT x
h i
T T
= E a (x − E [x]) (x − E [x]) a
h i
= aT E (x − E [x]) (x − E [x])T a
= aT Var [x] a
where the last expression is a quadratic form that cannot be negative
(showing that Var [x] is positive semi-definite); in particular:
K
" k−1
#
X X
Var [Y ] = a2k Var [Xk ] + 2 ak a` Cov [Xk , X` ]
k=1 `=1
104
3.3. Multivariate Moments
= E xy T − E [x] E [y]T
105
3.4. Multivariate Moment Generation
106
3.4. Multivariate Moment Generation
The r-th centered moments for each k-th element of the random vector x
can be calculated in analogy with the univariate case.
∂ r Mx (t) 1 ∂ r ϕx (t)
E [Xkr ] = = ·
∂trk t=0 ir ∂trk t=0
Furthermore, the cross-moments are obtained, for two integers r and s, as:
∂ r+s Mx (t) 1 ∂ r+s ϕx (t)
E [Xkr X`s ] = = ·
∂trk ∂ts` t=0 ir+s ∂trk ∂ts` t=0
and the case of the characteristic function is analogous. This fact allows to
calculate covariances using these two important functions.
Example 3.7. Moment Generating Function and Covariance of the
Bivariate Normal Distribution. The moment generating function of the
bivariate normal distribution is the following.
MX1 ,X2 (t1 , t2 ) = E [exp (t1 X1 + t2 X2 )]
1 2 2 2 2
= exp t1 µ1 + t2 µ2 + t σ + 2t1 t2 ρσ1 σ2 + t2 σ2
2 1 1
Obtaining this expression while keeping track of all these parameters is not
as difficult as it is annoying, therefore a proper and more elegant derivation
is postponed to the later, more general analysis of the multivariate normal
distribution. Here the point is to show how the covariance between X1 and
X2 can be derived via the moment generating function. It is not difficult to
see that E [Xk ] = µk and E [Xk2 ] = σ2k + µ2k for k = 1, 2, as in the univariate
case. As per the first cross-moment, some calculations show that:
∂2
MX1 ,X2 (t1 , t2 ) = µ1 + t1 σ21 + t2 ρσ1 σ2 µ2 + t2 σ22 + t1 ρσ1 σ2 +
∂t1 ∂t2
1 2 2 2 2
+ ρσ1 σ2 · exp t1 µ1 + t2 µ2 + t σ + 2t1 t2 ρσ1 σ2 + t2 σ2
2 1 1
and evaluating the above expression for t1 = 0 and t2 = 0 gives:
∂2
E [X1 X2 ] = MX1 ,X2 (t1 , t2 ) = ρσ1 σ2 + µ1 µ2
∂t1 ∂t2 t1 ,t2 =0
107
3.4. Multivariate Moment Generation
Like in the univariate case, both the moment generating and the char-
acteristic functions uniquely characterize a distribution, but only the more
“complex” characteristic function is guaranteed to always exist. In addition,
in the multivariate context it is possible to derive some results of extreme
importance about independent random variables.
Theorem 3.6. Moment generating and characteristic functions of
independent random variables. If the random variables from a random
vector x = (X1 , . . . , XK ) are pairwise independent, the moment generating
function (if it exists) and the characteristic function of x equal respectively
the product of the K moment generating functions (if they exist) and the K
characteristic functions of the random variables involved.
K
Y
Mx (t) = MXk (tk )
k=1
YK
ϕx (t) = ϕXk (tk )
k=1
Proof. This is an application Theorem 3.3 and Theorem 3.5 upon a sequence
of K transformed random variables: exp (t1 X1 ) , . . . , exp (tK XK ) which are
themselves mutually independent. For moment generating functions:
" K
!#
X
Mx (t) = E exp tT x = E exp
tk Xk
k=1
" K
#
Y
=E exp (tk Xk )
k=1
K
Y
= E [exp (tk Xk )]
k=1
YK
= MXk (tk )
k=1
108
3.4. Multivariate Moment Generation
This powerful result often allows to easily obtain the moment distribu-
tion of some linear combination of random variables x = (X1 , . . . , XK ), if
their underlying distribution is known and its moment generating or char-
acteristic function is manipulable in such a way that it returns the moment
generating function of another known random variable. A list of important
cases follows; for all results, the proof is either provided or outlined. Below,
the notation Xi indicates one of N random variables (for i = 1, . . . , N ) that
are all mutually independent and follow the indicated distribution.
Proof. If MXi (t) = p exp (t) + (1 − p), it suffices to multiply the N identical
moment generating functions: MPNi=1 Xi (t) = [p exp (t) + (1 − p)]N .
109
3.4. Multivariate Moment Generation
The next five results are easily demonstrated through the same approach as
in the previous observation: that is, by multiplying the moment generating
functions of the N specified primitive, independent random variables Xi .
Observation 3.6. If Xi ∼ NB (p, 1), it is N
P
i=1 Xi ∼ NB (p, N ).
110
3.5. Conditional Distributions
Proof. Define the two random variables W1 = X1 /λ1 and W2 = −X2 /λ2 ,
which are obviously independent. By the properties of moment generating
functions for linear transformations, the two transformed random variables
have moment generating function:
MW1 (t) = (1 − t)−1
MW2 (t) = (1 + t)−1
and since Y = W1 + W2 , the moment generating function of Y is:
−1
MY (t) = MW1 (t) MW2 (t) = 1 − t2
that is, that of a standard Laplace distribution.
Observation 3.14. If X1 ∼ Gumbel (µ1 , σ) and X2 ∼ Gumbel (µ2 , σ), and
the two random variables X1 and X2 are independent, the random variable
Y obtained as Y = X1 − X2 is such that Y ∼ Logistic (µ1 − µ2 , σ).
Proof. The moment generating function of Xi – for i = 1, 2 – is given by
MXi (t) = exp (µi t) Γ (1 − σt). Similarly, the transformed random variables
Wi = −Xi – again for i = 1, 2 – have moment generating functions given
by MWi (t) = exp (−µi t) Γ (1 + σt). It is easy to see that X1 is independent
of W2 and vice versa. Since Y = X1 + W2 , the moment generating function
of Y is therefore obtained as:
MY (t) = MX1 (t) MW2 (t)
= exp (µ1 t) Γ (1 − σt) · exp (−µ2 t) Γ (1 + σt)
Γ (1 − σt) Γ (1 + σt)
= exp (µ1 t − µ2 t)
Γ (2)
= exp ((µ1 − µ2 ) t) · B (1 − σt, 1 + σt)
which is indeed the moment generating function of the logistic distribution
with specified parameters (note that Γ (2) = 1! = 1).
111
3.5. Conditional Distributions
In what follows, it is presumed for simplicity’s sake that both x and y are
either composed by discrete random variables only or by continuous random
variables only, but the two types should not coincide between vectors (that
is, x might include only discrete random variables and y only continuous
ones, or vice versa). The definition of conditional mass or density function
is the point of departure of the discussion, as it allows to subsequently define
the cumulative conditional distribution.
Definition 3.15. Conditional mass or density function. Consider the
combined random vector (x, y) with joint mass/density function fx,y (x, y).
Suppose that the random vectors x has a probability mass or density func-
tion fx (x). The conditional mass or density function of y, given x = x, is
defined as follows for all x ∈ X:
fx,y (x, y)
f y|x (y| x = x) =
fx (x)
It is a conditional mass function if all the random variables in y are discrete,
and a conditional density function if they are all continuous.
Definition 3.16. Conditional cumulative distribution. Consider the
combined random vector (x, y) with joint mass/density function fx,y (x, y).
The conditional cumulative distribution of y, given x = x is defined as:
X
F y|x (y| x = x) = f y|x (t| x = x)
t∈Y:t≤y
112
3.5. Conditional Distributions
where the parameters are dropped in the expression on the left hand side
for the sake of brevity. One can observe that the resulting density is that
of another univariate normal distribution with different parameters, which
can be expressed in compact form as follows for any X2 = x2 .
σ1 2 2
X1 | X2 = x2 ∼ N µ1 + ρ (x2 − µ2 ) , σ1 1 − ρ
σ2
Clearly, the expression for the distribution of X2 conditional on X1 = x2 ,
for any x2 ∈ R, is symmetrical.
If y is an all-continuous random vector and x is an all-discrete one, the
definition of conditional density function may not be directly applicable –
short of resorting to a more general mathematical definition of joint density
that allows for discrete mass points. However, the concept is still valid as
much as it is useful, and it is best illustrated with an example.
Example 3.9. Conditional height distribution. Remember Example
3.3 about the height distribution with mixed genders. If one aims to describe
the density function of height for females only, the appropriate concept is
that of a conditional distribution:
1 h − µF
f H|G=1 (h| g = 1) = φ
σF σF
and symmetrically for males.
113
3.5. Conditional Distributions
114
3.5. Conditional Distributions
115
3.5. Conditional Distributions
116
3.5. Conditional Distributions
0.6
f Yi |Xi (yi | xi )
0.3
0
−1 4
0 3
E [Yi | Xi ]
1 2
2 1
xi yi
3 0
4 −1
Note: the conditional distribution Yi | Xi is normal, but with parameters that vary as a function of Xi .
Selected density functions of Yi | Xi are displayed for xi = {0, 1, 2, 3}.
117
3.5. Conditional Distributions
118
3.5. Conditional Distributions
The researcher, however, seems more intent into analyzing how the vari-
ation of Y differs across groups with respect to the variation of Y in the
population as a whole. The interest of the researcher can be, for example,
to gauge to what extent the resentment against inequality interacts with is-
sues about ethnicity. The concern of the researcher is all but enhanced after
having visualized a plot about the four conditional distributions, which is
reported in Figure 3.6 below. The figure shows how not only the means of Y
markedly differ across the four groups, but also the variation of log-income
is quite heterogeneous. Problems like this are the domain of the so-called
analysis of the variance, a set of statistical techniques for assessing differ-
ences about certain characteristics between groups of a population.
f Yi |Xi (yi | xi )
0.1
1 15
2 10
xi 3 5
yi
4 0
Note: the figure displays the four conditionally normal distributions of Yi | Xi for every Xi = 1, 2, 3, 4.
The four straight lines delimited by circles and placed beneath each density function denote the range
of all values of Yi | Xi within two standard deviations below or above each group’s conditional mean.
To the rescue of the researcher comes the Law of Total Variance, which
in this bivariate case reads as follows.
119
3.6. Two Multivariate Distributions
The first component on the right hand side, VarX [E [Y | X]] is the so-called
between group variation and is interpreted as the “variance of the con-
ditional means” – that is, how much do the four groups differ on average (a
more direct measure of cross-group inequality). Here, this is:
4
1X
VarX [E [Y | X]] = (E [Y | X = x] − EX [E [Y | X]])2 = 2.125
4 x=1
with about the same magnitude as the between group variation. Therefore,
the researcher concludes that the overall inequality has a very strong group
component, which is likely to bear social and political consequences.
Multinomial Distribution
The multinomial distribution describes a variation of the binomial experi-
ment with many, mutually exclusive, alternatives. Specifically, suppose one
is making n draws, and each of these can end up with the realization of one
and only one event between K ≥ 2 that are possible. All these Pevents have
probability pk ∈ [0, 1] to happen (for k = 1, . . . , K), with K k=1 pk = 1.
After the n draws, the result is a list of success counts for each alternative,
120
3.6. Two Multivariate Distributions
The probability mass function of this particular random vector is given by:
K
n! Y
fx (x1 , . . . , xK ; n, p1 , . . . , pK ) = QK pxkk (3.12)
k=1 xk ! k=1
−1
where the multinomial coefficient n! · K counts, in what appears
Q
k=1 (xk !)
to be an extension of the binomial coefficient, the number of realizations
containing exactly (x1 , . . . , xK ) successes for each alternative out of n draws.
The cumulative distribution clearly sums the mass function over points in
the support as follows, for t = (t1 , . . . , tK ).
K
X n! Y
Fx (x1 , . . . , xK ; n, p1 , . . . , pK ) = QK ptkk (3.13)
t∈X:t≤x k=1 tk ! k=1
This distribution draws its name from the multinomial theorem, which is
useful to analyze it. It helps show that the total probability mass equals 1:
X
P (x ∈ X) = fx (x1 , . . . , xK ; n, p1 , . . . , pK )
x∈X
K
X n! Y
= QK pxkk
x∈X k=1 xk ! k=1
K
!n
X
= pk
k=1
=1
as well as the calculation of the moment generating function:
K
!
X X
Mx (t1 , . . . , tK ) = exp tk xk · fx (x1 , . . . , xK ; n, p1 , . . . , pK )
x∈X k=1
K
X n! Y
= QK (pk · exp (tk ))xk (3.14)
x∈X k=1 xk ! k=1
K
!n
X
= pk · exp (tk )
k=1
121
3.6. Two Multivariate Distributions
hence one can write the mean vector and the variance-covariance matrix for
this distribution in more compact and elegant form as follows.
E [x] = np (3.18)
Var [x] = n diag (p) − pp T
(3.19)
122
3.6. Two Multivariate Distributions
The expression
x ∼ N (µ, Σ)
typically indicates that the random vector x follows the multivariate normal
distribution with parameters µ and Σ. A particular case of the multivariate
normal distribution is the standardized one, with µ = 0 and Σ = I. If a
random vector z follows the standard multivariate normal distribution, this
is written as follows.
z ∼ N (0, I)
Note that since Σ is symmetric and positive semi-definite, a Cholesky de-
1
composition can always be applied to it so to find some matrix Σ 2 such that
1 1
Σ− 2 ΣΣ− 2 = I. Therefore, a random vector that follows a generic normal
distribution with parameters µ and Σ is always related to a random vector
z that follows the standard multivariate normal via the transformations:
1
z = Σ− 2 (x − µ)
1
x = Σ2 z + µ
which is analogous to the univariate case; also observe that the two trans-
formations together comply with Theorem 3.1 about the transformation of
continuous random vectors.
As usual, the cumulative distribution of the normal distribution lacks a
closed form, hence it must be expressed as a multiple integral.
Fx (x1 , . . . , xK ; µ, Σ) =
ˆ x1 ˆ xK
1 1
= ... q exp − (t − µ) Σ (t − µ) dt (3.21)
T −1
−∞ −∞ K 2
(2π) |Σ|
Like in the univariate case, it is not immediate to show that the joint density
function integrates to one; the demonstration is a tedious extension of the
one from Lecture 2 for K = 1. However, obtaining the moment generating
function is again a relatively simpler task if one starts from the standardized
case, z ∼ N (0, I):
ˆ
T
1 1 T
Mz (t) = exp t z q exp − z z dz
RK K 2
(2π)
T ˆ !
t t 1 (z − t)T (z − t)
= exp q exp − dz
2 2
RK (2π)K
T
t t
= exp
2
123
3.6. Two Multivariate Distributions
where the integral in the second line equates that of an another multivariate
normal distribution with µ = t and Σ = I, hence it integrates to one. To
obtain the general expression of the moment generating function, note that:
Mx (t) = E exp tT x
h 1 i
T
= E exp t Σ z + µ 2
h 1
i
= exp tT µ · E exp tT Σ 2 z
tT Σt
T
= exp t µ +
2
since the expectation in the third line corresponds to the definition of mo-
ment generating function for the standardized normal distribution but with
1 1
a rescaled argument: Σ 2 t instead of t (recall that Σ 2 is symmetric).
By analyzing the above moment generating function, one can conclude
the following about the moments of a multivariate normal distribution.
E [x] = µ (3.22)
Var [x] = Σ (3.23)
Note that this is a different parametrization to that of the particular bivari-
ate case (see e.g. Example 3.6); there, σ1 and σ2 correspond for convenience
to standard deviations (square roots of variances) instead of just variances.
Here instead, the variances are denoted as Var [Xk ] = σkk for k = 1, . . . , K
and the covariances as Cov [Xk , X` ] = σk` for k, ` = 1, . . . , K and k 6= `.
An additional observation about the moment generating function is that for
any J-dimensional linear combination of x – write it y = a + Bx where a is
a vector of length J and B is a J × K matrix – one can obtain the moment
generating function for y: My (t) (where now t has length J) following the
same procedure as above:
My (t) = E exp tT y
= E exp tT (a + Bx)
= exp tT a · E exp tT Bx
tT BΣBT t
T
= exp t (Bµ + a) +
2
which is the moment generating function ofanother multivariate normal dis-
tribution, that is, y ∼ N Bµ + a, BΣBT . In plain words, any collection
of linear combinations of some possibly dependent normal distributions itself
follows a multivariate normal distribution. This result is frequently applied
to derive the distribution of just one single linear combination (J = 1).
124
3.6. Two Multivariate Distributions
where:
Σ1 ≡ Σ11 − Σ12 Σ−1
22 Σ21
and:
|Σ| = Σ1 · |Σ22 | = Σ2 · |Σ11 |
relating the determinant of Σ to those of the matrices expressing its parti-
tioned inverse. With all this in mind, (3.20) can be rewritten as:
1
fx (x1 , x2 ; µ1 , µ2 , Σ11 , Σ12 , Σ21 , Σ22 ) = q ×
K
(2π) Σ1 · |Σ22 |
1 −1 1 −1
× exp (x1 − µ1 )T Σ1 Σ12 Σ−1 22 (x2 − µ2 ) − (x1 − µ1 )T Σ1 (x1 − µ1 ) +
2 2
1 T −1 −1 1 T −1
+ (x2 − µ2 ) Σ2 Σ21 Σ11 (x1 − µ1 ) − (x2 − µ2 ) Σ2 (x2 − µ2 )
2 2
or alternatively, by inverting the ‘1’ and ‘2’ subscripts. The above expression
of the joint density and its symmetric version can be exploited to show, after
more calculations, that the “marginalized” distributions for x1 and x2 are:
x1 ∼ N (µ1 , Σ11 )
x2 ∼ N (µ2 , Σ22 )
while the conditional ones, for one subvector given the other, are:
x1 | x2 ∼ N µ1 + Σ12 Σ−1 −1
22 (x2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21
x2 | x1 ∼ N µ2 + Σ21 Σ−1 −1
11 (x1 − µ1 ) , Σ22 − Σ21 Σ11 Σ12
generalizing the observations already made for the simpler bivariate case.
125
Lecture 4
This lecture introduces some core concepts associated with the practice of
statistical analysis: samples and statistics that are calculated from samples.
A short introduction to the general notion of statistical sample is followed by
a focused analysis of an important special case: that of random sample, with
special emphasis on random samples drawn from the normal distribution. In
developing these concepts, the first half of the lecture introduces properties
of common sample statistics, while the second half is devoted to the analysis
of sample statistics with specific properties: order and sufficient statistics.
This definition is already general enough that it allows for the observation
of multiple variables for every unit of analysis; in the simpler cases when
each of these is associated with the realization of just one random variable,
a sample is written as {xi }N
i=1 . Conversely, the definition can be extended to
126
4.1. Random Samples
The hypothesis motivating random samples is that all the realizations are
obtained from a given population characterized by some joint probability
distribution expressed by some random vector x. A sample complies with
this framework if, for example, each realization is obtained by extracting
every unit of observation from some specific population through a protocol
that assigns to all such units the same probability of being drawn into the
sample, a process known as sampling with replacement (this name derives
from sampling protocols applied to finite populations where a unit is allowed
to produce multiple realizations xi ). Conversely, other protocols like such as
sampling without replacement, where every realization is drawn sequentially
from a population and is not allowed to be extracted again, do not comply
with the random sample framework. It is important to realize that not all
samples are random.
Intuitively, the more a sample departs from the i.i.d. benchmark the
more statistical inference is complicated. Yet in social sciences non-random
samples are common. For example, the data may be composed by observa-
tions obtained from recognizably different distributions (i.n.i.d.), or whose
characteristics are characterized by statistical dependence due to some un-
derlying socio-economic phenomenon, like group behavior or the response
to economic events (n.i.i.d.). While statistical inference is still possible on
non-random samples, the asymptotic framework is better suited to these
settings. The present lecture only focuses on random samples. Yet it must
127
4.1. Random Samples
where θ is the collection of parameters that are associated with the proba-
bility distribution of x. This is an extremely useful fact aiding the analysis
of selected statistics of a random sample.
Definition 4.5. Statistic. A function of the N random variables, vectors
or matrices that are specific to each i-th unit of observation and that gen-
erate a sample is called a statistic. Any statistic is itself a random variable,
vector or matrix.
Definition 4.6. Sampling distribution. The probability distribution of
a statistic is called its sampling distribution.
The two most common and better known statistics are the following.
Definition 4.7. Sample mean. In samples derived from random vectors,
the sample mean is a vector-valued statistic which is usually denoted as x̄
and defined as follows.
N
1 X
x̄ = xi
N i=1
This definition can be reduced to samples that are drawn from univariate
random variables, in which case the usual notation is X̄:
N
1 X
X̄ = Xi
N i=1
or extended to samples drawn from random matrices, where one can write
X̄ and the definition is again analogous.
Definition 4.8. Sample variance-covariance. In samples collected from
random vectors, the sample variance-covariance is a matrix-valued statistic
which is usually denoted by S and defined as follows.
N
1 X
S= (xi − x̄) (xi − x̄)T
N − 1 i=1
128
4.1. Random Samples
b. (N − 1) S = N T T
P
i=1 xi xi − N · x̄x̄ .
where two terms in the second line are both equal to zero by definition of
sample mean; in the last line, the first term does not depend on a while the
second is minimized at a = x̄. To show b. simply note that:
N
X N
X N
X N
X
T
(xi − x̄) (xi − x̄) = xi xT
i − T
xi x̄ − x̄xT
i + N · x̄x̄
T
i=1
and the result again follows from the definition of a sample mean.
129
4.1. Random Samples
where the first equality follows from the linear properties of the expectation
operator and the second equality follows from the fact that the distributions
of yi for i = 1, . . . , N are identical (this particular result does not require
independence and is also valid for n.i.i.d. samples). Regarding b. it is:
" N # " N #! N " N #!T
X XN X X X
Var yi = E yi − E yi yi − E yi
i=1 i=1 i=1 i=1 i=1
! !T
N
X N
X
= E (yi − E [yi ]) (yi − E [yi ])
i=1 i=1
" N
#
X
=E (yi − E [yi ]) (yi − E [yi ])T
i=1
N
X h i
= E (yi − E [yi ]) (yi − E [yi ])T
i=1
= N · Var [yi ]
where the first line is just the definition of variance for N i=1 yi , the sec-
P
ond line applies the linear properties of expectations while also rearranging
terms, the third line rearranges terms again after observing that, for i 6= j:
h i
E (yi − E [yi ]) (yj − E [yj ])T = 0
which follows from the independence of the realizations in the random sam-
ple, the fourth line is another application of the linear properties of expec-
tations, while the fifth line again exploits the fact that all the realizations
follow from identically distributed random variables.
130
4.1. Random Samples
where the third line follows after adding and subtracting N ·E [x] E [x]T .
The theorem examined last is the culmination of the results analyzed
previously, and specifies how to obtain quantities that can be used to evalu-
ate – more precisely, estimate – the moments of the underlying distribution.
By selecting the sample mean and the sample variance-covariance for the
sake of estimating the corresponding moments, one can rely on the property
that the expectation of those quantities is indeed the moment sought after,
in both cases. Later, this property is defined unbiasedness.
131
4.2. Normal Sampling
132
4.2. Normal Sampling
The next result is central in statistical inference and allows to derive the
sampling distribution of the t-statistic when the sample is drawn from the
normal distribution.
Theorem 4.4. Sampling from the Normal Distribution. Consider a
random sample {xi }Ni=1 which is drawn from a random variable following the
normal distribution X ∼ N (µ, σ2 ), and the random variables corresponding
2
to the two sample statistics X̄ = N1 N
P 2 1
PN
i=1 X i and S = N −1 i=1 X i − X̄ .
The following three properties are true:
a. X̄ and S 2 are independent;
b. X̄ ∼ N (µ, σ2 /N );
c. (N − 1) S 2 /σ2 ∼ χ2N −1 .
Proof. Point a. is the most crucial to show. To this end, it is useful to start
from the observation that the sample variance can be expressed in terms of
only N − 1 of the original random variables, say X2 , . . . , XN :
N
1 X 2
S2 = Xi − X̄
N − 1 i=1
" N
#
1 2 X 2
= X1 − X̄ + Xi − X̄
N −1 i=2
!2
N N
1 X X 2
= Xi − X̄ + Xi − X̄
N −1 i=2 i=2
that the sample mean is independent of the sample variance requires to show
that X̄ is independent of N −1 out of the N demeaned normally distributed
random variables, say X2 − X̄, . . . , XN − X̄. To do so, a convenient approach
is to define the following random vector ze of length N , which is a function
of the standardized random variables Zi = (Xi − µ) /σ for i = 1, . . . , N .
Z̄ Z̄ N −1 N −1 ... N −1
Ze2 Z2 − Z̄ −N −1 1 − N −1 . . . −N −1
ze = .. = .. = .. .. . .. z
. . . . . . .
−1 −1 −1
ZeN ZN − Z̄ −N −N ... 1 − N
Here, z = (Zi , . . . , ZN ). One can prove that this linear transformation has
Jacobian determinant equaling 1/N , and therefore it is invertible; it follows
133
4.2. Normal Sampling
and it can be clearly decomposed into the product of two components: the
density function of Z̄ and that of all the other elements of ze, implying that
X̄ is independent of X2 − X̄, . . . , XN − X̄, and consequently of S 2 .
To continue the proof about the other points in the statement, note that
point b. is, as said, a consequence of Theorem 3.7. In order to demonstrate
point c. instead, it is easiest to proceed as follows:
N 2
S 2 X Xi − X̄
(N − 1) 2 =
σ i=1
σ2
N 2
X Xi − µ + µ − X̄
=
i=1
σ2
N 2 N
X (Xi − µ)2 N X̄ − µ X Xi − µ
= 2
− 2
− 2 X̄ − µ
i=1
σ σ i=1
σ2
N
√ X̄ − µ 2
2
X Xi − µ
= − N
i=1
σ σ
that is, the statistic (N − 1) S 2 /σ2 is shown to be the sum of the squares
of N independent random variables all of which follow the standard normal
distribution (the standardized versions of Xi , . . . , XN ) minus the square of
another random variable that follows the standard normal distribution (the
standardized version of the sample mean X̄). By the demonstration of point
a. the latter is independent of the former. Consequently, the distribution of
134
4.2. Normal Sampling
Yj ∼ N µY , σ2Y for j = 1, . . . , NY
135
4.2. Normal Sampling
where x̄ is the sample mean whose expectation and variance are µ = E [xi ],
and Σ/N = Var [xi ] respectively, and where σ∗−1
k` is k`-th element of Σ .
−1
1
Note that this nomenclature is not standard, but is applied throughout the lectures.
136
4.2. Normal Sampling
137
4.3. Order Statistics
thus x(1) = min {xi }Ni=1 and x(N ) = max {xi }i=1 . The j-th order statistic is
N
the random variable, denoted as X(j) , that generates the j-th realization in
the above sequence, that is x(j) . Any univariate sample has N associated
order statistics that must satisfy the following property.
X(1) ≤ X(2) ≤ · · · ≤ X(N )
2
To prove this result one must develop the distribution of the random matrix S: the
so-called Wishart distribution, which is outside the scope of this analysis.
138
4.3. Order Statistics
that is, for the j-th order statistic to be less or equal than some x, all the
inferior order statistic must also be less or equal than x (while the superior
ones can be larger, equal or lower than x). In a random sample, where all
the realizations obtain from independent and identically distributed random
variables, the above expression is considerably easier to evaluate.
139
4.3. Order Statistics
Proof. For at least j realizations to be less or equal than x, the event defined
as Xi ≤ x must occur an integer number of j ≤ k ≤ N times, whereas the
complementary event Xi > x must instead occur N −k times. If the sample
is random (i.i.d.), these two fundamental events occur with probabilities
that are constant across all realizations:
P (Xi ≤ x) = FX (x)
P (Xi > x) = 1 − FX (x)
and since realizations are independent, any joint combination of said events
can be expressed as the appropriate product of those probabilities. Clearly,
for a given k any joint events can be expressed through a binomial distribu-
tion, with the binomial coefficient counting all potential combinations with
k “successes” (Xi ≤ x) and N − k “failures” (Xi > x). Summing over the el-
igible values of k delivers the result sought after, of which the distributions
for the minimum and the maximum are special cases.
Corollary. If X is a continuous distribution with density function fX (x),
the density function of the j-th order statistic is the following.
N!
fX(j) (x) = fX (x) [FX (x)]j−1 [1 − FX (x)]N −j
(j − 1)! (N − j)!
Proof. This follows from manipulating the derivative of the cumulative dis-
tribution FX(j) (x). By the chain rule one obtains the density function:
dFX(j) (x)
fX(j) (x) =
dx
N
X N
= k [FX (x)]k−1 [1 − FX (x)]N −k fX (x) −
k=j
k
k N −k−1
− (N − k) [FX (x)] [1 − FX (x)] fX (x)
N!
= fX (x) [FX (x)]j−1 [1 − FX (x)]N −j +
(j − 1)! (N − j)!
N
X N
+ k [FX (x)]k−1 [1 − FX (x)]N −k fX (x) −
k=j+1
k
N
X N
− (N − k) [FX (x)]k [1 − FX (x)]N −k−1 fX (x)
k=j
k
where the third line obtains by isolating the term for k = j in the summation
that results from taking the derivative. All that is left to do is to show that
140
4.3. Order Statistics
the two “residual” summations in the third line cancel out. To this end, some
additional manipulation is necessary. In particular, a re-indexing of the first
of the two concerned residual summations, as well as the observation that
the term for k = N in the second residual summation equals zero, gives:
N!
fX(j) (x) = fX (x) [FX (x)]j−1 [1 − FX (x)]N −j +
(j − 1)! (N − j)!
N −1
X N
+ (k + 1) [FX (x)]k [1 − FX (x)]N −k−1 fX (x) −
k=j
k + 1
N −1
X N
− (N − k) [FX (x)]k [1 − FX (x)]N −k−1 fX (x)
k=j
k
where the second line follows from the properties of the Gamma function;
the result is the density function of the postulated Beta distribution.
This result, in conjunction with Theorem 1.13, allows to derive the sampling
distribution of the order statistic of a sample of percentiles p drawn from
some known distribution, as p ∼ U (0, 1).
141
4.3. Order Statistics
Other relevant results are, instead, restricted to the two extreme order
statistics: the minimum and the maximum. In particular, certain distribu-
tion have the useful feature to return – in random samples – minima and
maxima that follows a distribution in the same sub-family.
Definition 4.18. Extreme order statistics (min-max) stability. Con-
sider a random sample drawn from some known distribution. If the sample
minimum (maximum) follows another distribution of the same family, that
distribution is said to be min-stable (max-stable).
For example, the exponential distribution is notoriously min-stable.
Observation 4.2. Consider a random sample drawn from the exponential
distribution with parameter λ, X ∼ Exp (λ). The first order statistic – the
minimum – is such that X(1) ∼ Exp (N −1 λ).
Proof. By applying the formula for the distribution of the minimum:
FX(1) (x; λ, N ) = 1 − [1 − FX (x; λ)]N
N
1
= 1 − exp − x
λ
N
= 1 − exp − x
λ
the postulated (cumulative) distribution is obtained straightforwardly.
All three types of distributions in the GEV family, instead, are max-stable:
this is another motivation for their collective name.
Observation 4.3. Consider a random sample drawn from the Type I GEV
(Gumbel) distribution with parameters µ and σ, X ∼ EV1 (µ, σ). The top
order statistic – the maximum – is such that X(N ) ∼ EV1 (µ − σ log (N ) , σ).
Proof. By applying the formula for the distribution of the maximum:
FX(N ) (x; µ, σ, N ) = [FX (x; µ, σ)]N
N
x−µ
= exp − exp
σ
x−µ
= exp −N exp
σ
x − µ + σ log (N )
= exp − exp
σ
one obtains the Gumbel cumulative distribution that was argued.
142
4.3. Order Statistics
Observation 4.4. Consider a random sample drawn from the Type II GEV
(Fréchet) distribution with parameters α, µ, and σ, Y ∼ EV2 (α, µ, σ). The
top order statistic – the maximum – is such that Y(N ) ∼ EV2 α, µ, σN 1/α .
The result is identical in random sampling from the Type III GEV (reverse
Weibull) case: with Y ∼ EV3 (α, µ, σ), it is Y(N ) ∼ EV3 α, µ, σN 1/α .
Proof. Here, applying the formula for the distribution of the maximum:
FY(N ) (y; α, µ, σ, N ) = [FY (y; α, µ, σ)]N
−α !N
y−µ
= exp −
σ
−α !
1 y − µ
= exp − N − α
σ
is equally valid to show both the Fréchet and the reverse Weibull results.
A last observation about the traditional Weibull distribution – which should
be intuitive in light of the relationships between the traditional Weibull, the
exponential, and the GEV distributions – is presented next.
Observation 4.5. Consider a random sample drawn from the traditional
Weibull distribution with parameters α, µ, and σ, W ∼ Weibull
(α, µ, σ).
The sample minimum is such that W(1) ∼ Weibull α, µ, σN 1/α .
Proof. Things proceed similarly to the Fréchet and reverse Weibull cases.
FW(1) (w; α, µ, σ, N ) = 1 − [1 − FW (w; α, µ, σ)]N
−α !N
w−µ
= 1 − exp −
σ
−α !
1 w − µ
= 1 − exp − N − α
σ
The difference is that here, the formula for the minimum is applied.
It is difficult to identify other situations where the exact distribution of
an order statistic of interest can be computed and related to some known
common distribution. In an asymptotic environment things are different, as
the so-called Extreme Value Theorem (see Lecture 6) allows to circumscribe
the set of sampling distribution of order statistics to the three different types
of the GEV family – so long as the sample size is large enough. Along with
the above “stability” results, the Theorem in question motivates the use of
the GEV distributions for modeling extreme order statistics.
143
4.4. Sufficient Statistics
144
4.4. Sufficient Statistics
Examples are useful to build intuition; it is best start from simpler ones.
Example 4.1. A sufficient statistic for the Bernoulli parameter p.
Suppose that a random (i.i.d.) sample is obtained from a random variable
X following the Bernoulli distribution with parameter p, or X ∼ Be (p). It
turns out that the statistic counting the number of “successes,” define it as:
N
X
T = T (X1 , . . . , XN ) = Xi
i=1
thus:
QN xi 1−xi
fX1 ,...,XN (x1 , . . . , xN ; p) i=1 p (1 − p)
= N
pt (1 − p)N −t
qT (t; p, N ) t
t N −t
p (1 − p)
= N
pt (1 − p)N −t
t
t! (N − t)!
=
N!
where the first equality follows since the sample is random and the second
from the definition of t. Thus it is proved that the distribution of the sample,
conditional on the sufficient statistic, does not depend on p – as postulated.
The intuition for this is that upon knowing the number of “successes” t over
N attempts, there is no other information in the sample that helps “learn”
(perform inference) about the parameter p.
Example 4.2. A sufficient statistic for µ in the normal distribution.
Suppose that a random (i.i.d.) sample is obtained from a random variable
X following the normal distribution with location parameter µ and scale
parameter σ2 , or X ∼ N (µ, σ2 ). The sample mean X̄ is a sufficient statistic
for the mean parameter µ. The demonstration proceeds as in the previous
case, recalling now that X̄ ∼ N (µ, σ2 /N ). Similarly as above, it is useful
to define the actual realization of the sample mean in the data.
N
1 X
x̄ = xi
N i=1
145
4.4. Sufficient Statistics
146
4.4. Sufficient Statistics
for θ! Showing this result is quite simple: calling x(N ) = max {x1 , . . . , xN }
the realization of X(N ) , and writing that the latter’s density function as
d x(N ) N
qX(N ) x(N ) ; θ = · 1 x(N ) ∈ (0, θ)
dx(N ) θ
−1
N xN
(N )
= · 1 x(N ) ∈ (0, θ)
θN
it is straightforward to see that:
fX1 ,...,XN (x1 , . . . , xN ; θ) 1
= −1
· 1 x (N ) ∈ (0, θ)
N xN
qX(N ) x(N ) ; θ (N )
as fX1 ,...,XN (x1 , . . . , xN ; θ) = [fX (x; θ)]N = θ−N because the sample is ran-
dom. Again, the result is intuitive: since the support of the uniform distri-
bution is bounded above, the highest value found in the sample is the most
informative about the limit of the the support.
Sometimes, it is difficult to verify that a statistic is effectively sufficient
for a certain parameter of interest. Fortunately, the following theorem often
helps simplify the analysis.
Theorem 4.7. Fisher-Neyman’s Factorization Theorem. Consider a
sample generated by a list of random vectors (x1 , . . . , xN ), whose joint dis-
tribution has mass or density function fx1 ,...,xN (x1 , . . . , xN ; θ) that also de-
pends on some parameter θ. A statistic T = T (x1 , . . . , xN ) is sufficient for
θ if and only if it is possible to identify two functions g (T (x1 , . . . , xN ) ; θ)
and h (x1 , . . . , xN ) such that the following holds.
147
4.4. Sufficient Statistics
where the second line follows from the definition of conditional probability
while the third just renames the previous probability function, noting that
the conditional probability of the sample given T can be expressed as some
generic function h (x1 , . . . , xN ) that does not depend on θ by definition of
sufficient statistic.
(Sketched.) In the continuous case the logic of the proof is analogous;
however, the proper demonstration requires the use of advanced measure
148
4.4. Sufficient Statistics
theory. Thus, the analysis is only outlined for a restricted case that can be
related to the discrete case above, and is still general enough to allow many
concrete situations. Suppose that there is a list of bijective and differentiable
transformations that does not depend on θ denoted as:
y1 g1 (x1 , . . . , xN )
y2 g2 (x1 , . . . , xN )
.. = ..
. .
yN gN (x1 , . . . , xN )
where any element of this list, suppose the first element Y11 of y1 , is fixed as
Y11 = T (x1 , . . . , xN ) by construction. In addition, write the corresponding
inverse transformation as follows.
−1
w1 g1 (y1 , . . . , yN )
w2 g −1 (y1 , . . . , yN )
2
.. = ..
. .
−1
wN gN (y1 , . . . , yN )
To show necessity, write the joint density of the transformation as:
fy1 ,...,yN (y1 , . . . , yN ; θ) = fx1 ,...,xN (w1 , . . . , wN ; θ) · |J∗ |
= g (T (w1 , . . . , wN ) ; θ) · h (w1 , . . . , wN ) · |J∗ |
= g (y11 ; θ) · h (w1 , . . . , wN ) · |J∗ |
where |J∗ | is shorthand notation for the absolute value of the Jacobian of
the inverse transformation, and the second line follows from hypothesis. It
is obvious that the marginal distribution of Y11 , that is the density function
qT (T (x1 , . . . , xN ) ; θ) of the statistic of interest T , inherits a factorization
analogous to the above and thus since y11 = T (x1 , . . . , xN ), it can be shown
that the ratio between the joint density of the sample and the density of T
does not depend on θ, hence T is sufficient. To prove that if T is sufficient
then a proper factorization can be expressed (the “sufficiency” part of the
Theorem), apply the definition of conditional density function to show that:
fy1 ,...,yN (y1 , . . . , yN ; θ) = qT (y11 ; θ) · f{y1 ,...,yN }\Y11 ({y1 , . . . , yN } \ y11 | Y11 )
where the notation {·} \ Y11 denotes a list that excludes Y11 . Dividing both
sides of the above by |J∗ | returns the desired factorization for:
f{y1 ,...,yN }\Y11 ({y1 , . . . , yN } \ y11 | Y11 )
h (x1 , . . . , xN ) =
|J∗ |
and for g (T (x1 , . . . , xN ) ; θ) = qT (T (x1 , . . . , xN ) ; θ).
149
4.4. Sufficient Statistics
The factorization theorem can be easily applied to cases like the previous
examples. However, it is particularly useful to show that multiple statistics
are simultaneously sufficient for a given number of associated parameters.
This is usually expressed through a vector of statistics t (x1 , . . . , xN ):
T1 (x1 , . . . , xN )
T2 (x1 , . . . , xN )
t (x1 , . . . , xN ) = ..
.
TK (x1 , . . . , xN )
which are said to be simultaneously sufficient for a vector of parameters θ:
θ1
θ2
θ = ..
.
θJ
where in general, it may as well be that K 6= J. The factorization theorem
can be extended to allow for g (t (x1 , . . . , xN ) ; θ) to be the joint density of
all the statistics in question and for a multidimensional parameter vector.
Example 4.4. Two sufficient statistics for µ and σ2 in the normal
distribution. Let us revisit Example 4.2. There, the factorization theorem
can be expressed for:
!
N (x̄ − µ)2
g (x̄; µ) = exp −
2σ2
and: N2 !
N
(xi − x̄)2
1 X
h (x1 , . . . , xN ) = exp −
2πσ2 i=1
2σ2
and the product of these functions returns the joint density of the sample
X1 , . . . , XN . Observe that, however, both expressions still incorporate the
parameter σ2 . To obtain a sufficient statistic for it, it is intuitive to think
of the sample variance S 2 , whose realization is usually written as follows.
N
1 X
s2 = (xi − x̄)2
N − 1 i=1
150
4.4. Sufficient Statistics
where K is, as usual, the dimension of the random vector x; the interme-
diate steps involve some tedious linear algebra. The intuition is that every
random variable listed in x, say Xk , follows a marginal distribution which
is normal with location parameter µk ; hence the sample mean X̄k – which is
listed in x̄ – exhausts all the information contained in the sample about that
particular parameter, and this holds simultaneously for all k = 1, . . . , K.
In analogy with in the univariate case, these observations can be pushed
even further by claiming that the vector of sample means x̄ and the sample
variance-covariance S are simultaneously sufficient for all the parameters of
the multivariate normal distribution, (µ, Σ). The realization of the sample
variance-covariance, written as S, is as follows.
N
1 X
S= (xi − x̄) (xi − x̄)T
N − 1 i=1
151
4.4. Sufficient Statistics
By means of some algebraic manipulation, one can show that the function:3
1 N T −1 N −1 −1
g (x̄, S; µ, Σ) = N exp − (x̄ − µ) Σ (x̄ − µ) − tr Σ S
|Σ| 2 2 2
NK
complies to the factorization theorem along with h (x1 , . . . , xN ) = (2π)− 2 .
To develop intuition, it is important to recall that the matrix Σ not only
features the K variances of each normally distributed random variable listed
in x, but also the K (K − 1) /2 covariances. The sample variance-covariance
S provides appropriate sufficient statistics for all these parameters.
where, by recognizing that (xi − µ) = (xi − x̄ + x̄ − µ), the term inside the exponential
develops as follows.
N
X N
X
T T T
(xi − µ) Σ−1 (xi − µ) = (xi − x̄) Σ−1 (xi − x̄) + N (x̄ − µ) Σ−1 (x̄ − µ) +
i=1 i=1
N
X N
X
T T
+ (x̄ − µ) Σ−1 (xi − x̄) + (xi − x̄) Σ−1 (x̄ − µ)
i=1 i=1
| {z } | {z }
=0 =0
PN
The last two terms are zero since i=1 (xi − x̄) = 0. The first term in the decomposition
instead develops, by the property of the trace operator, as:
N N
!
X T −1
X T −1
(xi − x̄) Σ (xi − x̄) = tr (xi − x̄) Σ (xi − x̄)
i=1 i=1
N
!
X T
−1
= tr Σ (xi − x̄) (xi − x̄)
i=1
= (N − 1) tr Σ−1 S
where the last line follows from the definition of S. Collecting terms allows to verify that
the factorization fx1 ,...,xN (x1 , . . . , xN ; µ, Σ) = g (x̄, S; µ, Σ) · h (x1 , . . . , xN ) holds with
the expressions given in the text.
152
4.4. Sufficient Statistics
order statistics, the minimum X(1) (with x(1) = min {x1 , . . . , xN }) and the
maximum X(N ) (with x(N ) = max {x1 , . . . , xN } as in Example 4.3) are in
fact simultaneously sufficient for (α, β). This is shown by observing that
the joint density function of the sample is:
N
1
fX1 ,...,XN (x1 , . . . , xN ; α, β) = · 1 [α ≤ x1 , . . . , xN ≤ β]
β−α
and by setting:
N
1
g x(1) , x(N ) ; α, β = · 1 α ≤ x(1) · 1 x(N ) ≤ β
β−α
and h (x1 , . . . , xN ) = 1. Hence, the factorization theorem applies trivially.
One more time, the result is intuitive: if both bounds of the uniform distri-
bution are unknown, there is no better information contained in the sample
than the two extreme order statistics.
The factorization theorem allows to quickly verify that certain statistics
are sufficient for the specified parameters of an important “macrofamily” of
probability distributions, which is defined next.
Definition 4.20. Exponential Family. A family of probability distribu-
tions characterized by a vector of parameters θ = (θ1 , . . . , θJ ) is said to
belong to the exponential (macro)-family if the associated mass or density
functions can be written, for J ≤ L, as:
L
!
X
fX (x; θ) = h (x) c (θ) exp w` (θ) t` (x)
`=1
where h (x) and t` (x) are functions of the realizations x while c (θ) ≥ 0 and
w` (θ) are functions of the parameters θ, and in both cases, ` = 1, . . . , L.
The exponential macrofamily is extensive: it comprises many of the distri-
bution families analyzed in Lecture 2. In particular, the discrete Bernoulli,
geometric, Poisson families; as well as the continuous normal, lognormal,
Beta and Gamma families – including the special cases of the Gamma fam-
ily, like the chi-squared and exponential distribution – are all sub-families
of the exponential macro-family.4 All these claims can be verified by ma-
nipulating of the density functions of interest. Other distributions are said
to belong to the exponential family so long as certain parameters are “fixed”
(i.e. not part of θ in the above definition): this is the case of the binomial
and negative binomial families for a constant number of trials n or r.
4
One must be careful at not mistaking the exponential (macro-)family for the more
restricted subfamily of exponential distributions!
153
4.4. Sufficient Statistics
are simultaneously sufficient for θ, where the functions t` (x) are as in the
previous definition of the exponential family for ` = 1, . . . , L.
Proof. The joint density of the sample can be expressed as:
N
! L N
!
Y X X
fX1 ,...,XN (x1 , . . . , xN ; θ) = h (xi ) [c (θ)]N exp w` (θ) t` (xi )
i=1 `=1 i=1
the context of normal distributions; it is soon shown how they relate to the
more frequent sample mean X̄ and sample variance S 2 .
154
4.4. Sufficient Statistics
T12
1 1
X̄ = T1 and S =2
T2 −
N N −1 N
a transformation that does not depend on the parameters.
Example 4.11. Two sufficient statistics for the Gamma distribu-
tion, revisited. The two sufficient statistics for α and β in random samples
drawn from the Gamma distribution are typically listed as:
N
Y N
X
T10 = Xi and T20 = Xi
i=1 i=1
155
Lecture 5
Statistical Inference
This lecture develops the core concepts of statistical inference: the theory
and practice of the statistical evaluation of data. After having introduced
the concept of point estimator and two chief methods for constructing esti-
mators – the Method of Moments and Maximum Likelihood Estimation –
this lecture discusses a framework and associated results for the evaluation
of the statistical properties of different estimators. Finally, this lecture con-
cludes with an outline of the theory and the practice of hypothesis testing
in statistical inference and the associated methods to construct confidence
intervals for estimators, the so-called interval estimation.
156
5.1. Principles of Estimation
The notation θ,
b with the typical “hat,” is typically used to denote a point
estimator for some parameters θ of a distribution (that are possibly mul-
tivalued). This notation is used for both estimators intended as statistics,
that is random variables or vectors endowed with a sampling distribution,
and for the estimates calculated in the data. The ensuing discussion treats
the parameters θ = (θ1 , . . . , θK ) that are sought after as vector-valued with
dimension K; the univariate (scalar) case can be considered as a particular
one (but with examples aplenty). Note that depending on the context, some
values may or may not admissible estimates for certain parameters. For ex-
ample, the scale parameter σ of location-scale families, or the parameters
α and β of the Gamma distribution, cannot be negative. It is important to
accurately define the set of values that are allowed in the estimation.
Definition 5.2. Parameter space. The set of admissible values for the
parameters θ is called parameter space and is usually denoted as Θ ⊆ RK .
The first of the two method to find or construct estimators that is intro-
duced here is both the most intuitive and the oldest one (unsurprisingly).
The Method of Moments is based on the following idea, formulated as
a statistical principle.
then a point estimator for θ can be obtained as the solution to the so-called
sample analogue of the zero moment condition, that is the condition that
equates the sample mean of m (xi ; θ) to zero.
N
1 X b
m xi ; θ M M = 0 (5.2)
N i=1
157
5.1. Principles of Estimation
158
5.1. Principles of Estimation
Observe that this estimator differs from the sample variance S 2 by a factor
N −1
N
, hence its expectation does not equal the actual variance of Xi in a
random sample. The method can be extended to other distributions; in the
logistic case for example under the standardp parametrization the variance
bM M = S 3 (N − 1) /N /π. Next, consider
is Var [Xi ] = σ2 π 2 /3, therefore σ
the multivariate normal distribution; there:
E [xi − µ] = E [mµ (xi ; µ, Σ)] = 0
h
T T
i (5.9)
E xi xi − E [xi ] E [xi ] − Σ = E [mΣ (xi ; µ, Σ)] = 0
159
5.1. Principles of Estimation
Here, X̄ and Ȳ are the sample means of Xi and Yi respectively, and the two
estimators are also called the least squares estimators of the model, for
reasons to be elaborated in Lecture 7. Note that to derive these estimators
independence is not necessary, since the two quantities are obtained directly
from (3.10) and (3.11). Furthermore, no specification of the joint density of
Yi and Xi was made, except that the two variables are related via a linear
conditional expectation function E [ Yi | Xi ] = β0 + β1 Xi .
160
5.1. Principles of Estimation
where the subscript “MLE” has an obvious meaning. Since the likelihood
function is always positive, in practical settings it is often useful to maximize
its logarithm instead, which is called the log-likelihood function. In other
words, (5.17) is equivalent to the following.
bM LE = arg max log L (θ| x1 , . . . , xN )
θ (5.18)
θ∈Θ
161
5.1. Principles of Estimation
If the sample has certain properties, the calculation of the MLE is simplified
further. First, if the observations are independent the joint mass or density
of the sample reduces to the product of the mass or density functions of all
the observations, and so maximizing the log-likelihood function amounts to
maximize a summation:
"N # N
Y X
θ
bM LE = arg max log fx (xi ; θ) = arg max
i
log fx (θ| xi ) (5.19)
i
θ∈Θ i=1 θ∈Θ i=1
Two observations are in order. First, since Xi ∈ {0, 1} it is X̄ ∈ [0, 1], hence
the MLE is restricted to valid values in the parameter space. Second, the
Second Order Condition is as follows:
PN PN
d2 log L (p| x1 , . . . , xN ) x i N − i=1 xi
= − i=1 − <0
dp 2 p 2
(1 − p)2
verifying that indeed pbM LE is the maximizer of the likelihood function.
162
5.1. Principles of Estimation
Note that while the solution looks like the paired Method of Moments esti-
mators for µ and σ2 , this is not generally true. To verify that the likelihood
function is indeed maximized, it is necessary to analyze the determinant of
the Hessian matrix of the log-likelihood function evaluated at the solution.
The Hessian matrix in question is the following.
∂ log L( µ,σ2 |x1 ,...,xN ) ∂ 2 log L( µ,σ2 |x1 ,...,xN )
2
∂µ2 ∂µ∂σ2
H µ, σ2 x1 , . . . , xN = ∂ 2 log L( µ,σ
| 1 N ) ∂ 2 log L( µ,σ2 |x1 ,...,xN )
2 x ,...,x
∂σ2 ∂µ ∂(σ2 )2
163
5.1. Principles of Estimation
Note that the two second-order partial derivatives outside the diagonal are
symmetric and equal, and:
∂ 2 log L (µ, σ2 | x1 , . . . , xN ) N
2
=− 2
∂µ σ
N
∂ 2 log L (µ, σ2 | x1 , . . . , xN ) X (xi − µ)
2
=−
∂µ∂σ i=1
σ4
N
X (xi − µ)2
∂ 2 log L (µ, σ2 | x1 , . . . , xN ) N
= 4−
∂ (σ2 )2 2σ i=1
σ6
however,P when evaluated at the solution, the cross-derivatives equals zero,
because N b M LE ) = 0, while the second derivative of σ2 simplifies
i=1 (xi − µ
too. In fact, by the second of the two First Order Conditions:
N
∂ 2 log L ( µ b2M LE | x1 , . . . , xN )
b M LE , σ N b M LE )2
X (xi − µ
= −
∂ (σ2 )2 σ4M LE
2b i=1
b6M LE
σ
N N
= 4 − 4
2b
σM LE σ
bM LE
N
=− 4
2bσM LE
and it follows the Hessian matrix, evaluated at the solution, is:
N
− σ
b2M LE
0
H µ b2M LE x1 , . . . , xN =
b M LE , σ
N
0 − 4
2b
σM LE
and its determinant is obviously always positive. Since at least one second
order partial derivative (in particular, the second derivative for µ) is always
negative, the solution is indeed a maximum.
Example 5.7. Maximum Likelihood Estimation of the parameters
of the multivariate normal distribution. Move next to a multivariate
environment, and consider sampling from a random vector x ∼ N (µ, Σ).
The likelihood function is:
N
!
Y 1 (xi − µ)T Σ−1 (xi − µ)
L (µ, Σ| x1 , . . . , xN ) = q exp −
K 2
i=1 (2π) |Σ|
N
!
1 X (xi − µ)T Σ−1 (xi − µ)
=h i N2 exp − 2
K
(2π) |Σ| i=1
164
5.1. Principles of Estimation
To find the MLE estimator for (µ, Σ) it is easiest to split the problem into
simpler bits: the estimation of µ and that of Σ (note that this is not always
possible). Here, the First Order Conditions with respect to µ:
N
∂ log L ( µ
b M LE , Σ| x1 , . . . , xN ) X
= Σ−1 (xi − µ
b M LE ) = 0
∂µ i=1
165
5.1. Principles of Estimation
which, if evaluated at the solution where µ = x̄ and set at zero, returns the
MLE of the variance-covariance matrix.
N
1 X
ΣM LE =
b (xi − x̄) (xi − x̄)T
N i=1
As in the Method of Moments and in the univariate MLE case, this estima-
tor is a rescaled version of the sample variance-covariance, Σb M LE = N −1 S.
N
Some tedious analysis, similar to that from the univariate case, would show
that the MLE solutions µ b M LE and Σ
b M LE indeed identify a maximum of the
(log-)likelihood function.
In the last few cases, the Method of Moments and Maximum Likelihood
estimators are seen to coincide. This, however, is generally not true, as the
following example shows.
Example 5.8. Maximum Likelihood Estimation of the parameters
of the Gamma distribution. Consider again random sampling from the
Gamma distribution as in Example 5.3. There, the likelihood function is:
N
Y 1
L (α, β| x1 , . . . , xN ) = βα xα−1
i exp (−βxi )
i=1
Γ (α)
α N Y N
! N
!
β X
= xα−1
i exp −β xi
Γ (α) i=1 i=1
166
5.1. Principles of Estimation
There is no closed form solution to this problem. Even if the solution must
clearly respect the property that:
N
α
b M LE 1 X
= Xi = X̄
β
b M LE N i=1
as in the Method of Moments case, exact expressions of the estimators for
α and β in terms of (X1 , . . . , X2 ) – or of (x1 , . . . , x2 ) – cannot be derived
from the First Order Conditions. It is then necessary to employ numerical
methods on a case-by-case basis in order to identify the estimates.
This example showed how difficult it can be to perform Maximum Like-
lihood Estimation in certain cases – and this is not uncommon! Sometimes,
the MLE of interest does not even exist.
Example 5.9. Maximum Likelihood Estimation of the parameter
of uniform distributions with fixed lower bound. Consider a random
sample drawn from a uniformly distributed random variable Xi ∼ U (0, θ)
with lower bound fixed at zero and closed support: X = [0, θ]. It is easy to
see that E [X] = θ/2 and thus the Method of Moments estimator is:
N
2 X
θM M =
b Xi = 2X̄
N i=1
while the MLE is the sample maximum.
bM LE = X(N )
θ
To see this, note that the likelihood function here is:
1
L (θ| x1 , . . . , xN ) = · 1 [0 ≤ x1 , . . . , xN ≤ θ]
θN
and there is no need to guess the log-likelihood function to see that the above
is maximized for the smallest value of θ such that θ ≥ max {x1 , . . . , xN },
hence the maximum. Suppose now that the support of X is open, at least
on the right: X = [0, θ). The Method of Moments estimator is unchanged,
but the likelihood function now becomes:
1
L (θ| x1 , . . . , xN ) = N · 1 [0 ≤ x1 , . . . , xN < θ]
θ
with only an inequality being changed within the indicator function. Note
that it is no longer possible to follow the reasoning above in order to identify
a statistic that maximizes the likelihood function (to gain intuition, compare
the two likelihood functions depicted next in Figure 5.1). In cases like that
with open support, one typically says that the MLE does not exist.
167
5.1. Principles of Estimation
LN (θ) LN (θ)
1 1
θ θ
0 1 2 0 1 2
Note: N = 5 and x(5) = 1 in both cases; X = [0, θ] in the left panel and X = [0, θ) in the right panel.
LN (θ) is shorthand notation for L ( θ| x1 , . . . , xN ).
It thus might seem that thanks to its simplicity and flexibility, Method
of Moments estimation trumps Maximum Likelihood as a more convenient
method for constructing estimators, and this goes without mentioning that
the latter possibly does not even exist. As anticipated, however, Maximum
Likelihood estimators generally have conceptual and statistical advantages;
one of these is illustrated next.
Theorem 5.1. Invariance of Maximum Likelihood Estimators. Call
θ
bM LE the Maximum Likelihood Estimator for some parameter vector θ. Let
ϕ = g (θ) be some transformation of parameter vector θ. The Maximum
Likelihood estimator of ϕ is simply the corresponding transformation of the
Maximum Likelihood Estimator of θ.
ϕ
b M LE = g θ bM LE
L∗ ( ϕ
b M LE | x1 , . . . , xN ) = max max L (θ| x1 , . . . , xN )
ϕ {θ:g(θ)=ϕ}
= max L (θ| x1 , . . . , xN )
θ
=L θ bM LE x1 , . . . , xN
= max L (θ| x1 , . . . , xN )
{θ:g(θ)=g(θbM LE )}
= L∗ g θ bM LE x1 , . . . , xN
168
5.1. Principles of Estimation
where the first and last equalities follow from the definition of induced like-
lihood function, the second equality follows from the properties of iterated
maximizations, and the remaining ones follow by the definition of MLE.
Example 5.10. Maximum Likelihood Estimation of the “precision”
parameter of the normal distribution. Recall that the normal distri-
bution can be alternatively described in terms of the precision parameter
φ2 = σ−2 , where the density function is expressed as in (2.37). In that case,
the (induced) likelihood function would be as follows.
2 N2 N
!
2
φ X φ2 (xi − µ)2
L µ, φ x1 , . . . , xN = exp −
2π i=1
2
would reveal that the MLE of µ is still the sample mean, while the MLE of
the precision parameter is the following:
" N #−1
2
X
b2
φ =N
M LE Xi − X̄
i=1
169
5.2. Evaluation of Estimators
once again shows that a closed form solution cannot be identified; however,
the solution must be such that θ b −1 .
bM LE = β
M LE
170
5.2. Evaluation of Estimators
Thus, for any vector of parameters and associated estimates the MSE is
simply the sum of K elements of the form:
2
MSEk = E θk − θk
b
are certainly possible (the above is called Mean Absolute Error – MAE).
Nevertheless, the overwhelming majority of practical applications adopts
the MSE; among the various reasons (including analytical convenience) the
following property plays a fundamental role.
h i h i 2
MSEk = E θ bk − E θ bk + E θ bk − θk
h i2 h i 2
= E θk − E θk
b b + E E θk − θk
b +
h h i h i i
+ 2E θ bk − E θ bk E θbk − θk
h i h i 2
= Var θ bk + E θ bk − θk
Above, the last element in the second line is easily shown to vanish to zero;
this decomposition should reminiscent of the analysis conducted in Lecture
1 about the mean as the “best guess” of a random variable. In words, the
MSE relative to a specific estimator θbk can be decomposed in two parts:
the variance of the estimator, and the squared deviation of its mean from
the parameter of interest. The last concept warrants a definition.
Definition 5.5. Bias and unbiasedness. Consider a unidimensional es-
timator θ
b for some parameter of interest θ. Its bias is the quantity:
h i
Bias θb ≡ E θb −θ
171
5.2. Evaluation of Estimators
172
5.2. Evaluation of Estimators
This example clarifies why the unbiasedness property does not guarantee
that an estimator performs better than others in practical situations. When
restricting their attention to unbiased estimators, researchers should at least
ensure that these are also those with the smallest possible variance. These
estimators have a proper name in the theory of statistical inference.
Definition 5.6. Best unbiased estimators. Consider the set of unbiased
estimators θ
b of a certain parameter θ:
n h i o
Cθ = θ : E θ = θ
b b
An estimator θ
b∗ is called the best unbiased estimator, or the uniform mini-
mum variance unbiased estimator of θ, if the following holds.
h i h i
Var θ − Var θ
b b∗ ≥ 0 for all θ
b ∈ Cθ (5.20)
where the inequality is interpreted in the sense that the matrix on the left-
hand side is positive semi-definite, and Cθk is the set of unbiased estimators
of θk for k = 1, . . . , K.
A property of best unbiased estimators is that they are unique.
Theorem 5.2. Uniqueness of best unbiased estimators. Let θ b∗ be a
b∗ is unique,
best unbiased estimator for some parameter θ. In this setting θ
in the sense that (5.20) holds sharply (without equality).
Proof. Suppose that there is another estimator θ b∗∗ that is also a best unbi-
ased estimator, in the sense that it has the same expectation and variance
as θb∗ . Define the estimator:
θ b∗ + 1 θ
b0 ≡ 1 θ b∗∗
2 2
h i
it is clear that E θb0 = θ. As per the variance, it must be that:
h i 1 h i 1 h i 1 h i
b0 = Var θ
Var θ b∗ + b∗∗ +
Var θ Cov θb∗ , θ
b∗∗
4 4 2
1 h i 1 h i 1n h i h io 12
≤ Var θ b∗ + b∗∗ +
Var θ ∗
Var θ Var θ
b b∗∗
4 h i 4 2
b∗
= Var θ
173
5.2. Evaluation of Estimators
where the inequality follows from the same argument as in Theorem 3.4, and
the last line is due to the fact that θ b∗ and θ
b∗∗ have the same variance. Note
that the inequality must be replaced by an equality to avoid a contradiction!
If the inequality were sharp, then θ b∗ would not be a best unbiased estimator,
as θb0 would improve it. To have an equality, it must be – again by Theorem
3.4 – that θb∗∗ is a linear transformation of θ b∗ , that is θ b∗ . But in
b∗∗ = a + bθ
this case it must also be that a = 0, or else θ would be biased, and b = 1,
b ∗∗
Thus, θ
b∗ , θ
b∗∗ and θ
b0 are all identical estimators, that is, θ
b∗ is the only best
unbiased estimator.
The search for unbiased estimators with good properties is facilitated
by the following result, which is stated here in its multivariate version.
Theorem 5.3. The Rao-Blackwell Theorem. Consider an environment
where (x1 , . . . , xN ) is a sample drawn from some list of random vectors, θ is
some parameter vector of interest, θ b is any vector of unbiased estimators of
θ, and t = t (x1 , . . . , xN ) is a vector of statistics that are all simultaneously
sufficient for θ. Define the following statistic as a conditional expectation
function. h i
θb∗ ≡ E θ b t
174
5.2. Evaluation of Estimators
175
5.2. Evaluation of Estimators
176
5.2. Evaluation of Estimators
h i ∂ h i
∂ h iT
b − −1
Var θ E θb [IN (θ)] E θb ≥0
∂θT ∂θT
which is to be interpreted in the sense that the K × K matrix on the left
hand side is positive semi-definite, and where IN (θ) is as follows.
" T #
∂ ∂
IN (θ) ≡ E log f (x1 , . . . , xN ; θ) log f (x1 , . . . , xN ; θ)
∂θ ∂θ
Proof. In analogy with the univariate case, define the random vectors:
u=θb (x1 , . . . , xN )
∂
v= log f (x1 , . . . , xN ; θ)
∂θ
that are related as follows, by the properties of multivariate moments.
Var [u] − [Cov [u, v]] [Var [v]]−1 [Cov [u, v]]T ≥ 0
177
5.2. Evaluation of Estimators
178
5.2. Evaluation of Estimators
where fX (x; θ) is the mass or density function that generates the sample.
Proof. Observe that:
" !2
2 # N
∂ ∂ Y
E log f (X1 , . . . , XN ; θ) = E log fX (Xi ; θ)
∂θ ∂θ i=1
!2
N
X ∂
= E log fX (Xi ; θ)
i=1
∂θ
" N 2 #
X ∂
=E log fX (Xi ; θ)
i=1
∂θ
N
" 2 #
X ∂
= E log fX (Xi ; θ)
i=1
∂θ
" 2 #
∂
=N ·E log fX (X; θ)
∂θ
where the first line follows from random sampling, the second line is a simple
manipulation, the third and fourth lines are based on the linear properties
of expectations and independence, as terms of the following form for i 6= j:
∂ ∂
E log fX (Xi ; θ) log fX (Xj ; θ) =0
∂θ ∂θ
must be equal to the product of the respective means and therefore to zero,
while the fifth line follows from identically distributed observations.
Multivariate case. In the (multivariate) version of the setup of Theorem
5.4, if the sample is random the inequality is based on the following version
of the information matrix:
" T #
∂ ∂
IN (θ) = N · E log fx (x; θ) log fx (x; θ)
∂θ ∂θ
where fx (x; θ) is the mass or density function that generates the sample.
179
5.2. Evaluation of Estimators
Proof. The proof is all but an extension of the univariate case. The infor-
mation matrix is developed as:
" T #
∂ ∂
IN (θ) = E log f (x1 , . . . , xN ; θ) log f (x1 , . . . , xN ; θ)
∂θ ∂θ
! !T
N N
∂ Y ∂ Y
= E log fx (xi ; θ) log fx (xi ; θ)
∂θ i=1
∂θ i=1
! !T
N
X ∂ N
X ∂
= E log fx (xi ; θ) log fx (xi ; θ)
i=1
∂θ i=1
∂θ
" N T #
X ∂ ∂
=E log fx (xi ; θ) log fx (xi ; θ)
i=1
∂θ ∂θ
N
" T #
X ∂ ∂
= E log fx (xi ; θ) log fx (xi ; θ)
i=1
∂θ ∂θ
" T #
∂ ∂
=N ·E log fx (x; θ) log fx (x; θ)
∂θ ∂θ
where the crucial step is between the third and the fourth line, as the terms
of the following form, for i 6= j:
" T #
∂ ∂
E log fx (xi ; θ) log fx (xj ; θ) =0
∂θ ∂θ
disappear due to independence; the other steps are simple manipulations
or other implications of random sampling.
The mentioned additional simplifications are possible if the differentia-
tion operation with respect to the parameters of interest can pass through
the expectation operator twice (this is generally the case for a wide number
of distributions, including all those in the exponential macro-family). This
implies that in the univariate case, it is:
" 2 # 2
∂ ∂
E log fX (X; θ) = −E log fX (X; θ)
∂θ ∂θ2
while in the multivariate case the following two K × K matrices are equal:
a result known as the information matrix equality.
" T #
∂2
∂ ∂
E log fx (x; θ) log fx (x; θ) = −E log fx (x; θ)
∂θ ∂θ ∂θ∂θT
180
5.2. Evaluation of Estimators
181
5.2. Evaluation of Estimators
182
5.2. Evaluation of Estimators
where φ (z), as usual, is the density function of the standard normal distri-
bution. In analogy with the calculation of the Hessian matrix from Example
5.6, the above information matrix is as follows.
1 X −µ N
− 2 − 2 0
IN µ, σ2 = −N · E σ σ4 = σ
X −µ 2
1 (X − µ) N
− − 0
σ4 2σ4 σ6 2σ4
Clearly, X̄ is an unbiased estimator and its variance Var X̄ = σ2 /N at-
183
5.2. Evaluation of Estimators
Following this discussion, one may be left wondering whether any math-
ematical result can help identify whether an unbiased estimator attains the
Cramér-Rao bound or not. Such a result exists and is the following.
Theorem 5.6. Attainment of the Cramér-Rao Bound – univariate
case. In the (univariate) setup of Theorem 5.4, if θ
b is an unbiased estimator
of θ, it attains the Cramér-Rao bound if and and only if:
b − θ = ∂ log fX ,...,X (x1 , . . . , xN ; θ)
h i
aN (θ) θ 1 N
∂θ
for some function aN (θ) of the parameter.
Proof. Recall the proof of Theorem 5.4 as well as Theorem 3.4: the equality
is attained only if U (the estimator) is a linear function of V (the derivative
of the logarithmic joint mass or density function of the sample, i.e. the log-
likelihood function). By the Cauchy-Schwarz Inequality this can be phrased
as a (U − E [U ]) = V . As a can be a function of θ, write it as aN (θ).
Multivariate case. In the (multivariate) setup of Theorem 5.4, if θ b is an
unbiased estimator of θ, it attains the Cramér-Rao bound if and and only
if:
h i ∂
AN (θ) θ − θ =
b log fx1 ,...,xN (x1 , . . . , xN ; θ)
∂θ
for some K × K matrix AN (θ) which is a function of the parameters.
Proof. Similarly to the univariate case, the equality is only attained if u is
a linear function of v, i.e. A (u − E [u]) = v where A = AN (θ).
Example 5.16. Attainment of the Cramér-Rao bound for estima-
tors of the normal distribution. Consider again random sampling from
the normal distribution. The derivative of the joint density corresponds to
the MLE First Order Conditions as in Example 5.6; write them as:
N
N
X (xi − µ) 1X
N
xi − µ
σ 2 2 0 N
= σ
i=1 i=1
N 2 N 2
X (xi − µ) N
N X (xi − µ)
− 2 0 4 −σ 2
i=1
2σ4 2σ | {z2σ } N
i=1
=AN (µ,σ2 )
as per Theorem 5.6. This decomposition not only shows again that X̄ is an
unbiased estimator of µ that attains the bound; it also reveals
PN that the2only
unbiased estimator of σ that attains the bound is σ
2
e = i=1 (Xi − µ) /N :
2
184
5.3. Tests of Hypotheses
185
5.3. Tests of Hypotheses
186
5.3. Tests of Hypotheses
σ2X σ2X
H0 : 2 ≤ C H1 : 2 > C
σY σY
where σ2X and σ2Y are the variances of X and Y , respectively. Here, C < ∞
must be finite but otherwise unrestricted. If C = 1 the test has an obvious
interpretation: the null hypothesis represents the scenario where X has a
variance smaller or equal than that of Y , while the alternative hypothesis
states that the variance of X is larger than that of Y . Naturally, two-sided
tests about specific values of the ratio are perfectly possible.
Naturally, specifying the acceptance and rejection regions for large samples
can be quite complicated, and maybe not extremely useful. Therefore, it is
common to use univariate test statistics for this purpose.
187
5.3. Tests of Hypotheses
In an ideal test, both types of errors never occur; clearly, this ideal cannot
be attained as otherwise it would not be necessary to conduct tests in the
first place. At the same time, it is not possible to identify a criterion which
is useful to simultaneously shrink both types of errors; since the probability
to commit either depends on the acceptance and rejection regions, reducing
one increases the other and vice versa. The following concept well represents
the trade-off in question.
Definition 5.10. Power Function. The probability that the test statistic
falls in the rejection region, as a function of the parameters θ, is the power
function of a test.
PT (θ) = P (t ∈ Tc0 ; θ) = 1 − P (t ∈ T0 ; θ)
Clearly, a power function expresses the probability to commit a Type I error
if θ ∈ Θ0 , and equals one minus the probability to commit a Type II error
if θ ∈ Θc0 . This notion, in turn, is instrumental in the following definitions.
188
5.3. Tests of Hypotheses
Definition 5.11. Level of a test. Given a number α ∈ [0, 1], a test with
power function PT (θ) has confidence level α if PT (θ) ≤ α for all θ ∈ Θ0 .
Definition 5.12. Size of a test. Given a number α ∈ [0, 1], a test with
power function PT (θ) has size α if supθ∈Θ0 PT (θ) = α.
The distinction between level and size is subtle, but it highlights aspects
of the testing procedure. Given that a trade-off between Type I and Type II
errors exists, the convention in statistical analysis is to restrict the attention
to tests that have a sufficiently small probability of Type I errors (rejecting
the null hypothesis when it is true), a value which is fixed at some α. These
tests are said to have confidence level α. The confidence level is a always a
discretionary choice of the researcher, but conventionally, α is chosen to be
equal to one value between 0.1, 0.05, and 0.01. The smaller is the confidence
level, the more credible is the outcome of the test when the null hypothesis
is rejected (since that outcome has a smaller probability to occur under the
null hypothesis). Once a confidence level is decided, a conscious researcher
must recognize that attempting to further reduce the probability of Type I
errors might be counterproductive, due to an increased probability of Type
II errors. Thus, the attention is restricted to those tests whose maximum
probability of rejecting the null hypothesis when it is true is exactly α: the
size of the test.4 In most practical applications, this nominal distinction is of
little consequence, but it is important to make a correct use of terminology.
Example 5.23. Testing for the mean: level and size. Consider a test
about the mean of a certain distribution, say the normal. In the two-sided
case, Θ0 = {C} and Θc0 = R \ {C}, hence there is no practical distinction
between level and size. In the one-sided case, however, if the null hypothesis
is that the mean is smaller or equal than some constant C, Θ0 = (−∞, C]
and Θc0 = (C, ∞), and vice versa. Consequently, for a fixed level α there are
different rejection probabilities for different values in Θ0 . In typical testing
procedures, the maximum rejection probability is achieved at µ = C.
After conducting tests, researchers usually report the following informa-
tion enclosed to their statistical analyses: the confidence level, the decision
outcome (acceptance vs. rejection) and often, a statistic called p-value.
Definition 5.13. The p-value. In a test of hypothesis with given size α,
a p-value is a statistic P = P (x1 , . . . , xN ) such that for all θ ∈ Θ0 , it is
P (P (x1 , . . . , xN ) ≤ α) ≤ α.
4
Some advanced statistical theory of tests helps identify criteria to obtain the “opti-
mal” tests, that is, those tests that minimize the probability of Type II errors for a fixed
level α. This analysis is outside the scope of this discussion.
189
5.3. Tests of Hypotheses
190
5.3. Tests of Hypotheses
fX (x) µ0 = 0
0.4 µ1 = 3
0.2
x
−4 −2 2 4 6
Note: this figure represents the probabilities of both a type I and a type II error when H0 : µ ≤ 0, the
alternative hypothesis is true for µ1 = 3, and the testing protocol of the researcher is to reject the null
hypothesis if the realized standardized sample mean centered at µ0 = 0 exceeds 2. The probability of a
Type I error is thus the shaded area below the continuous density function centered at µ0 = 0 while the
probability of a Type II error is the dotted area below the dashed density function centered at µ1 = 3.
Figure 5.2: Test on the mean of a normal distribution: error types I & II
Consider now the general case of a one-sided test, for some given C.
H0 : µ ≤ C H1 : µ > C
In order to use the standardized sample mean as an appropriate test statistic
with a given size α, one must solve the following equation in terms of the
critical value zα∗ , where the subscript indicates the size of the test.
√ X̄ − C
σ ∗ ∗
P X̄ > C + √ zα = P N > zα = α
N σ
191
5.3. Tests of Hypotheses
φ(x) φ(x)
0.4 0.4
0.2 0.2
x x
−3 −1 1 3 −3 −1 1 3
Note: the left panel depicts a one-sided test, the right-panel a two-sided test, both with size α = 0.05.
The shaded areas represent the corresponding rejection regions. The random variable X represented in
both panels is the standardized sample mean centered at C; it follows the standard normal distribution.
∗
The critical values are, respectively, z0.05 ∗
≈ 1.64 in the left panel and z0.025 ≈ 1.96 in the right panel.
Now suppose that the test is two-sided: the null hypothesis allows for
only one value C, while the alternative hypothesis admits all other values.
H0 : µ = C H1 : µ 6= C
The researcher must now look for two symmetric critical values: zα∗/2 > 0
and its mirror image −zα∗/2 < 0. Intuitively, the researcher is agnostic about
the sign of the deviation from C in case the alternative hypothesis is true;
5
Once again, the standardized sample mean is centered at µ = C because with this
choice, and for a fixed level α, the probability of a Type I error is maximized while the
probability of a Type II error is minimized.
192
5.3. Tests of Hypotheses
hence, given a level α the probabilities of both the Type I and the Type II
errors are minimized when:
!
√ −
σ X̄ C α
P X̄ − C > √ zα∗/2 = P N > zα∗/2 =
N σ 2
√
and the null hypothesis is rejected if N |x̄ − C| /σ > zα∗/2 .6 This is visually
represented in the right panel of Figure 5.3, where the rejection region is
composed by two symmetric tails of the standard normal distribution. Here,
the p-value is calculated as the sum of two symmetric probabilities.
p (x̄) = P X̄ > x̄ + P X̄ < −x̄
= 2 · P X̄ > |x̄|
= 2 · P X̄ > x̄
= 2 · P X̄ < −x̄
Two-sided tests about the mean of the normal distribution are perhaps the
most common kinds of tests of hypotheses. It is thus useful to memorize the
critical values associated with conventional confidence levels: z0.05
∗
≈ 1.64 if
α = 0.1, z0.025 ≈ 1.96 if α = 0.05, and z0.005 ≈ 2.33 if α = 0.01.
∗ ∗
The two-sided case bears symmetric analogies. One could graphically rep-
resent the two scenarios similarly as in Figure 5.3, but using the Student’s
t-distribution instead of the standard normal.
6
Here, the standardized sample mean calculated for evaluating the test is centered
at µ = C because this is the only value allowed by the null hypothesis.
193
5.3. Tests of Hypotheses
H0 : 0 < σ2 ≤ C H1 : σ2 > C
Here, the test statistic is the rescaled sample variance (N − 1) S 2 /C, and
the critical value kα∗ for a test of size α is identified through the chi-squared
distribution with N − 1 degrees of freedom (see Figure 5.4 below).
S2
2 C ∗ ∗
P S > k = P (N − 1) > kα = α
N −1 α C
fX (x)
0.1
x
5 10 15 20
Note: the shaded area represents the rejection region for a test with size α = 0.05 on the variance of a
normal distribution if N = 8. The represented random variable X ∼ χ27 is the rescaled sample variance.
Thus, the null hypothesis is rejected if (N − 1) s2 /C > kα∗ , and the p-value
is calculated as p (s2 ) = P (S 2 ≥ s2 ).
Example 5.26. Testing for the variance ratio of two normal dis-
tributions: test statistic and p-values. Suppose now that the interest
of the analyst falls the variances of two independent normally distributed
populations. The relevant hypotheses are as in Example 5.22:
σ2X σ2X
H0 : 2 ≤ C H1 : 2 > C
σY σY
and by the analysis conducted in Lecture 4, the relevant test statistic is the
F -statistic F = SX
2
/SY2 C. Thus, the critical value kα∗ for a test of size α is
194
5.3. Tests of Hypotheses
fX (x)
0.8
0.4
x
1 2 3 4
Note: the shaded area represents the rejection region for a test with size α = 0.05 on the normal variance
ratio if NX = NY = 12. The represented random variable X ∼ F11,11 is the F -statistic.
In this scenario, the null hypothesis is rejected if (s2X /s2Y ) /C > kα∗ and the
p-value is calculated as p (s2X , s2Y ) = P (SX
2
/SY2 > s2X /s2Y ).
Example 5.27. Testing multiple means of the multivariate normal
distribution: test statistic and p-values. Consider now some composite
hypotheses about multiple parameters – specifically, multiple means – of the
multivariate normal distribution (recall Example 5.20):
H0 : µk = Ck H1 : µk 6= Ck
for k = 1, . . . , K. This test is best expressed in vectorial form:
H0 : µ = c H1 : µ 6= c
195
5.3. Tests of Hypotheses
fX (x)
0.6
0.4
0.2
x
2 4 6
Note: the shaded area represents the rejection region for a test with size α = 0.05 about K = 4 means
µ = (µ1 , . . . , µK ) of a multivariate normal distribution, with N = 12. The represented random variable
X ∼ F4,8 is the rescaled Hotelling’s t-squared statistic which is discussed above in the text.
196
5.3. Tests of Hypotheses
The logic of the test can be generalized. For example, if the hypotheses
of interest were about the equality of the means of a random vector (X, Y )
that follows the bivariate normal distribution, like:
H0 : µX − µY = 0 H1 : µX − µY 6= 0
197
5.3. Tests of Hypotheses
fX (x)
0.2
0.1
x
4 8 12
Note: the shaded area represents the rejection region for a test with size α = 0.05 on the parameter λ of
an exponential distribution if N = 10 and C = 5. The represented random variable X ∼ Γ (10, 2) is the
rescaled sample mean under the null hypothesis that λ = C. Note that both parameters depend on C.
198
5.4. Interval Estimation
Note that the probabilities defined above depend on the chosen statistics L
and U (two random variables), and are evaluated in the sample space defined
by the support of the sample. In practice, the distinction between the two
definitions is often irrelevant, because in many common cases the coverage
probability does not vary in the parameter space. When it varies, however,
an interval estimator is evaluated in terms of the confidence coefficient.
199
5.4. Interval Estimation
where k1− ∗∗
α/2 and kα/2 are two suitable critical values associated with,
∗
α
P T (x1 , . . . , xN ; C) > kα∗/2 =
2
This notation is chosen for the sake of consistency with the more gen-
eral treatment of tests. The above acceptance region T0 is associated
with a size α which is defined in terms of the following probability.
∗∗ ∗
P k1− α/2 ≤ T (x1 , . . . , xN ; C) ≤ kα/2 =1−α
Note that this equals one minus the probability of a Type I error.
3. Construct the following two statistics by inverting the function that
defines the test statistic, T (x1 , . . . , xN ; C), with respect to C, and by
evaluating the inverse at the two critical values.
I1 = T −1 x1 , . . . , xN ; k1−∗∗
α/2
I2 = T −1 x1 , . . . , xN ; kα∗/2
200
5.4. Interval Estimation
P (L (x1 , . . . , xN ) ≤ θ ≤ U (x1 , . . . , xN )) = 1 − α
This procedure appears abstract and convoluted, but since most test statis-
tics are simple function of the parameters, the inversion is generally straight-
forward and intuitive. The method is best illustrated via examples.
Example 5.29. The confidence interval for the mean of the normal
distribution. As described in Example 5.24, in two-sided tests about the
mean of the normal distribution in the case where the variance σ2 is known,
the acceptance region is defined in terms of the following interval:
√ X̄ − C ∗
∗
T0,µ = N ∈ −zα/2 , zα/2
σ
√
since the null hypothesis is rejected if N |x̄ − C| /σ > zα∗/2 . Note that in
this settings one can seamlessly convert open intervals into closed ones and
vice versa, since realizations equal to a specific value have probability zero.
The two critical values −zα∗/2 and zα∗/2 are evaluated in terms of the standard
normal distribution, which is symmetric around zero. Thus, the confidence
interval for µ is:
σ ∗ σ ∗
µ ∈ X̄ − √ zα/2 , X̄ + √ zα/2
N N
which is obtained easily, since the test statistic is a simple linear function
of the parameter. Note that the function is also a monotonically decreasing
one, hence L = I2 and U = I1 according to the procedure described earlier.
If instead the variance σ2 is unknown, the analogous procedure based on
the t-statistic results in the analogous confidence interval:
S ∗ S ∗
µ ∈ X̄ − √ tα/2 , X̄ + √ tα/2
N N
where t∗α/2 is evaluated in terms of the Student’s t-distribution with N − 1
degrees of freedom, which is also symmetric around zero.
201
5.4. Interval Estimation
Example 5.30. The confidence interval for the variance of the nor-
mal distribution. By extending Example 5.25, a two-sided test about the
variance of the of the normal distribution would have acceptance region:
S 2 ∗∗
∗
T0,σ2 = (N − 1) ∈ k1−α/2 , kα/2
C
where k1−
∗∗
α/2 and kα/2 are evaluated in terms of the chi-squared distribution
∗
fX (x)
0.1
x
5 10 15 20
Note: the shaded area displays the rejection region for a two-sided version of the test in Example 5.25.
Example 5.31. The confidence interval for the variance ratio from
two normal distributions. The case of the variance ratio σ2X /σ2Y from
two samples drawn from two independent normal distributions is analogous;
the two-sided version of Example 5.26, gives the following acceptance region:
2
SX 1 ∗∗ ∗
T σ2X = ∈ k1−α/2 , kα/2
0, 2
σ
SY2 C
Y
where k1−
∗∗
α/2 and kα/2 are evaluated as quantiles of the F -distribution with
∗
202
5.4. Interval Estimation
fX (x)
0.2
0.1
x
4 8 12
Note: the shaded area displays the rejection region for a two-sided version of the test in Example 5.28.
203
Lecture 6
Asymptotic Analysis
204
6.1. Convergence in Probability
205
6.1. Convergence in Probability
206
6.1. Convergence in Probability
The following two useful results show that whenever some random sequence
converges in quadratic or higher mean to some specific vector (such as, say,
its mean), it also converges in probability to it.
Theorem 6.2. Convergence in Lower Means. A random sequence xN
that converges in r-th mean to some constant vector c also converges in s-th
mean to c for s < r.
Proof. The proof is based on Jensen’s Inequality:
h i
r rs s
lim E [kxN − ck ] = lim E (kxN − ck ) ≤ lim {E [kxN − ckr ]} r = 0
s
N →∞ N →∞ N →∞
207
6.1. Convergence in Probability
p
which also implies convergence in probability, x̄N → E [x].
With the aid of some measure theory, it is possible to prove the intuitive
result that almost sure convergence implies convergence in probability.
All concepts, definitions and results about convergence that have been
discussed thus far apply to sequences of random matrices as well. A random
sequence XN of matrices converges in probability to some matrix C if:
(where for any matrix B, kBk = tr (BT B)). This is denoted as follows.
p
p
XN → C
208
6.1. Convergence in Probability
that is, convergence in probability and almost sure convergence are preserved
when functions are applied to random sequences.
Proof. (Sketched.) Only the case about convergence in probability is proved
here, with the purpose of illustrating the core argument (which is essentially
an extension of the properties of limits for continuous functions). For a given
positive number δ > 0, define the set
Gδ = {x ∈ X| x ∈
/ Dg : ∃y ∈ X : kx − yk < δ, kg (x) − g (y)k > ε}
this is the set of points in X where g (·) “amplifies” the distance with some
other point y beyond a small neighborhood of ε. By this definition:
and notice that upon taking the limit of the right-hand side as N → ∞,
the second term vanishes by definition of a continuous function, while the
third term is zero by hypothesis. Therefore:
209
6.1. Convergence in Probability
210
6.2. Laws of Large Numbers
where the third line follows from independence between observations, while
the fourth line relies on observations being identically distributed (so that
they have the same moment-generating function); essentially, this analysis
is an extension of Theorem 3.6. From a Taylor expansion around t0 = 0:
T N
tT E [x]
t ι
Mx̄N (t) = 1 + +o
N N
211
6.2. Laws of Large Numbers
212
6.2. Laws of Large Numbers
Observe that Markov’s version of the Strong Law of Large Numbers does
not impose finite second moments, but only that the absolute moments of
order slightly larger than one, i.e. 1 + δ > 1, are finite. This is a seemingly
complex, but actually weaker condition (an analogue of which is also used
in certain versions of the Central Limit Theorem, as it is discussed later).
Other, more general versions of the Law of Large Numbers also allow for
weakly dependent observations – that is, n.i.n.i.d. samples – which are a
prominent feature of socio-economic settings. These results are extensively
applied in econometrics, but are not elaborated here. To give more intuition
about the working of the Law of Large Numbers, Figure 6.1 below displays
the results of multiple simulations about the sample mean calculated from
random samples of increasing size drawn from X ∼ Pois (4) – see the notes.
1
0.6
0.4
0.5
0.2
0 0
0 4 8 12 16 0 4 8 12 16
N =1 N = 10
1.5
2
1
1
0.5
0 0
0 4 8 12 16 0 4 8 12 16
N = 100 N = 1000
Note: histograms of realizations of X̄N obtained from multiple i.i.d. samples drawn from X ∼ Pois (4).
Each histogram is obtained with 800 samples of the indicated size N . The realizations of X̄N are binned
on the x-axes with bins of length 0.02. For all histograms, the y-axes measure the density of their bins.
Figure 6.1: Simulation of the Law of Large Numbers for X ∼ Pois (4)
213
6.2. Laws of Large Numbers
214
6.2. Laws of Large Numbers
One can show that if the assumptions that motivate Method of Moments
or Maximum Likelihood estimators are correct, these are consistent. This
is shown next by some “heuristic” (i.e. intuitive, not too rigorous) proofs.
Theorem 6.8. Consistency of the Method of Moments. An estimator
θ
bM M defined as the solution of a set of sample moments (5.2) is consistent
for the parameter set θ0 that solves the corresponding population moments
(5.1), if such a solution exists (i.e. if the estimation problem is well defined).
Proof. (Heuristic.) By some applicable Law of Large Numbers:
N
1 X b
p
h i
m x i ; θM M → E m x i ; θ M M = 0
b
N i=1
215
6.3. Convergence in Distribution
then the r-th moment of x (and by extension all lower moments) converge to those of x:
r r
lim E [|xN | ] → E [|x| ] < ∞
N →∞
so long as they are finite. However, this is not enough to guarantee that higher moments,
not to mention the distribution function itself, converge to those of x.
216
6.3. Convergence in Distribution
217
6.3. Convergence in Distribution
FX (x) ν=1
1 ν=3
ν→∞
0.5
x
−5 −3 −1 1 3 5
218
6.3. Convergence in Distribution
where: ν21
ν1 +ν2
Γ 1 ν1
lim 2
= 2− 2
ν2 →∞ Γ ν22 ν2 + w
follows by the properties of the Gamma function.
Example 6.5. Convergence in distribution of Hotelling’s t-squared
statistic. Recall the formulation of Hotelling’s rescaled t-squared statistic
for a given K, and express it as a random sequence.
N −K 2 (N − K) N
tN = (x̄ − µ)T S −1 (x̄ − µ) ∼ FK,N −K
K (N − 1) K (N − 1)
For a given N , this statistic follows the F -distribution with paired degrees
of freedom K and N − K. By Observation 6.2, however, as the sample size
N grows large one obtains the following result.
d
t2N → χ2K
In words, Hotelling’s t-squared statistic (non-rescaled) converges in distri-
bution to a chi-squared distribution with K degrees of freedom.4 Similarly
as in the univariate case, this has important implications for multivariate
tests of hypothesis, say about the means of a multivariate normal distribu-
tion. As the sample grows large, these can be based on the relatively simple
chi-squared distribution, instead of the more involved F -distribution.
4
The rescaling factor is removed for two reasons. First, to apply Observation 6.2 one
should multiply the sequence by ν1 = K. Second, as N → ∞ the term (N − K) / (N − 1)
becomes irrelevant, and the asymptotic result holds irrespectively of it.
219
6.3. Convergence in Distribution
220
6.3. Convergence in Distribution
221
6.3. Convergence in Distribution
where the left-hand side is the limit of the cumulative distribution of the
standardized maximum, while the right-hand side is the expression of the
cumulative standardized GEV distribution. By taking the the logarithm of
this expression, the above is:
1
lim N log FX (aN x − bN ) = − (1 + ξx) ξ
N →∞
222
6.4. Central Limit Theorems
While the Extreme Value Theory is outside the scope of this discussion,
it is worth to briefly comment on some implications of the Fisher-Tippett-
Gnedenko Theorem.
1. First, the Theorem does not state that a standardized maximum always
converge to a GEV distribution; it states that if it converges, the limit-
ing distribution is GEV. In this respect, the Theorem differs from other
results such as the Central Limit Theorem.
2. The implications of this result are not restricted to the maximum, but
extend to the minimum too. By defining Y = −X, for every N it clearly
is Y(1) = −X(N ) , which helps identify the distribution of the minimum if
that of the maximum is known (think for instance about the relationship
between the reverse Weibull and the “traditional” Weibull distribution).
This explains why the name of the theorem references “extreme values”
and not just maxima.
3. As mentioned, the proof of the Theorem sets conditions that allow to
identify which Type of GEV distribution is a possible limiting distribu-
tion of the maximum, by inspecting the cumulative distribution FX (x)
that generates the data. These conditions are quite technical, but some
of their implications are quite useful. For example, it is known that the
limiting distribution of the maximum associated with a random sample
drawn from the normal distribution is the Gumbel distribution.
In econometrics, the Extreme Value Theorem is invoked as the motiva-
tion behind specific assumptions made in certain models of decision-making,
where the random component of choice is assumed to follow a GEV distribu-
tion. In fact, a GEV distribution is a natural choice to model the maximum
value between multiple options that are considered by a decision-maker.
223
6.4. Central Limit Theorems
224
6.4. Central Limit Theorems
by a derivation analogous to the one in the proof of the Weak Law of Large
Numbers (Theorem 6.5). As in that proof, apply a Taylor expansion of the
above expression around t0 = 0, but account for the second order element:
" T #N
tT E [z] tT E zz T t t t
Mz̄¯N (t) = 1 + √ + +o
N 2N 2N
T N
tT t
t t
= 1+ +o
2N 2N
where the second line exploits the fact that E [z] = 0 and that E zz T = I
A
where the notation ∼ indicates that the normal distribution in question,
called the asymptotic distribution, is approximate and is valid for a fixed
N , instead of being a “limiting” distribution (recall from the discussion of
Example 6.6 that a limiting distribution cannot be expressed in terms of N ).
To illustrate, Figure 6.3 plots the empirical distribution of the standardized
sample means obtained with the same simulation of samples drawn from
the Poisson distribution as in Figure 6.1. The standardization implies that
the limiting distribution is the standard normal.5 Despite the complications
entailed in the representation of a distribution via histograms, the simula-
tion highlights how the limiting distribution is approximated increasingly
better as the sample size increases.
5 ¯N , call it say Z̄¯N , is
For intuition, it is as if the univariate version of the sequence z̄
plotted in Figure 6.1.
225
6.4. Central Limit Theorems
0.8
0.6 0.4
0.4
0.2
0.2
0 0
−2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3
N =1 N = 10
0.4
0.4
0.2 0.2
0 0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
N = 100 N = 1000
√
Note: histograms of realizations of N X̄N − 4 /2 obtained from multiple i.i.d. samples drawn from
X ∼ Pois (4). Each histogram is obtained with 800 samples of the indicated size N . All the realizations
are binned on the x-axes with bins of length 0.25. For all histograms, the y-axes measure the density of
their bins. Density functions of the standard normal distribution are superimposed upon each histogram.
Figure 6.3: Simulation of the Central Limit Theorem for X ∼ Pois (4)
226
6.4. Central Limit Theorems
227
6.4. Central Limit Theorems
for any two elements k, ` = 1, . . . , K of the random vector x and for all ob-
servations i. Under the hypothesis of independent observations, the asymp-
totic properties of most econometric estimators are obtained by invoking
Ljapunov’s Central Limit Theorem, hence conditions akin to (6.1) are rou-
tinely invoked and they are referred to as the “Ljapunov conditions.”
Example 6.7. Asymptotic normality of the linear regression esti-
mator. Let us return once again to the Method of Moments estimator of
the bivariate linear regression slope from example 6.3. Rewrite it as:
PN
b 1,M M = P i=1 X i − X̄ Yi
β N 2
i=1 Xi − X̄
PN PN
i=1 Xi − X̄ (β0 + β1 Xi ) i=1 Xi − X̄ (Yi − β0 − β1 Xi )
= PN 2 + PN 2
i=1 X i − X̄ i=1 X i − X̄
PN PN
Xi − X̄ Xi i=1 Xi − X̄ εi
= β1 Pi=1N 2 + PN 2
i=1 Xi − X̄ i=1 Xi − X̄
1
PN
i=1 X i − X̄ εi
= β1 + N PN 2
1
N i=1 Xi − X̄
where
εi ≡ Yi − β0 − β1 Xi
is the so-called error term of the regression model – that is, the devia-
tion that occurs between Yi and the linear conditional expectation function
E [Yi | Xi ] = β0 + β1 Xi . The error term can be interpreted as a transformed
random variable defined as a linear combination of the “primitive” random
variables Yi and Xi . Note that E [εi ] = 0 by the hypotheses on β0 .
Recall that in the bivariate linear regression model, the Law of Iterated
Expectations implies E [Xi εi ] = 0. This observation provides another av-
enue for showing consistency of the MM estimator of the regression slope.
In fact, by the Continuous Mapping Theorem:
N
1 X p
Xi − X̄ εi → E [Xi εi ] − E [Xi ] E [εi ] = 0
N i=1 | {z } | {z }
=0 =0
p
implying β b 1,M M → β1 . Furthermore, since the expression on the left-hand
side is a sample mean, under the proper assumptions about the sample an
applicable Central Limit Theorem implies the following.
N
1 X d
Xi − X̄ εi → N 0, E ε2i (Xi − E [Xi ])2 (6.2)
√
N i=1
228
6.4. Central Limit Theorems
p
In (6.2) the limiting variance takes the stated form because X̄ → E [Xi ] at
the probability limit. The limiting variance obtains as:
" N
# N
1 X 1 X
Var √ (Xi − E [Xi ]) εi = Var [(Xi − E [Xi ]) εi ]
N i=1 N i=1
= E ε2i (Xi − E [Xi ])2
while in the even more specialized case where the squared deviations of Xi
and εi from their respective means are mutually independent, it is:
E ε2i (Xi − E [Xi ])2 = E ε2i E (Xi − E [Xi ])2 = σ2ε · Var [Xi ]
where σ2ε ≡ E [ε2i ]. This latter case is the one where the conditional variance
function of εi given Xi is actually a constant – a scenario commonly defined
homoscedasticity (as opposed to heteroscedasticity, the general case).
The expression in (6.2), the above decomposition of the MM estimator
of the bivariate linear regression slope, the Cramér-Wold device, as well as
the following implication of the Continuous Mapping Theorem:
" N
#−1
1 X 2 p
Xi − X̄ → [Var [Xi ]]−1
N i=1
all imply that the limiting distribution of the MM estimator is:
2
!
√
2
d E ε i (X i − E [X i ])
N β b 1,M M − β1 → N 0, 2 (6.3)
(Var [Xi ])
and for some given N , its asymptotic distribution is as follows.
2 2
!
A 1 E ε i (X i − E [X i ])
βb 1,M M ∼ N β1 , (6.4)
N (Var [Xi ])2
In the more specialized “homoscedastic” case, the limiting distribution is:
√ σ2ε
d
N β1,M M − β1 → N 0,
b (6.5)
Var [Xi ]
while the asymptotic distribution is derived consequently.
σ2ε
A 1
β1,M M ∼ N β1 ,
b (6.6)
N Var [Xi ]
A proper econometric treatment of the linear regression model would discuss
the multivariate generalization of these expressions, while additionally intro-
ducing the appropriate estimators for the unknown variances (or variance-
covariances) of these distributions – that is, estimators of the asymptotic
variances in (6.4) and (6.6).
229
6.4. Central Limit Theorems
The next result is is instrumental for the analysis and derivation of the
asymptotic properties of many estimators.
Theorem 6.16. Delta Method. Suppose that some random sequence of
dimension K, xN , is asymptotically normal:
√ d
N (xN − c) → N (0, Υ) (6.7)
for some K × 1 vector c and some K × K matrix Υ. In addition, consider
some vector-valued function d (x) : RK → RJ . If the latter is continuously
differentiable at c and the J × K Jacobian matrix
∂
∆≡ d (c)
∂xT
has full row rank J, the limiting distribution of d (xN ) is as follows.
√ d
N (d (xN ) − d (c)) → N 0, ∆Υ∆T (6.8)
230
6.4. Central Limit Theorems
Proof. The proof applies the same logic as the Delta Method. By the mean
value theorem, the sample moment conditions are developed as:
N
1 X b
0= m x i ; θM M =
N i=1
N
" N
#
1 X 1 X ∂
= m (xi ; θ0 ) + m x i ; θ
e N θ
b MM − θ 0
N i=1 N i=1 ∂θT
where the first expression on the upper line is equal to zero by construction
√
of all Method of Moments estimators. After multiplying both sides by N
and some manipulation the above expression is rendered as follows.
" N
#−1 N
√ 1 X ∂ 1 X
N θ M M − θ0 = −
b m x i ; θNe √ m (xi ; θ0 )
N i=1 ∂θT N i=1
Note that, since this is a random sample:
1. by a suitable Central Limit Theorem:
N
1 X d
−√ m (xi ; θ0 ) → N (0, Var [m (xi ; θ0 )])
N i=1
since E [m (xi ; θ0 )] = 0 by hypothesis;
2. while by the Weak Law of Large Numbers:
N
1 X ∂
p ∂
m x i ; θN → E
e m (xi ; θ0 )
N i=1 ∂θT ∂θT
p
since θ
eN → θ0 by consistency of the estimator (at the limit, θ
eN , θ
bM M
and θ0 all coincide).
These intermediate results are together combined via the Continuous Map-
ping Theorem, Slutskij’s Theorem and the Cramér-Wold device so to imply
the statement. Consequently, for a fixed N the asymptotic distribution is:
A 1 T
θ M M ∼ N θ 0 , M 0 Υ0 M 0
b
N
which concludes the proof.
An analogous result holds for Maximum Likelihood estimators as well, and
the proof is almost identical. In this case, however, the result is especially
powerful, as the asymptotic variance coincides with the Cramér-Rao bound.
231
6.4. Central Limit Theorems
where I (θ0 ) – without the N subscript – is the expression for the following
“single-observation” information matrix evaluated at θ0 .
" T #
∂ ∂
I (θ0 ) ≡ E log fx (xi ; θ0 ) log fx (xi ; θ0 )
∂θ ∂θ
∂2
= −E log fx (xi ; θ0 )
∂θ∂θT
Consequently, θ
bM LE asymptotically attains the Cramér-Rao bound.
−1 1 X
" N
# N
1 X ∂2 ∂
=− T
log fx xi ; θN
e √ log fx (xi ; θ0 )
N i=1 ∂θ∂θ N i=1 ∂θ
232
6.4. Central Limit Theorems
but in this case some additional simplifications are possible, thanks to the
Information Matrix Equality. In fact, under the regularity conditions:
1. a suitable Central Limit Theorem implies that:
N
1 X ∂
d
−√ log fx xi ; θ bM LE → N (0, I (θ0 ))
N i=1 ∂θ
233
6.4. Central Limit Theorems
234
6.4. Central Limit Theorems
As the last example has shown, the specific formulae for the estimation of
the asymptotic variance are typically context- and assumption-dependent.
In random samples, however, it is easy to establish expressions with a more
general validity. Consider Method of Moments estimators first; in the i.i.d.
framework, a general expression for a consistent estimator of their asymp-
totic variance is given by N −1 M
cN Υ c T , where
b NM
N
−1
" N
#
cN ≡ 1 X ∂
p
M T
m xi ; θbM M → M0
N i=1 ∂θ
is a consistent estimator of M0 (by some applicable Law of Large Numbers
and the Continuous Mapping Theorem), while
N h
bN ≡ 1
i h iT
p
X
Υ m xi ; θ
bM M m xi ; θ
bM M → Υ0
N i=1
is also a consistent estimator of the variance of the zero moment conditions
by some applicable Law of Large Numbers, since in a random sample the
following holds.6
h i
Υ0 = Var [m (xi ; θ0 )] = E (m (xi ; θ0 )) (m (xi ; θ0 ))T
These estimating matrices are not only based on sample analogues of their
population counterparts (the object of estimation), something which is indi-
cated with the subscript N instead of 0. In addition, they are also evaluated
at the estimated parameters θ, which is symbolized by the wide “hat” used
to denote them. In the Maximum Likelihood case, the information matrix
equality offers two alternative routes for estimating the asymptotic variance.
The first option is based on the Hessian of the mass or density function:
N
1 X ∂2
p
HN ≡ −
b
T
log f x x i ; θ
b M LE → I (θ0 )
N i=1 ∂θ∂θ
while the second option exploits the “squared” score.
N ∂ T
bN ≡ 1 X ∂
p
J log fx xi ; θ
bM LE log fx xi ; θ
bM LE → I (θ0 )
N i=1 ∂θ ∂θ
Both matrices H b N and J bN are evaluated at the MLE solution and, in ran-
dom samples, let consistently estimate the information matrix; in practice
the choice of a specific option is usually based on convenience. It is impor-
tant to familiarize with this type of results and the associated notation, as
they are typical of the standard treatment of econometric theory.
6 p
bN →
Under fairly general assumptions, it is Υ Υ0 also with i.n.i.d. observations.
235
Part II
Econometric Theory
236
Lecture 7
237
7.1. Linear Socio-economic Relationships
238
7.1. Linear Socio-economic Relationships
Ci = c0 + c1 Yi + εi (7.4)
4 95
94
93
92
91
90
3.5 89
88
87
86
85
3 84
83
82
81
80
79
78
2.5 77
76
75
74
73
72
2 70
71
69
68
67
66
65
1.5 63
64
62
61
60
59
239
7.1. Linear Socio-economic Relationships
Figure 7.1 depicts an example of the association between two time series
of Y and C based on actual macroeconomic data (both series are normalized
to 1992 prices). The relationship between the two variables in question ap-
pears to be robustly linear, so that at first sight a structural relationship like
(7.4) seems justified. Applied macroeconometric research has demonstrated
that, in fact, the strong linear association between the levels of income and
consumption which is typically observed in the data is spurious, in the sense
that the variation of both variables, while certainly reciprocally related, is
influenced by parallel trends that influence both income and consumption.
This finding has led to the development of more sophisticated linear and
non-linear models for the analysis of macroeconomic time series.
Si = γ0 + γ1 Zi + φ1 Xi + φ2 Xi2 + ψ0 αi + ηi (7.7)
240
7.1. Linear Socio-economic Relationships
13
12
11
10
7
6 8 10 12 14 16 18 20
Si : Years of education
241
7.1. Linear Socio-economic Relationships
panel or longitudinal data are typically more informative and useful for the
purposes of microeconometric analysis (while also being typically costlier
to gather and often not readily available); however, cross-sectional data –
by virtue of being simpler – have pedagogical value in selected settings.
The scatter plot displayed in Figure 7.2 represents the raw relationship
between (log) wages and the attained level of education, while ignoring other
variables (such as say experience and ability) which also might bear effects
on earnings. Unlike the association between consumption and income from
Figure 7.1, that the relationship between log-wages and education is linear
is not immediately clear through the visual representation of the data. How-
ever, repeated analyses have shown that this relationship is “more linear”
than the one between the level of wages and education (an observation that
helps motivate the popularity of the Mincer equation). In addition, observe
that the independent education variable Si takes values upon a discrete set,
which is a typical feature of many microeconometric datasets.
Both 7.1 and 7.2 are valid examples of economic structural relationships
framed through linear equations. However, as motivated at length over the
course of this lecture, the linear model is a powerful tool whose properties
are quite useful even when studying relationships that are not strictly struc-
tural or, under certain conditions, that are structural but not necessarily
linear. The analysis of the linear model often benefits from a mathematical
representation based on compact matrix notation. Define:
y1 xT1 x 11 x 21 . . . x K1 ε1
y2 xT x12 x22 . . . xK2 ε2
2
y = .. ; X = .. = .. .. . .. ; ε = ..
. . . . . . . .
T
yN xN x1N x2N . . . xKN εN
that is, y, X and ε are obtained by vertically stacking over all observations,
respectively, the realization yi of the dependent variable, the transpose of
the vector xi , and the error term εi . If the model features a constant term,
the first column of X is the ιN vector whose entries are all equal to 1. With
compact matrix notation, (7.1) can be conveniently written as follows.
y = Xβ + ε (7.8)
Econometric models are often written in terms of realizations, not in terms
of abstract random variables, vector or matrices (note that the distinction
does not apply to the error terms, which cannot be observed by definition).
Because of this convention, in what follows the notation adopted to describe
specific models alternates between the two cases, depending on purpose and
convenience (that is, which notation best clarifies a certain concept).
242
7.2. Optimal Linear Prediction
with ei ≡ yi − ybi and where L (ei ) has the properties that it is increasing in
|yi − ybi | and that L (0) = 0. If the researcher aims at specifying a predictor
that is consistent across different realizations of xi , a sensible criterion is to
choose the function m (xi ) that minimizes the expected loss:
h i
E L Yi − Ybi = E [L (Yi − my (xi ))] (7.10)
where the expectation is taken on the joint support of Yi and (X1i , . . . , XKi ).
This still leaves open the question about the choice of the working loss
function L (ei ). In general, this choice may depend on the context; here, the
analysis is focused on the quadratic loss L (ei ) = e2i . The quadratic loss is
appealing, since deviations of the prediction from the “true” realization of
Yi are disproportionately more “harmful” the higher they are. The expected
quadratic loss is the so-called mean squared error of prediction.
Alternative loss criteria exist. For example, the absolute loss L (ei ) = |ei |
differs from the quadratic loss in that it does not disproportionately punish
large mistakes. For some p ∈ (0, 1), the quantile loss
243
7.2. Optimal Linear Prediction
is asymmetric: for prediction errors of the same absolute size |ei |, it punishes
underprediction (ei > 0) more than overprediction (ei < 0) if p > 0.5 – and
vice versa; the asymmetry increases the farther p departs from 0.5. Observe
that the absolute loss is a special case of the quantile loss, for p = 0.5.
The remainder of this analysis focuses, as anticipated, on the quadratic
loss. The Mean Squared Error of prediction is associated with a well-known
statistical result.
The fact that the conditional expectation function – the population con-
ditional average of Yi – is the best predictor may appear intuitive; however,
this results does not generally hold for all loss functions! In fact, it is limited
to the quadratic loss. Under different loss functions the optimal predictor is
different: for example, the one associated with the absolute loss L (ei ) = |ei |
is the conditional median of Yi given xi ; with the quantile loss, the optimal
predictor is the p-th conditional quantile of Yi given xi . Nevertheless, the
main result should be reminiscent of the simpler observation that the mean
E [X] of a random variable X is the latter’s best “guess” (predictor) under a
quadratic loss criterion (Lecture 1). Theorem 7.1 generalizes that finding.
244
7.2. Optimal Linear Prediction
where h 2 i
β∗ ∈ arg min E Yi − xT i β (7.12)
β∈RK
that is, β∗ is one specific coefficient vector which, among all predictors that
are linear in xi , minimizes the Mean Squared Error (observe that β∗ needs
not be unique). The implications are summarized with the next result.
that is, any optimal linear predictor of Yi is also an optimal linear predictor
of the CEF, E [Yi | xi ], in the MSE sense.
Proof. The demonstration is analogous to that of Theorem 7.1:
h 2 i h 2 i
T T
E Yi − x i β = E Yi − E [Yi | xi ] + E [Yi | xi ] − xi β
h 2 i
= E (Yi − E [Yi | xi ])2 + E E [Yi | xi ] − xT
i β
T
+ 2 E (Yi − E [Yi | xi ]) E [ Yi | xi ] − xi β
h 2 i
= E (Yi − E [Yi | xi ])2 + E E [Yi | xi ] − xT
i β
h i
2
= Var [Yi | xi ] + E E [ Yi | xi ] − xT
i β
where again, the cross-term in the third line disappears by a proper applica-
tion of the Law of Iterated Expectations. Therefore, the two minimizers in
(7.12) and (7.13) are identical so long as Var [Yi | xi ] is a finite constant.
245
7.2. Optimal Linear Prediction
This observation has had a profound impact on motivating the use of linear
regression analysis in contexts where the true form of the statistical depen-
dence between a dependent variable Yi and a set of independent variables
(X1i , . . . , XKi ) is unknown. This important interpretation is revisited later
at the end of this lecture.
With the knowledge about the relationship between optimal predictors,
optimal linear predictors and the CEF at hand, it is convenient to express
the solution of the optimal linear prediction problem. One can rewrite the
First Order Conditions of the problem in (7.12) as:3
(7.14)
∗
E xi xT
i β = E [xi Yi ]
246
7.2. Optimal Linear Prediction
247
7.2. Optimal Linear Prediction
for any nonnegative integer r. As one can autonomously verify, the combi-
nation of all these hypotheses implies the following optimal linear predictor.
5 1
+ Xi
py (Xi ) =
12 2
Note that one would obtain a different optimal linear predictor if Xi were
to follow a different distribution, including say a uniform distribution with
a different support!
f Yi |Xi (yi | xi )
0.8
0.4
0
0 py (Xi ) 4
1 E [Yi | Xi ] 3
2 2
xi 3 1
yi
4 0
5 −1
Note: the continuous curve is the CEF: E [ Yi | Xi ] = Xi − Xi2 /10; the dash-dotted line is the optimal
linear predictor py (Xi ) = 5/12 + Xi /2. The conditional distribution Yi | Xi is normal, with parameters
that vary as a function of Xi . Selected density functions of Yi | Xi are displayed for xi = {1, 2, 3, 4}.
The result is illustrated graphically in Figure 7.3 above, where the contin-
uous curve represents the quadratic CEF while the dash-dotted (straight)
line is the optimal linear predictor, which is at the same time the best linear
approximation of the quadratic CEF. To help visualize the random nature
of the relationship between Yi and Xi , the conditional distribution of the
former given the latter is displayed as normal, but the analysis developed
in this example does not depend on this specific coincidence.
248
7.3. Analysis of Least Squares
N
1 X 2
b ∈ arg min y i − xT
i β (7.17)
β∈RK N i=1
249
7.3. Analysis of Least Squares
A careful reader will have noted that the derivation of the Least Squares
solution bears many analogies with Method of Moments estimation. For the
moment, however, it is better to abstract from any statistical assumptions
that might lead to make statements about estimation. A more immediately
useful exercise is to rather familiarize with both the analytic vector notation
(based on scalars like yi and vectors like xi ) and compact matrix notation.
In fact, both are useful in their own right: while the former provides more
visual information about certain computational details, the latter is better
suited to synthetically express some more convoluted formulae. For a start,
one should understand that the two K × K matrices i=1 xi xi and XT X
PN T
250
7.3. Analysis of Least Squares
where: −1
P X ≡ X XT X XT (7.25)
is called the projection matrix, which if pre-multiplied to y results in the
vector of fitted values y
b. Furthermore:
e=y−y b
= y − Xb
= (I − PX ) y
= MX y
where: −1
M X ≡ I − P X = I − X XT X XT (7.26)
is the so-called residual maker matrix. Pre-multiplying y by the residual
maker matrix clearly results in the vector of residuals e.
The projection and residual maker matrices have important properties.
They are both symmetric:
PX = PT
X
MX = MT
X
idempotent:4
PX PX = PX
MX MX = MX
and they are orthogonal to one another.
PX MX = MX PX = 0
In addition, it is easy to see that:
PX X = X
MX X = 0
with a straightforward interpretation: if one projects the columns of X onto
themselves, the projection is identical to X and the residuals, consequently,
are zero. Finally, observe that:
y = (I + PX − PX ) y
= PX y + MX y
=yb+e
−1 T −1 T −1 T
4
Note that PX PX = X XT X X X XT X X = X XT X X = PX and
MX MX = (I − PX ) (I − PX ) = I − 2PX + PX PX = I − PX = MX . The other results
follow easily from these observations.
251
7.3. Analysis of Least Squares
and:
bT e = yT PX MX y
y
= eT y
b = y T M X PX y
=0
that is, the decomposition of the vector y between the fitted values y b and
the residuals e is such that these two components are orthogonal to one
another. This fact relates to the all-important geometric interpretation
of the Least Squares solution b, seen as the vector which, through the linear
combination y b = PX y = Xb, results in the geometrical projection of y
onto the column space of X.5 In fact, inspecting the normal equations (7.19)
or (7.21) reveals how the residual vector e is by construction orthogonal to
the space S (X) spanned by the columns of X (the K explanatory variables).
This is graphically represented in Figure 7.4 for the case of K = 2.
e S (X)
X∗,2
90◦
y
b
X∗,1
5
Hence names such as “projection matrix” and “linear projection.”
252
7.3. Analysis of Least Squares
where:
X∗2 ≡ MX1 X2
and MX1 is the residual maker matrix of X1 .
−1 T
MX1 ≡ I − X1 XT 1 X1 X1
Furthermore, a symmetrical result is obtained for b1 .
Proof. By the algebra of partitioned matrices, one can write b1 as a function
of b2 as: −1 T −1 T
b1 = XT 1 X1 X1 y − XT 1 X1 X1 X2 b2
plugging the above in the lower block of K2 rows in (7.28) gives:
−1 T −1 T
XT T
2 X 1 X1 X1 X1 y − XT T
2 X1 X1 X 1 X1 X2 b2 + XT T
2 X2 b2 = X2 y
with solution:
h
T
T
−1 T i−1 h T
−1 T i
b2 = X2 I − X1 X1 X1 X1 X2 X 2 I − X 1 X1 X 1 X1 y
−1
= XT XT
2 MX1 X2 2 MX1 y
253
7.3. Analysis of Least Squares
While this theorem might, at a first glance, look like a bunch of trivial if
nasty-looking algebraic formulas, it does deliver quite a fundamental insight:
any component (b2 ) of the least squares solution is algebraically equivalent
to another least squares solution, which follows from a transformed model
where the explanatory variables in question (X2 ) are substituted with the
corresponding residuals (X∗2 = MX1 X2 ) that are obtained from projecting
them on the other explanatory variables (X1 ). Recall our earlier observation
that the least squares projection returns a vector of fitted values and a vector
of residuals that are reciprocally orthogonal. What the Frisch-Waugh-Lovell
Theorem means in this framework is that in a linear model with multiple
explanatory variables, each coefficient bk obtained via Least Squares can be
interpreted as the overall “contribution” of Xki to Yi , after the contributions
of the other K − 1 explanatory variables to Yi has been netted out or, using
more technical terminology, partialled out.
This property explains by a great deal the immense popularity of statisti-
cal estimators based on the Least Squares principle in econometric analysis.
In fact, it allows researchers to interpret the estimated coefficients associ-
ated with a single socio-economic variable of interest by pretending that all
other variables included in the model are taken “as given,” corresponding
with the typical ceteris paribus type of scientific thought experiments. Note
that the theorem does not motivate the exclusion of relevant explanatory
variables from the analysis, except for cases when they can be confidently
assumed to be statistically unrelated to the variables of interest (say, X2 ).
Notice, in fact, that only if X1 and X2 are orthogonal (XT 1 X2 = 0) it holds
that X∗2 = MX1 X2 = X2 , meaning that the least squares coefficients asso-
ciated with the X2 explanatory variables are identical whether one includes
the remaining factors X1 in the model or not.
The properties of partitioned Least Squares appear particularly powerful
when considering any partitioned model with K1 = K − 1 and K2 = 1. In
this case, let X2 = s; the K-th Least Squares coefficient is as follows.
sT MX1 y
bK = (7.30)
sT MX1 s
This quantity is related to a statistical object called the partial correla-
tion coefficient ρ∗Y S between variables Yi and XKi = Si :
sT MX1 y
ρ∗Y S = √ p (7.31)
sT MX1 s yT MX1 y
which is nothing else but the correlation coefficient of the residuals of both
Yi and Si , obtained by projecting these on the other K − 1 explanatory
254
7.3. Analysis of Least Squares
measure the correlation between two variables once the dependence of both
(in the linear projection sense) from other variables has been removed. The
algebraic relationship between (7.30) and (7.31) appears evident, which ex-
plains why Least Squares coefficients are often attributed an interpretation
in terms of partial correlation (although the two are not identical).6
Another oft-invoked application of the Frisch-Waugh-Lovell Theorem is
demeaning, that is the operation of subtracting the respective mean from
both the explanatory and dependent variables. Suppose that X1 = ι is the
“constant term” of the model (an N -sized vector of ones), so that X2 entails
K −1 columns like in model (7.3). In such a case, the residual maker matrix
reads as:
−1 T 1
D ≡ MX1 = I − ι ιT ι ι = I − ιιT (7.33)
N
hence for any vector a of length N , it is:
Da = a − āι
(7.34)
yi − ȳ = β1 (xi1 − x̄1 ) + · · · + β(K−1) xi(K−1) − x̄(K−1) + εi
PN PN
where x̄ ≡ N1 i=1 xi , ȳ ≡ N1 i=1 yi , while D is defined as in (7.33); the analogies of
the above expression with both (5.16) and (7.31) are obvious.
255
7.4. Evaluation of Least Squares
Thanks to the above properties, the inclusion of a constant term into the
model allows to employ a common criterion for the evaluation of the Least
Squares’ goodness of fit, defined as the extent by which the linear combina-
tion ybi = xTi b explains, in a statistical sense, the variation of the dependent
variable yi . This criterion is called coefficient of determination R2 and
is defined as: PN
ESS yi − ȳ)2
(b
2
R = = Pi=1N 2
∈ [0, 1] (7.35)
TSS i=1 (y i − ȳ)
where the term in the numerator relates to the variance of the fitted values
(note that N i=1 ybi = ȳ because of the inclusion of a constant term):
1
PN
N
X
ESS ≡ yi − ȳ)2 = bT XT DXb
(b
i=1
and is called instead Total Sum of Squares (TSS). The difference between
these two quantities is called Residual Sum of Squares (RSS).
N
X
RSS ≡ TSS − ESS = e2i = eT e
i=1
256
7.4. Evaluation of Least Squares
To see that the RSS equals the sum of the squared residuals, observe
first that with the inclusion of a constant term into the model the mean of
the residuals themselves is zero, hence De = e and
bT XT DXb = y
bT Db
y
bT D (y + e)
=y
bT Dy
=y
and therefore:
bT Db
y y bT Dy y
y bT Dy bT Dy
y
R2 = = · = =
yT Dy yT Dy ybT Dby yT Dy
| {z }
=1
hP i2
N
(yi − ȳ) (b
i=1 yi − ȳ)
= hP i hP i
N 2 N 2
yi − ȳ)
i=1 (b i=1 (yi − ȳ)
that is, the R2 coefficient is equal to the square of the correlation coefficient
between yi and the fitted values ybi (hence its name).
7
Another way to see this is:
N
X N
X N
X N
X
2 2 2
(yi − ȳ) = (yi − ybi + ybi − ȳ) = e2i + yi − ȳ)
(b
i=1 i=1 i=1 i=1
PN PN PK PN
noting that i=1 (yi − ybi ) (b
yi − ȳ) = i=1 (yi − ybi ) ybi = k=1 bk i=1 xik (yi − ybi ) = 0
follows from the normal equations.
257
7.4. Evaluation of Least Squares
where R20 and R21 are the two coefficients of determination calculated respectively prior
to and posterior to the inclusion of S.
258
7.4. Evaluation of Least Squares
92
91
90
3.5 89
88
87
86
85
3 84
83
82
81
80
79
78
2.5 77
76
75
74
73
72
2 70
71
69
68
67
66
65
1.5 63
64
62
61
60
59
1
1.5 2 2.5 3 3.5 4 4.5 5
Yi : Disposable income, 1996 $ trillions
In fact, econometric research has shown that strong linear fits of this sort
are standard properties of macroeconomic time series, which intuitively are
due to the co-movement of variables because of some other factors that are
possibly unaccounted by the model. Therefore, the estimate c1 ' 0.80 can
hardly be interpreted as the average increase in aggregate consumption that
follows from the increase of a country’s GDP.
Example 7.5. Human capital and wages, revisited. A simple linear fit
of the relationship between the logarithmic wage and the education variable
from example 7.2 returns a slope coefficient b ' 0.064 and an R2 coefficient
equal to 0.076 (see Figure 7.6, top panel). This does not imply that such a
relationship is meaningless, quite the contrary! By enriching the model with
a squared polynomial for the experience variable as in the proper Mincer
equation (7.6), one would obtain a higher slope coefficient, up to b0 ' 0.105;
while the R2 coefficient increases too, as expected, up to about 0.176. Note
that here K is small relative to the sample size (the selected cross section of
the original dataset has size N = 1, 241), hence the adjusted R2 is virtually
identical to the standard R2 in both calculations.
259
7.4. Evaluation of Least Squares
12
11
10
b ' 0.064
7
6 8 10 12 14 16 18 20
Si : Years of education
log Wi : Logarithm of the annual wage, 1987 $
12
11
b0 ' 0.105
10
b ' 0.064
7
−8 −6 −4 −2 0 2 4 6 8
Residuals from the projection of Si on Xi and Xi2
260
7.5. Least Squares and Linear Regression
E [Yi | xi ] = xT
i β0 (7.37)
261
7.5. Least Squares and Linear Regression
E [εi | xi ] = E Yi − xT
i β0 x i
= E [Yi | xi ] − xT
i β0
=0
E [xi εi ] = Ex [E [xi εi | xi ]]
= Ex [xi · E [εi | xi ]]
=0
(the opposite is not true, that is, E [xi εi ] = 0 does not imply E [εi | xi ] = 0).
By the standard properties of probability limits it can be shown that:9
N
!−1 N
X X p −1
xi x T xi yi → E xi xT E [xi Yi ] = β∗
bN = i i
i=1 i=1
that is, the Least Squares projection converges in probability to the optimal
linear predictor (7.12); this should not be too surprising, since such a result
generally holds for sample analogs of population moments. What is relevant
here is that under hypothesis (7.37), the optimal linear predictor coincides
with the (linear ) CEF for any given realization xi = xi :
∗ T −1
py (xi = xi ) = xT T
i β = xi E x i x i E [xi Yi ]
−1
= xT T
i E xi xi E [xi E [Yi | xi ]]
T −1
= xT E xi xT
i E xi xi i β0
= xT
i β0
= E [Yi | xi = xi ]
where the second line once again exploits the Law of Iterated Expectations.
In light of Theorems 7.1 and 7.2, this result should not be too surprising.
9
Compare with the analysis and the proof of consistency conducted for the bivariate
case in Example 6.3, Lecture 6. Also note that the sequence bN , whose probability limit
is taken, is defined here in terms of realizations xi and yi , This notation is conventional
in the analysis of econometric estimators.
262
7.5. Least Squares and Linear Regression
The implication of both observations is that if the CEF is linear, the cor-
responding Least Squares solution coincides asymptotically with the “true”
parameters of the regression model.
p
bN → β0 (7.38)
This property motivates the use of Least Squares as a statistical or econo-
metric estimator of the linear regression model, an estimator that takes
the name of Ordinary Least Squares (OLS), where ‘ordinary’ is meant to
distinguish the baseline estimator from its variations or extensions. Result
(7.38) is re-framed later as the consistency property of the OLS estimator.
In what follows, it is given an array of motivations for the use of the linear
regression model in practical contexts.
263
7.5. Least Squares and Linear Regression
264
7.5. Least Squares and Linear Regression
265
7.5. Least Squares and Linear Regression
266
7.5. Least Squares and Linear Regression
267
7.5. Least Squares and Linear Regression
Example 7.9. Human capital and wages – blacks and whites. Let
us return again to the analysis of returns to schooling, examining how they
may differ between the two major racial groups in the US: blacks and whites.
Figure 7.7 below helps visualize the racial split in the cross-sectional excerpt
from Examples 7.2 and 7.5: blacks, who constitute about 33% of the sample,
appear more prevalent amongst the lower income brackets.
log Wi : Logarithm of the annual wage, 1987 $
13 Whites, no interact.
Whites, interaction
Blacks, no interact.
12 Blacks, interaction
11
10
7
6 8 10 12 14 16 18 20
Si : Years of education
Figure 7.7: (Log) wage and education in 1987; node colors match race
268
7.5. Least Squares and Linear Regression
269
7.5. Least Squares and Linear Regression
hence it corresponds to the derivative µ0Y |S,x (si ; xi ) of the CEF, averaged
over the support of xi , after having been weighted through the support of
Si by the following term, which depends on si and varies with xi .
The term φ (si , xi ) is hard to interpret, but intuitively it takes larger values
around the median of Si – as an inspection of the formula would suggest.
The original derivation of (7.56) given by Angrist and Krueger was initially
applied to a discrete Si (say years of education); with some manipulation
of integrals this can be proved for a continuous Si too.11
11
The proof proceeds as follows. After having defined s∗ ≡ lim sup XS , develop:
where the first and third lines above are consequent to the Law of Iterated Expectations,
the second line follows from the Fundamental Theorem of Calculus, and the rest obtains
with some manipulation. In addition, by standard properties of conditional moments:
E [ Si | xi ] = E [ Si | Si ≥ si , xi ] P ( Si ≥ si | xi ) + E [ Si | Si < si , xi ] [1 − P ( Si ≥ si | xi )]
showing the numerator of (7.56) is as stated above; a similar, simpler analysis also applies
to the denominator of (7.56).
270
7.5. Least Squares and Linear Regression
0.4 Frequency 2
Average φ (si ; xi )
Average φ (si ; xi )
Frequency
0.2 1
0 0
10
0
9
6 8 10 12 14 16 18 20
Si : Years of education
271
Lecture 8
Having developed some general statistical and practical motivations for the
use of Least Squares, this lecture examines the statistical properties of the
OLS estimator, which are instrumental to statistical estimation and infer-
ence. While both small and large sample properties are analyzed, the latter
are discussed first as the standard choice for use in empirical research, data
permitting. Finally, the lecture develops the implications of departures from
the assumption on independence between observations, along with the op-
tions available for performing reliable inference under those conditions.
272
8.1. Large Sample Properties
To denote the OLS estimator, the notation β b OLS will be used through-
out this lecture and beyond. While the algebraic expression of the estimator
is identical to that of the Least Squares solution b in (7.20) and (7.22), the
above notation is preferred when the intention is to highlight that OLS is
being used as a proper statistic and econometric estimator. Under Assump-
tion 1, the OLS estimator can be decomposed as:
N
!−1 N
X X
xi xT x i x T β0 + εi
β
b OLS =
i i
i=1 i=1
!−1 (8.2)
N N
1 X 1 X
= β0 + xi x T
i x i εi
N i=1 N i=1
or equivalently, in compact matrix notation, as follows.
b OLS = XT X −1 XT (Xβ0 + ε)
β
1 T
−1
1 T (8.3)
= β0 + X X X ε
N N
This decomposition turns out to be very useful throughout this analysis.
Assumption 2. Independently but not identically distributed data.
The observations in the sample {(yi , xi )}N i=1 are independently, but not nec-
essarily identically, distributed (i.n.i.d.).
This assumption characterizes the data sample. Note how the conditions
imposed on it are less restrictive than those from most baseline statistical re-
sults, which usually require identically and independently distributed (i.i.d.)
observations. By letting observations to be not identically distributed, they
are not only allowed to have different absolute or conditional moments, but
also to be drawn from different distributions. The independence assumption
remains problematic in many practical contexts, and econometric solutions
for scenarios when it likely fails are discussed in the last part of this lecture.
Assumption 3. Moments and realizations of the regressors. The
regressors random vector xi has a finite second moment, and for some δ > 0:
h i
E |Xik Xi` |1+δ < ∞ (8.4)
273
8.1. Large Sample Properties
This assumption specifies the nature of the regressors X used for esti-
mation, and is composed of two parts. The first part is about its stochastic
properties. In fact, it implicitly allows the regressors to be actually stochas-
tic – which is not to be taken for granted, since in the classical treatment of
the linear regression model the regressors are assumed to be “fixed” (that is
identical in repeated samples, such as when regression is used to evaluate
some kind of experimental variation); in those classical treatments the only
random component of the model is the error term εi . Stochastic regressors
are assumed to have finite second (mixed) moments while conforming to
condition (8.4). All this implies that the following probability limit:
N N
1 T 1 X T p 1 X
E xi xT (8.6)
X X= xi xi → lim i ≡ K0
N N i=1 N →∞ N
i=1
of the assumption states that also the actual realizations of the regressors
must satisfy an analogous invertibility condition. Recall that this condition
is necessary for the Least Squares solution to be unique; it rules out issues
such as the dummy variable trap.1
Assumption 4. Exogeneity. Conditional on the regressors xi , the error
term εi has mean zero (with typical terminology, it is mean-independent
of the regressors xi ).
E [εi | xi ] = 0 (8.7)
This is the all-important assumption, against which one’s estimates are
evaluated, since it is the crucial one for obtaining consistency of the OLS
estimator. As it was already observed in the previous lecture, (8.7) amounts
to assume that the CEF is indeed linear in xi , and it implies E [xi εi ] = 0,
hence:
N
1 T 1 X p
X ε= x i εi → 0 (8.8)
N N i=1
if the conditions for the application of an appropriate Law of Large Numbers
are satisfied by the other assumptions, and so the residual element that adds
to β0 on the right-hand side of (8.2) and (8.3) vanishes asymptotically. The
intuition is that since xi does not provide information on εi , any variation
in Yi associated with a variation in xi must be due, on average, to xi alone.
1
As Lecture 9 clarifies, this is a standard identification condition specific to the OLS
estimator.
274
8.1. Large Sample Properties
Like much of the econometric terminology, the name “exogeneity” for this
assumption originates with the analysis of Simultaneous Equations Models,
(SEMs, see Lectures 9 and 10) although a more appropriate name is the
longer (and hence less popular) mean independence of the error term.
The motivation for the shorter name is best understood later in the context
of the analysis of SEMs. A discussion of those frequent scenarios where this
condition might fail are reviewed in later lectures.
Assumption 5. Heteroscedastic, Independent Errors. The variance
of the error term εi conditional on xi is left unrestricted (heteroscedasticity).
Since observations are independent, the conditional covariance between two
error terms from two different observations i, j = 1, . . . , N is zero.
E ε2i xi = σ2 (xi ) ≡ σ2i (8.9)
E [ εi εj | xi , xj ] = 0 (8.10)
In addition, for some δ > 0 the following holds for all i = 1, . . . , N .
h i
2 1+δ
E εi <∞ (8.11)
275
8.1. Large Sample Properties
which is semi-definite positive and has full rank by Assumption 6. Note that
i .
if the observations were also identically distributed, then Ξ0 = E ε2i xi xT
and by the Ljapunov condition (8.14), the following Central Limit Theorem
result holds too.
N
1 X d
√ xi εi → N (0, Ξ0 ) (8.18)
N i=1
Finally, notice that in the special case of homoscedasticity, the variance of
the error term is independent of the regressors, hence:
E ε2i xi xT
2 T
2
T
i = E ε i E x i x i = σ 0 E x i x i
276
8.1. Large Sample Properties
Having discussed all the six White’s Assumptions at length, proving the
large sample properties of the OLS estimator is straightforward.
Theorem 8.1. The Large Sample properties of the OLS Estimator.
Under Assumptions 1-6 the OLS estimator is consistent, that is:
p
b OLS →
β β0 (8.19)
Proof. The consistency result (8.19) was in a way already proved in the pre-
vious lecture by exploiting the properties of the linear projection when the
CEF is linear; under Assumptions 1-6 it can be alternatively seen by apply-
ing the probability limit (8.8) to the decomposition of the OLS estimator
in (8.2). Regarding asymptotic normality, it follows from “rephrasing” the
Central Limit Theorem result in (8.18) in terms of the random sequence:
N
!−1 N
√ 1 X 1 X
b OLS − β0 =
N β xi xT
i √ x i εi
N i=1 N i=1
277
8.1. Large Sample Properties
278
8.1. Large Sample Properties
H0 : βk0 = ck H1 : βk0 6= ck
where the expression in the denominator is the square root of the kk-th en-
try of the estimated asymptotic variance of the OLS estimates, also called
the standard error of the k-th estimated parameter (standard errors are
typically reported, along the estimated coefficients, in the output of regres-
sions performed by the main statistical computer packages).
After having estimated the whole variance-covariance matrix of the OLS
estimates, it is possible to test hypotheses that involve multiple parameters.
Consider, for example, the following L ≥ 0 linear hypotheses:
H0 : Rβ0 = c H1 : Rβ0 6= c
279
8.2. Small Sample Properties
280
8.2. Small Sample Properties
that is, in expectation the OLS estimator returns the true value β0 . Note
how the exogeneity (mean independence) assumption is instrumental for
obtaining this result – just like in the case of consistency – and that using
the Law of Iterated Expectations allows to sidestep the fact that regressors
are stochastic. The conditional variance of the OLS estimator, given a
specific realization of the regressors X, is calculated instead as follows.
h i h −1 T T −1 i
T T
Var βOLS X = E X X
b X εε X X X X
−1 (8.28)
−1
= XT X XT E εεT X X XT X
−1 −1
= XT X XT ΣX XT X
the OLS estimator is the element of B that yields the minimum variance
estimate of any element of β0 , as well as of all possible linear combinations
lT β0 of β0 , where l is a K × 1 vector.
281
8.2. Small Sample Properties
−1 T
where the third line follows from B1 X = B0 X − XT X X X = 0. Thus:
h i h i
lT Var β e X l ≥ lT Var β b OLS X l
which proves the conditional (on X) version of the theorem; the uncondi-
tional version is easily obtained by taking the expectation over the random
matrix X, of which X is a specific realization.
This result – which, note, has not required invoking Assumption 8 yet –
is the one for which the OLS estimator deserves the denomination of Best
Linear Unbiased Estimator (BLUE). In this phrase, “Best” must be
interpreted in the sense of efficient, that is of minimum variance. However,
even in small samples this result is no longer valid when homoscedasticity
does not hold, as it is observed later while analyzing the Generalized Least
Squares model. Since in the current empirical practice homoscedasticity is
seen more as an exception and researchers are advised to employ variance
estimates that are robust to heteroscedasticity in large samples – such as
the “robust” formula (8.22) – the Gauss-Markov Theorem has lost much of
its original significance. However, it is still seldom useful as a benchmark
for efficiency comparisons.
In order to obtain a distributional result that that is usable for inference
purposes, observe that by Assumption 8 it would hold exactly that:
b OLS X ∼ N β0 , σ2 XT X −1
β (8.31)
0
by the properties of the normal distribution, recalling that the OLS estima-
tor is a linear function of the error terms ε as per (8.3). This result can be
immediately used for inference so long as σ20 is known; since it is generally
unknown, one needs to estimate this parameter. Intuitively, one could use
the same estimator for (8.30) that follows from the large sample properties
282
8.2. Small Sample Properties
= E Tr εT MX ε X
= E Tr MX εεT X
= Tr MX E εεT X
= Tr σ20 MX
−1 T
2 T
= σ0 Tr I − X X X X
−1 T
= σ20 Tr (I) − σ20 Tr XT X X X
= σ20 (N − K)
where IK is the K × K identity matrix. In addition, the sixth equality follows from the
fact that the expectation conditions on X, hence it can pass through the trace operator
as well as matrix MX (the only function of X in the trace).
7
This is analogous to estimating the variance a random variable using the standard
sample variance S 2 , without applying the rescaling factor NN−1 (see Theorem 4.3).
283
8.2. Small Sample Properties
eT e εT M X ε
X = X ∼ χ2N −K (8.34)
σ20 σ20
is a chi-squared distribution with degrees of freedom equal to the rank of
MX ; this quantity equals the trace of MX , that is N − K as per the earlier
derivation. Furthermore, one can show that (8.33) and (8.34) are indepen-
dent;8 therefore, the actual t-statistic tH0 which is obtained by substituting
σ20 with its unbiased estimate eT e/ (N − K) discussed above:
q √ t∗H0 √ b k,OLS − ck
β
tH0 2
= σ0 N − K √ = N −K √ T (8.35)
eT e e e·exkk
follows a Student’s t-distribution with N − K degrees of freedom, condi-
tionally on X (as usual, this follows from Observation 3.2).
tH0 | X ∼ TN −K
This result is usable for inference purposes; in small but sizable samples
(N > 20), however, this is known to yield results that are not very different
from approximations based on the standard normal.
b OLS = β0 + XT X −1 XT ε is shown to be indepen-
8
The vector of OLS estimates β
dent of (8.34) by the following observation.
−1 T
σ−2
0 XT X X MX = 0
b OLS , from which t∗ is constructed.
This result also applies to each individual element of β H0
284
8.2. Small Sample Properties
does indeed follow an exact χ2L distribution with L degrees of freedom, but
again this is an expression that depends upon the unknown parameter σ20 .
Therefore, in small samples an F -statistic must be used instead:
h
T
−1 T i−1
N −K T R X X R
FH0 = RβOLS − c
b RβOLS − c (8.37)
b
L eT e
this quantity results from dividing (8.36) by (8.34) – which are independent
from one another9 – and multiplying the ratio in question by L−1 (N − K).
By Observation 3.3, this F -statistic follows exactly an F -distribution with
degrees of freedom L and N − K, conditionally on X.
FH0 | X ∼ FL,N −K
A customary use of the F -statistic is in the model F -test (or simply the
model test) corresponding to the null hypothesis H0 : β0 = 0 that all the
parameters of the model (except the constant term, if present) are jointly
meaningful. The F -statistic obtained from this test is typically part of the
default regression output returned by statistical computer packages.10
and it is easy to see that the central coefficient matrix of this quadratic form returns 0
whether it is pre- or post-multiplied to MX .
10
The F -statistic and the model F -test are typically evaluated even in large sample
environments, in which cases they are calculated through the appropriate estimates of
the asymptotic variance of the OLS estimator. The F -distribution might, in fact, provide
a better approximation of the true underlying probabilities.
285
8.2. Small Sample Properties
relied for the most part on its small sample properties, which motivated the
search for an adequate solution within the same framework. This resulted
in the development of the Generalized Least Squares (GLS) model.
The intuition behind GLS is simple. Suppose
that the errors are indeed
heteroscedastic, but matrix Σ = E εε X is known. If one performs
T
1 1
such that Σ 2 Σ 2 = Σ, then the Generalized Least Squares estimator:
−1
βb GLS = X e TX
e Xe Ty
e
= XT Σ−1 X
−1 T −1
X Σ y (8.39)
−1 T −1
= β0 + XT Σ−1 X X Σ ε
is easily seen to be unbiased with respect to β0 . Moreover, since:
1 1 1 1
E εeεeT X = Σ− 2 E εεT X Σ− 2 = Σ− 2 ΣΣ− 2 = I
286
8.2. Small Sample Properties
The main problem with the GLS estimator is that Σ is, clearly, generally
unknown, therefore this estimator is unfeasible in practice. A solution would
be to substitute Σ with some plausible estimate of it: this approach is called
Feasible Generalized Least Squares (FGLS) and it works as follows.
1. Assume a functional form for the dependence of the variance of the
error term on the covariates X; for example, a simple and popular
choice is the exponential conditional variance σ2 (xi ) = exp xT ;
i ψ
2. estimate the main regression model of interest via OLS, which returns
an unbiased and consistent estimate of β0 , and calculate the resulting
squared residuals (e21 , e22 , . . . , e2N );
3. estimate via OLS the assumed model for the conditional variance; in
i ψ + $i , where
the exponential case this model would be log e2i = xT
$i is some error term with E [$i | xi ] = 0;
4. construct matrix Σ,b the estimate of Σ, accordingly; in the exponential
case it would be, for example:
Tb
exp x1 ψOLS 0 ... 0
T
0 exp x2 ψOLS ... 0
b
Σ
b =
.. .. ..
..
. . . .
0 0 . . . exp xT ψN
b OLS
287
8.3. Dependent Errors
The problem with this approach is that it may fail if the conditional vari-
ance model is incorrectly specified. In this case, FGLS-WLS might be less
efficient than “traditional” OLS, even in small samples. Consequently, a the-
ory about tests for heteroscedesticity was developed, whose objective is to
guide researchers in search of the right specification of the heteroscedasticity
model. Nowadays, these tests and GLS altogether are seen as largely redun-
dant, since modern econometric practice relies on large samples, asymptotic
properties and “heteroscedasticity-robust” variance estimators.11 However,
learning GLS can still be useful, for both pedagogical and practical reasons.
The pedagogical reason is that it is instructive to make efficiency compar-
isons between certain estimators (like 3SLS or linear GMM) and the GLS
benchmark. The practical reason is that GLS is still used in some settings,
for example in models for panel data featuring so-called “random effects.”
288
8.3. Dependent Errors
yic = xT
ic β0 + εic (8.47)
289
8.3. Dependent Errors
290
8.3. Dependent Errors
291
8.3. Dependent Errors
292
8.3. Dependent Errors
293
8.3. Dependent Errors
This result, while quite convenient, is obtained under the condition that
the number of clusters C is large and grows to infinity. In general, however,
the number of clusters is finite and typically not very large. This is one of
the reasons that has motivated the frequent use of the CCE formula (8.55)
with a multiplicative “degrees of freedom correction” C−1 C N
N −K
that takes
into account the fact that both the number of clusters and the sample size
are finite.14 With a very low number of clusters C – usually between 20
and 50 – however, the CCE formula is employed along statistical tests for
small samples (based on the Student’s T and the F distributions). A paper
by Bester et al. (2011) provides theoretical foundations for this practice: if
C is small and fixed but Nc goes to infinity in all clusters, intuitively the C
within-group averages of xic εic are asymptotically normally distributed; in
addition they show that CCE estimation works even under some weak forms
of cross-cluster error dependence, so long as clusters are similar enough in
their observable and unobservable characteristics, as well as in their size.
In current microeconometric practice, the majority of non-experimental
studies feature some form of clustered covariance estimation. This is not in
minor part due to some influential papers (Moulton, 1986; Bertrand et al.,
2004) which observed that failing to account for within-group dependence
can lead to seriously biased inference results.15 In particular, panel data
estimates are routinely clustered at least at the level of panel units, however
it often makes sense to define clusters at an even higher level of aggregation
(for example, in a panel of firms one may want to consider industry-level
clusters, including all observations of firms of the same industry over all the
years T ). In ideal experimental studies, instead, it is not necessary to
cluster standard errors: intuitively, even if the errors are correlated within
groups, if xic is independent of εic , therefore Ξ0 = σ20 K0 holds and standard
estimation under homoscedasticity is asymptotically consistent.
14
This is similar to the standard practice of estimating “robust” standard errors with
a multiplicative degrees of freedom correction N N −K , a habit which is motivated however
more by customs than by either theory or data concerns.
15
In some stylized cases, it is possible to solve for the explicit analytic expression of
this bias. Consider, for example, the CSRE model with equal group sizes M = Nc = N/C
for c = 1, . . . , C and identical regressors across clusters Xc = Xg for c 6= g; in this case
the asymptotic variance of the OLS estimator can be shown to simplify as:
C
!−1
σ2α
X
Avar β
b OLS = 1 + (M − 1) · σ2 XT
c Xc
σ2α + σ2 c=1
therefore, a standard estimate of the homoscedastic variance of the OLS estimate would
be downward biased. The extent of the bias is expressed by the muliplicative term within
brackets, which is often referred to as the Moulton bias from Moulton (1986).
294
8.3. Dependent Errors
HAC Estimation
In large samples, alternatives to CCE exist under specific structures of the
cross-error dependence. Such estimators of the OLS variance-covariance go
by the name of heteroscedasticity-autocorrelation-consistent (HAC)
estimators, since they were originally devised for the case of autocorrelation
in time. Like CCE estimators as well as all asymptotic covariance estimators
of OLS more generally, HAC estimators are based on K ×K matrices Ξ b HAC
such that:
p
Ξb HAC → Ξ0
and if a Central Limit Theorem for dependent observations can be applied:
√
d
N βOLS − β0 → N 0, K−1 −1
0 Ξ0 K0
b
295
8.3. Dependent Errors
296
8.3. Dependent Errors
where eit = yit −xTit βOLS and so on. Clearly enough, in this case the kernel is
b
allowed to cover the entire panel length T , since consistent HAC estimation
follows from the asymptotic properties obtained as N grows larger. Observe
that if κT (s) = 1 for all observations of the same panel unit and equals zero
otherwise, (8.59) would coincide with the CCE formula when clusters are
defined at the panel unit level. The HAC estimator is also easily ported to
a setting featuring spatial correlation. Recall that in such a case, cross-error
dependence decays with some measure of distance dij between observations
i and j; in a standard (say, cross-sectional) model yi = xT i β0 + εi , the HAC
estimator is easily adapted as:
N N
1 XX
ΞHSC =
b κN (dij ) ei xi xT
j ej (8.60)
N i=1 j=1
297
Lecture 9
Econometric Models
298
9.1. Structural Models
γ11 Y1i + γ12 Y2i + . . . + γ1P YP i = φ11 Z1i + φ12 Z2i + . . . + φ1Q ZQi + ε1i
γ21 Y1i + γ22 Y2i + . . . + γ2P YP i = φ21 Z1i + φ22 Z2i + . . . + φ2Q ZQi + ε2i
... = ...
γP 1 Y1i + γP 2 Y2i + . . . + γP P YP i = φP 1 Z1i + φP 2 Z2i + . . . + φP Q ZQi + εP i
299
9.1. Structural Models
300
9.1. Structural Models
301
9.2. Model Identification
The objective of estimating a model of this kind would be, for example,
that of finding out what factors zi best predict the profitability of a market.
Extensions of such a stylized model might allow for heterogeneous firms, or
for cost factors that vary across markets, hence extending the scope of the
analysis towards supply factors that also affect profitability. Other models
introduce include endogenous variables, incomplete information, and more;
for an introduction to this literature, see Berry and Reiss (2007).
This completes the exposition of three quite different econometric mod-
els, each grounded on a specific piece of economic theory. The rest of these
lectures is devoted to the analysis of methods for the estimation of models
like these. Before proceeding to estimation, however, the careful econome-
tricians should ask themselves questions of the following sort.
1. Is it possible to use the results of my estimates for the sake of attribut-
ing unique values to each parameter within the set θ?
2. If so, is it possible to use these estimates in order to answer questions
about the “effect” of certain variables upon the others?
Questions like these lie at the core of econometric analysis. These relate,
respectively, to the notions of identification and causality – while inter-
twined, these two concepts are often confused for one another, and it is thus
useful to provide appropriate introductions to both. Most of the remainder
of this lecture is devoted to this objective.
302
9.2. Model Identification
303
9.2. Model Identification
Armed with these definitions, one can provide rigorous answers to the
question whether some models are “identified” or not.
Example 5.6 outlines the First and Second conditions for a maximum of this
log-likelihood function, and the consequent expressions for estimators of the
two parameters, call them µb x and σ
b2x . The question of identification here is:
“can these conditions [for a maximum] characterize a univocal association
304
9.2. Model Identification
Since σb2x 6= 0 the determinant is nonzero, hence the Hessian has full rank.
Thus, by the Implicit Function Theorem it is (almost) always possible to
solve for unique values of (µx , σ2x ): these parameters are identified.
Second, consider the log-likelihood function of ϑ = (β0 , β1 , σ2ε , σxε ) given
the information about Yi contained in the sample {(yi , xi )}Ni=1 (here one can
abstract from µx and σx as they are shown to be identified).
2
N
log 2π β21 σ2x + 2β1 σxε + σ2ε −
log L (ϑ| y1 , . . . , yN , x1 , . . . , xN ) = −
2
N
X (yi − β0 − β1 xi )2
−
i=1
2 (β21 σ2x + 2β1 σxε + σ2ε )
305
9.2. Model Identification
A useful exercise is to show, following the example above about the identifi-
cation of (µx , σ2x ), that the model is identified when the restriction σxε = 0
is imposed. It is easiest to start from a simpler case, where Xi is “fixed in
repeated samples” (that is, some N realizations occur with probability one)
which greatly simplifies the expression of the likelihood function.
The identification condition σxε = 0 from Example 9.3, which states that
the covariance between Xi and εi must be zero, is intimately connected with
the so-called “exogeneity” condition of the linear regression model, which is
abundantly discussed in other lectures, but it is also worth to revisit it here.
This condition requires that the expectation of the error term conditional
on the explanatory variables is zero (here, E [ εi | Xi = xi ] = 0 for all xi ∈ X)
and it implies that the CEF of Yi given Xi is linear as well:
σxε = Cov (Xi , εi ) = E [Xi εi ] − E [Xi ] E [εi ]
= EX [E [Xi εi | Xi ]]
=0
because E [εi ] = 0 and by the Law of Iterated Expectations (Example 3.11).
In abstract terms, the intuition can be formulated as follows: if σxε 6= 0, it is
apparent that Xi and Yi move together it is impossible to tell whether they
do because Xi affects Yi directly, or rather through the indirect influence of
εi . Clearly, here something has to give, and to properly interpret the data
it is necessary to place some “restriction” on the statistical model M.
The concept of identification is not restricted to fully parametric models.
Indeed, it can apply as well to semi-parametric models: that is, models
in which only some features of the joint probability distribution of (zi , εi ) is
specified. A full-fledged treatment of identification in the semi-parametric
case is outside the scope of this discussion, but it is still worth to illustrate
the main intuition via an example.
Example 9.4. Parametric Bivariate Regression. Consider once again
the bivariate linear model Yi = β0 + β1 Xi + εi , but unlike in Example 9.3,
abstain from imposing fully parametric assumptions: the joint distribution
of (Xi , εi ) is left unspecified. In this case one could re-define the concept of
“model” M as a set of structures of the kind
(9.6)
θ = β0 , β1 , Px , Pε , P ε|x
where Px , Pε and P ε|x are families of probability distributions, respectively
of Xi , of εi , and of εi conditional on Xi , that are allowed by the model M.
A straightforward restriction here is that all elements of Pε must conform
to E [εi ] = 0; clearly, an unrestricted mean is indistinguishable from β0 .
306
9.2. Model Identification
307
9.3. Linear Simultaneous Equations
QD D
i = α0 + α1 Pi + υi
(9.9)
QSi = β0 + β1 Pi + υiS
We learn from economic theory that, in a market, demand and supply meet
in equilibrium, thus:
QD S
i = Qi = Qi
308
9.3. Linear Simultaneous Equations
and that both equilibrium prices Pi and quantities Qi are determined si-
multaneously and interdependently, hence they are both endogenous.
The parameters θ = (α0 , α1 , β0 , β1 ) of model (9.9) are not identified.
This is easily shown via the reduced form of the structural model:
β1 α0 − α1 β0 β1 υiD − α1 υiS
Qi = +
β1 − α1 β1 − α1
D S
(9.10)
α0 − β0 υi − υi
Pi = +
β1 − α1 β1 − α1
Thus, the parameters of the two bivariate regression models featured in the
structural form (9.9) cannot be identified, neither in fully parametric nor
in semi-parametric environments (Theorems 9.1 and 9.2). The best one can
do is to exploit the reduced form to estimate the two unconditional moments
E [Qi ] = (β1 α0 − α1 β0 ) / (β1 − α1 ) and E [Pi ] = (α0 − β0 ) / (β1 − α1 ) that
are implied by (9.10), which obviously do not contain enough information
about each element of θ: the system is not identified in the sense that there
is an infinite number of θ combinations that predict these two unconditional
averages. This is represented in graphical form in Figure 9.1.
Pi QSi (Pi )
i QS` (P` )
QD
i (Pi ) `
QD
` (P` )
QSj (Pj )
QD
j (Pj )
Qi
Figure 9.1: Infinite supply and demand curves given the sample {i, j, `}
309
9.3. Linear Simultaneous Equations
The economic intuition behind this negative result is that changes in the
equilibrium price and quantity in one market cannot be attributed to either
demand or supply factors in isolation, absent any further information that
is specific to either supply or demand.
Qi = α0 + α1 Pi + α2 Mi + υiD
(9.11)
Qi = β0 + β1 Pi + υiS
310
9.3. Linear Simultaneous Equations
Pi
QS (P )
M` > Mj `
Mj > Mi j QD
` (P, M` )
QD
j (P, Mj )
i
QD
i (P, Mi )
Qi
Suppose one can also observe another variable denoted by Ci that repre-
sents a synthetic index of production costs in this market. Clearly, Ci affects
311
9.3. Linear Simultaneous Equations
supply but not demand, and therefore it can be treated as exogenous. The
model now reads as follows.
Qi = α0 + α1 Pi + α2 Mi + υiD
(9.13)
Qi = β0 + β1 Pi + β2 Ci + υiS
The reduced form can be expressed in terms of two multivariate regressions,
one for quantity Qi and the other for price Pi , with the exogenous variables
Mi and Ci showing up on the right-hand side in both cases.
β1 α0 − α1 β0 β1 α2 α1 β2 β1 υiD − α1 υiS
Qi = + Mi − Ci +
β1 − α1 β1 − α1 β1 − α1 β1 − α1
D S
(9.14)
α0 − β0 υ2 β2 υ − υi
Pi = + Mi − Ci + i
β1 − α1 β1 − α1 β1 − α1 β1 − α1
If E [υD , υS | Mi ] = E [ υD , υS | Ci ] = 0, multivariate regression techniques (as
discussed later) allow to back up all the six combined parameters of (9.14).
It is easy to verify that there is a unique solution that maps this set onto the
set of the original “structural” parameters θM C = (α0 , α1 , α2 , β0 , β1 , β2 ).
The model is thus exactly identified. This result fails either if Ci enters
the demand equation, or if Mi does not enter the model while Ci does (in
this case, however, α1 would be identified: Ci would act as a “supply shifter”
that allows to map the demand curve, symmetrically to the scenario above).
In order to appreciate an instance of overidentification, and how it
can coexist with partial identification, let us consider one final case, that
is a model without Ci – no demand shifter – but with two supply shifters:
consumers’ income Mi and the price of a competing product Pi∗ , which is
expected to affect demand positively. The model now reads as:
Qi = α0 + α1 Pi + α2 Mi + α3 Pi∗ + υiD
(9.15)
Qi = β0 + β1 Pi + υiS
and its reduced form as:
β1 α0 − α1 β0 β1 α2 β1 α3 β1 υiD − α1 υiS
Qi = + Mi + Pi∗ +
β1 − α1 β1 − α1 β1 − α1 β1 − α1
D S
(9.16)
α0 − β0 α2 α3 υ − υi
Pi = + Mi + Pi∗ + i
β1 − α1 β1 − α1 β1 − α1 β1 − α1
which, by reasoning analogous to those from the previous cases, has some in-
teresting implications. First, notice that there are multiple ways to calculate
β1 (by taking the ratio of the two coefficients for Mi and Pi∗ , respectively):
this means that parameter β1 is overidentified. Second, parameter α1 is
not identified (this is easy to verify): the intuitive reason is that there are
no longer any “demand shifters” in the model.
312
9.3. Linear Simultaneous Equations
313
9.3. Linear Simultaneous Equations
314
9.4. Causal Effects
values; and:
∂
Cqpi (zqi , z−qi , εi ) = rp (zqi , z−qi , εi ) (9.19)
∂zqi
if Xzq is a continuous set, and where rp (·) is the p-th equation of the reduced
form, the one that predicts Ypi . Therefore, the individual causal effect can
be interpreted as the “effect” of a ceteris paribus, marginal variation of Zqi
on the endogenous variable Ypi , obtained by keeping all the other observable
exogenous variables as well as the unobserved factors constant.
315
9.4. Causal Effects
316
9.4. Causal Effects
317
9.4. Causal Effects
and the answer is a conditional no; µqYp |z (zqi , z−qi ) = ACEqp (zqi , z−qi ) only
if: ˆ
∂
rp (zqi , z−qi , εi ) f ε|z (εi | zqi , z−qi ) dε1i . . . dεP i = 0
Xε ∂zqi
that is, if Zqi and the unobservables εi are independent, conditional on the
other exogenous variables z−i . This has a proper definition in statistics.
Definition 9.16. Conditional Independence Assumption (CIA). The
CIA is the hypothesis that the the unobservables εi and a specific exoge-
nous variable Zqi are statistically independent, conditional on all the other
exogenous variables z−qi .
Zqi ⊥ εi | z−qi (9.23)
For binary treatments, the CIA is often expressed in potential outcomes
notation as follows.
The importance of this concept lies in the fact that it provides a clear condi-
tion to verify if the parameters of an econometric model can be interpreted
causally. Suppose, for example, that the CEF of interest is linear:
y i = xT
i β0 + δ 0 s i + εi
318
9.4. Causal Effects
the interpretation of the optimal linear predictor for δ0 can be based on the
Yitzakhi-Angrist-Krueger decomposition (7.56), where the derivative of the
CEF conditional on xi is precisely the ATE:
µ0Y |S,x = E [Yi | Si = 1, xi ] − E [Yi | Si = 0, xi ] ≡ ∆ (xi ) (9.25)
while the weighting factor here is φ (xi ) = P (Si = 1| xi ) [1 − P (Si = 1| xi )].
In this case, (7.56) becomes as follows.
Ex [∆ (xi ) P (Si = 1| xi ) [1 − P (Si = 1| xi )]]
δ∗ = (9.26)
Ex [P (Si = 1| xi ) [1 − P (Si = 1| xi )]]
By inspecting the expression above, it appears that the linear projection
δ∗ identifies the average causal effect of Si on Yi in one of two cases:
• the causal effect does not vary with xi (∆ (xi ) is a constant);
• the probability to “take up the treatment” (Si = 1) is constant for xi .
While these conditions are unlikely to hold in practice, δ∗ still carries an
interpretation as an average causal effect that is weighted by φ (xi ); as in
the general case, in a practical setting it is important to learn about these
weights in order to inform the interpretation of one’s estimates.
Finally, it is important to observe that while causality is best framed in
terms of effects of exogenous variables on endogenous ones in the setting of
reduced form models, a selected class of structural models – the so-called
triangular models – allows for “endogenous-to-endogenous” causal effects.
Definition 9.17. Triangular Models. A triangular structural model is
one where its P equations and its P endogenous variables can be ordered
in such a way that, for any natural number P 0 < P , the first P 0 endogenous
variables never enter the last P − P 0 equations, or vice versa.
Possibly, the simplest triangular model is the following trivariate model:
Yi = β0 + β1 Xi + β2 Zi + εi
(9.27)
Xi = π0 + π1 Zi + ηi
where (Yi , Xi ) are the endogenous variables while Zi is the only exogenous
one: notice that Yi does not enter the second equation. In general, a SEM is
triangular if matrix Γ is either upper- or lower-triangular (hence the name).
A Mincer model enriched with a linear equation for education where wages
are absent, such as (7.7), is triangular too. In the case of triangular models,
it is sensible to talk about the causal effects of endogenous variables upon
other endogenous variables (Xi on Yi , education on wages, et cetera); all
definitions and considerations made above apply.
319
Lecture 10
Instrumental Variables
320
10.1. Endogeneity Problems
which relates to the population correlation between the omitted variable and
the explanatory variables.1 Recall that if there is good reason to believe that
either term is zero, omitting Si is not a problem at all! Lecture 7 already
provides a generalization of this concept to multiple omitted variables, as
well as an illustrative interpretation in terms of the Mincer equation.
321
10.1. Endogeneity Problems
With panel data, fixed effects can be easily addressed: even if T is small
and thus the individual effects αi cannot be consistently estimated by brute
force (e.g. as separate dummy variables) we know from the Frisch-Waugh-
Lovell theorem that an estimate of β0 based on a “demeaned” model:
where ȳi = T1 Tt=1 yit , x̄i = T1 Tt=1 xit , and ε̄i = T1 Tt=1 εit , is numerically
P P P
equivalent to the one from a brute force estimate that includes unit-specific
dummies; moreover, such an estimate would be consistent since αi is absent
from (10.4). An alternative, which is generally asymptotically equivalent,
is to estimate a model in “first differences:”
∆yit = ∆xT
it β0 + ∆εit (10.5)
Simultaneity
The problem of simultaneity is perhaps the archetypical type of endogene-
ity problem, and it derives from the classical analysis of linear simultane-
ous equations in the early days of econometrics. While already elaborated
in Lecture 9, this problem is revisited here from a more statistical angle.
Consider a Simultaneous Equations Model (SEM) written in compact form
(9.2): Γyi = Φzi + εi . Suppose that the exogenous variables zi are defensi-
bly exogenous, that is conditionally mean independent of the P error terms
εi of the model. Write:
E [εi | zi ] = 0 (10.6)
implying E [zi εpi ] = 0 for p = 1, . . . , P by (10.6) and by the
Law of Iterated
Expectations; this can be written more compactly as E zi εi = 0. T
322
10.1. Endogeneity Problems
The intuition is represented in Figure 10.1 below, which displays the graph
of structural relationships implied by two equations of a SEM, where any
two endogenous variables Y1i and Y2i show up in both equations. Since both
endogenous variables are also affected by the respective error terms ε1i and
ε2i , the latter are by construction correlated with both Y1i and Y2i , directly
or indirectly; this is an abstract representation of the identification problem
illustrated in Examples 9.5 and 9.6. Here the denomination simultaneity
is predicated on the concomitance of all these structural relationships.
Y1i ε1i
Y2i ε2i
Measurement Error
It is quite common that some variables contained in a dataset are, to some
degree, measured with error. This problem typically leads to inconsistent
estimates whenever it affects the model’s explanatory (exogenous) variables.
To illustrate, consider a bivariate regression model Yi = β0 + β1 Xi + εi with
“exogenous” Xi (E [εi | Xi ] = 0). However, the researcher cannot observe the
true variable Xi , but only its error-ridden version:
Xi∗ = Xi + υi (10.8)
where υi denotes the error in the measurement of Xi (with E [υi ] = 0). In
addition, suppose that the error υi is also completely random, that it is
independent from both the “true” Xi as well as of the original error term εi .
E [Xi υi ] = E [εi υi ] = 0
323
10.1. Endogeneity Problems
324
10.1. Endogeneity Problems
y = X∗ β0 + ε − Uβ0 (10.10)
The “actual” error term is now εi − K k=1 βk,0 υki ; again by construction, it
P
is: " #
XK
E εi − βk,0 υki x∗i = x∗i 6= 0 (10.11)
k=1
where x∗iis the i-th row of X , even under the maintained hypothesis that
∗
the errors υki are completely independent of both the “true” explanatory
variables and the primitive error term:
325
10.1. Endogeneity Problems
Structural Endogeneity
The final subcategory of endogeneity is somewhat residual, and it collects
those kinds of endogeneity problems that are somewhat “built-in” the very
structural model. Consider, for example, the following time-series model:
yt = β0 + β1 yt−1 + xT T
t γ0 + xt−1 γ1 + εt
(10.16)
εt = ρεt−1 + ξt
where the current realizations of the dependent variable yt depend on its
past, as well as on current and past realizations of some explanatory vari-
ables xt ; in addition, the error term presents an AR(1) structure with au-
toregressive parameter ρ ∈ (0, 1) and “innovation” shock ξt – not autocorre-
lated; this is a more specialized version of model (8.44). It is quite obvious
that:
E [εt | Yt−1 ] = E [ρεt−1 + ξt | Yt−1 ] = ρ E [ εt−1 | Yt−1 ] + E [ξt | Yt−1 ] 6= 0
even if ξt is conditionally mean-independent, the grand error term εt is not,
as its lag εt−1 affects the lag of the dependent variable Yt by construction.
Another not too dissimilar case is a spatial model written in compact
matrix notation as:
y = β0 ι + β1 Wy + Xγ0 + WXγ1 + ε (10.17)
where W is a N × N non-stochastic spatial weighting matrix with zero
diagonal:
0 w12 . . . w1N
w21 0 . . . w2N
W = .. .. ... ..
. . .
wN 1 wN 2 . . . 0
which collects the wij distances between two distinct units i and j in the
sample. A model of this sort is common in urban and regional econometrics,
as well as in the econometrics of networks and social interactions. In general,
it can be rewritten in terms of its solution for y as:
y = (I − β1 W)−1 (β0 ι + Xγ0 + WXγ1 + ε)
which suggests the existence of an endogeneity problem due to the feedback
mechanisms, that are built in the model, between the dependent variables
(and the error terms) of different economic units.
E [εi | Y1 , . . . , Yi−1 , Yi+1 , . . . , YN ] 6= 0
A careful reader will easily note an analogy between the endogeneity prob-
lem of spatial models and the issue of simultaneity in SEMs!
326
10.2. Instrumental Variables in Theory
Zi Xi Yi
ηi εi
327
10.2. Instrumental Variables in Theory
328
10.2. Instrumental Variables in Theory
Yi = β0 + π0 β1 + (β1 π1 + β2 ) Zi + εi + β1 ηi
(10.27)
Xi = π0 + π1 Zi + ηi
329
10.2. Instrumental Variables in Theory
330
10.2. Instrumental Variables in Theory
which generalizes (10.22), and which is itself a special case of the more
general “overidentified” Two-Stages Least Squares estimator, as discussed
below. By writing the N ×K matrix that collects the zi vectors of exogenous
variables as:
zT
1 x11 . . . x1K1 z11 . . . z1K2
zT x21 . . . x2K z21 . . . z2K2
2
(10.30)
1
Z ≡ .. = .. ... .. .. . . ..
. . . . . .
zT
N xN 1 . . . xN K1 zN 1 . . . zN K2
the IV estimator can be elegantly written in compact matrix notation.
b IV = ZT X −1 ZT y (10.31)
β
Both representations showPstraightforwardly that the IV estimator is well-
defined so long as matrix i=1 zi xi = Z X is invertible, and that it admits
N T T
331
10.2. Instrumental Variables in Theory
si zi
where xi is the experience Xi specific to observation i; si is instead his or
her education Si , while zi is her or his specific value of the instrument Zi .
IV estimation would then proceed as per (10.29) or (10.31).
332
10.2. Instrumental Variables in Theory
Until now, the formula for the IV estimator – and its asymptotic proper-
ties – have been presented without much of a motivation. To gain intuition,
it is useful to once again represent the structural relationships between the
endogenous and the exogenous variables in the form of a triangular model
of simultaneous equations – in a simple case. Suppose, in particular, that
K1 = K −1 and K2 = 1, with xi2 = xKi = si (this is similar to an analogous
partition from Lecture 7) and zi0 = zi . Thus, only one variable of the main
linear model is suspected to be endogenous, with only one instrument zi to
compensate for it. The triangular model representation of this setup is:
yi = x T
i1 β0\K + δ0 si + εi
(10.35)
si = x T
i1 π0\K + τ0 zi + ηi
T
where β0 = βT ; the reduced form of this model is as follows.
0\K δ 0
yi = xT
i1 β0\K + δ0 π0\K + δ0 τ0 zi + εi + δ0 ηi
(10.36)
s i = xT
i1 π0\K + τ0 zi + ηi
This model is, like (10.20), identified, thanks to the exclusion restriction
whereby the instrument zi does not enter the structural equation for yi in
(10.35); note that a consistent estimator for δ0 can be obtained, in analogy
with the trivariate case above (10.22), as:
−1
zT MX1 y zT MX1 s zT MX1 y
δτ
cOLS
δIV =
b = T = (10.37)
τbOLS z MX1 z zT MX1 z zT MX1 s
where y, s and z are the vectors of length
−1 TN that collect, respectively, yi , si
and zi ; while MX1 = I − X1 XT 1 X 1 X1 is the residual-maker matrix
obtained from the first K − 1 (exogenous) explanatory regressors xi1 . Note
that expression (10.37) is best understood as an application of (7.30).
It turns out that the estimator for δ0 given in (10.37) is numerically
equivalent to the one be obtained via the IV estimator; it is a good exercise
to develop the proof of this result, which is in its essence a variation of the
Frisch-Waugh-Lovell Theorem, but applied to the partitioned IV estimator.
The intuition is likewise analogous: the IV estimator for multivariate linear
models extends the simple IV estimator of the slope (10.22) by partialing out
the empirical correlations of the instruments with the endogenous variables,
as well as the empirical correlation of the instrument with the dependent
variable, from the empirical correlations of the other explanatory variables
included in the structural form. In this respect, the IV estimator inherits
the desirable properties of the least squares solution and the OLS estimator
that have been discussed in the previous lectures.
333
10.2. Instrumental Variables in Theory
334
10.2. Instrumental Variables in Theory
335
10.2. Instrumental Variables in Theory
=β b IV
where the third line is only possible if X and Z have the same (column)
dimensions. The usual decomposition gives:
N
!−1 N
1 X 1 X
β
b 2SLS = β0 + x bT
bi x i x
b i εi (10.49)
N i=1 N i=1
−1
1 T 1 T
= β0 + X PZ X X PZ ε (10.50)
N N
which shows, in conjuction with the geometric interpretation of the 2SLS
estimator, why the latter is consistent. In fact, the operation of projecting
the possibly endogenous regressors xi onto the space which is spanned by the
exogenous variables zi – call it S (Z) – generates a set of “fitted” regressors
bki that, by construction, lie on S (Z), and are consequently orthogonal to
x
the disturbance vector ε. Again, this is by construction because of (10.28);
see figure 10.3 for the geometric intuition. Since E [b
xi εi ] = 0, it is:
N
1 T 1 X p
X PZ ε = b i εi → 0
x (10.51)
N N i=1
p
implying consistency of 2SLS estimator, i.e. β
b 2SLS → β0 .
336
10.2. Instrumental Variables in Theory
X∗,k
X∗,k − X
b ∗,k S (Z)
Z∗,2
90◦ ε
X
b ∗,k
90◦
Z∗,1
Example 10.2. Mincer, Revisited (again). Let us return yet once more
to the Mincer equation, and the attempt to address the endogeneity bias
due to the omission of ability αi which is explored in example 10.1. One can
obtain the IV estimator in two alternative ways; both require specifying a
first stage equation for education, like:
Si = π0 + π1 Xi + π2 Xi2 + π3 Zi + ηi (10.52)
note that this is a linear version of the structural equation for ability (7.7)
from example 7.2, with the inclusion of a squared term for experience (as
in the “structural” Mincer equation) but the exclusion of ability itself. The
resulting reduced form of the model is:
log Wi = β0 + β3 π0 + (β1 + β3 π1 ) Xi + (β2 + β3 π2 ) Xi2
+ β3 π3 Zi + αi + i + β3 ηi (10.53)
Si = π0 + π1 Xi + π2 Xi2 + π3 Zi + ηi
337
10.2. Instrumental Variables in Theory
Sbi = π
b0 + π b2 Xi2 + π
b1 Xi + π b3 Zi
and the just-identified IV-2SLS estimator results from running OLS on the
following model.
log Wi = β0 + β1 Xi + β2 Xi2 + β3 Sbi + εi
However, the 2SLS estimator would work just as well in an overidentified
setting where the researcher has access to redundant instruments. Suppose
that there are two additional valid instruments Gi and Fi ; the First Stage
equation would then read as:
Si = π0 + π1 Xi + π2 Xi2 + π3 Zi + π4 Gi + π5 Fi + ηi (10.54)
and 2SLS estimation would proceed as described. In Lecture 12, which is
about the more general GMM framework, this case is revisited in order to
develop an example of an overidentification test, while making at the same
time some substantive examples of “additional” instruments Gi and Fi .
It remains to show the remaining asymptotic properties of the 2SLS
estimator, especially with regard to the variance. These are now presented
more formally than how it was done in the context of the just-identified IV
estimator; it is a good exercise to derive the asymptotic properties of the
latter as a particular case. In what follows, some additional assumptions of
the generalized linear model – one that allows for instrumental variables
and possibly overidentification – are stated more rigorously.
Assumption 9. Independently but not identically distributed IVs.
The observations in the sample {(yi , xi , zi )}Ni=1 are independently, but not
necessarily identically, distributed (i.n.i.d.).
Assumption 10. Exogeneity of the Instruments. Conditional on the
J ≥ K regressors zi , the error term εi has mean zero.
E [εi | zi ] = 0 (10.55)
338
10.2. Instrumental Variables in Theory
E [εi εj | zi , zj ] = 0 (10.58)
Assumption 13. Asymptotics of Projected Regressors interacted
with the Errors. Given the following diagonal matrix of squared errors:
ε21 0 · · · 0
0 ε2 · · · 0
2
E ≡ .. .. . . . (10.59)
. . . ..
0 0 · · · ε2N
the following probability limit exists, is finite, and has full rank.
1 T p
X PZ EPZ X → Ψ0 (10.60)
N
In addition, conditions hold so that the following Central Limit Theorem
result applies.
1 d
√ XT PZ ε → N (0, Ψ0 ) (10.61)
N
These assumption are worth to be discussed briefly in relationship with
the analogous (White’s) hypotheses of OLS. Assumption 9 simply extends
Assumption 2 to the instruments listed in zi . Assumption 10 was discussed
multiple times; it is necessary to obtain (10.51) and hence consistency of the
2SLS estimator. Assumption 11 allows to establish a condition analogous to
(8.6), for the sake of establishing and estimating the asymptotic variance.
Assumption 12 characterizes the concept of heteroscedasticity in the context
of instrumental variables. Finally, Assumption 13 ensures that some Central
Limit Theorem can be appropriately extended to the 2SLS estimator as well.
Observe that some of these assumptions might be founded – more rigorously
– onto more primitive hypotheses (like conditions on specific moments, e.g.
Ljapunov’s); this is not pursued here for the sake of conciseness.
In light of these assumptions, one can finally establish the asymptotic
properties of the 2SLS estimator.
339
10.2. Instrumental Variables in Theory
where:
b N ≡ 1 XT Z ZT Z −1 ZT X (10.66)
P
N
b N ≡ 1 XT Z ZT Z −1 ZT E b N Z ZT Z −1 ZT X (10.67)
Ψ
N
p p
with PbN → P0 , Ψ
bN → Ψ0 ; while E b N is the following diagonal matrix:
e21 0 · · · 0
0 e2 · · · 0
2
EN ≡ .. .. . . .. (10.68)
b
. . . .
0 0 · · · e2N
where ei ≡ yi − xT
i β2SLS for i = 1, . . . , N . Naturally, this would not work
b
with dependent observations as Assumption 13 would fail; estimating the
appropriate “meat” matrix Ψ0 would require a CCE approach, as follows:
" C #
1 X −1 −1 p
XT T
ZT T T
ZT
b CCE ≡
Ψ c Zc Zc Zc c ec ec Zc Zc Zc c X c → Ψ0
N c=1
(10.69)
where ec ≡ yc − Xc β2SLS . Clearly, analogous HAC estimators also exist.
b
340
10.2. Instrumental Variables in Theory
(this assumes the existence of some (x0 , ξ0 ) at t = 0). The above relation-
ship, lagged by one further period, applies to the endogenous yt−1 variable as
well. Hence, all valid lags xt−s for s ≥ 2 can be combined into the instru-
ments vector zt in a 2SLS framework. If, in addition, ξt is homoscedastic,
this leads to efficient estimates. In a similar vein, the solution of the spatial
model (10.17) can be rephrased, by standard results of linear algebra, as:
X∞
y= βs1 Ws (β0 ι + Xγ0 + WXγ1 + ε)
s=0
thus, all the linearly independent columns of the matrices in {Ws X}∞ s=2
can enter the instruments matrix Z (Kelejian and Prucha, 1998; Bramoullé
et al., 2009); with homoscedastic errors, this leads to efficient estimates.
341
10.3. Instrumental Variables in Practice
y i = xT bT
i β0 + ηi ρ0 + ςi (10.73)
T
where η bK2 i while ρ0 collects the K2 parameters
bi = η b2i . . . η
b1i η
associated with each set of residuals and ςi is some new error term.
The OLS estimates of (10.73) are actually consistent for both β0 and ρ0 .
A semi-formal argument, intuitive if convoluted, is provided next.
Consider, for k = 1, . . . , K2 , the first stage model for the k-th endoge-
nous variable (10.45); and note that, by construction:
therefore, the error term of the k-th equation ηki appears to contain some
statistical information about what makes variable Xki endogenous in the
342
10.3. Instrumental Variables in Practice
first place (in fact this is so by construction, since ηki is the residual from
the projection of Xki onto the exogenous instruments). One might want to
extend this intuition to all the other K2 first stage residuals ηki by specifying
a statistical model for the error term, also called a control function.
If one supposes that such a model is linear:
εi = η T
i ρ0 + ξi (10.74)
where, in the population:
−1
ρ0 ≡ E ηi ηiT
E [ηi εi ]
T
is the linear projection of εi onto ηi = η1i η2i . . . ηK2 i , while ξi is
yi = xT T
i β0 + ηi ρ0 + ξi (10.75)
which could be consistently estimated by OLS if only ηi could be observed.
While ηi cannot be observed by definition, it can be definitely estimated in
the first stage, hence (10.73) matches (10.75) for
K0
X
T
b i ) ρ0 + ξi = zT
ςi ≡ (ηi − η i · (πk0 − π
bk,OLS ) ρk0 + ξi
k=0
and it can be shown that the first component of this expression is condition-
ally mean independent of (xi , ηbi ), because it only depends on the statistical
noise in the estimation of the first stage models. Consequently:
E xTi ςi = 0 and E η
T
bi ςi = 0
hence, OLS estimation of (10.73) is both consistent and, for β0 , equivalent
to IV-2SLS. The classical, full-fledged proof of this result is based on a vari-
ation of the Frisch-Waugh-Lovell Theorem and the algebra of projections.
343
10.3. Instrumental Variables in Practice
In practice, control function approaches are seldom used for linear mod-
els. With respect to IV-2SLS, in fact, they entail a few shortcomings: first,
they might not work too well if the endogenous variables entered the struc-
tural form non-linearly (with higher-order terms, interactions etc.); second,
they can be shown to be less efficient and to produce larger standard errors.
One might wonder, then, what are control function approaches useful for.
Not only they play a role in the tests for endogeneity, as it is mentioned
below; but they can be actually convenient for extending instrumental vari-
ables to non-linear models. In fact, IVs can be combined with non-linear
models in a variety of ways (see e.g. the discussion in Lecture 12 about the
GMM approach) but in practical terms, these often entail complications of
computational or statistical kind. Conversely, control function approaches
are very flexible; they typically entail augmenting a non-linear model with
the inclusion of some function of the residuals obtained from the first-stage
regressions of the endogenous variables.
The above implies that OLS and IV-2SLS have the same probability limit;
thus, an operationally more convenient null hypothesis can be:
b OLS − plim β
H0 : plim β b 2SLS = 0
Under the null hypothesis, this quadratic form should be close to zero. The
problem is that, in general, it is hard to derive an exact expression for the
variance of the difference between two estimators, like:
h i h i h i h i
Var βOLS − β2SLS = Var βOLS + Var β2SLS − 2 Cov βOLS , β2SLS
b b b b b b
344
10.3. Instrumental Variables in Practice
the OLS estimator under i.i.d. homoscedastic errors (by the Gauss-Markov
Theorem), the unknown covariance is actually equal to the variance of the
less efficient estimator, therefore:
h i h i
Cov β b OLS , β
b 2SLS = Var βb 2SLS
p
with H → χ2K asymptotically. This statistic is easily calculated in the data.
Unfortunately, in regression analysis the Hausman test is limited to the
case of i.i.d. homoscedastic errors, or to other scenarios where an alternative
estimator is evaluated against an efficient benchmark (such as fixed effects
vs. random effects in panel data). Even with i.i.d. errors, however, it can be
shown that the Hausman test converges in probability to a Wald statistic
formed out of the ρ bOLS from the control function estimator of (10.73)! The
intuition is simple in light of the earlier discussion about control function:
the null hypothesis of “no endogeneity” is equivalent to:
H0 : ρ0 = 0
since it implies E ηiT εi = E xT i εi = 0. This observation is not only useful
345
10.3. Instrumental Variables in Practice
Some theoretical research has attempted to quantify this bias and to develop
procedures for testing it. The practice of empirical research which is based
on IV-2SLS estimation emphasizes the use of large datasets in order to rely
on the better asymptotic properties of the estimator.
Weak instruments
It was observed in several instances that the limiting variance of IV-2SLS
estimators is inversely proportional to the correlation between the endoge-
nous regressors and the exogenous instruments. This relates to the intuition
for identification in IV estimation: the “effect” of some endogenous variable
Xi on Yi is obtained through the indirect effect that the instrument Zi has
on Yi , since Zi affects Xi directly but it affects Yi only through Xi . It is then
self-evident that if the direct relationship between Xi and Zi is statistically
weak, the main effect of interest is hard to “capture” and it will be at best
imprecisely estimated. This is a problem of weak instruments.
Weak instruments have two main implications. First, IV-2SLS estimates
obtained with weak instruments might make predictions worse than the ones
obtained with inconsistent OLS estimates, in a Mean Squared Error sense.
Asymptotically, the latter reads (particularly for IV-2SLS) as:
2
plim MSEIV −2SLS = plim βIV −2SLS − β0 + Avar βIV −2SLS
b b
| {z } | {z }
= squared asymptotic bias = asymptotic variance
hence, the gains obtained in terms of lower bias might be more than offset
by the losses due to a higher variance. Second, if the instruments are weak
and only slightly endogenous, the “cure” to endogeneity achieved by IV-2SLS
estimation might be worse then the disease. To appreciate this, consider
the ratio between the asymptotic bias of the IV estimator (10.22) of the
trivariate triangular model, and the OLS estimator (5.15) from bivariate
regression:
346
10.3. Instrumental Variables in Practice
and assume further that it has i.i.d. errors. It is possible to show that:
plim δIV − δ0 Corr Si , εi
b
1
= ·
plim δOLS − δ0 Corr (Si , εi ) plim R2s,z|x
where R2s,z|x is the following partialed out R-squared coefficient.
sT PZ MX1 PZ s
R2s,z|x =
sT MX1 s
Notice that, by the Frisch-Waugh-Lovell Theorem, this is the R2 coefficient
that would be obtained from a regression of si on zi , after partialing out
the exogenous regressors xi1 , as follows (see Lecture 7).
MX1 s = MX1 Zπ0 + MX1 η
In light of this analysis, it may appear that embarking into an empirical
study based on Instrumental Variables is very risky, due to the high chance
of ending up with mildly endogenous and fairly weak instruments. For the
sake of mitigating this risk, it is best to follow some general guidelines.
1. It is always useful to test the statistical power of the instruments
via estimates of the First Stage models (10.45). In the applied econo-
metric practice, some rules of thumb apply: t-statistics for the ex-
ogenous instruments higher than 3, or model-wide F -statistics higher
than 10, are considered signs that the instruments are “satisfactorily
strong.” These numbers appear to be based on simulation studies and
surveys, see e.g. Stock et al. (2002); however, they must be taken with
a grain of salt, since the conditions that make an instrument “strong
enough” are really context- and data-dependent.
2. The earlier observation that 2SLS is more likely to hit the efficiency
bound the more instruments are used must be revisited. While this is
true in theory, in practice chances are that the more instruments one
is employing, the higher the probability to include mildly endogenous,
weak instruments – it is advisable to drop instruments from overi-
dentified 2SLS estimators whenever they are suspected to be weak.
347
10.4. Estimation of Simultaneous Equations
348
10.4. Estimation of Simultaneous Equations
349
10.4. Estimation of Simultaneous Equations
where yp is the N -dimensional vector obtained from stacking all the obser-
vations ypi for i = 1, . . . , N , while εp is constructed analogously. Similarly,
matrix Xp results from vertically stacking vectors xT pi for i = 1, . . . , N ; thus
(10.80) can be rephrased as follows.
yp = Xβp0 + εp (10.80)
Furthermore, consider the stacked instruments matrix Z as in (10.30); the
associated projection matrix PZ , and construct the P equation-specific ma-
trices of projected regressors as:
Xb p = P Z Xp
while
X
b1 0 ··· 0
0 X b2 ··· 0
b ≡
X .. .. ..
..
. . . .
0 0 · · · XP
b
350
10.4. Estimation of Simultaneous Equations
As hinted later in Lecture 12, there exist versions of the 3SLS estimator that
are robust to heteroscedasticity and to wider forms of error dependence.
The 3SLS estimator is the most efficient among all the semi-parametric
estimators of SEMs. Just like the 2SLS estimator, as it is discussed later, it
corresponds to the solution of a Generalized Method of Moments (GMM)
problem. Nonetheless, in the fully parametric case other methods are avail-
able for the estimation of SEMs (the so-called LIML, Limited Information
Maximum Likelihood, and FIML, Full Information Maximum Likelihood
methods). These maximum likelihood methods however, are not more effi-
cient than GMM-based or otherwise semi-parametric methods, and in ad-
dition they are liable to violations of the parametric assumptions. For this
reason, the current practice favors the use of semi-parametric methods for
the estimation of linear simultaneous equations.
351
Lecture 11
Maximum Estimation
352
11.1. Criterion Functions
thus the OLS estimator is just the minimizer of the sample analog:
N
b OLS = arg max − 1
X 2
β yi − x T
i β (11.5)
β∈RK N i=1
2
i β .
corresponding to an M-Estimator for q (Yi , xi ; β) = − Yi − xT
Example 11.2. Non-Linear Least Squares (NLLS). If, instead, the
CEF is non-linear, governed by some function denoted as h (xi ; θ):
E [Yi | xi ] = h (xi ; θ0 ) (11.6)
one can show again by extending Theorem 7.1 that:
N
1 X
− E [Yi − h (xi ; θ)]2 (11.7)
θ0 = arg max lim
θ∈Θ N →∞ N
i=1
353
11.1. Criterion Functions
For the present discussion, it is useful to define the following two objects.
Borrowing from Maximum Likelihood terminology, the observation-specific
score si (xi ; θ) for i = 1, . . . , N of an M-estimator is defined as the vector
of first derivatives of q (xi ; θ), with respect to the parameter set θ:
∂q(xi ;θ)
∂θ1
∂q(x i ;θ)
∂q (xi ; θ) ∂θ2
si (xi ; θ) = = .
∂θ ..
∂q(xi ;θ)
∂θK
again for i = 1, . . . , N . Notice that both the score vector and the Hessian
matrix only exist under certain conditions, specifically if the M-Estimation
objective function is, respectively, at least once or twice continuously differ-
entiable. These conditions might not be respected for the objective function
of some important econometric estimators, such as the quantile regression.
The score and the Hessian matrix are instrumental for the characterization
of the identification conditions for M-Estimators.
Theorem 11.1. Identification of M-Estimators. In any M-Estimation
environment, the “true” parameter set θ0 is locally point identified if the
following limiting average Hessian matrix evaluated at θ0 has full K rank.
N
1 X
Q0 ≡ lim E [Hi (xi ; θ0 )]
N →∞ N
i=1
354
11.1. Criterion Functions
si (yi , xi ; β) = 2xi εi
Hi (yi , xi ; β) = −2xi xT
i
and the requirement that the above matrix must have full rank in order for
the OLS estimator to be identified is quite a familiar condition.
Example 11.4. Identification of NLLS. In the NLLS case, by denoting
the error term by εi ≡ yi − h (xi ; θ), the score and the Hessian are:
∂
si (yi , xi ; θ) = 2 h (xi ; θ) · εi
∂θ
∂ ∂ ∂2
Hi (yi , xi ; θ) = −2 h (xi ; θ) T h (xi ; θ) + 2 h (xi ; θ) · εi
∂θ ∂θ ∂θ∂θT
and note that, by the Law of Iterated Expectations:
∂2 ∂2
E h (xi ; θ0 ) · εi = Ex h (xi ; θ0 ) · E [εi | xi ] = 0
∂θ∂θT ∂θ∂θT
where:
∂
h0i ≡ h (xi ; θ0 )
∂θ
is the derivative of the CEF evaluated at xi and at the true parameters θ0 .
The identification of NLLS is generally evaluated in terms of the matrices
E h0i h0i , and the probability limits of their sample averages.
T
In practical applications, it is customary to verify that the sample mean
of the Hessian (like N −1 XT X in the OLS case) has full rank, as an indication
that the model is identified. In addition, it is useful to check that the rows
or columns of the Hessian’s sample mean are not too correlated ; otherwise,
identification is said to be weak, and the estimates are usually very imprecise
with large standard errors. This problem is called quasi-multicollinearity
355
11.1. Criterion Functions
where here fz,ε (·) is the probability mass or density function associated with
Fz,ε (·), whereas s−1ε (·) is the solution of the structural relationship with
respect to the unobservable factors εi (assuming that such a unique inverse
exists). In this environment, MLE corresponds with the M-Estimator that
is defined for a criterion function equaling the logarithm of fz,ε (·); write
this function succintly as ` (xi ; θ) where again xi = (yi , zi ).
q (yi , zi ; θ) = log fz,ε zi ; s−1
ε (yi , zi ; θ) ; θ
≡ ` (xi ; θ)
This characterization perfectly complies with the definition of M-Estimators
since the true value of the parameters θ0 , by definition, satisfies:
N
1 X
θ0 = arg max lim E [` (xi ; θ)] (11.9)
θ∈Θ N →∞ N
i=1
356
11.1. Criterion Functions
where X is the joint support of xi = (yi , zi ), and dxi is the joint differential.
Note that the expectation must be evaluated by integrating over fx (xi ; θ0 ),
as this is the function which is assumed to generate the data. These rela-
tionships hold for any for any θ ∈ Θ; hence over the entire parameter space
it is E [` (xi ; θ0 )] ≥ E [` (xi ; θ)] for i = 1, . . . , N and (11.9) must hold.
A variant of this approach which is equally valid, and at the same time
typically more practical, is that where a researcher only specifies the condi-
tional distribution of the unobserved factors εi , given the realizations yi of
the exogenous variables: thus, the criterion function is specified as follows.
q (yi , zi ; θ) = log fz,ε s−1
ε (y i , zi ; θ) zi
357
11.1. Criterion Functions
N
!−1 N
X X
β
b M LE = xi xT
i x i yi
i=1 i=1
PN 2
i=1 yi − xT
i βM LE
b
b2M LE =
σ
N
and the Second Order Conditions are satisfied for a maximum. Therefore,
this estimator exists and is unique as long as matrix i=1 xi xT i has full
PN
rank. Note that the Maximum Likelihood estimator of β is identical to the
corresponding OLS estimator of the linear regression model. The estimator
for the error variance parameter σ b2M LE differs, however, from the unbiased
estimator σ b from the small sample analysis of OLS, as the latter is larger
2
by the factor N N −K
. This is just one particular example of a general feature
of MLE: while this approach might produced biased estimators, these are
in general consistent and at least as efficient as their unbiased counterparts.
This Maximum Likelihood estimator can alternatively be obtained un-
der more general assumptions. Suppose that xi is not fixed; without specify-
ing its full data generation process, assume that conditional on any realiza-
tion xi , the error term is normal with constant variance: εi | xi ∼ N (0, σ2 ).
Since εi = yi − xT i β, this implies the following conditional density function:
T
2 !
1 y i − x i β
f Y |x yi | xi ; β, σ2 = √
exp − 2
2πσ 2 2σ
with the same associated likelihood function as above. This shows that the
CMLE approach here delivers the same result as the simple (but unrealistic)
assumption that xi is fixed. In fact, xi is allowed to follow any distribution,
so long as εi is normal when conditioning on it.
358
11.2. Asymptotics of Maximum Estimators
Q (θ) Q0 (θ)
Q0 (θ) ±
θ
θ0 θ1
359
11.2. Asymptotics of Maximum Estimators
If, in this example, the sampling error increases at higher values of θ, the
local maximum θ1 could be mistaken for the actual global maximum θ0 .
Uniform convergence is ensured if these four conditions hold: i. q (xi ; θ)
is continuous; ii. Θ is a compact set; iii. E [|q (xi ; θ)|] < ∞: that is, q (xi ; θ)
has a bounded first absolute moment; iv. q (xi ; θ) is Borel-measureable on
its support. These conditions, together, allow to invoke a result known as
Uniform Weak Law of Large Numbers, implying uniform convergence.
These conditions are technical; notice, however, that i. and ii. relate to the
fact that M-Estimators are, in fact, maximum points; while iii. and iv. are
analogous to similar conditions from other Laws of Large Numbers. Armed
with the notion of uniform convergence, one can replicate the original proof
of M-Estimators’ consistency as it was given by Newey and McFadden (their
Theorem 2.1, which is reported here with minor variations).
Theorem 11.2. Consistency of M-Estimators. If i. Q0 (θ) is uniquely
maximized at θ0 , ii. Θ is a compact set, iii. Q0 (θ) is a continuous function,
and iv. Q
bN (θ) uniformly converges in probability to Q0 (θ), then it follows
that M-Estimators are consistent as per (11.10).
Proof. For any > 0, with probability approaching 1 (w.p.a. 1);
bN (θ0 ) −
by i.: Q
bN θ bM > Q (a)
3
by iv.: Q0 θbM > Q bN θbM − (b)
3
by iv.: bN (θ0 ) > Q0 (θ0 ) −
Q (c)
3
therefore, w.p.a. 1:
(a)
bN (θ0 ) − 2 > Q0 (θ0 ) −
(b) (c)
Q0 θ bM > Q bM − > Q
bN θ
3 3
hence, Q0 θ bM > Q0 (θ0 ) − w.p.a. 1. Now, denote by U any given open
neighborhood of θ0 and by Uc its complement in Θ. Also define,
Q0 (θ∗ ) = sup Q0 (θ)
θ∈Θ∩Uc
for some θ∗ , and notice that Q0 (θ∗ ) < Q0 (θ0 ) by i.-ii.-iii.: thus, by setting:
= Q0 (θ0 ) − Q0 (θ∗ )
it follows that:
Q0 θ bM > Q0 (θ∗ )
bM > Q0 (θ0 ) − ⇒ Q0 θ
p
implying that θ
bM ∈ U for any open neighborhood U. Thus, θ
bM → θ0 .
360
11.2. Asymptotics of Maximum Estimators
Proof. This derivation is reminiscent of the proofs for Theorems 6.17 and
6.18, respectively for MM and MLE estimators, in Lecture 6; this one is in a
way more general as it allows for possibly non i.i.d. data. Since, by condition
ii. the score function is assumed to be continuous and differentiable, then
by the Mean Value Theorem one can write:
si xi ; θM = si (xi ; θ0 ) + Hi xi ; θN θM − θ0
b e b
361
11.2. Asymptotics of Maximum Estimators
where θ
eN is some convex combination of θ
√
bM and θ0 . By summing over the
N observations and dividing by N , one gets:
N
1 X b
0= √ si xi ; θM
N i=1
N
" N
#
1 X 1 X e √ b
=√ si (xi ; θ0 ) + Hi xi ; θ N N θM − θ0
N i=1 N i=1
by recalling that the sample score evaluated at the solution is equal to zero
by definition of M-Estimators. The expression above can be rewritten as:
−1 1 X
" N
# N
√ 1 X
N θM − θ0 = −
b Hi xi ; θ N
e √ si (xi ; θ0 ) (11.15)
N i=1 N i=1
as v. lets invert the average Hessian matrix. Next, consider the following.
1. By i. and v. one can apply some suitable Law of Large Numbers to
the “sample-averaged” Hessian matrix, showing that:
N
1 X e p
H i x i ; θ N → Q0 (11.16)
N i=1
p
which follows from the Continuous Mapping Theorem since θ eN → θ0 .
2. Condition iii. implies ∂Q(θ 0)
= limN →∞ N1 Ni=1 E [si (xi ; θ0 )] = 0 and:
P
∂θ
N
1 X p
si (xi ; θ0 ) → 0
N i=1
hence, by condition iv. and the Continuous Mapping Theorem, it is
as follows.
N
1 X d
√ si (xi ; θ0 ) → N (0, Υ0 )
N i=1
Again, Slutskij’s Theorem and the Cramér-Wold Device allow to recombine
these intermediate results so to show (11.13).
As usual, the asymptotic sandwiched variance-covariance is not imme-
diately workable for statistical inference since matrices Q0 and Υ0 are un-
known and must be estimated. The “bread” Q0 is asymptotically evaluated
as:
N
1 X b p
QN ≡
b H i x i ; θ M → Q0 (11.17)
N i=1
362
11.2. Asymptotics of Maximum Estimators
where both observations and scores are also indexed by the group or cluster
c = 1, . . . , C which they belong to. Similar extensions to HAC estimation do
exist, although in order to account for dependent observations in practical
applications of M-Estimators, CCE is overwhelmingly preferred because it
p
is much easier to implement. For any appropriate estimator Υ bN → Υ0 , the
variance-covariance matrix of M-Estimators is estimated as follows.
bM = 1 Q
[ θ
Avar b −1 Υ b −1
b NQ (11.21)
N N N
363
11.2. Asymptotics of Maximum Estimators
364
11.3. The Trinity of Asymptotic Tests
d
fH0 →
W χ2L
that is, under the null hypothesis H0 the test statistic has a limiting χ2L
distribution with L degrees of freedom. This result appears more intuitive
upon comparing the generalized Wald Statistic to its original version from
the linear model, where v (β) = Rβ − c = 0 is some linear function. There,
the original Wald Statistic is just a particular case of Hotelling’s t-squared
statistic, which asymptotically follows the chi-squared distribution as per
Observation 6.2. This non-linear case is analogously derived after applying
the Delta Method to the central matrix of the quadratic form.
Example 11.8. Generalized Wald statistic and a linear constraint.
Suppose that interest lies in a specific hypothesis about the linear model:
K
X K
X
H0 : βk = 1 H1 : βk 6= 1
k=1 k=1
365
11.3. The Trinity of Asymptotic Tests
is the expression of the “Distance” Statistic in all cases but MLE, while
h i
d
LRH0 = 2 log QN θM − log QN θ
b b b bV → χ2L (11.26)
is the expression of the “Likelihood Ratio” for MLE, where QbN (θ) = LbN (θ)
is the empirical likelihood function (notice a difference in the scaling fac-
tor). Intuitively, the test is comparing how much gain is there to make, in
terms of explaining the data, by letting the model to be estimated “freely”
without the restriction. Clearly, the unrestricted model will always perform
statistically better at fitting the data; the question is “how much better”
with respect to the researcher’s a priori hypotheses.
Example 11.9. The distance test and a linear constraint. One can
test the same hypothesis as in Example 11.8 through the estimation of a
“restricted” model, such as:
K−1
!
X
Yi = β0 + β1 X1i + β2 X2i + · · · + βK−1 X(K−1)i + 1 − βk XKi + i
k=1
366
11.3. The Trinity of Asymptotic Tests
which can also be written as follows, with Ẍki ≡ Xki −XKi for k = 1, . . . , K.
Yi − XKi = β0 + β1 Ẍ1i + β2 Ẍ2i + · · · + βK−1 Ẍ(K−1)i + εi
In this example, the last coefficient of the original model is forced to conform
to the restriction that is implied by the null hypothesis; yet imposing the
restriction on any other coefficient (except β0 ) is equivalent. The Distance
Test is computed in this case as:
" N N
#
X 2 X 2
d
DH =
0 yi − xKi − ẍT β
i
bV − y i − xT β
i
b OLS → χ2
i=1 i=1
where β
b V are the parameter estimates from the restricted model above.
where, as appropriate:
N
" #
[ √1
X
Υ
b v = Avar si (xi ; θv )
N i=1
and it is indicative of how statistically relevant are the deviations in (11.27),
because it computes a quadratic form of their standardized values.
367
11.4. Quasi-Maximum Likelihood
tice, the method based on the so-called outer product of the gradients
(OPG), that is using Υ b N , is often favored due to computational consider-
ations. In fact, statistical softwares routinely compute scores in order to
perform MLE, and the OPG adds little computational cost to the problem
of estimating the sample variance-covariance.
1
In the treatment developed in Lecture 6, the two alternative options are expressed
through the notation Hb N and J
bN , respectively.
368
11.4. Quasi-Maximum Likelihood
which is symmetric. By plugging the MLE estimates into the above matrix,
summing it over all the observations and taking the inverse of the result
one obtains a consistent estimate of the opposite of the information matrix:
−1
" N # " P −1 #
b −1
Q N T
b2M LE
i=1 xi xi σ 0
X
N
= Hi y i , xi ; θ
bM LE =−
N i=1 0T 2 4
σ
N M LE
b
where the border elements (except the lower bottom one) are equal to zero
because they are proportional to the first K elements of the sample score –
that is, the K normal equations.2 An equivalent way to obtain the estimator
of interest is to calculate the outer product of the gradients. By the above
expression for si (yi , xi ; θ), it is easy to verify that:
−1
" N #
b −1
Υ X
N
= bM LE sT yi , xi ; θ
si yi , xi ; θ bM LE
i
N i=1
" P −1 #
N T 2
= i=1 xi xi σ
bM LE 0
T 2 4
0 σ
N M LE
b
2
The calculations to obtain the opposite of the bottom right element of this matrix
b2M LE ) are as follows.
(that is, the asymptotic variance of σ
N
!−1
N −4 X 2
−6
b2M LE = − Tb
[ σ
Avar σ − yi − xi βM LE σ
2 M LE i=1
b bM LE
−1 −1
N −4 N −4 2 4
=− σ b−4
bM LE − N σ = − − σ = σ
M LE
2 M LE N M LE
b b
2
For the outer product of the gradients the calculations are similar.
369
11.4. Quasi-Maximum Likelihood
that is, Υ
b −1 = −Q
N
b −1 as predicted by the information matrix equality. This
N
result highlights more clearly that the OLS estimator of the variance of β
in small samples (under the homoscedasticity assumption) differs from the
Cramér-Rao bound by a multiplicative factor of N N −K
.
Unfortunately, the desirable properties of MLE break down if the i.i.d.
hypothesis cannot be defended, since the information matrix equality fails.
To illustrate, let again xi = (yi , zi ) be the collection of all endogenous and
exogenous variables in the model, and allow for group dependence between
observations. In this case, the likelihood function can be factored between
clusters:
C
Y
L (θ| x1 , . . . , xN ) = fx1 ,...,xNc (x1c , . . . , xNc c ; θ)
c=1
but not further; the information matrix equality no longer applies. A simi-
lar argument applies to more general cases of spatial and time dependence,
as well as to the i.n.i.d. case where the observations are independent but not
identically distributed (for example, when the homoscedasticity assumption
which is implicit in the CMLE model described in Example 11.5 cannot be
maintained, even if the error terms are always conditionally normal). In all
these cases, MLE retains the sandwiched limiting variance of M-Estimators
as per (11.13), and Υ0 must be estimated according to the working assump-
tions – for example, by formula (11.20) under group dependence.
As it has been already observed, the almost ideal asymptotic properties
of MLE break down even if the data are generated from a random (i.i.d.)
sample, but the likelihood function is misspecified, that is it does not match
the “true” data generation process in the population under examination. It
is interesting to investigate the consequences of misspecifation, that is of es-
timating a model via MLE while assuming a wrong underlying distribution,
since this can occur frequently in practice. In such cases, the estimator of
interest is called the Quasi-Maximum Likelihood Estimator θ bQM LE ,
and it is useful to characterize its probability limit, which is commonly
called the pseudo-true value θ∗ :
p
bQM LE = arg max L (θ| x1 , . . . , xN ) →
θ θ∗ (11.30)
θ∈Θ
where the probability limit is evaluated with respect to the true distribution.
A relevant question is whether the QMLE is consistent, that is θ∗ = θ0 .
In fact, in the main example of MLE examined so far – the linear model
under normality assumptions – it can be observed that the ML estimator
of β0 coincides with the standard OLS estimator, so it is consistent if the
standard conditional mean assumption for linear models E [εi | xi ] = 0 holds,
370
11.4. Quasi-Maximum Likelihood
371
11.4. Quasi-Maximum Likelihood
The First Order Conditions of the MLE problem, expressed as the sum of
the individual scores, are:
N
X XN h i
si yi , xi ; βM LE =
b xi yi − exp xT
i β
b M LE = 0
i=1 i=1
and they lack a closed form solution; consequently, the estimator in question
must be obtained by numerical methods. The empirical Hessian matrix is:
N N
1 X 1 X
QN =
b Hi yi , xi ; βM LE = −
b exp xT
i
b M LE · xi xT
β i
N i=1 N i=1
372
11.4. Quasi-Maximum Likelihood
that is, the QMLE converges in probability to the pseudo-true value, which
is at the same time the maximizer of the pseudo-population likelihood and
the minimizer of the KLIC! In this sense, the probability limit of the QMLE
minimizes a well-defined criterion of distance between the assumed and the
true density, and as such it is thinkable as some “best approximation” of sort.
Mirroring the analogous discussion of Least Squares as best approximation
of the CEF, this is not an excuse for disregarding the problem of correctly
specifying the likelihood function! However, it motivates the applied prac-
tice of enriching MLE models with flexible parametric specifications (say,
polynomials of xi ) in order to best approximate the true distribution.
373
11.5. Introduction to Binary Outcome Models
374
11.5. Introduction to Binary Outcome Models
yi = x T
i β0 + i , yi ∈ {0, 1}
P i = 1 − xT T
i β0 x i = x i β0
P i = −xT T
i β0 xi = 1 − xi β0
E [i | xi ] = 1 − xT
T T T
i β 0 x i β0 − x i β0 1 − x i β0 = 0
hence OLS applied to this model would still produce unbiased and consistent
estimates of β0 even if the problem is naturally heteroscedastic (something
that is normally addressed either via “robust” standard errors or, in small
samples, via FGLS). The main issues of the LPM depend on the fact that
the linear conditional expectation xTi β0 cannot be constrained to lie within
the (0, 1) interval. This implies that:
1. the conditional variance of the error term might take negative values;
Var [i | xi ] = xT T
i β0 1 − x i β0 R 0
375
11.5. Introduction to Binary Outcome Models
G (xi , β0 ) = Fx xT (11.32)
i β0 = Fx (λi )
yi∗ = xT β0 + εi (11.33)
(i
1 if yi∗ > α0
yi = (11.34)
0 if yi∗ ≤ α0
376
11.5. Introduction to Binary Outcome Models
where the fourth line exploits the symmetry of Fx (·). This fact reconciles
the latent variable model with our specification of the conditional probabil-
ity for the outcome Yi .
Before getting to practical aspects and the MLE estimation of models
with binary outcomes, two observations need to be made.
• Latent variable models are not specific of binary outcomes: multinomial
LDV models are usually motivated by more complex versions, which are
outside the scope of this overview. Latent variable models are also used
in the structural analysis of empirical strategic games.
• Derivation (11.36) above shows why Fx (·) should not contain a variable
scale parameter, such as the variance (if Fx (·) is, say, normal, its variance
should be known or normalized, e.g. σ2 = 1); otherwise the K parameters
in β0 and the scale parameter would not be separately identified. To see
intuitively why, consider the case where α0 = 0 and Fx (·) features some
scale parameter, call it σ. Here, the two equations:
yi∗ = xT
i β0 + εi
σyi∗ = σ xT
i β0 + εi
intuition behind this is that one can only observe whether the latent
variable takes values above (Yi = 1) or below (Yi = 0) its hypothesized
threshold, and not its variation as a function of the variation of xi . For a
similar reason, the scale parameter could be identified if α0 = 0 and xi
did not include any constant term. In such a case the “scale” parameter
would be identified by changes in the average value of Yi that are not
explained by xi . However, the basic fact that scale parameters are not
independently identified remains. In general, there is seldom a reason to
include a scale parameter instead of a constant location parameter.
377
11.5. Introduction to Binary Outcome Models
YN
yi 1−yi
L β| {(yi , xi )}N F x xT 1 − Fx xT
i=1 = i β i β
i=1
where the First Order Conditions (the sum of the individual scores, evalu-
ated at the estimates) are:
N
∂
N
X
log L βM LE {(yi , xi )}i=1 =
b si yi , xi ; βM LE =
b
∂β i=1
XN y i f x xT i β
b M LE (1 − y i ) f x x Tb
i β M LE
= − xi = 0
Tb Tb
i=1 Fx xi βM LE 1 − Fx xi βM LE
378
11.6. Simulated Maximum Estimation
In practice, the two distributions are very similar (they are both bell-shaped,
although the logistic has fatter tails) and the two models usually produce
similar sets of results which are easily compared against one another.
After having estimated a probit or a logit model, one must be careful
at interpreting the estimates of β! In fact, while the linear specification of
the latent variable might induce some confusion, a coefficient βk is neither
the causal effect of Xik on Yi nor the predicted change in the probability to
get Yi = 1 following some unitary increase in variable Xik . The best way to
interpret the estimated parameters is by calculating the marginal effects.
For all the explanatory variables in xi , these are characterized as:
∂ P (Yi = 1| xi ) ∂ E [Yi | xi ]
=
∂xi ∂xi
∂Fx xT i β
=
∂xi
= fx x T
i β β
and they are a function of the data for any value of β. There are two ways
to calculate marginal effects that meaningful for interpretation’s sake:
379
11.6. Simulated Maximum Estimation
380
11.6. Simulated Maximum Estimation
where the expectation is taken over the support of ui , then by some suitable
Law of Large Numbers:
p
fbx (xi | θ) → fx (xi | θ) (11.43)
381
11.6. Simulated Maximum Estimation
382
11.6. Simulated Maximum Estimation
θ
bBCM SL =
h i2
XN XS f x|u (xi | us ; θ) − fx,S (xi | θ)
e b
= arg max log f (x | θ) +
x,S i
b h i2
θ∈Θ i=1 s=1 2S fbx,S (xi | θ)
(11.45)
given that the inner summation inside the brackets on the right-hand side
is easily motivated as a consistent estimator of the second-order term of
the above Taylor expansion for each observation i = 1, . . . , N . Researchers
shall consider this extended estimator if they are concerned about the size
of S relative to N in a practical environment.
The theory of simulated M-Estimators extends beyond Maximum Likeli-
hood: if a generic M-Estimator is defined in terms of an observation-specific
criterion q (xi ; θ) that is based upon integrals without closed form solution,
a simulation approach is rendered necessary. A Simulated M-Estimator
(SM), of which MSL is a special case, is defined as:
N
1 X
θSM = arg max
b qbS (xi ; θ) (11.46)
θ∈Θ N i=1
383
11.7. Applications of Maximum Estimation
where H b Si (xi ; θ) = bsSi (xTi ;θ) = ∂ qbS (xiT;θ) . Expressions that extend (11.47) to
∂θ ∂θ∂θ
the CCE or HAC cases can be derived. Notice that in the MSL case, the
estimator of the information matrix (under i.i.d. observations) is typically
obtained through the outer product of the gradients as:
XS ∂ fe
x|u x i | u s ; θ
b SM L X S ∂ fe
x|u x i | u s ; θ
b SM L
N ∂θ ∂θ T
X s=1 s=1
p
→ I (θ0 )
PS e P
S
i=1 f x | u ; θ f x | u ; θ
s=1 x|u i s SM L i s SM L
b b
s=1 x|u
e
384
11.7. Applications of Maximum Estimation
where Yi is output, Ki and Li are capital and labor, αK > 0 and αL > 0
are the respective so-called saliency parameters that determine the relative
importance of each input, ρ > 0 is a parameter related to the elasticity of
substitution between inputs, which as the model’s name goes in this model
is constant and writes σ = (1 + ρ)−1 ∈ (0, 1), while εi is an error term. It
can be shown that as ρ → 0, (11.49) becomes a Cobb-Douglas production
function like (7.42) where αK = βK and αL = βL .
This model must obviously be estimated by NLLS via numerical meth-
ods, and even the simplest case (11.49) is known to entail complications. A
typical estimation algorithm involves splitting the problem as follows:
" N
#
X 1
2
(b
ρ, α b L )N LLS ∈ arg min arg min
bK , α yi − [αK ki + αL li ] ρ
ρ∈R++ (αK ,αL )∈R2++ i=1
385
11.7. Applications of Maximum Estimation
hi = xT
i β + εi (11.50)
where hi represents worked hours over a given time frame or other measures
of labor supply (for example days or weeks in the case of seasonal jobs).
A problem is that hi is only observed for women who do actually work:
where there is some latent variable zi∗ , depending on a possibly different set
of characteristics wi , which represents the individual cost-benefit evaluation
on whether to work or not. The difference with binary outcome models is
that in this case if an individual does participate to the labor market, as
specified by the participation equation (11.51) and by the assignment rule
(11.52), the intensity of her work is observed as a continuous variable. In
fact, the ultimate objective of the researcher is to estimate a model such
as (11.50) for the determinants of the intensity variable hi , and not merely
a binary outcome model for participation per se. Notice that this model
could be alternatively specified for other intensity variables hi such as the
market wage for women; in other variations of this model interest may lie
in both the quantity (hours) and the price (wage) variables.
Unfortunately, OLS cannot estimate (11.50) consistently. Denoting by
Hi the random variable whence the observations of hi are drawn, it is:
where λ (wi ) ≡ E εi | υi > −wiT γ 6= 0 as long as the two error terms are
386
11.7. Applications of Maximum Estimation
For example, a woman with a wealthy husband who is happy supporting her
will be both less inclined to work and to work many hours if she works at
all (the husband’s income is an element of both xi and wi ). If, in addition
to this, the natural inclinations of the woman in question are correlated for
both the participation (υi ) and the intensity (εi ) decisions – a natural state
of things – all the conditions for a sample selection bias are present.
Heckman (1977) devised a solution to this problem that was worth him
the Nobel Prize in Economics. This solution, which is also known with the
name of heckit in analogy with probit, logit and other models with LDV
components, is based on a parametric assumption about the two error terms
(εi , υi ) such as a bivariate normal distribution:
2
εi 0 σ ρ
∼N ,
υi 0 ρ 1
where ρ is the correlation coefficient between the two errors, σ2 is the vari-
ance of the error of the intensity equation while the corresponding variance
for the participation equation is normalized to 1 since it is not identified in
binary LDV models. Thus, all the parameters of the model could be in prin-
ciple estimated via MLE by specifying an appropriate likelihood function
that accounts for the common dependence of the two equations.
However, Heckman also proposed an alternative procedure that is much
easier to implement, while still requiring the bivariate normal assumption:
1. run a probit on the participation equation (11.52) and obtain γ
b M LE ;
2. for each observation, calculate the inverse Mills ratio:
" #
φ wiT γ
b M LE
λi =
Φ (wiT γ
b M LE )
where φ (·) and Φ (·) are, respectively, the density and cumulative func-
tions of the standard normal distribution;
3. run OLS on a modified intensity equation:
hi = xT
i β + ρλi + εi (11.53)
where the correlation parameter ρ is the OLS coefficient for λi .
Under the assumptions of the heckit model this procedure produces consis-
tent estimates of β. A disadvantage of this approach is that the standard
errors of the OLS step are inconsistently estimated, because they do not ac-
count for the joint distribution of (εi , υi ). However, “resampling” techniques
such as the bootstrap can address this.
387
11.7. Applications of Maximum Estimation
a sentence that scared any economist who would think about actually trying
to test the suspicion. Quoting Samuelson’s words in the introduction of his
paper, Bresnahan (1987) developed a methodology for detecting collusion in
oligopolies that back then was quite innovative, becoming a starting point
of the “empirical revolution” in IO. In fact, he was able to show statistically
that hypotheses other than a momentary price war were unlikely.
Bresnahan models the automobile industry as one of N types of cars,
each with quality Xi = X (zi , β) being a function of one car’s characteristics
zi given parameters β. Qualities can be ordered from best to worst: without
loss of generality, Xi > Xh if i > h. He provides microfoundations for the
demand functions of each car, defined for each year t = 1, . . . , T as
QD
it = D (Pht , Pit , Pjt , Xht , Xit , Xjt , γ) (11.54)
where Qit is the quantity of product i, Pit its price, h, i, j are three consec-
utive products in the order of qualities and γ are some parameters. This
specification makes prices and quantities only dependent, in equilibrium, on
those of the “neighbors” of one product in the product space, and follows
from a particular specification of consumers’ utility.
388
11.7. Applications of Maximum Estimation
with c (Xit ) = µ exp (Xit ); and he distinguishes the following two scenarios.
1. Competition: in this case each firm sets its own price Pit by taking the
price of neighbors h and j as given, with First Order Conditions:
2. Cooperation: in this case the firm(s) selling two products, say, i and j
would set prices Pit and Pjt so to maximize the joint profits, with First
Order Conditions for the i-th price:
where the demand function parameters γ enter via the derivative of the
demand functions implied in the First Order Conditions.
By setting the equilibrium condition QD it = Qit = Qit (and similarly
S ∗
389
11.7. Applications of Maximum Estimation
which is more easily obtained by solving for both the demand functions and
the supply side First Order Conditions simultaneously. The last assumption
is that the actual prices and quantities differ from their theoretical, reduced
form values by a pair of normally distributed error terms:
P
Pit − P ∗
2
ξit 0 σP 0
= Q =N ,
Qit − Q∗ ξit 0 0 σ2Q
where the variances of the two error terms reflect heteroscedasticity. Hence,
the likelihood function can be written in terms of data realizations as:
P 2
T Y N !
Y 1 ξ
L β, γ, µ| Ht , {pit , qit , zi }N
i=1 = p exp − it 2 ×
t=1 i=1 2πσ 2
P
2σ P
2
Q
ξit
T N
YY 1
× exp −
2σ2Q
q
t=1 i=1 2πσ2 Q
the associated test statistic would reject the null if Ht1 fits the data signif-
icantly better than Ht0 . Thanks to this procedure, Bresnahan has statisti-
cally shown that some car producers have been colluding in the US market
in all years but 1955: Paul Samuelson must have not been happy.
It must be remarked that while the Bresnahan model is still a nice exam-
ple of a structural model in IO that simultaneously incorporates both the
demand and supply side within an elegant MLE framework, by today’s stan-
dards it certainly feels antiquated and “mechanical.” The current practice
in Industrial Organization favors the use of random coefficients multinomial
LDV models that incorporate the supply side while attempting to correct
for endogeneity of prices and product characteristics through instrumental
variables – all within a larger Generalized Method of Moments framework.
390
Lecture 12
391
12.1. Generalizing the Method of Moments
392
12.1. Generalizing the Method of Moments
Intuitively, the GMM estimator picks the value θ bGM M that minimizes
the distance of all the empirical moments from their expected “true” value
(zero). Such a distance is measured as a quadratic form that employs matrix
AN in order to “weigh” the relative importance of different moments, as it
is clarified later. Some important observations are in order.
1. The First Order Conditions of the problem are as follows:
∂ T b
gN θGM M · AN · gN θ bGM M = 0 (12.6)
∂θ
note that the term the pre-multiplies AN , the transposed Jacobian of
the empirical moment conditions evaluated at the solution, is a K × J
matrix. Given that an analytic solution is generally not available, the
GMM estimator is typically obtained numerically.
2. The GMM estimator resembles – indeed, is – an M-Estimator. Define:
N
1 X
G0 (θ) ≡ lim E [g (xi ; θ)]T A0 E [g (xi ; θ)] ≥ 0 (12.7)
N →∞ N
i=1
and one can see that a unique solution is obtained if the J × K matrix
G0 which is defined as:
N
1 X ∂
G0 ≡ lim E g (xi ; θ0 ) (12.9)
N →∞ N
i=1
∂θT
has full column rank K, as otherwise many combination of parameters
are equally capable of minimizing G0 (θ).
∂ Note that under identically
distributed observations, it is G0 = E ∂θT g (xi ; θ0 ) .
393
12.1. Generalizing the Method of Moments
394
12.1. Generalizing the Method of Moments
where as usual θ
eN is a convex combination of θ
bGM M and θ0 , while GN (θ)
is defined as:
N
1 X ∂
GN (θ) ≡ g (xi ; θ)
N i=1 ∂θT
in analogy with G0 . Plugging (12.15) into the First Order Conditions (12.6)
delivers the expression:
h√ √ i
GTN θ
bGM M AN N g N (θ 0 ) + GN θ
e N N θ
b GM M − θ 0 =0
395
12.1. Generalizing the Method of Moments
whereas the following applies under clustering (HAC extensions also exist).
C Nc X Nc
1 XX
p
ΩCCE =
b b T
gic xic ; θGM M gjc xjc ; θGM M → Ω0
b (12.19)
N c=1 i=1 j=1
1
e 0 ≡ Ω− 2 G0 , it is:
where, for G 0
−1
e eTe
e 0 ≡ I − G0 G0 G0
MG eT
G 0
396
12.1. Generalizing the Method of Moments
397
12.1. Generalizing the Method of Moments
and takes its name from the fact that whenever GbN (θ) is numerically
optimized, the weight matrix changes at every iteration. This is shown in
Monte Carlo simulations to perform better than the two-step procedure,
and to make overidentification tests more reliable (Hansen et al., 1996).
However, it is also very computationally demanding.
398
12.2. GMM and Instrumental Variables
which clearly has an analytic solution. In fact, the First Order Conditions
of the problem above are:
" N
# " N
#
1 X 1 X
−2 xi zT
i AN zi y i − x T
i βGM M
b =0
N i=1 N i=1
therefore:
" N
! N
!#−1 N
! N
X X X X
β
b GM M = xi zT
i AN zi xT
i x i zT
i AN zi y i
i=1 i=1 i=1 i=1
(12.29)
399
12.2. GMM and Instrumental Variables
which already resembles the 2SLS estimator. Note, in fact, that if one were
to choose the weighting matrix AN as:
N
!−1 −1
1 X T 1 T
A
eN = zi zi = Z Z
N i=1 N
this estimator would correspond exactly with the standard 2SLS estimator.
The actual estimate of its variance would depend on the assumptions made
by the researcher (standard heteroscedasticity, homoscedasticity, group de-
pendence etc.) but would anyhow be easy relatable to the “long” expression
of the GMM asymptotic variance-covariance (12.16).
The theory of GMM, however, allows for additional efficiency gains. In
fact, if the weighting matrix AN were chosen as the inverse of the estimated
variance-covariance matrix of the moment conditions which is obtained un-
der the assumption that the observations are independent:
( " N
#)−1
b −1 1 X Tb
AN = Ω N = Avar
[ √ zi yi − x βGM M
N i=1
N
!−1
1 X 2 T
= e zi z
N i=1 i i
−1
1 Tb
= Z EN Z
N
where ei ≡ yi − xT
i βGM M for i = 1, . . . , N and EN is as in (10.68). The
b b
GMM estimator (12.30) would thus become:
−1 −1 −1
T Tb
βGM M = X Z Z EN Z
b T
Z X XT Z ZT E
bNZ ZT y (12.31)
which differs slightly from standard 2SLS. In fact, this kind of GMM estima-
tion retrieves a generalized version (in the GLS sense) of the overidentified
2SLS estimator. To better appreciate this, consider the estimated asymp-
totic variance of (12.31):
−1 −1
[ β
h i 1 T
Tb T
Avar b GM M = X Z Z EN Z Z X
N
400
12.2. GMM and Instrumental Variables
N
−1 1 X T p T
AN =
e zi zi → E zi zi
N i=1
so that the two estimators would asymptotically coincide: observe that σ2
would simplify in the expression of the probability limit of (12.22). In less
ideal scenarios, the equivalence collapses. Nevertheless, it has been observed
that for linear models, the efficiency gains obtained thanks to the optimal
variance GMM are small in comparison to the computational and empirical
costs associated with its implementation – for example, if observations are
dependent a different estimator of Ω0 may be necessary in order to construct
the optimal weighting matrix, and the resulting GMM estimator may be
even less efficient than 2SLS if wrong choices are taken. It is no surprise,
then, that current practice favors the use of the standard 2SLS estimator
coupled with appropriate estimators of its variance – most typically, the
heteroscedasticity- or the cluster-robust ones.
401
12.2. GMM and Instrumental Variables
ZT 0 · · · 0
0 ZT · · · 0
I ⊗ ZT = .. .. . . ..
. . . .
0 0 · · · ZT
Notice that this Kronecker product has dimension P Q × P N , where P N is
also the length of the error terms vector ε. In this case, the GMM problem
is written as:
T
1 T
1 T
β
b GM M = arg min I ⊗ Z (y − Xβ) AN I ⊗ Z (y − Xβ)
β∈RB N N
402
12.2. GMM and Instrumental Variables
403
12.2. GMM and Instrumental Variables
404
12.2. GMM and Instrumental Variables
as in Examples 11.4 and 11.7, while A0 and Ω0 are as in the standard theory
of GMM. The estimated asymptotic variance of the NL2SLS estimator is:
h i 1 bT −1 −1
T T
Avar θN L2SLS =
[ b JN A N J N
b JN AN ΩN AN JN JN AN JN
b b b b b
N
where:
N
bN = 1
X
J bT
zi h i
N i=1
405
12.2. GMM and Instrumental Variables
Note that if points 1. and 2. above are maintained and, in addition, in the
moment conditions (12.34) one specifies zi = h0i – so that the instruments
enter as zi = h
b i in the estimation problem – the GMM returns the standard
NLLS estimator which is examined in Lecture 11. This is clearly analogous
to setting zi = xi in linear models, which returns standard OLS.
Optimal Instruments
When operationalized via GMM, conditional moment conditions like (12.27)
constitute a framework for the semi-parametric estimation of a wide class of
econometric models: in fact, the theory for instrumental variables in non-
linear models can be extended to non-linear systems of structural equations
as well, similarly as how 3SLS generalizes 2SLS. However, conditional mo-
ment conditions are even more general, since any function of the instruments
l (zi ) which takes values upon a J 0 -dimensional set makes for valid moment
conditions of the kind:
E [l (zi ) ⊗ h (yi , zi ; θ0 )] = 0
so long as P J 0 ≥ K, where K is the total number of parameters. A relevant
question is to what extent it is possible to construct appropriate optimal
instruments so that the resulting GMM problem delivers the most efficient
estimate available with the information enclosed in the conditional moment
conditions (12.27). A result proved by several authors is that this objective
is achieved through the K × P matrix L (yi , zi ; θ0 ) defined as:
∂ T
L (yi , zi ; θ0 ) = E h (yi , zi ; θ0 ) zi {Var [h (yi , zi ; θ0 )| zi ]}−1
∂θ
(12.36)
where the first term (the conditional expectation) is a K × P matrix, while
the second term (the inverted conditional variance) is a P × P matrix. The
efficient estimate of θ0 is then obtained through the following K “optimal”
moment conditions:
E [g (yi , zi ; θ0 )] = E [L (yi , zi ; θ0 ) · h (yi , zi ; θ0 )] = 0
and the corresponding estimate θ bM M solves a simple Method of Moments
sample analog system of equations.
N
1 Xh
b M M · h yi , z i ; θ
i
L yi , zi ; θ bM M = 0
N i=1
Clearly, the limiting variance for this estimator assumes a “sandwiched” ex-
pression which is simpler than (12.16): as the problem is exactly identified,
the weighting matrix is redundant.
406
12.3. Testing Overidentification
407
12.3. Testing Overidentification
for:
e = N gT θ b −1 gN 1 θ d
J θ
e N1
e Ω
N1
e → χ2J1 −K
and:
b −1 gN 1 (θ)
e = arg min gT (θ) Ω
θ N1 N1
θ∈Θ
4
It may well be the J1 -dimensional upper left square block of Ω
bN.
408
12.3. Testing Overidentification
In practice, the incremental Sargan test results from subtracting from the
original Hansen J -statistics another Hansen J -statistic, where the latter is
obtained from a “reduced” GMM model of J1 “certain” moment conditions.
This second Hansen J -statistic is always smaller by construction, because
there are fewer moment conditions to match the zero vector. Therefore, the
incremental Sargan test measures how much the other J2 conditions deviate
from zero once the J1 “certain” ones are held constant. It is apparent how
the intuition behind this test presents many analogies with the Distance or
“Likelihood Ratio” test from the Trinity of statistical tests.
log Wi = β0 + β1 Xi + β2 Xi2 + β3 Si + αi + i
however, suppose that there are now three potential instruments available.
The first is Zi , the already mentioned “distance from college” instrument
by Card. The second is Gi , representing past eligibility to some “fellowship
grant” for attending higher education programs. For example, Gi might be
motivated on some random (exogenous) past allocation of scholarship grants
by the government authorities. Clearly, in this case the eligible individuals
had obtained an advantage at the time of deciding whether or not to enroll
in college. This, however, did not likely affect their future wages other than
via better education (exclusion restriction). The last instrument is Fi , the
average education of one individual’s friends or close social network. One
may argue that one’s friends might have affected the individual decision on
whether to enroll in college, but not his or her wages: the latter statement
however is about exogeneity, and it is dubious at best.
The resulting set of moment conditions is
1
Xi
2
Xi
(12.39)
2
E Zi |log Wi − β0 − β1{z Xi − β2 Xi − β3 Si =0
}
Gi =αi +i
Fi
six conditions for four parameters of the Mincer Equations. As we have dis-
cussed, this model can be estimated via 2SLS-GMM by setting Yi = log Wi ,
xTi = (1, Xi , Xi , Si ) and zi = (1, Xi , Xi , Zi , Gi , Fi ). However, having three
2 T 2
409
12.4. Methods of Simulated Moments
and sixth rows of (12.39). In this context, the Hansen’s J-statistic reads as
" N #T " N #−1 " N #
d
X X X
J βb 2SLS = N zi ei e 2 zi zT
i i zi ei → χ2 2
i=1 i=1 i=1
410
12.4. Methods of Simulated Moments
In such cases, Direct Monte Carlo Sampling based on a sample {us }Ss=1
of S random draws of ui from Hu (ui ) allows to construct a simulator that
converges in probability to (12.40) for each observation i = 1, . . . , N :
S
1X
gbS (xi ; θ) = geu (xi , us ; θ) (12.41)
S s=1
411
12.4. Methods of Simulated Moments
logistic model predicts the outcome Yi without bias. Although (12.44) lends
itself naturally to Method of Moments estimation, in practice function Λ (·)
might be hard to calculate under probabilistic assumptions on the random
coefficients βi1 . Let again β1i ∼ N (β1 , σ2 ) and ui = (β1i − β1 ) /σ, then:
ˆ
Λ (β0 + β1i Xi ) = Λ (β0 + (β1 + σui ) Xi ) φ (ui ) dui (12.45)
R
and that correlate with the main regressor Xi . Thus, using the simulator
N
" S
#
1 X 1 Xe
gbS (yi , xi , zi ; θ) = zi yi − Λ (β0 + (β1 + σus ) xi )
N i=1 S s=1
412
12.4. Methods of Simulated Moments
413
12.5. Applications of GMM
414
12.5. Applications of GMM
εit = αi + it
where the upper bar applied to someP variable xki denote observation-specific
averages over time t, like x̄ki = T −1 T t=1 xkit (this particular approach has
been called “within transformation” in Lecture 10). Not even this method
can, unfortunately, work for dynamic linear models, because the demeaned
lagged outcome Yi(t−1) − Ȳi is mechanically correlated to the demeaned
shock εit − ε̄i . In fact, past values of the outcomes depend on all the past
shocks, which in turn are all included in the average shock ε̄i !
E Yi(t−1) − Ȳi (εit − ε̄i ) 6= 0
415
12.5. Applications of GMM
weak instruments type (that is, weak statistical correlation between the
instruments and the structural variables). The GMM framework, thanks to
its power to combine many instruments, optimally weigh them by their sta-
tistical relevance, and test their validity, is therefore well suited to address
these issues. Unsurprisingly, the GMM framework has become dominant in
the estimation of dynamic macroeconomic models or other dynamic models
for panel data. As the following discussion shows, however, approaches sim-
ilar to that outlined above are also adopted to address other more “classical”
kinds of endogeneity problem.
416
12.5. Applications of GMM
where υit ≡ ξit +εit −ρεi(t−1) : the “backward looking” autoregressive endoge-
nous shock is removed, and all that is left are components of the random
shocks that are arguably exogenous to appropriate lags of the first differ-
ences of the capital and labor inputs. This shows how moment conditions
of the kind,
∆ki(t−s)
E αi (1 − ρ) + ξit + εit − ρεi(t−1) = 0
∆`i(t−s)
417
12.5. Applications of GMM
firm’s economic conditions with a lag (it takes time to invest in new capital
and equip it) it makes sense to motivate a moment condition akin to:
as firms cannot observe ξit timely enough so as to affect their choice of kit .
For additional discussion about the more modern practices in the estimation
of production functions, see Wooldridge (2009) and Ackerberg et al. (2015).
418
12.5. Applications of GMM
419
Bibliography
and Stephen Bond, “Some tests of specification for panel data: Monte
Carlo evidence and an application to employment equations,” The Review
of Economic Studies, 1991, 58 (2), 277–297.
420
Bibliography
421
Bibliography
Moulton, Brent R., “Random Group Effects and the Precision of Regres-
sion Estimates,” Journal of Econometrics, 1986, 32 (3), 385–397.
422
Bibliography
423