[go: up one dir, main page]

0% found this document useful (0 votes)
26 views428 pages

Statistics and Econometrics Lecture Notes 2021

The document is a comprehensive set of notes on Statistics and Econometrics, structured into two main parts: Probability and Statistics, and Econometric Theory. It covers a wide range of topics including random variables, statistical inference, linear regression models, and various estimation methods. The notes serve as a graduate-level resource for understanding the mathematical foundations and applications of econometrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views428 pages

Statistics and Econometrics Lecture Notes 2021

The document is a comprehensive set of notes on Statistics and Econometrics, structured into two main parts: Probability and Statistics, and Econometric Theory. It covers a wide range of topics including random variables, statistical inference, linear regression models, and various estimation methods. The notes serve as a graduate-level resource for understanding the mathematical foundations and applications of econometrics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 428

Statistics and Econometrics

Notes for a Graduate Sequence in Econometrics

Paolo Zacchia

January 28, 2022


Contents

I Probability and Statistics 1


1 Random Variables 2
1.1 Events and Probabilities 2
1.2 Conditional Probability 9
1.3 Probability Distributions 13
1.4 Relating Distributions 21
1.5 Moments of Distributions 28

2 Common Distributions 37
2.1 Discrete Distributions 37
2.2 Continuous Distributions I 48
2.3 Continuous Distributions II 60
2.4 Continuous Distributions III 75

3 Random Vectors 82
3.1 Multivariate Distributions 82
3.2 Independence and Random Ratios 90
3.3 Multivariate Moments 96
3.4 Multivariate Moment Generation 106
3.5 Conditional Distributions 111
3.6 Two Multivariate Distributions 120

4 Samples and Statistics 126


4.1 Random Samples 126
4.2 Normal Sampling 132
4.3 Order Statistics 138
4.4 Sufficient Statistics 144

5 Statistical Inference 156


5.1 Principles of Estimation 156
5.2 Evaluation of Estimators 170

i
Contents

5.3 Tests of Hypotheses 185


5.4 Interval Estimation 199

6 Asymptotic Analysis 204


6.1 Convergence in Probability 204
6.2 Laws of Large Numbers 211
6.3 Convergence in Distribution 216
6.4 Central Limit Theorems 223

II Econometric Theory 236


7 The Linear Regression Model 237
7.1 Linear Socio-economic Relationships 237
7.2 Optimal Linear Prediction 243
7.3 Analysis of Least Squares 249
7.4 Evaluation of Least Squares 256
7.5 Least Squares and Linear Regression 261

8 Least Squares Estimation 272


8.1 Large Sample Properties 272
8.2 Small Sample Properties 280
8.3 Dependent Errors 288

9 Econometric Models 298


9.1 Structural Models 298
9.2 Model Identification 302
9.3 Linear Simultaneous Equations 308
9.4 Causal Effects 315

10 Instrumental Variables 320


10.1 Endogeneity Problems 320
10.2 Instrumental Variables in Theory 327
10.3 Instrumental Variables in Practice 342
10.4 Estimation of Simultaneous Equations 348

11 Maximum Estimation 352


11.1 Criterion Functions 352
11.2 Asymptotics of Maximum Estimators 359
11.3 The Trinity of Asymptotic Tests 365
11.4 Quasi-Maximum Likelihood 368
11.5 Introduction to Binary Outcome Models 374

ii
Contents

11.6 Simulated Maximum Estimation 379


11.7 Applications of Maximum Estimation 384

12 Generalized Method of Moments 391


12.1 Generalizing the Method of Moments 391
12.2 GMM and Instrumental Variables 399
12.3 Testing Overidentification 407
12.4 Methods of Simulated Moments 410
12.5 Applications of GMM 414

Bibliography 420

iii
Part I

Probability and Statistics

1
Lecture 1

Random Variables

This lecture is a self-contained introduction to basic probability theory, in-


cluding random variables and univariate probability distribution functions.
In reviewing concepts that are fundamental towards later subjects, special
care and emphasis are placed on establishing notation conventions that are
adopted throughout all lectures. Examples are chosen so to facilitate a more
extensive treatment of univariate and multivariate probability distributions,
which constitute the subject of later lectures.

1.1 Events and Probabilities


Probability theory is a branch of mathematics that concerns the analysis of
phenomena of uncertain occurrence. It constitutes a common mathematical
framework for the measurement of the odds that some specific phenomena
manifests themselves in the real world – their probability – at any point in
time: past, present and future. As such, probability theory is the foundation
of statistics and related disciplines such as econometrics.
The elaboration of probability theory requires a common mathematical
characterization for all the possible occurrences of the phenomena that are
potentially subject to its analysis. For this sake, probability theory borrows
from set theory and models phenomena as, indeed, sets of alternatives.

Definition 1.1. Sample Space. The set S collecting all possible outcomes
associated with a certain phenomenon is called the sample space.

A basic example of a sample space is the one associated with the classical
experiment about tossing a coin: Scoin = {Head, T ail}. A more expanded
sample space is that of grades in a university exam: with letter grades for
example (but not allowing for plus and minus), Sexam = {A, B, C, D, E, F }.

2
1.1. Events and Probabilities

The former are both examples of countable sample spaces character-


ized by a finite number of elements. There are also countable sample spaces
with an infinite number of events: for example, the number of emails that
one receives during a day can be expressed as Semails = {0, 1, 2, . . . } = N0 .
Other phenomena are modeled through uncountable sample spaces with
an infinite number of elements. For example, the sample space associated
with the income of an individual is represented by the nonnegative portion
of the real line: Sincome = R+ , whereas net wealth (assets minus liabilities)
can also be negative: Swealth = R. One can definitely construct even more
complex, multidimensional sample spaces. For example, the net wealth of
a household of two with separate financial positions is the collection of two
numbers. It follows that Shousehold = Swealth.1 × Swealth.2 = R2 .
The characterization of phenomena as sets of occurrences allows for a
suitable definition of events, that is, combinations of occurrences.
Definition 1.2. Event. Any subset of a sample space S, including S itself,
is an event.
The definition of events as subsets allows to think about the probability
of well-defined groups of alternatives. In the coin case, there are four events:
Anull = ∅, Ahead = {Head}, Atail = {T ail}, Af ull = Scoin = {Head, T ail}:
clearly the null event can never happen while the “full” event (either head or
tail) should always happen, but this is a matter of associating probabilities
to events, not about the definition of events. In the case of grades, one can
think about events such as being above (or below) a passing grade such as
C, so that Apassing = {A, B, C}, Af ailing = {D, E, F }, et cetera. Similarly,
one can split the the income sample space into segments like tax brackets.
Because events are subsets, standard set operations such as union (∪),
intersection (∩) and complementation (Ac ) extend to them. Moreover,
the following properties apply (the proof is left as an exercise).
Theorem 1.1. Properties of Events. Let AS , BS and CS be three events
associated with the sample space S. The following properties hold.
a. Commutativity: AS ∪ BS = BS ∪ AS
AS ∩ BS = BS ∩ AS
b. Associativity: AS ∪ (BS ∪ CS ) = (AS ∪ BS ) ∪ CS
AS ∩ (BS ∩ CS ) = (AS ∩ BS ) ∩ CS
c. Distributive Laws: AS ∩ (BS ∪ CS ) = (AS ∩ BS ) ∪ (AS ∩ CS )
AS ∪ (BS ∩ CS ) = (AS ∪ BS ) ∩ (AS ∪ CS )
d. DeMorgan’s Laws: (AS ∪ BS )c = AcS ∩ BcS
(AS ∩ BS )c = AcS ∪ BcS

3
1.1. Events and Probabilities

It is useful to characterize events that do not overlap, in the sense that no


occurrence – that is, no element of the sample space S – which is contained
in one is also contained in the other.
Definition 1.3. Disjoint Events. Two events A1 and A2 are disjoint or
mutually exclusive if A1 ∩ A2 = ∅. The events in a collection A1 , A2 , . . .
are pairwise disjoint or mutually exclusive if Ai ∩ Aj = ∅ for all i 6= j.
Intuitively, disjoint events cannot simultaneously happen. It is also useful to
characterize collections of disjoint events covering the entire sample space.
Definition 1.4. Partition. The events in a collection A1 , A2 , . . . form a
partition of the sample space S if they are pairwise disjoint and ∪Zi=1 Ai = S
if the collection is of finite dimension Z; ∪∞
i=1 Ai = S if the collection has an
infinite number of elements.
For example, the collection of events Apassing and Af ailing is a partition of the
(simplified) letter grades sample space. One can obtain an infinite partition
of the income sample space by splitting the positive real line into an infinite
number of non-overlapping segments, such as – given some positive integer
K – the following sequence.
A1 = [0, K) , A2 = [K, 2K) , . . . , An = [(n − 1) K, nK) , . . .
These notions are almost sufficient to provide a formal characterization
of a probability function, that is, a function assigning to each event of a
sample space a value that measures the chance of any occurrence allowed
by that event. The formal mathematical definition of probability functions
that is illustrated next follows the axiomatic foundations of probability
theory as originally developed by Andrej Nikolaevič Kolmogorov. However,
it is necessary to first discuss yet another mathematical notion, concerning
the properties of the (collection of) events that are the domain of probability
functions. This is the concept of sigma algebra (σ-algebra).
Definition 1.5. Sigma Algebra. Given some set S, a sigma algebra or
a Borel field is a collection of subsets of S, which is denoted as B, that
satisfies the following properties:
a. ∅ ∈ B;
b. for any subset A ∈ B, it is Ac ∈ B;
c. for any countable sequence of subsets A1 , A2 , · · · ∈ B, it is ∪∞
i=1 Ai ∈ B.

It is easy to see that properties b. and c. together with DeMorgan’s Law


also imply ∩∞i=1 Ai ∈ B for any appropriate countable sequence of subsets.

4
1.1. Events and Probabilities

The notion of sigma algebra is quite general that it is usually possible to


find many sigma algebras for some individual set such as a sample space S.
For example, a trivial sigma algebra is the one constituted by just the empty
set ∅, and the original set (e.g. S). For finite and (or) countable sets such
as the two realizations of the coin experiment and the list of letter grades,
any collection of subsets is an adequate sigma algebra (it is a good exercise
to prove this). For such sets, probability functions are usually formulated
upon the largest sigma algebra that contains all the subsets.
The case of uncountable sets is somewhat more complicated as certain
collections of subsets are not sigma algebras. In most cases, the uncountable
set of interest is a subset of RK for some given integer K ≥ 1, and the sigma
algebra upon which the probability functions are built is the collection of all
connected sets, their union and intersections;1 in the case of R for example,
connected sets take the form

[a, b] , (a, b] , [a, b) , (a, b)

for any two a, b ∈ R with a ≤ b. It appears then that the notion of sigma
algebra is quite general to allow for a wide class of reasonable collections of
subsets or events; note though that collections that are not sigma algebras
exist and probability functions cannot be applied to them.2
The definition of probability function is thus in order.

Definition 1.6. Probability Function. Given a sample space S and an


associated sigma algebra B, a probability function P is a function with
domain B that satisfies the three axioms of probability:

a. P (A) ≥ 0 ∀A ∈ B;

b. P (S) = 1;

c. given a countable sequence


P∞ of pairwise disjoint subsets A1 , A2 , · · · ∈ B,
then P (∪i=1 Ai ) = i=1 P (Ai ).

1
For readers unfamiliar with topology, a connected set is somewhat informally defined
as a set that cannot be partitioned into two nonempty subsets such that each subset has
no points in common with the set closure of the other subset. For example, the subset of
R defined as A = {x : x ∈ [a, b) ∨ x ∈ (b, c] , a < b < c} is not connected, because it can
be partitioned in such a way that defies the above definition.
2
Here is an example of a collection B 0 of subsets of R which is not a sigma algebra.
Suppose that B 0 contains all the finite disjoint unions of sets of the form

(−∞, a] , (a, b] , (b, ∞) , ∅, R

then ∪∞ i−1
/ B 0 which contradicts the definition of sigma algebra.

i=1 0, i = (0, 1) ∈

5
1.1. Events and Probabilities

Note from the definition that it is especially easy to construct probability


functions that satisfy the three axioms for any suitable sigma algebra B of
all finite and/or countable sample spaces S. In such cases, it
Pis sufficient to
assign to each element s ∈ S a number p (s) ≥ 0 such that s∈S p (s) = 1.3
For uncountable sample spaces S, it is best to see a probability function as
a particular instance of a measure (which is a more general mathematical
concept, whose treatment is outside the scope of this chapter).

0
1
2
3
4
5

Figure 1.1: Probabilities on a Dartboard

Example 1.1. Probabilities on a Dartboard. One useful example of


a probability function for an uncountable sample space is that of the dart-
board displayed in Figure 1.1. Consider dart players who score points de-
pending on how closely to the center of the dartboard they land their darts;
to each area divided by two contiguous concentric circles corresponds a
varying number of points, and zero points are attributed if a player fails to
hit the dartboard. One may want to calculate the probability that a player
scores a specific number of points by throwing just a single dart. Clearly,
the sample space in this particular setting is the set of all the points that
the dart can potentially hit (on the dartboard as well as outside it) and it is
clearly uncountable, because there are infinitely many such points. Yet it is
intuitive to see each region of the dartboard as a separate event, that fail-
ure of hitting the board is another event, that all such events are pairwise
disjoint, and that together with the empty set they form a sigma algebra.
Appropriate probability functions, depending say on the skill of each player,
can be based on the sample space partition defined by these events.
3
In the coin example, one has p (Head) ≥ 0, p (T ail) ≥ 0 and p (Head)+p (T ail) = 1.
Similarly, a probability function for the simplified letter grades would assign to each letter
a nonnegative number such that their sum equals 1.

6
1.1. Events and Probabilities

It is easier to think about the probability function for a naïve or unskilled


player who, if hits the dartboard at all, does so purely at random, so that
the probability to score a given number of nonzero points is proportional to
the area of each dartboard section. Suppose that, for example, the distance
of all circles from the center of the dartboard is given by (I + 1 − i) r, where
I is the maximum number of points that are attainable (5 in the Figure),
i is the number of points associated with each “ring” of the dartboard, and
r > 0 is the distance between any two contiguous rings (all equidistant from
one another) as well as between the innermost circle and the center. In such
a case, the area corresponding to each ring measures as follows.
Area associated with i points = πr2 (I + 1 − i)2 − (I − i)2
 

It would seem that in order to calculate the desired probabilities, one would
need to divide such area by the total area of the dartboard – which equals
πr2 I 2 – however this is not quite enough, because one must take into account
the event that a player fails to heat the board and scores 0 points, and that
all the probabilities must sum up to one. Let the area outside the dartboard
measure T > 0; an appropriate probability function would thus be given by
P (0 points) = T / (T + πI 2 r2 ) and the following expression for 0 < i ≤ I.
−1
P (i points) = πr2 (I + 1 − i)2 − (I − i)2 T + πI 2 r2
 

It is easy to verify that the three axioms by Kolmogorov are satisfied. 


Some general properties of probability functions deserve to be discussed.
Theorem 1.2. Properties of Probability Functions (a). If P is some
probability function and A is a set in B, the following properties hold:
a. P (∅) = 0;
b. P (A) ≤ 1;
c. P (Ac ) = 1 − P (A).
Proof. The observation that A and Ac form a partition of S and thus it is
P (A) + P (Ac ) = P (S) = 1 proves c. – thus a. and b. follow from it.
Theorem 1.3. Properties of Probability Functions (b). If P is some
probability function and A, B are sets in B, the following properties hold:
a. P (B ∩ Ac ) = P (B) − P (A ∩ B);
b. P (A ∪ B) = P (A) + P (B) − P (A ∩ B);
c. if A ⊂ B, it is P (A) ≤ P (B).

7
1.1. Events and Probabilities

Proof. To prove a. note that B can be expressed as the union of two disjoint
sets B = {B ∩ A} ∪ {B ∩ Ac }, thus P (B) = P (B ∩ A) + P (B ∩ Ac ). To show
b. decompose the union of A and B as A ∪ B = A ∪ {B ∩ Ac }, again two
disjoint sets; hence:
P (A ∪ B) = P (A) + P (B ∩ Ac ) = P (A) + P (B) − P (A ∩ B)
where a. implies the second equality. Finally, c. follows from a. as A ⊂ B
implies that P (A ∩ B) = P (A), thus P (B ∩ Ac ) = P (B) − P (A) ≥ 0.
Theorem 1.4. Properties of Probability Functions (c). If P is some
probability function, the following properties hold:
a. P (A) = ∞
P
i=1 P (A ∩ Ci ) for any A ∈ B and any partition C1 , C2 , . . .
of the sample space such that Ci ∈ B for all i ∈ N;
P∞
b. P (∪∞ i=1 Ai ) ≤ i=1 P (Ai ) for any sets A1 , A2 , . . . such that Ai ∈ B for
all i ∈ N.
Proof. Regarding a. note that, by the Distributive Laws of events, it is

! ∞
[ [
A=A∩S=A∩ Ci = (A ∩ Ci )
i=1 i=1

where the intersection sets of the form A ∩ Ci are pairwise disjoint as the
Ci sets are, hence:

! ∞
[ X
P (A) = P (A ∩ Ci ) = P (A ∩ Ci )
i=1 i=1

as postulated. To establish b. it is useful to construct another collection of


pairwise disjoint events A∗1 , A∗2 , . . . such that ∪∞
i=1 Ai = ∪i=1 Ai and
∞ ∗


! ∞
! ∞ ∞
[ [ X X
P Ai = P A∗i = P (A∗i ) ≤ P (Ai )
i=1 i=1 i=1 i=1

where the second equality would follow from the pairwise disjoint property.
Such additional collection of events can be obtained as:
i−1
!c i−1
!
[ \
A∗1 = A1 , A∗i = Ai ∩ Aj = Ai ∩ Acj for i = 2, 3, . . .
j=1 j=1

which, by construction, are pairwise disjoint and satisfy ∪∞


i=1 Ai = ∪i=1 Ai .
∞ ∗

Furthermore, by construction
P∞ Ai ⊂∗ Ai for

P∞every i, implying P (Ai ) ≤ P (Ai )

and thus the inequality i=1 P (Ai ) ≤ i=1 P (Ai ).

8
1.2. Conditional Probability

1.2 Conditional Probability


That of conditional probability is a fundamental concept in probability,
statistics and econometrics. It is a formalization of the idea of calculating
probabilities within a more restricted set of events than the original sample
space, and it allows to formulate arguments about probabilities for certain
events given that some other events have been occurring alongside them.
Definition 1.7. Conditional Probability. Consider a sample space S, an
associated sigma algebra B, and two events A, B ∈ B such that P (B) > 0.
The conditional probability of A given B is written as P (A| B) and is
defined as follows.
P (A ∩ B)
P (A| B) = (1.1)
P (B)
Example 1.2. Conditional Grades. Naturally, conditional probability is
a moot concept for simple, binary scenarios like the coin tossing experiment.
It is already a significant concept for the case of grades. Observe that the
following two partitions of the grades sample space S belong to the same
sigma algebra: the partition constituted by all singleton grades (AA = {A},
AB = {B} etc.) and the “passing vs. fail” (where Apassing = {A, B, C}
and Af ailing = {D, E, F }) partition.4 Thus it make sense to talk about the
probability for a student of scoring a specific grade given passing (or failing)
an exam. Suppose for example that the probability function for individual
grades is as follows:

P (A) = P (B) = P (C) = 0.3


P (D) = P (F ) = 0.05
P (E) = 0

(nobody uses E anymore). Simple additions thus give P (passing) = 0.9 and
P (f ailing) = 0.1. Also note that A ⊂ Apassing , so P (A ∩ passing) = P (A).
Consequently, the probability of getting an A given that a student passes
the exam can be expressed as the following conditional probability:
P (A ∩ passing) 0.3 1
P (A| passing) = = =
P (passing) 0.9 3
and similarly P (D| f ailing) = P (F | f ailing) = 0.5 express the odds that
a student who fails the exams gets either D or F . 
4
Consider the sigma algebra for S based upon the maximal partition including all sin-
gleton grade sets; the two “passing” and “failing” sets are obtained by taking appropriate
unions of the singleton grade sets, hence must be part of the same sigma algebra.

9
1.2. Conditional Probability

Example 1.3. Conditional Probabilities on a Dartboard. Endowed


with the concept of conditional probability, let us further expand Example
1.1 about dartboard play. Suppose that a player has learned never to miss
the dartboard, but still is not skilled enough to effectively target the center,
so that his or her darts still hit the dartboard quite randomly (this would be
an unusual scenario, but it serves for illustration). The probability that this
player scores a given number of points can be obtained from the probability
function from Example 1.1, conditional upon hitting the board. Remember
that failure to hit the dartboard (and scoring zero points) has a probability
of P (i = 0) = T / (T + πI 2 r2 ) for a completely naïve player, meaning that
hitting the board has a probability of P (i > 0) = πI 2 r2 / (T + πI 2 r2 ). Thus,
P (i ∩ i > 0)
P (i points| i > 0) = = I −2 (I + 1 − i)2 − (I − i)2
 
P (i > 0)
is a conditionally probability function for the general case that can be inter-
preted as the probability function for the slightly more experienced player.
Suppose next that this player trains more and learns how to score at least
3 < I points with every dart, but still without becoming more effective at
actually approaching the center. Therefore:
(I + 1 − i)2 − (I − i)2
 
P (i ∩ i > 2)
P (i points| i > 2) = =
P (i > 2) (I − 2)2
is the corresponding, more restrictive (conditional) probability function. 
Example 1.4. Preemptive Medical Treatment. All these examples are
relatively trivial, as they are all based on nested sets so that the numerator
of (1.1) coincides with either set under consideration. To better appreciate
conditional probability, suppose that a population at risk for some kind of
illness is offered a preemptive medical treatment that is not always 100%
effective. Not all subjects at risk take up the treatment: if some are “takers,”
others are “hesitant.” In both groups, some individuals eventually become
sick while others stay healthy. The probability measures associated with all
possible intersections between these simple partitions are as follows.
P (taker ∩ healthy) = 0.40
P (taker ∩ sick) = 0.20
P (hesitant ∩ healthy) = 0.15
P (hesitant ∩ sick) = 0.25
In terms of the original partitions, P (taker) = 0.60, P (hesitant) = 0.40,
P (healthy) = 0.55 and P (sick) = 0.45.

10
1.2. Conditional Probability

By looking at the elementary probabilities expressed by the intersected


sets, it might seem that it is more likely to take the treatment and falling sick
rather than not taking it while still falling sick. This conclusion is fallacious
though: these concepts must be expressed as conditional probabilities:
P (taker ∩ sick) 1
P (sick| taker) = =
P (taker) 3
P (hesitant ∩ sick) 5
P (sick| hesitant) = =
P (hesitant) 8
that is, the truth is actually opposite to the original suggestion! 
An important result based on conditional probabilities is Bayes’ Rule.
Note from inspecting (1.1) that the roles of events A and B can be reversed
so long as P (A) > 0. This observation allows to rewrite the expression in a
way that relates the two “reverse” conditional probabilities.
P (B| A) P (A)
P (A| B) = (1.2)
P (B)
The above expression is a simple version of Bayes’ Rule: a powerful result
used to calculate conditional probabilities or statistical estimators in a wide
variety of settings. The more general version is given as follows.
Theorem 1.5. Bayes’ Rule. Let A1 , A2 , . . . be a partition of the sample
space S, and B some event B ⊂ S. For i = 1, 2, . . . the following holds.
P (B| Ai ) P (Ai )
P (Ai | B) = P∞
j=1 P (B| Aj ) P (Aj )

Proof. This follows from (1.2) for A = Ai and by observing that:


X∞ ∞
X
P (B) = P (B ∩ Aj ) = P (B| Aj ) P (Aj )
j=1 j=1

from Theorem 1.4 and the definition of conditional probability.


Example 1.5. Imperfect Medical Treatment, continued. Let us con-
tinue with example 1.4, and suppose that one is interested in the conditional
probability of finding a taker among the sick, knowing the conditional prob-
ability of getting sick after taking the treatment, the total probability mea-
sure of takers, and the total probability measure corresponding with getting
sick. This is easily calculated from (1.2).
P (taker) 4
P (taker| sick) = P (sick| taker) =
P (sick) 9
Note: this number says little about the effectiveness of the treatment! 

11
1.2. Conditional Probability

On occasion, conditional probabilities do not differ from unconditional


probabilities, e.g. for two events A and B, it is P (A| B) = P (A). Intuitively,
this is so because the two events are completely unrelated, and knowing that
one of the two (say B) occurs does not change the odd of the other event (A)
to happen. In such cases, it is said that the two events are independent.
Definition 1.8. Statistical independence (two events). Two events
A and B are statistically independent if the following holds.
P (A ∩ B) = P (A) P (B)
This intuition is easily extended to groups (collections) of events.
Definition 1.9. Mutual statistical independence (multiple events).
The events of any collection A1 , A2 , . . . , AN are mutually independent
if, for any subcollection Ai1 , Ai2 , . . . , AiN 0 with N 0 ≤ N , the following holds.
N0 N0
!
\ Y 
P Aij = P Aij
j=1 j=1

The notion of independence is also extended to the complements of the


events involved.
Theorem 1.6. Independence and Complementary Events. Consider
any two independent events A and B. It can be concluded that the following
pairs of events are independent too:
a. A and Bc ;
b. Ac and B;
c. Ac and Bc .
Proof. Case a. is easily shown as follows:
P (A ∩ Bc ) = P (A) − P (A ∩ B)
= P (A) − P (A) P (B)
= P (A) [1 − P (B)]
= P (A) P (Bc )
where the second equality follows from the definition of independence. Cases
b. and c. are analogous.
As it is elaborated in later chapters, the primitive concept of statistical
independence and the associated mathematical definitions extend one-to-
one to probability distributions of random variables; as such they are crucial
for characterizing the properties of statistical samples and estimators.

12
1.3. Probability Distributions

1.3 Probability Distributions


The general concept of probability function applies to any suitable set which
can be expressed as a sample space and endowed with a sigma algebra. In
statistics and econometrics, however, it is often convenient to re-formulate
probability functions so that the domain of interest are real numbers. This
requires handling two fundamental probabilistic concepts: that of random
variables and that of probability distribution functions.
Definition 1.10. Random Variables. A random variable X is a func-
tion from the the sample space S onto the set of real numbers X : S → R.
Conventionally, random variables are denoted by slanted capital letters,
while their realizations, corresponding to specific outcomes in the original
sample space(s) that are hypothesized to occur, are denoted by italic lower
case letters. For example, the random variable Xcoin for the coin experiment
is expressed as: (
1 if T ail
xcoin =
0 if Head
or vice versa. In more extended experiments where many (e.g. n ≥ 1) coins
are tossed, the outcome of interest is the total count Xn.coins ∈ {0, 1, . . . , n}
of heads (or tails): Xn.coins is another random variable. A random variable
for letter grades – Xgrade – can correspond to each grade’s weight for calcu-
lating the GPA, such as xgrade = 4 for A, xgrade = 3 for B, et cetera. In the
case of sample spaces like the number of received emails, individual income
and individual wealth, the corresponding random variables X typically map
onto the original sample spaces: N0 , R+ and R respectively. However, the
mapping itself may be non-trivial for reasons of interpretation.5
Definition 1.11. Probability Distribution. Given a random variable X,
a cumulative (probability) distribution function (often abbreviated
as c.d.f.) is a function FX (x) which is defined as follows.
FX (x) = P (X ≤ x) for all x ∈ R
A cumulative probability distribution function is the mathematical ob-
ject that allows to reformulate a primitive probability function, as expressed
in a sample space, into a function which takes real numbers as arguments,
yet conveys the same information about probability as the original function
P. The subscript X associates a c.d.f. to a specific random variable X, and
is often omitted when discussing generic distributions or their properties.
5
For example, one may want to convert heterogeneous monetary values expressed in
different currencies into standardized monetary units.

13
1.3. Probability Distributions

Example 1.6. Tossing Two Coins. Consider the “extended” coin exper-
iment for n = 2. The sample space for this scenario is the set

S2.coins = {Head & Head, Head & T ail, T ail & Head, T ail & T ail}

where for each element, the two terms before and after the ‘&’ sign represent
the outcome for the first and second coin respectively. The random variable
of interest takes values in the set X2.coins ∈ {0, 1, 2} ∈ R, and the mapping
X equals the number of tails in each attempt.

X (Head & Head) = 0


X (Head & T ail) = 1
X (T ail & Head) = 1
X (T ail & T ail) = 2

Assuming that the coins in the experiment are “balanced” (that is, there are
equal chances to obtain heads or tails), and given that clearly the outcome
of either coin cannot be predicted by the other (if treated as separate events,
they would be independent), the probability associated with each element of
S2.coins is 0.25, meaning that P (X2.coins = 0) = P (X2.coins = 2) = 0.25 while
P (X2.coins = 1) = 0.50. This results in the following cumulative probability
distribution:
if x ∈ (−∞, 0)


0
.25 if x ∈ [0, 1)

FX2.coins (x) =


.75 if x ∈ [1, 2)
if x ∈ [2, ∞)

1
which is easily represented graphically as in in Figure 1.2 below.

1 FX2.coins

.75

.50

.25
x
0 1 2

Figure 1.2: Probability Distribution for the Two Coins Experiment

14
1.3. Probability Distributions

Figure 1.2 displays a step function which is non-decreasing, with image in


[0, 1], and which is right-continuous (but left-discontinuous) at every step.
As it is discussed next, this property is not unique to this example. 
Theorem 1.7. Properties of Probability Distribution Functions. A
function F (x) can be a probability distribution function if and only if the
following three conditions hold:
a. limx→−∞ F (x) = 0 and limx→∞ F (x) = 1;
b. F (x) is a nondecreasing function of x;
c. F (x) is right-continuous, that is limx↓x0 F (x) = F (x0 ) ∀x0 ∈ R.
Proof. (Outline.) Necessity follows easily from the definition of Probability
Functions. Sufficiency requires some reverse engineering, that would show
how for each Probability Distribution Function with the above properties,
one can find an appropriate sample space S, an associated probability func-
tion P and a relative random variable X.
While all probability distributions must conform to the conditions es-
tablished by Theorem 1.7, not all of them take the shape of step functions.
In fact, the latter is true only for a certain class of distributions: those for
discrete – as opposed to continuous – random variables.
Definition 1.12. Classes of Random Variables. A random variable X
is continuous if FX (x) is a continuous function of x, while it is discrete
if FX (x) is a step function of x.
The following example provides a discussion of the two most exemplifying
continuous probability distributions.
Example 1.7. Standard Logistic and Normal Distributions. While
perhaps not the most frequently found, the standard logistic probability
distribution
1
FX (x) = Λ (x) =
1 + exp (−x)
is arguably the continuous probability distribution taking positive values on
the entire real line with the simplest mathematical expression. It is depicted
in Figure 1.3 (continuous line) and it is easy to verify that it satisfies the
conditions of Theorem 1.7 (note that its derivative is always positive). The
most important continuous distribution, however, is certainly the standard
normal or Gaussian distribution, a more complex function:
ˆ x  2
1 t
FX (x) = Φ (x) = √ exp − dt
−∞ 2π 2

15
1.3. Probability Distributions

while also taking positive values on the entire real line and complying with
Theorem 1.7. Even this distribution is displayed in Figure 1.3 (dashed line);
observe how relative to the standard logistic, the standard normal implies
a higher probability associated to the values of x closer to zero.

FX (x) Λ (x)
1 Φ (x)

0.5

x
−5 −3 −1 1 3 5

Figure 1.3: Standard Logistic and Normal Probability Distributions

As elaborated later in this chapter, the standard logistic and the stan-
dard normal are both specialized cases of more flexible specifications of the
logistic and normal distributions, which allow for parameters that deter-
mine their exact shape (however, the particular notation Λ (x) and Φ (x) is
typically reserved for the “standard” versions of both distributions). Both
distributions are often used to represent real world scenarios that are best
represented on the entire real line, like the deviations of a certain variable of
interest (say, human height) from some focal point (say, a group-specific av-
erage). The predominance of the standard normal is motivated by a funda-
mental result in asymptotic probability theory, the Central Limit Theorem,
which is discussed at length in Lecture 6. 
Cumulative probability distributions are seldom handled directly; it is
usually more convenient to manipulate some associated mathematical ob-
jects that more directly relate to the underlying probability measures. Such
objects, the probability mass and density functions, are defined differently
for discrete and continuous distributions, respectively. These two concepts
make it easier to also characterize the support of a random variable, which
intuitively is the subset of R where all the probability is concentrated.
Definition 1.13. Probability Mass Function. Given a discrete random
variable X, its probability mass function fX (x) (which is often abbreviated
as p.m.f.) is defined as follows.
fX (x) = P (X = x) for all x ∈ R

16
1.3. Probability Distributions

Definition 1.14. Probability Density Function. Given a continuous


random variable X, its probability density function fX (x) (which is often
abbreviated as p.d.f.) is defined as the function that satisfies the following
relationship. ˆ x
FX (x) = fX (t) dt for all x ∈ R
−∞

If a continuous cumulative distribution is differentiable everywhere in R, the


associated density function is simply its first derivative: fX (t) = ∂F∂x
X (t)
.
Definition 1.15. Support of a random variable. Given some random
variable X which is either discrete or continuous, its support X is defined
as the following set
X ≡ {x : x ∈ R, fX (x) > 0}
where fX (x) is the probability mass or density function associated with X,
as appropriate.
Clearly, the support of a discrete random variable is a countable set, thus
a probability mass function has an easy interpretation as a transposition of
the underlying probability function, implying for instance that:
b
X
P (a ≤ X ≤ b) = FX (b) − FX (a) = fX (t)
t=a

hence:
b
X
P (X ≤ b) = FX (b) = fX (t)
t=inf X
and: X
P (X ∈ X) = fX (t) = 1
t∈X
which connects directly with the cumulative probability distribution FX (x).
Example 1.8. Tossing Two Coins, Revisited. Consider the cumulative
distribution function for the experiment about “tossing two coins” described
in Example 1.6. The associated probability mass function is obtained from
the original probability function:
.25 if x = 0



.50 if x = 1

fX2.coins (x) =


 .25 if x = 2
otherwise

0
and it is visually represented in Figure 1.4, as displayed next. 

17
1.3. Probability Distributions

1 fX2.coins
.75
.50
.25
x
0 1 2

Figure 1.4: Probability Mass Function for the Two Coins Experiment

For density functions instead the support is an uncountable set, and the
interpretation of the quantity fX (x) ≥ 0 is subtler: it cannot be interpreted
as a probability because x has measure zero in the support. However, when
X is continuous the definition of cumulative distribution functions implies
that: ˆ b
P (a ≤ X ≤ b) = FX (b) − FX (a) = fX (t) dt
a
hence: ˆ
P (X ∈ X) = fX (t) dt = 1
X
hence density functions bear a probabilistic interpretation for segments of R.
Also observe that unlike in the case of mass functions, density functions can
generally take values larger than one, since their probabilistic interpretation
is based on the above integral formulations.

Example 1.9. Standard Logistic and Normal Density Functions.


Let us expand Example 1.7. The density function associated with the stan-
dard logistic distribution Λ (x) is:

dΛ (x) exp (−x)


λ (x) = =
dx [1 + exp (−x)]2

while the density function of the standard normal distribution is as follows.


 2
dΦ (x) 1 x
φ (x) = = √ exp −
dx 2π 2

Both functions are displayed in Figure 1.5 below. The graphical comparison
between the two density functions highlights again that the standard logistic
is “thicker in the tails” relative to the standard normal, that is, the standard

18
1.3. Probability Distributions

0.4 fX (x) λ (x)


φ (x)

0.2

x
−5 −3 −1 1 3 5
Note: the shaded areas represent the probability that x falls in the [1, 3] interval for either distribution

Figure 1.5: Standard Logistic and Normal Probability Densities

logistic probability is more dispersed towards the outer values of the support
R. To better exemplify the probabilistic interpretation of density functions,
the Figure also displays – by means of distinct shaded areas – the probability
that x occurs between 1 and 3 for either distribution. 

One can summarize the properties of mass and density function through
the following statement.

Theorem 1.8. Properties of mass and density functions. A function


fX (X) is an appropriate probability mass or density function of a random
variable X if and only if:

a. fX (X) ≥ 0 for all x ∈ R;


P ´
b. x∈X fX (x) = 1 or X fX (x) dx = 1 for mass and density functions
respectively.

Proof. (Outline.) Necessity follows by the definitions of cumulative distri-


bution, mass and density functions. Sufficiency follows by Theorem 1.7 after
having constructed the associated cumulative distribution FX (X).

It must be noted at this point that not all random variables are either exclu-
sively discrete, or exclusively continuous. In numerous situations of interest,
a random variable appears continuous only on a subset of the support and
discrete in other points. In such cases, the definition of cumulative proba-
bility distribution is still valid, however those of mass and density functions
are only valid upon a subset of the support. It is possible to describe these
mixed cases by using a generalized density which is formulated in terms of
a Lebesgue integral, but this is beyond the scope of this treatment.

19
1.3. Probability Distributions

Example 1.10. Truncated Standard Normal Distribution. Suppose


that a real world phenomenon is distributed according to a standard normal
distribution, but is not actually observed for negative values. An example
is that of an electronic detector of potential power overload, which would
measure any positive deviation from some optimal “average” (which is estab-
lished at zero) but would not detect negative deviations, which are recorded
simply as x = 0. In such a case, one typically says that the distribution in
question (here, the standard normal) is truncated at zero. The cumulative
distribution function would read here as:
(
0 if x < 0
Φ≥0 (x) =
Φ (x) if x ≥ 0

and would be drawn as in Figure 1.6 below.

Φ≥0 (x)
1
0.8
0.6
0.4
0.2
x
−5 −3 −1 1 3 5

Figure 1.6: Cumulative standard normal distribution truncated at zero

In this case, it is sensible to characterize the density function only for the
nonnegative part of the distribution’s support:
 2
1 x
φ≥0 (x) = √ exp − if x ≥ 0
2π 2

which allows to calculate the probabilities


´ ∞about observations falling within
specified intervals in R+ (observe that 0 φ (x) dx = 0.5). The description
of this particular truncated distribution is completed by specifying that the
rest of the probability mass is found at zero:

P (X = 0) = 0.5

since obviously P (X < 0) = 0. 

20
1.4. Relating Distributions

1.4 Relating Distributions


Many random variables are related to one another, in the sense that they
convey similar and sometimes identical probabilistic information for certain
events. It is useful to characterize when two random variables are identical,
and when they can be expressed as a transformation from one to another.
Definition 1.16. Identically Distributed Random Variables. Any
two random variables X and Y sharing a sample space S and an associated
sigma algebra B are said to be identically distributed if for every event
A ∈ B, it is P (X ∈ X (A)) = P (Y ∈ Y (A)).
The definition is quite straightforward: two identically distributed ran-
dom variables express the same information about the probability of funda-
mental events in the sample space. Observe that, however, two identically
distributed random variables need not be equal mappings. Suppose that,
for example, one defines X as the count of “Heads” and Y as the count of
“Tails” in the simple coin experiment (even if repeated n times). Clearly X
and Y are identically distributed even if X (A) 6= Y (A) for all A ∈ B. If
the domain of the original mapping is a subset of R, the following holds.
Theorem 1.9. Identical Distribution. Given two random variables X
and Y whose primitive sample space is a subset of the real numbers S ⊆ R,
the following two statements are identical:
a. X and Y are identically distributed;
b. FX (x) = FY (x) for every x in the relevant support.
Proof. (Outline.) Clearly a. implies b. by construction. The converse – that
b. implies a. – requires showing that if the two distributions are identical,
they share a probability function defined for some sigma algebra B of S.
Two random variables also convey similar information if they are related
through some functional dependence. Suppose that given random variable
X, another random variable Y is defined as Y = g (X), where g (·) is some
function – one would commonly say that Y is a transformation of X. Thus,
for any two real numbers a ≤ b, it is:
P (Y ∈ [a, b]) = P (g (X) ∈ [a, b]) (1.3)
and similarly for intervals of the form [a, b), (a, b] and (a, b). The probabilis-
tic relationship between X and Y (and their distributions) becomes more
apparent when the function g is invertible on the interval of interest, hence:
P (Y ∈ [a, b]) = P X ∈ g −1 ([a, b]) (1.4)


21
1.4. Relating Distributions

where g −1 ([a, b]) ∈ R is the subset of real numbers that are mapped by the
inverse function g −1 (·).6 Also note that in general, a transformed random
variable Y has a a support Y which differs from the support X of the original
random variable X; an obvious example is Y = exp (X) whereby if X = R,
it is Y = R++ ; conversely if Y = log (X) and X = R++ , it is Y = R.
A relevant question is about how to calculate the distribution and the
mass or density functions of Y starting from those of X. If X is discrete
also Y is, and the calculation of mass functions is straightforward.
fY (y) = fX g −1 (y) (1.5)


Thus, the cumulative distribution for Y can be obtained by summing all


the mass points for preceding values in the support.
y
X
FY (y) = fY (y)
inf Y

For continuous random variables, things are slightly more complicated. Let
us start from the following result about cumulative distributions.
Theorem 1.10. Cumulative Distribution of Transformed Random
Variables. Let X and Y = g (X) be two random variables that are related
by a transformation g (·), X and Y their respective supports, and FX (x) the
cumulative distribution of X.
a. If g (·) is increasing in X, it is FY (y) = FX (g −1 (y)) for all y ∈ Y.
b. If g (·) is decreasing in X and X is a continuous random variable, it
is FY (y) = 1 − FX (g −1 (y)) for all y ∈ Y.
Proof. This is almost tautological: a. is shown as:
ˆ g−1 (y)
fX (x) dx = FX g −1 (y)

FY (y) =
−∞

where the first equality is motivated on (1.4) and the fact that an increasing
function applied upon some interval preserves its order. The demonstration
of b. is symmetric:
ˆ ∞
fX (x) dx = 1 − FX g −1 (y)

FY (y) =
g −1 (y)

since
´ a a decreasing´ ∞function upon an interval inverts the order and because
−∞ X
f (x) dx + a
fX (x) dx = 1 if fX (x) is a density function.
Note that this subset may not equal g −1 (a) , g −1 (b) because the inverse mapping
6
 

g −1 (·) may not preserve the order or the connectedness of the original interval.

22
1.4. Relating Distributions

To calculate the transformed density, another theorem – building on the


previous one – comes to rescue.

Theorem 1.11. Density of Transformed Random Variables (1). Let


X and Y = g (X) be two random variables related by a transformation g (·),
X and Y their respective supports, and fX (x) the probability density func-
tion of X, which is continuous on X. If the inverse of the transformation
function, g −1 (·), is continuously differentiable on Y, the probability density
function of Y can be calculated as follows.
(
d −1
fX (g −1 (y)) dy g (y) if y ∈ Y
fY (y) =
0 if y ∈
/Y

Proof. Both increasing and decreasing functions are monotone; hence, since
g −1 (·) is continuously differentiable on Y, for all y ∈ Y:
(
d fX (g −1 (y)) dy
d −1
g (y) if g (·) is increasing
fY (y) = FY (y) =
dy −fX (g −1 (y)) dy g (y) if g (·) is decreasing
d −1

by Theorem 1.10 and the chain rule.

Example 1.11. Uniform to Exponential Transformation. Let X be


a random variable with a uniform distribution on the unit interval: a
random variable with support X = [0, 1], cumulative distribution

0 if x ∈ (−∞, 0]

FX (x) = x if x ∈ (0, 1)
1 if x ∈ [1, ∞)

and density function fX (x) = 1 [x ∈ [0, 1]], as depicted in Figure 1.7.

FX (x) fX (x)
1 1

x x
0 1 0 1
Note: cumulative distribution function FX (x) on the left, density function fX (x) on the right

Figure 1.7: Uniform distribution on the unit interval

23
1.4. Relating Distributions

Now consider the transformed random variable Y = − log X, which is ob-


tained by applying a decreasing monotone function to X. Observe how the
support of Y is the set of all nonnegative real numbers Y = R+ , and that
the transformation can be inverted: X = exp (−Y ). By applying Theorem
1.10, the cumulative distribution of the transformed random variable is:

FY (y) = 1 − exp (−y)

while its density function is:

fY (y) = exp (−y)

as it is easily verified through Theorem 1.11.

FY (y) fY (y)
1 1

y y
0 2 4 0 2 4
Note: cumulative distribution function FY (y) on the left, density function fY (y) on the right

Figure 1.8: Exponential distribution with unit parameter

As it shall be expanded later, the transformed random variable Y follows a


particular case of the exponential distribution, that with unit parameter.
Figure 1.8 shows its cumulative distribution and density function. 

Theorem 1.11 is restricted to monotone transformations. However, it


can be extended to a more general class of transformations as follows.

Theorem 1.12. Density of Transformed Random Variables (2). Let


X and Y = g (X) be two random variables related by some transformation
g (·), X and Y their respective supports, and fX (x) the probability density
function of X. Suppose further that there exists a partition of the support
of X, X0 , X1 , . . . , XK such that ∪ki=0 Xi = X, P (x ∈ X0 ) = 0, and fX (x) is
continuous on each Xi . Finally, suppose that there is a sequence of functions
g1 (x) , . . . , gk (x), each associated with a corresponding set in X1 , . . . , XK ,
satisfying the following conditions for i = 1, . . . , K:

i. g (x) = gi (x) for every x ∈ Xi ;

ii. gi (x) is monotone in Xi ;

24
1.4. Relating Distributions

iii. Y = {y : y = gi (x) for some x ∈ Xi }, that is the image of gi (x) is


always equal to the support of Y ;
iv. gi−1 (y) exists and is continuously differentiable in Y.
Then the density of Y can be calculated as follows.
(PK
−1
 d −1
i=1 fX gi (y) g (y)
dy i
if y ∈ Y
fY (y) =
0 if y ∈
/Y

Proof. (Outline.) The logic of this result is that if g (·) is not monotone, but
it can be separated into a sequence of monotone subfunctions over different
intervals of the support of X, then the result of Theorem 1.11 can be applied
to each interval, and thus the density for each point y ∈ Y can be obtained
as the sum of the transformed densities associated with all points in x ∈ X
that map to y (note that this allows g (·) not to be invertible over the entire
support of X, it suffices that it is invertible on each interval in the partition).
The “dummy” set X0 with zero probability allows for discontinuity or even
saddle points separating the K subfunctions in the partition.
Example 1.12. Normal to Chi-squared Transformation. Let X be
a random variable that follows the standard normal distribution Φ (x); the
support of X is thus X = R. Consider the transformation Y = X 2 : function
g (x) = x2 is obviously not monotone over all R. However, it is respectively
decreasing in R− and increasing in R+ ; and it is easy to verify that it satisfies
the requirements of Theorem 1.12 for the following sets.
X0 = {0}
X1 = R−−
X2 = R++
Therefore, the density of Y is obtained, for y ∈ Y = R++ , as:
√ 2 ! √ 2 !
1 − y 1 1 y 1
fY (y) = √ exp − − √ + √ exp − √
2π 2 2 y 2π 2 2 y
1 1  y
= √ √ exp −
2π y 2
√ √
as g1−1 (y) = − y and g2−1 (y) = y. Its cumulative distribution is obtained
by integrating the density.
ˆ y  
1 1 t
FY (y) = √ √ exp − dt
0 2π t 2

25
1.4. Relating Distributions

FY (y) fY (y)
1 1

y y
0 3 6 0 3 6
Note: cumulative distribution function FY (y) on the left, density function fY (y) on the right.

Figure 1.9: Chi-squared distribution (χ2 ) with one degree of freedom

This turns out to be a particular case of the chi-squared (χ2 ) distribution,


that with one “degree of freedom” – its cumulative distribution and density
function are displayed in Figure 1.9. The general version of the chi-squared
distribution, its relationship with other distributions besides the standard
normal, and its role in statistical inference are discussed later at length. 
It is worth to conclude this general discussion of probability distributions
by introducing the concept of a random variable’s quantile function.
Definition 1.17. Quantile Function. The quantile function associated
with a random variable X is the following function with argument p ∈ (0, 1).
QX (p) = inf {x ∈ X : p ≤ FX (x)}
Observe that QX (p) corresponds to the inverse of FX (x) if the latter is
strictly increasing; otherwise – if FX (x) is flat on segments of the support
of X – the quantile function returns a “pseudo-inverse” with the property
P (QX [FX (X)] ≤ QX (p)) = P (X ≤ QX (p))
for all p ∈ (0, 1), by construction. This is best illustrated via an example.
Example 1.13. The quantile function of a non-strictly-monotonic
distribution. Consider a random variable X with the following cumulative
distribution function.
−1

[1 + exp (−x)]
 if x ∈ (−∞, 0]
FX (x) = 0.5 if x ∈ (0, 3)
−1
if x ∈ [3, ∞)

[1 + exp (−x + 3)]

The shape of this distribution closely mimics that of the standard logistic
from Example 1.7, but with a crucial difference that makes it “non-strictly”
monotonic: over the (0, 3) interval this distribution is flat, and it resumes
a “standard logistic” behavior only for x ≥ 3, as if the support is shifted by
three units of measurement associated with the random variable X.

26
1.4. Relating Distributions

FX (x) 7 QX (p)
1
5
3
0.5 1 p
−1 0.5 1
−3
x −5
−5 −3 −1 1 3 5 7
Note: cumulative distribution function FX (X) on the left, quantile function QX (p) on the right.

Figure 1.10: Quantile function for a non-strictly-monotonic distribution

According to the definition, the quantile function of X is as follows.


(
log (x) − log (1 − x) if x ∈ (0, 0.5]
QX (p) =
log (x) − log (1 − x) + 3 if x ∈ (0.5, 1)
This “pseudo-inverse” is visibly discontinuous at p = 0.5, which is the value
that the cumulative distribution takes for x ∈ [0, 3] – an interval whose infi-
mum is the value that the quantile function itself takes at the discontinuity,
QX (0.5) = 0. The cumulative distribution as well as the quantile function
of X are both displayed in Figure 1.10. 
Quantile functions have several applications; a typical one is grounded
on the result that is expressed and demonstrated next.
Theorem 1.13. Cumulative Transformation. For any continuous ran-
dom variable X with cumulative distribution denoted as FX (x), the trans-
formation P = FX (X) follows a uniform distribution on the unit interval.
Proof. By the properties of quantile functions (including the fact that they
are monotone increasing by definition), for all p ∈ (0, 1) it holds that:
P (P ≤ p) = P (FX (X) ≤ p)
= P (QX [FX (X)] ≤ QX (p))
= P (X ≤ QX (p))
= FX (QX (p))
=p
where the fourth and fifth lines follow from the definition and continuity of
FX (x). Since FP (p) = 0 for p ≤ 0 and FP (p) = 1 for p ≥ 0 by construction,
P follows a uniform distribution on the interval (0, 1).

27
1.5. Moments of Distributions

This theorem motivates the use of the uniform distribution for generating
random draws from any distribution FX (x). Given that it is easier to obtain
actual random draws from the uniform distribution, it is convenient to do so
and then apply the quantile function QX (p) in order to obtain the desired
random draws from FX (x), whatever the random variable X of interest.

1.5 Moments of Distributions


The moments of a probability distribution are quantities that summarize
some of its properties. Moments can be either uncentered or centered.
Definition 1.18. Uncentered Moments. The r-th uncentered moment
of a random variable X with support X, denoted as E [X r ], is defined for
some positive integer r and for discrete random variables as
X
E [X r ] = xr fX (x)
x∈X

while it is defined as follows in the case of continuous random variables.


ˆ
r
E [X ] = xr fX (x) dx
X

The zero-th centered moment E [X 0 ] is defined as the unweighted sum of


the probability mass function or the unweighted integral of the probability
density function over the support of X, and is thus by definition E [X 0 ] = 1.
The most important uncentered moment is that for r = 1, E [X]; it is called
mean or expected value (or even expectation). The mean is a measure
of the “central” value of the distribution of X in a probabilistic sense, and
it is obtained by summing or integrating all the elements of the support,
weighted by their probability mass or density. The mean is instrumental in
the definition of the centered moments.
Definition 1.19. Centered Moments. The r-th centered moment of a
random variable X with support X, denoted as E [(X − E [X])r ], is defined
for some positive integer r and for discrete random variables as:
X
E [(X − E [X])r ] = (x − E [X])r fX (x)
x∈X

while it is defined as follows in the case of continuous random variables.


ˆ
r
E [(X − E [X]) ] = (x − E [X])r fX (x) dx
X

28
1.5. Moments of Distributions

The most important centered moment is that for r = 2, the variance:


it is always nonnegative and is denoted as follows.
Var [X] = E (X − E [X])2 ≥ 0 (1.6)
 

The variance provides a measure of the “dispersion” of the distribution of


X around the mean.7 Higher centered moments provide the basis for other
relevant measures. For example, the centered moment for r = 3, when it is
standardized by the variance up to the power of 1.5, delivers the so-called
skewness:
E (X − E [X])3
 
Skew [X] =  3 R 0
E (X − E [X])2 2


a measure of the degree of asymmetry of a distribution. The skewness is a


positive number for asymmetric distributions that are “skewed” to the right
of the mean – and vice versa; it is equal to zero only for distributions that
are symmetric around the mean. The centered moment for r = 4 instead
is functional for calculating the kurtosis, which is defined through another
standardization – now, by the square of the variance:
E (X − E [X])4
 
Kurt [X] = 2 ≥ 0
E (X − E [X])2


and it is once again an always nonnegative number; indeed a measure of the


overall “thickness” of the distribution – the relative frequency of realizations
of X that are distant from the mean. Centered moments can be conveniently
expressed as functions of uncentered moments only. In the variance case,
for example:
Var [X] = E (X − E [X])2
 

= E X 2 − 2 E [X] E [X] + E [X]2


 

= E X 2 − E [X]2
 

while in general – for higher centered moments – a variation of the binomial


formula applies.
r  
r
X r
(−1)r−i E X i E [X]r−i
 
E [(X − E [X]) ] =
i=0
i
This is a useful fact since, as it is discussed later, the uncentered moments of
a distribution can be calculated through the associated moment generating
or characteristic functions.
7
Clearly, the variance is equal to zero only in the case of degenerate discrete distri-
butions where the the entire probability mass is concentrated in one single point.

29
1.5. Moments of Distributions

Example 1.14. Mean and variance for coin experiments. Consider


the usual simple example about tossing a coin whereby xcoin = 1 if Head and
xcoin = 0 if Tail. Suppose however that the coin is “unfair:” in particular,
Head occurs with probability 0.6 and Tail with probability 0.4. Thus:
E [Xcoin ] = 1 · fXcoin (1) + 0 · fXcoin (0)
= 1 · 0.6 + 0 · 0.4
= 0.6
whereas the variance can be calculated as follows.
Var [Xcoin ] = (1 − E [Xcoin ])2 · fXcoin (1) + (0 − E [Xcoin ])2 · fXcoin (0)
= (0.4)2 · 0.6 + (−0.6)2 · 0.4
= 0.24
Now consider the more complex case of the random variable Xn.coins which
counts the heads (or tails) out of n iterations of the simple coin experiment.
In total, there are 2n possible sequences of outcomes; however, it is quite
easy to enumerate those corresponding to exactly x heads (or tails) by using
the binomial coefficient nx . Because every single sequence counting exactly


x heads (or tails) occurs with probability 0.6x · 0.4n−x , the probability mass
function for this specific random variable can be written as:
 
n
fXn.coins (x) = · 0.6x · 0.4n−x
x
and its expected value can be calculated as follows, for y = x − 1:
n  
X n
E [Xn.coins ] = x · 0.6x · 0.4n−x
x=0
x
n  
X n−1
= n · 0.6x · 0.4n−x
x=1
x − 1
n  
X n−1
= 0.6 · n · 0.6x−1 · 0.4n−1−x+1
x=1
x − 1
n  
X n−1
= 0.6 · n · 0.6y · 0.4n−1−y
y=0
y
| {z }
=1
= 0.6 · n
where the simplification in the second-to-last line occurs because the sum-
mation therein is recognized as the total probability mass of an analogous,
hypothetical experiment with n − 1 attempts and y successes.

30
1.5. Moments of Distributions

The second uncentered moment is calculated similarly.


n  
2 n
 2  X
E Xn.coins = x · 0.6x · 0.4n−x
x=0
x
n  
X n−1
= xn · 0.6x · 0.4n−x
x=1
x − 1
n  
X n−1
=n (y + 1) · 0.6y+1 · 0.4n−y−1
y=0
y
n  
X n−1
= 0.6 · n y · 0.6y · 0.4n−y +
y=0
y
n  
X n−1
+ 0.6 · n · 0.6y · 0.4n−y
y=0
y
= 0.6 · n · [0.6 · (n − 1) + 1]
Here, the two summations in the second-to-last line also simplify, since they
correspond respectively to the mean – equaling 0.6 · (n − 1) – and the total
probability mass – equaling 1 – of the hypothetical experiment discussed
earlier. Exploiting the fact that Var [Xn.coins ] = E [Xn.coins
2
] − E [Xn.coins ]2 ,
it is easy to verify that the variance of Xn.coins
2
equals 0.24 · n. 
Example 1.15. Mean and variance of the uniform distribution. Let
again some continuous random variable X follow the uniform distribution
on the (0, 1) interval. Its mean is given simply by:
ˆ 1 1
x2 1
E [X] = xdx = =
0 2 0 2
whereas, since:
ˆ 1 1
x3 1
E X2 = x2 dx =
 
=
0 3 0 3
the variance is calculated as Var [X] = E [X 2 ] − E [X]2 = 1
3
− 1
4
= 1
12
. 
Example 1.16. Mean and variance of the exponential distribution.
Consider some random variable Y that follows the exponential distribution
with unit parameter, which is connected with the uniform distribution on
the [0, 1] interval through the transformation Y = − log X as per Example
1.11. Its mean is calculated through integration by parts:
ˆ ∞ ˆ ∞

E [Y ] = y exp (−y) dy = −y exp (−y)|0 + exp (−y) dy = 1
0 0

31
1.5. Moments of Distributions

´∞
since limM →∞ −y exp (−y)|M 0 = 0 and 0 exp (−y) dy = 1. The variance is
calculated by noting that, integrating by parts again:
ˆ ∞ ˆ ∞
 2 2 2 ∞
E Y = y exp (−y) dy = −y exp (−y) 0 + 2 y exp (−y) dy = 2
0 0

hence Var [Y ] = E [Y 2 ] − E [Y ]2 = 2 − 1 = 1. 
One is often interested in the analysis of the moments of a transformed
random variable Y = g (X) in terms of the moments of the original random
variable X. By applying the standard linear properties of summations and
integration, it is quite easy to see that if Y = a + bX, then:

E [Y ] = a + b E [X]

and, since E (Y − E [Y ])2 = E b2 (X − E [X])2 , the following holds too.


   

Var [Y ] = b2 Var [X]

For non-linear functions g (X), Jensen’s Inequality can be extended to


probability distributions to show that, if g (·) is a concave function,

E [g (X)] ≤ g (E [X])

while if g (·) is a convex function instead, the converse applies.

E [g (X)] ≥ g (E [X])

This shows that, in general, a non-linear function does not pass through
the expectation operator. In addition, a first-order approximation of g (X)
based on a Taylor expansion around E [X]:

g (X) ≈ g (E [X]) + g 0 (E [X]) [X − E [X]]

shows that E [g (X)] ≈ g (E [X]) is hardly an acceptable approximation.


However, one can rewrite the expansion as:

g (X) ≈ [g (E [X]) − g 0 (E [X]) E [X]] + g 0 (E [X]) X

showing that the approximation:


2
Var [g (X)] ≈ [g 0 (E [X])] Var [X]

is actually a decent one, as it accounts for the first order term of the series.
The next two properties of the mean and the variance are instrumental
in establishing some important results of asymptotic theory.

32
1.5. Moments of Distributions

Theorem 1.14. Markov’s Inequality. Given a nonnegative random vari-


able X ∈ R+ and a constant k > 0, it must be P [X ≥ k] ≤ E [X] /k.
Proof. Apply the decomposition
ˆ +∞ ˆ +∞ ˆ +∞
E [X] = xf (x) dx ≥ xf (x) dx ≥ k f (x) dx = k P [X ≥ k]
0 k k

with the first equality requiring X to be nonnegative.


Theorem 1.15. Čebyšëv’s Inequality. Given a random variable Y ∈ R
and a number δ > 0, it must be P [|Y − E [Y ]| ≥ δ] ≤ Var [Y ] /δ 2 .
Proof. Rephrase Markov’s inequality setting X = (Y − E [Y ])2 and k = δ 2 ,
and notice that:
 2
2 E (Y − E [Y ]) Var [Y ]
P [|Y − E [Y ]| ≥ δ] ≤ P (Y − E [Y ]) ≥ δ 2 ≤
 
2
=
δ δ2
as postulated.
An additional important relationship about the mean and the variance,
which helps build intuition about these two moments and interpreting them,
relates to the property of the mean as being the “best predictor” of a random
variable X under a quadratic criterion of distance. More concretely, suppose
that one is aiming to guess an unknown realization of X, whose distribution
is known, by solving the following problem:
 2 
min E X − X b
X
b

which is intuitively appealing, since unexpected deviations that are larger in


magnitude, when squared, count more towards the evaluation of the above
expectation. It is easy to see that the solution is found at X
b = E [X]:
 2   2 
E X −X b = E X − E [X] + E [X] − X b
 2 
 2
= E (X − E [X]) + E E [X] − X b
h  i (1.7)
+ 2 E (X − E [X]) E [X] − Xb
| {z }
=0
 2 
= Var [X] + E E [X] − Xb

33
1.5. Moments of Distributions

where the third term in the second line is zero because:


h  i  
E (X − E [X]) E [X] − X = E [X] − X E [(X − E [X])] = 0
b b

since neither E [X] nor X b are random and can be taken out of the expecta-
tion operator, while E [(X − E [X])] = 0 by definition. Of the two remaining
terms in the last line of (1.7), the first one – the variance of X – is constant,
while the second is shrunk to zero when X b = E [X]. Thus, in addition to
the interpretation of the mean as “best predictor” under quadratic distances,
the variance is intuitively interpreted as the prediction error that “cannot
be removed.” Later in Lecture 7 this property of the mean and variance is
generalized in a setting where multiple random variables are used to jointly
predict the realization of some other random variable of interest.
The last concept covered in this lecture is about classes of functions that
are most useful to calculate the moments of a distribution.
Definition 1.20. Moment generating function. Given a random vari-
able X with support X, the moment-generating function MX (t) is de-
fined, for t ∈ R, as the expectation of the transformation g (X) = exp (tX),
so long as it exists; for discrete random variables this is:
X
MX (t) = E [exp (tX)] = exp (tx) fX (x)
x∈X

while for continuous random variables, it is as follows.


ˆ
MX (t) = E [exp (tX)] = exp (tx) fX (x) dx
X

Moment generating functions draw their name from the following result.
Theorem 1.16. Moment generation. If a random variable X has an
associated moment generating function MX (t), its r-th uncentered moment
can be calculated as the r-th derivative of the moment generating function
evaluated at t = 0.
dr MX (t)
E [X r ] =
dtr t=0

Proof. Note that, for all r = 1, 2, . . . :


dr MX (t) dr
 r 
d
r
= r E [exp (tX)] = E r
exp (tX) = E [X r exp (tX)]
dt dt dt
so long as the r-th derivative with respect to t can pass through the expec-
tation operator. If so, it is E [X r exp (tX)] = E [X r ] for t = 0.

34
1.5. Moments of Distributions

Obviously, this allows to obtain centered moments as well, by the earlier


observation that all centered moments can be expressed as functions of the
uncentered ones. It is a useful exercise to calculate both the mean and the
variance of the distributions from Examples 1.14, 1.15 and 1.16 by using the
respective moment generating functions, which are provided in the following
additional list of examples.
Example 1.17. Moment generating function for coin experiments.
Let us return to the coin experiments from example 1.14. In the one attempt
case, the moment generating function is simple to calculate.
MXcoin (t) = exp (t · 1) · fXcoin (1) + exp (t · 0) · fXcoin (0)
= 0.6 · exp (t) + 0.4
With n attempts instead:
n  
X n
MXn.coins (t) = exp (tx) · 0.6x · 0.4n−x
x=0
x
n  
X n
= [0.6 · exp (t)]x · 0.4n−x
x=0
x
= [0.6 · exp (t) + 0.4]n
the result is obtained with another application of the binomial formula. 
Example 1.18. Moment generating function of the uniform distri-
bution. The moment generating function of a random variable X which is
uniformly distributed on the unit segment is simple to calculate.
ˆ 1 1
1 1
MX (t) = exp (tx) dx = exp (tx) = [exp (t) − 1]
0 t 0 t
Note: it might seem that this function
P∞ n is ill-defined at t = 0, but applying
the Taylor expansion exp (t) = n=0 t /n! for t = 0 into the above formula
would reveal that actually, MX (0) = 1. 
Example 1.19. Moment generating function of the exponential dis-
tribution. Calculations are not that more difficult in the case of a random
variable Y following the exponential distribution with unit parameter, but
it must be noted that it only exists for t < 1.
ˆ ∞ M
1 1
MY (t) = exp ((t − 1) y) dy = lim − exp (− (1 − t) y) =
0 M →∞ 1−t 0 1−t
In fact, it is easy to see that the above integral diverges if t ≥ 1. 

35
1.5. Moments of Distributions

The moment generating functions of two random variables Y and X that


are related by a linear transformation, say Y = a + bX, are also themselves
related through a simple formula. In fact:
MY (t) = exp (at) MX (bt)
where MX (t) and MY (t) are the moment generating functions of X and Y
respectively. This is fact is easily shown as follows.
E [exp (tY )] = E [exp (ta + tbX)] = exp (at) E [exp (btX)]
A fundamental property of moment generating functions is that they
uniquely characterize a distribution, in the sense that each distribution
FX (x) has its own distinct moment generating function MX (t). The proof
of this result requires more involved mathematics and is not developed here.
Observe that in general a unique sequence of moments does not identify a
unique distribution, or vice versa. It can be shown that this is only the case
for a specific subset of distributions: those with bounded support. In other
words, there exist different distributions, whose support is unbounded, that
have different moment generating functions but share identical moments.
As hinted, sometimes a moment-generating function does not exist, or
it is not defined within any open interval around t = 0. In such a case, the
alternative characteristic function ϕX (t) is guaranteed to always exist,
and to be unique for each distribution.
Definition 1.21. Characteristic function. For a given random variable
X with support X, the characteristic function ϕX (t) is defined, for t ∈ R
and for discrete random variables:
X
ϕX (t) = E [exp (itX)] = exp (itx) fX (x)
x∈X

while for continuous random variables it is:


ˆ
ϕX (t) = E [exp (itX)] = exp (itx) fX (x) dx
X
where i is the imaginary unit.
A result analogous to Theorem 1.16 shows that it is possible to calculate all
the moments of a distribution by using the characteristic functions instead.
1 dr ϕX (t)
E [X r ] = · for r ∈ N
ir dtr t=0
As it involves complex numbers, the characteristic function is a much more
difficult object to handle than the moment-generating function. The char-
acteristic function is most useful to prove fundamental results of asymptotic
probability theory, such as the Central Limit Theorem.

36
Lecture 2

Common Distributions

This lecture is essentially an annotated list of probability distributions that


are most frequently encountered in practice. The list is organized by type
of distribution (e.g. discrete vs. continuous) and especially in the case of the
continuous ones, by “family” – a term that denotes groups of distributions
sharing similar characteristics. The objective of this lecture is not simply to
familiarize with important and frequently found distributions, but also to
highlight those relationships between distributions that are especially useful
towards statistical inference and econometric analysis.

2.1 Discrete Distributions


Some examples of discrete probability distributions are developed in Lecture
1 starting from the simple coin experiment. In the first part of this section
these examples are more rigorously generalized, allowing for a parametriza-
tion of the hypothetical experiment. Subsequently, the discussion of other
discrete distributions follows. Starting from this section, some distributions
are associated with a specific notation, so that it is easier to interpret – for
example – the following conventional expression:
X ∼ Be (p)
as “the random variable X is distributed according to the Bernoulli distri-
bution with parameter p.” While allowing for parameters, it can be useful
to indicate them in the expression of mass or density functions, cumulative
distributions, et cetera. Therefore, a conventional expression such as:
fX (x; p) = px + (1 − p) (1 − x)
means that X follows the above probability mass function given a parameter
p. Note how a semicolon separates realizations and parameters in fX (x; p).

37
2.1. Discrete Distributions

Bernoulli distribution
The Bernoulli distribution is the one describing dichotomous events akin to
those of the coin experiment. In general, one must allow for the two events
under consideration to occur with different probabilities (for example, coins
may not be “fair” or “balanced”). One writes X ∼ Be (p) if X = {0, 1} and:
P (X = 1) = p
P (X = 0) = 1 − p
implying a probability mass function that can be written as in the above
example about notation, or equivalently – but more elegantly – as:
fX (x; p) = px (1 − p)1−x (2.1)
for x ∈ {0, 1} and p ∈ [0, 1]. The cumulative distribution writes:
FX (x; p) = (1 − p) · 1 [x ∈ [0, 1)] + 1 [x ∈ [1, ∞)] (2.2)
its moment generating function is:
MX (t; p) = p exp (t) + (1 − p) (2.3)
and this lets obtain the mean and the variance easily as follows.
E [X] = p (2.4)
Var [X] = p (1 − p) (2.5)
The Bernoulli distribution is elementary; thus, it forms the basis for several
other discrete distributions.

Binomial distribution
The binomial distribution characterizes a random variable defined on a sam-
ple space constituted by all possible recombinations of n Bernoulli (binary)
events with probability p, and that measures the probability for the number
x and n−x of realizations of each alternative. Thus, this distribution corre-
sponds to the hypothetical experiment from Lecture 1 about tossing several,
(say n) possibly unbalanced coins. Conventionally, the outcomes counted as
x are defined as “successes” and those counted as n − x as “failures;” for this
reason, it is common to verbally describe the binomial distribution as the
one that measures the “probability of x successes of a binary phenomenon
out of n attempts.” In less verbal terms, a random variable X that follows
the binomial distribution is typically denoted as follows.
X ∼ Bn (p, n)

38
2.1. Discrete Distributions

The above expression highlights the two parameters that characterize


the binomial distribution: the probability of a single trial p ∈ [0, 1] and the
number of trials n ∈ N. It is helpful to appreciate the following facts about
random variables X that follow a binomial distribution:
• a possible reformulation of the sample space is the set S = {0, 1}n ;
• the support of the random variable is X = {0, 1, . . . , n};
• if the probability of all Bernoulli realizations is p, all the events that
define the binomial distribution are mutually independent.
The binomial distribution owes its name to the fact that its probability
mass function is described through the formula:
 
n x
P (X = x; p, n) = fX (x; p, n) = p (1 − p)n−x (2.6)
x
as argued earlier in the particular context of Example 1.14, this is due to
the fact that the count of possible outcomes featuring exactly x successes is
given by the binomial coefficient: x . The cumulative probability function
n


writes, for x ∈ [0, n], and given bxc the largest integer smaller than x, as:
bxc  
X n i
P (X ≤ x; p, n) = FX (x; p, n) = p (1 − p)n−i (2.7)
i=0
i
with P (X = n; p, n) = FX (n; p, n) = 1 per the binomial formula.
n  
X n x
FX (n; p, n) = p (1 − p)n−x
x=0
x
= (p + (1 − p))n
=1
The moment generating function is similarly obtained (see Example 1.17):
n  
X n
MX (t; p, n) = exp (tx) px (1 − p)n−x
x=0
x (2.8)
= [p exp (t) + (1 − p)]n
while the mean and variance are as follows (one can calculate them through
the approach in Example 1.14, or with the moment generation function).
E [X] = np (2.9)
Var [X] = np (1 − p) (2.10)
The two distributions discussed next are variations on the idea of multiple
Bernoulli events or, as these are usually referred to, “Bernoulli trials.”

39
2.1. Discrete Distributions

Geometric distribution
Consider the sample space constructed out of all combinations of an infinite
number of Bernoulli trials with identical probability p. Suppose that these
trials are ordered; for example, the order might correspond to a sequence in
time when the trials are realized. Rather than defining a random variable
X that counts the number of successes, let X ∈ N denote the index of the
first Bernoulli trial in the sequence for which a “success” is observed. It is:
P (X = x; p) = fX (x; p) = p (1 − p)x−1 (2.11)
because for a success with probability p to happen in the x-th trial, x − 1
failures must first occur, an event with probability (1 − p)x−1 . The proba-
bility mass function in (2.11) is characterized by a geometric series, which
motivates the name for the distribution associated with X. By the proper-
ties of the geometric series, for x ∈ R the cumulative distribution function
of X is obtained as:
bxc−1
X
P (X ≤ x; p) = FX (x; p) = p (1 − p)i
i=0

1 − (1 − p)bxc (2.12)
= p
1 − (1 − p)
= 1 − (1 − p)bxc
which converges to 1 as x → ∞; while it is FX (x; p) = 0 for x < 1. Similarly,
the moment generating function is obtained, for t < − log (1 − p), as:
M
X
MX (t; p) = lim exp (tx) · p (1 − p)x−1
M →∞
x=0
M
X
= p exp (t) · lim [(1 − p) · exp (t)]x−1
M →∞
x=0 (2.13)
M
1 − [(1 − p) · exp (t)]
= p exp (t) · lim
M →∞ 1 − (1 − p) · exp (t)

p exp (t)
=
1 − (1 − p) exp (t)
allowing to derive the mean and variance following tedious calculations.
1
E [X] = (2.14)
p
1−p
Var [X] = (2.15)
p2

40
2.1. Discrete Distributions

A very important feature of the geometric distribution is its memoryless


property, which is defined as follows for any two integers s, t with s > t:
P (X > s| X > t) = P (X > s − t) (2.16)
that is, the probability that success occurs at the s-th trial conditional on
t trials having already occurred (implicitly, with failure) is equal to the ex
ante probability that success occurs with exactly s−t trials. In other words
– keeping with the time interpretation of the sequence of Bernoulli trials –
every failure is uninformative about future odds of success (or failure): it is
as if the calculation of future odds “forgets” past realizations. This property
of the geometric distribution is easily proved as follows.
P (X > s ∩ X > t)
P (X > s| X > t) =
P (X > t)
P (X > s)
=
P (X > t)
= (1 − p)s−t
= P (X > s − t)
Because of the memoryless property, the geometric distribution is used to
model the “waiting count” of homogeneous Bernoulli trials that occur before
some events of interest, on the assumption that the passing of time does not
affect their probabilities. It is not an adequate distribution to describe, say,
the probability that some physical objects stop to function or some living
beings die if it is likely that some aging process – of either physical objects
or living beings – may affect these probabilities.1

Negative binomial distribution


Consider the same setup as that of the geometric distribution: all the possi-
ble combinations of an infinite number of ordered Bernoulli trials. However,
let the outcome of interest be not the number of trials it takes to achieve
one success, but instead the number of trials X ∈ N it takes to get a generic
number of successes r ∈ N. This probability is defined as:
 
x−1 r
P (X = x; p, r) = fX (x; p, r) = p (1 − p)x−r (2.17)
r−1
which is the probability mass function of the negative binomial distribution;
one would typically express this as follows.
X ∼ NB (p, r)
1
In these cases, the expression “success” of a Bernoulli trial is certainly a misnomer.

41
2.1. Discrete Distributions

The motivation for (2.17) is readily explained: every unique combination of


x Bernoulli trials featuring r successes and that terminates with a success
in the x-th trial must contain exactly r − 1 successes in the previous x − 1
trials; the number
 of those combinations is counted throughr the binomial
coefficient r−1 , and each of them occurs with probability p (1 − p)x−r .
x−1

Observation 2.1. The geometric distribution is a special case of the nega-


tive binomial distribution, with r = 1; thus it is denoted as X ∼ NB (p, 1).
The negative binomial distribution can also be expressed by the alter-
native random variable Y = X − r which counts the number of failures that
occur before the r-th success; by analogous reasoning:
 
r+y−1 r
P (Y = y; p, r) = fY (y; p, r) = p (1 − p)y (2.18)
y
which, by the properties of the binomial coefficients with negative integers,
can also be written as:
 
y −r
P (Y = y; p, r) = fY (y; p, r) = (−1) pr (1 − p)y
y
which motivates the term negative binomial, thanks to the resemblance with
(2.6) if not for the (possibly) negative multiplicative term. The cumulative
distribution is obtained by appropriately summing over (2.17) – or (2.18)
– and it can be shown that, when its argument x (or y) goes to infinity, it
has limit one. The moment generating function is, for t < − log (1 − p):
 r
p exp (t)
MX (t; p, r) = (2.19)
1 − (1 − p) exp (t)
while the key moments of X are as follows (those of Y are derived easily).
r
E [X] = (2.20)
p
r (1 − p)
Var [X] = (2.21)
p2
The negative binomial distribution is frequently used in both statistics and
econometrics in order to model countable events of interest.

Poisson distribution
The Poisson distribution, which is presented next, is an important discrete
distribution with numerous applications. Like the other distributions pre-
sented thus far, also the Poisson is related to the concept of Bernoulli trials,
although the connection is less immediate to intuitively appreciate. To that
end, it is helpful to provide first a formal description of the distribution.

42
2.1. Discrete Distributions

The support of the Poisson distribution is the set of nonnegative integers


X = N0 = {0, 1, 2, . . . }; the distribution has one parameter – a nonnegative
real number – which is usually called “intensity” and is denoted by λ ∈ R+ ;
while the notation indicating that a random variable X follows the Poisson
distribution with intensity parameter λ is the following.
X ∼ Pois (λ)
The probability mass function of the Poisson distribution is:
exp (−λ) · λx
P (X = x; λ) = fX (x; λ) = (2.22)
x!
its cumulative distribution is:
bxc
X λi
P (X ≤ x; λ) = FX (x; λ) = exp (−λ) (2.23)
i=0
i!
and by exploiting the Taylor expansion of the exponential function, it is:
M
X λx
lim FX (M ; λ) = exp (−λ) · lim
M →∞ M →∞
x=0
x!
= exp (−λ) · exp (λ)
=1
showing compliance with the definition of probability distribution.
An important and well-known feature of the Poisson distribution is that
it resembles – i.e. it mathematically approximates – a binomial distribution
where n is “large” and p is “small.” To show this, suppose that X ∼ Bn (p, n)
and let λ = np be some fixed number. Consider the limit of the probability
mass function for some X = x as n goes to infinity; for λ fixed, this implies
that p goes to zero, therefore it is more convenient to express the limit solely
in terms of n and λ, as in the following derivation.
 
n x
lim fX (x; p, n) = lim p (1 − p)n−x
n→∞ n→∞ x
 x  n−x
n! λ λ
= lim 1−
n→∞ x! (n − x)! n n
x −x  n
λx
 Q  
k=1 (n − k + 1) λ λ
= lim 1− 1−
x! n→∞ nx n n
| {z }| {z } | {z }
→1 →1 → exp(−λ)
x
λ
= exp (−λ)
x!
= fX (x; λ)

43
2.1. Discrete Distributions

The approximation is quite good even for moderately large values of n and
moderately small values of p, as shown in the example from Figure 2.1.

fX (x)
.2

.1
x
0 2 4 6 8 10
Note: binomial probabilities are denoted with solid thin lines, smaller full points; Poisson probabilities
are denoted with dashed thicker lines, larger hollow points. All probabilities for x > 10 are negligible.

Figure 2.1: Binomial vs. Poisson comparison with n = 20, p = 0.2, λ = 4

This mathematical relationship provides both intuition and motivation


to support the use of the Poisson distribution as a model for the probability
that a number X of events occurs in a specified (usually small) interval of
time or space. Common examples are the number of phone calls (or emails)
received in a time interval of interest, or the number of sewing imperfections
found upon a uniform section of textile. The hypotheses that underpin the
use of the Poisson distribution in situation of this kind are that:
• the events of interest happen independently, they are all equally likely,
and they cannot overlap (in time, in space, et cetera);
• the larger the interval under examination, the higher the probability
to encounter a single event.
The Taylor expansion of the exponential functions is also useful to derive
the moment generating function of the Poisson distribution:
M
X exp (−λ) · λx
MX (t; λ) = lim exp (tx) ·
M →∞
x=0
x!
M
X [λ · exp (t)]x (2.24)
= exp (−λ) · lim
M →∞
x=0
x!
= exp (−λ) · exp (λ · exp (t))
= exp (λ (exp (t) − 1))
from which one can calculate the mean and variance as follows.
E [X] = λ (2.25)
Var [X] = λ (2.26)

44
2.1. Discrete Distributions

Both moments are equal to one another, as well as to the parameter λ!


This motivates the name “intensity” for the latter: λ represents the average
rate at which the events occur; the more frequent they are, the larger is the
average number of “successes” as well as their dispersion on the support N0 .

Discrete uniform distribution


There are certainly discrete distribution unrelated to the notion of Bernoulli
trials. Perhaps the simplest example of such a distribution is the “discrete
uniform” one, where discrete marks a difference with the analogous contin-
uous distribution already encountered in several examples from Lecture 1.
This simple distribution has equal probability mass over N mass points:
1
P (X = x; N ) = fX (x; N ) = (2.27)
N
where |X| = N . Naturally, the other features of this distribution depend on
the specific points in the support X. At one extreme, these points are the
first N integers: X = {1, 2, . . . , N }, in this case, calculating the cumulative
distribution and the moments of X is a simple exercise; at the other extreme,
these can be irregularly selected points, and those calculations may be non-
trivial and context-dependent.
Between these two extremes, a quite frequent scenario is that where X
contains all the integers that belong to some interval [a, b] of length N − 1:
X = {x ∈ Z : x = a, . . . , b}, b − a = N − 1. In this case, one usually writes:
X ∼ U {a, b}
denoting that the distribution has two parameters, a and b, which together
determine N = b − a + 1. The cumulative distribution is:
bxc − a + 1
P (X ≤ x; a, b) = FX (x; a, b) = · 1 [a ≤ x ≤ b] + 1 [b < x]
b−a+1
(2.28)
the moment generating function is:
exp (at) − exp ((b + 1) t)
MX (t; a, b) = (2.29)
N (1 − exp (t))
while the mean and variance are as follows.
a+b
E [X] = (2.30)
2
N2 − 1
Var [X] = (2.31)
12
In this case, it may be easier to derive the two moments by direct application
of their definition than by using the moment generating function.

45
2.1. Discrete Distributions

Hypergeometric distribution
The last discrete distribution considered here is a variation of the binomial
experiment of obtaining x successes out of n attempts of a binary outcome,
but in an environment where the probability of each success is not fixed,
and thus the correspondence with n independent Bernoulli trials cannot be
established. Specifically, the hypergeometric distribution is modeled on the
idea of randomly selecting n ∈ N objects out of a population that contains
N ∈ N of them in total, of which K ∈ N presents a certain binary feature
that is called a “success,” and where n < N and K < N . The two most
common concrete representations of this mental experiment are:
• the urn model about the extractions of certain n items (“balls”) from
a container (“urn”) – the items-balls are N in total and K of them
present a certain feature (“color”);
• the concept of sampling without replacement from a finite population
of size N , K of whose elements present a certain feature; this corre-
sponds to the case of a statistical sample which is obtained by ran-
domly drawing n individual units from the population, one at a time,
and excluding them from additional draws once they are selected.
The random variable X that represents the number of “successes” x out
of the n “extractions” is said to follow the hypergeometric distribution, also
written as:
X ∼ H (N, K, n)
which presents three parameters: N , K, and n. Its support X must satisfy
the following four conditions: x ≥ 0, x ≤ n, (these two are obvious) x ≤ K
and n−x ≤ N −K (actual successes and failures cannot exceed the possible
maxima). Combining all these inequalities together, the support is written
as follows.
X = {max (0, n + K − N ) , . . . , min (n, K)}
The hypergeometric probability mass function presents an expression com-
posed by three binomial coefficients:
  
K N −K
x n−x
P (X = x; N, K, n) = fX (x; N, K, n) =   (2.32)
N
n

which, however, has an intuitive explanation. There are in total Nn ways to




extract a sample of size n out of a population of size N ; under the hypotheses

46
2.1. Discrete Distributions

underpinning the hypergeometric distribution, each of these has an identical


probability to occur. Therefore, in order to calculate the probability that x
successes are achieved, one must multiply the total number of combinations
that x successes are obtained out of the K units with that defining feature,
times the number of ways that n − x failures realize out of the residual units
N − K lacking that feature. This is what is expressed in (2.32).
Yet manipulating (2.32) is complicated and it requires familiarity with
combinatorics. The cumulative distribution function is obtained by appro-
priately summing over successive elements of the support and can expressed
in terms of the hypergeometric function, which gives its name to the distri-
bution itself. Furthermore, it can be shown that this function equals one
upon summing over all the elements of X. Likewise, the moment generat-
ing function is also expressed in terms of the hypergeometric function. The
mean and the variance are best obtained through direct application of the
definition; for example:
  
K N −K
X x n−x
E [X] = x  
N
x∈X
n
  
K −1 N −K
XK x−1 n−x
=  
N N −1
x∈X (2.33)
n n−1
  
K − 1 (N − 1) − (K − 1)
KX y n−1−y
=n  
N y∈Y N −1
n−1
K
=n
N
where the summation in the third line equals one because it corresponds to
the total probability mass of another hypergeometric distribution with pa-
rameters (N − 1, K − 1, n − 1), with y = x−1 and where Y is the “rescaled”
support of Y = X − 1. Intuitively, the mean of X equals is proportional to
the ratio of potential successes in the population, K/N , multiplied by the
number of attempts n. To calculate the variance:
KN −KN −n
Var [X] = n (2.34)
N N N −1
it is convenient to compute the second uncentered moment E [X 2 ] first.

47
2.2. Continuous Distributions I

2.2 Continuous Distributions I


The treatment of continuous distributions starts in this section with the
general, parametric version of the normal distribution, and proceeds – still
in this section – with other distributions that are similar to the normal, in
the sense that they all present two key parameters: a location parameter
and a scale parameter. These two parameters define two key features of
each distribution: the location parameter characterizes the relative position
of the distribution on R (intuitively, the area where most of the probability
is found) while the shape parameter defines the actual form of the density
function and the overall probability dispersion around the mean (which may
or may not linked to specific moments of the distribution). These concepts
are worth of a formal treatment.
Definition 2.1. Location and scale families. Let fZ (z) be a probability
density function associated with some random variable Z. For any µ ∈ R
and any σ ∈ R++ , the family of probability density functions of the form
 
1 x−µ
fX (x) = fZ
σ σ
for a generic random variable X is called the location-scale family with
standard probability density function fZ (z); µ is called the location
parameter while σ is called the scale parameter.
The distributions belonging to a location-scale family are inextricably
connected, in the sense expressed by the following theorem.
Theorem 2.1. Standardization of densities. Let f (·) be any probability
density function, µ ∈ R and σ ∈ R++ . Then, a random variable X follows
a probability distribution with density function:
 
1 x−µ
fX (x) = f
σ σ
if and only if there exists a continuous random variable Z whose probability
density function is fZ (z) = f (z) and X = σZ + µ.
Proof. Necessity is shown by noting that X = g (Z) = σZ + µ, where g (·)
is a monotone transformation with g −1 (x) = (x − µ) /σ, dx d −1
g (x) = σ−1
and thus the conditions of Theorem 1.11 apply. Sufficiency is shown through
the converse exercise: define Z = g (X) = (X − µ) /σ, again a monotone
transformation with g −1 (z) = σz + µ, dz g (z) = σ and again one can
d −1

extend the theorem for monotone transformations of density functions.

48
2.2. Continuous Distributions I

Some very useful implications of this result are that:


• a standard density function (with associated distribution) exists for
every location-scale family;
• given that all the distributions in a location-scale family are all linked
through a linear transformation, their mean, variance, other moments
and moment generating functions are related via simple functions;
• all probabilities specific to a distribution from a location-scale family
can be expressed with reference to the standard distribution.
 
a−µ X −µ b−µ
P (a ≤ X ≤ b) = P ≤ ≤
σ σ σ
 
a−µ b−µ
=P ≤Z≤
σ σ
The usefulness of these facts is best understood when put in context.

Normal distribution
The all-important location-scale family of normal distributions, sometimes
called Gaussian, includes all continuous distributions with support on the
whole of R, density function of the form
!
2
1 (x − µ)
fX (x; µ, σ2 ) = √ exp − (2.35)
2πσ2 2σ2
and cumulative distribution obtained by appropriate integration as follows.
ˆ x 2
!
1 (t − µ)
FX (x; µ, σ2 ) = √ exp − dt (2.36)
−∞ 2πσ2 2σ2
These distributions present two parameters, µ for location and σ for shape,
although the latter is typically replaced with its square σ2 for an immediate
interpretation as variance (see below). The expression
X ∼ N µ, σ2


indicates that X follows the normal distribution with the specified parame-
ters. In the case of the standard normal distribution from Examples 1.7 and
1.9, with density φ (x) and cumulative distribution Φ (x), these parameters
are equal to 0 and 1 respectively, hence the above writes:
X ∼ N (0, 1)
although X is often replaced with Z for clearer indication of standardization
– e.g. Z ∼ N (0, 1).

49
2.2. Continuous Distributions I

Figure 2.2 shows three examples of normal density functions. The con-
tinuous line represents the familiar standard version. The dashed line dis-
plays a density with µ = 2 and σ2 = 1: note how an increase of the location
parameter produces a “shift to the right” of the distribution (if the location
parameter is decreased, one would obtain a “shift to the left”). The dotted
line depicts a density with µ = 0 and σ2 = 4: while still centered at zero, an
increase in the shape parameter produces a more “flattened out,” dispersed
distribution (conversely, one would obtain a more concentrated distribution
if the shape parameter is decreased).

fX (x) Stand.
0.4 µ=2
σ2 = 4

0.2

x
−5 −3 −1 1 3 5 7

Figure 2.2: Different normal probability densities

Occasionally, the normal distribution is expressed through the following


alternative parametrization:
r !
2 2 2
φ φ (x − µ)
fX (x; µ, φ2 ) = exp − (2.37)
2π 2

where φ2 = σ−2 is called the precision parameter. In Figure 2.2, the dotted
density has φ2 = 0.25.
Showing that the density function of a normal distribution integrates to
one is not an immediate task; thanks to Theorem 2.1, it is enough to show
that the result holds in the standardized case, that is:
ˆ ∞  2
1 z
√ exp − dz = 1
−∞ 2π 2

or, exploiting symmetry around zero, that the following holds.


ˆ ∞  2 √ r
z 2π π
exp − dz = = (2.38)
0 2 2 2

50
2.2. Continuous Distributions I

This is accomplished by squaring the left-hand side of (2.38), manipulating


it, and showing that it equals to π/2:
ˆ ∞  2  2 ˆ ∞  2   ˆ ∞  2 
z t u
exp − dz = exp − dt exp − du
0 2 0 2 0 2
ˆ ∞ˆ ∞  2
t + u2

= exp − dtdu
0 0 2
ˆ ∞ˆ π  2
2 r
= r · exp − dθdr
0 0 2
ˆ
π ∞
 2
r
= r · exp − dr
2 0 2
  2  ∞
π r
= − exp −
2 2 0
π
=
2
where the third line implements a change of variables in polar coordinates,
t = r · cos (θ) and u = r · sin (θ).
The derivation of the moment generating function is fortunately way
easier. It is again best to work with the standardized random variable Z:
ˆ +∞  2
1 z
MZ (t) = exp (tz) √ exp − dz
−∞ 2π 2
ˆ +∞  2
z − 2zt + t2 − t2

1
= √ exp − dz
−∞ 2π 2
 2  ˆ +∞ !
t 1 (z − t)2
= exp √ exp − dz
2 −∞ 2π 2
 2
t
= exp
2
where the integral in the third line vanishes because it corresponds to the
density function of a normal distribution with µ = t and σ2 = 1 integrated
over its entire support, and hence equal to one. By the properties of moment
generating functions, for any normally distributed random variable obtained
as X = σZ + µ:
σ2 t2
 
2
(2.39)

MX t; µ, σ = exp µt +
2
which shows that:
E [X] = µ (2.40)
Var [X] = σ2 (2.41)

51
2.2. Continuous Distributions I

that is, the two parameters of the normal distribution have an immediate
interpretation in terms of fundamental moments, a quite convenient fact!
In addition, it can be shown that:

Skew [X] = 0 (2.42)


Kurt [X] = 3 (2.43)

that is, all normal distributions have zero skewness (they are all perfectly
symmetric) and kurtosis equal to three, which makes this number a refer-
ence value for evaluating the kurtosis of other distributions.2
The normal distribution has ubiquitous applications that are motivated
by its numerous relationships with other probability distributions and, most
importantly, by the asymptotic result known as Central Limit Theorem, in
its various versions (Lecture 6). For these reasons, the normal distribution
is central in statistical inference as well as in econometric analysis.

Lognormal distribution
The lognormal distribution has the following probability density function,
for y ∈ R++ , µ ∈ R, and σ2 ∈ R++ .
!
2
1 1 (log y − µ)
fY (y; µ, σ2 ) = √ exp − (2.44)
2πσ y
2 2σ2

Observation 2.2. By Theorem 1.11, one can observe that the lognormal
distribution is evidently obtained through the transformation Y = exp (X),
where X ∼ N (µ, σ2 ). Conversely, X = log (Y ): the logarithm of a random
variable Y which follows the lognormal distribution is normally distributed,
hence the former distribution’s name.
The cumulative distribution of the lognormal distribution is:
ˆ y !
1 1 (log t − µ)2
2
FY (y; µ, σ ) = √ exp − dt (2.45)
0 2πσ2 t 2σ2

and the density must clearly integrate to one since the normal distribution
does. In order to specify that some random variable Y follows the lognormal
distribution, the most convenient way is surely as follows.

log (Y ) ∼ N (µ, σ2 )
2
In fact, the fourth standardized central moment of some random variable X is often
expressed in terms of “excess kurtosis” Kurt [X] − 3 that is, as the differences between
the kurtosis of X and that of the normal distribution.

52
2.2. Continuous Distributions I

Figure 2.3 displays example of lognormal density functions that are anal-
ogous to those from Figure 2.2, highlighting how lognormal densities can
assume different shapes, always asymmetric by some degree; for example,
the standard version is recognizable by its characteristic hump. These char-
acteristics make the lognormal distribution well suited to describe phenom-
ena that only take positive values and that are characterized by apparent
“inequality,” such as the income distribution or that of firm size.

fX (x) Stand.
1 µ=2
σ2 = 4

0.5

x
2 4 6 8

Figure 2.3: Different lognormal probability densities

The lognormal distribution is one for which the moment generating func-
tion does not exist (the integral that defines it diverges), and even the char-
acteristic function takes a very complex expression. Fortunately, uncentered
moments can be calculated easily:
E [Y r ] = E [(exp (X))r ]
= E [exp (Xr)]
σ2 r 2
 
= exp µr +
2
because the second equality corresponds with the definition of moment gen-
erating function of X ∼ N (µ, σ2 ) given r = t. The mean and variance, for
example, are obtained as follows.
σ2
 
E [Y ] = exp µ + (2.46)
2
Var [Y ] = exp σ2 − 1 exp 2µ + σ2 (2.47)
   

The lognormal distribution is obviously asymmetric and clearly, it always


shows a positive skewness which is a function of the scale parameter.
(2.48)
 p
Skew [Y ] = exp σ2 + 2 · exp (σ2 ) − 1 > 0
 

53
2.2. Continuous Distributions I

Logistic distribution
Even the standard logistic distribution introduced with Examples 1.7 and
1.9 has a full-fledged location-scale family. With support X = R, a location
parameter µ ∈ R, and a scale parameter σ ∈ R++ , the probability density
function of a generic logistic distribution is written as follows.
   −2
1 x−µ x−µ
fX (x; µ, σ) = exp − 1 + exp − (2.49)
σ σ σ
Also the logistic cumulative distribution has a closed form expression, which
can be obtained by an exercise in integrating the density above.
  −1
x−µ
FX (x; µ, σ) = 1 + exp − (2.50)
σ
making it obvious that limx→∞ FX (x; µ, σ) = 1. Expression (2.50) is easy
to manipulate and invert; consequently the logistic distribution has a simple
expression for its quantile function.
 
p
QX (p; µ, σ) = µ + σ log for p ∈ (0, 1) (2.51)
1−p

fX (x) Stand.
µ=2
0.2 σ=2

0.1

x
−8 −4 4 8 12

Figure 2.4: Different logistic probability densities

As shown above in Figure 2.4, and as already observed with reference to


the respective standardized distributions, the logistic density function has a
symmetric bell shape similar to the normal’s. However, if the two distribu-
tions have equal variances the logistic is “thinner” than the normal around
the mean, and has “thicker” outer tails. A random variable X following the
logistic distribution can be denoted with the following notation.
X ∼ Logistic (µ, σ)

54
2.2. Continuous Distributions I

The moment generating function of the standard logistic distribution is


obtained via a change of variable in the integral that defines the function,
where the function being applied is the cumulative distribution itself.
1
u=
1 + exp (−z)
This implies a range of integration restricted to (0, 1), that:
 t  t
1 u
exp (tz) = = = ut (1 − u)−t
exp (−z) 1−u
and that the differentials and the density function are related as follows.
exp (−z)
du = dz
(1 + exp (−z))2
All this implies that:
ˆ ∞
exp (−z)
MZ (t) = exp (tz) dz
−∞ (1 + exp (−z))2
ˆ1
= ut (1 − u)−t du
0
= B (1 + t, 1 − t)
as the second line is recognized as a particular case of the Beta Function,
an important mathematical function with two arguments a > 0 and b > 0.
ˆ 1
B (a, b) ≡ ua−1 (1 − u)b−1 du
0

Clearly, here the correspondence is obtained for a = 1 + t and b = 1 − t.


The Beta function is notoriously symmetric in its two arguments, thus the
moment generating function of the standard logistic is also (perhaps more
commonly) written as:
ˆ 1
MZ (t) = u−t (1 − u)t du = B (1 − t, 1 + t)
0

and since the Beta generating function is defined only for positive values
of a and b, the moment generating function is only defined for t ∈ (−1, 1).
For a general logistic distribution such that X = σZ + µ, by the properties
of moment generating functions the one of X is:
ˆ 1
MX (t; µ, σ) = exp (µt) · u−σt (1 − u)σt du (2.52)
0

55
2.2. Continuous Distributions I

or, in terms of the Beta Function:


MX (t; µ, σ) = exp (µt) · B (1 − σt, 1 + σt)
and its domain is thus restricted to t ∈ (−σ−1 , σ−1 ).
By the properties of the Beta function, the mean and the variance of a
generic logistic distribution are obtained as follows.
E [X] = µ (2.53)
2 2
σπ
Var [X] = (2.54)
3
Observe that unlike the standard normal distribution, the standard logistic
distribution with µ = 0 and σ = 1 has variance greater than one. In order
to appropriately the variance so as to equal 1, a reparametrization:

∗ 3
σ = σ
π
is occasionally employed. One can also demonstrate that:
Skew [X] = 0 (2.55)
21
Kurt [X] = (2.56)
5
showing that the logistic distribution has always a kurtosis higher than the
normal’s, a fact that matches some previous observations about the shape of
the two distributions. Thanks to its similarity with the normal, the logistic
distribution is typically used as an approximation of the former when easy
analytical manipulation of the cumulative distribution is necessary.

Cauchy distribution
The Cauchy distribution is an interesting “pathological” case of a location-
scale family of symmetric, bell-shaped distributions having X = R as their
support. Its density functions writes, given a location parameter µ ∈ R and
a scale parameter σ ∈ R++ , as:
"  2 #−1
1 x−µ
fX (x; µ, σ) = 1+ (2.57)
πσ σ

while the cumulative distribution has the following closed form expression,
displaying the necessary property limx→∞ FX (x; µ, ξ) = 1.
 
1 x−µ 1
FX (x; µ, σ) = arctan + (2.58)
π σ 2

56
2.2. Continuous Distributions I

Since the above is an invertible function, also the quantile function of the
Cauchy distribution has a simple closed form expression, for p ∈ (0, 1).
  
1
QX (p; µ, σ) = µ + σ tan π p − (2.59)
2

A random variable X which is distributed according to the Cauchy distri-


bution is denoted as follows.

X ∼ Cauchy (µ, σ)

The Cauchy probability density functions displayed in Figure 2.5 below


appear not too dissimilar from the shapes of the normal and logistic distri-
butions, however the Cauchy is an interesting pathological case because its
moment generating functions, and its moments themselves, do not exist in
the sense that the integrals that characterize them have no defined solution.

fX (x) Stand.
0.3 µ=2
σ=2
0.2

0.1
x
−5 −3 −1 1 3 5 7

Figure 2.5: Different Cauchy probability densities

To show this property, one typically works with the standardized case
Z = (X − µ) /σ, whose expectation is rewritten as follows:
ˆ 0 ˆ +∞
1 z 1 z
E [Z] = 2
dz + dz
−∞ π 1 + z 0 π 1 + z2

that is, the integral that defines the mean is split between two parts that,
since the distribution is symmetric around z = 0, must have opposite signs
but equal absolute value. Consider the integral defined on R++ ; note that:
ˆ +∞ M
z log (1 + z 2 )
dz = lim =∞
0 1 + z2 M →∞ 2 0

57
2.2. Continuous Distributions I

the integral does not converge and lacks a finite solution. Because the same
applies to the other half of the partition above, the value of E [Z] remains
undefined. Similar arguments also apply to the non-standardized versions of
the distribution, the higher moments, and the moment generating function.
However, the Cauchy distributions – like all distributions – always has a
characteristic function, which can be shown to be the following.
ϕX (t; µ, σ) = exp (iµt − σ |t|) (2.60)
Note that this particular characteristic function is not differentiable at t = 0,
thus it cannot help derive the moments of the distribution. As it lacks the
moments, the Cauchy distribution has limited practical applications; it is
however notable for its links – to be illustrated later – with distributions of
more practical use such as the normal and the Student’s t-distribution.

Laplace distribution
The last location-scale family discussed here is that of the Laplace distri-
bution, characterized by support X = R and the following density function.
 
1 |x − µ|
fX (x; µ, σ) = exp − (2.61)
2σ σ
Note that this density function is discontinuous at x = µ, but despite this,
the above density is easy to integrate on the two complementary subsets of
R that are split at the discontinuity point. Hence, the cumulative density
function has two distinct closed form expressions:
(
1
exp x−µ if x < µ

FX (x; µ, σ) = 2 σ
(2.62)
1 x−µ
if x ≥ µ

1 − 2 exp − σ

and is notably continuous at x = µ. This expression clearly conforms to the


properties of a probability distribution and it integrates to 1 over R. Since
the distribution is symmetric, the quantile for p = 0.5 corresponds to the
mean; hence the quantile function is also split in two expressions for values
of p that are either below or above p = 0.5.
(
if p ∈ 0, 21

µ + σ log (2p)
QX (p; µ, σ) = (2.63)
µ − σ log (2 − 2p) if p ∈ 21 , 1
 

Should a random variable X follow the Laplace distribution, this is usually


expressed with the following notation.
X ∼ Laplace (µ, σ)

58
2.2. Continuous Distributions I

fX (x) Stand.
µ=2
0.4 σ=2

0.2

x
−5 −3 −1 1 3 5 7

Figure 2.6: Different Laplace probability densities

The three Laplace densities displayed in Figure 2.6 all take the typical
“tent shape” that characterizes this distribution. Unlike in the Cauchy case,
this distribution has well-defined moments; to see this, it is easiest to start
from calculating the moment generating function in the standardized case:
ˆ +∞
1
MZ (t) = exp (tz − |z|) dz
−∞ 2
ˆ ˆ
1 0 1 +∞
= exp ((1 + t) z) dz + exp (− (1 − t) z) dz
2 −∞ 2 0
 
1 1 1
= +
2 1+t 1−t
1
=
1 − t2
which is only defined for |t| < 1 (or else the two integrals in the second line
diverge). By the properties of the moment generating functions for linearly
transformed random variables, it is easy to see that in the general case:
exp (µt)
MX (t; µ, σ) = (2.64)
1 − σ2 t2
where, for analogous reasons, this moment generating function is defined
only for |t| < σ−1 . By (2.64) it is possible to obtain all the other moments;
the mean and variance for example are as follows.
E [X] = µ (2.65)
Var [X] = 2σ2 (2.66)
The Laplace distribution has a limited number of applications (it can be
used, for example, to model the growth rates of firms) but is mostly known

59
2.3. Continuous Distributions II

for its relationships with other distributions – in particular, the exponential


distribution. In an anticipation of a discussion developed in the next section,
if a random variable X follows the Laplace distribution, the transformation
Y = |X − µ| (which is obtained by “mirroring” X around its mean) follows
the exponential distribution, a fact that gained the alternative name double
exponential distribution to the Laplace distribution.

2.3 Continuous Distributions II


The location-scale families of continuous distributions do not certainly ex-
haust the set of relevant continuous probability distributions. The distribu-
tions that follow next in the discussion are all continuous, do not generally
follow in that category, yet they have theoretical or practical importance.
Some of these distributions have a special place in the theory and practice
of statistical inference. In the ensuing discussion, some emphasis is placed
on those relationships between distributions that can be described in terms
of univariate transformations. Other relationships, including some that are
especially relevant for statistical inference, require the development of con-
cepts such as those of joint probability distributions, independent random
variables, random samples, and convergence in distribution; for this reason,
the illustration of these relationships is postponed to later lectures.

Continuous uniform distribution


Lecture 1 already provides ample discussion of the uniform distribution on
the unit interval, including its moments and moment generating function.
Similarly to its discrete analogue, the continuous uniform distribution can
be generalized to have support on any segment X = [a, b] of the real line,
where the extremes a and b of the interval are expressed as two parameters.
The general form of the density function is:
1
fX (x; a, b) = · 1 [x ∈ [a, b]] (2.67)
b−a
while that of the cumulative distribution is as follows.
x−a
FX (x; a, b) = · 1 [x ∈ (a, b)] + 1 [x ∈ [b, ∞)] (2.68)
b−a
Obviously, the two parameters a and b (where b > a) characterize both
the position and the overall “spread” of the distribution; for this reason, the
family of uniform distributions is occasionally classified among the location-
scale families. However, the two parameters a and b here play a symmetric

60
2.3. Continuous Distributions II

role; neither of them is more characteristic of the location or the scale of the
distribution (like µ and σ do for the distributions examined previously). If
a random variable X follows the uniform distribution on the [a, b] interval,
one usually writes:
X ∼ U (a, b)
where X ∼ U (0, 1) is just but a special case. By generalizing the examples
given in the previous lecture, it is straightforward to verify that the moment
generating function of a generic uniform distribution are:
(
1
[exp (bt) − exp (at)] if t 6= 0
MX (t; a, b) = t(b−a)
(2.69)
1 if t = 0
while the mean and variance are as follows.
a+b
E [X] = (2.70)
2
1
Var [X] = (b − a)2 (2.71)
12
Depending on the context of interest, all uniform distributions can be alter-
natively defined on the open interval (a, b); the analysis of the distribution
is largely unaffected whether the support is an open or a closed set.

Beta distribution
With the expression Beta distributions one usually refers to a family of dis-
tributions that, like the uniform distribution, have support on a segment of
the real line, but unlike the uniform distribution, can take varying shapes.
The starting point in their description is the standard family of Beta dis-
tributions with support on the unit interval, X = (0, 1). These particular
distributions are defined by two positive parameters α ∈ R++ and β ∈ R++
that jointly define the shape of the density function, which reads:
xα−1 (1 − x)β−1 xα−1 (1 − x)β−1
fX (x; α, β) = ´ 1 = for x ∈ (0, 1)
u α−1 (1 − u)β−1 du B (α, β)
0
(2.72)
where B (α, β) is a Beta function with the parameters α and β as arguments
and serves as a normalization constant to ensure that the density integrates
to one on the unit interval. The Beta function is also related to the so-called
Gamma function Γ (γ), a function with one argument γ ∈ R++ :3
ˆ ∞
Γ (γ) = uγ−1 exp (−u) du
0
3
Specifically, it can be shown that B (a, b) = Γ (a) Γ (b) /Γ (a + b).

61
2.3. Continuous Distributions II

and therefore (2.72) can be alternatively written as follows.


Γ (α + β)
fX (x; α, β) = · xα−1 (1 − x)β−1 for x ∈ (0, 1) (2.73)
Γ (α) Γ (β)
The cumulative distribution is:
´ x α−1
t (1 − t)β−1 dt B (x; α, β)
FX (x; α, β) = ´ 10
β−1
= (2.74)
u α−1 (1 − u) du B (α, β)
0
´ x a−1
where B (x; a, b) ≡ 0 t (1 − t)b−1 dt is the so-called lower incomplete
Beta function; it is thus obvious that the cumulative distribution function
integrates to 1 over the (0, 1) interval. To express that a random variable
X follows a standard Beta distribution it is common to write as follows.
X ∼ Beta (α, β)
It must be observed that for many Beta distributions, the support can be
defined in terms of the closed interval [0, 1] without affecting the analysis.
Observation 2.3. X ∼ Beta (1, 1) is equivalent to X ∼ U (0, 1), that is,
the uniform distribution on the unit interval is a special case of the standard
Beta distribution, for parameters α = 1 and β = 1.

fX (x) α = 2, β=2
3 α = .5, β = .5
α = 2, β=5
2 α = 5, β=2

x
0.25 0.5 0.75 1

Figure 2.7: Different Beta probability densities

Figure 2.7 displays some examples of the standard Beta distribution that
are different from the uniform case. The shapes assumed by the different
density functions in the figure are wildly different, and more can be obtained
with different configurations of the parameters. It is easy to calculate the
moment of this distribution directly. Observe that, because of (2.72):
ˆ 1
1 B (r + α, β)
r
E [X ] = xr+α−1 (1 − x)β−1 dx =
B (α, β) 0 B (α, β)

62
2.3. Continuous Distributions II

which can be alternatively expressed in terms of the Gamma function.


Γ (r + α) Γ (α + β)
E [X r ] =
Γ (r + α + β) Γ (α)
Thanks to the property of the Gamma function that Γ (γ + 1) = γΓ (γ),
one can show that:
α
E [X] = (2.75)
α+β
αβ
Var [X] = 2 (2.76)
(α + β) (α + β + 1)
and that the distribution is asymmetric if α 6= β. This is actually an easier
way to proceed for calculating the moments, since the moment generating
function of the standard Beta distribution is particularly involved.
∞ q−1
!
X Y α+k tq
MX (t; α, β) = 1 + (2.77)
q k=0
α + β + k q!
The standard family of Beta distributions can be easily generalized to
different segments in R. The nonstandard family of Beta distributions has
support X = (a, b) – or in many cases, X = [0, 1] – and the following density
function.
(x − a)α−1 (b − x)β−1
fX (x; α, β, a, b) = for x ∈ (a, b) (2.78)
B (α, β) · (b − a)α+β−1
The analysis of this distribution proceeds similarly as in the standard case;
when α = β = 1, a nonstandard Beta distribution coincides with a uniform
distribution on the (a, b) interval. The Beta distribution is a useful one for
modeling the probability of events that are defined on a bounded interval
of the real line but are not uniform. Thanks to its relationships with other
distributions, the Beta distribution has a number of other applications in
statistical inference, one of which is mentioned later in Lecture 5.

Exponential distribution
Like the uniform distribution, even the exponential distribution has already
been introduced in Lecture 1 through a special case, defined as that with
“unit parameter.” The larger family of exponential distributions takes its
support on the set of nonnegative real numbers X = R+ , and it allows for
different values of the parameter λ ∈ R++ (where λ = 1 is clearly the “unit
parameter” case). The probability density function reads generally as:
1  x
fX (x; λ) = exp − for x ≥ 0 (2.79)
λ λ

63
2.3. Continuous Distributions II

while the cumulative distribution function is:


 x
FX (x; λ) = 1 − exp − (2.80)
λ
which obviously integrates to one. The functional form of these distributions
is regular and simple, as shown below in Figure 2.8.

2 fX (x) λ = .5
λ=1
λ=2

x
1 2 3 4 5

Figure 2.8: Different exponential probability densities

By extending the examples from Lecture 1, it is straightforward to cal-


culate the moment generating function (for t ≤ λ−1 ) as:
1
MX (t; λ) = (2.81)
1 − λt
and the mean and variance as:
E [X] = λ (2.82)
Var [X] = λ2 (2.83)
which loads λ with interpretation. The exponential distribution is typically
utilized to model the “waiting time” before the occurrence of some particular
event, if time is measured as a positive real number X > 0. In this sense, the
exponential distribution is a continuous analogue of the discrete geometric
distribution. Therefore, the parameter λ is both a measure of the average
waiting time, and of its variation. Even the exponential distribution features
the memoryless property (2.16), that is for any two integers s, t with s > t:
P (X > s| X > t) = P (X > s − t)
which is easy to show by following the same procedure as in the case of the
geometric distribution. Intuitively, this means that “waiting times” do not
depend upon the passing of time, and that the expectations are continuously
reset so long as the event of interest does not occur.

64
2.3. Continuous Distributions II

If a random variable X follows the exponential distribution λ, this is


usually written as follows.
X ∼ Exp (λ)
It is simple to verify that if a random variable Y = K · X is obtained by
rescaling an exponentially distributed random variable X by some constant
K, Y is also exponentially distributed with a rescaled parameter, as follows.

Y ∼ Exp (Kλ)

The exponential distribution has numerous relationships with other distri-


butions. The ones that relate it to distributions that were already discussed
in this Lecture are summarized next.

Observation 2.4. If X ∼ U (0, 1) and Y = −λ log (X) it is Y ∼ Exp (λ).

Observation 2.5. If X ∼ Exp (λ) and Y = exp (−X) it is Y ∼ Beta λ1 , 1 .




Observation 2.6. As it was already anticipated, if X ∼ Laplace (µ, σ) and


Y = |X − µ| it is Y ∼ Exp (σ).

Observation 2.7. If X ∼ Exp (1) and


 
exp (−X)
Y = µ − σ log
1 − exp (−X)

it is Y ∼ Logistic (µ, σ). The interpretation of this result is as follows: the


logistic distribution can model a linear function of the logarithm of an odds
ratio between two probabilities – that is, the probabilities that the waiting
time for some event which can be modeled by the exponential distribution
with unit parameter is either longer or shorter than some given number X.

Important relationships between the exponential distribution and other dis-


tributions not discussed yet are illustrated as these are introduced.

Gamma distribution
The Gamma family is central in the taxonomy of continuous distributions,
since it relates directly or indirectly to many other such families. Its support
equates the set of positive real numbers: X = R++ ; like the Beta family, it
is identified by two positive parameters α ∈ R++ and β ∈ R++ (but several
different parametrizations are possible, here the focus is on a specific one).
The name of this distribution derives from the fact that its density function

65
2.3. Continuous Distributions II

features the Gamma function; according to the parametrization adopted in


these lectures, the density function reads as follows.
1
fX (x; α, β) = βα xα−1 exp (−βx) for x > 0 (2.84)
Γ (α)
The cumulative distribution lacks a closed form expression, but similarly as
in the Beta case, it is often expressed in terms of an ancillary
´ b a−1 function called
lower incomplete Gamma function: γ (a, b) ≡ 0 u exp (−u) du.
ˆ x
1 γ (α, βx)
FX (x; α, β) = βα tα−1 exp (−βt) dt = (2.85)
Γ (α) 0 Γ (α)
A property, which is easy to show, of the lower incomplete Gamma function
is that limb→∞ γ (a, b) = Γ (a); clearly, all Gamma distributions integrate to
1. If a random variable X follows a Gamma distribution with parameters
α and β, this is generally written in one of two ways.
X ∼ Γ (α, β)
X ∼ Gamma (α, β)

Observation 2.8. X ∼ Gamma 1, λ1 is equivalent to X ∼ Exp (λ), that




is, exponential distributions are all special cases of the Gamma family.

fX (x) α = 2, β = 2
1.5 α = 4, β = 2
α = 2, β = 8
1

0.5

x
1 2 3 4 5

Figure 2.9: Different Gamma probability densities

Figure 2.9 displays different Gamma density functions for α > 1, they
all display the asymmetric humped shape that usually characterizes these
distributions. Similarly as with the Beta distributions, the direct calculation
of the Gamma centered moments is easy. By inspecting (2.84), it is:
ˆ ∞
r 1 1 Γ (r + α)
E [X ] = r
βr+α xr+α−1 exp (−βx) dx =
Γ (α) β 0 Γ (α) βr

66
2.3. Continuous Distributions II

since the expression inside the integral, divided by Γ (r + α), is the density
function of yet another Gamma distribution with parameters r + α and β.
All the moments are thus easily obtained thanks to the Gamma function’s
property that Γ (γ + 1) = γΓ (γ); for example, the mean and the variance
are expressed as follows.
α
E [X] = (2.86)
β
α
Var [X] = 2 (2.87)
β
Alternatively, one could have used the moment generating function, which
is calculated in analogy with the centered moments.
ˆ ∞
1
MX (t; α, β) = exp (tx) βα xα−1 exp (−βx) dx
Γ (α) 0
ˆ ∞
βα 1
= (β − t)α xα−1 exp (− (β − t) x) dx
(β − t)α Γ (α) 0
 α
β
= (2.88)
β−t
 −α
t
= 1−
β
Note that the integral in the second line is easily related to a density function
of a Gamma function with parameters α and β − t > 0, which again helps
simplify the accounts. This implies that the moment generating function is
only defined for t < β.
The Gamma distribution has numerous applications in several branches
of science; but its direct applications in the social sciences are quite scant.
As mentioned, the main importance of this distribution lies in its relation-
ship with other distributions.

Chi-squared distribution
The family of chi-squared distributions is central in the theory of statistical
inference. It has its support on the positive real numbers X = R++ , a single
positive parameter κ ∈ R++ , and its density function is as follows.
1 κ
 x
fX (x; κ) = κ x 2
−1
exp − for x > 0 (2.89)
Γ κ2 · 2 2

2
If a random variable X follows the chi-squared distribution, this is written
in the following way, which might slightly differ in the details.
X ∼ χ2 (κ) or X ∼ χ2κ

67
2.3. Continuous Distributions II

Notably, when κ is an integer it is called the number of degrees of freedom of


the chi-squared distribution, for reasons that relate to the interpretation of
the latter as the distribution of a sample variance obtained from normally
distributed, independent random variables (see Lecture 5).4
Observation 2.9. X ∼ Gamma κ2 , 12 is equivalent to X ∼ χ2 (κ), that is,


chi-squared distributions are all special cases of the Gamma family.


Observation 2.10. X ∼ χ2 (2) is equivalent to X ∼ Exp 21 .


fX (x) κ=3
κ=5
0.2 κ=7

0.1

x
4 8 12 16

Figure 2.10: Different chi-squared probability densities

The two observations above clarify that the chi-squared distributions are
a subfamily of the Gamma family, and are consequently related to the ex-
ponential distribution as well. Unsurprisingly, the chi-squared distributions
typically have humped shapes similar to the Gamma ones, as displayed in
Figure 2.10. For low values of κ however, as for Gamma distributions with
specific combinations of α and β, this is not the case; recall for example
Figure 1.9 which shows that fX (x, κ = 1) approaches the y-axis asymptoti-
cally as x → 0, and lacks a maximum.5 Since every chi-squared distribution
is a particular Gamma distribution, the analysis of the former follows that
of the latter. For example, the mean and variance are as follows.
E [X] = κ (2.90)
Var [X] = 2κ (2.91)
The moment generating function, defined for t < 0.5, is instead given below.
κ
MX (t; κ) = (1 − 2t)− 2 (2.92)
4
In this case one says that some random variable X follows the chi-squared distribu-
tion “with κ degrees of freedom.”
5
The chi-squared density function with one
 degree
√ of freedom given in Example 1.12
is reconciled with (2.89) by noting that Γ 21 = π.

68
2.3. Continuous Distributions II

The most important property of the chi-squared distribution is certainly


its relationship with the standard normal distribution, as already introduced
in Example 1.12. This is reiterated here.
Observation 2.11. If X ∼ N (0, 1) and Y = X 2 , it is Y ∼ χ2 (1).
This fact plays a fundamental role in statistical inference, as elaborated in
Lecture 5 and – for the asymptotic case – Lecture 6.

Snedecor’s F -distribution
The distribution named after Ronald Fisher and George W. Snedecor – for
brevity, Snedecor’s F -distribution or just the F -distribution, is yet another
quite involved family of distributions with support restricted to the positive
set of real numbers, X = R++ which is defined by two positive parameters
(ν1 , ν2 ) ∈ R2++ ; its density is given as:
  ν21  − ν1 +ν 2
1 ν1 ν1 ν1 2
fX (x; ν1 , ν2 ) = x 2
−1
1+ x for x > 0
B ν21 , ν22

ν2 ν2
(2.93)
where B ν1 ν2
is a Beta function of the (halved) parameters. Its cumula-

2
, 2
tive distribution can be expressed in a compact way by using the previously
introduced incomplete Beta function:

B x, ν21 , ν22

FX (x; ν1 , ν2 ) = (2.94)
B ν21 , ν22


showing that the distribution integrates to 1. Some random variable X that


follows the Fisher-Snedecor distribution is generally indicated with one of
two slightly different notations.

X ∼ F (ν1 , ν2 ) or X ∼ Fν1 ,ν2

Observation 2.12. If X ∼ F (ν1 , ν2 ) and Y = X −1 , it is Y ∼ F (ν2 , ν1 ).


Observation 2.13. If X ∼ F (ν1 , ν2 ) and Y ∼ Beta ν21 , ν22 , the random


variables X and Y are related by the following reciprocal transformations:


Y = (ν1 X/ν2 ) / (1 − ν1 X/ν2 ) and X = ν2 Y /ν1 (1 − Y ).
Selected density functions conforming to (2.93) are displayed in Figure
(2.11) below. For values of ν1 and ν2 around or smaller than 2, the shape of
the F -distribution resembles that of the chi-squared for values of κ around
or smaller than 1, respectively. In a similar vein, values of the parameters

69
2.3. Continuous Distributions II

larger than 2 lead to a typical hump-shaped density. Like in the lognormal


and Cauchy cases, also the F -distribution lacks a moment generating func-
tion (and even its characteristic function is quite involved). Fortunately, its
moments can be calculated by direct integration as in the Beta and Gamma
cases. This allows to derive the mean and variance as:
ν2
E [X] = (2.95)
ν2 − 2
2ν22 (ν1 + ν2 − 2)
Var [X] = (2.96)
ν1 (ν2 − 2)2 (ν2 − 4)

but these two moments are only defined for ν2 > 2 and ν2 > 4 respectively
(or else the integral that defines them diverges).

1 fX (x) ν1 = 2, ν2 = 2
ν1 = 2, ν2 = 6
ν1 = 12, ν2 = 12

0.5

x
1 2 3 4 5

Figure 2.11: Different F probability densities

The Fisher-Snedecor distribution has multiple relationships with other


distributions; its various links to the chi-squared distribution are especially
relevant and are discussed at length in later lectures; the importance of these
links is due to the F -distribution’s role in statistical inference about samples
drawn from the normal distribution. In those contexts, the parameters ν1
and ν2 are integers, and are also referred to – as in the chi-squared case –
as two degrees of freedom.6

Student’s t-distribution
The distribution named after the pseudonym of William Sealy Gosset, that
is Student (while ‘t’ derives from the distribution’s use in statistical tests),
is the only one analyzed in this section which has support on the entire set of
6
In such settings, one would say that X follows the F -distribution “with ν1 and ν2
degrees of freedom.”

70
2.3. Continuous Distributions II

real numbers, X = R; it is symmetric and defined by one positive parameter


ν ∈ R++ . Its density function can be expressed in two alternative ways,
with the aid of either the Gamma or the Beta function. In the former case,
one has: − ν+1
Γ ν+1

x2

1 2
fX (x; ν) = 2
√ 1 + (2.97)
Γ ν2 πν ν
while in the latter it is as follows.
− ν+1
x2

1 1 2
fX (x; ν) = 1 ν
√ 1+ (2.98)
B 2, 2 ν ν
Its cumulative distribution can be expressed through the incomplete Beta
function: (
1 ν 1 ν
if x ≤ 0

B , ,
(2.99)
2
FX (x; ν) = 2 1x +ν 2ν 2 1 ν 
1 − 2 B x2 +ν , 2 , 2 if x > 0
this shows that the distribution must integrate to 1 over the real line. A
random variable X that follows the Student’s t-distribution can be indicated
in two similar ways.
X ∼ T (ν) or X ∼ Tν
Observation 2.14. X ∼ T (1) is equivalent to X ∼ Cauchy (0, 1).
The above observation states that for ν = 1, the Student’s t is the stan-
dard Cauchy distribution. For moderate values of ν, however, the Student’s
t appears almost identical to the standard normal distribution, up to the
point that the two distributions “coincide” in an asymptotic sense as shown
later in Lecture 6. For small values of ν that yet are larger than 1 however,
the Student’s t-distribution can be thought as a sort of “middle ground”
between the standard Cauchy and the standard Normal distributions. This
is illustrated in Figure 2.12, which displays the density function of the latter
two distributions along with the density of the Student’s t for ν = 3.
It is easy to guess that because of the similarity with the Cauchy, the
moments of the Student’s t-distribution may not always be defined. In fact,
the distribution lacks a moment generating function and its characteristic
function is also hard to manipulate, while given a value of the parameter
ν, the moments of order r ≥ ν do not exist. Yet one can show that:
    
r+1 ν−r
Γ Γ

 r r
2 2 ν


· if r is even and 0 < r < ν

E [X r ] =
ν
Γ π
2



if r is odd and 0 < r < ν

0

71
2.3. Continuous Distributions II

fX (x) Stud.’s t, ν = 3
0.4
Stand. Cauchy
Stand. Normal

0.2

x
−5 −3 −1 1 3 5

Figure 2.12: Student’s t, standard Cauchy, and standard normal densities

which implies that, for ν > 1, the mean is defined as:


E [X] = 0 (2.100)
for ν > 2, the variance is defined as:
ν
Var [X] = (2.101)
ν−2
for ν > 3, the skewness is defined as:
Skew [X] = 0 (2.102)
for ν > 4, the kurtosis is defined as:
3ν − 6
Kurt [X] = (2.103)
ν−4
and so on and so forth with the higher moments.
The Student’s t-distribution is another important distribution in the
theory of statistical inference, as it is used to model the distribution of a
sample mean drawn from normally distributed random variables. In these
contexts, the parameter ν is an integer and takes the not-too-original name
of number of degrees of freedom.7 The Student’s t-distribution is also related
to the Snedecor’s F -distribution through some non-monotone transforma-
tions, as expressed in the following two observations.
Observation 2.15. If X ∼ T (ν) and Y = X 2 , it is Y ∼ F (1, ν).
Observation 2.16. If X ∼ T (ν) and Y = X −2 , it is Y ∼ F (ν, 1).
Even these transformations play a role in statistical inference and are worth
to keep in mind.
7
Similarly to the chi-squared and F cases, in these contexts one would thus say that
X follows the Student’s t-distribution “with ν degrees of freedom.”

72
2.3. Continuous Distributions II

Pareto distribution
This section is concluded with the analysis of the distribution named after
Vilfredo Pareto, a famous distribution with support defined on a subset of
the set of positive real numbers, X = [α, ∞). Here, α ∈ R++ is a parameter
of the family of Pareto distributions; this role is shared with another positive
parameter β ∈ R++ . Given two such parameters, the density function of a
particular Pareto distribution is:
βαβ
fX (x; α, β) = β+1 for x ≥ α (2.104)
x
and the cumulative distribution is:
 α β
FX (x; α, β) = 1 − for x ≥ α (2.105)
x
and zero otherwise (x < α). Its cumulative distribution clearly tends to 1 as
X tends to infinity. The shape of the distribution is displayed in Figure 2.13
below for a fixed value α = 1 and three different values of β. Clearly, the
parameter β affects the shape of the distribution, with lower values making
the distribution flatter, and vice versa (similarly as, but contrarily to, λ in
the exponential distribution’s case). Instead, parameter α affects both the
location of the distribution (as it defines the support) and the overall scale.
A random variable X distributed according to some Pareto distribution is
denoted as follows.
X ∼ Pareto (α, β)

fX (x) β=1
3 β=2
β=3
2

x
1 2 3 4

Figure 2.13: Different Pareto probability densities, α = 1

Observation 2.17. If X ∼ Pareto (α, β) and Y ∼ Exp (β−1 ), the two


random variables are related through the two symmetric transformations
X = α exp (Y ) and Y = log (X/α).

73
2.3. Continuous Distributions II

The moment generating function of the Pareto distribution is:


ˆ ∞
βαβ
MX (x; α, β) = exp (tx) · β+1 dx = β (−αt)β · Γ (−β, −αt) (2.106)
α x
´∞
where Γ (a, b) = b ua−1 exp (−u) du is the so-called upper incomplete
Gamma function, another ancillary function; the equivalence between the
two expressions in (2.106) is obtained through the simple change of variable
u = −tx and since Γ (a, b) is only defined for b > 0, the moment generating
functions of Pareto distributions are only defined for t < 0. To calculate the
moments of this distribution, it is easiest to proceed by direct integration
of the density, which results in finite solutions only for restricted values of
the “shape” parameter β. Specifically, the mean is such that:

for β ≤ 1

∞
E [X] = αβ (2.107)
 for β > 1
β−1
while the variance is as follows.

∞
 for β ≤ 2
Var [X] = 2
αβ (2.108)
for β > 2
(β − 1)2 (β − 2)

Intuitively, the Pareto distribution should not be too flat, or else its right
tail becomes “too heavy” causing the relevant moments to diverge. This is
a well known property of the Pareto distribution.
Like the lognormal and the Gamma distributions, the Pareto distribu-
tion is typically used to model asymmetric phenomena that are associated
with positive real numbers; it may be more motivated than its competing
alternatives when it is necessary to study phenomena that display “fat tails”
(i.e. notable inequality), such as the distribution of wealth or that of cities’
size. Another attractive feature of the Pareto distribution is that the den-
sity function (2.104) satisfies a mathematical power law, which is easy to
describe as a linear function upon applying a logarithmic transformation.

log fX (x; α, β) = log βαβ − (β + 1) log x for x ≥ α




Furthermore, the power law implies that the cumulative distribution (2.105)
can be easily inverted, resulting in a quantile function of conveniently simple
manipulation.
1
QX (p; α, β) = α (1 − p)− β

74
2.4. Continuous Distributions III

It must be mentioned that the family of distributions discussed so far is


just a particular case of a more general family called generalized Pareto
distribution, whose cumulative distribution function reads:
"   1 #−β
x−µ γ
FX (x; β, γ, µ, σ) = 1 − 1 + for x ≥ µ (2.109)
σ
with parameters µ ∈ R, (β, γ, σ) ∈ R3++ and support X = [µ, ∞). In this
context, the particular subfamily defined by (2.104) and (2.105) is referred
to as “Type I” Pareto distribution.

2.4 Continuous Distributions III


This lecture is completed with the separate treatment of a specific wider
family of continuous distributions, all encompassed under the umbrella of
the Generalized Extreme Value (GEV) distribution. As the name sug-
gests, these distributions are especially well suited to model the probability
about “extreme” realizations of events, which are defined (depending on the
context) as the maxima or the minima of a certain sequence of realizations.
The understanding about this specific interpretation of GEV distributions
is better developed through a result of asymptotic theory known under var-
ious names, including that of “Extreme Value Theorem,” and discussed in
Lecture 6. The analysis in this section is limited to that of the mathematical
properties of the GEV distribution, and their mutual relationships.
A GEV distribution depends on three parameters: a location parameter
µ ∈ R, a scale parameter σ ∈ R++ , and a so-called shape parameter ξ ∈ R.
The support of a GEV distribution depends on the value of ξ. In particular,
if ξ > 0, it is:  
σ
Xξ>0 = µ − , ∞ (2.110)
ξ
if ξ < 0, it is:  
σ
Xξ<0 = −∞, µ − (2.111)
ξ
while if ξ = 0, the support Xξ=0 = R is more simply equal to the entire set
of real numbers. The general expression of the density function of the GEV
distribution is best defined through the standardized value Z = (X − µ) /σ;
as a function of ξ, the density function of Z is:
( 1
 1

(1 + ξz)− ξ −1 exp − (1 + ξz)− ξ for ξ 6= 0 and ξz > −1
fZ (z; ξ) =
exp (−z) exp (− exp (−z)) for ξ = 0
(2.112)

75
2.4. Continuous Distributions III

and the corresponding cumulative distribution is:


 
− ξ1
(
exp − (1 + ξz) for ξ 6= 0 and ξz > −1
FZ (z; ξ) = (2.113)
exp (− exp (−z)) for ξ = 0

and one can easily verify that in all cases, the distribution integrates to 1.
Observe that for both the density function and the cumulative distribution,
the expression for ξ = 0 corresponds to the limit case of the expression for
ξ 6= 0. An important property of GEV distributions is that (2.113) is easy
to invert in all cases, so that the quantile function can be written as follows.
 −ξ
 (− log (p)) − 1
for ξ > 0, p ∈ [0, 1) ; or ξ < 0, p ∈ (0, 1]
QZ (p; ξ) = ξ
− log (− log (p)) for ξ = 0 and p ∈ (0, 1)

(2.114)
Note that the restrictions in the domain of the quantile function correspond
to the restrictions on the support of X.

fX (x) ξ = .5
0.4 ξ=0
ξ = −.5

0.2

x
−4 −2 2 4 6

Figure 2.14: Different GEV probability densities with µ = 0 and σ = 1

Some examples of GEV density functions, for constant parameters µ = 0


and σ = 1 but different values of ξ, are displayed in Figure 2.14 above. All
these examples of distribution are clearly symmetric; also observe that for
ξ 6= 0, the support of the distribution is bounded accordingly. In order to
denote that a random variable X follows a GEV distribution with specified
parameters µ, σ and ξ, one usually writes as follows.

X ∼ GEV (µ, σ, ξ)

A general expression for the moment generating (or characteristic) function


of GEV distributions is difficult to derive, but the moments of interest can

76
2.4. Continuous Distributions III

be obtained by direct integration; their definition (or lack thereof) depends


again the value of the parameter ξ. The mean, for example, is given as:

σ
µ + [Γ (1 − ξ) − 1] if ξ 6= 0, ξ < 1



ξ
E [X] = (2.115)

 µ + σγ if ξ = 0
if ξ ≥ 1

∞

where Γ (·) is the Gamma function and γ ' 0.577 is the Euler-Mascheroni
constant, while the variance is as follows.
 2
σ 
Γ (1 − 2ξ) − (Γ (1 − ξ))2 if ξ 6= 0, ξ < 12

 
 2
ξ

Var [X] = π2 (2.116)
 σ 2
if ξ = 0
 6


if ξ ≥ 21

The remainder of this section (and of this lecture alike) discusses in more
detail three particular cases of GEV distributions. These are of particular
interest for economists and econometricians, as they feature prominently in
both theoretical economic models with a stochastic component and in the
closely related structural econometric models. These restricted subfamilies
of the larger GEV family are typically named according to their discoverers,
but are also distinguished by a number from a classification between types.

Type I GEV distribution (ξ = 0): Gumbel


The GEV subfamily restricted to ξ = 0 is the simplest one, and its members
are said to be following the Gumbel or Type I GEV distribution. Its support
X = R is the entire set of real numbers and it is defined only in terms of the
location and scale parameters µ and σ respectively. In the non-standardized
cases, the Gumbel density function is given by:
    
1 x−µ x−µ
fX (x; µ, σ) = exp − exp − exp − (2.117)
σ σ σ
its cumulative distribution is:
  
x−µ
FX (x; µ, σ) = exp − exp − (2.118)
σ
and its quantile function is simply as follows, for p ∈ (0, 1).

QX (p; µ, σ) = µ − σ log (− log (p)) (2.119)

77
2.4. Continuous Distributions III

0.4 fX (x) Stand.


µ=2
σ=2

0.2

x
−5 −2 1 4 7 10

Figure 2.15: Different Gumbel probability densities

Figure 2.15 shows the “standardized” version of the Gumbel density


function along with versions for different values of µ and σ; their respective
roles as location and scale parameters are well evident. The mean and the
variance of the Gumbel distribution are given as the central cases for ξ = 0
in (2.115) and (2.116) respectively.8 If some random variable X follows the
Gumbel distribution, this is indicated in one of two ways:
X ∼ Gumbel (µ, σ)
or
X ∼ EV1 (µ, σ)
where “EV1” stands for “Extreme Value (Type) 1.” For reasons to be clari-
fied in later lectures, the Gumbel distribution is typically used to model the
maximum value among many different realizations, and it is the backbone
of several “structural” econometric techniques used to model the behavior
of socio-economic variables of interest that take values on a countable set.

Type II GEV distribution (ξ > 0): Fréchet


Restricting the wider class of GEV distributions to ξ > 0 defines the sub-
family of so-called Fréchet or Type II GEV distributions. A Fréchet distri-
bution has bounded support as per (2.110), and its defining equations have
been given earlier for the standardized (µ = 0 and σ = 1) case. Often, the
distribution is reparametrized through the inverse of the shape parameter,
α = ξ−1 , and the following transformation is applied.
Y = σ + µ (1 − ξ) + ξX (2.120)
8
The Gumbel distribution also features a relatively simple expression for the moment
generating function, that is MX (t; µ, σ) = exp (µt) Γ (1 − σt).

78
2.4. Continuous Distributions III

In such a case, the support is Y = [µ, ∞), the density function reads as:
 −α−1  −α !
α y−µ y−µ
fY (y; α, µ, σ) = exp − for y > µ
σ σ σ
(2.121)
the cumulative distribution function as:
 −α !
y−µ
FY (y; α, µ, σ) = exp − for y > µ (2.122)
σ
and the quantile function as follows.
1
QY (p; α, µ, σ) = (− log (p)) α for p ∈ [0, 1) (2.123)

fY (y) Stand.
µ=2
σ=2
0.5

y
2 4 6 8

Figure 2.16: Different Fréchet probability densities with α = 2

Three different Fréchet density for a fixed value of the shape parameter
are shown above in Figure 2.16, again highlighting the role of the location
and scale parameters; observe how the former (µ) also affects the bound of
the distribution’s support. The mean and the variance of the transformed
random variable, if they are finite, are obtained by applying the standard
properties of simple moments to (2.115) and (2.116) respectively. If it fol-
lows the Fréchet distribution, a random variable Y is indicated either as:
Y ∼ Frechet (α, µ, σ)
or as:
Y ∼ EV2 (α, µ, σ)
where the latter notation now refers to “Type II.” The most frequent use of
the Fréchet distribution is similar to the Gumbel’s, that is, for modeling the
maximum value of many realizations. Furthermore, the Fréchet distribution
features prominently in several economic and econometric models, especially
in the field of international trade.

79
2.4. Continuous Distributions III

Type III GEV distribution (ξ < 0): (reverse) Weibull


The last subfamily of GEV distributions is defined for ξ < 0, and results in
the reverse Weibull or Type III GEV distribution. Once more, the support
of a reverse Weibull distribution is bounded as per (2.111) and its defining
functions follow from the general GEV case. This distribution is typically
reparametrized with α = −ξ−1 ; with the transformation (2.120) the analy-
sis proceeds as in the Fréchet case (e.g. the support becomes Y = (−∞, µ]).
With the alternative transformation instead:

W = − [σ + µ (1 − ξ) + ξX] (2.124)

one obtains the (traditional) Weibull distribution, which predates the the-
ory of GEV distributions (thus explaining the name reverse Weibull for the
Type III GEV case, since W = −Y ). The traditional Weibull distribution
has support W = [µ, ∞); its density function is:
 α−1   α 
α w−µ w−µ
fW (w; α, µ, σ) = exp − for w > µ
σ σ σ
(2.125)
its cumulative distribution function as:
  α 
w−µ
FW (w; α, µ, σ) = 1 − exp − for w > µ (2.126)
σ

and its quantile function is as follows.


1
QW (p; α, µ, σ) = (− log (1 − p)) α for p ∈ [0, 1) (2.127)

1 fY (y) Stand.
µ=2
σ=2

0.5

y
−6 −4 −2

Figure 2.17: Different reverse Weibull probability densities with α = 2

80
2.4. Continuous Distributions III

1 fW (w) Stand.
µ=2
σ=2

0.5

w
2 4 6

Figure 2.18: Different Weibull probability densities with α = 2

Figures 2.17 and 2.18 display the reverse Weibull and the Weibull dis-
tribution respectively, for α = 2 and the same perturbations of the location
and scale parameters. The two figures clarify that each distribution is the
perfect “mirror image” of the other, as their names and mathematical rela-
tionship suggest. The moments of both the inverse Weibull and the Weibull
distribution can again be appropriately derived from the expressions in the
GEV case. If a random variable Y follows the reverse Weibull distribution,
this is best written with an explicit reference to the “Type III” GEV:
Y ∼ EV3 (α, µ, σ)
while one writes:
W ∼ Weibull (α, µ, σ)
if a random variable W follows the (traditional) Weibull distribution.
This section is concluded by highlighting some relationships between the
Weibull distribution, other types of GEV distributions, and the exponential
distribution (and by extension, all distributions related to the exponential).
Observation 2.18. If X ∼ Exp (1), Y = µ−σ log (X), and W = µ+σX 1/α ,
it is Y ∼ Gumbel (µ, σ) and W ∼ Weibull (α, µ, σ).
√ 
Observation 2.19. If X ∼ Exp α and W ∼ Weibull α, 0, 21 , the two


random variables are symmetrically related: X = W and W = X 2 .
Observation 2.20. If Y ∼ Frechet (α, µY , σ), and W = (Y − µY )−1 + µW ,
it is W ∼ Weibull (α, µW , σ−1 ).
The (traditional) Weibull distribution is most often used to model the min-
imum value among multiple realizations, contrary to the Gumbel and the
Fréchet cases. A frequent application is in survival analysis (the statistical
study of waiting times) along with the related exponential distribution.

81
Lecture 3

Random Vectors

This lecture introduces those conceptual and mathematical tools that are
necessary to handle multiple random variables: these notions include those
of random vector, joint versus marginal probability distribution, multivari-
ate transformation of a random vector, independence, covariance and cor-
relation, conditional distribution and conditional moments. While develop-
ing these concepts, this lecture introduces additional relationships between
common univariate distributions, concluding with the treatment of two im-
portant multivariate distributions. The mathematical notation is chosen so
to facilitate the later treatment of econometric theory.

3.1 Multivariate Distributions


In most practical settings of interest, a statistical analyst is often interested
in describing the occurrence of multiple events, each best expressed through
a single random variable, that are possibly related to one another through a
probabilistic relationship of dependence. It is therefore important to extend
the theory of probability distributions to a multivariate environment. The
first step in this direction is the definition of a random vector.
Definition 3.1. Random Vector. A random vector x of length K is a
collection of K random variables X1 , . . . , XK :
 
X1
 .. 
x= . 
XK
each with support Xk ⊆ R for k = 1, . . . , K.
In light of the definition, the name random vector is admittedly uninspiring.

82
3.1. Multivariate Distributions

It is worth to observe one specific choice about notation: random vectors


are indicated with lower case, bold faced italic characters. In these lectures,
a similar notation (x) with roman – instead of italic – characters is instead
reserved for the realizations of random vectors: that is, the collection of the
specific realizations of each of the K random variables that define x.
 
x1
 .. 
x= . 
xK
It is also useful to specify what the support of a random vector is.
Definition 3.2. Support of a Random Vector. The support X ⊆ RK
of a random vector x is the Cartesian product of all the supports of the
random variables featured in x.
X = X1 × · · · × XK
Note that the definition of random vector imposes no restriction on the
original sample space on which the K random variables are based: it may
well be that it is the same for all of them. In such a case, one should allow
for the fact that the variables in question are probabilistically dependent,
that is, the realization of one random variable provides information about
the odds of another random variable’s realizations (as expressed through the
definition of conditional probability). A joint probability distribution
is a mathematical function that easily handles such circumstances.
Definition 3.3. Joint Probability Cumulative Distribution. Given
a random vector x, its joint probability cumulative distribution is defined
as the following function.
Fx (x) = P (x ≤ x) = P (X1 ≤ x1 ∩ · · · ∩ XK ≤ xK )
Obviously, a joint probability cumulative distribution takes values in the
[0, 1] interval. One can show that the properties of cumulative probability
distributions for single random variables (Theorem 1.7) are extended to the
joint, multivariate case; in particular, when the limit of all its arguments
tends to minus infinity (plus infinity) the function tends to zero (one); the
function is non-decreasing and right-continuous in all its arguments.
Even mass and density functions have their multivariate analogues.
Definition 3.4. Joint Probability Mass Function. Given any random
vector x composed by discrete random variables only, its joint probability
mass function fx (x) is defined as follows, for all x = (x1 , . . . , xK ) ∈ RK .
fx (x1 , . . . , xK ) = P (X1 = x1 ∩ · · · ∩ XK = xK )

83
3.1. Multivariate Distributions

A joint probability mass function is related to the cumulative function via


the following relationship:
X
P (x ≤ x) = Fx (x) = fx (t)
t∈X:t≤x

where the summation is taken over the vectors t in X whose all elements are
smaller or equal than all the elements of x. In addition, the joint probability
mass function must obviously satisfy the following condition.
X
P (x ∈ X) = fx (x) = 1
x∈X

Definition 3.5. Joint Probability Density Function. Given any ran-


dom vector x composed by continuous random variables only, its joint pro-
bability density function fx (x) is defined as the function that satisfies the
following relationship, for all x = (x1 , . . . , xK ) ∈ RK .
ˆ x1 ˆ xK
Fx (x1 , . . . , xK ) = ··· fx (x1 , . . . , xK ) dx1 . . . dxK
−∞ −∞

In analogy with its univariate counterpart, a joint probability density func-


tion only takes nonnegative values (possibly larger than one), and allows to
express probabilities about events occurring within specified intervals – now,
intervals in RK . Given two vectors a = (a1 , . . . , aK ) and b = (b1 , . . . , bK )
with a, b ∈ RK and bk ≥ ak for k = 1, . . . , K, it is as follows.

P (a1 ≤ x1 ≤ b1 ∩ · · · ∩ aK ≤ xK ≤ bK ) = Fx (b1 , . . . , bK )−Fx (a1 , . . . , aK )


ˆ b1 ˆ bK
= ... fx (x1 , . . . , xK ) dx1 . . . dxK
a1 aK

Clearly, the joint density integrates to one over the entire support of x.
ˆ ˆ
P (x ∈ X) = ... fx (x1 , . . . , xK ) dx1 . . . dxK = 1
X1 XK

In a multivariate environment, the probability distributions of the indi-


vidual random variables that compose a random vector are called marginal
distributions, and can be derived from the joint mass and density functions.
Definition 3.6. Marginal Distribution (discrete case). For a given
random vector x made of discrete random variables only, the probability
mass function of Xk – the k-th random variable in x – is obtained as:
X X X X
fXk (xk ) = ··· ··· fx (x1 , . . . , xK )
x1 ∈X1 xk−1 ∈Xk−1 xk+1 ∈Xk+1 xK ∈XK

and thus FXk (xk ) =


Pxk
t=inf Xk fXk (t).

84
3.1. Multivariate Distributions

The above relationship expresses a summation over all the supports of all
the other random variables in x, excluding Xk . It has to be interpreted in
a general sense, whatever the dimension of K and the actual index k are: if
K is small and/or k is at either extreme of the list, it must be reformulated
accordingly. This is best seen with small values of K.
Example 3.1. Joint Medical Outcomes. Recall Example 1.4 about the
probability of getting sick following the take-up (or lack thereof) of some
preemptive medical treatment. One could reformulate that example via a
random vector (X, Y ) where: x = 1 if an individual is a taker, x = 0 if he
or she hesitates, y = 1 if an individual stays healthy, y = 0 if he or she gets
sick. This is a bivariate Bernoulli distribution with:
fX,Y (x = 1, y = 1) = 0.40, fX,Y (x = 1, y = 0) = 0.20,
fX,Y (x = 0, y = 1) = 0.15, fX,Y (x = 0, y = 0) = 0.25,
and the marginal mass function of either Bernoulli-distributed random vari-
able is obtained by appropriately summing over the support of the other.
fX (x) = fX,Y (x, y = 1) + fX,Y (x, y = 0) for x = 0, 1
fY (y) = fX,Y (x = 1, y) + fX,Y (x = 0, y) for y = 0, 1
Bernoulli distributions are typically represented through frequency tables,
where joint probabilities are displayed at the center, and marginal proba-
bilities at the margins; a frequency table for this example is shown below.
Y =0 Y =1 Total
X=0 0.25 0.15 0.40
X=1 0.20 0.40 0.60
Total 0.45 0.55 1
The denomination “marginal” clearly derives from this graphical device. 
Analogously, the density functions of continuous marginal distributions
can be obtained by integrating the joint density over the support of all the
random variables in the random vector, except the one of interest.
Definition 3.7. Marginal Distribution (continuous case). For a given
random vector x made of continuous random variables only, the probability
density function of Xk – the k-th random variable in x – is obtained as:
ˆ
fXk (xk ) = fx (x) dx−k
×`6=k X`
´ xk
and thus FXk (xk ) = −∞
fXk (t) dt.

85
3.1. Multivariate Distributions

In this more compact definition, the notation ×`6=k X` indicates the Carte-
sian product of all the supports of each random variable in x excluding Xk :
e.g. ×`6=k X` = X1 × · · · × Xk−1 × Xk+1 × · · · × XK ; similarly the expression
dx−k for the differential is to be interpreted as the product of all differentials
except the one for xk : dx−k = dx1 . . . dxk−1 dxk+1 . . . dxK .

Example 3.2. Bivariate Normal Distribution. A two-dimensional ran-


dom vector x = (X1 , X2 ) follows a bivariate normal distribution if its joint
density function fX1 ,X2 (x1 , x2 ) is, for some parameters µ1 ∈ R, µ2 ∈ R,
σ1 ∈ R++ , σ2 ∈ R++ , and ρ ∈ [−1, 1], expressed as follows.

1
fX1 ,X2 (x1 , x2 ; µ1 , µ2 , σ1 , σ2 , ρ) = √ ×
2πσ1 σ2 1 − ρ2
!
(x1 − µ1 )2 (x2 − µ2 )2 ρ (x1 − µ1 ) (x2 − µ2 )
× exp − 2 − + (3.1)
2σ1 (1 − ρ2 ) 2σ22 (1 − ρ2 ) σ1 σ2 (1 − ρ2 )

fX2 (x2 ) fX1 (x1 )


fX1 ,X2 (x1 , x2 )

0.4

0.2

0
−1 4

0 3
0.3
1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1

Figure 3.1: Bivariate normal for µ1 = 1, µ2 = 2, σ1 = .5, σ2 = 1, ρ = .4.

86
3.1. Multivariate Distributions

Figure 3.1 represents a bivariate normal distribution (for selected values of


the parameters) in a three-dimensional plot, with domain restricted to the
[−1, 4]2 area. Furthermore, it projects the density functions of the marginal
distributions for X1 and X2 onto planes defined at the margins of the space
under consideration. A somewhat tedious exercise in integration of the joint
density would reveal that X1 ∼ N (µ1 , σ21 ) and X2 ∼ N (µ2 , σ22 ). 

In numerous cases of practical interest, a random vector features both


continuous and discrete random variables. In such situations, obtaining the
marginal probability mass or density function for a specific random variable
is an exercise in combining summation and integration as appropriate.

Example 3.3. Mixed discrete-continuous random vector. Consider


a simple example of a random vector x = (H, G) where H is a continuous
random variable representing human height, while G is a Bernoulli random
variable describing gender (e.g. G = 1 for females and G = 0 for males).
It turns out that human height is normally distributed but with different
parameters in the population. One can provide a full probability description
of human height in the population as follows:
 
1 h − µF
fH,G (h, g = 1; µF , µM , σF , σM , p) = φ ·p
σF σF
 
1 h − µM
fH,G (h, g = 0; µF , µM , σF , σM , p) = φ · (1 − p)
σM σM

where subscripts below parameters refer to genders (female or male) and


φ (·) is the standard normal density function. The marginal density function
of H is obtained by simply summing the two expressions:
   
1 h − µF 1 h − µM
fH (h; µF , µM , σF , σM , p) = φ ·p+ φ · (1 − p)
σF σF σM σM

while the observation that both densities must integrate to one gives:

fG (g = 1; p) = p
fG (g = 0; p) = 1 − p

thus returning the marginal mass function for G. 

The results about transformations of random variables can be gener-


alized to random vectors. In this case, the interest falls on a transformed
random vector y = g (x) where g (·) is a function taking K arguments and
returning J values, with possibly J 6= K (thus y would have dimension J).

87
3.1. Multivariate Distributions

Once again, this problem is tractable only so long as the transformation is


invertible; in a multivariate setting this means that one can define a set of
functions g1−1 (·) , . . . , gK
−1
(·) such that:
Xk = gk−1 (Y1 , . . . , YJ )
for k = 1, . . . , K. In such a case, a transformed discrete joint mass function
can be obtained by generalizing (1.5)
−1
fy (y) = fx g1−1 (y) , . . . , gK (3.2)

(y)
while the cumulative distribution Fy (y) is derived consequently. For con-
tinuous random vectors, Theorem 1.11 is extended as follows, by imposing
the additional restrictions that the transformation g (·) is bijective (both
injective and surjective, i.e. “one-to-one and onto”) and that K = J, that
is, the transformation does not affect the length of the random vector.
Theorem 3.1. Joint Density of Transformed Random Vectors. Let
x and y = g (x) be two random vectors of length K that are related by a
bijective transformation g (·) which preserves vector length, X and Y their
respective supports, and fx (x) the joint probability density function of x,
which is continuous on X. If the inverse of the transformation function,
gk−1 (·), is continuously differentiable on Y for k = 1, . . . , K, the joint prob-
ability density function of y can be calculated as:
(  
fx g1−1 (y) , . . . , gK
−1
(y) · det ∂y∂T g −1 (y)

if y ∈ Y
fy (y) =
0 if y ∈
/Y
T
where g −1 (y) = g1−1 (y) , . . . , gK −1
(y) , with the following K ×K Jacobian
matrix:  −1
∂g1−1 (y) ∂g1−1 (y)

∂g1 (y)
. . .
 ∂g∂y−1
1 ∂y2
∂g2−1 (y)
∂yK 
∂g2−1 (y) 
2 (y)
∂ −1  ...
g (y) =  ∂y. 1 ∂y2 ∂yK 

∂y T  .. .
.. .. .. 
 −1 . .  
−1 −1
∂gK (y) ∂gK (y) ∂gK (y)
∂y1 ∂y2
... ∂yK
 
the absolute value of whose determinant is denoted as det ∂y∂T g −1 (y) .

Proof. This result is a particular application of Jacobian transformations


from multivariate calculus.
This result can be additionally generalized to transformations that are not
bijective, but are bijective on each element with positive probability of some
partition of X, similarly as in Theorem 1.12 for the univariate case.

88
3.1. Multivariate Distributions

Example 3.4. Bivariate lognormal distribution. Consider the bivari-


ate normally distributed random vector from Example 3.2, and the random
vector y = (Y1 , Y2 ) which is obtained through the following transformation.
     
Y1 X1 exp (X1 )
y= =g =
Y2 X2 exp (X2 )

This implies
 X1 = g1 (Y1, Y2 ) = log (Y1 ) and X2 = g2 (Y1 , Y2 ) = log (Y2 ),
−1 −1

hence det ∂y∂T g −1 (y1 , y2 ) = (y1 y2 )−1 > 0 and:


1 1
fY1 ,Y2 (y1 , y2 ; µ1 , µ2 , σ1 , σ2 , ρ) = √ ×
2πσ1 σ2 1 − ρ2 y1 y2
!
2 2
(log y1 − µ1 ) (log y2 − µ2 ) ρ (log y1 − µ1 ) (log y2 − µ2 )
× exp − 2 2
− 2 2
+
2σ1 (1 − ρ ) 2σ2 (1 − ρ ) σ1 σ2 (1 − ρ2 )

showing that y = (Y1 , Y2 ) follows a bivariate lognormal distribution on R2++


with parameters (µ1 , µ2 , σ1 , σ2 , ρ), as displayed in Figure 3.2 below.

fY1 (y1 )

fY2 (y2 )

0.2
fY1 ,Y2 (y1 , y2 )

0.1

0
0 5

1 4

2 3
4 · 10−2
y1 3 2 y2
2 · 10−2
4 1
0
5

Figure 3.2: Bivariate lognormal, µ1 = 1, µ2 = 2, σ1 = .5, σ2 = 1, ρ = .4.

89
3.2. Independence and Random Ratios

This example is relatively simple, because the two original random vari-
ables X1 and X2 do not interact in the transformation (that is, the Jacobian
is diagonal). Some more elaborate cases are discussed in the next section.
However, this example is also useful in itself as an occasion to graphically
visualize another bivariate distribution (in this case, the lognormal). 

All the concepts and ideas discussed until this point in this lecture ex-
tend easily to random matrices, that is arrayed combinations of L random
vectors, written e.g. X = x1 . . . xL . In analogy with random vectors,


the realizations of random matrices adopt a romanized notation too, being


written for example as X = x1 . . . xL . The necessity to deal with ran-


dom matrices explains why uppercase letters are not used to denote random
vectors. Random matrices do not involve particular conceptual hurdles, be-
ing nothing else but a different algebraic way of arraying random variables.
However, they are necessary in multivariate statistical analysis and econo-
metrics as a means to more elegantly, compactly and often clearly describe
statistical estimators and their properties.

3.2 Independence and Random Ratios


When a random vector involves some underlying random variables that
are probabilistically unrelated from one another, we say that these random
variables are independent. Intuitively, this means that the realization of
one specific random variable is uninformative with respect to the potential
realization of the other(s). In other words (tracing back to a definition given
in Lecture 1), any event described by either random variable is statistically
independent to any event described by the other. Because the probability of
events expressed by random variables are completely characterized by their
mass or density functions, it is possible to establish the following definitions.

Definition 3.8. Independent Random Variables. Let x = (X, Y ) be


a random vector with joint probability mass or density function fX,Y (x, y),
and marginal mass or density functions fX (x) and fY (y). Let uppercase F
denote corresponding cumulative distributions instead (joint or marginal).
The two random variables X and Y are independent if the two equivalent
conditions below hold.

fX,Y (x, y) = fX (x) fY (y) ⇐⇒ FX,Y (x, y) = FX (x) FY (y)

The definition of independence clearly extends to multiple random vari-


ables. However, one must be careful at distinguishing between independence

90
3.2. Independence and Random Ratios

between random vectors versus independence between multiple random vari-


ables that might belong to the same random vectors. Thus, it is important
to keep the following two definitions separate.
Definition 3.9. Mutually – or Pairwise – Independent Random
Variables. Let x = (X1 , . . . , XK ) be a random vector with joint probabil-
ity mass or density function fx (x), and marginal mass or density functions
fX1 (x1 ) , . . . , fXK (xK ). Let uppercase F denote corresponding cumulative
distributions instead (joint or marginal). The random variables X1 , . . . , XK
are pairwise independent if every pair of random variables listed in x are
independent, and they are mutually independent if the two equivalent con-
ditions below hold.
K
Y K
Y
fx (x) = fXk (xk ) ⇐⇒ Fx (x) = FXk (xk )
k=1 k=1

Observe that while mutual independence implies pairwise independence,


the converse is not true.
Definition 3.10. Independent Random Vectors. Let (x1 , . . . , xJ ) be
a sequence of J random vectors whose joint probability mass or density
function is written as fx1 ,...,xJ (x1 , . . . , xJ ). Let the joint probability mass or
density functions of an individual nested random vector be fxi (xi ), where
i = 1, . . . , J, and the joint probability mass or density functions of any
two random vectors indexed i, j (with i 6= j) as fxi ,xj (xi , xj ). Finally, let
uppercase F denote corresponding cumulative distributions instead (joint
or marginal). Any pair of random vectors indexed i and j are independent
if the two equivalent conditions below hold.
fxi ,xj (xi , xj ) = fxi (xi ) fxj (xj ) ⇐⇒ Fxi ,xj (xi , xj ) = Fxi (xi ) Fxj (xj )
If the above holds for any i, j distinct pair, the J random vectors are said to
be pairwise independent. The J random vectors are mutually independent
if the two equivalent conditions below hold.
J
Y J
Y
fx1 ,...,xJ (x1 , . . . , xJ ) = fxi (xi ) ⇐⇒ Fx1 ,...,xJ (x1 , . . . , xJ ) = Fxi (xi )
i=1 i=1

Note that within each random vector, the underlying random variables are
not necessarily independent. Moreover, if all the random vectors in question
have length one, these definitions reduce to those given above.
Two results about independent random variables are well worth of being
discussed: the first helps the interpretation of independence, the second is of
more practical use and instrumental to derive other properties and results.

91
3.2. Independence and Random Ratios

Theorem 3.2. Independence of Events. Any two events mapped by two


independent random variables X and Y are statistically independent.
Proof. (Outline.) This requires to show that, for any two events A ⊂ SX
and B ⊂ SY – where SX and SY are the primitive sample spaces of X and
Y respectively – it is:

P (X ∈ X (A) ∩ Y ∈ Y (B)) = P (X ∈ X (A)) · P (Y ∈ Y (B))

which follows from the definitions of (joint) cumulative distribution, mass


and density functions, and that of independent events.
Generalization: Mutual Independence between Events. Any com-
bination of events mapped by a sequence of J mutually independent random
vectors (x1 , . . . , xJ ) are mutually independent.
Proof. (Outline.) Extending the reasoning above, consider a collection of
J events denoted by Ai ⊂ Sxi for i = 1, . . . , J, where Sxi is the primitive
sample space of xi . It must be shown that:
J
! J
\ Y
P (xi ∈ xi (Ai )) = P (xi ∈ xi (Ai ))
i=1 i=1

which follows by analogous considerations.


Theorem 3.3. Independence of Functions of Random Variables.
Consider two independent random variables X and Y , and let U = gX (X)
be a transformation of X and V = gY (Y ) a transformation of Y . The two
transformed random variables U and V are independent.
Proof. (Outline.) This requires to show that,

fU,V (u, v) = fU (u) fV (v) ⇐⇒ FU,V (u, v) = FU (u) FV (v)

which is achieved by manipulating the inverse mappings gX −1


([a, b]) and
gY ([a, b]) for any appropriate interval [a, b] ⊂ R, with a ≤ b.
−1

Generalization: Independence of Functions of Random Vectors.


Consider a sequence of mutually independent random vectors (x1 , . . . , xJ ),
as well as a sequence of transformations (y1 , . . . , yJ ) such that yi = gi (xi )
for i = 1, . . . , J. The J transformed random vectors (y1 , . . . , yJ ) are also
themselves mutually independent.
Proof. (Outline.) The proof extends the logic of the bivariate case to higher
dimensions; it requires manipulating the J Jacobian transformations.

92
3.2. Independence and Random Ratios

The concept of independence is central in theoretical and applied statis-


tics. In probability theory, it helps identify the distribution of functions of
random variables that can be expressed as a random ratio (Y = X1 /X2 ),
or as a random product (Y = X1 X2 ). Some results about random ratios
are especially important in applied statistics, and they are developed next
in the form of observations about distributions.
Observation 3.1. If X1 ∼ N (0, 1) and X2 ∼ N (0, 1), and the two random
variables X1 and X2 are independent, the random variable Y obtained as
Y = X1 /X2 is such that Y ∼ Cauchy (0, 1).
Proof. To demonstrate this assertion, consider first that if the two standard
normal random variables X1 and X2 in question are independent, their joint
density function is by definition a simplified version of (3.1).
 2
x1 + x22

1
fX1 ,X2 (x1 , x2 ) = exp −
2π 2
Consider the random vectors x = (X1 , X2 ) and y = (Y, Z) where Z = |X2 |
and the support of y is Y = R × R+ . It is straightforward to verify that the
transformation that relates x with y is not bijective, however the support
of x can be partitioned in such a way that it is bijective on each component
with positive probability, in the spirit of Theorem 1.12, as follows.
X0 = {(x1 , x2 ) : x2 = 0}
X1 = {(x1 , x2 ) : x2 < 0}
X2 = {(x1 , x2 ) : x2 > 0}
The intermediate objective is to derive the joint probability density of y.
To this end, one must analyze the Jacobian matrices defined on both sets
X1 and X2 , apply Theorem 3.1 on them, and sum the results. The inverse
transformations on X1 , where Z = −X2 , are:
−1
X1 = g1,X1
(Y, Z) = −Y Z
−1
X2 = g2,X1
(Y, Z) = −Z
therefore the determinant of the Jacobian matrix is as follows.
   
∂ −1 −z −y
det g (y, z) = det =z>0
∂y T X1 0 −1
In the case of X2 it is Z = X2 , hence the analysis proceeds as:
−1
X1 = g1,X 2
(Y, Z) = Y Z
−1
X2 = g2,X 2
(Y, Z) = Z

93
3.2. Independence and Random Ratios

and:    
∂ −1 z y
det g (y, z) = det =z>0
∂y T X2 0 1
and in both cases the result is equal to z which is positive by construction,
leaving no need to take absolute values. In addition, since the two sets X1
and X2 in the partition are both symmetric around x2 = 0 – just like the
transformation that defines z is – the joint density of y can be obtained by
applying 3.1 once on the joint density of x, and multiplying the result by
two.
(y 2 + 1) z 2
 
z
fY,Z (y, z) = exp −
π 2
The final objective is to show that the marginal density of Y indeed follows
the standard Cauchy distribution. To achieve this, the route is to integrate
the joint density of y over the support of Z, which is R+ :
ˆ +∞
fY (y) = fY,Z (y, z) dz
ˆ +∞
0
(y 2 + 1) z 2
 
z
= exp − dz
0 π 2
ˆ +∞
(y 2 + 1)
 
1
= exp − u du
0 2π 2
ˆ +∞ 2
(y 2 + 1)
 
1 (y + 1)
= exp − u du
π (y 2 + 1) 0 2 2
1
= 2
π (y + 1)

where in the third line the change of variable u = z 2 is applied, while the
integral in the fourth line vanishes because it is the total probability of an
exponential distribution with parameter λ = (y 2 + 1) /2. The final result is
indeed the probability density function of a standard Cauchy distribution,
as it was originally postulated.
Observation 3.2. If Z ∼ N (0, 1) and X ∼ χ2 (ν), and the two random
variablespZ and X are independent, the random variable Y obtained as
Y = Z/ X/ν is such that Y ∼ T (ν).
Proof. The first step ispto show what the distribution of the transformed
random variable W = X/ν is. The distribution of the squared root of a
random variable following the chi-squared distribution is called (unsurpris-
ingly) the chi distribution, and W just follows one rescaled version of it.
This transformation is monotone, it preserves the support X = W = R+ , its

94
3.2. Independence and Random Ratios

inverse is X = g −1 (W ) = νW 2 and thus dx


dw
= 2νw > 0, and consequently
the density function of W is:
ν
νw2
 
ν2
fW (w; ν) =  ν w ν−1
exp − for w > 0
Γ ν2 · 2 2 −1 2
and therefore the joint density function of the random vector w = (Z, W )
is, given φ (z) the density function of the standard normal distribution:
fw (z, w; ν) = φ (z) fW (w; ν)
ν  2
z + νw2

1 ν2 ν−1
=√ ν
 ν −1 w exp −
2π Γ 2
· 22 2
for z ∈ R and w ∈ R++ ; note that a product of the two densities is sufficient
here because by Theorem 3.3, if Z is independent of X it is also independent
of W . The next step is to obtain the joint density function of the random
vector y = (Y, W ), which is related to w by a support-preserving, bivariate,
bijective transformation whose inverse has a Jacobian matrix with determi-
nant w > 0, similarly to the analysis of Observation 3.1. Since Z = Y W ,
the density function in question is, for y ∈ R and w ∈ R++ :
fy (y, w; ν) = w · fw (yw, w; ν)
ν+1
(y 2 + ν) w2
 
ν 2 1 ν
=  ν−1 √ w exp −
Γ ν2 · 2 2 πν 2
and the marginal density of Y is obtained by integrating W over R++ . The
result is the density function of a Student t-distribution with parameter ν.
ˆ +∞
fY (y; ν) = fY,W (y, w) dw
0
ˆ +∞ ν+1
(ν + y 2 ) w2
 
1 1 ν 2 ν
= √ ν−1 w exp − dw
Γ ν2 πν 0 2 2 2
ˆ +∞   ν+1
y2
   
1 1 ν 2 ν−1 ν
= √ u 2 exp − 1+ u du
Γ ν2 πν 0 2 2 ν
− ν+1
Γ ν+1

y2

2 1 2
= ν
√ 1+ ×
Γ 2 πν ν
ˆ +∞  ν+1
ν + y 2 2 ν−1 ν + y2
    
1
× u 2 exp − u du
Γ ν+1

0 2
2 2
 ν+1
2 − 2
Γ ν+1
 
2 1 y
= √ 1+
Γ ν2 πν ν

95
3.3. Multivariate Moments

In the above analysis, the third line applies the change of variable u = w2 ;
the fourth line is obtained through some manipulation, whereas the integral
therein is recognized as the density function of a Gamma distribution with
parameters α = (ν + 1) /2 and β = (ν + y 2 ) /2, thus vanishing.
Observation 3.3. If X1 ∼ χ2 (ν1 ) and X2 ∼ χ2 (ν2 ), and the two random
variables X1 and X2 are independent, the random variable Y obtained as
Y = (X1 /ν1 ) / (X2 /ν2 ) is such that Y ∼ F (ν1 , ν2 ).
Proof. (Outline.) This proceeds as in the previous two observations. First,
define W1 = X1 /ν1 ; the density function of this transformation is the same
as X1 ’s but multiplied by ν1 , and similarly for W2 = X2 /ν2 . The next step
is the transformation Y = W1 /W2 and Z = |W2 |; the joint density function
of Y and Z can be derived consequently from the one of W1 and W2 . Some
manipulation would then reveal that the marginal density of the ratio Y is
that of an F -distribution with parameters ν1 and ν2 .
The last observation is presented completely without proof.
Observation 3.4. If X1 ∼ Γ (α, γ) and X2 ∼ Γ (β, γ), and the two random
variables X1 and X2 are independent, the random variable Y obtained as
Y = X1 / (X1 + X2 ) is such that Y ∼ Beta (α, β), and is independent of the
random variable W obtained as W = X1 + X2 such that W ∼ Γ (α + β, γ).
This completes the picture about the best known results on random ratios.
Among these Observations, 3.2 and 3.3 play an important role in statistical
inference, as elaborated in the next lectures.

3.3 Multivariate Moments


To every random vector x = (X1 , . . . , XK ) is associated some list of mo-
ments, both uncentered and centered, that pertain to all the random vari-
ables featured in x (insofar as these moments exist). There is not a distinct
definition for these. However, it is interesting to analyze how they formally
relate with the joint distribution of x, through the link between joint and
marginal mass or density functions. Beginning with the mean, the r-th un-
centered moments of discrete and continuous random variables belonging
to the random vector x can be written, respectively, as:
X X
E [Xkr ] = ··· xrk fx (x)
x1 ∈X1 xK ∈XK
ˆ ˆ
E [Xkr ] = ... xrk fx (x) dx
X1 XK

96
3.3. Multivariate Moments

while the r-th centered moments, including the variance, as:


X X
E [(Xk − E [Xk ])r ] = ··· (xk − E [Xk ])r fx (x)
x1 ∈X1 xK ∈XK
ˆ ˆ
E [(Xk − E [Xk ])r ] = ... (xk − E [Xk ])r fx (x) dx
X1 XK

where in both cases, dx = dx1 . . . dxK is the product of all the differentials.
In a multivariate context it is interesting to describe the degree by which
two random variables tend to deviate from their mean in the same direction,
in a probabilistic sense. This concept is expressed by the covariance (an
absolute measure) and the correlation (a normalized one).
Definition 3.11. Covariance. For any two random variables Xk and X`
belonging to a random vector x, their specific covariance is defined as the
expectation of a particular function of Xk and X` , that is, the product of
both variables’ deviations from their respective means.
Cov [Xk , X` ] = E [(Xk − E [Xk ]) (X` − E [X` ])]
The full expression is written as follows, for discrete and continuous random
variables respectively.
X X
Cov [Xk , X` ] = ··· (xk − E [Xk ]) (x` − E [X` ]) fx (x)
x ∈X1 xK ∈XK
ˆ1 ˆ
Cov [Xk , X` ] = ... (xk − E [Xk ]) (x` − E [X` ]) fx (x) dx
X1 XK

The covariance takes positive values if the two variables (Xk and X` ) tend
to deviate from the mean in the same direction, and negative vice versa. It
must be observed that, however, the covariance expresses a relationship of
dependence which is essentially linear; if two random variables tend to move
together in a very non-linear or irregular way, this may not be captured at
all by the covariance. Similarly to the variance, the definition of covariance
can be rewritten in a way that is often more convenient to handle.
Cov [Xk , X` ] = E [(Xk − E [Xk ]) (X` − E [X` ])]
= E [Xk X` ] − E [Xk E [X` ]] − E [X` E [Xk ]] + E [Xk ] E [X` ]
= E [Xk X` ] − E [Xk ] E [X` ]
Definition 3.12. Correlation. For any two random variables Xk and X`
belonging to a random vector x, their population correlation is defined as
follows.
Cov [Xk , X` ]
Corr [Xk , X` ] = p p
Var [Xk ] Var [X` ]

97
3.3. Multivariate Moments

The correlation is a “normalized covariance” which is comparable across


different pairs of random variables, thanks to the following result.
Theorem 3.4. Properties of Correlation. For any two random vari-
ables X and Y , it is:
a. Corr [X, Y ] ∈ [−1, 1], and
b. |Corr [X, Y ]| = 1 if and only if there are some real numbers a 6= 0
and b such that P (Y = aX + b) = 1. If Corr [X, Y ] = 1 it is a > 0, if
Corr [X, Y ] = −1 it is a < 0.
Proof. Define the following function:
C (t) = E [(X − E [X]) · t + (Y − E [Y ])]2
 

= Var [X] · t2 + 2 Cov [X, Y ] · t + Var [Y ]


and note that it is nonnegative, because it is defined as the expectation of
the square of a random variable. Thus its solution must satisfy:
(2 Cov [X, Y ])2 − 4 Var [X] Var [Y ] ≤ 0 (3.3)
or, equivalently:
p p p p
− Var [X] Var [Y ] ≤ Cov [X, Y ] ≤ Var [X] Var [Y ]
thus a. is proved. Next, consider that |Corr [X, Y ]| = 1 only if (3.3) holds
with equality, or equivalently, C (t) = 0. For this to happen, it must be:
P [(X − E [X]) t + (Y − E [Y ])]2 = 0 = 1


or equivalently:
P ((X − E [X]) t + (Y − E [Y ]) = 0) = 1
which only occurs if, given Y = aX + b:
a = −t
b = E [X] · t + E [Y ]
Cov [X, Y ]
t=−
Var [X]
and the proof of b. is completed by showing that a and Corr [X, Y ] must
also share the same sign.
Result a. in the above Theorem characterizes the normalized interpretation
of correlation. Result b. instead specifies the linear nature of the relation-
ship captured by measures of correlation, which equal either 1 or −1 if and
only if the two random variables under consideration are connected through
an exact linear dependence.

98
3.3. Multivariate Moments

Additional insights can be acquired by looking at specific moments of


independent random variables.
Theorem 3.5. Cross-expectation of independent random variables.
Given two independent random variables X and Y , it is

E [XY ] = E [X] E [Y ]

if the above moments exist.


Proof. The left hand side of the above relationship is the expectation of a
random variable which is defined as the product of X and Y :
ˆ ˆ ˆ ˆ
xyfX,Y (x, y) dxdy = xyfX (x) fY (y) dxdy
X Y ˆX Y ˆ
= xfX (x) dx · yfY (y) dy
X Y

falling back to the product of two expressions corresponding to the definition


of mean (for X and Y respectively); the first equality exploits the definition
of independent random variables.
Corollary 1. of Theorem 3.5. Both the covariance and the correlation
between two independent random variables X and Y equal zero.
Corollary 2. of Theorem 3.5. Given two transformations U = gX (X)
and V = gY (Y ) of two independent random variables X and Y , it is:

E [U V ] = E [U ] E [V ]

because U and V are also independent (so long as all the relevant moments
exist); this also implies that U and V have zero covariance and correlation
and that all higher moments of X and Y inherit this property, for example:

Var [XY ] = E (X − E [X])2 (Y − E [Y ])2


 

= E (X − E [X])2 E (Y − E [Y ])2
   

= Var [X] Var [Y ]

which is best seen by setting U = (X − E [X])2 and V = (Y − E [Y ])2 .


Example 3.5. Covariances and Correlations of Bivariate Normal
Distributions. For every bivariate normal distribution such as the one in
Example 3.2, the following holds.

E [X1 X2 ] = ρσ1 σ2 + µ1 µ2

99
3.3. Multivariate Moments

The demonstration of this result is simplified by a transformation of the


random vector x = (X1 , X2 ). Define the random vector y = (Y, Z) as:
X1 − µ1 X2 − µ2
Y =
σ1 σ2
X1 − µ1
Z=
σ1
the support of y is Y = R2 and the transformation is clearly bijective; the
inverse transformation is:
X1 = g1−1 (Y, Z) = σ1 Z + µ1
Y
X2 = g2−1 (Y, Z) = σ2 + µ2
Z
whose Jacobian has the following absolute value of the determinant.
   
∂ −1 0 σ1 σ1 σ2 σ1 σ2
det g (y, z) = det −1 −2 = = √
∂y T σ2 z −σ2 yz |z| z2
By Theorem 3.1, the joint density function of y = (Y, Z) is:
z − 2ρy + y 2 z −2
 2 
1
fY,Z (y, z; ρ) = p exp −
2π (1 − ρ2 ) z 2 2 (1 − ρ2 )
2
and, observing that z 2 − 2ρy + y 2 z −2 = (1 − ρ2 ) z 2 + (y − ρz 2 ) z −2 , it is:
ˆ +∞ "ˆ
+∞ 2
! #
y (y − ρz 2 )
E [Y ] = φ (z) p exp − dy dz
−∞ −∞ 2π (1 − ρ2 ) z 2 2 (1 − ρ2 ) z 2
ˆ +∞
=ρ z 2 φ (z) dz
−∞

where φ (z) is the density function of the standard normal distribution, and
where the second line follows from the observation that the inner integral
in the first line is the mean of a normally distributed random variable with
mean ρz 2 and variance (1 − ρ2 ) z 2 . Therefore, exploiting again the inverse
transformation above:
 
Y
E [X1 X2 ] = σ1 σ2 E [Y ] + σ1 µ2 E [Z] + σ2 µ1 E + µ1 µ2
Z
= ρσ1 σ2 + µ1 µ2
as postulated, because E [Z] = E YZ = 0 are both expectations of random
 

variables that follow the standard normal distribution.

100
3.3. Multivariate Moments

In light of this result, it is:


Cov [X1 , X2 ] = ρσ1 σ2
Corr [X1 , X2 ] = ρ
hence, parameter ρ has an immediate interpretation as correlation (and in
fact its range is confined in the [−1, 1] interval).1 For the sake of illustration,
consider the bivariate distribution depicted in Figure 3.1, and suppose to
invert the sign of ρ: the result would be as in Figure 3.3 below, which quite
clearly manifests – in graphical form – an inversion of the linear relationship
that connects X1 and X2 in Example 3.5.

fX2 (x2 ) fX1 (x1 )


fX1 ,X2 (x1 , x2 )

0.4

0.2

0
−1 4

0 3
0.3
1 2
0.2
2 1
x1 x2
0.1
3 0
0
4 −1

Figure 3.3: Bivariate normal as in Example 3.5, but with ρ = −.4.

If the two arguments of a bivariate normal distributions – the two “marginal”


random variables – are independent, it must thus be that ρ = 0. This case
is instead represented in Figure 3.4, which is obtained by again substituting
only that parameter. There, no linear dependence between the two random
variables can be visually detected, neither positive nor negative.
1
Note that by construction of the previous transformation, Corr [X1 , X2 ] = E [Y ].

101
3.3. Multivariate Moments

fX2 (x2 ) fX1 (x1 )


fX1 ,X2 (x1 , x2 )

0.4

0.2

0
−1 4

0 3
0.3
1 2
0.2
2 1
0.1 x1 x2
3 0
0
4 −1

Figure 3.4: Bivariate normal as in Example 3.5, but with ρ = 0.

Note that in all these cases the marginal distributions stay the same, since
they do not depend upon the parameter ρ (as already mentioned, a unique
feature of the bivariate normal distribution is that marginal distributions
do not depend on ρ, a unique feature of the bivariate normal). 

Having characterized all the relevant moments of a given random vector


x = (X1 , . . . , XK ), it is useful to establish some notation for expressing all
the relevant moments of a certain type of all the random variables involved.
This is accomplished by means of tools typical of linear algebra, which turn
out to be extremely useful to handle moments of multivariate distribution.
In particular, the mean vector – usually denoted as E [x] – is the collection
of the means of all the random variables in random vector x.
 
E [X1 ]
 E [X2 ] 
E [x] =  .. 
 
 . 
E [XK ]

102
3.3. Multivariate Moments

The variance-covariance matrix instead, which is commonly denoted


by Var [x], collects the variances of each random variable along the diagonal,
and the covariances between each pair of elements of x outside the diagonal.
 
Var [X1 ] Cov [X1 , X2 ] . . . Cov [X1 , XK ]
 Cov [X2 , X1 ] Var [X2 ] . . . Cov [X2 , XK ]
Var [x] =  .. .. . .
 
 . . .. .. 

Cov [XK , X1 ] Cov [XK , X2 ] . . . Var [XK ]
Given how each element is calculated, it is often useful to express the vari-
ance of x as the expectation of a specific random matrix, as follows.
h i
Var [x] = E (x − E [x]) (x − E [x])T

It is worth making few observations about the variance-covariance matrix:


• it has dimension K × K and is symmetric;
• the elements along its diagonal, the variances, are always nonnegative;
but those outside the diagonal, the covariances, can be negative;
• in analogy with the univariate case, one can establish the following.
h i
T
Var [x] = E (x − E [x]) (x − E [x])
= E xxT − E [x] E [x]T − E [x] E [x]T + E [x] E [x]T
 

= E xxT − E [x] E [x]T


 

• Finally, Var [x] is positive semi-definite, that is, for any non-zero vec-
tor a of length K, the quadratic form aT Var [x] a ≥ 0 is nonnegative.
This property is demonstrated later while analyzing the moments of
linear transformations of random vectors.
Example 3.6. Summarizing the Moments of Bivariate Normal Dis-
tributions. All the moments of the bivariate normal distribution from the
previous examples can be summarized using the following notation:
 
µ
µ ≡ 1 = E [x]
µ2
and  
σ21 ρσ1 σ2
Σ≡ = Var [x]
ρσ1 σ2 σ22
and it is straightforward to verify that Σ complies with the properties of all
variance-covariance matrices. If x = (X1 , X2 ) follows the bivariate normal
distribution, one can write x ∼ N (µ, Σ). 

103
3.3. Multivariate Moments

With the aid of some linear algebra, the usual properties of means and
variances are generalized to a multivariate environment. Consider a random
vector x with mean E [x] and variance Var [x] in the three following cases.
• Linear Transformations returning Scalars. Consider some vec-
tor a = (a1 , . . . , aK )T of length K which, multiplied to x, returns the
random variable Y = aT x as a linear combination. Because expecta-
tions are linear operators, the mean of Y is:
E [Y ] = E aT x
 

= E [a1 X1 + · · · + aK XK ]
= a1 E [X1 ] + · · · + aK E [XK ]
= aT E [x]
as for the variance of Y instead:
Var [Y ] = Var aT x
 
h T i
= E aT x − E aT x
 T
a x − E aT x
 
h i
T T
= E a (x − E [x]) (x − E [x]) a
h i
= aT E (x − E [x]) (x − E [x])T a
= aT Var [x] a
where the last expression is a quadratic form that cannot be negative
(showing that Var [x] is positive semi-definite); in particular:
K
" k−1
#
X X
Var [Y ] = a2k Var [Xk ] + 2 ak a` Cov [Xk , X` ]
k=1 `=1

which exemplifies how it can be easier to work with matrices.


• Linear Transformations returning Vectors. Consider now a vec-
tor a of length J > 1, a matrix B of dimension J ×K, and the random
vector y = a + Bx = (Y1 , . . . , YJ )T resulting from J different linear
combinations of x. Since expectations are linear operators, it is:
E [y] = E [a + Bx] = a + B E [x]
while the variance-covariance matrix of y is given by:
Var [y] = Var [a + Bx] = B Var [x] BT
noting that, if bi and bj are the i-th and the j-th
 rows of B, then the
ij-th element of Var [y] equals Cov bi x, bj x = bi Var [x] bj .
T T T

104
3.3. Multivariate Moments

• Non-linear Transformations of Random Vectors. Finally look


at the case of a J-dimensional non-linear vector-valued function g (x).
A first-order Taylor expansion of g (·) around E [x] gives

g (x) ≈ g (E [x]) + g (E [x]) [x − E [x]]
 ∂xT 
∂ ∂
≈ g (E [x]) − g (E [x]) E [x] + g (E [x]) x
∂xT ∂xT
where ∂x∂ T g (E [x]) is a J × K Jacobian matrix containing, in each jk-
th element, the derivative of the j-th equation of g (x) with respect
to the k-th element of x. Hence, in analogy with the univariate case:
E [g (x)] ≈ g (E [x])
is a generally poor approximation, but
   T
∂ ∂
Var [g (x)] ≈ g (E [x]) Var [x] g (E [x])
∂xT ∂xT
is a considerably better one.
In a multivariate environment, it is sometimes useful to summarize the
covariances between the elements of two random vectors x and y of length
Kx and Ky respectively. This is best done via a cross-covariance matrix
of dimension Kx × Ky (one should always be careful with terminology and
not mistake it for a variance-covariance matrix). Such a matrix collects the
covariances between every i-th element of x and every j-th element of y in
its ij-th entries, and it is obviously symmetric.
  
Cov [X1 , Y1 ] Cov [X1 , Y2 ] . . . Cov X1 , YKy 
 Cov [X2 , Y1 ] Cov [X2 , Y2 ] . . . Cov X2 , YKy 
Cov [x, y] =  .. .. .. ..
 
. . . .

 
 
Cov [XKx , Y1 ] Cov [XKx , Y2 ] . . . Cov XKx , YKy
Like a variance-covariance matrix, a cross-covariance matrix can be recast
as the expectation of a random matrix, which can be simplified.
h i
Cov [x, y] = E (x − E [x]) (y − E [y])T
= E xy T − E [x] E [y]T − E [x] E [y]T + E [x] E [y]T
 

= E xy T − E [x] E [y]T
 

Clearly, the cross-covariance matrix is a collection of zeros if x and y are


independent, as E xy = E [x] E [y]T .
 T

105
3.4. Multivariate Moment Generation

Other properties are obtained by extending some previous observations


to the case of two random vectors; here, they are summarized briefly.
• If two linearly transformed vectors u = ax + Bx x and v = ay + By y
have length Ju and Jv , their Ju × Jv cross-covariance matrix is:
Cov [u, v] = Cov [ax + Bx x, ay + By y] = Bx Cov [x, y] BT
y

• however, if u = gx (x) and v = gy (y) are obtained via two non-linear


transformations, the following approximation can be useful.
   T
∂ ∂
Cov [u, v] ≈ gx (E [x]) Cov [x, y] gy (E [y])
∂xT ∂y T
Furthermore, the cross-covariance matrix of x and y relates with the re-
spective variance-covariance matrices through the following relationship.
Var [x] − Cov [x, y] [Var [y]]−1 Cov [x, y]T ≥ 0
The above inequality is to interpreted in the sense that the matrix on the
left-hand side is positive semi-definite. This relationship is tedious to prove,
but it essentially represents a bound on the cross-covariance matrix, extend-
ing the logic of the intermediate result (3.3) from Theorem 3.4.

3.4 Multivariate Moment Generation


Both the moment-generating and the characteristic functions are easily gen-
eralized to a multivariate environment, where they are especially useful for
deriving the distribution of certain linear combinations of random variables.
Definition 3.13. Moment generating function (multivariate case).
Given a random vector x = (X1 , . . . , XK ) with support X, the moment ge-
nerating function Mx (t) is defined, for t = (t1 ,. . . , tK ) ∈ RK , as the expec-
tation of the transformation g (X) = exp tT x .
" K
!#
X
Mx (t) = E exp tT x = E exp
 
tk Xk
k=1

Definition 3.14. Characteristic function (multivariate case). Given


a random vector x = (X1 , . . . , XK ) with support X, the characteristic func-
tion ϕx (t) is defined, for t = (t1 ,. . . , tK ) ∈ RK , as the expectation of the
transformation g (X) = exp itT x .
" K
!#
X
T
 
ϕx (t) = E exp it x = E exp i t k Xk
k=1

106
3.4. Multivariate Moment Generation

The r-th centered moments for each k-th element of the random vector x
can be calculated in analogy with the univariate case.
∂ r Mx (t) 1 ∂ r ϕx (t)
E [Xkr ] = = ·
∂trk t=0 ir ∂trk t=0

Furthermore, the cross-moments are obtained, for two integers r and s, as:
∂ r+s Mx (t) 1 ∂ r+s ϕx (t)
E [Xkr X`s ] = = ·
∂trk ∂ts` t=0 ir+s ∂trk ∂ts` t=0

which follows since:


" K
!# " K
!#
∂ r+s Mx (t) ∂ r+s X X
=E exp tk Xk = E Xkr X`s exp tk Xk
∂trk ∂ts` ∂trk ∂ts` k=1 k=1

and the case of the characteristic function is analogous. This fact allows to
calculate covariances using these two important functions.
Example 3.7. Moment Generating Function and Covariance of the
Bivariate Normal Distribution. The moment generating function of the
bivariate normal distribution is the following.
MX1 ,X2 (t1 , t2 ) = E [exp (t1 X1 + t2 X2 )]
 
1 2 2 2 2

= exp t1 µ1 + t2 µ2 + t σ + 2t1 t2 ρσ1 σ2 + t2 σ2
2 1 1
Obtaining this expression while keeping track of all these parameters is not
as difficult as it is annoying, therefore a proper and more elegant derivation
is postponed to the later, more general analysis of the multivariate normal
distribution. Here the point is to show how the covariance between X1 and
X2 can be derived via the moment generating function. It is not difficult to
see that E [Xk ] = µk and E [Xk2 ] = σ2k + µ2k for k = 1, 2, as in the univariate
case. As per the first cross-moment, some calculations show that:

∂2
MX1 ,X2 (t1 , t2 ) = µ1 + t1 σ21 + t2 ρσ1 σ2 µ2 + t2 σ22 + t1 ρσ1 σ2 +
  
∂t1 ∂t2
 
 1 2 2 2 2

+ ρσ1 σ2 · exp t1 µ1 + t2 µ2 + t σ + 2t1 t2 ρσ1 σ2 + t2 σ2
2 1 1
and evaluating the above expression for t1 = 0 and t2 = 0 gives:
∂2
E [X1 X2 ] = MX1 ,X2 (t1 , t2 ) = ρσ1 σ2 + µ1 µ2
∂t1 ∂t2 t1 ,t2 =0

which is exactly as derived in Example 3.5. 

107
3.4. Multivariate Moment Generation

Like in the univariate case, both the moment generating and the char-
acteristic functions uniquely characterize a distribution, but only the more
“complex” characteristic function is guaranteed to always exist. In addition,
in the multivariate context it is possible to derive some results of extreme
importance about independent random variables.
Theorem 3.6. Moment generating and characteristic functions of
independent random variables. If the random variables from a random
vector x = (X1 , . . . , XK ) are pairwise independent, the moment generating
function (if it exists) and the characteristic function of x equal respectively
the product of the K moment generating functions (if they exist) and the K
characteristic functions of the random variables involved.
K
Y
Mx (t) = MXk (tk )
k=1
YK
ϕx (t) = ϕXk (tk )
k=1

Proof. This is an application Theorem 3.3 and Theorem 3.5 upon a sequence
of K transformed random variables: exp (t1 X1 ) , . . . , exp (tK XK ) which are
themselves mutually independent. For moment generating functions:
" K
!#
X
Mx (t) = E exp tT x = E exp
 
tk Xk
k=1
" K
#
Y
=E exp (tk Xk )
k=1
K
Y
= E [exp (tk Xk )]
k=1
YK
= MXk (tk )
k=1

and the case of characteristic functions is analogous.


Theorem 3.7. Moment generating and characteristic functions of
linear combinations of independent random variables. Consider a
random variable Y obtained as the sum of N linearly transformed, pairwise
independent random variables x = (X1 , . . . , XN ):
N
X
Y = (ai + bi Xi )
i=1

108
3.4. Multivariate Moment Generation

where (ai , bi ) ∈ R2 for i = 1, . . . , N . The moment generating and charac-


teristic functions of Y are obtained as follows.
N
! N
X Y
MY (t) = exp t ai MXi (bi t)
i=1 i=1
N
! N
X Y
ϕY (t) = exp t ai ϕXi (bi t)
i=1 i=1

Proof. For moment generating functions this results is obtained as:


" N
!#
X
MY (t) = E [exp (tY )] = E exp t · (ai + bi Xi )
i=1
N
! " N
!#
X X
= exp t ai E exp tbi Xi
i=1 i=1
N
! " N
#
X Y
= exp t ai E exp (tbi Xi )
i=1 i=1
N
! N
X Y
= exp t ai E [exp (tbi Xi )]
i=1 i=1
N
! N
X Y
= exp t ai MXi (bi t)
i=1 i=1

where the second-to-last line follows from observing that b1 X1 , . . . , bN XN


are N mutually independent random variables (as per Theorem 3.3) as well.
The case of characteristic functions is analogous.

This powerful result often allows to easily obtain the moment distribu-
tion of some linear combination of random variables x = (X1 , . . . , XK ), if
their underlying distribution is known and its moment generating or char-
acteristic function is manipulable in such a way that it returns the moment
generating function of another known random variable. A list of important
cases follows; for all results, the proof is either provided or outlined. Below,
the notation Xi indicates one of N random variables (for i = 1, . . . , N ) that
are all mutually independent and follow the indicated distribution.

Observation 3.5. If Xi ∼ Be (p), it is N


P
i=1 Xi ∼ BN (p, N ) .

Proof. If MXi (t) = p exp (t) + (1 − p), it suffices to multiply the N identical
moment generating functions: MPNi=1 Xi (t) = [p exp (t) + (1 − p)]N .

109
3.4. Multivariate Moment Generation

The next five results are easily demonstrated through the same approach as
in the previous observation: that is, by multiplying the moment generating
functions of the N specified primitive, independent random variables Xi .
Observation 3.6. If Xi ∼ NB (p, 1), it is N
P
i=1 Xi ∼ NB (p, N ).

Observation 3.7. If Xi ∼ Pois (λ), it is i=1 Xi ∼ Pois (N λ).


PN

Observation 3.8. If Xi ∼ Exp (λ), it is N −1


P
i=1 Xi ∼ Γ (N, λ ).
P 
Observation 3.9. If Xi ∼ χ2 (κi ), it is N 2
i=1 i .
N
P
i=1 X i ∼ χ κ
 
Observation 3.10. If Xi ∼ Γ (αi , β), it is i=1 Xi ∼ Γ i=1 αi , β .
PN PN

Observation 3.11. If Xi ∼ N (µi , σ2i ), for all real ai , bi it is as follows.


N N N
!
X X X
Y = (ai + bi Xi ) ∼ N (ai + bi µi ) , b2i σ2i
i=1 i=1 i=1

Proof. This requires few steps:


K
! K
X Y
MY (t) = exp t ai MXi (bi t)
i=1 i=1
K
! K K
!
X X X b2 σ 2
i i
= exp t ai exp t bi µi + t2
i=1 i=1 i=1
2
K K
!
X X b2i σ2i
= exp t (ai + bi µi ) + t2
i=1 i=1
2

and the second line is recognized as the moment generating function of a


normal distribution with the parameters indicated for Y .
Observation 3.12. If log (Xi ) ∼ N (µi , σ2i ), for all real ai , bi it is as follows.
N
! N N
!
Y X X
log exp (ai ) Xibi ∼ N (ai + bi µi ) , b2i σ2i
i=1 i=1 i=1
Q 
Proof. Since log N
[ai + bi log (Xi )], the previous
bi PN
i=1 exp (ai ) Xi = i=1
observation extends easily.
Observation 3.13. If X1 ∼ Exp (λ1 ) and X2 ∼ Exp (λ2 ), and the two ran-
dom variables X1 and X2 are independent, the random variable Y obtained
as Y = X1 /λ1 − X2 /λ2 is such that Y ∼ Laplace (0, 1).

110
3.5. Conditional Distributions

Proof. Define the two random variables W1 = X1 /λ1 and W2 = −X2 /λ2 ,
which are obviously independent. By the properties of moment generating
functions for linear transformations, the two transformed random variables
have moment generating function:
MW1 (t) = (1 − t)−1
MW2 (t) = (1 + t)−1
and since Y = W1 + W2 , the moment generating function of Y is:
−1
MY (t) = MW1 (t) MW2 (t) = 1 − t2
that is, that of a standard Laplace distribution.
Observation 3.14. If X1 ∼ Gumbel (µ1 , σ) and X2 ∼ Gumbel (µ2 , σ), and
the two random variables X1 and X2 are independent, the random variable
Y obtained as Y = X1 − X2 is such that Y ∼ Logistic (µ1 − µ2 , σ).
Proof. The moment generating function of Xi – for i = 1, 2 – is given by
MXi (t) = exp (µi t) Γ (1 − σt). Similarly, the transformed random variables
Wi = −Xi – again for i = 1, 2 – have moment generating functions given
by MWi (t) = exp (−µi t) Γ (1 + σt). It is easy to see that X1 is independent
of W2 and vice versa. Since Y = X1 + W2 , the moment generating function
of Y is therefore obtained as:
MY (t) = MX1 (t) MW2 (t)
= exp (µ1 t) Γ (1 − σt) · exp (−µ2 t) Γ (1 + σt)
Γ (1 − σt) Γ (1 + σt)
= exp (µ1 t − µ2 t)
Γ (2)
= exp ((µ1 − µ2 ) t) · B (1 − σt, 1 + σt)
which is indeed the moment generating function of the logistic distribution
with specified parameters (note that Γ (2) = 1! = 1).

3.5 Conditional Distributions


Many conceptual and practical exercises involving multivariate distributions
involve “fixing” the value of specific random variables, or “restricting” them
to a subset of their support, and analyzing the resulting distribution of the
remaining random variables. Such exercises are also called conditioning and
result in conditional distributions; these, together with the conditional
moments that are derived from them, are of extreme practical importance
in statistics and econometrics. For simplicity, only conditioning on specific
values (as opposed to subsets of the support) is discussed here.

111
3.5. Conditional Distributions

The ensuing discussion considers two generic random vectors x and y


of dimension Kx ≥ 1 and Ky ≥ 1 (with possibly Kx 6= Ky ) and supports
X and Y respectively, which are expressed as follows.
T
x = X1 . . . XKx
T
y = Y1 . . . YKy

In what follows, it is presumed for simplicity’s sake that both x and y are
either composed by discrete random variables only or by continuous random
variables only, but the two types should not coincide between vectors (that
is, x might include only discrete random variables and y only continuous
ones, or vice versa). The definition of conditional mass or density function
is the point of departure of the discussion, as it allows to subsequently define
the cumulative conditional distribution.
Definition 3.15. Conditional mass or density function. Consider the
combined random vector (x, y) with joint mass/density function fx,y (x, y).
Suppose that the random vectors x has a probability mass or density func-
tion fx (x). The conditional mass or density function of y, given x = x, is
defined as follows for all x ∈ X:
fx,y (x, y)
f y|x (y| x = x) =
fx (x)
It is a conditional mass function if all the random variables in y are discrete,
and a conditional density function if they are all continuous.
Definition 3.16. Conditional cumulative distribution. Consider the
combined random vector (x, y) with joint mass/density function fx,y (x, y).
The conditional cumulative distribution of y, given x = x is defined as:
X
F y|x (y| x = x) = f y|x (t| x = x)
t∈Y:t≤y

if all the random variables in y are discrete, and


ˆ y1 ˆ yKy
F y|x (y| x = x) = ... f y|x (t| x = x) dt
−∞ −∞

if all the random variables in y are continuous.


When x is some generic (undetermined) realization of x – as in the virtual
entirety of the conditional probabilities and moments analyzed in these lec-
tures – the conditional density and the conditional cumulative distribution
are often written more simply as f y|x (y| x) and F y|x (y| x), respectively.

112
3.5. Conditional Distributions

If both x and y are all-discrete random vectors, the interpretation of


a conditional mass function in terms of conditional probability of y given
x = x is obvious from the definition. If the two random vectors are instead
both all-continuous, the definition restricts the analysis to an hyperplane of
the original space considered by the joint distribution, and allows the make
conditional probability statements of the following sort.

P a1 ≤ Y1 ≤ b1 ∩ · · · ∩ aKy ≤ YKy ≤ bKy X1 = x1 ∩ · · · ∩ XKx = xKx =
ˆ b1 ˆ bKy
= ... f y|x (y| x = x) dy
a1 aKy

Example 3.8. Conditional Normal Distribution. Consider the bivari-


ate normal distribution from Example 3.2. Some tedious algebraic calcula-
tions would reveal that the conditional distribution of one variable, say X1 ,
conditional on the other variable, say X2 , is obtained as:
 h i2 
σ1
1  x1 − µ1 − ρ σ2 (x2 − µ2 ) 
f X1 |X2 (x1 | x2 ) = p exp −
2πσ21 (1 − ρ2 ) 2σ21 (1 − ρ2 )

where the parameters are dropped in the expression on the left hand side
for the sake of brevity. One can observe that the resulting density is that
of another univariate normal distribution with different parameters, which
can be expressed in compact form as follows for any X2 = x2 .
 
σ1 2 2

X1 | X2 = x2 ∼ N µ1 + ρ (x2 − µ2 ) , σ1 1 − ρ
σ2
Clearly, the expression for the distribution of X2 conditional on X1 = x2 ,
for any x2 ∈ R, is symmetrical. 
If y is an all-continuous random vector and x is an all-discrete one, the
definition of conditional density function may not be directly applicable –
short of resorting to a more general mathematical definition of joint density
that allows for discrete mass points. However, the concept is still valid as
much as it is useful, and it is best illustrated with an example.
Example 3.9. Conditional height distribution. Remember Example
3.3 about the height distribution with mixed genders. If one aims to describe
the density function of height for females only, the appropriate concept is
that of a conditional distribution:
 
1 h − µF
f H|G=1 (h| g = 1) = φ
σF σF
and symmetrically for males. 

113
3.5. Conditional Distributions

It is important to observe that the concept of conditional distribution is


in a sense moot for independent random variables. Suppose, indeed, that
every single random variable in y is independent from every single random
variable in x. This implies that the joint mass/density function of the two
random vectors can be described as the product of the two “marginal” joint
densities for each separate random vector:
fx,y (x, y) = fx (x) fy (y)
by the definition of conditional mass/density function, this implies that:
f y|x (y| x) = fy (y)
f x|y (x| y) = fx (x)
in other words, the conditional distribution of y given x is equal to the
unconditional distribution of y, and vice versa.
The mean, the variance and other moments are appropriately defined
for conditional distributions as well. The conditional expectation, which
is also called regression, is defined for discrete random variables as:
X X
E [y| x] = ··· yf y|x (y| x)
y1 ∈Y1 yK ∈YKy

and for continuous random variables as:


ˆ ˆ
E [y| x] = ... yf y|x (y| x) dy
Y1 YKy

and it is a vector if Ky > 1. Similarly, the conditional variance is, in the


discrete case:
X X
Var [y| x] = ··· (y − E [ y| x]) (y − E [ y| x])T f y|x (y| x)
y1 ∈Y1 yK ∈YKy

and in the continuous case:


ˆ ˆ
Var [y| x] = ... (y − E [y| x]) (y − E [y| x])T f y|x (y| x) dy
Y1 YKy

and it is a (conditional) variance-covariance matrix if Ky > 1. As usual with


variances, even its conditional version can be expressed in a more compact
form and decomposed into simpler uncentered moments.
h i
Var [y| x] = E (y − E [y| x]) (y − E [y| x])T x
= E yy T x − E [ y| x] E [y| x]T
 

The results analyzed above relative to the moments of functions of random


vectors naturally extend to conditional moments as well.

114
3.5. Conditional Distributions

Example 3.10. Conditional Moments of the Bivariate Normal Dis-


tribution. The mean and variance for the conditional distribution X1 | X2
examined in Example 3.8 are clearly the following.
σ1
E [X1 | X2 = x2 ] = µ1 + ρ (x2 − µ2 ) (3.4)
σ2
Var [X1 | X2 = x2 ] = σ21 1 − ρ2 (3.5)


Symmetric expressions apply to the moments of X2 | X1 . 


It is quite common to express conditional moments as functions of the
conditioned variables. In such a case they take names such as conditional
expectation function (CEF) or conditional variance function (CVF),
and the conditioned objects are best indicated with the notation for random
variables/vectors (e.g. X or y) as opposed to their realizations (e.g. x or y).
Two results about conditional moment functions turn out to be extremely
useful for the sake (among others) of analyzing econometric estimators.
Theorem 3.8. Law of Iterated Expectations. Given any two random
vectors x and y, it is:
E [y] = Ex [E [y| x]]
where Ex [·] denotes an expectation taken over the support of x.
Proof. In the continuous case, apply the following decomposition:
ˆ ˆ
E [y] = yfx,y (x, y) dydx
ˆX ˆY
= yf y|x (y| x) fx (x) dydx
ˆX Y
ˆ 
= fx (x) yf y|x (y| x) dy dx
X Y
= Ex [E [ y| x]]

and the discrete case is analogous (summations substitute integrals).


It must be observed that the Law of Iterated Expectations is often applied
in situations where y is a function of x itself. In these cases, the evaluation
of the inner expectation is often simplified since if x is conditioned upon it is
treated as constant, hence it can be taken outside the expectation operator.
Example 3.11. The Bivariate Linear Regression Model. Economet-
rics revolves around the analysis of statistical models, which are based upon
economic theory, that specify the response of certain endogenous variables

115
3.5. Conditional Distributions

Yi to some other endogenous variables Xi (or Zi ), where the subscript i de-


notes the unit of observation of a sample – see lecture 5. These relationships
are best framed via conditional distributions and conditional moments. The
point of departure for much of econometrics are linear regression models,
that is linear specifications of the conditional expectation function, like:
E [ Yi | X1i , . . . , XKi ] = β0 + β1 X1i + · · · + βK Xki
where (β0 , β1 , . . . , βK ) are the parameters of interest.
The simplest linear regression model is the bivariate one, which involves
two random variables Yi and Xi characterized by the following relationship:
E [Yi | Xi ] = β0 + β1 Xi (3.6)
which is equivalently written as follows.
E [Yi − β0 − β1 Xi | Xi ] = 0 (3.7)
The parameter β0 is typically loaded with the interpretation as the condi-
tional mean of Yi given Xi = 0, or equivalently, as the “constant” coefficient
that satisfies the following relationship.
E [Yi − β0 − β1 Xi ] = 0
Consider the following application of the Law of Iterated Expectations: the
objective is to analyze the mean of the random variable Xi (Yi − β0 − β1 Xi ).
E [Xi (Yi − β0 − β1 Xi )] = EX [E [Xi (Yi − β0 − β1 Xi )| Xi ]]
= EX [Xi · E [(Yi − β0 − β1 Xi )| Xi ]]
=0
In the second line Xi can be taken outside the inner expectation operator
since there it is treated as a constant; the result is ultimately zero because
of (3.7). Consequently, one can establish the following system featuring two
equations and two unknown parameters, β0 and β1 .
E [Yi − β0 − β1 Xi ] = 0 (3.8)
E [Xi (Yi − β0 − β1 Xi )] = 0 (3.9)
After some manipulation, the solution for β0 and β1 obtains as:
Cov [Xi , Yi ]
β0 = E [Yi ] − · E [Xi ] (3.10)
Var [Xi ]
Cov [Xi , Yi ]
β1 = (3.11)
Var [Xi ]

116
3.5. Conditional Distributions

although (3.10) is more commonly written as β0 = E [Yi ] − β1 E [Xi ]. Note


that these are relationships that hold in the population as a consequence of
the initial assumption (3.6) about linearity of the CEF. Parameter β1 , in
particular, is called the regression slope and it bears an interpretation as
the average response of Yi to marginal variations in Xi . It is related to the
correlation between Xi and Yi through the following relationship.
p
Var [Yi ]
β1 = Corr [Xi , Yi ] · p
Var [Xi ]
In later lectures, these results are generalized to the multivariate case.

0.6
f Yi |Xi (yi | xi )

0.3

0
−1 4

0 3
E [Yi | Xi ]
1 2

2 1
xi yi
3 0

4 −1
Note: the conditional distribution Yi | Xi is normal, but with parameters that vary as a function of Xi .
Selected density functions of Yi | Xi are displayed for xi = {0, 1, 2, 3}.

Figure 3.5: A bivariate linear regression model for β0 = 1 and β1 = 1/3

Figure 3.5 depicts an example of a bivariate linear regression model where


the conditional distribution of Yi | Xi is normal; in this case, the parameters
of the normal distribution vary visibly along the support of Xi . Note that
the conditional distribution of Yi | Xi can be left unrestricted, so long as the
conditional moment E [Yi | Xi ] exists and complies with (3.6). 

117
3.5. Conditional Distributions

The other important result about conditional moment functions follows.


Theorem 3.9. Law of Total Variance. This result, otherwise known as
the variance decomposition, states that given any two random vectors x
and y:
Var [y] = Varx [E [y| x]] + Ex [Var [y| x]]
where Varx [·] denotes that the variance is obtained by summing/integrating
over the support of x; while Ex [·] represents a summation or an integral
applied to a matrix and returning, element-by-element, yet another matrix.
Proof. The proof repeatedly applies the Law of Iterated Expectations.
Var [y] = E yy T − E [y] E [y]T
 

= Ex E yy T x − Ex [E [y| x]] [Ex [E [y| x]]]T


   
h  i
= Ex E yy T x − E [y| x] E [y| x]T x

h i
+ Ex E [y| x] E [y| x] x − Ex [E [y| x]] [Ex [E [y| x]]]T
T

= Ex [Var [y| x]] + Varx [E [ y| x]]


h i
The result follows by adding and subtracting Ex E [y| x] E [y| x] T
x .

Example 3.12. Variance decomposition. Suppose that a researcher is


examining how a certain continuous random variable Y (say, the logarithm
of income) differs across four specific groups in the population. These groups
are coded as X = 1, 2, 3, 4 – X is clearly a discrete random variable – and
may correspond to, say, different ethnicities or combinations of two binary
characteristics such as age and education. The researcher finds out that the
log-income distribution, conditional on each demographic group, is normal:
Y |X =1∼N (3, 1.5)
Y |X =2∼N (4.5, 2.5)
Y |X =3∼N (5.5, 3)
Y |X =4∼N (7, 4)
therefore, it appears that to a higher conditional mean corresponds a higher
conditional variance. Furthermore, each group is found with equal proba-
bility in the population, meaning that P (X = x) = 0.25 for x = 1, 2, 3, 4.
Hence, by the Law of Iterated Expectations, the grand mean of Y in the
population can be calculated as follows.
4
1X
E [Y ] = EX [E [Y | X]] = E [ Y | X = x] = 5
4 x=1

118
3.5. Conditional Distributions

The researcher, however, seems more intent into analyzing how the vari-
ation of Y differs across groups with respect to the variation of Y in the
population as a whole. The interest of the researcher can be, for example,
to gauge to what extent the resentment against inequality interacts with is-
sues about ethnicity. The concern of the researcher is all but enhanced after
having visualized a plot about the four conditional distributions, which is
reported in Figure 3.6 below. The figure shows how not only the means of Y
markedly differ across the four groups, but also the variation of log-income
is quite heterogeneous. Problems like this are the domain of the so-called
analysis of the variance, a set of statistical techniques for assessing differ-
ences about certain characteristics between groups of a population.
f Yi |Xi (yi | xi )

0.1

1 15
2 10

xi 3 5
yi
4 0
Note: the figure displays the four conditionally normal distributions of Yi | Xi for every Xi = 1, 2, 3, 4.
The four straight lines delimited by circles and placed beneath each density function denote the range
of all values of Yi | Xi within two standard deviations below or above each group’s conditional mean.

Figure 3.6: Hypothetical log-income distribution by four groups

To the rescue of the researcher comes the Law of Total Variance, which
in this bivariate case reads as follows.

Var [Y ] = VarX [E [Y | X]] + EX [Var [Y | X]]

119
3.6. Two Multivariate Distributions

The first component on the right hand side, VarX [E [Y | X]] is the so-called
between group variation and is interpreted as the “variance of the con-
ditional means” – that is, how much do the four groups differ on average (a
more direct measure of cross-group inequality). Here, this is:
4
1X
VarX [E [Y | X]] = (E [Y | X = x] − EX [E [Y | X]])2 = 2.125
4 x=1

where EX [E [Y | X]] = 5 – as previously calculated by averaging the condi-


tional means over X. The second component on the right hand side of the
variance decomposition, EX [Var [Y | X]] is called within group variation
and is interpreted as the “mean of the conditional variances,” identifying
that part of the variance which depends on the relative position all individ-
uals against their own group average (that is, their conditional mean) over
all the groups. In this case this is:
4
1X
EX [Var [Y | X]] = Var [Y | X = x] = 2.75
4 x=1

with about the same magnitude as the between group variation. Therefore,
the researcher concludes that the overall inequality has a very strong group
component, which is likely to bear social and political consequences. 

3.6 Two Multivariate Distributions


This lecture is concluded with the discussion of two important multivariate
distributions, a discrete and a continuous one, which play a special role in
both statistics and econometrics. The discrete one is the multinomial dis-
tribution, a generalization of the binomial distribution. The continuous one
cannot be anything but the generalized multivariate normal distribution,
whose relationship with the univariate case is obvious from its name.

Multinomial Distribution
The multinomial distribution describes a variation of the binomial experi-
ment with many, mutually exclusive, alternatives. Specifically, suppose one
is making n draws, and each of these can end up with the realization of one
and only one event between K ≥ 2 that are possible. All these Pevents have
probability pk ∈ [0, 1] to happen (for k = 1, . . . , K), with K k=1 pk = 1.
After the n draws, the result is a list of success counts for each alternative,

120
3.6. Two Multivariate Distributions

a list that one could write as x = (X1 , . . . , XK ) with Xk ∈ {0, 1, . . . , n} for


k = 1, . . . , K. The support of this random vector is thus the following set.
( K
)
n
X
X = x = (x1 , . . . , xK ) ∈ {0, 1, . . . , n} : xk = n
k=1

The probability mass function of this particular random vector is given by:
K
n! Y
fx (x1 , . . . , xK ; n, p1 , . . . , pK ) = QK pxkk (3.12)
k=1 xk ! k=1
−1
where the multinomial coefficient n! · K counts, in what appears
Q
k=1 (xk !)
to be an extension of the binomial coefficient, the number of realizations
containing exactly (x1 , . . . , xK ) successes for each alternative out of n draws.
The cumulative distribution clearly sums the mass function over points in
the support as follows, for t = (t1 , . . . , tK ).
K
X n! Y
Fx (x1 , . . . , xK ; n, p1 , . . . , pK ) = QK ptkk (3.13)
t∈X:t≤x k=1 tk ! k=1

This distribution draws its name from the multinomial theorem, which is
useful to analyze it. It helps show that the total probability mass equals 1:
X
P (x ∈ X) = fx (x1 , . . . , xK ; n, p1 , . . . , pK )
x∈X
K
X n! Y
= QK pxkk
x∈X k=1 xk ! k=1
K
!n
X
= pk
k=1
=1
as well as the calculation of the moment generating function:
K
!
X X
Mx (t1 , . . . , tK ) = exp tk xk · fx (x1 , . . . , xK ; n, p1 , . . . , pK )
x∈X k=1
K
X n! Y
= QK (pk · exp (tk ))xk (3.14)
x∈X k=1 xk ! k=1
K
!n
X
= pk · exp (tk )
k=1

in both cases, the multinomial theorem is applied at the third line.

121
3.6. Two Multivariate Distributions

Through appropriate differentiation of the moment generating function


and additional calculations, one can see that for all k = 1, . . . , K:

E [Xk ] = npk (3.15)


Var [Xk ] = npk (1 − pk ) (3.16)

and for all k, ` = 1, . . . , K:

Cov [Xk , X` ] = −npk p` (3.17)

and the covariance is always negative because an increasing number of suc-


cesses for one alternative implies a decreasing number for another alterna-
tive. By writing the vector of probabilities as:
 
p1
 .. 
p≡ . 
pK

hence one can write the mean vector and the variance-covariance matrix for
this distribution in more compact and elegant form as follows.

E [x] = np (3.18)
Var [x] = n diag (p) − pp T
(3.19)


Multivariate Normal Distribution


A generic random vector x = (X1 , . . . , XK ) of length K whose support is
X = RK is said to follow the multivariate normal distribution if its joint
probability density function is given by:
 
1 1
fx (x; µ, Σ) = q T −1
exp − (x − µ) Σ (x − µ) (3.20)
K 2
(2π) |Σ|

with two collections of parameters: a vector µ of length K and a symmetric,


positive semi-definite matrix Σ of dimension K × K and full rank:
   
µ1 σ11 σ12 . . . σ1K
 µ2   σ21 σ22 . . . σ2K 
µ ≡  ..  and Σ ≡  .. .. ... .. 
   
 .   . . . 
µK σK1 σK2 . . . σKK

with σij = σji for i, j = 1, . . . , K and where |Σ| is the determinant of Σ.

122
3.6. Two Multivariate Distributions

The expression
x ∼ N (µ, Σ)
typically indicates that the random vector x follows the multivariate normal
distribution with parameters µ and Σ. A particular case of the multivariate
normal distribution is the standardized one, with µ = 0 and Σ = I. If a
random vector z follows the standard multivariate normal distribution, this
is written as follows.
z ∼ N (0, I)
Note that since Σ is symmetric and positive semi-definite, a Cholesky de-
1
composition can always be applied to it so to find some matrix Σ 2 such that
1 1
Σ− 2 ΣΣ− 2 = I. Therefore, a random vector that follows a generic normal
distribution with parameters µ and Σ is always related to a random vector
z that follows the standard multivariate normal via the transformations:
1
z = Σ− 2 (x − µ)
1
x = Σ2 z + µ
which is analogous to the univariate case; also observe that the two trans-
formations together comply with Theorem 3.1 about the transformation of
continuous random vectors.
As usual, the cumulative distribution of the normal distribution lacks a
closed form, hence it must be expressed as a multiple integral.

Fx (x1 , . . . , xK ; µ, Σ) =
ˆ x1 ˆ xK  
1 1
= ... q exp − (t − µ) Σ (t − µ) dt (3.21)
T −1
−∞ −∞ K 2
(2π) |Σ|
Like in the univariate case, it is not immediate to show that the joint density
function integrates to one; the demonstration is a tedious extension of the
one from Lecture 2 for K = 1. However, obtaining the moment generating
function is again a relatively simpler task if one starts from the standardized
case, z ∼ N (0, I):
ˆ  
T
 1 1 T
Mz (t) = exp t z q exp − z z dz
RK K 2
(2π)
 T ˆ !
t t 1 (z − t)T (z − t)
= exp q exp − dz
2 2
RK (2π)K
 T 
t t
= exp
2

123
3.6. Two Multivariate Distributions

where the integral in the second line equates that of an another multivariate
normal distribution with µ = t and Σ = I, hence it integrates to one. To
obtain the general expression of the moment generating function, note that:
Mx (t) = E exp tT x
 
h   1 i
T
= E exp t Σ z + µ 2

h  1
i
= exp tT µ · E exp tT Σ 2 z


tT Σt
 
T
= exp t µ +
2
since the expectation in the third line corresponds to the definition of mo-
ment generating function for the standardized normal distribution but with
1 1
a rescaled argument: Σ 2 t instead of t (recall that Σ 2 is symmetric).
By analyzing the above moment generating function, one can conclude
the following about the moments of a multivariate normal distribution.
E [x] = µ (3.22)
Var [x] = Σ (3.23)
Note that this is a different parametrization to that of the particular bivari-
ate case (see e.g. Example 3.6); there, σ1 and σ2 correspond for convenience
to standard deviations (square roots of variances) instead of just variances.
Here instead, the variances are denoted as Var [Xk ] = σkk for k = 1, . . . , K
and the covariances as Cov [Xk , X` ] = σk` for k, ` = 1, . . . , K and k 6= `.
An additional observation about the moment generating function is that for
any J-dimensional linear combination of x – write it y = a + Bx where a is
a vector of length J and B is a J × K matrix – one can obtain the moment
generating function for y: My (t) (where now t has length J) following the
same procedure as above:
My (t) = E exp tT y
 

= E exp tT (a + Bx)
 

= exp tT a · E exp tT Bx
  

tT BΣBT t
 
T
= exp t (Bµ + a) +
2
which is the moment generating function ofanother multivariate normal dis-
tribution, that is, y ∼ N Bµ + a, BΣBT . In plain words, any collection
of linear combinations of some possibly dependent normal distributions itself
follows a multivariate normal distribution. This result is frequently applied
to derive the distribution of just one single linear combination (J = 1).

124
3.6. Two Multivariate Distributions

A final observation concerns the marginal and conditional distributions


obtained from multivariate normal distributions. Suppose that x = (x1 , x2 )
can be split into two subvectors x1 and x2 of length K1 and K2 respectively,
with K1 +K2 = K. Partition the original collection of parameters as follows:
   
µ1 Σ11 Σ12
µ= and Σ =
µ2 Σ21 Σ22
where µ1 is a vector of length K1 , µ2 one of length K2 , Σ11 a symmetric
K1 × K1 matrix, Σ22 a symmetric K2 × K2 matrix, while Σ12 and Σ21 are
two matrices, one being the transpose of the other, of dimension K1 × K2
and K2 × K1 respectively. By the properties of partioned inverse matrices:
−1 −1
!
−1
Σ −Σ Σ Σ
12 22
Σ−1 = −1
1 1
−1
−Σ2 Σ21 Σ−1 11 Σ 2

where:
Σ1 ≡ Σ11 − Σ12 Σ−1
22 Σ21

Σ2 ≡ Σ22 − Σ21 Σ−1


11 Σ12

and:
|Σ| = Σ1 · |Σ22 | = Σ2 · |Σ11 |
relating the determinant of Σ to those of the matrices expressing its parti-
tioned inverse. With all this in mind, (3.20) can be rewritten as:
1
fx (x1 , x2 ; µ1 , µ2 , Σ11 , Σ12 , Σ21 , Σ22 ) = q ×
K
(2π) Σ1 · |Σ22 |

1 −1 1 −1
× exp (x1 − µ1 )T Σ1 Σ12 Σ−1 22 (x2 − µ2 ) − (x1 − µ1 )T Σ1 (x1 − µ1 ) +
2 2

1 T −1 −1 1 T −1
+ (x2 − µ2 ) Σ2 Σ21 Σ11 (x1 − µ1 ) − (x2 − µ2 ) Σ2 (x2 − µ2 )
2 2
or alternatively, by inverting the ‘1’ and ‘2’ subscripts. The above expression
of the joint density and its symmetric version can be exploited to show, after
more calculations, that the “marginalized” distributions for x1 and x2 are:
x1 ∼ N (µ1 , Σ11 )
x2 ∼ N (µ2 , Σ22 )
while the conditional ones, for one subvector given the other, are:
x1 | x2 ∼ N µ1 + Σ12 Σ−1 −1

22 (x2 − µ2 ) , Σ11 − Σ12 Σ22 Σ21
x2 | x1 ∼ N µ2 + Σ21 Σ−1 −1

11 (x1 − µ1 ) , Σ22 − Σ21 Σ11 Σ12

generalizing the observations already made for the simpler bivariate case.

125
Lecture 4

Samples and Statistics

This lecture introduces some core concepts associated with the practice of
statistical analysis: samples and statistics that are calculated from samples.
A short introduction to the general notion of statistical sample is followed by
a focused analysis of an important special case: that of random sample, with
special emphasis on random samples drawn from the normal distribution. In
developing these concepts, the first half of the lecture introduces properties
of common sample statistics, while the second half is devoted to the analysis
of sample statistics with specific properties: order and sufficient statistics.

4.1 Random Samples


The practice of all statistical analysis is based on the analysis of some data
that are extracted from some population of interest. All data are organized
by samples, that is collections of information associated with different units
of analysis of the population (individuals, firms, polities, etc.). The practice
of statistical analysis requires to make probabilistic evaluations about data
samples, which must thus be connected to probability theory. The first step
in this direction is a probabilistic definition of a sample.

Definition 4.1. Sample. A sample is a collection {xi }N i=1 of realizations of


some N random vectors {xi }i=1 associated with some population of interest.
N

Each unit of such a population is typically named a unit of observation and


its associated realization xi is identified by a unique subscript i.

This definition is already general enough that it allows for the observation
of multiple variables for every unit of analysis; in the simpler cases when
each of these is associated with the realization of just one random variable,
a sample is written as {xi }N
i=1 . Conversely, the definition can be extended to

126
4.1. Random Samples

realizations drawn from random matrices, in which case a sample is written


as {Xi }Ni=1 . In all these cases, the following terminology applies.

Definition 4.2. Sample size. The dimension N of a sample is called size.

A sample can be called in various ways, depending on the commonalities


between the random variables from which the observations are drawn.

Definition 4.3. Random sample. A sample is said to be random if all the


realizations that compose it are drawn from independent and identically
distributed (i.i.d.) random vectors {xi }N i=1 (or variables, or matrices).

The hypothesis motivating random samples is that all the realizations are
obtained from a given population characterized by some joint probability
distribution expressed by some random vector x. A sample complies with
this framework if, for example, each realization is obtained by extracting
every unit of observation from some specific population through a protocol
that assigns to all such units the same probability of being drawn into the
sample, a process known as sampling with replacement (this name derives
from sampling protocols applied to finite populations where a unit is allowed
to produce multiple realizations xi ). Conversely, other protocols like such as
sampling without replacement, where every realization is drawn sequentially
from a population and is not allowed to be extracted again, do not comply
with the random sample framework. It is important to realize that not all
samples are random.

Definition 4.4. Non-random sample. A sample is non-random if the


realizations that compose it are not drawn from i.i.d. random variables, or
vectors, or matrices. Instead, these may be:

• independent and not identically distributed (i.n.i.d.);


• not independent and identically distributed (n.i.i.d.);
• not independent and not identically distributed (n.i.n.i.d.).

Intuitively, the more a sample departs from the i.i.d. benchmark the
more statistical inference is complicated. Yet in social sciences non-random
samples are common. For example, the data may be composed by observa-
tions obtained from recognizably different distributions (i.n.i.d.), or whose
characteristics are characterized by statistical dependence due to some un-
derlying socio-economic phenomenon, like group behavior or the response
to economic events (n.i.i.d.). While statistical inference is still possible on
non-random samples, the asymptotic framework is better suited to these
settings. The present lecture only focuses on random samples. Yet it must

127
4.1. Random Samples

be mentioned that the modern theory of econometrics extends to the gen-


eral n.i.n.i.d. case because of the necessity to deal with real data that hardly
fit the i.n.i.d. case, and almost never (except special cases) the i.i.d. one.
The rest of this section heretofore focuses on the random, i.i.d. case. A
first observation about random samples is that since their realizations are
independent, it is easy to express their associated joint probability mass or
density function, as the product of every unit of observation’s joint density:
N
Y
fx1 ,...,xN (x1 , . . . , xN ; θ) = fx (x1 ; θ) × · · · × fx (xN ; θ) = fx (xi ; θ)
i=1

where θ is the collection of parameters that are associated with the proba-
bility distribution of x. This is an extremely useful fact aiding the analysis
of selected statistics of a random sample.
Definition 4.5. Statistic. A function of the N random variables, vectors
or matrices that are specific to each i-th unit of observation and that gen-
erate a sample is called a statistic. Any statistic is itself a random variable,
vector or matrix.
Definition 4.6. Sampling distribution. The probability distribution of
a statistic is called its sampling distribution.
The two most common and better known statistics are the following.
Definition 4.7. Sample mean. In samples derived from random vectors,
the sample mean is a vector-valued statistic which is usually denoted as x̄
and defined as follows.
N
1 X
x̄ = xi
N i=1
This definition can be reduced to samples that are drawn from univariate
random variables, in which case the usual notation is X̄:
N
1 X
X̄ = Xi
N i=1
or extended to samples drawn from random matrices, where one can write
X̄ and the definition is again analogous.
Definition 4.8. Sample variance-covariance. In samples collected from
random vectors, the sample variance-covariance is a matrix-valued statistic
which is usually denoted by S and defined as follows.
N
1 X
S= (xi − x̄) (xi − x̄)T
N − 1 i=1

128
4.1. Random Samples

In samples from univariate random variables, this statistic is simply called


sample variance, its associated notation is S 2 , and it is a scalar.
N
2 1 X 2
S = Xi − X̄
N − 1 i=1

In this case, the square root of the sample variance is written S = S 2 and
called the standard deviation. In order to extend this definition to sampling
from random matrices it is necessary to develop three-dimensional arrays.
These statistics have some important properties, which are proved here
in the vector-valued case.
Theorem 4.1. Properties of simple sample statistics (1). Consider
a sample {xi }N
i=1 , its sample mean x̄, and its sample variance-covariance
S. The following two properties are true:
a. x̄ = arg mina∈RK N T
P
i=1 (xi − a) (xi − a) ;

b. (N − 1) S = N T T
P
i=1 xi xi − N · x̄x̄ .

Proof. To show point a. note that:


N
X N
X
T
(xi − a) (xi − a) = (xi − x̄ + x̄ − a) (xi − x̄ + x̄ − a)T
i=1 i=1
XN N
X
T
= (xi − x̄) (xi − x̄) + (x̄ − a) (x̄ − a)T
i=1 i=1
XN XN
+ (xi − x̄) (x̄ − a)T + (x̄ − a) (xi − x̄)T
i=1 i=1
N
X N
X
= (xi − x̄) (xi − x̄)T + (x̄ − a) (x̄ − a)T
i=1 i=1

where two terms in the second line are both equal to zero by definition of
sample mean; in the last line, the first term does not depend on a while the
second is minimized at a = x̄. To show b. simply note that:
N
X N
X N
X N
X
T
(xi − x̄) (xi − x̄) = xi xT
i − T
xi x̄ − x̄xT
i + N · x̄x̄
T

i=1 i=1 i=1 i=1


N
X
= xi xT
i − N · x̄x̄
T

i=1

and the result again follows from the definition of a sample mean.

129
4.1. Random Samples

Theorem 4.2. Properties of simple sample statistics (2). Consider


a random sample {xi }N i=1 drawn from a random vector x, a transformation
of this vector y = g (x), and suppose that all the moments expressed in the
mean vector E [y] and in the variance-covariance matrix Var [y] are defined.
The following two properties are true:
hP i
N
a. E i=1 y i = N · E [yi ];
hP i
N
b. Var i=1 yi = N · Var [yi ].

Proof. To show a. simply observe that:


" N # N
X X
E yi = E [yi ] = N · E [yi ]
i=1 i=1

where the first equality follows from the linear properties of the expectation
operator and the second equality follows from the fact that the distributions
of yi for i = 1, . . . , N are identical (this particular result does not require
independence and is also valid for n.i.i.d. samples). Regarding b. it is:
" N #  " N #! N " N #!T 
X XN X X X
Var yi = E  yi − E yi yi − E yi 
i=1 i=1 i=1 i=1 i=1
 ! !T 
N
X N
X
= E (yi − E [yi ]) (yi − E [yi ]) 
i=1 i=1
" N
#
X
=E (yi − E [yi ]) (yi − E [yi ])T
i=1
N
X h i
= E (yi − E [yi ]) (yi − E [yi ])T
i=1
= N · Var [yi ]
where the first line is just the definition of variance for N i=1 yi , the sec-
P
ond line applies the linear properties of expectations while also rearranging
terms, the third line rearranges terms again after observing that, for i 6= j:
h i
E (yi − E [yi ]) (yj − E [yj ])T = 0
which follows from the independence of the realizations in the random sam-
ple, the fourth line is another application of the linear properties of expec-
tations, while the fifth line again exploits the fact that all the realizations
follow from identically distributed random variables.

130
4.1. Random Samples

Theorem 4.3. Properties of simple sample statistics (3). Consider a


random sample {xi }N i=1 drawn from a random vector x whose mean vector
is E [x] and whose variance-covariance matrix is Var [x] < ∞ and is finite.
The following three properties are true:
a. E [x̄] = E [x];
b. Var [x̄] = Var [x] /N ;
c. E [S] = Var [x].
Proof. To show a. it is sufficient to apply Theorem 4.2, point a. for y = x:
" N
# N
1 X 1 X 1
E [x̄] = E xi = E [xi ] = · N · E [xi ] = E [x]
N i=1 N i=1 N

and point b. proceeds similarly.


" N
# N
1 X 1 X 1 Var [x]
Var [x̄] = Var xi = 2 Var [xi ] = 2 · N · Var [xi ] =
N i=1 N i=1 N N

The proof of point c. is as follows:


" N
!#
1 X
E [S] = E xi xT
i − N · x̄x̄
T
N − 1 i=1
N
!
1 X
E xi xT T
   
= i − N · E x̄x̄
N −1 i=1
1
= (N · Var [xi ] − N · Var [x̄])
N −1 
N 1
= 1− Var [x]
N −1 N
= Var [x]

where the third line follows after adding and subtracting N ·E [x] E [x]T .
The theorem examined last is the culmination of the results analyzed
previously, and specifies how to obtain quantities that can be used to evalu-
ate – more precisely, estimate – the moments of the underlying distribution.
By selecting the sample mean and the sample variance-covariance for the
sake of estimating the corresponding moments, one can rely on the property
that the expectation of those quantities is indeed the moment sought after,
in both cases. Later, this property is defined unbiasedness.

131
4.2. Normal Sampling

4.2 Normal Sampling


Once the expectations of both the sample mean and the sample variance-
covariance are known, the next step is to identify the sampling distribution
of the sampling mean. This is necessary for the sake of characterizing the
probability about the occurrence of different samples. In general, there is no
unique answer to this problem, as it ultimately depends on the underlying
distribution of the random vector that generates a random sample. In some
specific cases, however, it is possible to leverage upon Theorem 3.7 in order
to derive the distribution of univariate sample means obtained from selected
distributions. The most important of these cases is the one about sampling
from the normal distribution. Specifically, if a random sample {xi }N i=1
is drawn from a random variable X ∼ N (µ, σ2 ), it is easy to see that:
σ2
 
X̄ ∼ N µ, (4.1)
N
which is an extension of Observation 3.11 from Lecture 3. The above result
can be alternatively expressed in a standardized fashion, which can be more
convenient for calculating probabilities about the sample mean.
√ X̄ − µ
N ∼ N (0, 1) (4.2)
σ
In practical settings, those analysts who are interested about evaluating
– better, estimating – a specific value for the mean of a normal distribution
find results (4.1) and (4.2) of limited use. The reason is that, to manipulate
the density function of the normal distribution in order to make statements
about the probability that the parameter µ falls within specified ranges –
that is to perform hypothesis tests, see Lecture 5 – it is necessary to know
the parameter σ beforehand. In actual practice this information is generally
inaccessible to researchers. An intuitive workaround is to substitute σ with
its associated sample statistic, that is the standard deviation S. Doing so
is tantamount to working with a well-known statistic.
Definition 4.9. The t-statistic. Given some univariate sample {xi }N i=1 of
size N drawn from a sequence of random variables X1 , . . . , XN , a t-statistic
is defined as the following quantity:
√ X̄ − µ
t= N
S
where X̄ is the sample mean whose expectation is µ = E X̄ , and S is the
 

sample standard deviation.

132
4.2. Normal Sampling

The next result is central in statistical inference and allows to derive the
sampling distribution of the t-statistic when the sample is drawn from the
normal distribution.
Theorem 4.4. Sampling from the Normal Distribution. Consider a
random sample {xi }Ni=1 which is drawn from a random variable following the
normal distribution X ∼ N (µ, σ2 ), and the random variables corresponding
2
to the two sample statistics X̄ = N1 N
P 2 1
PN
i=1 X i and S = N −1 i=1 X i − X̄ .
The following three properties are true:
a. X̄ and S 2 are independent;
b. X̄ ∼ N (µ, σ2 /N );
c. (N − 1) S 2 /σ2 ∼ χ2N −1 .
Proof. Point a. is the most crucial to show. To this end, it is useful to start
from the observation that the sample variance can be expressed in terms of
only N − 1 of the original random variables, say X2 , . . . , XN :
N
1 X 2
S2 = Xi − X̄
N − 1 i=1
" N
#
1 2 X 2
= X1 − X̄ + Xi − X̄
N −1 i=2
 !2 
N N
1  X  X 2
= Xi − X̄ + Xi − X̄ 
N −1 i=2 i=2

where the last line follows from N i=1 Xi − X̄ = 0. Consequently, proving


P 

that the sample mean is independent of the sample variance requires to show
that X̄ is independent of N −1 out of the N demeaned normally distributed
random variables, say X2 − X̄, . . . , XN − X̄. To do so, a convenient approach
is to define the following random vector ze of length N , which is a function
of the standardized random variables Zi = (Xi − µ) /σ for i = 1, . . . , N .
     
Z̄ Z̄ N −1 N −1 ... N −1
 Ze2   Z2 − Z̄  −N −1 1 − N −1 . . . −N −1 
ze =  ..  =  ..  =  .. .. . .. z
     
 .   .   . . . . . 
−1 −1 −1
ZeN ZN − Z̄ −N −N ... 1 − N

Here, z = (Zi , . . . , ZN ). One can prove that this linear transformation has
Jacobian determinant equaling 1/N , and therefore it is invertible; it follows

133
4.2. Normal Sampling

that the Jacobian determinant of the inverse transformation is equal to N .


Thus, the joint distribution of ze can be obtained from that of z by applying
Theorem 3.1 quite straightforwardly:
 !2 
N N
N 1 X 1 X
fz̃ (z̄, ze2 , . . . , zeN ) = q exp − z̄ − zei − (z̄ + zei )2 
2 2 i=2
(2π)N i=2
r
N z̄ 2
 
N
= exp − ×
2π 2
s  !2 
N N
N 1 X 1 X
× exp − zei − ze2 
(2π)N −1
2 i=2 2 i=2 i

= fZ̄ (z̄) · fz̃−1 (e


z2 , . . . , zeN )

and it can be clearly decomposed into the product of two components: the
density function of Z̄ and that of all the other elements of ze, implying that
X̄ is independent of X2 − X̄, . . . , XN − X̄, and consequently of S 2 .
To continue the proof about the other points in the statement, note that
point b. is, as said, a consequence of Theorem 3.7. In order to demonstrate
point c. instead, it is easiest to proceed as follows:
N 2
S 2 X Xi − X̄
(N − 1) 2 =
σ i=1
σ2
N 2
X Xi − µ + µ − X̄
=
i=1
σ2
N 2 N
X (Xi − µ)2 N X̄ − µ X Xi − µ
= 2
− 2
− 2 X̄ − µ
i=1
σ σ i=1
σ2
N 
√ X̄ − µ 2
2  
X Xi − µ
= − N
i=1
σ σ

that is, the statistic (N − 1) S 2 /σ2 is shown to be the sum of the squares
of N independent random variables all of which follow the standard normal
distribution (the standardized versions of Xi , . . . , XN ) minus the square of
another random variable that follows the standard normal distribution (the
standardized version of the sample mean X̄). By the demonstration of point
a. the latter is independent of the former. Consequently, the distribution of

134
4.2. Normal Sampling

the statistic of interest can be obtained by a variation of Observation 3.9:


N
Y
MZ̄¯2 (t) M(N −1) S2 (t) = MZi2 (t)
σ2
i=1

where Z̄¯ ≡ N X̄ − µ /σ and Zi ≡ (Xi − µ) /σ, or equivalently:

N
1 Y
M(N −1) S2 (t) = M 2 (t)
σ2 MZ̄¯2 (t) i=1 Zi
1
= (1 − 2t)− 2 (N −1)
which follows since all the N + 1 moment generating functions involved on
the right-hand side are those of a chi-squared distribution with one degree
of freedom; consequently the end result is the moment-generating function
of a chi-squared distribution with N − 1 degrees of freedom.
This result took some effort to show, but it allows to derive the sampling
distribution of the t-statistic. In fact, the ratio:

X̄ − µ
√ X̄ − µ N
t= N =r σ ∼ TN −1
S S 2
1
(N − 1) 2 √
σ N −1
is easily seen as the ratio between two independent random variables; the
one in the numerator follows the standard normal distribution, while the one
in the denominator equals the square root of a random variable that follows
the chi-squared distribution with N − 1 degrees of freedom, divided by the
square root of N − 1. Hence, by Observation 3.2, a t-statistic follows the
Student’s t-distribution with N − 1 degrees of freedom. Theorem 4.4 is also
useful to obtain the sampling distribution of another important statistic.
Definition 4.10. Normal variance ratio. Consider two univariate ran-
dom samples {xi }N i=1 and {yi }i=1 of sizes NX and NY respectively, each
X NY

drawn from two independent sequences of random variables (X1 , . . . , XNX )


and (Y1 , . . . , YNY ) whose distributions are as follows.
Xi ∼ N µX , σ2X for i = 1, . . . , NX


Yj ∼ N µY , σ2Y for j = 1, . . . , NY


The normal variance ratio is defined as the following F -statistic:


2
SX /σ2X
F=
SY2 /σ2Y
where SX and SY are the sample variances of the two random samples.

135
4.2. Normal Sampling

The variance ratio is the statistic used by analysts to evaluate whether


two population of interest, call them X and Y by the denomination of the
respective random variables, have the same variance – or more generally,
whether their variance is equal to some given quantity σ2X /σ2Y . Once again,
these evaluations – these exercises in statistical inference – require knowl-
edge about the distribution of the statistic in question. Theorem 4.4 comes
again to the rescue, ensuring that both the numerator and the denominator
of F, if multiplied by NX − 1 and NY − 1 respectively, follow a chi-squared
distribution with those numbers as degrees of freedom; so, by Observation
3.3:
S 2 /σ2
F = X2 X ∼ FNX −1,NY −1
SY /σ2Y
that is, the variance ratio F follows the Snedecor F -distribution with paired
degrees of freedom given by NX − 1 and NY − 1. This fact is exploited in
statistical tests about the variances of different populations drawn from the
uniform distribution, as discussed in Lecture 5.
Most of these ideas can be extended to multivariate normal sampling,
where analysts have access to a vector-valued random sample {xi }N i=1 drawn
from some multivariate normal distribution x ∼ N (µ, Σ). Even in this en-
vironment, statistical inference requires the development of sample statistics
with a clearly recognizable distribution. Clearly, the standardized multivari-
ate sample mean in this case also follows a multivariate normal distribution,
thanks to the linear properties of the latter.
 
Σ
x̄ ∼ N µ, (4.3)
N
However, the above statistic is a random vector, which makes it unsuitable
in some specific stages of statistical inference, like tests of hypotheses. This
observation led to the development of the following statistic.
Definition 4.11. u-statistic. Given some multivariate sample {xi }N i=1 of
size N drawn from a sequence of random vectors x1 , . . . , xN , a u-statistic 1
is defined as the following quantity:
u = N (x̄ − µ)T Σ−1 (x̄ − µ)
K X
X K
σ∗−1
 
=N k` X̄k − µk X̄` − µ`
k=1 `=1

where x̄ is the sample mean whose expectation and variance are µ = E [xi ],
and Σ/N = Var [xi ] respectively, and where σ∗−1
k` is k`-th element of Σ .
−1

1
Note that this nomenclature is not standard, but is applied throughout the lectures.

136
4.2. Normal Sampling

To better interpret the u-statistic, it is useful to analyze its development


as a quadratic form in the second line of the above definition: the statistic
is a second degree polynomial of the K deviations of all univariate sample
means from their respective mean parameters, normalized by the population
variance-covariance. In this respect, the u-statistic intuitively appears to be
a multivariate generalization of the squared standardized univariate sample
mean. The following result should also be intuitive enough upon recalling
the relationship between the normal and chi-squared distributions.
Theorem 4.5. Sampling from the Multivariate Normal Distribu-
tion. Consider a random sample {xi }N i=1 drawn from a K-dimensional ran-
dom vector following the multivariate normal distribution, x ∼ N (µ, Σ).
In this environment, the u-statistic follows the chi-squared distribution with
K degrees of freedom.
u = N (x̄ − µ)T Σ−1 (x̄ − µ) ∼ χ2K
Proof. Derive the moment-generating function of the u-statistic, exploiting
x̄ ∼ N (µ, Σ/N ). With the aid of some linear algebra:
ˆ   

T −1 Σ
Mu (t) = exp N (x̄ − µ) Σ (x̄ − µ) t fx̄ x̄; µ, dx̄
RK N
ˆ s  
1 NK N T −1

= exp − (x̄ − µ) (1 − 2t) Σ (x̄ − µ) t dx̄
RK (2π)K |Σ| 2
s ˆ s
1 1 [(1 − 2t) N ]K
= × ×
(1 − 2t)K RK (2π)K |Σ|
 
N T −1

× exp − (x̄ − µ) (1 − 2t) Σ (x̄ − µ) t dx̄
2
K
= (1 − 2t)− 2
where the integral in the third line disappears because it is the expression of
the probability density function of a multivariate normal distribution with
mean µ and variance-covariance Σ/ (1 − 2t) N . The result is the moment-
generating function of a chi-squared distribution with parameter K.
While this is a nice theoretical result, its direct applications are rather
limited, since in applied statistical analysis the variance-covariance matrix
Σ is seldom known a-priori. Just like in the univariate case, the immediate
temptation is to substitute Σ with its corresponding sample statistic, the
sample variance-covariance S. This gives rise to yet another statistic, which
intuitively is the multivariate generalization of the t-statistic.

137
4.3. Order Statistics

Definition 4.12. Hotelling’s “t-squared” statistic. Given some multi-


variate sample {xi }N i=1 of size N drawn from a sequence of random vectors
x1 , . . . , xN , Hotelling’s t-squared statistic is defined as the random variable:
t2 = N (x̄ − µ)T S −1 (x̄ − µ)
K X
X K
∗−1
 
=N Sk` X̄k − µk X̄` − µ`
k=1 `=1

where x̄ is the sample mean whose expectation is µ = E [xi ], S is the


sample variance-covariance, and Sk`
∗−1
is k`-th element of S −1 .
Another result, which is not proved here, states that the following rescaled
version of Hotelling’s t-squared statistic:
N − K 2 N (N − K)
t = (x̄ − µ)T S −1 (x̄ − µ) ∼ FK,N −K
K (N − 1) K (N − 1)
follows the F -distribution with paired degrees of freedom K and N − K.2
This finding allows to conduct statistical inference on multivariate normal
samples through a well-known univariate distribution (see Lecture 5).

4.3 Order Statistics


In statistical analysis it is often useful to study the values that the realiza-
tion of a random variable take at a given position of the order of realizations
(e.g. the smallest value, the largest value, et cetera). These are themselves
realizations of random variables: more precisely, of the following statistics.
Definition 4.13. Order statistics. Consider some sample {xi }N i=1 of real-
izations obtained from univariate random variables, {Xi }N
i=1 . Suppose that
these values are placed in ascending order, where subscripts surrounded by
parentheses denote one observation’s position in the order:
x(1) ≤ x(2) ≤ · · · ≤ x(N )

thus x(1) = min {xi }Ni=1 and x(N ) = max {xi }i=1 . The j-th order statistic is
N

the random variable, denoted as X(j) , that generates the j-th realization in
the above sequence, that is x(j) . Any univariate sample has N associated
order statistics that must satisfy the following property.
X(1) ≤ X(2) ≤ · · · ≤ X(N )
2
To prove this result one must develop the distribution of the random matrix S: the
so-called Wishart distribution, which is outside the scope of this analysis.

138
4.3. Order Statistics

Definition 4.14. Sample Minimum. The sample minimum is the first


order statistic, X(1) .

Definition 4.15. Sample Maximum. The sample maximum is the N -th


order statistic, X(N ) .

Definition 4.16. Sample Range. The sample range R is the difference


between the sample maximum and the sample minimum: R = X(N ) − X(1) .

Definition 4.17. Sample Median. The sample median M is a function


of a sample’s most central order statistics.

X N +1
( ) if N is odd
M = 1 2 
 X N +X N
2 ( ) ( +1) if N is even
2 2

It is occasionally useful to analyze the sampling distribution of selected


order statistics. Fortunately, this task is simplified greatly if the sample is
random. In fact, the cumulative distribution of the j-th order statistic can
be expressed in terms of the following joint probability:

FX(j) (x) = P X(1) ≤ x ∩ · · · ∩ X(j) ≤ x

that is, for the j-th order statistic to be less or equal than some x, all the
inferior order statistic must also be less or equal than x (while the superior
ones can be larger, equal or lower than x). In a random sample, where all
the realizations obtain from independent and identically distributed random
variables, the above expression is considerably easier to evaluate.

Theorem 4.6. Sampling distribution of order statistics in a random


sample. In a univariate random sample, the cumulative distribution of the
j-th order statistic is based on the binomial distribution:
N  
X N
FX(j) (x) = [FX (x)]k [1 − FX (x)]N −k
k=j
k

where FX (x) is the cumulative distribution of the random variable X that


generates the sample. As two particular cases, the cumulative distributions
of the minimum and the maximum are as follows.

FX(1) (x) = 1 − [1 − FX (x)]N


FX(N ) (x) = [FX (x)]N

139
4.3. Order Statistics

Proof. For at least j realizations to be less or equal than x, the event defined
as Xi ≤ x must occur an integer number of j ≤ k ≤ N times, whereas the
complementary event Xi > x must instead occur N −k times. If the sample
is random (i.i.d.), these two fundamental events occur with probabilities
that are constant across all realizations:
P (Xi ≤ x) = FX (x)
P (Xi > x) = 1 − FX (x)
and since realizations are independent, any joint combination of said events
can be expressed as the appropriate product of those probabilities. Clearly,
for a given k any joint events can be expressed through a binomial distribu-
tion, with the binomial coefficient counting all potential combinations with
k “successes” (Xi ≤ x) and N − k “failures” (Xi > x). Summing over the el-
igible values of k delivers the result sought after, of which the distributions
for the minimum and the maximum are special cases.
Corollary. If X is a continuous distribution with density function fX (x),
the density function of the j-th order statistic is the following.
N!
fX(j) (x) = fX (x) [FX (x)]j−1 [1 − FX (x)]N −j
(j − 1)! (N − j)!
Proof. This follows from manipulating the derivative of the cumulative dis-
tribution FX(j) (x). By the chain rule one obtains the density function:

dFX(j) (x)
fX(j) (x) =
dx
N  
X N
= k [FX (x)]k−1 [1 − FX (x)]N −k fX (x) −
k=j
k

k N −k−1
− (N − k) [FX (x)] [1 − FX (x)] fX (x)
N!
= fX (x) [FX (x)]j−1 [1 − FX (x)]N −j +
(j − 1)! (N − j)!
N  
X N
+ k [FX (x)]k−1 [1 − FX (x)]N −k fX (x) −
k=j+1
k
N  
X N
− (N − k) [FX (x)]k [1 − FX (x)]N −k−1 fX (x)
k=j
k

where the third line obtains by isolating the term for k = j in the summation
that results from taking the derivative. All that is left to do is to show that

140
4.3. Order Statistics

the two “residual” summations in the third line cancel out. To this end, some
additional manipulation is necessary. In particular, a re-indexing of the first
of the two concerned residual summations, as well as the observation that
the term for k = N in the second residual summation equals zero, gives:
N!
fX(j) (x) = fX (x) [FX (x)]j−1 [1 − FX (x)]N −j +
(j − 1)! (N − j)!
N −1  
X N
+ (k + 1) [FX (x)]k [1 − FX (x)]N −k−1 fX (x) −
k=j
k + 1
N −1  
X N
− (N − k) [FX (x)]k [1 − FX (x)]N −k−1 fX (x)
k=j
k

and since, by simple manipulation of factorials, it is:


   
N N! N
(k + 1) = = (N − k)
k+1 k! (N − k − 1)! k
it clearly follows that the two terms do indeed cancel out.
This theorem is important, but its practical use is circumscribed as the
application of the above formulae seldom returns expressions that relate to
known distributions. One particular result, however, stands out.
Observation 4.1. Consider a random sample drawn from the standard
continuous uniform distribution, X ∼ U (0, 1). The j-th order statistic is
such that X(j) ∼ Beta (j, N − j + 1).
Proof. Since FX (x) = x and fX (x) = 1 for x ∈ (0, 1), while FX (x) = x
and fX (x) = 0 otherwise, the density function of X(j) is, for x ∈ (0, 1):
N!
fX(j) (x) = xj−1 (1 − x)N −j
(j − 1)! (N − j)!
Γ (N + 1)
= xj−1 (1 − x)(N −j+1)−1
Γ (j) Γ (N − j + 1)
= B (j, N − j + 1) · xj−1 (1 − x)(N −j+1)−1

where the second line follows from the properties of the Gamma function;
the result is the density function of the postulated Beta distribution.
This result, in conjunction with Theorem 1.13, allows to derive the sampling
distribution of the order statistic of a sample of percentiles p drawn from
some known distribution, as p ∼ U (0, 1).

141
4.3. Order Statistics

Other relevant results are, instead, restricted to the two extreme order
statistics: the minimum and the maximum. In particular, certain distribu-
tion have the useful feature to return – in random samples – minima and
maxima that follows a distribution in the same sub-family.
Definition 4.18. Extreme order statistics (min-max) stability. Con-
sider a random sample drawn from some known distribution. If the sample
minimum (maximum) follows another distribution of the same family, that
distribution is said to be min-stable (max-stable).
For example, the exponential distribution is notoriously min-stable.
Observation 4.2. Consider a random sample drawn from the exponential
distribution with parameter λ, X ∼ Exp (λ). The first order statistic – the
minimum – is such that X(1) ∼ Exp (N −1 λ).
Proof. By applying the formula for the distribution of the minimum:
FX(1) (x; λ, N ) = 1 − [1 − FX (x; λ)]N
  N
1
= 1 − exp − x
λ
 
N
= 1 − exp − x
λ
the postulated (cumulative) distribution is obtained straightforwardly.
All three types of distributions in the GEV family, instead, are max-stable:
this is another motivation for their collective name.
Observation 4.3. Consider a random sample drawn from the Type I GEV
(Gumbel) distribution with parameters µ and σ, X ∼ EV1 (µ, σ). The top
order statistic – the maximum – is such that X(N ) ∼ EV1 (µ − σ log (N ) , σ).
Proof. By applying the formula for the distribution of the maximum:
FX(N ) (x; µ, σ, N ) = [FX (x; µ, σ)]N
  N
x−µ
= exp − exp
σ
  
x−µ
= exp −N exp
σ
  
x − µ + σ log (N )
= exp − exp
σ
one obtains the Gumbel cumulative distribution that was argued.

142
4.3. Order Statistics

Observation 4.4. Consider a random sample drawn from the Type II GEV
(Fréchet) distribution with parameters α, µ, and σ, Y ∼ EV2 (α, µ, σ). The
top order statistic – the maximum – is such that Y(N ) ∼ EV2 α, µ, σN 1/α .


The result is identical in random sampling from the Type III GEV (reverse
Weibull) case: with Y ∼ EV3 (α, µ, σ), it is Y(N ) ∼ EV3 α, µ, σN 1/α .


Proof. Here, applying the formula for the distribution of the maximum:
FY(N ) (y; α, µ, σ, N ) = [FY (y; α, µ, σ)]N
 −α !N
y−µ
= exp −
σ
  −α !
1 y − µ
= exp − N − α
σ
is equally valid to show both the Fréchet and the reverse Weibull results.
A last observation about the traditional Weibull distribution – which should
be intuitive in light of the relationships between the traditional Weibull, the
exponential, and the GEV distributions – is presented next.
Observation 4.5. Consider a random sample drawn from the traditional
Weibull distribution with parameters α, µ, and σ, W ∼ Weibull
 (α, µ, σ).
The sample minimum is such that W(1) ∼ Weibull α, µ, σN 1/α .
Proof. Things proceed similarly to the Fréchet and reverse Weibull cases.
FW(1) (w; α, µ, σ, N ) = 1 − [1 − FW (w; α, µ, σ)]N
 −α !N
w−µ
= 1 − exp −
σ
  −α !
1 w − µ
= 1 − exp − N − α
σ
The difference is that here, the formula for the minimum is applied.
It is difficult to identify other situations where the exact distribution of
an order statistic of interest can be computed and related to some known
common distribution. In an asymptotic environment things are different, as
the so-called Extreme Value Theorem (see Lecture 6) allows to circumscribe
the set of sampling distribution of order statistics to the three different types
of the GEV family – so long as the sample size is large enough. Along with
the above “stability” results, the Theorem in question motivates the use of
the GEV distributions for modeling extreme order statistics.

143
4.4. Sufficient Statistics

4.4 Sufficient Statistics


After introducing the concept of a sample and the statistics that summarize
its characteristics – the sample mean, the sample variance-covariance, and
the order statistics – the next logical step is to use the sample in order to
learn important facts about the population from which the sample is drawn.
The final objective is to perform evaluations and tests about certain features
of the probability distribution that generates the sample, such as selected
parameters or moments; these exercises are known as statistical estimation
and inference. Before reaching that point, however, it is useful to recognize
that in selected situations – quite frequent ones actually – certain statistics
help simplify statistical evaluations. These are called sufficient statistics
and this long section is specifically devoted to them. A definition follows.
Definition 4.19. Sufficient statistics. Consider a sample generated by
a list of random vectors (x1 , . . . , xN ). Suppose that the joint distribution
of the sample depends, among the others, on some parameter θ; write the
associated probability mass or density function as fx1 ,...,xN (x1 , . . . , xN ; θ).
A statistic T = T (x1 , . . . , xN ) is said to be sufficient if the joint distribution
of the sample, conditional on it, does not depend on θ:
fx1 ,...,xN (x1 , . . . , xN ; θ)
f x1 ,...,xN |T (x1 , . . . , xN | T (x1 , . . . , xN )) =
qT (T (x1 , . . . , xN ) ; θ)
where qT (T (x1 , . . . , xN ) ; θ) is the probability mass or density function of
the sufficient statistic in question.
Note that the parameter θ disappears from the expression of the conditional
density on the left-hand side above. Equivalently, this can be expressed by
saying that the joint conditional density is constant as a function of θ.
The usefulness of sufficient statistics is that they intuitively “exhaust” all
the information about θ that is contained in a sample. This aids estimation
and inference in various ways which are best appreciated in more advanced
treatments (some related results are mentioned in the subsequent Lecture).
The role of sufficient statistics in inference is summarized by the following
statistical principle, that is, a postulate (axiom) of statistical analysis.
Statistical Principle 1. Sufficiency. If T = T (x1 , . . . , xN ) is a sufficient
statistic for a parameter θ, any evaluation about the latter should depend
solely on the sufficient statistic or a function thereof. That is, if (x1 , . . . , xN )
and (y1 , . . . , yN ) are two – possibly different – sample realizations such that
T (x1 , . . . , xN ) = T (y1 , . . . , yN ), all evaluations about θ should be identical
regardless of the exact observed values in either realization.

144
4.4. Sufficient Statistics

Examples are useful to build intuition; it is best start from simpler ones.
Example 4.1. A sufficient statistic for the Bernoulli parameter p.
Suppose that a random (i.i.d.) sample is obtained from a random variable
X following the Bernoulli distribution with parameter p, or X ∼ Be (p). It
turns out that the statistic counting the number of “successes,” define it as:
N
X
T = T (X1 , . . . , XN ) = Xi
i=1

is a sufficient statistic for p. This is shown by applying the definition, after


observing that T ∼ BN (p, N ). Call t the realization of T , that is the value
of the sufficient statistic evaluated in terms of the N actual realizations of
X that are observed in the sample:
N
X
t = T (x1 , . . . , xN ) = xi
i=1

thus:
QN xi 1−xi
fX1 ,...,XN (x1 , . . . , xN ; p) i=1 p (1 − p)
= N
pt (1 − p)N −t

qT (t; p, N ) t
t N −t
p (1 − p)
= N
pt (1 − p)N −t

t
t! (N − t)!
=
N!
where the first equality follows since the sample is random and the second
from the definition of t. Thus it is proved that the distribution of the sample,
conditional on the sufficient statistic, does not depend on p – as postulated.
The intuition for this is that upon knowing the number of “successes” t over
N attempts, there is no other information in the sample that helps “learn”
(perform inference) about the parameter p. 
Example 4.2. A sufficient statistic for µ in the normal distribution.
Suppose that a random (i.i.d.) sample is obtained from a random variable
X following the normal distribution with location parameter µ and scale
parameter σ2 , or X ∼ N (µ, σ2 ). The sample mean X̄ is a sufficient statistic
for the mean parameter µ. The demonstration proceeds as in the previous
case, recalling now that X̄ ∼ N (µ, σ2 /N ). Similarly as above, it is useful
to define the actual realization of the sample mean in the data.
N
1 X
x̄ = xi
N i=1

145
4.4. Sufficient Statistics

Proving the claim thus proceeds as above:


N q 2
!
Y −1 (x i − µ)
(2πσ2 ) · exp −
fX1 ,...,XN (x1 , . . . , xN ; µ, σ2 ) i=1
2σ2
= !
qX̄ (x̄; µ, σ2 /N ) q N (x̄ − µ) 2
(2πσ2 )−1 N · exp −
2σ2
N 2
!
q
−N
X (x i − µ)
(2πσ2 ) · exp −
i=1
2σ2
= 2
!
q N (x̄ − µ)
(2πσ2 )−1 N · exp −
2σ2
N
!
X (xi − x̄)2 N (x̄ − µ)2
exp − −
i=1
2σ2 2σ2
= 2
!
q N (x̄ − µ)
(2πσ2 )N −1 N · exp −
2σ2
N
!
X (xi − x̄)2
exp −
2σ2
= q i=1
(2πσ2 )N −1 N

where the third line obtains since:


N
X N
X N
X
2 2
(xi − µ) = (xi − x̄) − 2 (x̄ − µ) (xi − x̄) +N (x̄ − µ)2
i=1 i=1
|i=1 {z }
=0

which is a decomposition similar to that used to prove part c. of Theorem


4.4. Again, the final expression of the ratio does not depend on the location
parameter µ, however it does depend on σ2 . The intuition is that the sample
mean provides all the information that the sample can deliver about the
average of the population being sampled, but it does not provide enough
knowledge about its overall spread or variation. 

Example 4.3. A sufficient statistic for the uniform distribution. An


order statistics can be sufficient as well. Suppose that a random (i.i.d.) sam-
ple is obtained from a random variable X following the uniform distribution
with the infimum of the support is zero, and the supremum is θ, an unknown
parameter X ∼ U (0, θ). The maximum X(N ) is indeed a sufficient statistic

146
4.4. Sufficient Statistics

for θ! Showing this result is quite simple: calling x(N ) = max {x1 , . . . , xN }
the realization of X(N ) , and writing that the latter’s density function as
 
 d x(N ) N  
qX(N ) x(N ) ; θ = · 1 x(N ) ∈ (0, θ)
dx(N ) θ
−1
N xN
(N )  
= · 1 x(N ) ∈ (0, θ)
θN
it is straightforward to see that:
fX1 ,...,XN (x1 , . . . , xN ; θ) 1  
= −1
· 1 x (N ) ∈ (0, θ)
N xN

qX(N ) x(N ) ; θ (N )

as fX1 ,...,XN (x1 , . . . , xN ; θ) = [fX (x; θ)]N = θ−N because the sample is ran-
dom. Again, the result is intuitive: since the support of the uniform distri-
bution is bounded above, the highest value found in the sample is the most
informative about the limit of the the support. 
Sometimes, it is difficult to verify that a statistic is effectively sufficient
for a certain parameter of interest. Fortunately, the following theorem often
helps simplify the analysis.
Theorem 4.7. Fisher-Neyman’s Factorization Theorem. Consider a
sample generated by a list of random vectors (x1 , . . . , xN ), whose joint dis-
tribution has mass or density function fx1 ,...,xN (x1 , . . . , xN ; θ) that also de-
pends on some parameter θ. A statistic T = T (x1 , . . . , xN ) is sufficient for
θ if and only if it is possible to identify two functions g (T (x1 , . . . , xN ) ; θ)
and h (x1 , . . . , xN ) such that the following holds.

fx1 ,...,xN (x1 , . . . , xN ; θ) = g (T (x1 , . . . , xN ) ; θ) · h (x1 , . . . , xN )

Observe that function g (T (x1 , . . . , xN ) ; θ) depends on θ, however function


h (x1 , . . . , xN ) does not.
Proof. The logic of the proof is best illustrated in the discrete case, and it
is helpful to start from there. Restricting the analysis to the discrete case,
it best to begin by proving the “necessity” part of the theorem: if the above
factorization exists, then T is sufficient for θ. Write the mass function of T
as qT (T (x1 , . . . , xN ) ; θ). Furthermore, define the set of vectors spanning
the same space as (x1 , . . . , xN ) and that result in the same values for T , as
follows.

AT (x1 , . . . , xN ) ≡ {y1 , . . . , yN : T (x1 , . . . , xN ) = T (y1 , . . . , yN )}

147
4.4. Sufficient Statistics

By the property of probability functions, it is:


X
qT (T (x1 , . . . , xN ) ; θ) = fx1 ,...,xN (y1 , . . . , yN ; θ)
y1 ,...,yN ∈AT
X
= g (T (y1 , . . . , yN ) ; θ) · h (y1 , . . . , yN )
y1 ,...,yN ∈AT

and since T (y1 , . . . , yN ) is constant in AT (x1 , . . . , xN ), it is:


X
qT (T (x1 , . . . , xN ) ; θ) = g (T (x1 , . . . , xN ) ; θ) · h (y1 , . . . , yN )
y1 ,...,yN ∈AT

where in both cases AT is shorthand notation for AT (x1 , . . . , xN ). Then:

fx1 ,...,xN (x1 , . . . , xN ; θ) g (T (x1 , . . . , xN ) ; θ) · h (x1 , . . . , xN )


=
qT (T (x1 , . . . , xN ) ; θ) qT (T (x1 , . . . , xN ) ; θ)
h (x1 , . . . , xN )
=P
y1 ,...,yN ∈AT h (y1 , . . . , yN )

as g (T (y1 , . . . , yN ) ; θ) simplifies in the right hand side’s ratio; the latter


no longer depends on θ indicating that T is a sufficient statistic.
To prove that if T is sufficient the factorization of interest holds – the
“sufficiency” part of the Theorem in the discrete case, a term which is un-
fortunate here – it is convenient to recall the interpretation of a joint mass
function as a probability function:
N
!
[
fx1 ,...,xN (x1 , . . . , xN ; θ) = P x i = xi ; θ
i=1
N
!
[
=P xi = xi T = T (x1 , . . . , xN )
i=1
× P (T = T (x1 , . . . , xN ) ; θ)
= h (x1 , . . . , xN ) · qT (T (x1 , . . . , xN ) ; θ)

where the second line follows from the definition of conditional probability
while the third just renames the previous probability function, noting that
the conditional probability of the sample given T can be expressed as some
generic function h (x1 , . . . , xN ) that does not depend on θ by definition of
sufficient statistic.
(Sketched.) In the continuous case the logic of the proof is analogous;
however, the proper demonstration requires the use of advanced measure

148
4.4. Sufficient Statistics

theory. Thus, the analysis is only outlined for a restricted case that can be
related to the discrete case above, and is still general enough to allow many
concrete situations. Suppose that there is a list of bijective and differentiable
transformations that does not depend on θ denoted as:
   
y1 g1 (x1 , . . . , xN )
 y2   g2 (x1 , . . . , xN ) 
 ..  =  ..
   
 .   .


yN gN (x1 , . . . , xN )
where any element of this list, suppose the first element Y11 of y1 , is fixed as
Y11 = T (x1 , . . . , xN ) by construction. In addition, write the corresponding
inverse transformation as follows.
   −1 
w1 g1 (y1 , . . . , yN )
 w2  g −1 (y1 , . . . , yN )
  2
 ..  =  ..
 
 .   .


−1
wN gN (y1 , . . . , yN )
To show necessity, write the joint density of the transformation as:
fy1 ,...,yN (y1 , . . . , yN ; θ) = fx1 ,...,xN (w1 , . . . , wN ; θ) · |J∗ |
= g (T (w1 , . . . , wN ) ; θ) · h (w1 , . . . , wN ) · |J∗ |
= g (y11 ; θ) · h (w1 , . . . , wN ) · |J∗ |
where |J∗ | is shorthand notation for the absolute value of the Jacobian of
the inverse transformation, and the second line follows from hypothesis. It
is obvious that the marginal distribution of Y11 , that is the density function
qT (T (x1 , . . . , xN ) ; θ) of the statistic of interest T , inherits a factorization
analogous to the above and thus since y11 = T (x1 , . . . , xN ), it can be shown
that the ratio between the joint density of the sample and the density of T
does not depend on θ, hence T is sufficient. To prove that if T is sufficient
then a proper factorization can be expressed (the “sufficiency” part of the
Theorem), apply the definition of conditional density function to show that:
fy1 ,...,yN (y1 , . . . , yN ; θ) = qT (y11 ; θ) · f{y1 ,...,yN }\Y11 ({y1 , . . . , yN } \ y11 | Y11 )
where the notation {·} \ Y11 denotes a list that excludes Y11 . Dividing both
sides of the above by |J∗ | returns the desired factorization for:
f{y1 ,...,yN }\Y11 ({y1 , . . . , yN } \ y11 | Y11 )
h (x1 , . . . , xN ) =
|J∗ |
and for g (T (x1 , . . . , xN ) ; θ) = qT (T (x1 , . . . , xN ) ; θ).

149
4.4. Sufficient Statistics

The factorization theorem can be easily applied to cases like the previous
examples. However, it is particularly useful to show that multiple statistics
are simultaneously sufficient for a given number of associated parameters.
This is usually expressed through a vector of statistics t (x1 , . . . , xN ):
 
T1 (x1 , . . . , xN )
 T2 (x1 , . . . , xN ) 
t (x1 , . . . , xN ) =  ..
 
.

 
TK (x1 , . . . , xN )
which are said to be simultaneously sufficient for a vector of parameters θ:
 
θ1
 θ2 
θ =  .. 
 
.
θJ
where in general, it may as well be that K 6= J. The factorization theorem
can be extended to allow for g (t (x1 , . . . , xN ) ; θ) to be the joint density of
all the statistics in question and for a multidimensional parameter vector.
Example 4.4. Two sufficient statistics for µ and σ2 in the normal
distribution. Let us revisit Example 4.2. There, the factorization theorem
can be expressed for:
!
N (x̄ − µ)2
g (x̄; µ) = exp −
2σ2

and:  N2 !
N
(xi − x̄)2

1 X
h (x1 , . . . , xN ) = exp −
2πσ2 i=1
2σ2
and the product of these functions returns the joint density of the sample
X1 , . . . , XN . Observe that, however, both expressions still incorporate the
parameter σ2 . To obtain a sufficient statistic for it, it is intuitive to think
of the sample variance S 2 , whose realization is usually written as follows.
N
1 X
s2 = (xi − x̄)2
N − 1 i=1

It is easy to verify that the factorization theorem here applies with:


  N2 2 2
!
1 N (x̄ − µ) + (N − 1) S
g x̄, s2 ; µ, σ2 =

exp −
σ2 2σ2

150
4.4. Sufficient Statistics

and h (x1 , . . . , xN ) = (2π)− /2 , implying that the pair of statistics X̄, S 2


N 

is simultaneously sufficient for the vector of parameters (µ, σ2 ). This result


is intuitive again, since σ2 equals the variance in the normal case. 
Example 4.5. Sufficient statistics for µ and Σ in the multivariate
normal distribution. Suppose that a random (i.i.d.) sample is obtained
from a random vector x following the multivariate normal distribution with
parameters collected asµ and Σ: x ∼ N (µ, Σ). The multivariate sample
mean x̄ = X̄1 , . . . , X̄K is a collection of sufficient statistics for the mean
vector µ = (µ1 , . . . , µK ), thus extending the result from Example 4.2 to the
multivariate case. More accurately, one should say that the vector of sample
means x̄ features K sufficient statistics that are simultaneously sufficient
for the K parameters contained in the mean vector µ.
In order to show this one must observe that x̄ ∼ N (µ, Σ/N ) by the
previous results on the multivariate sample mean and the linear properties
of the normal distribution. Also define – similarly to the univariate case –
the realization of the sample mean as follows.
N
1 X
x̄ = xi
N i=1

The result from Example 4.2 can now be expressed as:


N
!
1X T
exp − (xi − x̄) Σ−1 (xi − x̄)
fx1 ,...,xN (x1 , . . . , xN ; µ, Σ) 2
= ri=1
qx̄ (x̄; µ, Σ/N ) h iN −1
(2π)K |Σ| N

where K is, as usual, the dimension of the random vector x; the interme-
diate steps involve some tedious linear algebra. The intuition is that every
random variable listed in x, say Xk , follows a marginal distribution which
is normal with location parameter µk ; hence the sample mean X̄k – which is
listed in x̄ – exhausts all the information contained in the sample about that
particular parameter, and this holds simultaneously for all k = 1, . . . , K.
In analogy with in the univariate case, these observations can be pushed
even further by claiming that the vector of sample means x̄ and the sample
variance-covariance S are simultaneously sufficient for all the parameters of
the multivariate normal distribution, (µ, Σ). The realization of the sample
variance-covariance, written as S, is as follows.
N
1 X
S= (xi − x̄) (xi − x̄)T
N − 1 i=1

151
4.4. Sufficient Statistics

By means of some algebraic manipulation, one can show that the function:3
 
1 N T −1 N −1 −1

g (x̄, S; µ, Σ) = N exp − (x̄ − µ) Σ (x̄ − µ) − tr Σ S
|Σ| 2 2 2
NK
complies to the factorization theorem along with h (x1 , . . . , xN ) = (2π)− 2 .
To develop intuition, it is important to recall that the matrix Σ not only
features the K variances of each normally distributed random variable listed
in x, but also the K (K − 1) /2 covariances. The sample variance-covariance
S provides appropriate sufficient statistics for all these parameters. 

Example 4.6. Two sufficient statistics for the uniform distribution.


Suppose that a random (i.i.d.) sample is obtained from a random variable
X following the uniform distribution with unknown support X ∼ U (α, β).
Here, the bounds of the support are written with the upright Greek letters α
and β to remark that they shall be treated as parameters. The two extreme
3
The calculations are as follows. The joint density of the sample is:
! N2 N
!
1 1X T −1
fx1 ,...,xN (x1 , . . . , xN ; µ, Σ) = K
exp − (xi − µ) Σ (xi − µ)
(2π) |Σ| 2 i=1

where, by recognizing that (xi − µ) = (xi − x̄ + x̄ − µ), the term inside the exponential
develops as follows.
N
X N
X
T T T
(xi − µ) Σ−1 (xi − µ) = (xi − x̄) Σ−1 (xi − x̄) + N (x̄ − µ) Σ−1 (x̄ − µ) +
i=1 i=1
N
X N
X
T T
+ (x̄ − µ) Σ−1 (xi − x̄) + (xi − x̄) Σ−1 (x̄ − µ)
i=1 i=1
| {z } | {z }
=0 =0

PN
The last two terms are zero since i=1 (xi − x̄) = 0. The first term in the decomposition
instead develops, by the property of the trace operator, as:
N N
!
X T −1
X T −1
(xi − x̄) Σ (xi − x̄) = tr (xi − x̄) Σ (xi − x̄)
i=1 i=1
N
!
X T
−1
= tr Σ (xi − x̄) (xi − x̄)
i=1
= (N − 1) tr Σ−1 S


where the last line follows from the definition of S. Collecting terms allows to verify that
the factorization fx1 ,...,xN (x1 , . . . , xN ; µ, Σ) = g (x̄, S; µ, Σ) · h (x1 , . . . , xN ) holds with
the expressions given in the text.

152
4.4. Sufficient Statistics

order statistics, the minimum X(1) (with x(1) = min {x1 , . . . , xN }) and the
maximum X(N ) (with x(N ) = max {x1 , . . . , xN } as in Example 4.3) are in
fact simultaneously sufficient for (α, β). This is shown by observing that
the joint density function of the sample is:
 N
1
fX1 ,...,XN (x1 , . . . , xN ; α, β) = · 1 [α ≤ x1 , . . . , xN ≤ β]
β−α
and by setting:
 N
 1    
g x(1) , x(N ) ; α, β = · 1 α ≤ x(1) · 1 x(N ) ≤ β
β−α
and h (x1 , . . . , xN ) = 1. Hence, the factorization theorem applies trivially.
One more time, the result is intuitive: if both bounds of the uniform distri-
bution are unknown, there is no better information contained in the sample
than the two extreme order statistics. 
The factorization theorem allows to quickly verify that certain statistics
are sufficient for the specified parameters of an important “macrofamily” of
probability distributions, which is defined next.
Definition 4.20. Exponential Family. A family of probability distribu-
tions characterized by a vector of parameters θ = (θ1 , . . . , θJ ) is said to
belong to the exponential (macro)-family if the associated mass or density
functions can be written, for J ≤ L, as:
L
!
X
fX (x; θ) = h (x) c (θ) exp w` (θ) t` (x)
`=1

where h (x) and t` (x) are functions of the realizations x while c (θ) ≥ 0 and
w` (θ) are functions of the parameters θ, and in both cases, ` = 1, . . . , L.
The exponential macrofamily is extensive: it comprises many of the distri-
bution families analyzed in Lecture 2. In particular, the discrete Bernoulli,
geometric, Poisson families; as well as the continuous normal, lognormal,
Beta and Gamma families – including the special cases of the Gamma fam-
ily, like the chi-squared and exponential distribution – are all sub-families
of the exponential macro-family.4 All these claims can be verified by ma-
nipulating of the density functions of interest. Other distributions are said
to belong to the exponential family so long as certain parameters are “fixed”
(i.e. not part of θ in the above definition): this is the case of the binomial
and negative binomial families for a constant number of trials n or r.
4
One must be careful at not mistaking the exponential (macro-)family for the more
restricted subfamily of exponential distributions!

153
4.4. Sufficient Statistics

The connection between sufficient statistics and the exponential family


is expressed through the following result.
Theorem 4.8. Sufficient statistics and the exponential family. If a
random sample is obtained from any random variable X whose distribution
belongs to the exponential family, the L statistics in the vector:
 PN 
i=1 t1 (Xi )
 PN t (X ) 
2 i 
t (X1 , . . . , XN ) =  i=1 .

 .. 

PN
i=1 tL (Xi )

are simultaneously sufficient for θ, where the functions t` (x) are as in the
previous definition of the exponential family for ` = 1, . . . , L.
Proof. The joint density of the sample can be expressed as:
N
! L N
!
Y X X
fX1 ,...,XN (x1 , . . . , xN ; θ) = h (xi ) [c (θ)]N exp w` (θ) t` (xi )
i=1 `=1 i=1

and applying the factorization theorem is straightforward.


Example 4.7. A sufficient statistic for the Bernoulli distribution,
revisited. It might not appear too obvious at first, but the Bernoulli family
of distributions for p ∈ (0, 1) is a full member of the exponential family. Its
density function can be rewritten, for x ∈ {0, 1}, as:
 x
p
fX (x, p) = (1 − p)
1−p
   
p
= (1 − p) exp log x
1−p
and so the sufficient statistic T = N i=1 Xi from Example 4.1 complies with
P
Theorem 4.8. 
Example 4.8. Two sufficient statistics for the normal distribution,
a member of the exponential family. Consider some random variable
X ∼ N (µ, σ2 ). Its density function can be rewritten as:
µ2
   
2
 1 µ 1 2
fX x; µ, σ = √ exp − 2 exp x − 2x
2πσ2 2σ σ2 2σ
therefore, by TheoremP 4.8 two statistics Pthat are simultaneously sufficient
for µ and σ are T1 = i=1 Xi and T2 = N
2 N
i=1 Xi . These are unfamiliar in
2

the context of normal distributions; it is soon shown how they relate to the
more frequent sample mean X̄ and sample variance S 2 . 

154
4.4. Sufficient Statistics

Example 4.9. Two sufficient statistics for the Gamma distribution.


The density function of random variable X ∼ Gamma (α, β) can be written
as:
βα
fX (x; α, β) = exp [(α − 1) log (x) − βx]
Γ (α)
for x > 0. By TheoremP4.8, two statistics that P are simultaneously sufficient
for α and β are T1 = i=1 log (Xi ) and T2 = N
N
i=1 Xi . A simpler analysis
extends to the two special cases of the Gamma distribution,
PN the exponential
and the chi-squared distributions. In both cases, T = i=1 Xi is a sufficient
statistic for the unknown parameter (λ and κ respectively). 
The treatment of sufficient statistics is concluded, along the entire Lec-
ture, by observing that it is easy to obtain alternative sufficient statistics for
the same parameters through appropriate transformations; if T (x1 , . . . , xN )
is a sufficient statistic for some parameter of interest θ, also the transforma-
tion T 0 (x1 , . . . , xN ) = g (T (x1 , . . . , xN )) results in a sufficient for θ if g (·)
does not depend on θ. This follows from the definition of sufficient statis-
tics and the theorems about the transformation of random variables. These
considerations extend to transformations of vectors of sufficient statistics,
t0 (x1 , . . . , xN ) = g (t (x1 , . . . , xN )), as in the following examples.
Example 4.10. Two sufficient statistics for the normal distribution,
a member of the exponential family, revisited. The two seemingly
different results from
P Examples 4.4 P and 4.8 can be reconciled by observing
that, given T1 = N i=1 X i and T2 = i=1 Xi , it is:
N 2

T12
 
1 1
X̄ = T1 and S =2
T2 −
N N −1 N
a transformation that does not depend on the parameters. 
Example 4.11. Two sufficient statistics for the Gamma distribu-
tion, revisited. The two sufficient statistics for α and β in random samples
drawn from the Gamma distribution are typically listed as:
N
Y N
X
T10 = Xi and T20 = Xi
i=1 i=1

which is easily be verified via direct application of the factorization theorem.


This can be related to Example 4.9 since T10 = exp (T1 ) and T20 = T2 . 
With the help of examples, this section has displayed multiple methods to
obtain the sufficient statistics of interest. The most appropriate method is
generally context-dependent, and it is often useful to verify that alternative
routes can lead to the same result.

155
Lecture 5

Statistical Inference

This lecture develops the core concepts of statistical inference: the theory
and practice of the statistical evaluation of data. After having introduced
the concept of point estimator and two chief methods for constructing esti-
mators – the Method of Moments and Maximum Likelihood Estimation –
this lecture discusses a framework and associated results for the evaluation
of the statistical properties of different estimators. Finally, this lecture con-
cludes with an outline of the theory and the practice of hypothesis testing
in statistical inference and the associated methods to construct confidence
intervals for estimators, the so-called interval estimation.

5.1 Principles of Estimation


The term estimation refers to a broad concept that includes different types
of evaluations about certain features of the probability distributions that are
hypothesized to generate a sample of data. Here the focus is on parameter
estimation: the use of selected statistics for evaluating the parameters of
such a distribution. There are multiple methods to perform parameter es-
timation, and each is motivated by some statistical principle. This section
introduces two of them: the Method of Moments and Maximum Likelihood
Estimation. Other methods, such as the so-called “Bayesian” ones, are in-
stead outside the scope of this lecture. It must be mentioned that not every
type of statistical estimation concerns the parameters of a distribution. The
theory and the practice of non-parametric estimation, for example, concerns
the direct evaluation of the density or the mass functions that are believed
to generate the data, without making specific hypotheses about the func-
tional form – and the parameters – of these distributions. Following this
general introduction, a first definitions is in order.

156
5.1. Principles of Estimation

Definition 5.1. (Point) estimators, and their estimates. Any statis-


tic, if used to make evaluations about certain features of a probability dis-
tribution, is called a point estimator (or more simply, an estimator ). The
sample realization of such a statistic is called an estimate.

The notation θ,
b with the typical “hat,” is typically used to denote a point
estimator for some parameters θ of a distribution (that are possibly mul-
tivalued). This notation is used for both estimators intended as statistics,
that is random variables or vectors endowed with a sampling distribution,
and for the estimates calculated in the data. The ensuing discussion treats
the parameters θ = (θ1 , . . . , θK ) that are sought after as vector-valued with
dimension K; the univariate (scalar) case can be considered as a particular
one (but with examples aplenty). Note that depending on the context, some
values may or may not admissible estimates for certain parameters. For ex-
ample, the scale parameter σ of location-scale families, or the parameters
α and β of the Gamma distribution, cannot be negative. It is important to
accurately define the set of values that are allowed in the estimation.

Definition 5.2. Parameter space. The set of admissible values for the
parameters θ is called parameter space and is usually denoted as Θ ⊆ RK .

The first of the two method to find or construct estimators that is intro-
duced here is both the most intuitive and the oldest one (unsurprisingly).
The Method of Moments is based on the following idea, formulated as
a statistical principle.

Statistical Principle 2. Analogy. The analogy principle states that if the


random variables that generate the sample and the parameters are related
via some vector-valued function m (xi ; θ) of dimension K, such that for
i = 1, . . . , N a zero moment condition can be established:

E [m (xi ; θ)] = 0 (5.1)

then a point estimator for θ can be obtained as the solution to the so-called
sample analogue of the zero moment condition, that is the condition that
equates the sample mean of m (xi ; θ) to zero.
N
1 X  b 
m xi ; θ M M = 0 (5.2)
N i=1

Here, the estimator θ


bM M is denoted by the subscript that identifies it as a
Method of Moments (MM) estimator.

157
5.1. Principles of Estimation

The intuition behind the method of moments estimator is simple: because


m (xi ; θ) is a random vector with mean zero, the expectation of its sample
mean must also be zero (Theorem 4.3). The most intuitive way to simulate
this property with real data is to select a value for θ that satisfies the mean
zero requirement in the sample. Note that this definition is not restricted to
random samples. Also observe that the method of moments does not exploit
any characteristic of the joint distribution of the sample other than the zero
moment condition (which is expressed as a function of the parameters).
Example 5.1. Estimation of the mean. There are several probability
distributions such that their mean exactly equals one of the parameters.
These include the Bernoulli, Poisson, normal, logistic, Laplace, exponential
and others. Write the parameter in question as µ (which corresponds to p
in the Bernoulli case, λ in the Poisson and exponential cases, etc.). Suppose
that a random sample is obtained from one of these distributions. The zero
moment condition here is simply:
E [Xi − µ] = E [m (Xi ; µ)] = 0 (5.3)
and the Method of Moments estimator is obtained as follows.
N
1 X
bM M =
µ Xi = X̄ (5.4)
N i=1
Consider next a multivariate distribution whose mean satisfies the following:
E [xi − µ] = E [m (xi ; µ)] = 0 (5.5)
as for the multivariate normal distribution, and others. Once again:
N
1 X
µ
bM M = xi = x̄ (5.6)
N i=1
the Method of Moments estimator is the multivariate sample mean. 
Example 5.2. Estimation of the variance (and covariance). The zero
moment conditions naturally extend to moments higher than the mean, as
all moments are ultimately expectations. For example, in random sampling
from the normal distribution it holds that:
E [Xi − µ] = E m1 Xi ; µ, σ2 = 0
 
(5.7)
E (Xi − E [X])2 − σ2 = E m2 Xi ; µ, σ2 = 0
   

where the second condition can also be expressed as E [Xi2 ] − µ2 − σ2 = 0.


Here, the moment condition (5.3) is combined with another moment about
the variance to result in a system of two equations and two unknowns.

158
5.1. Principles of Estimation

The solution of this particular system is a pair of Method of Moments


estimators expressed by (5.4) for the estimator of the location parameter
µ, and the following expression for the estimator of the scale parameter σ2 .
N
1 X 2 N − 1 2
b2M M
σ = Xi − X̄ = S (5.8)
N i=1 N

Observe that this estimator differs from the sample variance S 2 by a factor
N −1
N
, hence its expectation does not equal the actual variance of Xi in a
random sample. The method can be extended to other distributions; in the
logistic case for example under the standardp parametrization the variance
bM M = S 3 (N − 1) /N /π. Next, consider
is Var [Xi ] = σ2 π 2 /3, therefore σ
the multivariate normal distribution; there:
E [xi − µ] = E [mµ (xi ; µ, Σ)] = 0
h
T T
i (5.9)
E xi xi − E [xi ] E [xi ] − Σ = E [mΣ (xi ; µ, Σ)] = 0

where the second set of conditions also writes as E xi xT


  T
i − µµ − Σ = 0
(note that this is a matrix-valued condition). The sample analogue of these
moment conditions delivers as solution another set of Method of Moments
estimators. These estimators are the sample mean as per (5.6) for µ, and:
N
1 X N −1
ΣM M =
b xi xT T
i − x̄x̄ = S (5.10)
N i=1 N
that is a rescaled version of the sample variance-covariance, for Σ. 
Example 5.3. Combined estimation of moments. Consider sampling
from the Gamma distribution, where two zero moment conditions:
 
α
E Xi − = E [m1 (Xi ; α, β)] = 0 (5.11)
β
 
α (α + 1)
2
E Xi − = E [m2 (Xi ; α, β)] = 0 (5.12)
β2
deliver a system of two equations in two unknown parameters. The Method
of Moments estimators of α and β are:
X̄ 2
bM M =
α 1
PN (5.13)
N i=1 Xi2 − X̄ 2

bM M =
β 1
PN (5.14)
N i=1 Xi2 − X̄ 2
where X̄ = Xi is the sample mean.
1
PN
N i=1 

159
5.1. Principles of Estimation

Example 5.4. Estimation of the bivariate linear regression model.


Remember the bivariate linear regression model from Example 3.11. There,
the application of the analogy principle to the two parameters of interest is
straightforward, because the covariance and the variance that define (3.11)
have simple sample analogues.
b 0,M M = Ȳ − X̄ · β
β b 1,M M (5.15)
PN  
X i − X̄ Y i − Ȳ
b 1,M M =
β i=1
PN 2 (5.16)
i=1 Xi − X̄

Here, X̄ and Ȳ are the sample means of Xi and Yi respectively, and the two
estimators are also called the least squares estimators of the model, for
reasons to be elaborated in Lecture 7. Note that to derive these estimators
independence is not necessary, since the two quantities are obtained directly
from (3.10) and (3.11). Furthermore, no specification of the joint density of
Yi and Xi was made, except that the two variables are related via a linear
conditional expectation function E [ Yi | Xi ] = β0 + β1 Xi . 

The Method of Moments is simple and often convenient, since it requires


fewer assumptions than its leading competitor for parametric estimation:
Maximum Likelihood Estimation. Furthermore, the Method of Moments is
virtually applicable to all statistical settings (those cases where the moments
are infinite or undefined, like with the Cauchy distribution, are exceptions).
As it is discussed later though, Maximum Likelihood has generally superior
statistical properties than the Method of Moments: thus a trade-off between
simplicity and flexibility on one side, and improved properties on the other
side, arises. To understand this, it is necessary to introduce the Maximum
Likelihood Estimator, starting from the definition of likelihood function.

Definition 5.3. The Likelihood Function. Suppose that some sample of


observations {x1 , . . . , xN } is observed. For fixed values of those realizations,
the likelihood function is defined as the joint mass or density function of
the sample, fx1 ,...,xN (x1 , . . . , xN ; θ), as a function of the parameters; it is
generally written as follows.

L (θ| x1 , . . . , xN ) = fx1 ,...,xN (x1 , . . . , xN ; θ) > 0

The likelihood function is by definition always positive because only values


in the support of (x1 , . . . , xN ) can be observed.

The likelihood function cannot be interpreted as a sort of probability


function, since it certainly does not integrate to 1 (and parameters are not

160
5.1. Principles of Estimation

functions of events either). However, it bears an interpretation in terms of


“how plausible” alternative parameter sets are for generating the observed
realizations. This function is associated with the following principle.
Statistical Principle 3. Likelihood. Suppose that two samples of obser-
vations, {x1 , . . . , xN } and {y1 , . . . , yN }, are observed, and they are obtained
from distributions with the same unknown parameters θ. Suppose that the
likelihood functions associated with the two realizations are proportional,
in the sense that there exists a constant, expressed as a function of the ob-
servations C (x1 , . . . , xN ; y1 , . . . , yN ), such that the two likelihood functions
are always identical up to this constant:
L (θ| x1 , . . . , xN ) = C (x1 , . . . , xN ; y1 , . . . , yN ) · L (θ| y1 , . . . , yN )
where “always” means for every admissible value of the parameters θ. Then,
any evaluation about the parameters should be identical in the two samples.
The likelihood principle has two main interpretations. The first is that if
C (x1 , . . . , xN ; y1 , . . . , yN ) = 1 uniformly for all alternative pairs of samples,
then two observations with the same value of the likelihood function imply
identical “evaluations” (that is, estimations) about the parameters θ. The
second implication is that for any two alternative values of the parameters,
say θ0 and θ00 , the ratio
L (θ0 | x1 , . . . , xN ) L (θ0 | y1 , . . . , yN )
=
L (θ00 | x1 , . . . , xN ) L (θ00 | y1 , . . . , yN )
must be constant across any different sets of observations {x1 , . . . , xN } and
{y1 , . . . , yN }. Consequently, if one treats the value expressed by the likeli-
hood function as a measure of “plausibility” that the parameter in question
is the one that generates the observations, the highest such value does not
depend on the particular observations. The next logical step is to define an
estimator which does select the value in question.
A Maximum Likelihood Estimator is a statistic that maximizes the
observed likelihood function. Such estimator is usually specified as:
bM LE = arg max L (θ| x1 , . . . , xN )
θ (5.17)
θ∈Θ

where the subscript “MLE” has an obvious meaning. Since the likelihood
function is always positive, in practical settings it is often useful to maximize
its logarithm instead, which is called the log-likelihood function. In other
words, (5.17) is equivalent to the following.
bM LE = arg max log L (θ| x1 , . . . , xN )
θ (5.18)
θ∈Θ

161
5.1. Principles of Estimation

If the sample has certain properties, the calculation of the MLE is simplified
further. First, if the observations are independent the joint mass or density
of the sample reduces to the product of the mass or density functions of all
the observations, and so maximizing the log-likelihood function amounts to
maximize a summation:
"N # N
Y X
θ
bM LE = arg max log fx (xi ; θ) = arg max
i
log fx (θ| xi ) (5.19)
i
θ∈Θ i=1 θ∈Θ i=1

where log fxi (xi ; θ) is the so-called “observation-specific component” of the


log-likelihood function. Furthermore, if the observations are also identically
distributed (the sample is random) it is log fxi (θ| xi ) = log fx (θ| xi ): the
observation-specific component is identical for all i = 1, . . . , N .
Example 5.5. Maximum Likelihood Estimation of N independent
Bernoulli trials. Suppose one is interested to estimate the parameter p
that generates the realizations {x1 , . . . , xN } out of N independent Bernoulli
trials. Note that here the parameter space of p is Θ = [0, 1]. The likelihood
function is:
N
Y
L (p| x1 , . . . , xN ) = pxi (1 − p)1−xi
i=1
and the log-likelihood function is as follows.
N
! N
!
X X
log L (p| x1 , . . . , xN ) = xi · log (p) + N− xi · log (1 − p)
i=1 i=1

The First Order Condition with respect to p is:


PN
N− N
P
d log L ( pbM LE | x1 , . . . , xN ) i=1 xi i=1 xi
= − =0
dp pbM LE 1 − pbM LE
solving for which allows to verify that the Maximum Likelihood Estimator
for this problem is the sample mean.
N
1 X
pbM LE = Xi = X̄
N i=1

Two observations are in order. First, since Xi ∈ {0, 1} it is X̄ ∈ [0, 1], hence
the MLE is restricted to valid values in the parameter space. Second, the
Second Order Condition is as follows:
PN PN
d2 log L (p| x1 , . . . , xN ) x i N − i=1 xi
= − i=1 − <0
dp 2 p 2
(1 − p)2
verifying that indeed pbM LE is the maximizer of the likelihood function. 

162
5.1. Principles of Estimation

Example 5.6. Maximum Likelihood Estimation of the parameters


of the normal distribution. Suppose now that some random sample is
drawn from a normally distributed random variable X ∼ N (µ, σ2 ). The
parameter space is Θ = R × R++ : while the mean can take any real value,
the variance is allowed to take only positive values. The likelihood function
equals the usual joint density of a normally distributed sample:
N
!
2
Y 1 (x i − µ)
L µ, σ2 x1 , . . . , xN =

√ exp −
i=1 2πσ2 2σ2
N
!
1 X (xi − µ)2
= N exp −
(2πσ2 ) 2 i=1
2σ2

but in the log-likelihood form, it simplifies as follows.


N
N N  X (xi − µ)2
log L µ, σ2 x1 , . . . , xN = − log (2π) − log σ2 −

2 2 i=1
2σ2

The First Order Conditions, evaluated at the solution, are:


N
∂ log L ( µ b2M LE | x1 , . . . , xN ) X xi − µ
b M LE , σ b M LE
= 2
=0
∂µ i=1
σ
bM LE
N
∂ log L ( µ b2M LE | x1 , . . . , xN )
b M LE , σ N X (xi − µb M LE )2
=− 2 + =0
∂σ2 2b
σM LE i=1
σ4M LE
2b

which is a system of two equations in two unknowns. Solving it delivers the


paired MLE’s for the normal distribution, which fit the parameter space.
N
1 X
µ
b M LE = Xi
N i=1
N
1 X 2
b2M LE
σ = Xi − X̄
N i=1

Note that while the solution looks like the paired Method of Moments esti-
mators for µ and σ2 , this is not generally true. To verify that the likelihood
function is indeed maximized, it is necessary to analyze the determinant of
the Hessian matrix of the log-likelihood function evaluated at the solution.
The Hessian matrix in question is the following.
∂ log L( µ,σ2 |x1 ,...,xN ) ∂ 2 log L( µ,σ2 |x1 ,...,xN )
 2 
∂µ2 ∂µ∂σ2
H µ, σ2 x1 , . . . , xN =  ∂ 2 log L( µ,σ

| 1 N ) ∂ 2 log L( µ,σ2 |x1 ,...,xN )
2 x ,...,x

∂σ2 ∂µ ∂(σ2 )2

163
5.1. Principles of Estimation

Note that the two second-order partial derivatives outside the diagonal are
symmetric and equal, and:
∂ 2 log L (µ, σ2 | x1 , . . . , xN ) N
2
=− 2
∂µ σ
N
∂ 2 log L (µ, σ2 | x1 , . . . , xN ) X (xi − µ)
2
=−
∂µ∂σ i=1
σ4
N
X (xi − µ)2
∂ 2 log L (µ, σ2 | x1 , . . . , xN ) N
= 4−
∂ (σ2 )2 2σ i=1
σ6
however,P when evaluated at the solution, the cross-derivatives equals zero,
because N b M LE ) = 0, while the second derivative of σ2 simplifies
i=1 (xi − µ
too. In fact, by the second of the two First Order Conditions:
N
∂ 2 log L ( µ b2M LE | x1 , . . . , xN )
b M LE , σ N b M LE )2
X (xi − µ
= −
∂ (σ2 )2 σ4M LE
2b i=1
b6M LE
σ
N N
= 4 − 4
2b
σM LE σ
bM LE
N
=− 4
2bσM LE
and it follows the Hessian matrix, evaluated at the solution, is:
N
 
 − σ
b2M LE
0 
H µ b2M LE x1 , . . . , xN = 
b M LE , σ 
 N 
0 − 4
2b
σM LE
and its determinant is obviously always positive. Since at least one second
order partial derivative (in particular, the second derivative for µ) is always
negative, the solution is indeed a maximum. 
Example 5.7. Maximum Likelihood Estimation of the parameters
of the multivariate normal distribution. Move next to a multivariate
environment, and consider sampling from a random vector x ∼ N (µ, Σ).
The likelihood function is:
N
!
Y 1 (xi − µ)T Σ−1 (xi − µ)
L (µ, Σ| x1 , . . . , xN ) = q exp −
K 2
i=1 (2π) |Σ|
N
!
1 X (xi − µ)T Σ−1 (xi − µ)
=h i N2 exp − 2
K
(2π) |Σ| i=1

164
5.1. Principles of Estimation

which becomes simpler again if transformed into a log-likelihood function.


NK
log L (µ, Σ| x1 , . . . , xN ) = − log (2π) −
2
N
N X (xi − µ)T Σ−1 (xi − µ)
− log (|Σ|) −
2 i=1
2

To find the MLE estimator for (µ, Σ) it is easiest to split the problem into
simpler bits: the estimation of µ and that of Σ (note that this is not always
possible). Here, the First Order Conditions with respect to µ:
N
∂ log L ( µ
b M LE , Σ| x1 , . . . , xN ) X
= Σ−1 (xi − µ
b M LE ) = 0
∂µ i=1

constitute a system of K equations in K unknowns µ = (µ1 , . . . , µK ) whose


solution does not depend on Σ!1 It follows that the MLE estimator of the
location parameters is, again, the vector of sample means.
N
1 X
µ
b M LE = xi = x̄
N i=1

To obtain the maximum for Σ it is easiest to differentiate the log-likelihood


function with respect to its inverse Σ−1 ; this must return the same solution.
Differentiating a scalar with respect to a matrix returns yet another matrix;
here this operation gives the following K × K matrix:2
N
X (xi − µ) (xi − µ)T
∂ log L (µ, Σ| x1 , . . . , xN ) N
= Σ −
∂Σ−1 2 i=1
2
1
This is so because Σ is positive semi-definite, a property that
Pextends to its inverse.
N
Thus, those First Order Conditions can only be equal to zero if i=1 (xi − µ b M LE ) = 0.
2
To get into the algebraic details, observe that:

log Σ−1 = Σ

∂Σ−1

and that the derivative of the summation component is as follows.


N
"N #
∂ X T −1 ∂ X T −1
(xi − µ) Σ (xi − µ) = tr (xi − µ) Σ (xi − µ)
∂Σ−1 i=1 ∂Σ−1 i=1
"N #
∂ X T −1
= tr (xi − µ) (xi − µ) Σ
∂Σ−1 i=1
N
X T
= (xi − µ) (xi − µ)
i=1

165
5.1. Principles of Estimation

which, if evaluated at the solution where µ = x̄ and set at zero, returns the
MLE of the variance-covariance matrix.
N
1 X
ΣM LE =
b (xi − x̄) (xi − x̄)T
N i=1

As in the Method of Moments and in the univariate MLE case, this estima-
tor is a rescaled version of the sample variance-covariance, Σb M LE = N −1 S.
N
Some tedious analysis, similar to that from the univariate case, would show
that the MLE solutions µ b M LE and Σ
b M LE indeed identify a maximum of the
(log-)likelihood function. 
In the last few cases, the Method of Moments and Maximum Likelihood
estimators are seen to coincide. This, however, is generally not true, as the
following example shows.
Example 5.8. Maximum Likelihood Estimation of the parameters
of the Gamma distribution. Consider again random sampling from the
Gamma distribution as in Example 5.3. There, the likelihood function is:
N
Y 1
L (α, β| x1 , . . . , xN ) = βα xα−1
i exp (−βxi )
i=1
Γ (α)
 α N Y N
! N
!
β X
= xα−1
i exp −β xi
Γ (α) i=1 i=1

and the log-likelihood function is:

log L (α, β| x1 , . . . , xN ) = N α log (β) − N log [Γ (α)] +


N
X N
X
+ (α − 1) log (xi ) − β xi
i=1 i=1

with the following First Order Conditions.3


N
∂ log L (α, β| x1 , . . . , xN ) N ∂Γ (α) X
= N log (β) − + log (xi )
∂α Γ (α) ∂α i=1
N
∂ log L (α, β| x1 , . . . , xN ) α X
=N − xi
∂β β i=1
3
The derivative of the logarithm of the Gamma function is known as the polygamma
function, and unless the argument (e.g. α here) is an integer, it only admits an integral
representation, which makes it difficult to solve the First Order Conditions for α and β.

166
5.1. Principles of Estimation

There is no closed form solution to this problem. Even if the solution must
clearly respect the property that:
N
α
b M LE 1 X
= Xi = X̄
β
b M LE N i=1
as in the Method of Moments case, exact expressions of the estimators for
α and β in terms of (X1 , . . . , X2 ) – or of (x1 , . . . , x2 ) – cannot be derived
from the First Order Conditions. It is then necessary to employ numerical
methods on a case-by-case basis in order to identify the estimates. 
This example showed how difficult it can be to perform Maximum Like-
lihood Estimation in certain cases – and this is not uncommon! Sometimes,
the MLE of interest does not even exist.
Example 5.9. Maximum Likelihood Estimation of the parameter
of uniform distributions with fixed lower bound. Consider a random
sample drawn from a uniformly distributed random variable Xi ∼ U (0, θ)
with lower bound fixed at zero and closed support: X = [0, θ]. It is easy to
see that E [X] = θ/2 and thus the Method of Moments estimator is:
N
2 X
θM M =
b Xi = 2X̄
N i=1
while the MLE is the sample maximum.
bM LE = X(N )
θ
To see this, note that the likelihood function here is:
1
L (θ| x1 , . . . , xN ) = · 1 [0 ≤ x1 , . . . , xN ≤ θ]
θN
and there is no need to guess the log-likelihood function to see that the above
is maximized for the smallest value of θ such that θ ≥ max {x1 , . . . , xN },
hence the maximum. Suppose now that the support of X is open, at least
on the right: X = [0, θ). The Method of Moments estimator is unchanged,
but the likelihood function now becomes:
1
L (θ| x1 , . . . , xN ) = N · 1 [0 ≤ x1 , . . . , xN < θ]
θ
with only an inequality being changed within the indicator function. Note
that it is no longer possible to follow the reasoning above in order to identify
a statistic that maximizes the likelihood function (to gain intuition, compare
the two likelihood functions depicted next in Figure 5.1). In cases like that
with open support, one typically says that the MLE does not exist. 

167
5.1. Principles of Estimation

LN (θ) LN (θ)
1 1

θ θ
0 1 2 0 1 2
Note: N = 5 and x(5) = 1 in both cases; X = [0, θ] in the left panel and X = [0, θ) in the right panel.
LN (θ) is shorthand notation for L ( θ| x1 , . . . , xN ).

Figure 5.1: Compared likelihood functions for the uniform distribution

It thus might seem that thanks to its simplicity and flexibility, Method
of Moments estimation trumps Maximum Likelihood as a more convenient
method for constructing estimators, and this goes without mentioning that
the latter possibly does not even exist. As anticipated, however, Maximum
Likelihood estimators generally have conceptual and statistical advantages;
one of these is illustrated next.
Theorem 5.1. Invariance of Maximum Likelihood Estimators. Call
θ
bM LE the Maximum Likelihood Estimator for some parameter vector θ. Let
ϕ = g (θ) be some transformation of parameter vector θ. The Maximum
Likelihood estimator of ϕ is simply the corresponding transformation of the
Maximum Likelihood Estimator of θ.
 
ϕ
b M LE = g θ bM LE

Proof. The Maximum Likelihood Estimator of ϕ is obtained as the maxi-


mizer of the following, so-called induced likelihood function.

L∗ (ϕ| x1 , . . . , xN ) = max L (θ| x1 , . . . , xN )


{θ:g(θ)=ϕ}

Call such maximum ϕ


b M LE , and observe that:

L∗ ( ϕ
b M LE | x1 , . . . , xN ) = max max L (θ| x1 , . . . , xN )
ϕ {θ:g(θ)=ϕ}

= max L (θ| x1 , . . . , xN )
θ
 
=L θ bM LE x1 , . . . , xN

= max L (θ| x1 , . . . , xN )
{θ:g(θ)=g(θbM LE )}
   
= L∗ g θ bM LE x1 , . . . , xN

168
5.1. Principles of Estimation

where the first and last equalities follow from the definition of induced like-
lihood function, the second equality follows from the properties of iterated
maximizations, and the remaining ones follow by the definition of MLE.
Example 5.10. Maximum Likelihood Estimation of the “precision”
parameter of the normal distribution. Recall that the normal distri-
bution can be alternatively described in terms of the precision parameter
φ2 = σ−2 , where the density function is expressed as in (2.37). In that case,
the (induced) likelihood function would be as follows.
 2  N2 N
!
2
 φ X φ2 (xi − µ)2
L µ, φ x1 , . . . , xN = exp −
2π i=1
2

By an analysis similar to that of Example 5.6, the First Order Conditions


of the log-likelihood function evaluated at the solution:
 
∂ log L µ b2
b M LE , φ x 1 , . . . , x N
N
M LE X
2
= φM LE
b (xi − µ
b M LE ) = 0
∂µ i=1
 
2
∂ log L µ b M LE , φ
b
M LE x1 , . . . , xN N X N
b M LE )2
(xi − µ
= − =0
∂φ2 2φb2
M LE i=1
2

would reveal that the MLE of µ is still the sample mean, while the MLE of
the precision parameter is the following:
" N #−1
2
X
b2

φ =N
M LE Xi − X̄
i=1

which is obviously nothing else but the inverse of σ


b2M LE . 
Example 5.11. Maximum Likelihood Estimation of an alternative
parameter of the Gamma distribution. Also the Gamma distribution
admits an alternative parametrization, for θ = β−1 : the two parameters are
distinguished by the names rate parameter for β and scale parameter for
θ (while α is the shape parameter). The reparametrized density function
is:  
1 1
fX (x; α, θ) = x α−1
exp − x for x > 0
Γ (α) θα θ
hence in a random sample, the (induced) likelihood function is as follows.
 N YN
! N
!
1 α−1 1X
L (α, θ| x1 , . . . , xN ) = xi exp − xi
Γ (α) θα i=1
θ i=1

169
5.2. Evaluation of Estimators

Analyzing the First Order Conditions of the log-likelihood function:


N
∂ log L (α, θ| x1 , . . . , xN ) N ∂Γ (α) X
= −N log (θ) − + log (xi )
∂α Γ (α) ∂α i=1
N
∂ log L (α, θ| x1 , . . . , xN ) Nα 1 X
=− + 2 xi
∂θ θ θ i=1

once again shows that a closed form solution cannot be identified; however,
the solution must be such that θ b −1 .
bM LE = β
M LE 

The invariance property does not extend to Method of Moments estimators.


While this is of little consequence in those cases where the latter coincide the
MLE, as in the various estimators of the mean considered in the examples,
this raises concerns about the Method of Moments in those cases where the
two approaches differ.

5.2 Evaluation of Estimators


Whether obtained through the Method of Moments, Maximum Likelihood
or other means, all estimators are statistics – functions of the random vari-
ables or vectors from which samples are drawn – and thus they are endowed
of a sampling distribution. While it is often difficult to derive selected sam-
pling distributions, it is often possible to analyze some of their properties,
especially certain moments of the estimators, in order to inform the choice
between different estimators. In fact, some estimators are better than others
(that are meant for the same parameters), having better statistical proper-
ties: in practical settings, using the sample values of the “better” estimators
results in more accurate conjectures about the parameters of interest.
This discussion is begun by introducing an important criterion that is
used to compare different estimators.

Definition 5.4. Mean Squared Error (MSE). Consider an estimator


b for some parameters of interest θ, where both θ
θ b and θ have dimension
K. The mean squared error is defined as the following quantity:
 T   K
X  2 
MSE ≡ E b−θ
θ b−θ
θ = E θk − θk
b
k=1

where k = 1, . . . , K indexes the parameters and associated estimators listed


in θ and θ,
b respectively.

170
5.2. Evaluation of Estimators

Thus, for any vector of parameters and associated estimates the MSE is
simply the sum of K elements of the form:
 2 
MSEk = E θk − θk
b

where θ bk is a statistic, θk is a parameter, and both are unidimensional. As


bk has a sampling distribution, the quantity expressed above is a measure
θ
of the average size of squared deviations from the parameter of interest that
result from the particular estimator θ bk . The use of a squared deviation, as
mentioned in Lecture 1 for an analogous context, is intuitive and motivated
on the fact that larger deviations are less desirable than smaller deviations,
and if the MSE is used to compare estimators, those estimators that produce
larger deviations are more penalized by this criterion.
Alternative criteria, such one that adopts absolute deviations like:
h i
MAEk = E θk − θk
b

are certainly possible (the above is called Mean Absolute Error – MAE).
Nevertheless, the overwhelming majority of practical applications adopts
the MSE; among the various reasons (including analytical convenience) the
following property plays a fundamental role.
 h i h i 2 
MSEk = E θ bk − E θ bk + E θ bk − θk
 h i2   h i 2 
= E θk − E θk
b b + E E θk − θk
b +
h h i  h i i
+ 2E θ bk − E θ bk E θbk − θk
h i  h i 2
= Var θ bk + E θ bk − θk

Above, the last element in the second line is easily shown to vanish to zero;
this decomposition should reminiscent of the analysis conducted in Lecture
1 about the mean as the “best guess” of a random variable. In words, the
MSE relative to a specific estimator θbk can be decomposed in two parts:
the variance of the estimator, and the squared deviation of its mean from
the parameter of interest. The last concept warrants a definition.
Definition 5.5. Bias and unbiasedness. Consider a unidimensional es-
timator θ
b for some parameter of interest θ. Its bias is the quantity:
h i
Bias θb ≡ E θb −θ

and the estimator is unbiased if its bias is zero.

171
5.2. Evaluation of Estimators

Unbiased estimators are certainly appealing, because they can be inter-


preted as estimators whose “average” value (in the population of all possible
samples) equals the parameter estimation. However, an unbiased estimator
might produce a MSE which is larger than that of a biased estimator. Thus,
a researcher who wants to compare the MSE of different estimators, so to
choose the one with the smallest MSE, must be aware that a bias-variance
trade-off might arise: choosing an estimator with smaller bias might imply
accepting a higher variance than the alternative. If a given estimator has a
smaller variance relative to another estimator it is compared against, it is
said that the former is more efficient than the latter, and vice versa.
Example 5.12. Estimation of the parameters of the normal distri-
bution: the bias-variance trade-off. Consider a random sample drawn
from some normally distributed random variable X ∼ N (µ, σ2 ), and the
following two alternative estimators of the parameters:
       
µ X̄ µ X̄
and
b1 b2
= = N −1 2
b21
σ S2 b22
σ N
S
where σ b22 = NN−1 S 2 is motivated by either the Method of Moments or Maxi-
mum Likelihood. The estimators of the location parameter are identical, so
they do not contribute to the difference between the MSE’s of the two alter-
natives. It was shown in Theorem 4.3 that E [S 2 −σ2 ] = 0, implying that
b1 is an unbiased estimator of σ . Consequently, E N S − σ = − N σ :
N −1 2 1 2
2 2 2

σ
b22 is a biased estimator of σ2 . Notably, the MSE associated with the second
σ
setup is lower! In fact, σ b22 is more efficient than σb21 :
 2
 2  2  2 N −1
Var S 2
 
Var σ b1 − Var σ b2 = Var S −
N
2N − 1  2

= Var S
N2
(2N − 1) σ4 S2
 
= Var (N − 1) 2
N 2 (N − 1)2 σ
2 (2N − 1) σ4
=
N 2 (N − 1)
where the last line follows from the fact that if W ∼ χ2κ , then Var [W ] = 2κ
(here, κ = N − 1). Clearly:
 2
  2  2 (2N − 1) σ4 σ4   2 2
 2
Var σ1 − Var σ2 =
b b > = E σ
b 2 − σ
N 2 (N − 1) N2
that is, the difference between the variances of the two alternative estimators
of the variance is larger than the square of the bias of σ b12 . 

172
5.2. Evaluation of Estimators

This example clarifies why the unbiasedness property does not guarantee
that an estimator performs better than others in practical situations. When
restricting their attention to unbiased estimators, researchers should at least
ensure that these are also those with the smallest possible variance. These
estimators have a proper name in the theory of statistical inference.
Definition 5.6. Best unbiased estimators. Consider the set of unbiased
estimators θ
b of a certain parameter θ:
n h i o
Cθ = θ : E θ = θ
b b

An estimator θ
b∗ is called the best unbiased estimator, or the uniform mini-
mum variance unbiased estimator of θ, if the following holds.
h i h i
Var θ − Var θ
b b∗ ≥ 0 for all θ
b ∈ Cθ (5.20)

In a multidimensional environment, if θb is a vector of estimators that are


all unbiased for a vector of parameters θ, this definition is recast in terms
of a vector θ
b∗ of best unbiased estimators such that:
h i h i
Var θ − Var θ
b b∗ ≥ 0 for all θb ∈ Cθ × . . . × Cθ (5.21)
1 K

where the inequality is interpreted in the sense that the matrix on the left-
hand side is positive semi-definite, and Cθk is the set of unbiased estimators
of θk for k = 1, . . . , K.
A property of best unbiased estimators is that they are unique.
Theorem 5.2. Uniqueness of best unbiased estimators. Let θ b∗ be a
b∗ is unique,
best unbiased estimator for some parameter θ. In this setting θ
in the sense that (5.20) holds sharply (without equality).
Proof. Suppose that there is another estimator θ b∗∗ that is also a best unbi-
ased estimator, in the sense that it has the same expectation and variance
as θb∗ . Define the estimator:

θ b∗ + 1 θ
b0 ≡ 1 θ b∗∗
2 2
h i
it is clear that E θb0 = θ. As per the variance, it must be that:
h i 1 h i 1 h i 1 h i
b0 = Var θ
Var θ b∗ + b∗∗ +
Var θ Cov θb∗ , θ
b∗∗
4 4 2
1 h i 1 h i 1n h i h io 12
≤ Var θ b∗ + b∗∗ +
Var θ ∗
Var θ Var θ
b b∗∗
4 h i 4 2
b∗
= Var θ

173
5.2. Evaluation of Estimators

where the inequality follows from the same argument as in Theorem 3.4, and
the last line is due to the fact that θ b∗ and θ
b∗∗ have the same variance. Note
that the inequality must be replaced by an equality to avoid a contradiction!
If the inequality were sharp, then θ b∗ would not be a best unbiased estimator,
as θb0 would improve it. To have an equality, it must be – again by Theorem
3.4 – that θb∗∗ is a linear transformation of θ b∗ , that is θ b∗ . But in
b∗∗ = a + bθ
this case it must also be that a = 0, or else θ would be biased, and b = 1,
b ∗∗

since the following chain of equalities must also hold.


h i h i h i h i
Var θ b∗ = Cov θ b∗ , θ
b∗∗ = Cov θ b∗ , bθ
b∗ = b Var θb∗

Thus, θ
b∗ , θ
b∗∗ and θ
b0 are all identical estimators, that is, θ
b∗ is the only best
unbiased estimator.
The search for unbiased estimators with good properties is facilitated
by the following result, which is stated here in its multivariate version.
Theorem 5.3. The Rao-Blackwell Theorem. Consider an environment
where (x1 , . . . , xN ) is a sample drawn from some list of random vectors, θ is
some parameter vector of interest, θ b is any vector of unbiased estimators of
θ, and t = t (x1 , . . . , xN ) is a vector of statistics that are all simultaneously
sufficient for θ. Define the following statistic as a conditional expectation
function. h i
θb∗ ≡ E θ b t

b∗ is a uniformly better unbiased estimator of θ, that is, it is


The statistic θ
an unbiased estimator with lower variance than θ.b

Proof. The Law of Iterated Expectations:


h i h h ii h i
θ=E θ b = Et E θ b∗
b t =E θ

along with the Law of Total Variance:


h i h h ii h h ii
Var θ
b = Vart E θ b t + Et Var θb t
h i h h ii

= Var θ + Et Var θ t
b b
h i
≥ Var θb∗

simultaneously show that if θ b∗ is an estimator of θ, it is unbiased and it


also has a lower variance than θ.b The definition of sufficiency and that of
b , however, jointly imply that the latter is a legitimate estimator of θ, as
θ ∗

its joint distribution by construction does not depend on θ.

174
5.2. Evaluation of Estimators

This result allows to “improve” already known unbiased estimators by con-


structing an appropriate CEF (conditional expectation function) with suffi-
cient statistics of the parameters of interest as arguments. This is generally
not easy. Another use of this result is to show that certain unbiased estima-
tors cannot be further improved. This is the case of well-known estimators
that are already expressed as simple functions of sufficient statistics.
Example 5.13. Rao-Blackwell applied to the normal distribution.
Consider again the pair of unbiased estimators forthe parameters (µ, σ2 ) of
the normal distribution: the two statistics X̄, S 2 . It turns out that these
two statistics are also sufficient (Example 4.4), hence they can be expressed
as trivial conditional expectation functions of themselves. Therefore, one
cannot find better unbiased estimators that are based on the same sufficient
statistics (but estimators with a smaller MSE are possible, as shown). 
According to the MSE criterion, evaluating the properties of estimators,
whether they are unbiased or not, requires some understanding about how
their variances compare with those of competing estimators. A fundamental
result in Statistics, known as the Cramér-Rao Inequality, characterizes
a lower bound on the variance that estimators can achieve. Therefore, the
closer an estimator is to that value, called the Cramér-Rao lower bound, the
more confident a researcher should be about using that estimator. In what
follows, the Theorem is stated and proved in both its general version, and
in the case restricted ti random samples. To facilitate understanding, the
two statements are expressed both in the univariate and multivariate cases;
the proofs begin by demonstrating the results in the univariate cases, and
then discuss how they are modified or extended in the multivariate cases.
Theorem 5.4. Cramér-Rao Inequality (general) – univariate case.
Consider a sample drawn from a list of random variables (X1 , . . . , XN ) with
a joint mass or density function written as f (x1 , . . . , xN ; θ) with shorthand
notation. Also consider a parameter of interest θ, as well as some estimator
b=θ
θ b (X1 , . . . , XN ) for θ, such that its variance is finite and additionally –
in the continuous case only – that the differentiation operation taken with
respect to θ can pass through the expectation operator as shown below.
∂ h i ˆ ˆ
∂ b
E θb = ... θ (x1 , . . . , xN ) · f (x1 , . . . , xN ; θ) dx1 . . . dxN
∂θ X1 XN ∂θ

In this environment, the variance of θ


b must satisfy the following inequality.
 h i2

h i ∂θ
E θ
b
Var θ b ≥ h 2 i

E ∂θ log f (X1 , . . . , XN ; θ)

175
5.2. Evaluation of Estimators

Proof. If one defines the following transformed random variables:


U =θ b (X1 , . . . , XN )

V =log f (X1 , . . . , XN ; θ)
∂θ
the result follows through as a simple implication of Theorem 3.4.
[Cov [U, V ]]2
Var [U ] ≥
Var [V ]
Note that if E [V ] = 0 the above is recast as:
[E [U V ]]2
Var [U ] ≥
E [V 2 ]
and in fact, in the continuous case (the discrete case is analogous) it is:
 

E [V ] = E log f (X1 , . . . , XN ; θ)
∂θ
 
1 ∂
=E f (X1 , . . . , XN ; θ)
f (X1 , . . . , XN ; θ) ∂θ
ˆ ˆ

= ... f (x1 , . . . , xN ; θ) dx1 . . . dxN
XN ∂θ
X1
ˆ ˆ

= ... f (x1 , . . . , xN ; θ) dx1 . . . dxN
∂θ X1 XN

= ·1
∂θ
=0
where the second line applies the chain rule while the fourth line is based
on the hypotheses about differentiation and expectation. Similarly:
 

E [U V ] = E θ (X1 , . . . , XN ) ·
b log f (X1 , . . . , XN ; θ)
∂θ
" #
b (X1 , . . . , XN ) ∂
θ
=E f (X1 , . . . , XN ; θ)
f (X1 , . . . , XN ; θ) ∂θ
ˆ ˆ
= ... b (x1 , . . . , xN ) · ∂ f (x1 , . . . , xN ; θ) dx1 . . . dxN
θ
∂θ
X1
ˆ XNˆ
∂ b (x1 , . . . , xN ) · f (x1 , . . . , xN ; θ) dx1 . . . dxN
= ... θ
∂θ X1 XN
∂ h i
= E θ b
∂θ
and collecting terms, the postulated result is obtained.

176
5.2. Evaluation of Estimators

Multivariate case. Consider a sample drawn from a list of random vectors


(x1 , . . . , xN ) with joint mass or density function written as f (x1 , . . . , xN ; θ)
with shorthand notation. Also consider a vector of parameters of interest θ
with length K, as well as some estimator θ b=θ b (x1 , . . . , xN ) for θ, such
that its variance is finite and additionally – in the continuous case only –
that the differentiation operation taken with respect to θ can pass through
the expectation operator as shown below.
∂ h i ˆ ˆ
∂ b
T
E θ = b ... T
θ (x1 , . . . , xN ) · f (x1 , . . . , xN ; θ) dx1 . . . dxN
∂θ X1 XN ∂θ

In this environment, the variance of θ


b must satisfy the following inequality:

h i  ∂ h i 
∂ h iT
b − −1
Var θ E θb [IN (θ)] E θb ≥0
∂θT ∂θT
which is to be interpreted in the sense that the K × K matrix on the left
hand side is positive semi-definite, and where IN (θ) is as follows.
"  T #
∂ ∂
IN (θ) ≡ E log f (x1 , . . . , xN ; θ) log f (x1 , . . . , xN ; θ)
∂θ ∂θ

Proof. In analogy with the univariate case, define the random vectors:

u=θb (x1 , . . . , xN )

v= log f (x1 , . . . , xN ; θ)
∂θ
that are related as follows, by the properties of multivariate moments.

Var [u] − [Cov [u, v]] [Var [v]]−1 [Cov [u, v]]T ≥ 0

If E [v] = 0, the above simplifies as:


−1   T T
Var [u] − E uv T E vv T
    
E uv ≥0

where E vv T = IN (θ). Consequently, the stated result follows through


 

if, in addition, the following holds too.


∂ h i
E uv T =
 
E θ
b
∂θT
Note that if the above relationship is proved, then E [v] = 0 follows easily
by replacing u with the unit vector ιK = (1, . . . , 1)T having the same length

177
5.2. Evaluation of Estimators

K as θ. To avoid repeating similar arguments as it was done (for illustrative


purposes) in the univariate case, only the more complex case is developed.
 
 T ∂
E uv = E θ (x1 , . . . , xN ) ·
b log f (x1 , . . . , xN ; θ)
∂θT
" #
θ
b (x1 , . . . , xN ) ∂
=E f (x1 , . . . , xN ; θ)
f (x1 , . . . , xN ; θ) ∂θT
ˆ ˆ
= ... b (x1 , . . . , xN ) · ∂ f (x1 , . . . , xN ; θ) dx1 . . . dxN
θ
∂θT
X1
ˆ XN ˆ
∂ b (x1 , . . . , xN ) · f (x1 , . . . , xN ; θ) dx1 . . . dxN
= ... θ
∂θT X1 XN
∂ h i
= E θ b
∂θT
As a consequence, E [v] = 0 as well as the main result also follow.
Once stated and proved, the expressions of the Cramér-Rao inequalities
surely look formidable, and it is worthwhile to analyze them carefully. In the
univariate case, the main determinant of the lower bound is the denominator
of the ratio, which is called Fisher information number:
" 2 #

IN (θ) = E log f (X1 , . . . , XN ; θ)
∂θ
note that this number is a different function of the parameter θ for each pos-
sible distribution that generates the data. The name “information” is based
on the interpretation of this number as the overall “amount of knowledge”
that a certain distribution f (X1 , . . . , XN ; θ) can provide about a parameter
of interest θ (the higher the number, the lower the bound on the variance of
θ). Its multivariate analogue is the matrix IN (θ), which is unsurprisingly
called Fisher information matrix.
As useful as this intuition can be, the expressions that characterize the
bound appear still difficult to operationalize in practice. However, they can
be simplified in a number of different ways.
h i h i
1. If the estimators are unbiased, the two terms ∂θ E θ and ∂θT E θ
∂ b ∂ b
clearly reduce to 1 and the identity matrix I respectively.
2. If the sample is random, the information number and the information
matrix can be simplified as expressed in the theorem stated next.
3. Additional simplifications are possible under some fairly general con-
ditions that are detailed later.

178
5.2. Evaluation of Estimators

Theorem 5.5. Cramér-Rao Inequality (i.i.d.) – univariate case. In


the (univariate) setup of Theorem 5.4, if the sample is random the inequality
can be expressed as follows:
 h i2

h i ∂θ
E θ
b
Var θb ≥ h 2 i

N · E ∂θ log fX (X; θ)

where fX (x; θ) is the mass or density function that generates the sample.
Proof. Observe that:
"  !2 
2 # N
∂ ∂ Y
E log f (X1 , . . . , XN ; θ) = E log fX (Xi ; θ) 
∂θ ∂θ i=1
 !2 
N
X ∂
= E log fX (Xi ; θ) 
i=1
∂θ
" N  2 #
X ∂
=E log fX (Xi ; θ)
i=1
∂θ
N
" 2 #
X ∂
= E log fX (Xi ; θ)
i=1
∂θ
" 2 #

=N ·E log fX (X; θ)
∂θ
where the first line follows from random sampling, the second line is a simple
manipulation, the third and fourth lines are based on the linear properties
of expectations and independence, as terms of the following form for i 6= j:
  
∂ ∂
E log fX (Xi ; θ) log fX (Xj ; θ) =0
∂θ ∂θ
must be equal to the product of the respective means and therefore to zero,
while the fifth line follows from identically distributed observations.
Multivariate case. In the (multivariate) version of the setup of Theorem
5.4, if the sample is random the inequality is based on the following version
of the information matrix:
"  T #
∂ ∂
IN (θ) = N · E log fx (x; θ) log fx (x; θ)
∂θ ∂θ
where fx (x; θ) is the mass or density function that generates the sample.

179
5.2. Evaluation of Estimators

Proof. The proof is all but an extension of the univariate case. The infor-
mation matrix is developed as:
"  T #
∂ ∂
IN (θ) = E log f (x1 , . . . , xN ; θ) log f (x1 , . . . , xN ; θ)
∂θ ∂θ
 ! !T 
N N
∂ Y ∂ Y
= E log fx (xi ; θ) log fx (xi ; θ) 
∂θ i=1
∂θ i=1
 ! !T 
N
X ∂ N
X ∂
= E log fx (xi ; θ) log fx (xi ; θ) 
i=1
∂θ i=1
∂θ
" N    T #
X ∂ ∂
=E log fx (xi ; θ) log fx (xi ; θ)
i=1
∂θ ∂θ
N
"  T #
X ∂ ∂
= E log fx (xi ; θ) log fx (xi ; θ)
i=1
∂θ ∂θ
"  T #
∂ ∂
=N ·E log fx (x; θ) log fx (x; θ)
∂θ ∂θ
where the crucial step is between the third and the fourth line, as the terms
of the following form, for i 6= j:
"  T #
∂ ∂
E log fx (xi ; θ) log fx (xj ; θ) =0
∂θ ∂θ
disappear due to independence; the other steps are simple manipulations
or other implications of random sampling.
The mentioned additional simplifications are possible if the differentia-
tion operation with respect to the parameters of interest can pass through
the expectation operator twice (this is generally the case for a wide number
of distributions, including all those in the exponential macro-family). This
implies that in the univariate case, it is:
" 2 #  2 
∂ ∂
E log fX (X; θ) = −E log fX (X; θ)
∂θ ∂θ2
while in the multivariate case the following two K × K matrices are equal:
a result known as the information matrix equality.
" T #
∂2
  
∂ ∂
E log fx (x; θ) log fx (x; θ) = −E log fx (x; θ)
∂θ ∂θ ∂θ∂θT

180
5.2. Evaluation of Estimators

These results are essentially mathematical properties of the logarithm


of density and mass functions; these properties can facilitate the calculation
of the Cramér-Rao bound. Only the multivariate continuous case is proven
here (the univariate and the discrete cases are respectively a particular and
an analogous version of it). Observe that:
∂ ∂
0= 1
∂θT ∂θ ˆ
∂ ∂
= fx (x; θ) dx
∂θT ∂θ X
ˆ
∂ ∂fx (x; θ)
= T
dx
∂θ X ∂θ
ˆ
∂ ∂ log fx (x; θ)
= T
fx (x; θ) dx
∂θ X ∂θ
ˆ  
∂ ∂ log fx (x; θ)
= T
fx (x; θ) dx
X ∂θ ∂θ
ˆ  2 
∂ log fx (x; θ) ∂ log fx (x; θ) ∂fx (x; θ)
= fx (x; θ) + dx
∂θ∂θT ∂θ ∂θT
ˆ 2
X
∂ log fx (x; θ)
= fx (x; θ) dx +
∂θ∂θT
X
ˆ
∂ log fx (x; θ) ∂ log fx (x; θ)
+ fx (x; θ) dx
X ∂θ ∂θT
where the first line is almost a tautology, the second line follows from the
definition of joint density function, the third and fourth lines are just ma-
nipulations, the fifth line takes advantage of the fact that the differentiation
operator can pass through the integral twice, the sixth line applies the chain
rule, and finally the seventh and last line applies one last manipulation that
makes the two sides of the information matrix equality distinct and visible.
As all the lines equal a K × K matrix of zeros, the result must hold.
Collecting all these results together, if all the simplifications described
apply the Cramér-Rao Inequality can be written in the univariate case as:
  2 −1
h i 1 ∂
Var θ b ≥− E log fX (X; θ) (5.22)
N ∂θ2
and in the multivariate case as follows.
  2
−1
h i 1 ∂
Var θ b + E log fx (x; θ) ≥0 (5.23)
N ∂θ∂θT
Some examples help illustrate the usefulness of these conclusions.

181
5.2. Evaluation of Estimators

Example 5.14. Comparison of estimators for the parameter of the


Poisson distribution. Recall that if X ∼ Pois (λ), then E [X] = λ and
Var [X] = λ. This fact straightforwardly suggests two unbiased estimators
for the Poisson parameter λ: the sample mean X̄ and the sample variance
S 2 . A natural subsequent question is: which of the two estimators is better
according to the MSE criterion, that is, which of the two has the smallest
sampling variance? To answer this question, one could proceed by calculat-
ing the variance associated with either statistic. In the case of the sample
mean this is simple: Var X̄ = λ/N by Theorem 4.3. However, calculating
the variance of the sample variance is not as straightforward.
If one is working with a random sample, however, a shortcut is possible:
the question can be answered in favor of the sample mean X̄ by calculating
the Fisher information number. Recall that E [X 2 ] = λ + λ2 and note that:
" 2 #
∂ exp (−λ) · λX
IN (λ) = N · E log
∂λ X!
" 2 #

=N ·E (−λ + X log (λ) − log (X!))
∂λ
" 2 #
X
= N · E −1 +
λ
 
2 1  2
= N 1 − · E [X] + 2 · E X
λ λ
2
 
λ+λ
=N −1
λ2
N
=
λ
h i
the Cramér-Rao bound is Var bλ ≥ λ/N and is attained by bλ = X̄. Also:
∂2 exp (−λ) · λX
 
IN (λ) = −N · E log
∂λ2 X!
  
∂ X
= −N · E −1 +
∂λ λ
 
X
= −N · E − 2
λ
N
=
λ
showcasing the alternative procedure used to calculate the information num-
ber, which is often more straightforward. 

182
5.2. Evaluation of Estimators

Example 5.15. The information matrix of the normal distribution.


The normal distribution has two parameters: hence, the Cramér-Rao bound
is evaluated in a multivariate setting. Suppose one is working with a ran-
dom sample; in this case, the information matrix is most easily calculated
through the Hessian matrix of the logarithmic density function:
" 2
∂2

 1 X−µ   1 X−µ  #
2 log φ 2 log φ σ
IN µ, σ2 = −N · E ∂µ σ1 X−µ
σ  ∂µ∂σ  σ1 X−µ

∂2 ∂2

∂σ2 ∂µ
log σ φ σ ∂(σ2 )2
log σ φ σ

where φ (z), as usual, is the density function of the standard normal distri-
bution. In analogy with the calculation of the Hessian matrix from Example
5.6, the above information matrix is as follows.
  
1 X −µ N

 − 2 −   2 0 
IN µ, σ2 = −N · E  σ σ4  = σ

 X −µ 2
1 (X − µ)  N

− − 0
σ4 2σ4 σ6 2σ4
Clearly, X̄ is an unbiased estimator and its variance Var X̄ = σ2 /N at-
 

tains the Cramér-Rao bound. However, while the estimator S 2 is unbiased:


σ4 S2 2σ4
 
 2
Var S = Var (N − 1) =
(N − 1)2 σ2 N −1
it does not attain the Cramér-Rao bound, which is calculated as 2σ4 /N for
unbiased estimators of σ2 . Consider instead the rescaled, biased estimator
b2 = NN−1 S 2 . Its variance is:
of the variance σ
σ4 S2 2 (N − 1) σ4
   
N −1 2
Var S = 2 Var (N − 1) 2 =
N N σ N2
and to verify whether the Cramér-Rao bound is attained, one must calculate
the latter by taking into account the bias. Calling IN (σ2 ) the bottom-right
element of the information matrix IN (µ, σ2 ), the bound is expressed as:
    2
N −1 2 1 ∂ N −1 2
Var S ≥ E S
N IN (σ2 ) ∂σ2 N
2
2σ4 ∂
 
N −1 2
= σ
N ∂σ2 N
2 (N − 1)2 σ4
=
N3
and not even in this case it is attained. For both estimators, the actual value
of the Cramér-Rao bound is equal to NN−1 times their effective variance. 

183
5.2. Evaluation of Estimators

Following this discussion, one may be left wondering whether any math-
ematical result can help identify whether an unbiased estimator attains the
Cramér-Rao bound or not. Such a result exists and is the following.
Theorem 5.6. Attainment of the Cramér-Rao Bound – univariate
case. In the (univariate) setup of Theorem 5.4, if θ
b is an unbiased estimator
of θ, it attains the Cramér-Rao bound if and and only if:
b − θ = ∂ log fX ,...,X (x1 , . . . , xN ; θ)
h i
aN (θ) θ 1 N
∂θ
for some function aN (θ) of the parameter.
Proof. Recall the proof of Theorem 5.4 as well as Theorem 3.4: the equality
is attained only if U (the estimator) is a linear function of V (the derivative
of the logarithmic joint mass or density function of the sample, i.e. the log-
likelihood function). By the Cauchy-Schwarz Inequality this can be phrased
as a (U − E [U ]) = V . As a can be a function of θ, write it as aN (θ).
Multivariate case. In the (multivariate) setup of Theorem 5.4, if θ b is an
unbiased estimator of θ, it attains the Cramér-Rao bound if and and only
if:
h i ∂
AN (θ) θ − θ =
b log fx1 ,...,xN (x1 , . . . , xN ; θ)
∂θ
for some K × K matrix AN (θ) which is a function of the parameters.
Proof. Similarly to the univariate case, the equality is only attained if u is
a linear function of v, i.e. A (u − E [u]) = v where A = AN (θ).
Example 5.16. Attainment of the Cramér-Rao bound for estima-
tors of the normal distribution. Consider again random sampling from
the normal distribution. The derivative of the joint density corresponds to
the MLE First Order Conditions as in Example 5.6; write them as:
 N
  N

X (xi − µ) 1X
N
 
xi − µ 
σ 2   2 0  N
  
 = σ
 
 i=1 i=1
N 2  N 2

X (xi − µ) N 
 N X (xi − µ)
 

− 2 0 4 −σ 2

i=1
2σ4 2σ | {z2σ } N
i=1
=AN (µ,σ2 )

as per Theorem 5.6. This decomposition not only shows again that X̄ is an
unbiased estimator of µ that attains the bound; it also reveals
PN that the2only
unbiased estimator of σ that attains the bound is σ
2
e = i=1 (Xi − µ) /N :
2

an estimator that is usually unfeasible because it requires an ex-ante perfect


knowledge of the location parameter µ, an unlikely occurrence. 

184
5.3. Tests of Hypotheses

5.3 Tests of Hypotheses


The estimation of a parameter (or of other features of a probability distri-
bution), which is performed by calculating the associated estimate obtained
in the data, is usually only the first step of statistical analysis. In practical
contexts, researchers usually aim to perform statistical inference, that
is, probabilistic evaluations that concern the estimates and that are aimed
at answering some real world questions of interest. In particular, researcher
might be interested to evaluate the implications of their estimates on some
hypotheses that they have formulated. The methods by which these eval-
uations are performed fall under the name of tests of hypotheses.
Tests of hypotheses are formulated as follows. Researchers first formu-
late some null hypothesis about their parameters of interest, that is, some
statements about the baseline scenario that represents some initial belief or
scenario to be evaluated. In general, this is written as:
H0 : θ ∈ Θ0
where H0 is a common notation to represent the null hypothesis, θ are the
possibly multidimensional parameters of interest, and Θ0 ⊂ Θ is the set of
values in the parameter space Θ that are permitted by the null hypothesis.
By contrast, the alternative hypothesis is a statement that negates the
null hypothesis:
H1 : θ ∈ Θc0
where the H1 is the usual notation for the alternative hypothesis, while Θc0
is the complement of Θ0 in the parameter space. It is helpful to represent
the dichotomy between null and alternative hypotheses via some examples.
Example 5.17. Test on the mean of the normal distribution. The
simplest case of an hypothesis test is that about the value of a single pa-
rameter, say the mean of the normal distribution. In this case, the null and
alternative hypothesis read respectively as:
H0 : µ = C H1 : µ 6= C
where |C| < ∞ is a finite value. In a slightly more nuanced case, the two
hypotheses are represented by two complementary inequalities. If Θ = R,
these can for example be:
H0 : µ ≥ C H1 : µ < C
or vice versa. The two cases are usually referred to as two-sided test and
one-sided test, respectively. A common scenario is the one for C = 0; in
this case, the one-sided test is a test about the sign of the parameter. 

185
5.3. Tests of Hypotheses

Example 5.18. Test on the regression slope. A very common example


of test is the one about the slope parameter of the linear regression model.
In this case, the two hypotheses read, respectively:
H0 : β1 = C H1 : β1 6= C
for the two-sided test, and:
H0 : β1 ≥ C H1 : β1 < C
or vice versa for the one-sided test. In this specific case, the test for C = 0
is about the relevance of the exogenous variable Xi as an explanation of the
endogenous variable Yi (or as a “predictor” as it is sometimes said). 
Example 5.19. Test on the parameter of the exponential distribu-
tion. The parameter space for the parameter λ of the exponential distri-
bution is the set of positive values. Thus, the test with:
H0 : 0 < λ ≤ C H1 : λ > C
is a proper formulation for testing whether the waiting time of some phe-
nomenon of interest that can be modeled through the exponential distribu-
tion is lower or higher than some positive number C > 0. 
Example 5.20. Test on the equality of the means of the multivari-
ate normal distribution. Suppose that one is analyzing a phenomenon
that can be modeled via the bivariate normal distribution, and is wondering
whether the means of the two random variables involved (call them X and
Y ) are equal. In this case, the two hypotheses are formulated as:
H0 : µX − µY = 0 H1 : µX − µY 6= 0
which is a well-defined restriction in the parameter space. Next, consider
the multivariate case, where it might be interesting to verify if all location
parameters are equal to some specified value. Here, the hypotheses are:
H 0 : µk = C k H1 : µk 6= Ck
for k = 1, . . . , K, where the restricted set Θ0 is a specific point in RK . 
Example 5.21. Test on the variance of the normal distribution. A
researcher who aims to test the parameter representing the variance of a
distribution, say the normal, must be conscious of the associated parameter
space, e.g. σ2 > 0. A sensible test in this case is:
H0 : 0 < σ2 ≤ C H1 : σ2 > C
where, again, the constant C > 0 must be positive. 

186
5.3. Tests of Hypotheses

Example 5.22. Test on the variance ratio of two independent nor-


mal distributions. One may wonder about the relationship between the
variances of two independent normal random variables X and Y . An test
which is adequate for this environment is:

σ2X σ2X
H0 : 2 ≤ C H1 : 2 > C
σY σY

where σ2X and σ2Y are the variances of X and Y , respectively. Here, C < ∞
must be finite but otherwise unrestricted. If C = 1 the test has an obvious
interpretation: the null hypothesis represents the scenario where X has a
variance smaller or equal than that of Y , while the alternative hypothesis
states that the variance of X is larger than that of Y . Naturally, two-sided
tests about specific values of the ratio are perfectly possible. 

A test of hypothesis is conventionally conducted as follows.

1. The researcher establishes the two alternative hypotheses, H0 and H1 .

2. The researcher identifies ex-ante those values of the sample realizations


that are associated with acceptance of the null hypothesis, and rejection
of the alternative hypothesis – this is called the acceptance region – as
well as those values that are associated with acceptance of the alternative
hypothesis, and rejection of the null hypothesis – the rejection region.
The two sets must be complementary in the support of the sample.

3. The researcher examines the sample and performs a decision according


to the criteria established at point 2. above.

Naturally, specifying the acceptance and rejection regions for large samples
can be quite complicated, and maybe not extremely useful. Therefore, it is
common to use univariate test statistics for this purpose.

Definition 5.7. Test statistic. In the context of some test of hypothesis,


a test statistic T = T (x1 , . . . , xN ) is a statistic with support T and whose
sample realization value is written as t = T (x1 , . . . , xN ) which is such that,
given a binary partition of the support T0 ∪ Tc0 = T, the test is resolved as
follows. (
T0 ⇒ H0 is accepted, and H1 is rejected
t∈
Tc0 ⇒ H1 is accepted, and H0 is rejected
In this setting, T0 and Tc0 are respectively called the acceptance region
and the rejection region associated with the test statistic.

187
5.3. Tests of Hypotheses

A proper test statistic is one whose probability distribution varies along


with the parameters being tested, so that it is possible to associate probabil-
ities for the acceptance and rejection regions that also vary as a function of
the parameters under examination. The choice of a specific test statistic is
usually test dependent. Before illustrating – with the aid of some examples
– different test statistics for different hypotheses, it is necessary to discuss
how the acceptance and rejection regions are determined. Ultimately, these
are always arbitrary choices of researchers, that are however typically con-
ducted according to certain conventions well grounded in statistical theory.
This discussion requires some additional definitions.
Definition 5.8. Type I Error. In the framework of a test of hypotheses,
the type I error is the circumstance whereby the null hypothesis is rejected,
and the alternative hypothesis is accepted, while the null hypothesis is true.
Definition 5.9. Type II Error. In the framework of a test of hypotheses,
the type II error is the circumstance whereby the null hypothesis is accepted,
and the alternative hypothesis is rejected, while the null hypothesis is false.
The various outcomes of a test are commonly schematized as follows.
t ∈ T0 t ∈ Tc0
Correct Type I
θ ∈ Θ0
decision error
Type II Correct
θ ∈ Θc0
error decision

In an ideal test, both types of errors never occur; clearly, this ideal cannot
be attained as otherwise it would not be necessary to conduct tests in the
first place. At the same time, it is not possible to identify a criterion which
is useful to simultaneously shrink both types of errors; since the probability
to commit either depends on the acceptance and rejection regions, reducing
one increases the other and vice versa. The following concept well represents
the trade-off in question.
Definition 5.10. Power Function. The probability that the test statistic
falls in the rejection region, as a function of the parameters θ, is the power
function of a test.
PT (θ) = P (t ∈ Tc0 ; θ) = 1 − P (t ∈ T0 ; θ)
Clearly, a power function expresses the probability to commit a Type I error
if θ ∈ Θ0 , and equals one minus the probability to commit a Type II error
if θ ∈ Θc0 . This notion, in turn, is instrumental in the following definitions.

188
5.3. Tests of Hypotheses

Definition 5.11. Level of a test. Given a number α ∈ [0, 1], a test with
power function PT (θ) has confidence level α if PT (θ) ≤ α for all θ ∈ Θ0 .
Definition 5.12. Size of a test. Given a number α ∈ [0, 1], a test with
power function PT (θ) has size α if supθ∈Θ0 PT (θ) = α.
The distinction between level and size is subtle, but it highlights aspects
of the testing procedure. Given that a trade-off between Type I and Type II
errors exists, the convention in statistical analysis is to restrict the attention
to tests that have a sufficiently small probability of Type I errors (rejecting
the null hypothesis when it is true), a value which is fixed at some α. These
tests are said to have confidence level α. The confidence level is a always a
discretionary choice of the researcher, but conventionally, α is chosen to be
equal to one value between 0.1, 0.05, and 0.01. The smaller is the confidence
level, the more credible is the outcome of the test when the null hypothesis
is rejected (since that outcome has a smaller probability to occur under the
null hypothesis). Once a confidence level is decided, a conscious researcher
must recognize that attempting to further reduce the probability of Type I
errors might be counterproductive, due to an increased probability of Type
II errors. Thus, the attention is restricted to those tests whose maximum
probability of rejecting the null hypothesis when it is true is exactly α: the
size of the test.4 In most practical applications, this nominal distinction is of
little consequence, but it is important to make a correct use of terminology.
Example 5.23. Testing for the mean: level and size. Consider a test
about the mean of a certain distribution, say the normal. In the two-sided
case, Θ0 = {C} and Θc0 = R \ {C}, hence there is no practical distinction
between level and size. In the one-sided case, however, if the null hypothesis
is that the mean is smaller or equal than some constant C, Θ0 = (−∞, C]
and Θc0 = (C, ∞), and vice versa. Consequently, for a fixed level α there are
different rejection probabilities for different values in Θ0 . In typical testing
procedures, the maximum rejection probability is achieved at µ = C. 
After conducting tests, researchers usually report the following informa-
tion enclosed to their statistical analyses: the confidence level, the decision
outcome (acceptance vs. rejection) and often, a statistic called p-value.
Definition 5.13. The p-value. In a test of hypothesis with given size α,
a p-value is a statistic P = P (x1 , . . . , xN ) such that for all θ ∈ Θ0 , it is
P (P (x1 , . . . , xN ) ≤ α) ≤ α.
4
Some advanced statistical theory of tests helps identify criteria to obtain the “opti-
mal” tests, that is, those tests that minimize the probability of Type II errors for a fixed
level α. This analysis is outside the scope of this discussion.

189
5.3. Tests of Hypotheses

The definition of p-value is cumbersome and recursive, but delivers an in-


tuition: the smaller the p-value associated with a sample, the smaller is the
probability to observe that sample when the null hypothesis is true. Hence,
a smaller p-value is interpreted in terms of less favorable evidence in favor
of the null hypothesis. This concept allows researchers to evaluate the out-
comes of tests on a more continuous scale, instead of being constrained by
the “acceptance” vs. “rejection” dichotomy. Usually, p-values are obtained
through the test statistics, by measuring the probability of observing real-
izations of the test statistic that are even less favorable to the null hypothesis
than the actual realization t. This is best illustrated via some examples.
Example 5.24. Testing for the mean µ of the normal distribution:
test statistics, error types and p-values. Suppose that some researcher
is conducting a test about the mean of a population which is described by a
normally distributed random variable X ∼ N (µ, σ2 ). To conduct the test,
the researcher has collected a random sample drawn from X. Suppose for
the moment that the researcher knows the variance σ2 of the distribution
of X. Additionally suppose, again for the moment, that the null hypothesis
is that the mean is not a positive number.
H0 : µ ≤ 0 H1 : µ > 0
As discussed in Lecture 4, in this environment the standardized sample mean
is a statistic which follows the standard normal distribution: thus, a logical
test protocol is to reject the null hypothesis if the observed standardized
sample mean surpasses a certain critical value, call it z ∗ . This means that
the test, for a given µ0 ≤ 0, is resolved as follows:
(
√ X̄ − µ0 ≤ z ∗ ⇒ µ ≤ 0
N
σ > z∗ ⇒ µ > 0
where the arrows directed to the right indicate alternative conclusions about
the value of µ. The ideal test for these hypotheses would be based on µ0 = 0:
the reason is best illustrated by the following two observations.
1. The probability to conduct a Type I error:
√ X̄ − µ0
 
P (Type I error) = P N > z H0 is true = α

σ
depends on the actual value of µ0 in the expression above. It is obvious
that the highest Type I error probability is attained for the supremum
value of µ in the set defined by the null hypothesis: this value is clearly
µ0 = 0. The probability of rejecting the null hypothesis associated with
that value is the size α of the test.

190
5.3. Tests of Hypotheses

2. Similarly, the probability to conduct a Type II error:


√ X̄ − µ1
 
P (Type II error) = P N ≤ z H1 is true

σ
is a function of the actual value of µ1 > 0 if the alternative hypothesis
is true. The researcher is ignorant about this value, but it is clear that
whatever the value, the probability increases along with z ∗ .
To illustrate the trade-off between the two error types probabilities, suppose
that the alternative hypothesis is true and that the√actual mean is µ1 = 3,
while the decision cutoff is set by the researcher at N · x̄/σ > z ∗ = 2. The
standardized sample mean is centered at µ0 = 0 since that is the value that
maximizes the probability of a Type I error, and simultaneously minimizes
the probability of a Type II error. The two probabilities are respectively a
decreasing and an increasing function of the threshold value z ∗ , and they
are illustrated in Figure 5.2 below for the given threshold z ∗ = 2.

fX (x) µ0 = 0
0.4 µ1 = 3

0.2

x
−4 −2 2 4 6
Note: this figure represents the probabilities of both a type I and a type II error when H0 : µ ≤ 0, the
alternative hypothesis is true for µ1 = 3, and the testing protocol of the researcher is to reject the null
hypothesis if the realized standardized sample mean centered at µ0 = 0 exceeds 2. The probability of a
Type I error is thus the shaded area below the continuous density function centered at µ0 = 0 while the
probability of a Type II error is the dotted area below the dashed density function centered at µ1 = 3.

Figure 5.2: Test on the mean of a normal distribution: error types I & II

Consider now the general case of a one-sided test, for some given C.
H0 : µ ≤ C H1 : µ > C
In order to use the standardized sample mean as an appropriate test statistic
with a given size α, one must solve the following equation in terms of the
critical value zα∗ , where the subscript indicates the size of the test.
√ X̄ − C
   
σ ∗ ∗
P X̄ > C + √ zα = P N > zα = α
N σ

191
5.3. Tests of Hypotheses

In the above expression, the second probability is evaluated with reference


to a standard normal cumulative distribution Φ (z).5 This procedure implies
that the null hypothesis
√ is rejected if the observed realization of the sample
mean is such that N (x̄ − C) /σ > zα∗ ; whatever the outcome of the test,
the p-value is calculated as follows:

p (x̄) = P X̄ ≥ x̄

where again, the calculation is performed through the cumulative standard


normal. To illustrate, consider the left panel of Figure 5.3, which depicts a
standard normal’s density function. The shaded area in the right tail repre-
sents the rejection region, the area corresponding to the realizations of the
standardized sample mean that lead to the rejection of the null hypothesis,
if α = 0.05. In this case, the critical value is z0.05

≈ 1.64.

φ(x) φ(x)
0.4 0.4

0.2 0.2

x x
−3 −1 1 3 −3 −1 1 3
Note: the left panel depicts a one-sided test, the right-panel a two-sided test, both with size α = 0.05.
The shaded areas represent the corresponding rejection regions. The random variable X represented in
both panels is the standardized sample mean centered at C; it follows the standard normal distribution.

The critical values are, respectively, z0.05 ∗
≈ 1.64 in the left panel and z0.025 ≈ 1.96 in the right panel.

Figure 5.3: Mean of the normal distribution: rejection region

Now suppose that the test is two-sided: the null hypothesis allows for
only one value C, while the alternative hypothesis admits all other values.

H0 : µ = C H1 : µ 6= C

The researcher must now look for two symmetric critical values: zα∗/2 > 0
and its mirror image −zα∗/2 < 0. Intuitively, the researcher is agnostic about
the sign of the deviation from C in case the alternative hypothesis is true;
5
Once again, the standardized sample mean is centered at µ = C because with this
choice, and for a fixed level α, the probability of a Type I error is maximized while the
probability of a Type II error is minimized.

192
5.3. Tests of Hypotheses

hence, given a level α the probabilities of both the Type I and the Type II
errors are minimized when:
!
√ −
 
σ X̄ C α
P X̄ − C > √ zα∗/2 = P N > zα∗/2 =
N σ 2

and the null hypothesis is rejected if N |x̄ − C| /σ > zα∗/2 .6 This is visually
represented in the right panel of Figure 5.3, where the rejection region is
composed by two symmetric tails of the standard normal distribution. Here,
the p-value is calculated as the sum of two symmetric probabilities.
 
p (x̄) = P X̄ > x̄ + P X̄ < −x̄

= 2 · P X̄ > |x̄|

= 2 · P X̄ > x̄

= 2 · P X̄ < −x̄

Two-sided tests about the mean of the normal distribution are perhaps the
most common kinds of tests of hypotheses. It is thus useful to memorize the
critical values associated with conventional confidence levels: z0.05

≈ 1.64 if
α = 0.1, z0.025 ≈ 1.96 if α = 0.05, and z0.005 ≈ 2.33 if α = 0.01.
∗ ∗

Suppose instead that the variance σ2 is unknown by the researcher. In


this case, the test statistic is unsurprisingly the t-statistic, where “t” stands
for test. For example, in the previous case of a one-sided test the researcher
should derive a critical value t∗α according to the expression:
√ X̄ − C
   
S ∗ ∗
P X̄ > C + √ tα = P N > tα = α
N S

where the second probability in the above display is evaluated in terms of a


Student’s t-distribution√ with N − 1 degrees of freedom. Similarly as above,
the test is rejected if N (x̄ − C) /s > t∗α , and the p-value is calculated as
the following function of both the observed sample mean and variance.
 
2
 X̄ − C x̄ − C
p x̄, s = P >
S s

The two-sided case bears symmetric analogies. One could graphically rep-
resent the two scenarios similarly as in Figure 5.3, but using the Student’s
t-distribution instead of the standard normal. 
6
Here, the standardized sample mean calculated for evaluating the test is centered
at µ = C because this is the only value allowed by the null hypothesis.

193
5.3. Tests of Hypotheses

Example 5.25. Testing for the variance σ2 of the normal distribu-


tion: test statistic and p-values. To additionally elaborate the previous
example, suppose that the researcher is testing the variance parameter σ2
of the normally distributed population, with the same hypotheses outlined
in Example 5.21 in the introduction to this section.

H0 : 0 < σ2 ≤ C H1 : σ2 > C

Here, the test statistic is the rescaled sample variance (N − 1) S 2 /C, and
the critical value kα∗ for a test of size α is identified through the chi-squared
distribution with N − 1 degrees of freedom (see Figure 5.4 below).

S2
   
2 C ∗ ∗
P S > k = P (N − 1) > kα = α
N −1 α C

fX (x)

0.1

x
5 10 15 20
Note: the shaded area represents the rejection region for a test with size α = 0.05 on the variance of a
normal distribution if N = 8. The represented random variable X ∼ χ27 is the rescaled sample variance.

Figure 5.4: Variance of the normal distribution: rejection region

Thus, the null hypothesis is rejected if (N − 1) s2 /C > kα∗ , and the p-value
is calculated as p (s2 ) = P (S 2 ≥ s2 ). 
Example 5.26. Testing for the variance ratio of two normal dis-
tributions: test statistic and p-values. Suppose now that the interest
of the analyst falls the variances of two independent normally distributed
populations. The relevant hypotheses are as in Example 5.22:
σ2X σ2X
H0 : 2 ≤ C H1 : 2 > C
σY σY
and by the analysis conducted in Lecture 4, the relevant test statistic is the
F -statistic F = SX
2
/SY2 C. Thus, the critical value kα∗ for a test of size α is

194
5.3. Tests of Hypotheses

obtained by evaluating an F -distribution with paired NX − 1 and NY − 1


degrees of freedom, as per the expression:
 2   2 
SX ∗ SX 1 ∗
P > Ckα = P > kα = α
SY2 SY2 C
and the illustration is given in Figure 5.5 below.

fX (x)
0.8

0.4

x
1 2 3 4
Note: the shaded area represents the rejection region for a test with size α = 0.05 on the normal variance
ratio if NX = NY = 12. The represented random variable X ∼ F11,11 is the F -statistic.

Figure 5.5: Ratio of two normal distributions’ variances: rejection region

In this scenario, the null hypothesis is rejected if (s2X /s2Y ) /C > kα∗ and the
p-value is calculated as p (s2X , s2Y ) = P (SX
2
/SY2 > s2X /s2Y ). 
Example 5.27. Testing multiple means of the multivariate normal
distribution: test statistic and p-values. Consider now some composite
hypotheses about multiple parameters – specifically, multiple means – of the
multivariate normal distribution (recall Example 5.20):
H0 : µk = Ck H1 : µk 6= Ck
for k = 1, . . . , K. This test is best expressed in vectorial form:
H0 : µ = c H1 : µ 6= c

where µ is the vector of means and c = (C1 , . . . , CK )T . A researcher who


knows the matrix Σ of variance-covariance parameters is in the position to
compute the following so-called u-statistic:
u = N (x̄ − c)T Σ−1 (x̄ − c) ∼ χ2K
which, by the properties of the sample mean x̄ drawn from a multivariate
normal distribution, follows the chi-squared distribution with K degrees of
freedom. The u-statistic can thus be used to test the hypothesis of interest
when the parameter matrix Σ is known.

195
5.3. Tests of Hypotheses

In general, however, Σ is unknown and the sample variance-covariance


matrix S is used in its place, hence the test statistic adopted is the rescaled
Hotelling’s t-squared statistic, already introduced in Lecture 4.
N − K 2 (N − K) N
t = (x̄ − c)T S −1 (x̄ − c) ∼ FK,N −K
K (N − 1) K (N − 1)
This test statistic follows the F -distribution with paired degrees of freedom
K and N −K. While this is a two-sided test, the null-hypothesis is rejected if
the observed test statistic surpasses a certain critical values kα∗ , an indication
that the true means are effectively likely larger than c.7 Given a size α, this
critical value is defined as follows.
 
N −K 2 ∗
P t > kα = α
K (N − 1)
This critical value is obtained as a quantile of an appropriate F -distribution,
as shown in Figure 5.6 below.

fX (x)
0.6

0.4

0.2
x
2 4 6
Note: the shaded area represents the rejection region for a test with size α = 0.05 about K = 4 means
µ = (µ1 , . . . , µK ) of a multivariate normal distribution, with N = 12. The represented random variable
X ∼ F4,8 is the rescaled Hotelling’s t-squared statistic which is discussed above in the text.

Figure 5.6: Mean of the multivariate normal distribution: rejection region

The p-value in this case is the probability to observe an Hotelling’s t-squared


statistic which is larger than the actual realization.
 
2 T −1
p (x̄, S) = P t > N (x̄ − c) S (x̄ − c)

Clearly, it is perfectly possible to test only a subset L < K of the means of


the multivariate normal distribution, for L of its variables. In this case, the
relevant sample variance-covariance S has dimension L × L, and the test
statistic follows an F -distribution with degrees of freedom L and N − L.
7
Negative deviations of the kind X̄k − Ck for k = 1, . . . , K contribute positively to
the calculation of the test statistic, since they are squared.

196
5.3. Tests of Hypotheses

The logic of the test can be generalized. For example, if the hypotheses
of interest were about the equality of the means of a random vector (X, Y )
that follows the bivariate normal distribution, like:

H0 : µX − µY = 0 H1 : µX − µY 6= 0

the appropriate test-statistic is the following version of Hotelling’s t-squared


statistic:8 2
2 N X̄ − Ȳ
t = 2 ∼ F1,N −1
SX + SY2 − 2SXY
which follows the F -distribution with paired degrees of freedom 1 and N −1,
and where:
N
1 X  
SXY = Xi − X̄ Yi − Ȳ
N − 1 i=1
is the sample covariance between X and Y . This obtains from applying the
testing procedure to the following transformed random variable.

W = X − Y ∼ N µX − µY , σ2X + σ2Y − 2ρXY σX σY




Note that it would be inappropriate to test this specific hypothesis by simply


analyzing the difference between the two standardized sample means while
disregarding their covariance, unless there are good reasons to believe that
the two random variables X and Y are independent. 
Example 5.28. Testing for the parameter λ of the exponential dis-
tribution: test statistic and p-values. Abandon for once the familiar
framework of a random sample drawn from the normal distribution, and
consider instead that of a random sample drawn from a random variable
X ∼ Exp (λ) that follows the exponential distribution; In this setting the
sample mean X̄ has an easily identifiable distribution. By the properties of
moment generating functions:
  N  −N
1 λ
MX̄ (t) = MX t = 1− t
N N
and therefore:  
N
X̄ ∼ Γ N,
λ
that is, X̄ follows the Gamma distribution with the given parameters.
8
The name t-squared associated with Hotelling’s statistic comes from the relationship
between the Student’s t-distribution and the F -distribution. As it was already observed
in Lecture 2, if X ∼ T (ν) and Y = X 2 , it is Y ∼ F (1, ν).

197
5.3. Tests of Hypotheses

It is thus intuitive to use X̄ as a test statistic for λ based on that Gamma


distribution, should a researcher be interested about testing that parameter.
Specifically, let the hypotheses at hand be the following (Example 5.19).
H0 : 0 < λ ≤ C H1 : λ > C
Given a test size α, the critical value gα∗ is evaluated as:
√ X̄ − C
  ∗   
gα ∗
P X̄ > C √ + 1 =P N > gα = α
N C

and the null hypothesis is rejected if x̄ > Cgα∗ / N + C; the illustration is
given in Figure 5.7 below.

fX (x)

0.2

0.1

x
4 8 12
Note: the shaded area represents the rejection region for a test with size α = 0.05 on the parameter λ of
an exponential distribution if N = 10 and C = 5. The represented random variable X ∼ Γ (10, 2) is the
rescaled sample mean under the null hypothesis that λ = C. Note that both parameters depend on C.

Figure 5.7: Exponential distribution’s parameter λ: rejection region

Here, the p-value is calculated similarly as in the one-sided normal test,


that is p (x̄) = P X̄ ≥ x̄ .


This example almost exhausts the analysis of the alternative hypotheses
introduced at the beginning of this section: only the test about the linear
regression slope parameter β1 (Example 5.18) has been left out. The reason
is that the distribution of any estimator of β1 (say, the Method of Moments
estimator from Example 5.4) depends on the underlying assumptions about
the conditional distribution of Yi | Xi . If, for example, such a distribution is
normal, one can show that the distribution of the estimator is also normal,
and therefore tests would proceed as in Example 5.24. Observe that most of
the previous examples analyze one-sided tests; a useful exercise is to recast
them as two-sided tests, derive expressions for their critical values, and plot
the rejection regions. Finally, it must be observed that some of these tests
can be simplified in an asymptotic environment, as discussed in Lecture 6.

198
5.4. Interval Estimation

5.4 Interval Estimation


However precise, any exercise in parameter estimation is always uncertain.
Acknowledging this fact, statistical analysts usually supplement point esti-
mates with other “likely” values of their parameters of interest, so to better
inform the understanding of real world phenomena. This exercise falls under
the name of interval estimation. A formal definition follows.
Definition 5.14. Interval estimators. Consider the statistical inference
about a scalar parameter θ. An interval estimator is a pair of statistics that
are functions of the sample: the “lower” bound statistic L = L (x1 , . . . , xN )
and the “upper” bound statistic U = U (x1 , . . . , xN ), such that L ≤ U and
that if the values (x1 , . . . , xN ) are observed, the conclusion of the statistical
inference is that θ falls in the interval defined by the two realized statistics.
θ ∈ [L (x1 , . . . , xN ) , U (x1 , . . . , xN )]
The interval in question is also called confidence interval.
Naturally, a confidence interval can be made however large so to increase
the chances that the “true” parameter θ falls inside it; however, the larger
the interval the less informative it is! At the extreme, the confidence interval
can encompass the entire parameter space for θ, which clearly makes the
entire exercise moot. Therefore, a good confidence interval is one that is as
small as possible while having a probability to include the true parameter θ
which is as high as possible. To evaluate this property, one must take into
account the following concepts.
Definition 5.15. Coverage probability. The coverage probability that
is associated with an interval estimator is the probability that the associated
confidence interval covers the true parameter, for a given parameter θ.
Coverage Probability = P (L (x1 , . . . , xN ) ≤ θ ≤ U (x1 , . . . , xN ))
Definition 5.16. Confidence coefficient. The confidence coefficient that
is associated with an interval estimator is the infimum of all the confidence
probabilities in the parameter space of θ (write it as Θ).
Confidence Coefficient = inf P (L (x1 , . . . , xN ) ≤ θ ≤ U (x1 , . . . , xN ))
θ∈Θ

Note that the probabilities defined above depend on the chosen statistics L
and U (two random variables), and are evaluated in the sample space defined
by the support of the sample. In practice, the distinction between the two
definitions is often irrelevant, because in many common cases the coverage
probability does not vary in the parameter space. When it varies, however,
an interval estimator is evaluated in terms of the confidence coefficient.

199
5.4. Interval Estimation

This problem should be reminiscent of the analogous issue discussed in


the setting of tests of hypothesis: how to select the acceptance and rejection
regions while minimizing the combined chance of errors? The analogy is self-
evident and thus, it should not be too surprising that the statistical methods
for constructing confidence intervals are intimately related to techniques for
the conduction of hypothesis tests. In fact, all methods for the construction
of confidence intervals are related to tests; here, only one of these methods
is described as all the others can be related to it. This method goes by the
name of inversion of test statistics and it proceeds as follows.
1. Start from a two-sided hypothesis about θ.
H0 : θ = C H1 : θ 6= C

2. Construct an acceptance region for C based on a test statistic T that


is a function of C:
  ∗∗ ∗

T0 = T (x1 , . . . , xN ; C) ∈ k1− α/2 , kα/2

where k1− ∗∗
α/2 and kα/2 are two suitable critical values associated with,

respectively, the (α/2)-th and (1 − α/2)-th quantiles of the distribu-


tion of the test statistic; if the latter is symmetric around zero it holds
α/2 = −kα/2 . The notation here is somewhat counterintuitive, since
∗∗ ∗
k1−
α/2 corresponds to the (α/2)-th quantile:
∗∗
k1−
∗∗
 α
P T (x1 , . . . , xN ; C) > k1− α/2 =1−
2
and symmetrically kα/2 corresponds to the (1 − α/2)-th quantile:

 α
P T (x1 , . . . , xN ; C) > kα∗/2 =
2
This notation is chosen for the sake of consistency with the more gen-
eral treatment of tests. The above acceptance region T0 is associated
with a size α which is defined in terms of the following probability.
∗∗ ∗

P k1− α/2 ≤ T (x1 , . . . , xN ; C) ≤ kα/2 =1−α
Note that this equals one minus the probability of a Type I error.
3. Construct the following two statistics by inverting the function that
defines the test statistic, T (x1 , . . . , xN ; C), with respect to C, and by
evaluating the inverse at the two critical values.
I1 = T −1 x1 , . . . , xN ; k1−∗∗

α/2

I2 = T −1 x1 , . . . , xN ; kα∗/2


200
5.4. Interval Estimation

4. The interval estimator is finally obtained as:

L (x1 , . . . , xN ) = min {I1 , I2 }


U (x1 , . . . , xN ) = max {I1 , I2 }

where typically, L = I2 and U = I1 . Note that the coverage probabil-


ity associated with the interval estimator is also 1 − α, since for any
C = θ the procedure just described implies the following.

P (L (x1 , . . . , xN ) ≤ θ ≤ U (x1 , . . . , xN )) = 1 − α

This procedure appears abstract and convoluted, but since most test statis-
tics are simple function of the parameters, the inversion is generally straight-
forward and intuitive. The method is best illustrated via examples.
Example 5.29. The confidence interval for the mean of the normal
distribution. As described in Example 5.24, in two-sided tests about the
mean of the normal distribution in the case where the variance σ2 is known,
the acceptance region is defined in terms of the following interval:
√ X̄ − C  ∗
 


T0,µ = N ∈ −zα/2 , zα/2
σ

since the null hypothesis is rejected if N |x̄ − C| /σ > zα∗/2 . Note that in
this settings one can seamlessly convert open intervals into closed ones and
vice versa, since realizations equal to a specific value have probability zero.
The two critical values −zα∗/2 and zα∗/2 are evaluated in terms of the standard
normal distribution, which is symmetric around zero. Thus, the confidence
interval for µ is:
 
σ ∗ σ ∗
µ ∈ X̄ − √ zα/2 , X̄ + √ zα/2
N N
which is obtained easily, since the test statistic is a simple linear function
of the parameter. Note that the function is also a monotonically decreasing
one, hence L = I2 and U = I1 according to the procedure described earlier.
If instead the variance σ2 is unknown, the analogous procedure based on
the t-statistic results in the analogous confidence interval:
 
S ∗ S ∗
µ ∈ X̄ − √ tα/2 , X̄ + √ tα/2
N N
where t∗α/2 is evaluated in terms of the Student’s t-distribution with N − 1
degrees of freedom, which is also symmetric around zero. 

201
5.4. Interval Estimation

Example 5.30. The confidence interval for the variance of the nor-
mal distribution. By extending Example 5.25, a two-sided test about the
variance of the of the normal distribution would have acceptance region:

S 2  ∗∗
 


T0,σ2 = (N − 1) ∈ k1−α/2 , kα/2
C

where k1−
∗∗
α/2 and kα/2 are evaluated in terms of the chi-squared distribution

with N − 1 degrees of freedom. To appreciate the difference with Example


5.25, see Figure 5.8 below.

fX (x)

0.1

x
5 10 15 20
Note: the shaded area displays the rejection region for a two-sided version of the test in Example 5.25.

Figure 5.8: Variance of the normal distribution: two-sided rejection region

Therefore, the confidence interval for σ2 is:


" #
2 2
S S
σ2 ∈ (N − 1) ∗ , (N − 1) ∗∗
kα/2 k1−α/2

and again, it is L = I2 and U = I1 because the test statistic is decreasing


in the parameter of interest σ2 . 

Example 5.31. The confidence interval for the variance ratio from
two normal distributions. The case of the variance ratio σ2X /σ2Y from
two samples drawn from two independent normal distributions is analogous;
the two-sided version of Example 5.26, gives the following acceptance region:
 2 
SX 1  ∗∗ ∗

T σ2X = ∈ k1−α/2 , kα/2
0, 2
σ
SY2 C
Y

where k1−
∗∗
α/2 and kα/2 are evaluated as quantiles of the F -distribution with

paired degrees of freedom NX − 1 and NY − 1. Consequently, the confidence

202
5.4. Interval Estimation

interval for the variance ratio is:


" #
σ2X SX2
1 SX 2
1
∈ ,
σ2Y SY2 kα∗/2 SY2 k1−
∗∗
α/2

and it is once again L = I2 and U = I1 . 


Example 5.32. The confidence interval for the parameter λ of the
exponential distribution. The inference about the parameter λ of the
exponential distribution bears similarities with that about the mean of the
normal distribution, the differences are that the parameter λ characterizes
both the mean and the variance of the exponential distribution, and that the
distribution of the test statistic is asymmetric and has support on positive
values only. By elaborating the analysis from Example 5.28 as a two-sided
test, the acceptance region becomes:
√ X̄ − C  ∗∗
 


T0,λ = N ∈ g1−α/2 , gα/2
C
where g1−∗∗
α/2 and gα/2 are appropriate quantiles of the Gamma distribution

with parameters α = N and β = N/λ, as shown in Figure 5.9.

fX (x)

0.2

0.1

x
4 8 12
Note: the shaded area displays the rejection region for a two-sided version of the test in Example 5.28.

Figure 5.9: Exponential distribution’s par. λ: rejection region, two-sided

As a result, by the usual procedure one obtains:


" #
X̄ X̄
λ∈ − 1/2 ∗
, − ∗∗
1+N gα/2 1 + N 1/2 g1− α/2

that is, a confidence interval for λ with coverage probability 1 − α. 


These examples show that, in general, the expression of the confidence
interval as a function of the test statistic is contextual: it must be derived on
a case-by-case basis. However, applications about the normal distribution
dominate, thanks to the asymptotic results developed in the next Lecture.

203
Lecture 6

Asymptotic Analysis

This lecture introduces the fundamental concepts of asymptotic probability


theory, and associated results that allow to expand and simplify statistical
analysis in settings where data samples of sufficiently large size are available.
More specifically, this lecture builds up the set of definitions and properties
that are necessary for an appropriate analysis of the Laws of Large Numbers
and the Central Limit Theorems; subsequently, it proves a relatively simple
version of both sets of results and discusses their implications in terms of
the asymptotic behavior of simple sample statistics and other estimators.

6.1 Convergence in Probability


The concept of convergence in probability relates to the idea of some random
variables “approaching” specific values in the support of their distribution
when the size of the sample grows very large. Convergence in probability is
suited to characterize the “asymptotic” behavior of statistics and estimators
in so-called large samples. Most typically, interest falls on those statistics
that converge in probability to certain population parameters or moments
of interest. To better introduce this concept it is necessary to formalize the
notion of sequences of random variables and vectors.
Definition 6.1. Random sequence. Any random vector expressed as an
N -indexed sequence, write it as xN = (X1N , . . . , XKN )T , is called a random
sequence. In the univariate context (K = 1), one can write it simply as XN .
The definition can be further extended to sequences of random matrices
with dimension J × K, that combine J vectorial sequences xjN of length
K for j = 1, . . . , J. Such a matrix is indicated for example as follows.
 T
XN = x1N x2N . . . xJN

204
6.1. Convergence in Probability

Example 6.1. Common random sequences. Both the sample mean (a


random vector) and the sample variance-covariance (a random matrix):
N N
1 X 1 X
x̄N = xi and SN = (xi − x̄N ) (xi − x̄N )T
N i=1 N − 1 i=1
are two random sequences, as they are statistics that depend on the sample
size N . Their univariate versions are usually written as X̄N and SN
2
.
Endowed with the definition of random sequence, it is possible to express
the concept of convergence more formally. A first, intuitive requirement for
convergent random sequences is that the latter are somehow “bounded,” in
the sense that as N grows the probability distribution of xN concentrates
around a subset of the support. This is expressed with the following concept.
Definition 6.2. Boundedness in Probability. A sequence xN of random
vectors is bounded in probability if and only if, for any ε > 0, there exists
some number δε < ∞ and an integer Nε such that
P (kxN k ≥ δε ) < ε ∀N ≥ Nε
which is also written as xN = Op (1) and read as “xN is big p-oh one.”
However this is not quite enough, because this definition still allows for the
probability distribution of xN to remain “dense” within some specific inter-
val but without shrinking into a unique point, even if N grows very large.
In fact, the concept of convergence in probability that is most frequently
adopted in these lectures is stronger than boundedness in probability.
Definition 6.3. Convergence in Probability. A sequence xN of random
vectors converges in probability to a constant vector c if
lim P (kxN − ck > δ) = 0
N →∞

for any positive real number δ > 0.


The above definition formalizes the idea that as the sample size N grows
increasingly larger, the probability distribution of xN concentrates within
an increasingly smaller neighborhood of c. In terms of notation, convergence
in probability is usually denoted in the two following alternative ways:
p
xN → c
plim xN = c
among which the former is preferred in these lectures. In fact, it is easy to
see that in analogy with real sequences, convergence in probability is just
a special case of boundedness in probability.

205
6.1. Convergence in Probability

Theorem 6.1. Convergent Random Sequences are also Bounded.


If some sequence xN of random vectors converges in probability to some
p
constant c, that is xN → c, then it is also bounded: xN = Op (1).
Proof. By the definition of convergence in probability, for any ε > 0 there
is always an integer Nε such that
P (kxN − ck > δ) < ε ∀N ≥ Nε
thus by setting δε = δ + kxNε k − kxNε − ck one gets that xN = Op (1).
This statement and its proof properly clarify the difference between bound-
edness and convergence in probability: while the former is valid for a specific
constant δ so long as N large enough (and so long it exists), the latter must
be true for any δ instead.
In the specific case of convergence in probability (Definition 6.3) where
c = 0, one can also write:
xN = op (1)
which is read as “xN is little p-oh one.” The use of the “probability” version
of the the big-oh and little-oh notation facilitates outlining the properties
of probability limits with respect to real sequences, which are analogous to
the non-stochastic case.
Definition 6.4. Convergence of Random to Real Sequences. Con-
sider a random sequence xN and a non-random sequence aN of the same di-
mension K. Moreover, define the random sequence zN = (Z1N , . . . , ZKN )T
where Zkn = Xkn /akn for k = 1, . . . , K and for n = 1, 2, . . . to infinity.
1. If zN = Op (1), then xN is said to be bounded in probability by aN ,
which one can write as xN = Op (aN ).
2. If zN = op (1), then xN is said to converge in probability to aN , which
one can write as xN = op (aN ).
There are further definitions of “convergence” in probabilistic sense that
are even stronger than convergence in probability.
Definition 6.5. Convergence in r-th Mean. A sequence xN of random
vectors is said to converge in r-th mean to a constant vector c under the
following condition.
lim E [kxN − ckr ] = 0
N →∞
In the special case where r = 2, this concept is known as Convergence in
Quadratic Mean and is also expressed as follows.
qm
xN → c

206
6.1. Convergence in Probability

The following two useful results show that whenever some random sequence
converges in quadratic or higher mean to some specific vector (such as, say,
its mean), it also converges in probability to it.
Theorem 6.2. Convergence in Lower Means. A random sequence xN
that converges in r-th mean to some constant vector c also converges in s-th
mean to c for s < r.
Proof. The proof is based on Jensen’s Inequality:
h i
r rs s
lim E [kxN − ck ] = lim E (kxN − ck ) ≤ lim {E [kxN − ckr ]} r = 0
s
N →∞ N →∞ N →∞

since limN →∞ E [kxN − ckr ] = 0.


Theorem 6.3. Convergence in Quadratic Mean and Probability. If
a random sequence xN converges in r-th mean to a constant vector c for
qm
r ≥ 2 (that is, at least xN → c), then it also converges in probability to c.
Proof. Define the (one-dimensional) nonnegative random sequence QN as:
q
QN = kxN − ck = (xN − c)T (xN − c) ∈ R+

and notice that by Theorem 6.3 it must converge in first mean:

lim E [QN ] = lim E [kxN − ck] = 0


N →∞ N →∞

and therefore, quadratic mean convergence also implies the following.

lim Var [QN ] = lim E Q2N = lim E kxN − ck2 = 0


   
N →∞ N →∞ N →∞

At the same time, by Čebyšëv’s Inequality:


Var [QN ]
P (|QN − E [QN ]| > δ) ≤
δ2
therefore, taking limits on both sides gives:
Var [QN ]
lim P (kxN − ck > δ) = lim P (|QN − E [QN ]| > δ) ≤ lim =0
N →∞ N →∞ N →∞ δ2
p
implying convergence in probability: xN → c.
This result is useful for verifying that in random samples drawn from some
random vector x with finite variance Var [x] < ∞, the sample mean xN
converges in probability to the mean of the population, E [x].

207
6.1. Convergence in Probability

Example 6.2. Convergence in Probability of the Sample Mean. In


a random sample drawn from some random variable X:
  N
lim E X̄N = lim E [X] = E [X]
N →∞ N →∞ N

and in addition, if Var [X] < ∞:


h  2 i   Var [X]
lim E X̄N − E X̄N = lim Var X̄N = lim =0
N →∞ N →∞ N →∞ N
qm p
and therefore, X̄N → E [X] which also implies X̄N → E [X]. This is easily
generalized to a multivariate context: for an N -dimensional random sample
drawn from a random vector x with Var [x] < ∞, it holds that:
N
1 X qm
x̄N = xi → E [x]
N i=1

p
which also implies convergence in probability, x̄N → E [x]. 

The last concept of convergence defined here, “almost sure convergence,”


is also stronger than convergence in probability, and it is mentioned here for
the sake of completeness. This notion is typically employed in the analysis
of time series, which deals with “sequences” of observations over time.

Definition 6.6. Almost Sure Convergence. A sequence xN of random


vectors converges almost surely, or with probability one to a constant vector
c if it holds that:  
P lim xN = c = 1
N →∞
a.s.
where limN →∞ xN is a random vector. This is also expressed as xN → c.

With the aid of some measure theory, it is possible to prove the intuitive
result that almost sure convergence implies convergence in probability.
All concepts, definitions and results about convergence that have been
discussed thus far apply to sequences of random matrices as well. A random
sequence XN of matrices converges in probability to some matrix C if:

lim P (kXN − Ck > δ) = 0


N →∞

(where for any matrix B, kBk = tr (BT B)). This is denoted as follows.
p

p
XN → C

208
6.1. Convergence in Probability

The following result about convergent random sequences is fundamental


to easily derive the asymptotic properties of many statistics and estimators,
and it is applied extensively in econometrics. The result is stated in terms
of vectorial random sequences, but it applies to matricial ones too.
Theorem 6.4. Continuous Mapping Theorem. Consider a vectorial
random sequence xN ∈ X, a vector c ∈ X with the same length as xN , and
a vector-valued continuous function g (·) with a set of discontinuity points
Dg such that:
P (x ∈ Dg ) = 0
(the probability mass at the discontinuities is zero). Then, it holds that:
p p
xN → c ⇒ g (xN ) → g (c)
a.s. a.s.
xN → c ⇒ g (xN ) → g (c)

that is, convergence in probability and almost sure convergence are preserved
when functions are applied to random sequences.
Proof. (Sketched.) Only the case about convergence in probability is proved
here, with the purpose of illustrating the core argument (which is essentially
an extension of the properties of limits for continuous functions). For a given
positive number δ > 0, define the set

Gδ = {x ∈ X| x ∈
/ Dg : ∃y ∈ X : kx − yk < δ, kg (x) − g (y)k > ε}

this is the set of points in X where g (·) “amplifies” the distance with some
other point y beyond a small neighborhood of ε. By this definition:

P (kg (xN ) − g (c)k > ε) ≤ P (kxN − ck ≥ δ) + P (c ∈ Gδ ) + P (c ∈ Dg )

and notice that upon taking the limit of the right-hand side as N → ∞,
the second term vanishes by definition of a continuous function, while the
third term is zero by hypothesis. Therefore:

lim P (kg (xN ) − g (c)k > ε) ≤ lim P (kxN − ck ≥ δ)


N →∞ N →∞

which proves the theorem in the case of convergence in probability.


The importance of this result is that it allows to easily derive the asymp-
totic properties of many statistical estimators. For example, it is generally
not possible to derive the expected value of some function g (b µN ) of a given
unbiased estimator µ b N such that E [b
µN ] = µ0 for some µ0 . In fact, as it has
been observed in Lecture 1, the best one can do about E [g (b µN )] is to make

209
6.1. Convergence in Probability

approximations based on Jensen’s Inequality. However, if µ


b N also converges
in probability to µ0 , the continuous mapping theorem ensures that in large
samples g (bµN ) converges in probability to g (µ0 ).
The most frequent applications of the Continuous Mapping Theorem are
the simple transformations that are summarized in what follows separately
for scalar, vectorial and matricial random sequences.1
p p
1. Scalars. Given two scalar random sequences XN → x and YN → y, the
following holds.
p
(XN + YN ) → x + y
p
XN YN → xy
p
XN /YN → x/y if y 6= 0
p p
2. Vectors. Given two vector random sequences xN → x and yN → y of
equal length, the following holds.
p
xT T
N yN → x y
T p
xN yN → xyT
p p
3. Matrices. Given two matrix random sequences XN → X and YN → Y
of appropriate dimension it holds that:
p
XN YN → XY
p
while for sequences of random full rank square matrices ZN → Z, it is
as follows.
p
ZN−1 → Z−1
4. Combinations of the Above. Consider the three random sequences
XN , xN and XN above, and suppose that the column dimension of XN
corresponds with the row dimension of xN . Then, the following holds.
p
XN XN xN → xXx

Clearly, all these properties apply to almost sure convergence as well. It is


fairly easy to construct examples about these properties that involve, say,
convergent sample means. Once again, comparing these properties to those
of expectations sheds an unfavorable light upon the latter: for example,
it is difficult to calculate the expectation of a ratio, E [X/Y ], even if both
E [X] and E [Y ] are known quantities!
1
Observe that the random sequence xN in the statement of theorem can be arrayed
however wished, hence the Theorem equally applies to all desired combinations of scalars,
vectors and matrices.

210
6.2. Laws of Large Numbers

6.2 Laws of Large Numbers


The definitions and results about convergence presented thus far allow to
formulate some fundamental results in probability theory, with crucial im-
plications for estimation and statistical analysis. Surely, the most important
of these results are the various theorems known as Laws of Large Numbers;
these posit that in a random sample, any scalar-, vector- or matrix-valued
sample mean converges in probability to its corresponding population mean,
and it does so under conditions that are more general than the previously
discussed result about convergence in quadratic mean. The Laws of Large
Numbers are only presented here for the case of vector-valued sample means,
of which scalars are a special case and matrices a more general one.

Theorem 6.5.PWeak Law of Large Numbers (Khinčin’s). The sample


mean x̄N = N1 N i=1 xi associated with a random (i.i.d.) sample drawn from
the distribution of a random vector x with finite mean E [x] < ∞ converges
in probability to the population mean of x.
N
1 X p
x̄N = xi → E [x]
N i=1

Proof. (Sketched.) The analysis is restricted to random vectors x for which


the moment-generating function Mx (t) is defined. The moment-generating
function of the sample mean x̄N is, for a given N :

Mx̄N (t) = E exp tT x̄N


 
" N
!#
1 X T
= E exp t xi
N i=1
N   
Y 1 T
= E exp t xi
i=1
N
  N
1
= Mx t
N

where the third line follows from independence between observations, while
the fourth line relies on observations being identically distributed (so that
they have the same moment-generating function); essentially, this analysis
is an extension of Theorem 3.6. From a Taylor expansion around t0 = 0:
 T N
tT E [x]

t ι
Mx̄N (t) = 1 + +o
N N

211
6.2. Laws of Large Numbers

hence, taking the limit gives the following result.


lim Mx̄N (t) = exp tT E [x]

N →∞

This is a trivial moment-generating function: that of a degenerate discrete


random vector where the entire probability mass is concentrated in E [x]!
Therefore, exploiting the result that moment-generating functions uniquely
characterize their distributions, one can actually conclude that the sample
mean converges in probability to its mean as N grows larger. If the random
vector x lacks a moment-generating function, one can extend an analogous
proof based on the characteristic function ϕx̄N (t) of the sample mean; this
proof is obviously more complex and the pun here is intended.
Unlike the result about convergence of sample means in quadratic mean,
the Weak law of Large Numbers does not impose finite variances of x, and
is thus more general. The “stronger” version of the Law of Large Numbers
is presented next, but in this case without proof. This version shows that,
under stricter conditions, the sample mean approaches the population mean
increasingly more closely, without deviating from it to an appreciable extent
and with appreciable probability: it “converges with probability one.”
Theorem 6.6. Strong Law of Large Numbers (Kolmogorov’s). If
in a random (i.i.d.) sample drawn from the distribution of some random
simultaneously holds that: i. E [x] < ∞, ii. P
P∞x it −2
vector Var [x] < ∞, and
iii. n=1 n Var [xn ] < ∞, the sample mean x̄N = N N 1
i=1 xi converges
almost surely to its population mean.
N
1 X a.s.
x̄N = xi → E [x]
N i=1
An even stronger version allows for independently, but not identically dis-
tributed observations (i.n.i.d.).
Theorem 6.7. Strong Law of Large Numbers (Markov’s). Consider
a non-random sample with independent, non identically distributed obser-
vations (i.n.i.d.) where the random vectors xi that generate it have possibly
heterogeneous moments E [xi ] and Var [xi ]. If for some δ > 0 it holds that:

X 1 h
1+δ
i
lim 1+δ
E |x i − E [x i ]| <∞
N →∞
i=1
i
then the following almost sure convergence result holds.
N
1 X a.s.
(xi − E [xi ]) → 0
N i=1

212
6.2. Laws of Large Numbers

Observe that Markov’s version of the Strong Law of Large Numbers does
not impose finite second moments, but only that the absolute moments of
order slightly larger than one, i.e. 1 + δ > 1, are finite. This is a seemingly
complex, but actually weaker condition (an analogue of which is also used
in certain versions of the Central Limit Theorem, as it is discussed later).
Other, more general versions of the Law of Large Numbers also allow for
weakly dependent observations – that is, n.i.n.i.d. samples – which are a
prominent feature of socio-economic settings. These results are extensively
applied in econometrics, but are not elaborated here. To give more intuition
about the working of the Law of Large Numbers, Figure 6.1 below displays
the results of multiple simulations about the sample mean calculated from
random samples of increasing size drawn from X ∼ Pois (4) – see the notes.

1
0.6

0.4
0.5
0.2

0 0
0 4 8 12 16 0 4 8 12 16
N =1 N = 10

1.5
2
1
1
0.5

0 0
0 4 8 12 16 0 4 8 12 16
N = 100 N = 1000
Note: histograms of realizations of X̄N obtained from multiple i.i.d. samples drawn from X ∼ Pois (4).
Each histogram is obtained with 800 samples of the indicated size N . The realizations of X̄N are binned
on the x-axes with bins of length 0.02. For all histograms, the y-axes measure the density of their bins.

Figure 6.1: Simulation of the Law of Large Numbers for X ∼ Pois (4)

213
6.2. Laws of Large Numbers

It is useful to exemplify how the Laws of Large Numbers can be extended


to estimators that can be expressed as functions of sample means. To this
end, a first definition is in order.
Definition 6.7. Consistent Estimators. An estimator θ bN is consistent
if it converges in probability to the true population parameters θ0 which it
is meant to estimate.
p
bN →
θ θ0
Here the subscript “0” is used again to denote the true value of the param-
eter of interest. This is a standard convention in asymptotic analysis.
Example 6.3. Consistency of the linear regression estimators. Con-
sider the bivariate linear regression model from Example 3.11 and the subse-
quent references. Suppose that a researcher has access to a random sample
drawn from (Xi , Yi ). The Method of Moments (MM) estimator of the true
slope parameter β1 is defined – see Example 5.4 – as:
PN  
i=1 Xi − X̄ Yi − Ȳ
β1,M M =
b PN 2
i=1 X i − X̄

where X̄ = N −1 i=1 Xi and Ȳ = N −1 N


N
i=1 Yi . This estimator can also be
P P
obtained via Maximum Likelihood under certain assumptions.
Observe that this estimator is defined as the ratio between what is de-
fined as the “sample covariance” between Xi and Yi , and the sample variance
of Xi . These sample statistics are obvious extensions of the sample mean;
by the Weak Law of Large Numbers, their probability limits are:
N
1 X   p
Xi − X̄ Yi − Ȳ → Cov [Xi , Yi ]
N i=1
N
1 X 2 p
Xi − X̄ → Var [Xi ]
N i=1
that is, the corresponding population moments. Therefore, by the properties
of probability limits that derive from the Continuous Mapping Theorem, it
follows that the MM estimator of the regression slope is consistent.
p
b 1,M M →
β β1
An extension of this analysis shows that the MM estimator of the regression
constant β0 :
b 0,M M = Ȳ − β
β b 1,M M · X̄
p
is also consistent; by the Continuous Mapping Theorem, βb 0,M M → β0 . 

214
6.2. Laws of Large Numbers

One can show that if the assumptions that motivate Method of Moments
or Maximum Likelihood estimators are correct, these are consistent. This
is shown next by some “heuristic” (i.e. intuitive, not too rigorous) proofs.
Theorem 6.8. Consistency of the Method of Moments. An estimator
θ
bM M defined as the solution of a set of sample moments (5.2) is consistent
for the parameter set θ0 that solves the corresponding population moments
(5.1), if such a solution exists (i.e. if the estimation problem is well defined).
Proof. (Heuristic.) By some applicable Law of Large Numbers:
N
1 X  b 
p
h  i
m x i ; θM M → E m x i ; θ M M = 0
b
N i=1

where the equality to 0 is given by the definition of a Method of Moments


estimator, which is maintained throughout the sequence as N → ∞. Since
by hypothesis the zero moment conditions have only one admissible solution,
at the probability limit it is plim θ
b M M = θ0 .

For Maximum Likelihood, an analysis that is as general as in the Method of


Moments case above would hardly be simple. Thus, the (heuristic) proof is
given here for random samples only. Extensions of this result are possible,
and they apply more generalized versions of the Law of Large Numbers.
Theorem 6.9. Consistency of Maximum Likelihood Estimators. In
a random sample, an estimator θ bM LE which is defined as the maximizer of
a log-likelihood function as per (5.19) is consistent for the parameter set θ0
that maximizes the corresponding population moment function.
θ0 = arg max E [log fx (x; θ)]
θ∈Θ

If such a maximum exists, by the likelihood principle it corresponds to the


true parameter of the distribution under analysis.
Proof. (Heuristic.) By the Weak Law of Large Numbers, for any θ ∈ Θ
including θ
bM LE and θ0 :
N
1 X p
log fx (xi ; θ) → E [log fx (x; θ)]
N i=1

moreover, by the definition of MLE the following holds for all N ∈ N.


N N
1 X   1 X p
log fx xi ; θM LE ≥
b log fx (xi ; θ0 ) → E [log fx (x; θ0 )]
N i=1 N i=1

215
6.3. Convergence in Distribution

Consequently, given that θ0 maximizes the expected log-density or log-mass


function in the population:
 h  i
lim P E [log fx (x; θ0 )] ≥ E log fx x; θ
bM LE =1
N →∞

all these facts can be reconciled only if, at the limit:


N
1 X  
bM LE →p
h  i
log fx xi ; θ E log fx x; θ
bM LE = E [log fx (x; θ0 )]
N i=1

and at the limit, plim θ


bM LE = θ0 by the Continuous Mapping Theorem.

6.3 Convergence in Distribution


So far, all concepts of convergence that have been discussed have involved a
random sequence converging to some constant c (or a matrix C). However,
one can extend the concept of convergence to a random sequence converging
to another random vector x (or another random matrix X). For example,
writing the statement
p
xN → x
is equivalent to saying that the random sequence xN converges in probabil-
ity to the random vector x in the following sense.

lim P (kxN − xk > δ) = 0


N →∞

Intuitively, in the limit the probabilistic behavior of xN “becomes similar”


to that of x up to δ.2 A relevant question is: “How similar?” – that is, do all
moments and the distribution of xN converge to that of x? In general, the
answer to this kind of questions is “No” unless one also invokes the concept
of convergence in distribution.3
2
Parallel notions exist for r-th moment and almost sure convergence.
3
One can show – but it is intuitive – that if xN converges in r-th mean to x, that is
r
lim E [kxN − xk ] = 0
N →∞

then the r-th moment of x (and by extension all lower moments) converge to those of x:
r r
lim E [|xN | ] → E [|x| ] < ∞
N →∞

so long as they are finite. However, this is not enough to guarantee that higher moments,
not to mention the distribution function itself, converge to those of x.

216
6.3. Convergence in Distribution

Definition 6.8. Convergence in Distribution. Consider a sequence of


random vectors xN , whose each element has a cumulative distribution func-
tion FxN (xN ), as well as a random vector x with cumulative distribution
function Fx (x). The random sequence xN is said to converge in distribution
to x if:
lim |FxN (xN ) − Fx (x)| = 0
N →∞

at all continuity points x ∈ X in the support of x. This is usually expressed


with the following formalism.
d
xN → x
d
Definition 6.9. Limiting Distribution. If xN → x, that is some random
sequence xN converges in distribution to a random vector x, then Fx (x) is
said to be the limiting distribution of xN .
Intuitively, convergence in distribution requires that the probability den-
sity of xN tends to become identical to that of x everywhere in the support
of the random vectors in question. This is a stronger condition than conver-
gence in probability to a random vector, which more simply requires that
the random sequence xN and its limit x are “very likely to produce close
observations” as the index of the sequence grows very large.
Some important examples of convergence in distribution are intimately
related to certain relationships between probability distributions, where one
distribution is identified as the “limit case” of another distribution when
some parameter that characterizes the latter tends to a specific limit value.
These relationships not only serve as excellent illustrations of convergence in
distribution, but are also important by themselves. They are illustrated as
follows in the form of observations about common probability distributions.
Observation 6.1. Asymptotics of Student’s t-distribution. Consider
a random variable that follows the Student’s t-distribution with parameter
ν, X ∼ T (ν). As ν → ∞, the probability distribution of X tends to that
of the standard normal distribution, i.e. limν→∞ X = Z ∼ N (0, 1).
Proof. Taking the limit of the probability density function of the Student’s
t-distribution as ν → ∞:
− ν+1
x2
  2
1 1 2
1 x
lim 1 ν
√ 1+ = √ exp −
ν→∞ B ,
2 2
ν ν 2π 2
√  √
as limν→∞ ν B 12 , ν2 = 2π by the properties of the Beta function; while
−(ν+1)/2
by more standard arguments, limν→∞ (1 + x2 /ν) = exp (−x2 /2).

217
6.3. Convergence in Distribution

This observation substantiates the claim, already put forward in Lecture


2 that the Student’s t-distribution becomes increasingly more similar to the
standard normal distribution as ν increases. It is useful to report again the
graphical intuition, similarly to Figure 2.12. In Figure 6.2 below, however,
the result is instead represented in terms of cumulative distributions.

FX (x) ν=1
1 ν=3
ν→∞

0.5

x
−5 −3 −1 1 3 5

Figure 6.2: Convergence of the Student’s t-distribution as ν → ∞

Example 6.4. Convergence in distribution of the t-statistic. Con-


sider a random sample drawn from some normally distributed random vari-
able X ∼ N (µ, σ2 ). By the arguments advanced in Lecture 4, the random
sequence of t-statistics:
√ X̄N − µ
tN = N ∼ T(N −1)
SN
follows the Student’s t-distribution with degrees of freedom given by N − 1.
Therefore, by Observation 6.1 it follows that:
d
tN → N (0, 1)

that is, the sequence tN converges in distribution to a random variable that


follows the standard normal distribution (note in the expression above that
the notation of the limiting distribution is represented on the right-hand
side, this is conventional). Again, the intuition is developed in Figure 6.2.
An implication of this result is that all tests of hypotheses – as well as all
interval estimators – that are based on the Student’s t-distribution become
increasingly similar, as the sample size increases, to those that are based on
the standard normal distribution. In fact, the difference becomes negligible
already for N > 20, which motivates the ubiquitous use of the critical values
derived from the standard normal in applied statistical analysis. 

218
6.3. Convergence in Distribution

Observation 6.2. Asymptotics of Snedecor’s F -distribution. Con-


sider a random variable that follows Snedecor’s F -distribution with param-
eters ν1 and ν2 , X ∼ F (ν1 , ν2 ). As ν2 → ∞, the probability distribution
of W = ν1 X tends to that of a chi-squared distribution with parameter ν1 ,
i.e. limν2 →∞ ν1 X = W ∼ χ2 (ν1 ).
Proof. It is easy to derive the probability density function fW (w) of the
transformation W = ν1 X. Taking its limit as ν2 → ∞ gives:
  ν21  − ν1 +ν2
1 1 ν1
−1 w 2
lim fW (w) = lim w2 1+
ν2 →∞ B ν1 , ν2

ν2 →∞
2 2
ν 2 ν2
ν1 − ν22
Γ ν1 +ν
   
2
2  1 1 2 ν1
−1 w
= lim w2 1+
ν2 →∞ Γ ν2 Γ ν21

2
ν2 + w ν2
1 ν1
−1
 w 
= ν1 w
2 exp −
Γ ν1 · 2 2 2

2

where:  ν21
ν1 +ν2

Γ 1 ν1
lim 2 
= 2− 2
ν2 →∞ Γ ν22 ν2 + w
follows by the properties of the Gamma function.
Example 6.5. Convergence in distribution of Hotelling’s t-squared
statistic. Recall the formulation of Hotelling’s rescaled t-squared statistic
for a given K, and express it as a random sequence.
N −K 2 (N − K) N
tN = (x̄ − µ)T S −1 (x̄ − µ) ∼ FK,N −K
K (N − 1) K (N − 1)
For a given N , this statistic follows the F -distribution with paired degrees
of freedom K and N − K. By Observation 6.2, however, as the sample size
N grows large one obtains the following result.
d
t2N → χ2K
In words, Hotelling’s t-squared statistic (non-rescaled) converges in distri-
bution to a chi-squared distribution with K degrees of freedom.4 Similarly
as in the univariate case, this has important implications for multivariate
tests of hypothesis, say about the means of a multivariate normal distribu-
tion. As the sample grows large, these can be based on the relatively simple
chi-squared distribution, instead of the more involved F -distribution. 
4
The rescaling factor is removed for two reasons. First, to apply Observation 6.2 one
should multiply the sequence by ν1 = K. Second, as N → ∞ the term (N − K) / (N − 1)
becomes irrelevant, and the asymptotic result holds irrespectively of it.

219
6.3. Convergence in Distribution

Observation 6.3. Asymptotics of the Gamma distribution. Con-


sider a random variable that follows the Gamma distribution with parame-
ters α and β, X ∼ Γ (α, β). Let µ = α/β and σ2 = α/β2 . As α → ∞, the
probability distribution of X tends to that of a normal distribution with
parameters µ and σ2 , i.e. limα→∞ X ∼ N (µ, σ2 ).
Proof. Unlike the previous two observations, this one can be proved via
the moment generating function (which both the Student’s t-distribution
and the F -distribution lack) and more handily
√  so.√Define the standardized
random variable Z = (X − µ) /σ = β/ α X − α; by the properties of
moment generating functions, and recalling (2.88), it is:
−α
√  √ 
  
β t
MZ (t) = exp − αt · MX √ t = exp − αt 1 − √
α α
and after some manipulation, the limit as α → ∞ gives:
−α
√ 
  2
t t
lim MZ (t) = lim exp − αt 1 − √ = exp
α→∞ α→∞ α 2
showing that at the limit, Z ∼ N (0, 1) and therefore X ∼ N (µ, σ2 ).
Example 6.6. Convergence in distribution of the sample mean X̄
drawn from the exponential distribution. As elaborated in previous
lectures, in a random sample drawn from a random variable X ∼ Exp (λ)
the sample mean follows the Gamma distribution, X̄ ∼ Γ (N, N/λ). Thus,
by Observation 6.3:
√  d
N X̄N − λ → N 0, λ2


a statement interpreted in the sense that for a fixed value of N :


λ2
 
A
X̄N ∼ N λ,
N
where A stands for “approximation” (by definition, the sequence index N
cannot appear in the formulation of the limiting distribution, therefore the
former expression is a more rigorous characterization of convergence in dis-
tribution). This result allows to use of the normal distribution in statistical
tests about the exponential distribution. As it is discussed later at length,
this result is more general and goes by the name of Central Limit Theorem;
what is interesting about the exponential setting is that the exact distribu-
tion of the sample mean is known to be the Gamma, and by Observation
6.3, this fact is reconciled with the Central Limit Theorem. 

220
6.3. Convergence in Distribution

The following results about convergence in distribution are necessary to


derive the asymptotic properties of many econometric estimators.
Theorem 6.10. Continuous Mapping Theorem (continued). Under
the hypotheses of Theorem 6.4:
d d
xN → x ⇒ g (xN ) → g (x)
that is, a random sequence which is obtained from the application of a trans-
formation g (·) to some original random sequence xN , converges in distribu-
tion to the distribution resulting from applying the transformation g (·) to
the random vector x associated with the limiting distribution of xN .
The proof of this statement is omitted as it involves some advanced measure
theory. The continuous mapping theorem for convergence in distribution is
an important result, as it allows to prove the following properties of random
sequences which are heavily exploited in statistics and econometrics.
Theorem 6.11. Slutskij’s Theorem. Consider any two (scalar) random
sequences XN and YN such that:
d
XN → X
p
YN → c
that is, XN converges in distribution to that of the random variable X, while
YN converges in probability to a constant c. Then, the following holds.
d
(XN + YN ) → X + c
d
XN YN → cX
d
XN /YN → X/c if c 6= 0
p
Proof. It is enough to recognize that, as YN → c, then YN has a degenerate
limiting distribution, and thus the (vector) random sequence (XN , YN ) con-
verges in distribution to that of the random vector (X, c). Then the results
above follow from the application of the Continuous Mapping Theorem to
three given continuous functions of XN and YN .
Corollary. Cramér-Wold Device. Given a random sequence xN and a
constant vector a of the same dimension:
d d
xN → x ⇒ aT xN → aT x
that is, if a vectorial random sequence has a limiting distribution, any linear
combination of its elements will converge in distribution to the distribution
of the corresponding “limiting” linear combination.

221
6.3. Convergence in Distribution

Before moving to the extensive treatment of the Central Limit Theorem,


this section is concluded with the analysis of another important result about
convergence in distribution. This result is foundational for an entire branch
of statistics called Extreme Value Theory, which concerns the analysis
of extreme order statistics (maxima and minima).
Theorem 6.12. Extreme Value Theorem. This result is also called the
Fisher-Tippett-Gnedenko Theorem by the name of its discoverers. It
states that given a random (i.i.d.) sample (X1 , . . . , XN ), if a convergence
in distribution result of the kind
X(N ) − bN d
→W
aN
can be established – where X(N ) is the maximum order statistic while aN > 0
and bN are sequences of real constants – then:
W ∼ GEV (0, 1, ξ)
for some real ξ; that is, the limiting distribution of the normalized maximum
is some standardized type of the Generalized Extreme Value distribution.
Proof. (Outline.) The objective of the proof is to show that, given a random
variable X from which the random sample is drawn, for all the points x ∈ X
in its support where the distribution FX (x) is continuous:
 
N − ξ1
lim [FX (aN x − bN )] = exp − (1 + ξx)
N →∞

where the left-hand side is the limit of the cumulative distribution of the
standardized maximum, while the right-hand side is the expression of the
cumulative standardized GEV distribution. By taking the the logarithm of
this expression, the above is:
1
lim N log FX (aN x − bN ) = − (1 + ξx) ξ
N →∞

showing that FX (aN x − bN ) → 1 as N → ∞. Since − log (x) ≈ 1 − x for


any x is close to 1, the above expression approximates the following.
1 1
lim = 1
N →∞ N [1 − FX (aN x − bN )]
(1 + ξx) ξ
The rest of the proof is mathematically involved, and it proceeds to i. show
that the right-hand side of the above expression on is the only admissible
limit and ii. establish conditions under which ξ = 0 (Type I GEV, Gumbel),
ξ > 0 (Type II GEV, Fréchet) and ξ < 0 (Type III GEV, reverse Weibull),
where ξ = 0 is interpreted as a limit case (see Lecture 2).

222
6.4. Central Limit Theorems

While the Extreme Value Theory is outside the scope of this discussion,
it is worth to briefly comment on some implications of the Fisher-Tippett-
Gnedenko Theorem.
1. First, the Theorem does not state that a standardized maximum always
converge to a GEV distribution; it states that if it converges, the limit-
ing distribution is GEV. In this respect, the Theorem differs from other
results such as the Central Limit Theorem.
2. The implications of this result are not restricted to the maximum, but
extend to the minimum too. By defining Y = −X, for every N it clearly
is Y(1) = −X(N ) , which helps identify the distribution of the minimum if
that of the maximum is known (think for instance about the relationship
between the reverse Weibull and the “traditional” Weibull distribution).
This explains why the name of the theorem references “extreme values”
and not just maxima.
3. As mentioned, the proof of the Theorem sets conditions that allow to
identify which Type of GEV distribution is a possible limiting distribu-
tion of the maximum, by inspecting the cumulative distribution FX (x)
that generates the data. These conditions are quite technical, but some
of their implications are quite useful. For example, it is known that the
limiting distribution of the maximum associated with a random sample
drawn from the normal distribution is the Gumbel distribution.
In econometrics, the Extreme Value Theorem is invoked as the motiva-
tion behind specific assumptions made in certain models of decision-making,
where the random component of choice is assumed to follow a GEV distribu-
tion. In fact, a GEV distribution is a natural choice to model the maximum
value between multiple options that are considered by a decision-maker.

6.4 Central Limit Theorems


Convergence in distribution has little practical content if one does not know,
or is not able to derive, the limiting distribution of some random sequence of
interest. Nevertheless, in a specific but fundamental situation the limiting
distribution is typically known: it is the case of generalized sample means.
In fact, thanks to a set of results known as the Central Limit Theorems, it
is possible to establish that the limiting distribution of a sample mean is a
normal distribution, regardless of the original distribution from which the
data are originated. It is because of this last point that these results are so
fundamental in statistics and econometrics.

223
6.4. Central Limit Theorems

Theorem 6.13. CentralPLimit Theorem (Lindeberg and Lévy’s).


The sample mean x̄N = N1 N i=1 xi associated with a random (i.i.d.) sample
drawn from the distribution of a random vector x with mean and variance
that are both finite: E [x] < ∞ and Var [x] < ∞, is such√that the random
sequence defined as a centered sample mean multiplied by N converges in
distribution to a multivariate normal distribution.
N
!
√ 1 X d
N xi − E [x] → N (0, Var [x])
N i=1
Proof. (Sketched.) Consider the standardized random vector:
1
z = [Var [x]]− 2 (x − E [x])
1 1
where the matrix [Var [x]]− 2 and its inverse [Var [x]] 2 satisfy the following.
1 1
[Var [x]]− 2 Var [x] [Var [x]]− 2 = I
1 1
[Var [x]] 2 [Var [x]] 2 = Var [x]
Such a matrix can always be constructed because variance-covariance ma-
trices are positive semi-definite. The objective of the proof is to show that:
N
1 X d
¯N ≡ √
z̄ zi → N (0, I)
N i=1
that is, the random sequence z̄ ¯N defined above converges in distribution to
a standard multivariate normal distribution. If this is true, the main result
also follows by the linear properties of the multivariate normal distribution
after recognizing the following relationship between random sequences.
N
! N
!
√ √ 1 X 1 1 X
N (x̄N − E [x]) = N xi − E [x] = [Var [x]] 2 √ zi
N i=1 N i=1
To show this, suppose that a moment-generating function of z exists. If
so, one can express the moment generating function of z̄ ¯N , for fixed N , as:
Mz̄¯N (t) = E exp tT z̄
 
¯N
" N
!#
1 X T
= E exp √ t zi
N i=1
N   
Y 1 T
= E exp √ t z
i=1
N
  N
1
= Mz √ t
N

224
6.4. Central Limit Theorems

by a derivation analogous to the one in the proof of the Weak Law of Large
Numbers (Theorem 6.5). As in that proof, apply a Taylor expansion of the
above expression around t0 = 0, but account for the second order element:
"    T #N
tT E [z] tT E zz T t t t
Mz̄¯N (t) = 1 + √ + +o
N 2N 2N
 T N
tT t

t t
= 1+ +o
2N 2N

where the second line exploits the fact that E [z] = 0 and that E zz T = I
 

by construction of z. Clearly, taking the limit of the above expression for


N → ∞ gives:  T 
t t
lim Mz̄¯N (t) = exp
N →∞ 2
and this is nothing else but the moment-generating function of the standard
multivariate normal, as it is postulated. Should z lack a moment-generating
function, a similar derivation that leverages upon the characteristic function
ϕz̄¯N (t) applies instead.
How is a Central Limit Theorem actually useful in practice? The result
is to be interpreted in the sense that for some specific value of N , the sample
mean is “approximately” normally distributed with a variance-covariance
which is decreasing in the sample size:
N  
1 X A 1
x̄N = xi ∼ N E [x] , Var [x]
N i=1 N

A
where the notation ∼ indicates that the normal distribution in question,
called the asymptotic distribution, is approximate and is valid for a fixed
N , instead of being a “limiting” distribution (recall from the discussion of
Example 6.6 that a limiting distribution cannot be expressed in terms of N ).
To illustrate, Figure 6.3 plots the empirical distribution of the standardized
sample means obtained with the same simulation of samples drawn from
the Poisson distribution as in Figure 6.1. The standardization implies that
the limiting distribution is the standard normal.5 Despite the complications
entailed in the representation of a distribution via histograms, the simula-
tion highlights how the limiting distribution is approximated increasingly
better as the sample size increases.
5 ¯N , call it say Z̄¯N , is
For intuition, it is as if the univariate version of the sequence z̄
plotted in Figure 6.1.

225
6.4. Central Limit Theorems

0.8

0.6 0.4

0.4
0.2
0.2

0 0
−2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3
N =1 N = 10

0.4
0.4

0.2 0.2

0 0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
N = 100 N = 1000
√ 
Note: histograms of realizations of N X̄N − 4 /2 obtained from multiple i.i.d. samples drawn from
X ∼ Pois (4). Each histogram is obtained with 800 samples of the indicated size N . All the realizations
are binned on the x-axes with bins of length 0.25. For all histograms, the y-axes measure the density of
their bins. Density functions of the standard normal distribution are superimposed upon each histogram.

Figure 6.3: Simulation of the Central Limit Theorem for X ∼ Pois (4)

Some more general formulations of the (multivariate) Central Limit The-


orem follow next. Their proofs, however, are not presented. These versions
are especially important as they extend the main result to samples whose
observations are possibly not identically distributed (i.n.i.d.). As such, it is
useful to familiarize with their statements and hypotheses since the asymp-
totic properties of many estimators, especially in econometrics, are based
on them. As in the case of the Laws of Large Numbers, more Central Limit
Theorems exist which additionally allow dependent observations (n.i.n.i.d.
samples), but these are outside the scope of this discussion. In economet-
rics, these are invoked in order to motivate the use of covariance estimators
such as the cluster-robust and HAC estimators.

226
6.4. Central Limit Theorems

Theorem 6.14. Central Limit Theorem (Lindeberg and Feller’s).


Consider a non-random (i.n.i.d.) sample where the random vectors xi that
generate it have possibly heterogeneous finite means E [xi ] < ∞, variances
Var [xi ] < ∞, and all mixed third moments are finite too. If:
N
!−1
X
lim Var [xi ] Var [xi ] = 0
N →∞
i=1

then it holds that:


N
1 X d
√ (xi − E [xi ]) → N (0, Var [x])
N i=1
p
where N1 N
P
i=1 Var [xi ] → Var [x], that is, the positive semi-definite matrix
Var [x] is the probability limit of the observations’ variances.
Theorem 6.15. Central Limit Theorem (Ljapunov’s). Consider a
non-random (i.n.i.d.) sample where the random vectors xi that generate it
have possibly heterogeneous finite moments E [xi ] < ∞ and Var [xi ] < ∞.
If:
N
!−(1+ 2δ ) N
X X h 2+δ
i
lim Var [xi ] E |xi − E [xi ]| =0
N →∞
i=1 i=1
for some δ > 0, then:
N
1 X d
√ (xi − E [xi ]) → N (0, Var [x])
N i=1
where Var [x] is the probability limit of the variances as in Theorem 6.14.
Ljapunov’s version of the Central Limit Theorem establishes that the
asymptotic normality result also holds with non-identically distributed data.
In econometrics, this is of particular importance since it allows observations
to be drawn from different distributions with heteroscedastic disturbances.
Note that with respect to the classical Central Limit Theorem by Lindeberg
and Lévy, Ljapunov’s version only bears the additional requirements that
variances be finite and that some absolute moment of order higher than two
exists but is asymptotically dominated by the variances. The latter condi-
tion appears similar to that from Markov’s Law of Large Numbers and, like
that, convoluted; however, in most econometric applications E [xi ] = 0 for
all i = 1, . . . , N , and therefore that assumption specializes to:
h i
E |Xik Xi` |1+δ
<∞ (6.1)

227
6.4. Central Limit Theorems

for any two elements k, ` = 1, . . . , K of the random vector x and for all ob-
servations i. Under the hypothesis of independent observations, the asymp-
totic properties of most econometric estimators are obtained by invoking
Ljapunov’s Central Limit Theorem, hence conditions akin to (6.1) are rou-
tinely invoked and they are referred to as the “Ljapunov conditions.”
Example 6.7. Asymptotic normality of the linear regression esti-
mator. Let us return once again to the Method of Moments estimator of
the bivariate linear regression slope from example 6.3. Rewrite it as:
PN 
b 1,M M = P i=1 X i − X̄ Yi
β N  2
i=1 Xi − X̄
PN  PN 
i=1 Xi − X̄ (β0 + β1 Xi ) i=1 Xi − X̄ (Yi − β0 − β1 Xi )
= PN 2 + PN 2
i=1 X i − X̄ i=1 X i − X̄
PN  PN 
Xi − X̄ Xi i=1 Xi − X̄ εi
= β1 Pi=1N 2 + PN 2
i=1 Xi − X̄ i=1 Xi − X̄
1
PN 
i=1 X i − X̄ εi
= β1 + N PN 2
1

N i=1 Xi − X̄

where
εi ≡ Yi − β0 − β1 Xi
is the so-called error term of the regression model – that is, the devia-
tion that occurs between Yi and the linear conditional expectation function
E [Yi | Xi ] = β0 + β1 Xi . The error term can be interpreted as a transformed
random variable defined as a linear combination of the “primitive” random
variables Yi and Xi . Note that E [εi ] = 0 by the hypotheses on β0 .
Recall that in the bivariate linear regression model, the Law of Iterated
Expectations implies E [Xi εi ] = 0. This observation provides another av-
enue for showing consistency of the MM estimator of the regression slope.
In fact, by the Continuous Mapping Theorem:
N
1 X  p
Xi − X̄ εi → E [Xi εi ] − E [Xi ] E [εi ] = 0
N i=1 | {z } | {z }
=0 =0
p
implying β b 1,M M → β1 . Furthermore, since the expression on the left-hand
side is a sample mean, under the proper assumptions about the sample an
applicable Central Limit Theorem implies the following.
N
1 X d
Xi − X̄ εi → N 0, E ε2i (Xi − E [Xi ])2 (6.2)
  

N i=1

228
6.4. Central Limit Theorems

p
In (6.2) the limiting variance takes the stated form because X̄ → E [Xi ] at
the probability limit. The limiting variance obtains as:
" N
# N
1 X 1 X
Var √ (Xi − E [Xi ]) εi = Var [(Xi − E [Xi ]) εi ]
N i=1 N i=1
= E ε2i (Xi − E [Xi ])2
 

while in the even more specialized case where the squared deviations of Xi
and εi from their respective means are mutually independent, it is:
E ε2i (Xi − E [Xi ])2 = E ε2i E (Xi − E [Xi ])2 = σ2ε · Var [Xi ]
     

where σ2ε ≡ E [ε2i ]. This latter case is the one where the conditional variance
function of εi given Xi is actually a constant – a scenario commonly defined
homoscedasticity (as opposed to heteroscedasticity, the general case).
The expression in (6.2), the above decomposition of the MM estimator
of the bivariate linear regression slope, the Cramér-Wold device, as well as
the following implication of the Continuous Mapping Theorem:
" N
#−1
1 X 2 p
Xi − X̄ → [Var [Xi ]]−1
N i=1
all imply that the limiting distribution of the MM estimator is:
2
!
√ 
 2

d E ε i (X i − E [X i ])
N β b 1,M M − β1 → N 0, 2 (6.3)
(Var [Xi ])
and for some given N , its asymptotic distribution is as follows.
 2 2
!
A 1 E ε i (X i − E [X i ])
βb 1,M M ∼ N β1 , (6.4)
N (Var [Xi ])2
In the more specialized “homoscedastic” case, the limiting distribution is:
√  σ2ε
  
d
N β1,M M − β1 → N 0,
b (6.5)
Var [Xi ]
while the asymptotic distribution is derived consequently.
σ2ε
 
A 1
β1,M M ∼ N β1 ,
b (6.6)
N Var [Xi ]
A proper econometric treatment of the linear regression model would discuss
the multivariate generalization of these expressions, while additionally intro-
ducing the appropriate estimators for the unknown variances (or variance-
covariances) of these distributions – that is, estimators of the asymptotic
variances in (6.4) and (6.6). 

229
6.4. Central Limit Theorems

The next result is is instrumental for the analysis and derivation of the
asymptotic properties of many estimators.
Theorem 6.16. Delta Method. Suppose that some random sequence of
dimension K, xN , is asymptotically normal:
√ d
N (xN − c) → N (0, Υ) (6.7)
for some K × 1 vector c and some K × K matrix Υ. In addition, consider
some vector-valued function d (x) : RK → RJ . If the latter is continuously
differentiable at c and the J × K Jacobian matrix

∆≡ d (c)
∂xT
has full row rank J, the limiting distribution of d (xN ) is as follows.
√ d
N (d (xN ) − d (c)) → N 0, ∆Υ∆T (6.8)


Proof. From the mean value theorem:



d (xN ) = d (c) + xN ) (xN − c)
d (e
∂xT
p
where x
eN is a convex combination of xN and c. However, as xN → c:
∂ p ∂
T
xN ) →
d (e d (c) = ∆
∂x ∂xT
hence, at the probability limit:
√ p √
N (d (xN ) − d (c)) → ∆ · N (xN − c)
which, together with (6.7), implies (6.8).
With these results at hand, one can demonstrate that the classes of estima-
tors introduced in Lecture 5 (Method of Moments and Maximum Likelihood
estimators) achieve asymptotic normality under quite general assumptions.
To facilitate the analysis, this is restricted to random (i.i.d.) samples.
Theorem 6.17. Asymptotically, Methods of Moments estimators
are normally distributed. An estimator θ bM M defined as the solution of a
set of sample moments (5.2) is asymptotically normal. If the sample is ran-
dom and the moment conditions are differentiable the limiting distribution
is: √  
d
N 0, M0 Υ0 MT

N θ b M M − θ0 →
0

so long as the following matrices exist, are finite and nonsingular.


  −1

Υ0 = Var [m (xi ; θ0 )] M0 ≡ E m (xi ; θ0 )
∂θT

230
6.4. Central Limit Theorems

Proof. The proof applies the same logic as the Delta Method. By the mean
value theorem, the sample moment conditions are developed as:
N
1 X  b 
0= m x i ; θM M =
N i=1
N
" N
#
1 X 1 X ∂    
= m (xi ; θ0 ) + m x i ; θ
e N θ
b MM − θ 0
N i=1 N i=1 ∂θT

where the first expression on the upper line is equal to zero by construction

of all Method of Moments estimators. After multiplying both sides by N
and some manipulation the above expression is rendered as follows.
" N
#−1 N
√   1 X ∂   1 X
N θ M M − θ0 = −
b m x i ; θNe √ m (xi ; θ0 )
N i=1 ∂θT N i=1
Note that, since this is a random sample:
1. by a suitable Central Limit Theorem:
N
1 X d
−√ m (xi ; θ0 ) → N (0, Var [m (xi ; θ0 )])
N i=1
since E [m (xi ; θ0 )] = 0 by hypothesis;
2. while by the Weak Law of Large Numbers:
N  
1 X ∂  
p ∂
m x i ; θN → E
e m (xi ; θ0 )
N i=1 ∂θT ∂θT
p
since θ
eN → θ0 by consistency of the estimator (at the limit, θ
eN , θ
bM M
and θ0 all coincide).
These intermediate results are together combined via the Continuous Map-
ping Theorem, Slutskij’s Theorem and the Cramér-Wold device so to imply
the statement. Consequently, for a fixed N the asymptotic distribution is:
 
A 1 T
θ M M ∼ N θ 0 , M 0 Υ0 M 0
b
N
which concludes the proof.
An analogous result holds for Maximum Likelihood estimators as well, and
the proof is almost identical. In this case, however, the result is especially
powerful, as the asymptotic variance coincides with the Cramér-Rao bound.

231
6.4. Central Limit Theorems

Theorem 6.18. Asymptotically, Maximum Likelihood estimators


are normally distributed and they attain the Cramér-Rao bound.
An estimator θ bM LE defined as the maximizer of a log-likelihood function
as per (5.19) is asymptotically normal. If the sample is random and some
so-called regularity conditions hold:
i. the problem is well defined, i.e. θ0 is the maximizer of the population
expression E [log fx (xi ; θ)] – where fx (xi ; θ) is the probability mass
or density function that generates the data;
ii. fx (xi ; θ) is three times continuously differentiable and its derivatives
are bounded in absolute value;
iii. the support of xi does not depend on θ, so that derivatives for θ can
pass at least twice through an integral defined in terms of fx (xi ; θ);
then the limiting distribution is expressible as:
√  
d
N 0, [I (θ0 )]−1

N θ bM LE − θ0 →

where I (θ0 ) – without the N subscript – is the expression for the following
“single-observation” information matrix evaluated at θ0 .
"  T #
∂ ∂
I (θ0 ) ≡ E log fx (xi ; θ0 ) log fx (xi ; θ0 )
∂θ ∂θ
∂2
 
= −E log fx (xi ; θ0 )
∂θ∂θT
Consequently, θ
bM LE asymptotically attains the Cramér-Rao bound.

Proof. The proof proceeds similarly to the Method of Moments case. By


the mean value theorem, the MLE First Order Conditions can be stated as:
N N
1 X ∂ 
bM LE = 1
 X ∂
0= log fx xi ; θ log fx (xi ; θ0 ) +
N i=1 ∂θ N i=1 ∂θ
" N
#
1 X ∂2    
+ log f x x i ; θ
e N θ
b M LE − θ 0
N i=1 ∂θ∂θT
where the entire expression is zero by definition of MLE. Once again:
√  
N θ bM LE − θ0 =

 −1 1 X
" N
# N
1 X ∂2  ∂
=− T
log fx xi ; θN
e √ log fx (xi ; θ0 )
N i=1 ∂θ∂θ N i=1 ∂θ

232
6.4. Central Limit Theorems

but in this case some additional simplifications are possible, thanks to the
Information Matrix Equality. In fact, under the regularity conditions:
1. a suitable Central Limit Theorem implies that:
N
1 X ∂  
d
−√ log fx xi ; θ bM LE → N (0, I (θ0 ))
N i=1 ∂θ

since θ0 maximizes E [log fx (xi ; θ0 )], hence E ∂θ


∂ 
log fx (xi ; θ0 ) = 0;
2. while by the Weak Law of Large Numbers:
N
1 X ∂2  
p
T
log fx xi ; θN → −I (θ0 )
e
N i=1 ∂θ∂θ
p
again since θ
eN → θ0 by consistency of MLE as per Theorem 6.9.
Hence, here the application of the Delta Method results in a simplified ex-
pression of the limiting variance, as given in the statement of the Theorem.
Collecting terms, for some fixed N the asymptotic distribution is:
A
N θ0 , [IN (θ0 )]−1

bM LE ∼
θ
where IN (θ0 ) is the grand (sample) information matrix for some fixed N .
Since the MLE is asymptotically consistent, at the probability limit its bias
is zero, hence the estimator attains the Cramér-Rao bound.
This result is celebrated, since it motivates the reputation of Maximum
Likelihood as the method for constructing estimators with the best statisti-
cal properties. Yet one should be careful at overusing Maximum Likelihood
on the expectation that the resulting estimators are asymptotically efficient.
In fact, Maximum Likelihood is very sensitive to the assumptions about the
distributions that generate the sample, and it can fail utterly (i.e. produce
inconsistent estimates) if the assumptions are incorrect. On the other hand,
the Method of Moments is generally more robust. This creates some tension
between efficiency and robustness when choosing between estimators.
It is worth to conclude this section by summarizing the advantages of
conducting statistical analysis in large samples such that asymptotic results
apply. In these settings, standard estimators are known to be normally dis-
tributed; this facilitates statistical inference immensely since it is generally
very difficult to derive their exact distributions in small samples. While the
variances of these estimators are usually unknown quantities, they can be
easily consistently estimated via their sample analogues; for example, under
p
normal sampling it is NN−1 S 2 → σ2 . Another example follows suit.

233
6.4. Central Limit Theorems

Example 6.8. Asymptotically testing for the regression slope β1


of the bivariate regression model. Let us return one more time to the
bivariate regression model. Recall the two-sided hypotheses from Example
5.18 in the previous lecture:
H0 : β1 = C H1 : β1 6= C
and similarly the analogous one-sided ones. In applied regression analysis,
the most common test is the one about C = 0, also called significance test
of the regression, since it is effectively a test about whether the explanatory
variable Xi affects the mean of Yi in a conditional sense.
As discussed briefly in Lecture 5, with small samples this test is prob-
lematic, since it requires to make specific assumptions about the conditional
distribution of Yi given Xi . In an asymptotic environment, however, it is
possible to rely on the implications of the Central Limit Theorem discussed
in Example 6.7. Thus, by Observation 6.1 the t-statistic defined as:
√ b 1,M M − C d
β
tN = N → N (0, 1)
Sβ1
asymptotically follows – under the null hypothesis – the standard normal
distribution. Above, Sβ1 is the sample standard deviation of the estimator;
the corresponding sample variance Sβ21 is calculated as the sample analogue
of the relevant limiting variance. In the general heteroscedastic case, it is:
PN  2 2
i=1 Y i − b 0,M M − β
β b 1,M M Xi Xi − X̄
Sβ21 = N hP 2 i2
N
i=1 Xi − X̄

while in the more restricted homoscedastic case Sβ21 is as follows.


PN  2
i=1 Yi − β0,M M − β1,M M Xi
b b
Sβ21 = PN 2
i=1 Xi − X̄

The quantity Sβ1 / N is called the standard error of the estimate β b 1,M M .
With these results at hand, it is possible to conduct tests of hypotheses and
construct confidence intervals under the familiar framework of the normal
distribution. For example, the confidence interval of β1 would be as follows.
 
∗ Sβ1 b ∗ Sβ1
β1 ∈ β1,M M − zα/2 √ , β1,M M + zα/2 √
b
N N
This example concludes the analysis of the bivariate linear regression model.
All concepts and ideas related to it which were developed in various exam-
ples extend easily to the multivariate version of the model. 

234
6.4. Central Limit Theorems

As the last example has shown, the specific formulae for the estimation of
the asymptotic variance are typically context- and assumption-dependent.
In random samples, however, it is easy to establish expressions with a more
general validity. Consider Method of Moments estimators first; in the i.i.d.
framework, a general expression for a consistent estimator of their asymp-
totic variance is given by N −1 M
cN Υ c T , where
b NM
N

 −1
" N
#
cN ≡ 1 X ∂ 
p
M T
m xi ; θbM M → M0
N i=1 ∂θ
is a consistent estimator of M0 (by some applicable Law of Large Numbers
and the Continuous Mapping Theorem), while
N h
bN ≡ 1
 i h  iT
p
X
Υ m xi ; θ
bM M m xi ; θ
bM M → Υ0
N i=1
is also a consistent estimator of the variance of the zero moment conditions
by some applicable Law of Large Numbers, since in a random sample the
following holds.6
h i
Υ0 = Var [m (xi ; θ0 )] = E (m (xi ; θ0 )) (m (xi ; θ0 ))T
These estimating matrices are not only based on sample analogues of their
population counterparts (the object of estimation), something which is indi-
cated with the subscript N instead of 0. In addition, they are also evaluated
at the estimated parameters θ, which is symbolized by the wide “hat” used
to denote them. In the Maximum Likelihood case, the information matrix
equality offers two alternative routes for estimating the asymptotic variance.
The first option is based on the Hessian of the mass or density function:
N
1 X ∂2  
p
HN ≡ −
b
T
log f x x i ; θ
b M LE → I (θ0 )
N i=1 ∂θ∂θ
while the second option exploits the “squared” score.
N    ∂ T
bN ≡ 1 X ∂  
p
J log fx xi ; θ
bM LE log fx xi ; θ
bM LE → I (θ0 )
N i=1 ∂θ ∂θ

Both matrices H b N and J bN are evaluated at the MLE solution and, in ran-
dom samples, let consistently estimate the information matrix; in practice
the choice of a specific option is usually based on convenience. It is impor-
tant to familiarize with this type of results and the associated notation, as
they are typical of the standard treatment of econometric theory.
6 p
bN →
Under fairly general assumptions, it is Υ Υ0 also with i.n.i.d. observations.

235
Part II

Econometric Theory

236
Lecture 7

The Linear Regression Model

This lecture introduces a workhorse model of statistics and econometrics:


the linear regression model. To this end, the lecture develops the method of
Least Squares as the algebraic solution to some linear prediction problem,
and subsequently discusses its properties and relationships to the (possibly
linear) population regression function. The treatment and the terminology
adopted are purposefully typical of the econometric approach.

7.1 Linear Socio-economic Relationships


Suppose that a researcher is studying an economic or social phenomenon,
and postulates the existence of a linear relationship between a dependent
variable Yi ; K independent or explanatory variables x = (X1i , . . . , XKi );
and finally an unobserved “disturbance” or error term εi .
Yi = β1 X1i + β2 X2i + · · · + βK XKi + εi (7.1)
The researcher has access to a sample {(yi , xi )}N
i=1 of size N > K, where yi
is the realization of Yi for the i-th observation, while the vector xi :
 
x1i
 x2i 
xi =  .. 
 
 . 
xKi
has length K and collects all the realizations of the explanatory variables
in x for the i-th observation. One can write (7.1) in terms of realizations:
yi = x T
i β + εi (7.2)
T
where β = β1 β2 . . . βK is the parameter vector of the model.


237
7.1. Linear Socio-economic Relationships

According to typical econometric terminology, (7.1) can be given a so-


called “structural” interpretation in the sense that the endogenous variable
Yi is assumed to depend linearly upon some K independent, exogenous
variables (X1i , . . . , XKi ) because of a priori knowledge of the setting under
analysis or theoretical reasoning. The linear relationship is augmented with
the inclusion of the unobserved error εi , which on the one hand represents
all the other, generally unknown factors that also determine Yi , and on the
other hand it deprives the linear relationship of any deterministic content.
In fact, social phenomena are not determined according to fixed rules, thus
any exact relationship is bound to be rejected by the empirical observation.
Observe that no statement has been made yet about the joint probability
distribution that determines (xi , εi ) – and thus (yi , xi ).
Typically, researchers introduce a constant term as an independent
parameter into the specification of linear relationships such as (7.1):
Yi = β0 + β1 X1i + · · · + β(K−1) X(K−1)i + εi (7.3)
observe that in addition to the constant parameter β0 , the model has K − 1
independent variables Xki , so that in total the model still has K parameters.
Here, the model can be still written as in (7.2), while vector xi becomes:
 
1
 x1i 
xi =  .. 
 
 . 
x(K−1)i
in practice a “constant” variable X0i , normalized for convenience as xi0 = 1
for each observation i, is included in the model. This offers the advantage
of letting researchers confidently assume that E [εi ] = 0: intuitively, if the
average influence of the unobserved factors on Yi were different from zero,
the extent of such an effect would be conceptually equivalent to a higher
constant coefficient β0 . In other words, introducing constant terms allows
researchers to disregard all those unobserved or uninteresting factors that
affect the unconditional average of the dependent variable Yi .
According to a traditional view, the econometric analysis of linear rela-
tionships such as (7.1) or (7.3) has the objective of giving empirical content
to parameters like β, so that economists can relate the abstract relationships
featured in their theories to actual quantities governing variables that can
be observed in the real world. This would enable economists to make state-
ments about causal mechanisms, ascertain the effect of economic policies,
analyze counterfactuals, provide forecasts, and more. Such an intellectual
tradition can be traced back to the most archaeological econometric models.

238
7.1. Linear Socio-economic Relationships

Example 7.1. The Keynesian consumption function. The diffusion


of Keynes’ General Theory in the middle of the twentieth century and the
birth of macroeconomics as a distinct subdiscipline are associated with the
diffusion of new theories about macroeconomic relationships. Among those,
perhaps the most exemplary case is the “Keynesian” consumption function:

Ci = c0 + c1 Yi + εi (7.4)

where Ci represents aggregate consumption, Yi aggregate disposable income,


εi is a disturbance term and (c0 , c1 ) are the two parameters of interest; c1 ,
in particular, measures the dependence of Ci on Yi and it bears the name
of marginal propensity of consumption, a parameter that played an
important role in the early macroeconomic theories following the Keynesian
legacy. To evaluate (7.4) with real world data one should have access to a
sample of economic regions with plausibly the same marginal propensity to
consume, or instead to multiple observations on the same region or country.
Sequences of observations about the same unit of analysis tracked over time
are called time series and feature preminently in macroeconometrics.
Ci : Nondurable consumption, 1996 $ trillions

4 95
94
93
92
91
90
3.5 89
88
87
86
85
3 84
83
82
81
80
79
78
2.5 77
76
75
74
73
72
2 70
71
69
68
67
66
65
1.5 63
64
62
61
60
59

1.5 2 2.5 3 3.5 4 4.5 5


Yi : Disposable income, 1996 $ trillions

Figure 7.1: Disposable Income and Consumption, 1959-1995 US data

239
7.1. Linear Socio-economic Relationships

Figure 7.1 depicts an example of the association between two time series
of Y and C based on actual macroeconomic data (both series are normalized
to 1992 prices). The relationship between the two variables in question ap-
pears to be robustly linear, so that at first sight a structural relationship like
(7.4) seems justified. Applied macroeconometric research has demonstrated
that, in fact, the strong linear association between the levels of income and
consumption which is typically observed in the data is spurious, in the sense
that the variation of both variables, while certainly reciprocally related, is
influenced by parallel trends that influence both income and consumption.
This finding has led to the development of more sophisticated linear and
non-linear models for the analysis of macroeconomic time series. 

Example 7.2. Human capital and wages. A much celebrated theory in


labor economics postulates that wages, being a function of the individual
(marginal) productivity, are a function of those factors that makes workers
more productive. Collectively, these factors fall under the name of human
capital; while nowadays this is a common expression, the idea of extending
to individuals a concept analogous to that of physical capital (Walsh, 1935;
Becker, 1962) was initially quite an original theoretical contribution. The
seminal framework for the empirical analysis of human capital is originally
due to Mincer (1958), who introduced the following relationship between the
wages Wi of workers, their experience in the workplace Xi , their education
Si , their ability αi , and finally to some “residual” factors i that influence
their labor market outcomes, say sheer luck in landing a good job.

Wi = exp β0 + β1 Xi + β2 Xi2 + β3 Si exp (αi + i ) (7.5)




While this relationship is certainly non-linear, it can be easily transformed


so that it becomes linear in the parameters βW = {β0 , β1 , β2 , β3 }, a rela-
tionship that is best known as the Mincer equation.

log Wi = β0 + β1 Xi + β2 Xi2 + β3 Si + αi + i (7.6)

The functional form of the Mincer equation was originally motivated on


empirical observation, and it still is nowadays the workhorse model for the
analysis of the returns to education, which in the model are subsumed
by the parameter β3 associated with the education variable Si . The reason
is that, by keeping the analogy with physical capital, human capital too is
something that can be enhanced by investment – in this case, by acquiring
more education. The latter can also be modeled, for example as:

Si = γ0 + γ1 Zi + φ1 Xi + φ2 Xi2 + ψ0 αi + ηi (7.7)

240
7.1. Linear Socio-economic Relationships

that is as some function of Si , which in this case is linear in the parameters


like γ1 , φ1 et cetera, and that includes a squared term for experience Xi ,
some generic factors Zi that affect the individual choice in education, ability
αi , as well as other unobserved factors ηi . Empirical labor economists typ-
ically augment their analyses about returns to schooling with the inclusion
of a specification of the education model (7.7), with the aim of addressing a
typical econometric problem dubbed “endogeneity” which ultimately stems
from the inability of econometricians to observe individual ability αi ; later
lectures elaborate on this topic and elucidate the advantages of specifying
a linear-in-the-parameters function for education Si .
log Wi : Logarithm of the annual wage, 1987 $

13

12

11

10

7
6 8 10 12 14 16 18 20
Si : Years of education

Figure 7.2: (Log) wage and education in 1987, excerpt of a survey

It is instructive to display in graphical form the relationship between


(log) wages and education. Figure 7.2 is obtained from an excerpt of a lon-
gitudinal survey of workers, by isolating observations from one specific year.
Using some more technical terminology, a single cross section of individ-
uals was isolated from a larger longitudinal or panel dataset.1 Certainly,
1
This is based on the “Keane” panel, originally spanning the years 1981 to 1987, avail-
able online at http://fmwww.bc.edu/ec-p/data/wooldridge/datasets.list.html.

241
7.1. Linear Socio-economic Relationships

panel or longitudinal data are typically more informative and useful for the
purposes of microeconometric analysis (while also being typically costlier
to gather and often not readily available); however, cross-sectional data –
by virtue of being simpler – have pedagogical value in selected settings.
The scatter plot displayed in Figure 7.2 represents the raw relationship
between (log) wages and the attained level of education, while ignoring other
variables (such as say experience and ability) which also might bear effects
on earnings. Unlike the association between consumption and income from
Figure 7.1, that the relationship between log-wages and education is linear
is not immediately clear through the visual representation of the data. How-
ever, repeated analyses have shown that this relationship is “more linear”
than the one between the level of wages and education (an observation that
helps motivate the popularity of the Mincer equation). In addition, observe
that the independent education variable Si takes values upon a discrete set,
which is a typical feature of many microeconometric datasets. 
Both 7.1 and 7.2 are valid examples of economic structural relationships
framed through linear equations. However, as motivated at length over the
course of this lecture, the linear model is a powerful tool whose properties
are quite useful even when studying relationships that are not strictly struc-
tural or, under certain conditions, that are structural but not necessarily
linear. The analysis of the linear model often benefits from a mathematical
representation based on compact matrix notation. Define:
       
y1 xT1 x 11 x 21 . . . x K1 ε1
 y2   xT   x12 x22 . . . xK2   ε2 
 2 
y =  ..  ; X =  ..  =  .. .. . ..  ; ε =  .. 
    
 .   .   . . . . .   . 
T
yN xN x1N x2N . . . xKN εN
that is, y, X and ε are obtained by vertically stacking over all observations,
respectively, the realization yi of the dependent variable, the transpose of
the vector xi , and the error term εi . If the model features a constant term,
the first column of X is the ιN vector whose entries are all equal to 1. With
compact matrix notation, (7.1) can be conveniently written as follows.
y = Xβ + ε (7.8)
Econometric models are often written in terms of realizations, not in terms
of abstract random variables, vector or matrices (note that the distinction
does not apply to the error terms, which cannot be observed by definition).
Because of this convention, in what follows the notation adopted to describe
specific models alternates between the two cases, depending on purpose and
convenience (that is, which notation best clarifies a certain concept).

242
7.2. Optimal Linear Prediction

7.2 Optimal Linear Prediction


Having specified a linear economic relationship, the objective of the econo-
metrician is that of assigning a value to the parameters β that “make sense”
on the basis of the real world observations of (xi , yi ). For the moment I shall
not speak of “estimating” parameters, since this term involves dealing with
distributional assumptions and statistical inference. In what follows, I dis-
cuss how the Least Squares solution – on which the Ordinary Least Squares
(OLS) estimator for linear regression is based – can be derived as the sam-
ple analog of the solution of a population prediction problem which is
restricted to linear prediction functions. This analysis allows to appreciate
certain properties of the Least Squares solution that are often invoked to
motivate linear regression analysis.
Suppose that the researcher aims at specifying a prediction function
ybi = my (xi ), meant as the “best guess” of some unknown value of Yi = yi
based on the observation of a vector of independent factors xi . Clearly, the
farther away the prediction ybi is from the actual realization of Yi , the worse
it is for the researchers. This implies the existence of some loss function:

L (ei ) = L (yi − ybi ) (7.9)

with ei ≡ yi − ybi and where L (ei ) has the properties that it is increasing in
|yi − ybi | and that L (0) = 0. If the researcher aims at specifying a predictor
that is consistent across different realizations of xi , a sensible criterion is to
choose the function m (xi ) that minimizes the expected loss:
h  i
E L Yi − Ybi = E [L (Yi − my (xi ))] (7.10)

where the expectation is taken on the joint support of Yi and (X1i , . . . , XKi ).
This still leaves open the question about the choice of the working loss
function L (ei ). In general, this choice may depend on the context; here, the
analysis is focused on the quadratic loss L (ei ) = e2i . The quadratic loss is
appealing, since deviations of the prediction from the “true” realization of
Yi are disproportionately more “harmful” the higher they are. The expected
quadratic loss is the so-called mean squared error of prediction.

MSE = E (Yi − my (xi ))2 (7.11)


 

Alternative loss criteria exist. For example, the absolute loss L (ei ) = |ei |
differs from the quadratic loss in that it does not disproportionately punish
large mistakes. For some p ∈ (0, 1), the quantile loss

L (ei ) = p |ei | · 1 [ei ≥ 0] + (1 − p) |ei | · 1 [ei < 0]

243
7.2. Optimal Linear Prediction

is asymmetric: for prediction errors of the same absolute size |ei |, it punishes
underprediction (ei > 0) more than overprediction (ei < 0) if p > 0.5 – and
vice versa; the asymmetry increases the farther p departs from 0.5. Observe
that the absolute loss is a special case of the quantile loss, for p = 0.5.
The remainder of this analysis focuses, as anticipated, on the quadratic
loss. The Mean Squared Error of prediction is associated with a well-known
statistical result.

Theorem 7.1. CEF as Optimal Predictor under Quadratic Loss.


If Var [Yi | xi ] < ∞, the predictor my (xi ) that minimizes the Mean Squared
Error is the Conditional Expectation Function (CEF): my (xi ) = E [Yi | xi ].
Proof. By the standard decomposition of the MSE:

E (Yi − my (xi ))2 = E (Yi − E [Yi | xi ] + E [Yi | xi ] − my (xi ))2


   

= E (Yi − E [Yi | xi ])2 + E (E [Yi | xi ] − my (xi ))2


   

+ 2 E [(Yi − E [Yi | xi ]) (E [Yi | xi ] − my (xi ))]


= E (Yi − E [Yi | xi ])2 + E (E [Yi | xi ] − my (xi ))2
   

= Var [Yi | xi ] + E (E [Yi | xi ] − my (xi ))2


 

which is minimized if E [Yi | xi ] = my (xi ) so long as Var [Yi | xi ] < ∞. Note


that the last term in the third line vanishes since:

E [(Yi − E [ Yi | xi ]) (E [Yi | xi ] − my (xi ))] =


 
= E (E [Yi | xi ] − my (xi )) · E [(Yi − E [ Yi | xi ])| xi ] = 0
| {z }
=0

an observation that carefully exploits the Law of Iterated Expectations.

The fact that the conditional expectation function – the population con-
ditional average of Yi – is the best predictor may appear intuitive; however,
this results does not generally hold for all loss functions! In fact, it is limited
to the quadratic loss. Under different loss functions the optimal predictor is
different: for example, the one associated with the absolute loss L (ei ) = |ei |
is the conditional median of Yi given xi ; with the quantile loss, the optimal
predictor is the p-th conditional quantile of Yi given xi . Nevertheless, the
main result should be reminiscent of the simpler observation that the mean
E [X] of a random variable X is the latter’s best “guess” (predictor) under a
quadratic loss criterion (Lecture 1). Theorem 7.1 generalizes that finding.

244
7.2. Optimal Linear Prediction

With this result at hand, return to the researcher’s prediction problem


assuming while maintaining choice of a quadratic loss function. Given that
the relationship (7.1) under analysis is hypothesized linear, it is intuitively
interesting to examine the consequences of restricting the analysis to an
optimal linear predictor. In other words, let my (xi ) = py (xi ) = xT i β ,
∗ 2

where h 2 i
β∗ ∈ arg min E Yi − xT i β (7.12)
β∈RK

that is, β∗ is one specific coefficient vector which, among all predictors that
are linear in xi , minimizes the Mean Squared Error (observe that β∗ needs
not be unique). The implications are summarized with the next result.

Theorem 7.2. Optimal Linear Predictor as best approximation to


the CEF. Consider any vector β∗ as defined in (7.12). If Var [Yi | xi ] < ∞,
then: h 2 i
∗ T
β ∈ arg min E E [Yi | xi ] − xi β (7.13)
β∈RK

that is, any optimal linear predictor of Yi is also an optimal linear predictor
of the CEF, E [Yi | xi ], in the MSE sense.
Proof. The demonstration is analogous to that of Theorem 7.1:
h 2 i h 2 i
T T
E Yi − x i β = E Yi − E [Yi | xi ] + E [Yi | xi ] − xi β
h 2 i
= E (Yi − E [Yi | xi ])2 + E E [Yi | xi ] − xT
 
i β
T
 
+ 2 E (Yi − E [Yi | xi ]) E [ Yi | xi ] − xi β
h 2 i
= E (Yi − E [Yi | xi ])2 + E E [Yi | xi ] − xT
 
i β
h i
2
= Var [Yi | xi ] + E E [ Yi | xi ] − xT

i β

where again, the cross-term in the third line disappears by a proper applica-
tion of the Law of Iterated Expectations. Therefore, the two minimizers in
(7.12) and (7.13) are identical so long as Var [Yi | xi ] is a finite constant.

The interpretation of Theorem 7.2 is that even if the CEF is unknown,


choosing an optimal linear predictor results in the best approximation
to the true CEF that can be attained with a linear function, where “best
approximation” shall be translated as “the minimal expected squared loss.”
2
The asterisk is related to an alternative notation for the optimal linear predictor,
which sometimes is written as E∗ [ Yi | xi ] = xT ∗ ∗
i β with β defined as in (7.12). Instead,
the notation py (xi ) specifies that the prediction function my (xi ) is linear.

245
7.2. Optimal Linear Prediction

This observation has had a profound impact on motivating the use of linear
regression analysis in contexts where the true form of the statistical depen-
dence between a dependent variable Yi and a set of independent variables
(X1i , . . . , XKi ) is unknown. This important interpretation is revisited later
at the end of this lecture.
With the knowledge about the relationship between optimal predictors,
optimal linear predictors and the CEF at hand, it is convenient to express
the solution of the optimal linear prediction problem. One can rewrite the
First Order Conditions of the problem in (7.12) as:3

(7.14)
 ∗
E xi xT

i β = E [xi Yi ]

therefore a unique solution exists if matrix E xi xT is nonsingular, and


 
i
it reads as: −1
β∗ = E x i x T (7.15)

i E [xi Yi ]
implying that the optimal linear predictor is as follows.
T −1
py (xi ) = xT (7.16)
 
i E xi xi E [xi Yi ]

This expression is also called the (population) linear projection of Yi given


xi , for reasons that are to appear clearer after examining the algebraic and
geometric properties of its sample analog: the Least Squares solution.

Example 7.3. Linear approximation of a particular quadratic CEF.


It is useful to illustrate this concept by developing and visualizing an exam-
ple. Suppose that some variable Yi of interest for prediction only depends
on a single explanatory variable Xi . In addition, suppose that the true CEF
of Yi given Xi is quadratic, and in particular it is as follows.
1 2
E [ Yi | Xi ] = Xi − X
10 i
Quadratic relationships similar to the above are easy to identify in social
sciences. For example, the Mincer equation from Example 7.2 is quadratic
in experience Xi , and its second degree parameter is usually evaluated small
and negative in empirical studies. In this example the above CEF does not
include a term of degree zero, but this only so for convenience.
3
Note that (7.14) is equivalent to the First Order Condition of the problem in (7.13),
that of finding the MSE-best linear approximation to the CEF. The latter FOC is
 ∗
E xi xT

i β = E [xi · E [ Yi | xi ]]

while E [xi · E [ Yi | xi ]] = E [xi Yi ] follows from the Law of Iterated Expectations.

246
7.2. Optimal Linear Prediction

Presume that the researcher who aims at predicting Yi using Xi is un-


aware of the true form of the CEF. Conscious of the result from Theorem
7.2, the researcher sets out to establish an optimal linear predictor which
incorporates a constant term, as follows:
py (Xi ) = β∗0 + β∗1 Xi
such that:
(β∗0 , β∗1 ) ∈ arg min E (Yi − β0 − β1 Xi )2
 
(β0 ,β1 )∈R2

as in (7.12). In order to find the coefficients of the optimal linear predictor,


it is necessary to optimize the above MSE. The First Order Conditions are:
E [Yi − β∗0 − β∗1 Xi ] = 0
E [Xi (Yi − β∗0 − β∗1 Xi )] = 0
which are identical to the two equations (3.8)-(3.9) that determine the coef-
ficients of the bivariate linear regression model from Example 3.11! There-
fore, the solution can be expressed as follows, after some manipulation.
E [Xi Yi ] − E [Xi ] E [Yi ]
β∗0 = E [Yi ] − · E [Xi ]
E [Xi2 ] − (E [Xi ])2
E [Xi Yi ] − E [Xi ] E [Yi ]
β∗1 =
E [Xi2 ] − (E [Xi ])2
Observe that under the hypothesis of a quadratic CEF, the moments of the
form E [Xir Yi ] – for any nonnegative integer r – can be obtained easily; in
this specific case it is:
E [Xir Yi ] = E [E [Xir Yi | Xi ]]
= E [Xir · E [Yi | Xi ]]
  
r 1 2
= E Xi Xi − Xi
10
1  r+2 
= E Xir+1 −
 
E Xi
10
which is yet another application of the Law of Iterated Expectations. There-
fore, in this specific example the two coefficients β∗0 and β∗1 are ultimately
functions of some uncentered moments of Xi ; as a consequence, in order to
calculate the optimal linear predictor one must know the distribution of Xi
(or make assumptions about it). To simplify, suppose that Xi ∼ U [0, 5]; in
this case it is easy to see that:
5r
E [Xir ] =
r+1

247
7.2. Optimal Linear Prediction

for any nonnegative integer r. As one can autonomously verify, the combi-
nation of all these hypotheses implies the following optimal linear predictor.
5 1
+ Xi
py (Xi ) =
12 2
Note that one would obtain a different optimal linear predictor if Xi were
to follow a different distribution, including say a uniform distribution with
a different support!
f Yi |Xi (yi | xi )

0.8

0.4

0
0 py (Xi ) 4

1 E [Yi | Xi ] 3

2 2

xi 3 1
yi
4 0

5 −1
Note: the continuous curve is the CEF: E [ Yi | Xi ] = Xi − Xi2 /10; the dash-dotted line is the optimal
linear predictor py (Xi ) = 5/12 + Xi /2. The conditional distribution Yi | Xi is normal, with parameters
that vary as a function of Xi . Selected density functions of Yi | Xi are displayed for xi = {1, 2, 3, 4}.

Figure 7.3: The optimal linear predictor approximating a quadratic CEF

The result is illustrated graphically in Figure 7.3 above, where the contin-
uous curve represents the quadratic CEF while the dash-dotted (straight)
line is the optimal linear predictor, which is at the same time the best linear
approximation of the quadratic CEF. To help visualize the random nature
of the relationship between Yi and Xi , the conditional distribution of the
former given the latter is displayed as normal, but the analysis developed
in this example does not depend on this specific coincidence. 

248
7.3. Analysis of Least Squares

7.3 Analysis of Least Squares


The analysis of optimal predictors under specific loss functions is admit-
tedly quite abstract – it is grounded in statistical decision theory. However,
it is useful to motivate the Least Squares criterion and the associated sta-
tistical estimators on sound theoretical bases. By the analogy principle,
in fact, one can establish an appropriate sample version of the population
optimal linear predictor problem under a quadratic loss function. The Least
Squares problem applied to a sample {(yi , xi )}N i=1 is:

N
1 X 2
b ∈ arg min y i − xT
i β (7.17)
β∈RK N i=1

or equivalently, using compact matrix notation, as follows.


1
b ∈ arg min (y − Xβ)T (y − Xβ) (7.18)
β∈RK N

In a linear framework, the Least Squares problem is about finding a vector


of coefficients b that minimizes the sum of the quadratic deviations between
the dependent variable yi and the corresponding linear combination of the
independent variables xi of each observation in the sample. Note that the
N −1 factor is redundant towards the determination of the solution.
The K First Order Conditions of the problem (7.17), also called normal
equations, are expressed below.
N
2 X
x i y i − xT (7.19)

− i b = 0
N i=1

In analogy with the population


PoptimalT linear predictor, a unique solution
b exists if the K × K matrix N i=1 xi xi is invertible, and such a solution
reads as follows. !−1 N !
N
X X
b= xi xT
i xi y i (7.20)
i=1 i=1

The Least Squares solution b is perhaps more elegantly expressed by using


compact matrix notation: in this case, the K normal equations would read
as:
2
− XT (y − Xb) = 0 (7.21)
N
while the solution is written as follows.
−1 T
b = XT X X y (7.22)

249
7.3. Analysis of Least Squares

A careful reader will have noted that the derivation of the Least Squares
solution bears many analogies with Method of Moments estimation. For the
moment, however, it is better to abstract from any statistical assumptions
that might lead to make statements about estimation. A more immediately
useful exercise is to rather familiarize with both the analytic vector notation
(based on scalars like yi and vectors like xi ) and compact matrix notation.
In fact, both are useful in their own right: while the former provides more
visual information about certain computational details, the latter is better
suited to synthetically express some more convoluted formulae. For a start,
one should understand that the two K × K matrices i=1 xi xi and XT X
PN T

are in fact the same thing!


This rest of this section illustrates some central algebraic and geometric
properties that are typical of the Least Squares solution, and it culminates
with a fundamental result known as the Frisch-Waugh-Lovell Theorem. All
these properties aid the interpretation of Least Squares in practical appli-
cations. Before proceeding, some more definitions are in order. Let
ybi ≡ xT
i b (7.23)
be the fitted value for the i-th observation, that is the value of the depen-
dent variable that corresponds to xi in the hyperplane implied by the Least
Squares solution. Clearly, since the observations of yi incorporate random,
unobserved factors, they do not generally coincide with ybi . For each obser-
vation in the sample, the difference between the actual observation yi of the
dependent variable and the associated fitted value ybi is called the residual:
ei ≡ yi − ybi
(7.24)
= yi − x T
i b
note that the Least Squares problem can be equivalently expressed as that
of minimizing the sum of the squared residuals (hence its name).
One can vertically stack both sample fitted values and residuals so to
adapt them to the convenient use of compact matrix notation.
   
yb1 e1
 yb2   e2 
b =  .. 
y e =  .. 
   
 .   . 
ybN eN
The vector of fitted values can be expressed compactly as:
y
b = Xb
−1
= X XT X XT y
= PX y

250
7.3. Analysis of Least Squares

where: −1
P X ≡ X XT X XT (7.25)
is called the projection matrix, which if pre-multiplied to y results in the
vector of fitted values y
b. Furthermore:
e=y−y b
= y − Xb
= (I − PX ) y
= MX y
where: −1
M X ≡ I − P X = I − X XT X XT (7.26)
is the so-called residual maker matrix. Pre-multiplying y by the residual
maker matrix clearly results in the vector of residuals e.
The projection and residual maker matrices have important properties.
They are both symmetric:
PX = PT
X
MX = MT
X

idempotent:4
PX PX = PX
MX MX = MX
and they are orthogonal to one another.
PX MX = MX PX = 0
In addition, it is easy to see that:
PX X = X
MX X = 0
with a straightforward interpretation: if one projects the columns of X onto
themselves, the projection is identical to X and the residuals, consequently,
are zero. Finally, observe that:
y = (I + PX − PX ) y
= PX y + MX y
=yb+e
−1 T −1 T −1 T
4
Note that PX PX = X XT X X X XT X X = X XT X X = PX and
MX MX = (I − PX ) (I − PX ) = I − 2PX + PX PX = I − PX = MX . The other results
follow easily from these observations.

251
7.3. Analysis of Least Squares

and:

bT e = yT PX MX y
y
= eT y
b = y T M X PX y
=0

that is, the decomposition of the vector y between the fitted values y b and
the residuals e is such that these two components are orthogonal to one
another. This fact relates to the all-important geometric interpretation
of the Least Squares solution b, seen as the vector which, through the linear
combination y b = PX y = Xb, results in the geometrical projection of y
onto the column space of X.5 In fact, inspecting the normal equations (7.19)
or (7.21) reveals how the residual vector e is by construction orthogonal to
the space S (X) spanned by the columns of X (the K explanatory variables).
This is graphically represented in Figure 7.4 for the case of K = 2.

e S (X)
X∗,2

90◦
y
b

X∗,1

Figure 7.4: The geometric interpretation of the Least Squares solution

5
Hence names such as “projection matrix” and “linear projection.”

252
7.3. Analysis of Least Squares

In order to appreciate some properties of the Least Squares solution


that are especially relevant in settings with multiple explanatory variables
(K > 1), it is useful to split these into smaller subsets. For example, one
could rewrite the linear relationship (7.8) as:
y = Xβ + ε = X1 β1 + X2 β2 + ε (7.27)
where X = X1 X2 and βT = βT 2 . This amounts to “partition”
   
1 βT
the coefficient vector β in two smaller subvectors β1 and β2 respectively of
length K1 and K2 (with K1 + K2 = K), each pertaining to a corresponding
subset of explanatory variables. It is interesting to examine how the parti-
tioned components of the Least Squares solution b compare to one another.
To this end, rewrite the normal equations in (7.21) as:
 T    T 
X 1 X1 X T
1 X2 b1 X1 y
T T = (7.28)
X2 X1 X2 X2 b2 XT2y

where bT = bT 2 . A fundamental result then follows.


 
1 bT
Theorem 7.3. Frisch-Waugh-Lovell Theorem. The solution for b2 can
be written as:
∗ −1
b2 = X∗T X∗T (7.29)

2 X 2 2 y

where:
X∗2 ≡ MX1 X2
and MX1 is the residual maker matrix of X1 .
−1 T
MX1 ≡ I − X1 XT 1 X1 X1
Furthermore, a symmetrical result is obtained for b1 .
Proof. By the algebra of partitioned matrices, one can write b1 as a function
of b2 as: −1 T −1 T
b1 = XT 1 X1 X1 y − XT 1 X1 X1 X2 b2
plugging the above in the lower block of K2 rows in (7.28) gives:
−1 T −1 T
XT T
2 X 1 X1 X1 X1 y − XT T
2 X1 X1 X 1 X1 X2 b2 + XT T
2 X2 b2 = X2 y

with solution:
h
T

T
−1 T  i−1 h  T
−1 T  i
b2 = X2 I − X1 X1 X1 X1 X2 X 2 I − X 1 X1 X 1 X1 y
−1
= XT XT

2 MX1 X2 2 MX1 y

which is equivalent to (7.29) since MX1 is symmetric and idempotent. The


result for b1 is symmetrical.

253
7.3. Analysis of Least Squares

While this theorem might, at a first glance, look like a bunch of trivial if
nasty-looking algebraic formulas, it does deliver quite a fundamental insight:
any component (b2 ) of the least squares solution is algebraically equivalent
to another least squares solution, which follows from a transformed model
where the explanatory variables in question (X2 ) are substituted with the
corresponding residuals (X∗2 = MX1 X2 ) that are obtained from projecting
them on the other explanatory variables (X1 ). Recall our earlier observation
that the least squares projection returns a vector of fitted values and a vector
of residuals that are reciprocally orthogonal. What the Frisch-Waugh-Lovell
Theorem means in this framework is that in a linear model with multiple
explanatory variables, each coefficient bk obtained via Least Squares can be
interpreted as the overall “contribution” of Xki to Yi , after the contributions
of the other K − 1 explanatory variables to Yi has been netted out or, using
more technical terminology, partialled out.
This property explains by a great deal the immense popularity of statisti-
cal estimators based on the Least Squares principle in econometric analysis.
In fact, it allows researchers to interpret the estimated coefficients associ-
ated with a single socio-economic variable of interest by pretending that all
other variables included in the model are taken “as given,” corresponding
with the typical ceteris paribus type of scientific thought experiments. Note
that the theorem does not motivate the exclusion of relevant explanatory
variables from the analysis, except for cases when they can be confidently
assumed to be statistically unrelated to the variables of interest (say, X2 ).
Notice, in fact, that only if X1 and X2 are orthogonal (XT 1 X2 = 0) it holds
that X∗2 = MX1 X2 = X2 , meaning that the least squares coefficients asso-
ciated with the X2 explanatory variables are identical whether one includes
the remaining factors X1 in the model or not.
The properties of partitioned Least Squares appear particularly powerful
when considering any partitioned model with K1 = K − 1 and K2 = 1. In
this case, let X2 = s; the K-th Least Squares coefficient is as follows.

sT MX1 y
bK = (7.30)
sT MX1 s
This quantity is related to a statistical object called the partial correla-
tion coefficient ρ∗Y S between variables Yi and XKi = Si :

sT MX1 y
ρ∗Y S = √ p (7.31)
sT MX1 s yT MX1 y
which is nothing else but the correlation coefficient of the residuals of both
Yi and Si , obtained by projecting these on the other K − 1 explanatory

254
7.3. Analysis of Least Squares

variables (including a constant term). This quantity can be shown to be the


sample counterpart of the partial correlation, a variation of the population
correlation expressed in terms of conditional moments:
Cov [Yi , Si | x1 ]
Corr [Yi , Si | x1 ] = p p (7.32)
Var [Yi | x1 ] Var [Si | x1 ]

where x1 = X1i , . . . , X(K−1)i . Intuitively, partial correlation coefficients




measure the correlation between two variables once the dependence of both
(in the linear projection sense) from other variables has been removed. The
algebraic relationship between (7.30) and (7.31) appears evident, which ex-
plains why Least Squares coefficients are often attributed an interpretation
in terms of partial correlation (although the two are not identical).6
Another oft-invoked application of the Frisch-Waugh-Lovell Theorem is
demeaning, that is the operation of subtracting the respective mean from
both the explanatory and dependent variables. Suppose that X1 = ι is the
“constant term” of the model (an N -sized vector of ones), so that X2 entails
K −1 columns like in model (7.3). In such a case, the residual maker matrix
reads as:
−1 T 1
D ≡ MX1 = I − ι ιT ι ι = I − ιιT (7.33)
N
hence for any vector a of length N , it is:

Da = a − āι

where ā ≡ N1 N i=1 ai . Thus b2 can be equivalently obtained as the solution


P
of a Least Squares problem applied to the demeaned model :

(7.34)

yi − ȳ = β1 (xi1 − x̄1 ) + · · · + β(K−1) xi(K−1) − x̄(K−1) + εi

where ȳ ≡ N1 N i=1 yi and x̄k ≡ N i=1 xik for all k = 1, . . . , K − 1. This


P 1
PN
fact is exploited in linear models for panel data, which routinely include a
separate intercept (a fixed effect) for each panel unit in the sample.
6
This observation resonates well with the previous discussion of the bivariate regres-
sion model, the relationship between the correlation coefficient between any two random
variables Yi and Xi and their corresponding regression slope, and the MM/OLS estima-
tor for the latter (examples 3.11 and 5.4). In fact, in the sample the observed correlation
coefficient between Yi and Xi is defined as:
PN
i=1 (xi − x̄) (yi − ȳ) xT Dy
ρY S = qP qP =√ p
N
(x − x̄)
2 N
(y − ȳ)
2 xT Dx yT Dy
i=1 i i=1 i

PN PN
where x̄ ≡ N1 i=1 xi , ȳ ≡ N1 i=1 yi , while D is defined as in (7.33); the analogies of
the above expression with both (5.16) and (7.31) are obvious.

255
7.4. Evaluation of Least Squares

7.4 Evaluation of Least Squares


As mentioned, in actual applications of Least Squares researchers typically
include a constant term into their linear specifications. Two motivations for
this choice have already been introduced: first, the inclusion of a constant
term allows to confidently assume that E [εi ] = 0, which greatly simplifies
the statistical modeling of econometric estimators; second, it lets interpret
the calculated coefficients in terms of partial correlations. Some additional
important implications are apparent from the inspection of the first normal
equation in (7.19); in particular:
1. all the residuals sum up to zero: N i=1 ei = 0, and so. . .
P

2. . . . the mean ofPthe fitted values


PN ybTi coincides
PNwith that of dependent
variable yi : N i=1 yi = N i=1 xi b = N i=1 ybi = ȳ, and. . .
1 N 1 1

3. . . . the point (ȳ, x̄1 , x̄2 , . . . , x̄K ) lies on the p (xi ) = xT


i b hyperplane.

Thanks to the above properties, the inclusion of a constant term into the
model allows to employ a common criterion for the evaluation of the Least
Squares’ goodness of fit, defined as the extent by which the linear combina-
tion ybi = xTi b explains, in a statistical sense, the variation of the dependent
variable yi . This criterion is called coefficient of determination R2 and
is defined as: PN
ESS yi − ȳ)2
(b
2
R = = Pi=1N 2
∈ [0, 1] (7.35)
TSS i=1 (y i − ȳ)
where the term in the numerator relates to the variance of the fitted values
(note that N i=1 ybi = ȳ because of the inclusion of a constant term):
1
PN

N
X
ESS ≡ yi − ȳ)2 = bT XT DXb
(b
i=1

the above is called Explained Sum of Squares (ESS). The expression in


the denominator, on the other hand, corresponds with the overall empirical
variance of Yi (that is, the sample variance of the observations yi ):
N
X
TSS ≡ (yi − ȳ)2 = yT Dy
i=1

and is called instead Total Sum of Squares (TSS). The difference between
these two quantities is called Residual Sum of Squares (RSS).
N
X
RSS ≡ TSS − ESS = e2i = eT e
i=1

256
7.4. Evaluation of Least Squares

To see that the RSS equals the sum of the squared residuals, observe
first that with the inclusion of a constant term into the model the mean of
the residuals themselves is zero, hence De = e and

yT Dy = (Xb + e)T D (Xb + e)


| {z }
=TSS
= |bT X{z
T
eT e
DXb} + |{z}
=ESS =RSS

follows from the fact that y


b = Xb and e are orthogonal by construction.7
Consequently, the coefficient of determination R2 can also be written as:
PN 2
RSS e
2
R =1− = 1 − PN i=1 i 2 ∈ [0, 1]
TSS i=1 (yi − ȳ)

intuitively, this coefficient is close to 1 if the projection explains the over-


whelming majority of the variation in Yi , while it is close to 0 in the opposite
case where the explanatory variables relate to only a small portion of it. To
better appreciate this, observe that

bT XT DXb = y
bT Db
y
bT D (y + e)
=y
bT Dy
=y

and therefore:

bT Db
y y bT Dy y
y bT Dy bT Dy
y
R2 = = · = =
yT Dy yT Dy ybT Dby yT Dy
| {z }
=1
hP i2
N
(yi − ȳ) (b
i=1 yi − ȳ)
= hP i hP i
N 2 N 2
yi − ȳ)
i=1 (b i=1 (yi − ȳ)

that is, the R2 coefficient is equal to the square of the correlation coefficient
between yi and the fitted values ybi (hence its name).
7
Another way to see this is:
N
X N
X N
X N
X
2 2 2
(yi − ȳ) = (yi − ybi + ybi − ȳ) = e2i + yi − ȳ)
(b
i=1 i=1 i=1 i=1
PN PN PK PN
noting that i=1 (yi − ybi ) (b
yi − ȳ) = i=1 (yi − ybi ) ybi = k=1 bk i=1 xik (yi − ybi ) = 0
follows from the normal equations.

257
7.4. Evaluation of Least Squares

A couple of warnings about the interpretation of R2 in practical appli-


cations are in order. First, this coefficient is not an absolute measure of a
projection’s overall “quality.” In fact, the size of the variances of both the
dependent and explanatory variables – as well as the statistical relationship
between those – are specific to every particular empirical setting. Second,
one must be careful even when comparing the R2 from different projections
applied to the same setting or even dataset. The reason is that as it is easy
to see, this coefficient increases mechanically with the inclusion of each ad-
ditional explanatory variable into the model.8 A measure that takes the
last observation into account is the adjusted R2 coefficient, written as:
PN 2
2 e N −1
R = 1 − PN i=1 i 2
i=1 (yi − ȳ)
N −K
(7.36)
2 N −1

=1− 1−R
N −K
clearly, this variation of the coefficient of determination includes a term for
the total number of explanatory variables included into the model; if the
contribution of one of these variables towards the explained sum of squares
is negligible, the adjusted R2 might decrease or even turn negative. Finally,
it should go without saying that all the observations and interpretation of
the R2 coefficient are only valid if the linear model features a constant term.
The R2 coefficient can be computed by computer packages even for models
lacking a constant term; however, since the calculated residuals would not
sum up to zero, the resulting coefficient cannot be compared to that from
a model featuring a constant (again, the calculated R2 can be negative).
Some of the results and intuitions developed with the support of linear
algebra can be better appreciated by putting them in practical context. It is
thus helpful to review the initial examples of linear economic relationships.
Example 7.4. The Keynesian consumption function, revisited. A
simple linear fit of the relationship from example (7.1) is displayed in Figure
7.5. The slope of the line, which is meant to evaluate the parameter for the
marginal propensity of consumption c1 , is calculated to be about c1 ' 0.80.
In this particular case, the R2 coefficient is, staggeringly, virtually equal to
1! However, nowadays macroeconomists give little weight a result like this.
8
It can be shown that this increase is a function of the partial correlation ρ∗Y S between
the dependent variable Y and some given newly added variable S:
2
R21 = R20 + 1 − R20 (ρ∗Y S )


where R20 and R21 are the two coefficients of determination calculated respectively prior
to and posterior to the inclusion of S.

258
7.4. Evaluation of Least Squares

Ci : Nondurable consumption, 1996 $ trillions


4 R2 = 0.999
95
c1 ' 0.80 93
94

92
91
90
3.5 89
88
87
86
85
3 84
83
82
81
80
79
78
2.5 77
76
75
74
73
72
2 70
71
69
68
67
66
65
1.5 63
64
62
61
60
59

1
1.5 2 2.5 3 3.5 4 4.5 5
Yi : Disposable income, 1996 $ trillions

Figure 7.5: Fitted Keynesian Consumption Function

In fact, econometric research has shown that strong linear fits of this sort
are standard properties of macroeconomic time series, which intuitively are
due to the co-movement of variables because of some other factors that are
possibly unaccounted by the model. Therefore, the estimate c1 ' 0.80 can
hardly be interpreted as the average increase in aggregate consumption that
follows from the increase of a country’s GDP. 

Example 7.5. Human capital and wages, revisited. A simple linear fit
of the relationship between the logarithmic wage and the education variable
from example 7.2 returns a slope coefficient b ' 0.064 and an R2 coefficient
equal to 0.076 (see Figure 7.6, top panel). This does not imply that such a
relationship is meaningless, quite the contrary! By enriching the model with
a squared polynomial for the experience variable as in the proper Mincer
equation (7.6), one would obtain a higher slope coefficient, up to b0 ' 0.105;
while the R2 coefficient increases too, as expected, up to about 0.176. Note
that here K is small relative to the sample size (the selected cross section of
the original dataset has size N = 1, 241), hence the adjusted R2 is virtually
identical to the standard R2 in both calculations.

259
7.4. Evaluation of Least Squares

log Wi : Logarithm of the annual wage, 1987 $


13 R2 = 0.076 Xi omitted

12

11

10
b ' 0.064

7
6 8 10 12 14 16 18 20
Si : Years of education
log Wi : Logarithm of the annual wage, 1987 $

13 R2 = 0.175 Full model


Xi omitted

12

11
b0 ' 0.105

10
b ' 0.064

7
−8 −6 −4 −2 0 2 4 6 8
Residuals from the projection of Si on Xi and Xi2

Figure 7.6: Fitted Mincer Equation, two versions

260
7.5. Least Squares and Linear Regression

How are these changes in the output of Least Squares to be interpreted?


First, one can visualize the Frisch-Waugh-Lovell theorem at work through
the bottom panel of Figure 7.6, which represents the linear fit between log
wages and the residuals obtained from projecting education on experience
and its square (plus a constant vector ι); by partialing out the contribution
of the polynomial for experience on logarithmic wages, the slope attributed
to the education variable is to be interpreted in terms of partial correlation.
The socio-economic intuition that explains the increase in the evaluated re-
turns to schooling is that, without appropriately incorporating experience
into the model, the coefficient for education is dragged down by the mechan-
ical negative correlation between education and experience (by studying for
longer, one enters the labor market later) and experience itself positively
impacts upon labor market outcomes like wages. Second, the two measured
values for R2 suggest that the variation of individual wages is explained by
several factors, of which both education and experience are two prominent
ones; however, a large portion of this variation is due to other, idiosyncratic
characteristics of individuals (like their ability, their personal connections,
or simply their luck in life) that are unaccounted by the model. 

Through these two examples it is possible to draw one final observation


about the R2 coefficient: while certainly useful to evaluate “goodness of fit,”
it can be a poor criterion for identifying socio-economic explanations about
phenomena of interest. Instead, the issue of carefully selecting explanatory
variables appears to be one of more immediate importance.

7.5 Least Squares and Linear Regression


Throughout this introduction to Least Squares, the magic word “regression”
has never been used. In fact, the analysis of Least Squares has been largely
treated within a merely algebraic framework, albeit initially motivated on
some prediction problem. With the development of this framework at hand,
it is more convenient to examine the implications of augmenting the postu-
lated linear relationship with some distributional assumptions. Suppose
that the CEF that generates the data is effectively linear:

E [Yi | xi ] = xT
i β0 (7.37)

where β0 represents the supposed “true” vector of parameters from which


the data are generated. A linear model like (7.1), when enriched with this
hypothesis about the joint distribution of (xi , Yi ), is called a linear regres-
sion model, xi are called the regressors, and Yi is called the regressand.

261
7.5. Least Squares and Linear Regression

An implication of (7.37) is that if β = β0 , the expectation of the error


terms in (7.1), conditional on the explanatory variables xi , is zero:

E [εi | xi ] = E Yi − xT
 
i β0 x i
= E [Yi | xi ] − xT
i β0
=0

which, by the Law of Iterated Expectations, implies:

E [xi εi ] = Ex [E [xi εi | xi ]]
= Ex [xi · E [εi | xi ]]
=0

(the opposite is not true, that is, E [xi εi ] = 0 does not imply E [εi | xi ] = 0).
By the standard properties of probability limits it can be shown that:9
N
!−1 N
X X p −1
xi x T xi yi → E xi xT E [xi Yi ] = β∗

bN = i i
i=1 i=1

so long as matrices and have full rank. This means


1
PN T
 T

N i=1 x i x i E x i x i
that:
pT ∗
xT
i bN → xi β = py (xi = xi )

that is, the Least Squares projection converges in probability to the optimal
linear predictor (7.12); this should not be too surprising, since such a result
generally holds for sample analogs of population moments. What is relevant
here is that under hypothesis (7.37), the optimal linear predictor coincides
with the (linear ) CEF for any given realization xi = xi :
∗ T −1
py (xi = xi ) = xT T
 
i β = xi E x i x i E [xi Yi ]
−1
= xT T
 
i E xi xi E [xi E [Yi | xi ]]
T −1
= xT E xi xT
   
i E xi xi i β0
= xT
i β0
= E [Yi | xi = xi ]

where the second line once again exploits the Law of Iterated Expectations.
In light of Theorems 7.1 and 7.2, this result should not be too surprising.
9
Compare with the analysis and the proof of consistency conducted for the bivariate
case in Example 6.3, Lecture 6. Also note that the sequence bN , whose probability limit
is taken, is defined here in terms of realizations xi and yi , This notation is conventional
in the analysis of econometric estimators.

262
7.5. Least Squares and Linear Regression

The implication of both observations is that if the CEF is linear, the cor-
responding Least Squares solution coincides asymptotically with the “true”
parameters of the regression model.
p
bN → β0 (7.38)
This property motivates the use of Least Squares as a statistical or econo-
metric estimator of the linear regression model, an estimator that takes
the name of Ordinary Least Squares (OLS), where ‘ordinary’ is meant to
distinguish the baseline estimator from its variations or extensions. Result
(7.38) is re-framed later as the consistency property of the OLS estimator.
In what follows, it is given an array of motivations for the use of the linear
regression model in practical contexts.

Linear Regression, indeed


If the researcher can confidently assume that the CEF of the relationship
under analysis is indeed linear, by (7.38) the Least Squares estimator for the
linear regression model is the most natural choice. However, the researcher
must be careful at correctly specifying the linear model, and include all the
variables that might be correlated with the relevant explanatory variables
of interest. In fact, if the true CEF is linear but unlike (7.37) it reads:
E [Yi | xi , si ] = xT T
i β0 + si δ0 (7.39)
that is, it is only linear conditional on some additional variables si unac-
counted by the researchers and that enter the CEF with associated “true”
parameters δ0 6= 0, then the probability limit of Least Squares reads:
p −1
bN → E xi xT

i E [xi Yi ]
−1
= E xi xT
 
i E [xi E [Yi | xi , si ]]
−1
= E xi xT E xi xT T
   
i i β0 + xi si δ0
−1  T 
= β 0 + E xi xT

i E xi si δ0
and it coincides with β0 only if E xi si = 0, that is, variables xi and si are
 T

uncorrelated.10 The intuition is better developed when si = Si is a single


variable (with associated parameter δ0 in the CEF):
p −1
bN → β0 + δ0 E xi xT (7.40)

i E [xi Si ]
yielding the (in)famous formula for the omitted variable bias.
10
Once again, the second line exploits the Law of Iterated Expectations, where the
outer expectation in E [xi E [ Yi | xi , si ]] is taken with respect to both xi and si .

263
7.5. Least Squares and Linear Regression

Expression (7.40) indicates that the omission of a relevant explanatory


variable affects the probability limit of the calculated Least Squares coef-
ficients, which are increased by a term that equals δ0 multiplied by the
population projection of Si on xi . For instance, should experience be omit-
ted from the Mincer Equation’s estimated regression, this “bias term” would
be negative, since while the contribution of experience to log wages is likely
positive, its correlation with education is mechanically negative. Ability, on
the other hand, is a difficult variable to observe, and its omission from the
estimated Mincer Equation is likely to affect the education coefficient with
a positive asymptotic bias, as more skilled individuals are likely to attain
more education and also earn more money.

Non-Linear Models and Regression


Many relationships of interest between socio-economic variables are likely to
be non-linear. However, the linear regression framework is flexible enough
to accommodate many of these. In fact, so long a non-linear model can be
transformed to be linear in the parameters, it can be treated econometri-
cally like a linear model. This is better illustrated with some examples.
Example 7.6. A log-lin model. The model of human capital discussed
in example 7.2 is non-linear in the variables, but it is easily shown that it is
linear in the parameters. Taking logarithms on both sides of (7.5) returns
the so-called Mincer equation (7.6), which is rewritten here for εi ≡ αi + i :
log Wi = β0 + β1 Xi + β2 Xi2 + β3 Si + εi (7.41)
which is a famous case of a log-lin model: that is, logarithmic on the left-
hand side, linear on the right-hand side. In a log-lin model, the regression
coefficients associated with the regressors entering linearly in the equation
measure the semi-elasticity of the dependent variable relative to the re-
gressor in question: specifically, the relative increase in the former following
from a unitary increase in the latter. For example, the estimated coefficient
for education b0 = 0.105 from Figure 7.6 indicates that wages are expected
to increase by about 10.5% if an individual acquires one additional year of
education. It is quite easy to see that OLS would estimate the β parameters
of the model consistently, so long as E [εi | Xi , Si ] = 0 holds – an assumption
that can be quite problematic in the analysis of the returns to education.
Example 7.7. A log-log model. A common model used by economists for
describing the production process is the Cobb-Douglas production function:
Yi = Ai KiβK Lβi L (7.42)

264
7.5. Least Squares and Linear Regression

where Yi is total output, Ki is the capital input, Li is the labor input, Ai is


total factor productivity, and the unit of analysis can range from plants to
countries. Taking logarithms on both sides results in an adequately linear-
in-the-parameters equation:
log Yi = α + βK log Ki + βL log Li + ωi (7.43)
where log Ai = α + ωi : here ωi is an unobserved productivity shock, with
E [ωi ] = 0. Any transformed equation of this kind is called a log-log model
(that is, logarithmic on both sides), and coefficients such as βK and βL are
interpreted as elasticities, that is the ratio between the relative variations
of the dependent variable and the associated regressor that is implied by the
model. For example, βL = 0.6 means that a 1% increase in the labor input
induces, on average, a 0.6% increase in total output. Note that OLS would
estimate the semi-elasticity parameters consistently under E [ωi | Ki , Li ] = 0,
which is another problematic assumption in econometric analysis.
In other cases, however, an adequate transformation is not possible.
Example 7.8. Distance Decay. Suppose one is interested in studying
how knowledge spillovers generated by universities affect the productivity
of local firms, and is modeling the phenomenon by augmenting the linearized
Cobb-Douglas function (7.43) as follows:
log Yi = α + βK log Ki + βL log Li + δ exp (−λDi ) Ui + ωi (7.44)
where Ui is the size of some local university, while Di is its distance from
firm i. Here, parameter δ measures the semi-elasticity of firm productivity
with respect to the university’s size, weighted by the distance decay factor
exp (−λDi ) as parametrized by λ. In this model, the two parameters δ and
λ enter non-linearly in the equation, and irreducibly so.
The linear regression model and Ordinary Least Squares cannot be applied
to a structural equation such as (7.44); however, they can be adapted for
this purpose. The Non Linear Least Squares (NLLS) estimator is often
employed in order to address these types of problems, and is introduced in
subsequent lectures. Other times, instead, a researcher is simply uncertain
about the exact shape of the structural relationship and corresponding CEF
that generate the data. In those circumstances, using a linear regression as
the baseline model is most often a sensible idea: in fact, the Least Squares
projection is known to converge in probability to the population projection,
which by Theorem 7.2 is known to be the best approximation to the true
CEF in the MSE sense. While this should not be used as an excuse to give
up on the search for the right specification, starting the analysis of complex
economic relationships with a good approximation may be a good start.

265
7.5. Least Squares and Linear Regression

Groups and Dummies


The linear regression model is especially useful for handling grouped data,
that is, samples that are drawn from populations which can be partitioned
into identifiable sub-populations. If the distribution of the dependent vari-
able Yi is expected to be heterogeneous across such sub-populations, it is
convenient to account for this heterogeneity in regression analysis; typically,
this is done through devices known as dummy variables. The latter are
simply explanatory variables that equal one (Di = 1) if the i-th observation
belongs to a specific group of interest, and zero (Di = 0) otherwise.
To illustrate how dummy variables operate, consider the simple bivariate
model
Yi = π0 + π1 Di + ηi (7.45)
where Yi is some outcome of interest, Di is a dummy variable that identifies
some group of interest (e.g. females, blacks, foreigners, young people) and
ηi is an error term. The vector of Least Squares coefficients (p0 , p1 ) which
is obtained through a sample {yi , di }N i=1 is calculated as:
   −1  
p0 N ND N ȳ
=
p1 ND ND ND ȳD
 
1 N ȳ − ND ȳD
=
N − ND −N ȳ + N ȳD
 
1 N ȳ − ND ȳD
=
N − ND (N − ND ) ȳD − N ȳ + ND ȳD
 
ȳ\D
=
ȳD − ȳ\D
where ND is the number of observations with di = 1, ȳ is the grand average
of yi in the sample, ȳD ≡ ND i=1 yi di is the average of yi in the “dummy”
−1
PN
group with di = 1, whereas
N
1 X N ȳ − ND ȳD
ȳ\D ≡ yi (1 − di ) =
N − ND i=1 N − ND

is the average of yi in the complementary group with di = 0. By the Laws


of Large Numbers and the properties of linear projections it follows that:
       
p0 ȳ\D p E [Yi | Di = 0] π
= → = 0
p1 ȳD − ȳ\D E [Yi | Di = 1] − E [Yi | Di = 0] π1
endowing the two regression parameters π0 and π1 with a clear interpreta-
tion in terms of group-specific population averages.

266
7.5. Least Squares and Linear Regression

A similar result can be obtained through the following – perhaps simpler


– alternative model, with two dummy variables and no constant term:
Yi = π01 Di + π02 (1 − Di ) + ηi0 (7.46)
and it is even easier to show that:
 0      0
p1 ȳ\D p E [Yi | Di = 1] π
= → = 10
p02 ȳD E [Yi | Di = 0] π2
with an even more straightforward interpretation. Observe, however, that
it is impossible to run a model like (7.46) with the addition of a constant
term:
Yi = π000 + π001 Di + π002 (1 − Di ) + ηi00 (?)
because no unique vector of Least Squares coefficient is possibly computed.
The reason is that the columns of the regressors matrix X are by construc-
tion linearly dependent, implying that the 3 × 3 matrix XT X is singular.
This problem, which can be generalized to higher dimensions, is popularly
known as the “dummy variable trap.”
Another important observation about the two dummy variable models
(7.45) and (7.45) is that no statistical assumption was necessary in order to
attribute them an interpretation as group means (both in the sample and,
asymptotically, in the population). This can be generalized: suppose that
a population is partitioned between K non-overlapping groups, which may
also represent the intersections of more aggregate divisions; for example, a
partition that intersects two binary groups like “gender” and “age” is the set
G = {male&young, f emale&young, male&old, f emale&old}, with K = 4.
If a dummy variable Xki is associated to each group for k = 1, . . . , K, thus
taking care that the dummy variable trap is avoided, it holds that:
E [Yi | X1i , . . . , XKi ] = E [ Yi | xi ] = xT
i π0 (7.47)
for every dependent variable Yi , and the parameters π0 are interpreted as
the K group-specific means of Yi . Such a model is called a fully saturated
regression; it finds its primary use is in those statistical exercises that go
by the name of “Analysis of Variance” (ANOVA) and whose objective is to
examine group differences in selected populations.
Dummy variables are routinely used in econometrics in order to account
for group heterogeneity. A dummy variable Di can be added autonomously
to a regression model (so that the associated parameter is interpreted as a
group-specific shifter of the constant term) and also “interacted” with some
regressor of interest Xik , that is added as an additional regressor taking the
form Xik Di (whose parameter would thus be interpreted as group-specific
variation of the contribution of Xik to the CEF). An example follows.

267
7.5. Least Squares and Linear Regression

Example 7.9. Human capital and wages – blacks and whites. Let
us return again to the analysis of returns to schooling, examining how they
may differ between the two major racial groups in the US: blacks and whites.
Figure 7.7 below helps visualize the racial split in the cross-sectional excerpt
from Examples 7.2 and 7.5: blacks, who constitute about 33% of the sample,
appear more prevalent amongst the lower income brackets.
log Wi : Logarithm of the annual wage, 1987 $

13 Whites, no interact.
Whites, interaction
Blacks, no interact.
12 Blacks, interaction

11

10

7
6 8 10 12 14 16 18 20
Si : Years of education

Figure 7.7: (Log) wage and education in 1987; node colors match race

Consider the simple regression model:


log Wi = β0 + π0 Di + β1 Si + εi (7.48)
where experience is ignored for simplicity. The estimates are represented as
the two parallel, solid lines from Figure 7.5; the parameter π0 is evaluated
as p0 ' −0.283 and represents the average difference in log-wages between
the two groups across all education levels. Adding an “interaction term:”
log Wi = β0 + π0 Di + β1 Si + π1 Di Si + εi (7.49)
is equivalent to allowing for one regression line per group. The associated
estimates are displayed through the dashed lines in Figure 7.5, suggesting
that the earning gap tends to close at higher levels of education. 

268
7.5. Least Squares and Linear Regression

Regression and the CEF Derivative


The view and use of Linear Regression as an “approximation” of some un-
known CEF is also motivated by another property of Least Squares: that
they provide a good approximation to the average derivative of the CEF.
This is especially important in light of the relationship between the CEF
and the causal effects that are discussed later in Lecture 9. This property
was observed first by Yitzhaki (1996) in a bivariate setting; it was expanded
later by Angrist and Krueger (1999) in a multivariate model where the CEF
between of some dependent variable Yi and some regressor of interest Si with
a continuous support XS is unknown, but Si is expected to depend linearly
upon a set of other explanatory variables xi according to a linear CEF.
E [Si | xi ] = xT
i π0 (7.50)
Relationship (7.50) above is satisfied, for example, in a fully saturated en-
vironment where xi represents a complete dummy variable partition of the
population. Denote the following derivative of the CEF as µ0Y |S,x (si ; xi ):

µ0Y |S,x (si ; xi ) ≡ E [Yi | Si ; xi ] (7.51)
∂Si Si =si

which here measures the expected marginal increase of Yi for observations


with Si = si , as a function of xi . If Si has a discrete support, an analogous
definition in terms of discrete variation applies.
Observe that in a Least Squares fit of Yi on xi and Si :
Yi = x T
i β0 + δ0 Si + εi (7.52)
the coefficient associated with education is the following variation of a par-
tial correlation coefficient.
sT MX y
δOLS = T
b (7.53)
s MX s
Furthermore, if the CEF of Si conditional on xi is linear, it is possible to
show through some algebraic manipulation that the population projection
coefficient associated with education Si – the probability limit of (7.53) –
is the following ratio of conditional moments.
p Cov [Yi , Si | xi ]
δOLS → δ∗ =
b (7.54)
Var [Si | xi ]
By noting that E [Si − E [Si | xi ]] = 0, the above simplifies as follows.
E [Yi (Si − E [Si | xi ])]
δ∗ = (7.55)
E [Si (Si − E [Si | xi ])]

269
7.5. Least Squares and Linear Regression

The property in question is that (7.55) is also equal to:


ˆ 
0
Ex µ Y |S,x (si ; xi ) φ (si ; xi ) dsi

δ =
XS
ˆ  (7.56)
Ex φ (si ; xi ) dsi
XS

hence it corresponds to the derivative µ0Y |S,x (si ; xi ) of the CEF, averaged
over the support of xi , after having been weighted through the support of
Si by the following term, which depends on si and varies with xi .

φ (si ; xi ) ≡ {E [Si | Si ≥ si , xi ] − E [Si | Si < si , xi ]} ×


× {P (Si ≥ si | xi ) [1 − P (Si ≥ si | xi )]}

The term φ (si , xi ) is hard to interpret, but intuitively it takes larger values
around the median of Si – as an inspection of the formula would suggest.
The original derivation of (7.56) given by Angrist and Krueger was initially
applied to a discrete Si (say years of education); with some manipulation
of integrals this can be proved for a continuous Si too.11
11
The proof proceeds as follows. After having defined s∗ ≡ lim sup XS , develop:

E [Yi (Si − E [ Si | xi ])] = E [E [ Yi | Si , xi ] (Si − E [ Si | xi ])]


ˆ  
0
=E µ Y |S,x (si ) dsi (Si − E [ Si | xi ])
 S ˆ
X

0
= Ex ES µ Y |S,x (si ) (si − E [ Si | xi ]) dsi xi
ˆ ˆ
XS

= Ex µ0Y |S,x (si ) (si − E [ Si | xi ]) f S|x ( si | xi ) dsi dsi
X X
"ˆ S S "ˆ ∗ # #
s
= Ex µ0Y |S,x (si ) (si − E [ Si | xi ]) f S|x ( si | xi ) dsi dsi
XS si

where the first and third lines above are consequent to the Law of Iterated Expectations,
the second line follows from the Fundamental Theorem of Calculus, and the rest obtains
with some manipulation. In addition, by standard properties of conditional moments:

E [ Si | xi ] = E [ Si | Si ≥ si , xi ] P ( Si ≥ si | xi ) + E [ Si | Si < si , xi ] [1 − P ( Si ≥ si | xi )]

and by repeated substitution one can verify that:


ˆ s∗
(si − E [ Si | xi ]) f S|x ( si | xi ) dsi = {E [ Si | Si ≥ si , xi ] − E [ Si | xi ]} P ( Si ≥ si | xi )
si
= φ (si ; xi )

showing the numerator of (7.56) is as stated above; a similar, simpler analysis also applies
to the denominator of (7.56).

270
7.5. Least Squares and Linear Regression

Example 7.10. Human capital and wages: average extra return to


schooling. The usefulness of this interpretation rests on the understanding
of the φ (si ; xi ) weights, and is consequently context-specific. The top panel
from Figure 7.8 below reports the calculation of these weights for different
years of schooling Si , using the same data about education and wages from
previous examples; it shows how the weights are largest for those values of
Si between 12 and 16, that is between high school and college graduation.12
In this context, the weighting scheme attributes more weight to those values
of Si that are arguably most consequential for current education policy.

0.4 Frequency 2

Average φ (si ; xi )
Average φ (si ; xi )
Frequency

0.2 1

0 0

Variation in empirical CEF


Empirical CEF; Reg. line

12 Empirical CEF (avg. log Wi ) 1


Variation in empirical CEF
Regression line (Xi omitted)
11
0.5

10
0

9
6 8 10 12 14 16 18 20
Si : Years of education

Figure 7.8: Regression as the average derivative: returns to schooling

Here, µ0log W |S,x (si ; xi ) ≡ E [log Wi | Si = si ; xi ] − E [log Wi | Si = si − 1; xi ];


is a discrete variation of the CEF which expresses the expected log-wage for
one extra year of education. The associated estimates are represented in the
bottom panel of Figure 7.8, along with the empirical (non-parametric) CEF
and the regression line. The average CEF variation, weighted by φ (si ; xi ),
is about 0.064, just as the regression slope which was found earlier. 
12
All calculations replicate those from Angrist and Krueger (1999) with different data;
the results are similar. All averages in both panels are computed conditionally on Si .

271
Lecture 8

Least Squares Estimation

Having developed some general statistical and practical motivations for the
use of Least Squares, this lecture examines the statistical properties of the
OLS estimator, which are instrumental to statistical estimation and infer-
ence. While both small and large sample properties are analyzed, the latter
are discussed first as the standard choice for use in empirical research, data
permitting. Finally, the lecture develops the implications of departures from
the assumption on independence between observations, along with the op-
tions available for performing reliable inference under those conditions.

8.1 Large Sample Properties


The starting point is the analysis of the large sample properties of OLS,
which rely on asymptotic results and thus – as their name suggests – a
sample size N that tends to infinity. These properties are dependent upon
a number of specific statistical assumptions, which are sequentially in-
troduced next by adapting the original assumptions from the treatment by
White (1980). The motivation and implications of all such assumptions is
discussed at length.
Assumption 1. Linearity. The data are generated by a linear model with
“true” parameter vector β0 .
yi = xT
i β0 + εi (8.1)
This assumption may seem obvious, but is necessary to rule out all kinds
of specification errors, such as mistaking the functional form that relates yi
and xi – in which case a linear regression model estimated via OLS may not
be the best econometric choice. In addition, it is conceptually important to
specify that there exists a “true” parameter vector β0 of interest.

272
8.1. Large Sample Properties

To denote the OLS estimator, the notation β b OLS will be used through-
out this lecture and beyond. While the algebraic expression of the estimator
is identical to that of the Least Squares solution b in (7.20) and (7.22), the
above notation is preferred when the intention is to highlight that OLS is
being used as a proper statistic and econometric estimator. Under Assump-
tion 1, the OLS estimator can be decomposed as:
N
!−1 N
X X
xi xT x i x T β0 + εi

β
b OLS =
i i
i=1 i=1
!−1 (8.2)
N N
1 X 1 X
= β0 + xi x T
i x i εi
N i=1 N i=1
or equivalently, in compact matrix notation, as follows.
b OLS = XT X −1 XT (Xβ0 + ε)

β

1 T
−1
1 T (8.3)
= β0 + X X X ε
N N
This decomposition turns out to be very useful throughout this analysis.
Assumption 2. Independently but not identically distributed data.
The observations in the sample {(yi , xi )}N i=1 are independently, but not nec-
essarily identically, distributed (i.n.i.d.).
This assumption characterizes the data sample. Note how the conditions
imposed on it are less restrictive than those from most baseline statistical re-
sults, which usually require identically and independently distributed (i.i.d.)
observations. By letting observations to be not identically distributed, they
are not only allowed to have different absolute or conditional moments, but
also to be drawn from different distributions. The independence assumption
remains problematic in many practical contexts, and econometric solutions
for scenarios when it likely fails are discussed in the last part of this lecture.
Assumption 3. Moments and realizations of the regressors. The
regressors random vector xi has a finite second moment, and for some δ > 0:
h i
E |Xik Xi` |1+δ < ∞ (8.4)

for k, ` = 1, . . . , K and i = 1, . . . , N . In addition, its realizations xi are


such that, for any two K × 1 vectors β0 and β00 :
Xβ0 = Xβ00 iff β0 = β00 (8.5)
thus, X has full column rank and XT X ( N
i=1 xi xi ) is nonsingular.
T
P

273
8.1. Large Sample Properties

This assumption specifies the nature of the regressors X used for esti-
mation, and is composed of two parts. The first part is about its stochastic
properties. In fact, it implicitly allows the regressors to be actually stochas-
tic – which is not to be taken for granted, since in the classical treatment of
the linear regression model the regressors are assumed to be “fixed” (that is
identical in repeated samples, such as when regression is used to evaluate
some kind of experimental variation); in those classical treatments the only
random component of the model is the error term εi . Stochastic regressors
are assumed to have finite second (mixed) moments while conforming to
condition (8.4). All this implies that the following probability limit:
N N
1 T 1 X T p 1 X 
E xi xT (8.6)

X X= xi xi → lim i ≡ K0
N N i=1 N →∞ N
i=1

is a K ×K matrix, written as K0 , which is of full rank K and as such invert-


ible (observe incidentally that when observations are identically distributed
this matrix takes a simpler expression: K0 = E xi xi ). The second part
T


of the assumption states that also the actual realizations of the regressors
must satisfy an analogous invertibility condition. Recall that this condition
is necessary for the Least Squares solution to be unique; it rules out issues
such as the dummy variable trap.1
Assumption 4. Exogeneity. Conditional on the regressors xi , the error
term εi has mean zero (with typical terminology, it is mean-independent
of the regressors xi ).
E [εi | xi ] = 0 (8.7)
This is the all-important assumption, against which one’s estimates are
evaluated, since it is the crucial one for obtaining consistency of the OLS
estimator. As it was already observed in the previous lecture, (8.7) amounts
to assume that the CEF is indeed linear in xi , and it implies E [xi εi ] = 0,
hence:
N
1 T 1 X p
X ε= x i εi → 0 (8.8)
N N i=1
if the conditions for the application of an appropriate Law of Large Numbers
are satisfied by the other assumptions, and so the residual element that adds
to β0 on the right-hand side of (8.2) and (8.3) vanishes asymptotically. The
intuition is that since xi does not provide information on εi , any variation
in Yi associated with a variation in xi must be due, on average, to xi alone.
1
As Lecture 9 clarifies, this is a standard identification condition specific to the OLS
estimator.

274
8.1. Large Sample Properties

Like much of the econometric terminology, the name “exogeneity” for this
assumption originates with the analysis of Simultaneous Equations Models,
(SEMs, see Lectures 9 and 10) although a more appropriate name is the
longer (and hence less popular) mean independence of the error term.
The motivation for the shorter name is best understood later in the context
of the analysis of SEMs. A discussion of those frequent scenarios where this
condition might fail are reviewed in later lectures.
Assumption 5. Heteroscedastic, Independent Errors. The variance
of the error term εi conditional on xi is left unrestricted (heteroscedasticity).
Since observations are independent, the conditional covariance between two
error terms from two different observations i, j = 1, . . . , N is zero.
E ε2i xi = σ2 (xi ) ≡ σ2i (8.9)
 

E [ εi εj | xi , xj ] = 0 (8.10)
In addition, for some δ > 0 the following holds for all i = 1, . . . , N .
h i
2 1+δ
E εi <∞ (8.11)

The above is written in compact matrix notation as follows.


 
σ21 0 . . . 0
2
  0 σ2 . . . 0 

(8.12)
 T
Σ ≡ E εε X =  .. .. . . . 
. . . .. 
0 0 . . . σ2N
This assumption specifies the second moments of the error term εi . First,
the conditional variances are allowed to vary on the support of xi , a notion
that is named heteroscedasticity in econometrics. Heteroscedasticity is
a natural property of most empirical settings of interest in economics and
other social sciences more generally. For example, an inspection of Figure
7.2 from the previous lecture suggests that log-wages are differentially dis-
persed for individuals with different level of education, in a way that might
not be explained solely by the regressors such as education, and hence must
be due to some inherent variation of the other “residual” factors (the error
term). The circumstance where σ2 (xi ) = σ20 is independent of xi , and thus
identical for all observations i = 1, . . . , N :
Σ = E εεT X = E εεT = σ20 I (8.13)
   

is called homoscedasticity and must be seen as an exceptional case. In the


classical analysis of the linear regression model, instead, homoscedasticity
is traditionally considered a working assumption.

275
8.1. Large Sample Properties

Assumption 5 entails some additional conditions. Actually, that the con-


ditional cross-observation covariance of the errors is zero is not technically
part of Assumption 5, since this property follows directly from Assumption
2 (independent observations); yet it is useful to state it here for a better
understanding of (8.12). The usefulness of property (8.11) is clarified below.
Assumption 6. Moments of xi εi . For i = 1, . . . , N , matrix E ε2i xi xT
 
i
exists, is finite, semi-definite positive and it has full rank K. Furthermore,
for some δ > 0, for i = 1, . . . , N , and for k, ` = 1, . . . , K, the following
Ljapunov condition holds.
h i
1+δ
E ε2i Xik Xi` <∞ (8.14)
This assumption allows to establish the asymptotic normality of OLS.
To see this, hereinafter denote the following limiting variance with Ξ0 .
" N
#
1 X
Ξ0 ≡ lim Var √ x i εi (8.15)
N →∞ N i=1
Clearly, under Assumption 2 (independent observations) matrix Ξ0 assumes
a more straightforward expression:
" N
# N
1 X 1 X
Ξ0 = lim Var √ xi εi = lim Var [xi εi ]
N →∞ N i=1 N →∞ N
i=1
N
(8.16)
1 X  2
E εi x i x T

= lim i
N →∞ N
i=1

which is semi-definite positive and has full rank by Assumption 6. Note that
i .
if the observations were also identically distributed, then Ξ0 = E ε2i xi xT


An implication of (8.16) is that by some Law of Large Numbers:


N N
1 X 2 T p 1 X  2
E εi x i x T (8.17)

εi xi xi → lim i = Ξ0
N i=1 N →∞ N
i=1

and by the Ljapunov condition (8.14), the following Central Limit Theorem
result holds too.
N
1 X d
√ xi εi → N (0, Ξ0 ) (8.18)
N i=1
Finally, notice that in the special case of homoscedasticity, the variance of
the error term is independent of the regressors, hence:
E ε2i xi xT
   2  T
 2
 T

i = E ε i E x i x i = σ 0 E x i x i

for i = 1, . . . , N ; this implies, by (8.6) and (8.17), that Ξ0 = σ20 K0 .

276
8.1. Large Sample Properties

Having discussed all the six White’s Assumptions at length, proving the
large sample properties of the OLS estimator is straightforward.
Theorem 8.1. The Large Sample properties of the OLS Estimator.
Under Assumptions 1-6 the OLS estimator is consistent, that is:
p
b OLS →
β β0 (8.19)

and asymptotically normal, that is:


√  
d
N βOLS − β0 → N 0, K−1 −1
(8.20)

0 Ξ0 K0
b

hence its asymptotic distribution is, for a given N , as follows.


 
A 1 −1
βOLS ∼ N β0 , K0 Ξ0 K0
b −1
(8.21)
N

Proof. The consistency result (8.19) was in a way already proved in the pre-
vious lecture by exploiting the properties of the linear projection when the
CEF is linear; under Assumptions 1-6 it can be alternatively seen by apply-
ing the probability limit (8.8) to the decomposition of the OLS estimator
in (8.2). Regarding asymptotic normality, it follows from “rephrasing” the
Central Limit Theorem result in (8.18) in terms of the random sequence:

N
!−1 N
√   1 X 1 X
b OLS − β0 =
N β xi xT
i √ x i εi
N i=1 N i=1

which, by Slutskij’s Theorem and the Cramér-Wold Device, gives – under


Assumptions 1-6 – results (8.20) and (8.21).2
This results constitutes the motivation for the use of the OLS estimator,
thanks to its consistency property, and for performing statistical tests (on
its estimated parameters) that are based on the normal distribution and the
associated test statistics. Just like all asymptotic results that follow from
the Central Limit Theorem, (8.20) was derived regardless of the underlying
2
This conclusion is but a special case of some more general results (which themselves
extend Theorems 6.8 and 6.17) about the asymptotic behavior of Method of Moments
estimators for possibly non i.i.d. data. As it shall be expanded later, the OLS estimator
is in fact alternatively seen as the Method of Moments estimator based on the following
moment conditions.
E [xi εi ] = E xi Yi − xT
 
i β0 =0
Clearly, the sample analogues of the above moment conditions are the K normal equa-
tions (7.19) which solve the Least Squares problem.

277
8.1. Large Sample Properties

distribution that generates the sample, a result whose importance cannot be


stressed enough. In the classical regression model, conversely, the analogous
result is obtained under the assumption of of normally distributed errors,
which is very restrictive since – for example – it disqualifies the use of the
linear regression model in those settings where the error terms are known to
follow other distributions by construction (for example, when the dependent
variable is discrete).
A practical problem with employing the asymptotic variance in (8.21) for
estimation and testing purposes is that it contains the generally unknown
quantities K0 and Ξ0 . However, the Laws of Large Numbers suggest a way
to address the issue: to estimate these unknown expressions with the quan-
tities that are known to asymptotically converge to them as per (8.6) and
(8.17). This results in the following estimator of the asymptotic variance:
" N
#−1 " N 
#" N #−1
  X X 2 X
[ β
Avar b OLS = xi x T
i y i − xT
i βOLS
b xi xT
i xi x T
i
i=1 i=1 i=1
(8.22)
a formula called heteroscedasticity-consistent, Huber-Eicker-White,
or simply “robust” estimator of the OLS asymptotic variance.3 Note that
in the meat of this “sandwich expression,” the squared error terms ε2i are
substituted with the N squared residuals that are calculated with the actual
OLS estimates; this is not an exercise about estimating the residuals, but
an approach for consistent estimation of the limiting matrix Ξ0 . Since this
estimation problem has fixed dimension K (the number of estimated OLS
parameters), it does indeed converge in probability to the desired result as
N goes to infinity – a result due to Eicker (1967).
While the “robust” formula should be the preferred default option in
empirical research, some additional insights can be gained by assuming that
the errors are actually homoscedastic. If the variance of the error terms is
independent of the regressors, Ξ0 = σ20 K0 holds and the asymptotic variance
simplifies as:
2
 
A σ
b OLS ∼ N β0 , K
β 0 −1
(8.23)
N 0
which is consistently estimated by:
N
" N #−1
  1 X 2 X
[ β
Avar b OLS = y i − xT
i βOLS
b xi xT
i (8.24)
N i=1 i=1
3
That name arose in popularity due to the robust option of many STATA commands.
Observe that this option computes (8.22) and its extensions by applying a multiplicative
degrees of freedom correction N N
−K , although there is no theoretical basis for this.

278
8.1. Large Sample Properties

therefore, the variance of the OLS estimator is inversely proportional to the


variance of the regressors. Intuitively, if the linear dependence of Yi on xi
is measured over a larger empirical support for the independent variables
xi , it would appear more credible – this intuition extends naturally to the
heteroscedasticity-consistent “robust” formula (8.22). It is useful to observe
how the formulas for the limiting (or asymptotic) variance of the OLS esti-
mator differ, between the heteroscedastic and the homoscedastic case, in a
way that resembles the parallel formulas for the bivariate regression model,
which are examined in Example 6.7.
The estimation of the variance of the OLS estimates allows to performs
statistical inference about the linear regression model. Consider the simple
case where hypotheses of interest are specific to one parameter, as follows.4

H0 : βk0 = ck H1 : βk0 6= ck

In this case, the statistic of interest is the following t-statistic:5


b k,OLS − ck
β d
tH0 = r   → N (0, 1) (8.25)
[ β
Avar b k,OLS

where the expression in the denominator is the square root of the kk-th en-
try of the estimated asymptotic variance of the OLS estimates, also called
the standard error of the k-th estimated parameter (standard errors are
typically reported, along the estimated coefficients, in the output of regres-
sions performed by the main statistical computer packages).
After having estimated the whole variance-covariance matrix of the OLS
estimates, it is possible to test hypotheses that involve multiple parameters.
Consider, for example, the following L ≥ 0 linear hypotheses:

H0 : Rβ0 = c H1 : Rβ0 6= c

where R is a L × K matrix of full row rank L, while c is a L × 1 vector.


This setup affords great flexibility for recombining the K parameters into
a set of multiple linear hypotheses. It is easy to verify that under the null
hypothesis, the following asymptotic result holds.
√  
d
 h  i
T

N RβOLS − c → N 0, R Var βOLS R
b b
4
In most practical applications it is ck = 0: these are tests about the significance of
a particular regressor which is included in the model.
5
This denomination is traditional and is derived from the classical linear regression
model with normally distributed errors; technically, this is in fact a z-statistic.

279
8.2. Small Sample Properties

Such L hypotheses are typically simultaneously tested through the so-called


Wald statistic:
 T h   i−1  
b OLS RT d
WH 0 = R β b OLS − c [ β
RAvar b OLS − c →
Rβ χ2L
(8.26)
which is nothing else but a quite particular case of a Hotelling’s t-squared
statistic. Therefore, by Observation 6.2 the Wald statistic asymptotically
follows a chi-squared distribution with L degrees of freedom. A variation of
the Wald statistic can be adapted for testing multiple nonlinear hypotheses;
however, nonlinear hypotheses are treated later as part of the more general
discussion of tests in the M-Estimation framework (lecture 11), of which
the linear regression model is a particular case.

8.2 Small Sample Properties


Multiple references to a “classical” or “traditional” linear regression model
have been made throughout this discussion. This model is characterized by
fixed (non-stochastic) regressors, as well as spherical (homoscedastic) and
normally distributed errors; the model’s statistical properties are evaluated
in terms of exact moments (expectation and variance) of the OLS estimates.
Thus, it allows to perform statistical inference even in small samples, that
is when asymptotic properties do not apply. In the early days of economet-
rics this was especially important, since the availability of large economic
datasets was limited and much of applied work revolved around the analysis
of macroeconomic time series, cross-country regressions or other contexts
typically characterized by small sample sizes N .
For both pedagogical and practical reasons (sometimes, you do need to
work with small samples) it is worth to examine the exact moments of the
OLS estimators under the assumptions made so far – stochastic regressors,
heteroscedasticity et cetera – are all maintained. It is quite convenient to
perform this analysis by making use of compact matrix notation. First, it
is easy to see that the OLS estimator is unbiased; with compact notation:
h i h i
b OLS = β0 + E X T X −1 X T ε

E β
h h −1 T ii
T
= β0 + EX E X X X ε X
(8.27)
 
T
 −1 T
= β0 + EX X X
 X E [ε| X] 
| {z }
=0
= β0

280
8.2. Small Sample Properties

that is, in expectation the OLS estimator returns the true value β0 . Note
how the exogeneity (mean independence) assumption is instrumental for
obtaining this result – just like in the case of consistency – and that using
the Law of Iterated Expectations allows to sidestep the fact that regressors
are stochastic. The conditional variance of the OLS estimator, given a
specific realization of the regressors X, is calculated instead as follows.
h i h −1 T T −1 i
T T
Var βOLS X = E X X
b X εε X X X X
−1 (8.28)
−1
= XT X XT E εεT X X XT X
  
−1 −1
= XT X XT ΣX XT X
 

In small samples it is more convenient to work with the conditional variance


of the OLS estimator; the unconditional variance can be obtained by taking
the corresponding expectation over the random matrix X.
These results are not immediately applicable for performing statistical
inference. To this end, the model needs to be augmented with some varia-
tions of the classical assumptions.
Assumption 7. Spherical Errors. The errors are  homoscedastic, that is
σ2 (xi ) = Var [εi | xi ] = σ20 or equivalently Σ = E εεT X = σ20 I.


Assumption 8. Conditionally Normal Errors. The error term follows,


given a regressor matrix X, a conditionally normal distribution.
Together, Assumptions 7 and 8 can be expressed as follows.
ε| X ∼ N 0, σ20 I (8.29)


Note that the regressors are still maintained stochastic. By Assumption 7,


(8.28) can be re-written as:
h i
b OLS X = σ2 XT X −1 (8.30)

Var β 0

a formula associated with a fundamental, celebrated result about the clas-


sical regression model.
Theorem 8.2. Gauss-Markov Theorem. Consider the linear regression
model under Assumptions 1-7. Within the class of all linear, unbiased esti-
mators defined as:
n o
B = β = B0 y : E [B0 y| X] = B0 Xβ0 + B0 E [ε| X] = β0
e

the OLS estimator is the element of B that yields the minimum variance
estimate of any element of β0 , as well as of all possible linear combinations
lT β0 of β0 , where l is a K × 1 vector.

281
8.2. Small Sample Properties

Proof. By the definition of B and by Assumption 4, it follows that B0 X = I


−1
for all estimators in B. Define B1 ≡ B0 − XT X XT and observe that:
h i
e X = B0 E εεT X BT
 
Var β 0
h −1 T i h −1 T iT
= σ 2 B1 + X T X X B1 + XT X X
−1 T −1
= σ2 B1 BT 2
XT X X X XT X

1 +σ
−1
= σ2 B1 BT 2
XT X

1 +σ

−1 T
where the third line follows from B1 X = B0 X − XT X X X = 0. Thus:
h i h i
lT Var β e X l ≥ lT Var β b OLS X l

which proves the conditional (on X) version of the theorem; the uncondi-
tional version is easily obtained by taking the expectation over the random
matrix X, of which X is a specific realization.

This result – which, note, has not required invoking Assumption 8 yet –
is the one for which the OLS estimator deserves the denomination of Best
Linear Unbiased Estimator (BLUE). In this phrase, “Best” must be
interpreted in the sense of efficient, that is of minimum variance. However,
even in small samples this result is no longer valid when homoscedasticity
does not hold, as it is observed later while analyzing the Generalized Least
Squares model. Since in the current empirical practice homoscedasticity is
seen more as an exception and researchers are advised to employ variance
estimates that are robust to heteroscedasticity in large samples – such as
the “robust” formula (8.22) – the Gauss-Markov Theorem has lost much of
its original significance. However, it is still seldom useful as a benchmark
for efficiency comparisons.
In order to obtain a distributional result that that is usable for inference
purposes, observe that by Assumption 8 it would hold exactly that:
  
b OLS X ∼ N β0 , σ2 XT X −1
β (8.31)
0

by the properties of the normal distribution, recalling that the OLS estima-
tor is a linear function of the error terms ε as per (8.3). This result can be
immediately used for inference so long as σ20 is known; since it is generally
unknown, one needs to estimate this parameter. Intuitively, one could use
the same estimator for (8.30) that follows from the large sample properties

282
8.2. Small Sample Properties

under homoscedasticity, which in this case would be adapted, by writing it


in compact matrix notation, as follows.
h i eT e −1
Var βOLS X = XT X
d
d b
N
Note that N −1 eT e is a consistent estimator of σ20 ; however, it is also a biased
one, as one can show by taking its expectation conditional on X:
E eT e X = E y T MX y X
   
h i
= E (y − Xβ0 )T MX (y − Xβ0 ) X
= E εT M X ε X
 

= E Tr εT MX ε X
  

= E Tr MX εεT X
  

= Tr MX E εεT X
 

= Tr σ20 MX

 −1 T 
2 T
= σ0 Tr I − X X X X
 −1 T 
= σ20 Tr (I) − σ20 Tr XT X X X
= σ20 (N − K)

hence E N −1 eT e X = N N σ0 < σ20 .6 Thus in small samples, even under


−K 2
 

homoscedasticity, (8.30) underestimates the true variance, and is inappro-


priate if estimators shall be evaluated in terms of their exact moments. The
appropriate variance-covariance estimator for this setup would instead be:
T
b OLS X = e e XT X −1
h i
(8.32)

d β
Var
N −K
which, with respect to (8.30), applies a multiplicative “degrees of freedom”
correction N N
−K
.7 This estimator of the OLS variance-covariance is, under
6
This derivation makes use of the properties of the trace operator, here applied to
the scalar εT MX ε. Note that the order of matrices that are arguments of trace operators
can be changed so long as the resulting matrix is conformable; moreover:
 −1 T   −1 T 
Tr X XT X X = Tr XT X X X = Tr (IK ) = K

where IK is the K × K identity matrix. In addition, the sixth equality follows from the
fact that the expectation conditions on X, hence it can pass through the trace operator
as well as matrix MX (the only function of X in the trace).
7
This is analogous to estimating the variance a random variable using the standard
sample variance S 2 , without applying the rescaling factor NN−1 (see Theorem 4.3).

283
8.2. Small Sample Properties

homoscedasticity, both consistent and unbiased, and is the default formula


calculated by most statistical computer packages.
In small samples, tests of hypotheses cannot rely on asymptotic proper-
ties. As a consequence, while t-statistics are calculated similarly as in large
samples (using appropriate estimates of the OLS variance-covariance), un-
der Assumptions 1-8 they follow a Student’s T distribution with N − K de-
grees of freedom. To better appreciate this, consider that under (8.31) and
some null hypothesis H0 : βk0 = ck , the following “unfeasible” t-statistic,
which is denoted as t∗H0 , follows a standard normal distribution conditionally
on X:
b k,OLS − ck
β
t∗H0 X = p 2 X ∼ N (0, 1) (8.33)
xkk
σ0 e
−1
where e xkk is a shorthand notation for the kk-th element of XT X . Once
again, a problem arises as σ20 is unknown and must be estimated; however,
it is easy to see that under the normality Assumption 8, the distribution of
the standardized sum of squared residual σ−2 0 e e, conditionally on X:
T

eT e εT M X ε
X = X ∼ χ2N −K (8.34)
σ20 σ20
is a chi-squared distribution with degrees of freedom equal to the rank of
MX ; this quantity equals the trace of MX , that is N − K as per the earlier
derivation. Furthermore, one can show that (8.33) and (8.34) are indepen-
dent;8 therefore, the actual t-statistic tH0 which is obtained by substituting
σ20 with its unbiased estimate eT e/ (N − K) discussed above:
q √ t∗H0 √ b k,OLS − ck
β
tH0 2
= σ0 N − K √ = N −K √ T (8.35)
eT e e e·exkk
follows a Student’s t-distribution with N − K degrees of freedom, condi-
tionally on X (as usual, this follows from Observation 3.2).
tH0 | X ∼ TN −K
This result is usable for inference purposes; in small but sizable samples
(N > 20), however, this is known to yield results that are not very different
from approximations based on the standard normal.
b OLS = β0 + XT X −1 XT ε is shown to be indepen-
8

The vector of OLS estimates β
dent of (8.34) by the following observation.
−1 T
σ−2
0 XT X X MX = 0
b OLS , from which t∗ is constructed.
This result also applies to each individual element of β H0

284
8.2. Small Sample Properties

By a similar argument it is shown that multiple linear hypotheses cannot


be tested with a Wald statistic in small samples: here, the analogue of (8.26)
h −1 T i−1
 T R X T X R  

WH b OLS − c
= Rβ b OLS − c ∼ χ2
Rβ L (8.36)
0
σ20

does indeed follow an exact χ2L distribution with L degrees of freedom, but
again this is an expression that depends upon the unknown parameter σ20 .
Therefore, in small samples an F -statistic must be used instead:
h
T
−1 T i−1
N −K   T R X X R  
FH0 = RβOLS − c
b RβOLS − c (8.37)
b
L eT e
this quantity results from dividing (8.36) by (8.34) – which are independent
from one another9 – and multiplying the ratio in question by L−1 (N − K).
By Observation 3.3, this F -statistic follows exactly an F -distribution with
degrees of freedom L and N − K, conditionally on X.

FH0 | X ∼ FL,N −K

A customary use of the F -statistic is in the model F -test (or simply the
model test) corresponding to the null hypothesis H0 : β0 = 0 that all the
parameters of the model (except the constant term, if present) are jointly
meaningful. The F -statistic obtained from this test is typically part of the
default regression output returned by statistical computer packages.10

Generalized Least Squares


All the results derived so far for the “traditional” model are only valid under
the restrictive assumption that the errors are homoscedastic. This problem
was well acknowledged even in those days when the use of linear regression
9 ∗
The argument is similar to that from footnote 8: since WH 0
is a quadratic form in
the OLS estimates, its random component is proportional to the random variable:
−1 h −1 T i−1 −1 T
WN∗ = εT X XT X RT R X T X R R XT X X ε

and it is easy to see that the central coefficient matrix of this quadratic form returns 0
whether it is pre- or post-multiplied to MX .
10
The F -statistic and the model F -test are typically evaluated even in large sample
environments, in which cases they are calculated through the appropriate estimates of
the asymptotic variance of the OLS estimator. The F -distribution might, in fact, provide
a better approximation of the true underlying probabilities.

285
8.2. Small Sample Properties

relied for the most part on its small sample properties, which motivated the
search for an adequate solution within the same framework. This resulted
in the development of the Generalized Least Squares (GLS) model.
The intuition behind GLS is simple. Suppose
 that the errors are indeed
heteroscedastic, but matrix Σ = E εε X is known. If one performs
T

OLS estimation on the generalized linear model:


y
e = Xβ
e 0 + εe (8.38)
where:
1
e ≡ Σ− 12 X; εe ≡ Σ− 12 ε
e ≡ Σ− 2 y; X
y
and:
√1 2
 
 −1  0 ... 0
σ1 0 ... 0 σ1
√1 2 . . .
 
 0 σ−1
2 ... 0   0 0 
− 12 σ2
Σ ≡  .. .. ... .. =
   
 . . . .. .. .. .. 
  . . . . 
−1  
0 0 . . . σN 0 0 ... √1
2
σN

1 1
such that Σ 2 Σ 2 = Σ, then the Generalized Least Squares estimator:
 −1
βb GLS = X e TX
e Xe Ty
e

= XT Σ−1 X
−1 T −1
X Σ y (8.39)
−1 T −1
= β0 + XT Σ−1 X X Σ ε
is easily seen to be unbiased with respect to β0 . Moreover, since:
1 1 1 1
E εeεeT X = Σ− 2 E εεT X Σ− 2 = Σ− 2 ΣΣ− 2 = I
   

the GLS estimator is homoscedastic by construction; its conditional


variance is: h i
b GLS X = XT Σ−1 X −1 (8.40)

d β
Var
and it is easy to show that, by an extension of the Gauss-Markov theorem,
the GLS estimator is efficient under heteroscedasticity. In addition, under
Assumptions 1-6 and Assumption 8, the GLS estimator follows an exact
conditional normal distribution.
  
b GLS X ∼ N β0 , XT Σ−1 X −1
β (8.41)

Obviously, in large samples one could evaluate the asymptotic properties


of the GLS estimator as well: it is consistent and asymptotically normal as
per (8.41), whether the error terms are conditionally normal or not.

286
8.2. Small Sample Properties

The main problem with the GLS estimator is that Σ is, clearly, generally
unknown, therefore this estimator is unfeasible in practice. A solution would
be to substitute Σ with some plausible estimate of it: this approach is called
Feasible Generalized Least Squares (FGLS) and it works as follows.
1. Assume a functional form for the dependence of the variance of the
error term on the covariates X; for example, a simple and popular
choice is the exponential conditional variance σ2 (xi ) = exp xT ;

i ψ
2. estimate the main regression model of interest via OLS, which returns
an unbiased and consistent estimate of β0 , and calculate the resulting
squared residuals (e21 , e22 , . . . , e2N );
3. estimate via OLS the assumed model for the conditional variance; in
i ψ + $i , where
the exponential case this model would be log e2i = xT
$i is some error term with E [$i | xi ] = 0;
4. construct matrix Σ,b the estimate of Σ, accordingly; in the exponential
case it would be, for example:
   
Tb
exp x1 ψOLS 0 ... 0
   
T
0 exp x2 ψOLS ... 0
 b 
 
Σ
b =
.. .. ..
..

. . .  .
 
 
 
0 0 . . . exp xT ψN
b OLS

5. finally, calculate the FGLS estimator as follows.


 −1
b −1 X
b F GLS = XT Σ
β b −1 y
XT Σ (8.42)
Note that by denoting as σbi the square root of the i-th element of the
diagonal of Σ,
b the above could be equivalently obtained by running
OLS on the following transformed model (possibly, xi1 = 1 for all i).
yi xi1 xi2 xiK εi
= β1 + β2 + · · · + βK + (8.43)
σ
bi σ
bi σ
bi σ
bi σ
bi
This approach is known as Weighted Least Squares (WLS).
Under Assumptions 1-6 the FGLS-WLS estimator is both unbiased and
consistent. Moreover, if the conditional variance model is correctly specified,
in small samples it provides efficiency gains relative to naive OLS when the
errors are heteroscedastic, while in large samples its unconditional variance
converges in probability to the “theoretical” GLS conditional variance on the
right-hand side of (8.40). Therefore, under these ideal conditions, inference
is more reliable when using the FGLS-WLS estimator instead of OLS.

287
8.3. Dependent Errors

The problem with this approach is that it may fail if the conditional vari-
ance model is incorrectly specified. In this case, FGLS-WLS might be less
efficient than “traditional” OLS, even in small samples. Consequently, a the-
ory about tests for heteroscedesticity was developed, whose objective is to
guide researchers in search of the right specification of the heteroscedasticity
model. Nowadays, these tests and GLS altogether are seen as largely redun-
dant, since modern econometric practice relies on large samples, asymptotic
properties and “heteroscedasticity-robust” variance estimators.11 However,
learning GLS can still be useful, for both pedagogical and practical reasons.
The pedagogical reason is that it is instructive to make efficiency compar-
isons between certain estimators (like 3SLS or linear GMM) and the GLS
benchmark. The practical reason is that GLS is still used in some settings,
for example in models for panel data featuring so-called “random effects.”

8.3 Dependent Errors


A fundamental assumption that has been maintained throughout the dis-
cussion of both large and small sample properties of OLS is that of inde-
pendent observations. This hypothesis allows to assume independent errors:
E [εi εj | xi , xj ] = 0 for any two observations i and j. In large samples, this
property is especially convenient for applying the Central Limit Theorem,
because it implies a convenient estimator for Ξ0 as per (8.17); in small sam-
ples, it is instrumental for establishing the Gauss-Markov efficiency bound.
Yet error independence is quite as useful an assumption as it is unlikely to
be tenable in a wide array of situations, which can be classified as follows.
• Autocorrelation in Time. Traditionally, this was the original cause
for concern about dependent observations in econometrics, and is par-
ticularly relevant in time-series and macroeconometric analysis. In a
time-series model where time is indexed by t = 1, . . . , T :
y t = xT
t β0 + εt (8.44)
the unobserved “shock” αt of today can be related to that of the past:
E [εt εt−s | xt , xt−s ] 6= 0 (8.45)
where s 6= 0. This circumstance is called autocorrelation and must
be considered an inherent feature of time series data rather than an
exception, since the external factors that affect different observations
of different kind – from countries to stocks – change slowly over time.
11
Remarkably, a command for direct implementation of GLS is missing from STATA.

288
8.3. Dependent Errors

• Spatial Correlation. The concept of autocorrelation can be natu-


rally extended from time to some notion of “space.” Suppose that in
a standard cross-sectional model where observations are indexed by
i = 1, . . . , N , pairs of observations can be characterized by some mea-
sure of reciprocal “distance” dij ≥ 0. This concept can be attributed
some different interpretations, from actual distance in physical space
to more abstract notions such as network distance. Spatial correlation
is the scenario in which the errors of two different observations i and
j are increasingly more correlated the closer the two observations are:

E [εi εj | xi , xj ] = g (dij ) 6= 0 (8.46)

where g (dij ) is some decreasing function of distance, possibly yielding


zero for dij large enough. Intuitively, individuals, firms or cities that
are closer in physical space might be subject to more similar external
circumstances, and so do “friends” in a network.
• Within-Group Correlation. Suppose that the sample can be split
by a number C < N of groups or clusters indexed by c = 1, . . . , C;
in addition, each observation belongs to one and only one group or
cluster. Hence, observations can be indexed by group as well.

yic = xT
ic β0 + εic (8.47)

Within-group correlation is the case where individual errors can be


split between two components:

εic = αc + ic (8.48)

where the idiosyncratic shock ic is independent across any pair of


observations i, j = 1, . . . , N , regardless of their group:
(
= σ2 (xi ) if i = j
E [ic jg | xic , xjg ] (8.49)
=0 if i 6= j

while the group or cluster shock αc correlates within groups, but


not across groups.
(
6= 0 if c = g
E [αc αg | xic , xjg ] (8.50)
= 0 if c 6= g

This setup is suited to describe similar “shocks” that affect groups of


individuals (e.g. classmates, compatriots) or “categories” (firms in the
same industry, cities in the same administrative unit and so on).

289
8.3. Dependent Errors

• Combinations of the Above. These scenarios can co-exist at the


same time. In a panel data model indexed by panel unit i = 1, . . . , N ,
time t = 1, . . . , T and group c = 1, . . . , C for example:
yitc = xT
itc β0 + εitc (8.51)
autocorrelation in time, spatial correlation, and group correlation can
all be simultaneously present. In particular, in panel data “groups”
may coincide with panel units (C = N ); in this case the group shocks
αc are called random effects and are interpreted as those specific fac-
tors affecting the same individual unit repeatedly (albeit possibly not
to the same extent in different moments).
In these circumstances, inference based on any of the variance estimators
examined so far is unlikely to be accurate. However, a number of solutions
exist, which may vary by context. In small samples, the GLS framework can
be adapted to allow for dependent errors too. In large samples, cluster-based
covariance estimators are especially suited for the case of within-group cor-
relation, while the more complex cases of spatial and temporal dependence
can also be addressed with appropriate “heteroscedasticity-autocorrelation-
consistent” (HAC) covariance estimators. All of these are reviewed in turn.

Generalized Least Squares, Revisited


The GLS framework can incorporate dependent errors. The reason is that
the GLS estimator (8.39) is well defined even if Σ is non-diagonal. Likewise,
a Cholesky decomposition of Σ in such a way that:
1 1
Σ− 2 ΣΣ− 2 = I
1
is always possible since Σ is semi-definite positive, although Σ− 2 may be
non-diagonal. A classical application of GLS in the early days of economet-
rics was for autocorrelated time series. Suppose, for example, that in (8.44)
the shock εt follows a first-order autoregressive – AR(1) – process:
εt = ρεt−1 + ξt
where |ρ| < 1 and ξt is i.i.d. (homoscedastic, uncorrelated over time). Then:
 
1 ρ . . . ρT
ρ 1 . . . ρT −1 
Σ = σ2  .. .. . .
 
 . . .. .. 
T T −1
ρ ρ ... 1
and OLS estimation of (8.44) is inefficient. The FGLS approach in this case
is about estimating ρ and thereby the non-diagonal part of Σ.

290
8.3. Dependent Errors

Similar extensions of GLS for spatial correlation and group dependence


are possible, so long as a specific parametric form of the structure of error de-
pendence is assumed. To illustrate, consider the so-called cluster-specific
random effects (CSRE) model, where (8.50) is specified as:
(
= σ2α if c = g
E [ αc αg | xic , xjg ]
=0 if c 6= g
which amounts to assume constant within-group covariance. If, in addition,
standard homoscedasticity is assumed, that is σ2 (xi ) = σ2 is equal for all
observations, then Σ is block-diagonal over the C clusters:
 
Σ1 0 . . . 0
 0 Σ2 . . . 0 
Σ =  .. .. . . .. 
 
 . . . . 
0 0 . . . ΣC
where, given an identity matrix I and a unit vector ι of the same dimension
as the size Nc of some cluster c:
 
σ2 + σ2α σ2α ... σ2α
 σ2 σ2 + σ2α . . . σ2α 
α
Σc =  .. .. . .  = σ2 I + σ2α ιιT
 
 . . . . .
. 
2 2 2 2
σα σα . . . σ + σα
for c = 1, . . . , C. This setup is restrictive due to the assumptions of constant
group covariance and homoscedasticity, but is particular easy to handle by
GLS. Intuitively, a first set of OLS estimates is used to consistently estimate
the within-cluster covariance σ2α and thereby construct the appropriate es-
timate of Σ, then FGLS estimation follows.12 In fact, the CSRE model is
still quite used in practice, especially in panel data models featuring unit-
specific random effects. In general, however, clustered covariance and HAC
estimators that rely on large sample asymptotic results instead of paramet-
ric variance models are considered preferable whenever applicable.
12
Note that under dependent errors FGLS cannot be interpreted in terms of an equiv-
alent Weighted Least Squares model, since the transformed model implied by GLS, in
this like in other cases of error dependence, is about linear combinations of observations
obtained in such a way that resulting errors εe are both homoscedastic and uncorrelated.
In the case of the CSRE model, for example, the transformation is quite convenient:
T
yic − $c ȳc = (xic − $c x̄c ) β0 + (εic − $c ε̄c )
− 1
where $c ≡ 1−σ σ2 + Nc σ2α 2 and where ȳc , x̄c and ε̄c are the cluster-specific sample
means of yic , xic and εic respectively.

291
8.3. Dependent Errors

Clustered Covariance Estimation


In large samples, and in presence of a high number of groups or clusters C,
within-group dependence of any kind is elegantly addressed by an extension
of the “robust” heteroscedasticity-consistent formula (8.22). To appreciate
this, the development of some additional notation is necessary. Consider a
single group or cluster c, index its observations as i = 1, . . . , Nc , and stack
them vertically as follows.
       
y1c xT1c x11c x21c . . . xK1c ε1c
 y2c   xT   x12c x22c . . . xK2c   ε2c 
 2c  
yc =  ..  ; Xc =  ..  =  .. .. . ..  c  .. 
; ε =
    
 .   .   . . . . .   . 
yNc c xTNc c x 1Nc c x 2Nc c . . . xKNc c εN c c
In the case of model (8.47) featuring groups or clusters, the usual compact
matrix notation equation y = Xβ0 + ε is therefore obtained by vertically
stacking the following system of equations:
y c = X c β 0 + εc
over each of the C clusters. Note that these C groups are allowed to have
different sizes Nc . Thus, the OLS estimator of model (8.47) can be expressed
in the following three equivalent ways.
b OLS = XT X −1 XT y

β
C
!−1 C
X X
= XTc Xc XT
c yc
c=1 c=1
C X
Nc
!−1 C X
Nc
X X
= xic xT
ic xic yic
c=1 i=1 c=1 i=1

With group dependence, standard inference is invalid because now:


" C Nc
# C
"N #
c
1 XX 1 X X
Ξ0 = lim Var √ xic εic = lim Var xic εic
N →∞ N c=1 i=1 N →∞ N
c=1 i=1
C Nc X Nc
1 XX
E εic xic xT
 
= lim jc εjc
N →∞ N
c=1 i=1 j=1
(8.52)
but this expression cannot be reduced to (8.16); this invalidates the “meat”
matrix of the heteroscedasticity-consistent estimator of the OLS variance.
Intuitively, under group dependence the appropriate estimate of Ξ0 should
be a sample version of the ultimate expression in the derivation (8.52) above.

292
8.3. Dependent Errors

Conveniently, under some appropriate modifications of Assumptions 2-6,


a Central Limit Theorem result for dependent observations holds.
C Nc
1 XX d
√ xic εic → N (0, Ξ0 ) (8.53)
N c=1 i=1
p
Furthermore, a consistent estimator Ξ b CCE → Ξ0 for the limiting variance
above obtains by letting two realizations of xic εic and xjc εjc from the same
cluster c interact in the estimation:13
C
1 X T T
ΞCCE =
b X e c e c Xc
N c=1 c
C Nc X Nc 
(8.54)
1 XX   
= yic − xT
ic
b OLS xic xT yjc − xT β
β jc jc
b OLS
N c=1 i=1 j=1

where ec ≡ yc −Xc βb OLS substitutes for εc by arguments analogous to those


in the standard case. Since the “bread” matrices of the limiting variance of
OLS are unchanged, the estimator of the corresponding asymptotic variance
can be obtained by substituting the meat matrix of (8.22) with (8.54) above.
The resulting variance-covariance estimator is:
" C #−1 " C #" C #−1
  X X X
[ β
Avar b OLS = XTc Xc XT T
c ec ec Xc XT
c Xc (8.55)
c=1 c=1 c=1

and is called cluster-robust or clustered covariance estimator (CCE).


Observe how this result is obtained without placing any restrictions on the
within-group correlation, unlike the CSRE-FGLS case where some paramet-
ric assumptions on the structure of error dependence are necessary.
13
To gain further intuition, it is useful to think of X = ι as a single constant vector,
so that the only parameter that OLS tries to estimate is the unconditional mean of Yi :
β0 = E [Yi ], with variance Var [Yi ]. The estimation challenge lies in the possibility that
the variation of Yi is correlated within groups. The OLS estimator here is just the sample
mean Ȳ ; normally, one would estimate its asymptotic variance as follows.
C Nc
  1 XX 2
Var Ȳ = Yic − Ȳ
d
d
N c=1 i=1

Under within-group dependence, the appropriate asymptotic estimator of the variance


of Ȳ is instead a rescaled sum of the C squared total within-cluster deviations.
C
"N #2
c
  1 X X 
Var Ȳ = 2
d Yic − Ȳ
N c=1 i=1

The above is identical to (8.54) under the maintained hypothesis that X = ι.

293
8.3. Dependent Errors

This result, while quite convenient, is obtained under the condition that
the number of clusters C is large and grows to infinity. In general, however,
the number of clusters is finite and typically not very large. This is one of
the reasons that has motivated the frequent use of the CCE formula (8.55)
with a multiplicative “degrees of freedom correction” C−1 C N
N −K
that takes
into account the fact that both the number of clusters and the sample size
are finite.14 With a very low number of clusters C – usually between 20
and 50 – however, the CCE formula is employed along statistical tests for
small samples (based on the Student’s T and the F distributions). A paper
by Bester et al. (2011) provides theoretical foundations for this practice: if
C is small and fixed but Nc goes to infinity in all clusters, intuitively the C
within-group averages of xic εic are asymptotically normally distributed; in
addition they show that CCE estimation works even under some weak forms
of cross-cluster error dependence, so long as clusters are similar enough in
their observable and unobservable characteristics, as well as in their size.
In current microeconometric practice, the majority of non-experimental
studies feature some form of clustered covariance estimation. This is not in
minor part due to some influential papers (Moulton, 1986; Bertrand et al.,
2004) which observed that failing to account for within-group dependence
can lead to seriously biased inference results.15 In particular, panel data
estimates are routinely clustered at least at the level of panel units, however
it often makes sense to define clusters at an even higher level of aggregation
(for example, in a panel of firms one may want to consider industry-level
clusters, including all observations of firms of the same industry over all the
years T ). In ideal experimental studies, instead, it is not necessary to
cluster standard errors: intuitively, even if the errors are correlated within
groups, if xic is independent of εic , therefore Ξ0 = σ20 K0 holds and standard
estimation under homoscedasticity is asymptotically consistent.
14
This is similar to the standard practice of estimating “robust” standard errors with
a multiplicative degrees of freedom correction N N −K , a habit which is motivated however
more by customs than by either theory or data concerns.
15
In some stylized cases, it is possible to solve for the explicit analytic expression of
this bias. Consider, for example, the CSRE model with equal group sizes M = Nc = N/C
for c = 1, . . . , C and identical regressors across clusters Xc = Xg for c 6= g; in this case
the asymptotic variance of the OLS estimator can be shown to simplify as:

C
!−1
σ2α
    X
Avar β
b OLS = 1 + (M − 1) · σ2 XT
c Xc
σ2α + σ2 c=1

therefore, a standard estimate of the homoscedastic variance of the OLS estimate would
be downward biased. The extent of the bias is expressed by the muliplicative term within
brackets, which is often referred to as the Moulton bias from Moulton (1986).

294
8.3. Dependent Errors

Clustering Covariance Estimation can be generalized to multiple group


dimensions within which errors are expected to be dependent. To put things
in practical perspective, in a panel data model the errors may be correlated
within the same panel unit over different time periods, and at the same time
they may be correlated for multiple panel units observed in the same time
period. Since clearly it is unfeasible to characterize one giant cluster of the
entire panel, a possibility is to estimate the variance-covariance matrix of
interest through the two-way clustering formula. Without any loss of
generality, call I the first set of relevant groups (e.g. all panel units); T the
second group (e.g. all time periods) and J the set of all elements defined by
all possible intersections I ∩ T (in a panel dataset, these would be unique
observations at the i-t level). The two-way clustering formula is:
       
AvarI,T βOLS = AvarI βOLS + AvarT βOLS − AvarJ βOLS (8.56)
[ b [ b [ b [ b

where Avar [ T (·) and Avar


[ I (·), Avar [ J (·) are, respectively, expressions of the
general CCE formula (8.55) based on the groups defined by the sets I, T and
J; clearly, in a panel setting the third expression – which enters negatively
in (8.56) – is identical to the standard heteroscedasticity-robust covariance
estimator of OLS as given in (8.22). Two-way clustering is actually a par-
ticular case of the more general multi-way clustering; see Cameron et al.
(2011) for an extended discussion. In all cases of multi-way clustering, the
relevant number of clusters to look at in order to gauge the goodness of the
asymptotic approximation is that of the smallest set under consideration.
For example, in panel data with |I| = N and |T| = T , it usually is T < N
and T is very small (hardly larger than 20); hence, the previous discussion
about the theory and practice of clustering with few groups applies.

HAC Estimation
In large samples, alternatives to CCE exist under specific structures of the
cross-error dependence. Such estimators of the OLS variance-covariance go
by the name of heteroscedasticity-autocorrelation-consistent (HAC)
estimators, since they were originally devised for the case of autocorrelation
in time. Like CCE estimators as well as all asymptotic covariance estimators
of OLS more generally, HAC estimators are based on K ×K matrices Ξ b HAC
such that:
p
Ξb HAC → Ξ0
and if a Central Limit Theorem for dependent observations can be applied:
√  
d
N βOLS − β0 → N 0, K−1 −1

0 Ξ0 K0
b

295
8.3. Dependent Errors

then, HAC estimation of the asymptotic variance-covariance of OLS can be


performed as follows:
" N #−1 " N #−1
  X X
[ β
Avar b OLS = N x i xT iΞ
b HAC xi x T i (8.57)
i=1 i=1

where N (or T ) usually simplifies with N −1 (T −1 ) in the expression of Ξ b HAC .


The idea by Newey and West (1987) and Andrews and Monahan (1991)
in the case of autocorrelated time series from a model such as yt = xT t β0 +εt
(with t = 1, . . . , T ), is that an appropriate solution can be as follows:
T −1 T
X 1X
Ξ
b NW = κT (s) et xt xT
t+s et+s (8.58)
T t=1
s=−(T −1)

where N W stands for “Newey-West,” et = yt − xT t βOLS (and similarly for


b
et+s ), while κT (s) is a weighting kernel decreasing in |s|, and such that
κT (s) = 0 if t + s < 1 or t + s > T . The most popular weighting kernel is
the Bartlett kernel, named after Bartlett (1950):
 +
|s|
κBT (s) = 1 −
BT
where 2BT is the base of the Kernel; note that κT (0) = 1 and it decreases
uniformly for higher values of |s| until κT (|es|) = 0 for se ≥ BT . The intuition
behind this estimator is that of assigning to each observation in the series
an interval of a certain length (like the base of the Bartlett kernel) within
which autocorrelation can be nonzero; through the kernel, two close enough,
possibly autocorrelated observations “interact” in the HAC estimator (8.58)
of the variance-covariance of xt εt , just like observations of the same cluster
interact in the CCE formula (8.55).
In the mentioned theoretical contributions, conditions are established in
order for the HAC estimator to be a consistent estimator of Ξ0 and for the
applicability of a Central Limit Theorem. Clearly, the true autocorrelation
must be zero for observation pairs not captured by κT (s); with the Bartlett
kernel, this means that the base BT must be long enough so to capture the
actual extent of the autocorrelation. This creates an empirical tension, since
of course a longer base implies a larger estimated variance and less precise
estimates, while shortening the base entails the risk of underestimating the
true variance. In addition, consistency of the HAC estimator requires that
for integers s > 0, the kernel tends to zero sufficiently fast so that the overall
estimate of the variance vanishes as T grows itself increasingly larger.

296
8.3. Dependent Errors

The Newey-West-Andrews-Monahan estimator can be easily extended to


autocorrelated panel data. In a double indexed (i-t) model yit = xT
it β0 + εit ,
(8.58) rewrites as:
T −1 T N
X 1 XX
Ξ
b NW = κT (s) eit xit xT
i(t+s) ei(t+s) (8.59)
N T t=1 i=1
s=−(T −1)

where eit = yit −xTit βOLS and so on. Clearly enough, in this case the kernel is
b
allowed to cover the entire panel length T , since consistent HAC estimation
follows from the asymptotic properties obtained as N grows larger. Observe
that if κT (s) = 1 for all observations of the same panel unit and equals zero
otherwise, (8.59) would coincide with the CCE formula when clusters are
defined at the panel unit level. The HAC estimator is also easily ported to
a setting featuring spatial correlation. Recall that in such a case, cross-error
dependence decays with some measure of distance dij between observations
i and j; in a standard (say, cross-sectional) model yi = xT i β0 + εi , the HAC
estimator is easily adapted as:
N N
1 XX
ΞHSC =
b κN (dij ) ei xi xT
j ej (8.60)
N i=1 j=1

where HSC stands for heteroscedasticity and spatial correlation consistent,


and the kernel is sufficiently decreasing in dij . This estimator was analyzed
principally by Conley (1999) and Kelejian and Prucha (2007); the intuition
is as in the time series case; the difference is that instead of a time interval,
the kernel captures an “area” (possibly, in an abstract sense) around each
observation. Similarly, (8.60) coincides with the CCE if the kernel captures
only observations within segregated groups, and weighs them equally. In a
panel data environment with both temporal and spatial correlation, (8.59)
and (8.60) can be combined:
T −1 T N N
X 1 XXX
Ξ j(t+s) ej(t+s) (8.61)
κN (dij ) eit xit xT
b HASC = κT (s)
N T t=1 i=1 j=1
s=−(T −1)

the appropriate comparison in this case would be with two-way clustering.


To conclude this discussion, both CCE and HAC estimators are flexible
enough tools which can be utilized in the econometric practice in order to
perform correct inference even under dependent errors. The choice between
each depends on the context and the data, and is informed by the conditions
that are necessary for the appropriate asymptotic results to be applicable.
While CCE can be seen as a special case of HAC estimation, it is consider-
ably more popular, mostly for reasons of easier practical implementation.

297
Lecture 9

Econometric Models

This lecture provides an introduction to structural models in econometrics,


while contextually discussing the two fundamental concepts of identification
and causality, which govern the choice of empirical models in the applied
econometric practice. These notions are purposely introduced following the
treatment of the single-equation linear model from previous lectures, which
can thus be exploited as a useful source of examples.

9.1 Structural Models


A structural econometric model is a set of relationships regarding some
socio-economic variables relative to some unit of observation (the latter is
denoted here by i). The following treatment distinguishes between:
• some P endogenous variables: yi = (Y1i , Y2i , . . . , YP i );
• some Q exogenous variables: zi = (Z1i , Z2i , . . . , ZQi );
• and some R unobserved variables (or factors) εi = (ε1i , ε2i , . . . , εRi ).
A structural model relates endogenous variables to themselves, to exogenous
variables and to unobserved factors via P functional relationships.
yi = s (yi , zi , εi ; θ) (9.1)
Typically, these relationships combine as system of P equations, although
inequalities are occasionally included. Relying either on economic theory or
on a priori knowledge of the setting under analysis, econometricians specify
the functions expressed through s (·). The parameters that govern these
relationships are collected in the vector θ, which here is given dimension K
(|θ| = K), and whose parameter space is written as Θ.

298
9.1. Structural Models

The objective of econometric analysis is to characterize techniques for


performing statistical inference about the true value θ0 ∈ Θ, from a
data sample {(yi , zi )}N i=1 made of N observations, and on the basis of the
a-priori knowledge (informed by economic theory) of the structural function
s (·). To this end, the econometrician postulates adequate distributional
assumptions about the vector of unobservables εi that allow to transform
(9.1) into a statistical model whose parameters θ can actually be estimated.
Such distributional assumptions can be of different kinds: they range from
simple restrictions on the first moment of the unobservables (e.g. the value
of E [εi ]), through features of their conditional distribution given zi (say, the
value of E [εi | zi ]), up to the full-fledged specification of the joint probability
distribution function of (zi , εi ). This has implications in terms of what set
of statistical and econometric techniques is available for estimation.
Leading examples of structural models are the (linear) Simultaneous
Equations Models (SEMs) which, for P = R, generalize as follows.

γ11 Y1i + γ12 Y2i + . . . + γ1P YP i = φ11 Z1i + φ12 Z2i + . . . + φ1Q ZQi + ε1i
γ21 Y1i + γ22 Y2i + . . . + γ2P YP i = φ21 Z1i + φ22 Z2i + . . . + φ2Q ZQi + ε2i
... = ...
γP 1 Y1i + γP 2 Y2i + . . . + γP P YP i = φP 1 Z1i + φP 2 Z2i + . . . + φP Q ZQi + εP i

In this model, the parameter set is given by θ = (γ1 , . . . , γP ; φ1 , . . . , φP ),


where γp = (γp1 , . . . , γpP ) and φp = (φp1 , . . . , φpQ ) for p = 1, . . . , P ; for
the sake of making the model meaningful, certain parameters are typically
normalized – usually γpp = 1 for p = 1, . . . , P . A SEM can be conveniently
written in compact vectorial notation:

Γyi = Φzi + εi (9.2)

where Γ and Φ are, respectively, matrices of dimension P × P and P × Q,


which collect the γp and φp parameter vectors along their rows; while:
     
y1i z1i ε1i
 y2i   z2i   ε2i 
yi =  ..  , zi =  ..  , εi =  .. 
     
 .   .   . 
yP i zQi εP i

collect the observation-specific realizations of yi and zi as well as the values


of εi . A simple example of a SEM was already given in Lecture 7, as the
combinations of the two models (7.6) and (7.7) for the analysis of the returns
to education. There, log-wages Wi and Si are the endogenous variables, the

299
9.1. Structural Models

random vector zi = (Xi , Xi2 , Zi ) collects the exogenous variables, whereas


ε1i = αi + i and ε2i = ψ0 αi + ηi are two “combined” unobserved factors. It
is useful to make other examples of structural econometric models.
Example 9.1. The Klein I Model. SEMs were introduced by the famous
“Cowles commission” back in the ‘40s, at a time when econometrics was just
developed with the ambitious aim of creating a large macroeconomic model
of the entire economy that would guarantee both full employment and no
more repetitions of the Great Depression trauma. The underlying idea was
that the system would let policymakers control the “endogenous” variables,
such as the GDP, via the manipulation of “exogenous” policy variables such
as say government expenditures. The legacy of this intellectual undertaking
is controversial, but one of its heritages is a set of small, self-contained SEMs
that are still used for illustrative purposes. A famous one is the “Klein I”
model (Klein, 1950), which features three structural equations:
Ct = α0 + α1 Pt + α2 Pt−1 + α3 (Wtp + Wtg ) + ε1t
It = β0 + β1 Pt + β2 Pt−1 + β3 Kt−1 + ε2t
Wtp = γ0 + γ1 Xt + γ2 Xt−1 + γ3 At + ε3t
and which is accompanied by the following identities (the first one is also
usually seen as an equilibrium condition1 ):
Xt = Ct + It + Gt
Pt = Xt − Tt − Wtp
Kt = Kt−1 + It
where: i. Ct is consumption; ii. It is investment; iii. Gt is the government’s
nonwage expenditure; iv. Xt is the aggregate demand or GDP; v. Tt are the
indirect business tax plus net exports; vi. Kt is the aggregate capital stock
and Kt−1 is its lagged value; vii. Pt is the aggregate level of profits realized
in the private sector and Pt−1 is its lagged value; viii. Wtp are wages paid in
the private sector; ix. Wtg are wages paid in the business sector; and x. At
is a constant time trend. A simplified version of first structural equation
– the one for consumption – is the Keynesian consumption function from
Examples 7.1 and 7.4. Observe that in this model, variables are denoted
by the time subscript t, rather than the standard subscript i. The reason
is that macroeconomic models such as this one are typically estimated on
time series data. Here, time lags of specific variables, that are represented
by a subscript such as t − 1, introduce dynamics into the system. 
1
Specifically, Ct + It + Gt represents the aggregate demand in the economy which, in
equilibrium, must equal supply, resulting in an equilibrium level of output or GDP Xt .

300
9.1. Structural Models

Example 9.2. Entry Models. Among the fields of Economics, Industrial


Organization – the one that focuses on the analysis of specific markets – is
the one that makes more intensive use of non-linear structural models. The
set of applications ranges from the estimation of demand functions, that of
supply and production functions, the analysis of oligopolies, auctions, and
more. Perhaps, the archetypical structural models of industrial organization
are the entry models (or entry games), which concern with the analysis
of market structure – the number of and the degree of competition among
firms in a market – as a function of factors that relate to both demand and
supply. A very minimalistic entry model is sketched next.
Consider N separate markets indexed as i = 1, . . . , N , each populated
by an endogenous number Fi of identical firms. These may be, for example,
geographically segregated markets for homogeneous goods or services. The
average profit of a firm in market i, as a function of Fi , can be written as:

πi (Fi ) = πV i (Fi , zi , νi ; θM ) − Ci (9.3)

where πV i (·) is a variable profits function, whose arguments are Fi , the


market’s various exogenous characteristics zi (as parameterized by θM ), and
some unobserved factors νi ; instead, Ci are the market-specific fixed costs.
With standard characterizations of the demand and supply functions, and of
the modes of competition, πV i (·) is decreasing in Fi . Furthermore, economic
theory predicts that, under complete information, in equilibrium as many
firms will enter the market as the possibility to make positive profits allows.
Therefore, the endogenous variable Fi relates to the exogenous variables zi
and to the unobserved factor νi as follows:

Fi ∈ arg min πV i (F, zi , νi ; θM ) s.t. πV i (Fi , zi , νi ; θM ) − Ci ≥ 0 (9.4)


F ∈N

which clearly results in a non-linear relationship with a specific “step func-


tion” shape – yet, the parameters θM can be estimated with the appropriate
econometric techniques under specific distributional assumptions. Clearly,
this requires a functional form for πV i (·), which in turn depends on specific
hypotheses. For example, if the demand function has a constant elasticity
ζ and it is directly proportional to a measure of “market size” ziT θD + νi ,
where θM = (θD , ζ) and zi are factors that affect demand (e.g. demographic
characteristics); while firms have constant marginal costs and compete à la
Cournot, then: 
ζ ziT θD + νi
πV i (Fi , zi , νi ; θM ) = (9.5)
Fi2
which is similar to Berry (1992), and convenient for the sake of estimation.

301
9.2. Model Identification

The objective of estimating a model of this kind would be, for example,
that of finding out what factors zi best predict the profitability of a market.
Extensions of such a stylized model might allow for heterogeneous firms, or
for cost factors that vary across markets, hence extending the scope of the
analysis towards supply factors that also affect profitability. Other models
introduce include endogenous variables, incomplete information, and more;
for an introduction to this literature, see Berry and Reiss (2007). 
This completes the exposition of three quite different econometric mod-
els, each grounded on a specific piece of economic theory. The rest of these
lectures is devoted to the analysis of methods for the estimation of models
like these. Before proceeding to estimation, however, the careful econome-
tricians should ask themselves questions of the following sort.
1. Is it possible to use the results of my estimates for the sake of attribut-
ing unique values to each parameter within the set θ?
2. If so, is it possible to use these estimates in order to answer questions
about the “effect” of certain variables upon the others?
Questions like these lie at the core of econometric analysis. These relate,
respectively, to the notions of identification and causality – while inter-
twined, these two concepts are often confused for one another, and it is thus
useful to provide appropriate introductions to both. Most of the remainder
of this lecture is devoted to this objective.

9.2 Model Identification


There are several informal definitions of “identification,” all of them some-
how expressing the notion that for a certain parameter set θ ∈ Θ to be
identified in a statistical model, no other set θ0 ∈ Θ should have the same
probabilistic implications in terms of “generating” a certain data sample. In
econometrics, the concept of identification originates with the analysis of
Simultaneous Equations Models where – as it is elaborated later – short of
ex-ante assuming specific restrictions, an infinite number of parameter sets
is typically equally capable of rationalizing a posteriori the same data. This
idea can be intuitively connected to example 7.2: if education Si and wages
Wi are related in the data, is it because of some “effect” of the former on
the latter (β3 ) or due to the indirect effect of ability αi ? A classical, formal
definition of identification is the one by Rothenberg (1971), developed in
the context of a fully parametric model – one where the joint probability
distribution generating the data is fully specified.

302
9.2. Model Identification

Definition 9.1. A data generation process (DGP) is the joint proba-


bility distribution Fθ (zi , εi ) parametrized by θ or, given (9.1), Gθ (zi , yi ).
Definition 9.2. A family P of DGPs is some given set of similar DGPs.
Definition 9.3. A structure θ0 , is a specific restriction on θ that uniquely
determines a particular DGP Pθ0 (zi , εi ) ∈ P.
Definition 9.4. A statistical model M the set of valid structures, which
needs not to be equivalent with the family of DGPs P. A statistical model
M is best understood as the set of structures M ⊂ P compatible with the
restrictions implied in the “structural” model (9.1).
Example 9.3. Parametric Bivariate Regression. Consider a bivariate
linear model analogous to the one from Examples 3.11, 6.3 and 6.7:
Yi = β0 + β1 Xi + εi
where the data are generated according to a well-known family P of DGPs,
a bivariate normal distribution:
     2 
Xi µx σx σxε
∼N ;
εi µε σxε σ2ε

implying Yi ∼ N (β0 + β1 µx + µε , β21 σ2x + 2β1 σxε + σ2ε ). To operationalize


this model, one usually imposes the restriction µε = E [εi ] = 0. The statisti-
cal model M is the set of admissible structures θ = (β0 , β1 , µx , σ2x , σ2ε , σxε ).
An example of structure is the restriction θ0 = (5, 2, 0, 2, 2, 1). 
Quoting Rothenberg, “the identification problem concerns the existence
of a unique inverse association” from the data to the structure. That is, it
is a question about the possibility of recovering one exact structure θ when
knowing the complete probability distribution of the data. Whether this is
possible in theory determines what econometric techniques are available for
estimation, if any. Some further definitions are in order.
Definition 9.5. Observational Equivalence. Two structures θ0 and θ00
are observationally equivalent if P (yi , zi | θ0 ) = P (yi , zi | θ00 ).
Definition 9.6. Global Identification. A Structure θ0 ∈ Θ is globally
point identified if there is no other structure θ ∈ Θ that is observationally
equivalent to it.
Definition 9.7. Local Identification. A Structure θ0 ∈ Θ is locally point
identified if there is no other structure in an open neighborhood of θ0 that
is observationally equivalent to it.

303
9.2. Model Identification

One additional notion, that of set identification, is outside the scope of


this discussion and thus left aside. For simplicity, the term “identification”
henceforth denotes more generally the notion of global point identification.

Definition 9.8. Model Identification. An econometric model M is iden-


tified if all its Structures θ ∈ Θ are identified.

Armed with these definitions, one can provide rigorous answers to the
question whether some models are “identified” or not.

Theorem 9.1. Identification of a fully parametric bivariate regres-


sion. The statistical model M from Example 9.3 is not (point) identified.
However, the restricted model given by M0 = {θ ∈ M : σxε = 0} is instead
(point) identified.
Proof. (Sketched.) Here it is most convenient to proceed in steps. The first
step is about showing how parameters (µx , σ2x ) are always identified given
appropriate observations of Xi . To this end, a specific rule that associates
observations to parameter values is necessary; since the model in question
is fully parametric, the likelihood principle (see Lecture 5) appears the most
straightforward choice. Specifically, to any (set of) observations of Xi that
are drawn from some distribution FXi (xi | θ), the parameters θ chosen to
rationalize the data are those that maximize the (log-)likelihood function
log L (θ| x1 , . . . , xN ). One can resort to the Implicit Function Theorem to
establish that such an association exists under mild conditions. By following
a similar approach one can show how (β0 , β1 , σ2ε , σxε ) are not identified
under analogous conditions. The full-fledged demonstration of identification
under the restriction that σxε = 0 is however left as an exercise.
First, consider the information about Xi contained in a sample of size
N , that is the collection of realizations {x1 , . . . , xN } of Xi which – impor-
tantly – must not be all identical to each other (however, this occurrence
has probability zero under the maintained hypotheses). The following log-
likelihood function should resonate as familiar, as the sample is drawn from
the normal distribution.
N
N X (xi − µx )2
log L µx , σ2x x1 , . . . , xN = − log 2πσ2x −

2 i=1
2σ2x

Example 5.6 outlines the First and Second conditions for a maximum of this
log-likelihood function, and the consequent expressions for estimators of the
two parameters, call them µb x and σ
b2x . The question of identification here is:
“can these conditions [for a maximum] characterize a univocal association

304
9.2. Model Identification

from the sample to the parameters, (b b2x ) : Sx → R × R++ ?” To answer


µx , σ
this question, recall again from example 5.6 that the Jacobian matrix of the
score, that is the Hessian matrix of the log-likelihood function, is evaluated
at the solution as follows.
 −2 
2
 σ
bx 0
H µ bx x1 , . . . , xN = −N
bx , σ
0 2bσ−4
x

Since σb2x 6= 0 the determinant is nonzero, hence the Hessian has full rank.
Thus, by the Implicit Function Theorem it is (almost) always possible to
solve for unique values of (µx , σ2x ): these parameters are identified.
Second, consider the log-likelihood function of ϑ = (β0 , β1 , σ2ε , σxε ) given
the information about Yi contained in the sample {(yi , xi )}Ni=1 (here one can
abstract from µx and σx as they are shown to be identified).
2

N
log 2π β21 σ2x + 2β1 σxε + σ2ε −

log L (ϑ| y1 , . . . , yN , x1 , . . . , xN ) = −
2
N
X (yi − β0 − β1 xi )2

i=1
2 (β21 σ2x + 2β1 σxε + σ2ε )

The First Order Conditions now read as:


 
∂ log L ϑ b y1 , . . . , yN , x1 , . . . , xN XN
ei
= =0
∂β0 b2y
σ
i=1
   
2
∂ log L ϑ y1 , . . . , yN , x1 , . . . , xN
b N β1 σb bx + σ bxε XN
ei xi
=− 2
+ +
∂β1 σ
by b2y
σ
i=1
 
N e2 β b1σ 2
+ σ
X i b x b xε
+ =0
b4y
σ
i=1
 
∂ log L ϑ y1 , . . . , yN , x1 , . . . , xN
b
N XN
e2i
= − + =0
∂σ2ε 2bσ2y i=1
2b
σ 4
y
 
∂ log L ϑ b y1 , . . . , yN , x1 , . . . , xN N
!
2
N X ei
= −2β b1 − =0
∂σxε σ2y
2b i=1
σ4y
2b

where σb2y ≡ βb2σ 2


1 b x + 2β1 σ b2ε and ei ≡ yi − β
b bxε + σ b 1 xi for i = 1, . . . , N .
b0 − β
Clearly, as the two derivatives with respect to σ2ε and σxε are linearly depen-
dent, no Jacobian matrix of full rank can be formed out of the First Order
Conditions. Therefore, parameters ϑ = (β0 , β1 , σ2ε , σxε ) are not identified.

305
9.2. Model Identification

A useful exercise is to show, following the example above about the identifi-
cation of (µx , σ2x ), that the model is identified when the restriction σxε = 0
is imposed. It is easiest to start from a simpler case, where Xi is “fixed in
repeated samples” (that is, some N realizations occur with probability one)
which greatly simplifies the expression of the likelihood function.
The identification condition σxε = 0 from Example 9.3, which states that
the covariance between Xi and εi must be zero, is intimately connected with
the so-called “exogeneity” condition of the linear regression model, which is
abundantly discussed in other lectures, but it is also worth to revisit it here.
This condition requires that the expectation of the error term conditional
on the explanatory variables is zero (here, E [ εi | Xi = xi ] = 0 for all xi ∈ X)
and it implies that the CEF of Yi given Xi is linear as well:
σxε = Cov (Xi , εi ) = E [Xi εi ] − E [Xi ] E [εi ]
= EX [E [Xi εi | Xi ]]
=0
because E [εi ] = 0 and by the Law of Iterated Expectations (Example 3.11).
In abstract terms, the intuition can be formulated as follows: if σxε 6= 0, it is
apparent that Xi and Yi move together it is impossible to tell whether they
do because Xi affects Yi directly, or rather through the indirect influence of
εi . Clearly, here something has to give, and to properly interpret the data
it is necessary to place some “restriction” on the statistical model M.
The concept of identification is not restricted to fully parametric models.
Indeed, it can apply as well to semi-parametric models: that is, models
in which only some features of the joint probability distribution of (zi , εi ) is
specified. A full-fledged treatment of identification in the semi-parametric
case is outside the scope of this discussion, but it is still worth to illustrate
the main intuition via an example.
Example 9.4. Parametric Bivariate Regression. Consider once again
the bivariate linear model Yi = β0 + β1 Xi + εi , but unlike in Example 9.3,
abstain from imposing fully parametric assumptions: the joint distribution
of (Xi , εi ) is left unspecified. In this case one could re-define the concept of
“model” M as a set of structures of the kind
(9.6)

θ = β0 , β1 , Px , Pε , P ε|x
where Px , Pε and P ε|x are families of probability distributions, respectively
of Xi , of εi , and of εi conditional on Xi , that are allowed by the model M.
A straightforward restriction here is that all elements of Pε must conform
to E [εi ] = 0; clearly, an unrestricted mean is indistinguishable from β0 . 

306
9.2. Model Identification

The definitions of observational equivalence and identification are ex-


tended here to unique values of the exact parameters – like (β0 , β1 ) – and
families of distributions that are compatible with the data. In analogy with
the fully parametric case, identification of a semi-parametric bivariate lin-
ear model is possible by imposing a restriction on the elements from family
P ε|x , one with the same intuitive interpretation as in the fully parametric
case. Unsurprisingly, the restriction in question is of the form E [εi | Xi ] = 0.

Theorem 9.2. Identification of a semi-parametric bivariate regres-


sion. Consider the semi-parametric model M from Example 9.4, incorpo-
rating the restriction E [εi ] = 0; this model is not identified. However, the
restricted model M0 = {θ ∈ M : E [εi | Xi ] = 0} is identified.
Proof. (Outline.) This case cannot be evaluated by the likelihood principle
because clearly, without fully parametric assumptions, a likelihood function
cannot be specified. Instead, it is necessary to analyze the cross moments
of the model (like covariances) that involve Xi and εi , and evaluate whether
a unique association from the data to the parameters can be established by
the analogy principle (see Lecture 5 again). In general, M is not identified
because it allows for E [Xi εi ] = g (Xi ) where g (Xi ) is some function of Xi .
Consequently, a zero moment condition of the form E [Xi εi − g (Xi )] 6= 0 is
uninformative, as the analyst does not generally know g (Xi ) – or else this
knowledge can be used to impose some restriction. On the other hand, M0 is
identified by familiar arguments: moments (3.8) and (3.9) and their sample
analogues can be used to establish a unique association from the data to β0
and β1 . The identification of Px , Pε , and P ε|x under the stated restriction,
which is necessary to formally show identification of θ, is then trivial, since
all non-degenerate joint distributions that allow for mean independence of
the error term comply with the definition.

In many situations, however – whether one is working with a fully para-


metric model or a semi-parametric one – the problems of identification that
emerge do not depend on the statistical relationship between observed and
unobserved variables, but rather on the very structural relationships im-
plied by one’s model. An example of this sort is the “dummy variable trap”
that was briefly introduced in Lecture 7; one cannot estimate a linear model
where the columns of X are linearly dependent, because there is no unique
Least Squares solution. This is intuitively related to the formal definition
of identification: in such cases, there exists an infinite number of parameter
combinations which are equally capable of making sense of the data. It is
for this reason that White’s Assumption 3, the identification assumption of
OLS, must be imposed in order to explicitly rule out this kind of issues.

307
9.3. Linear Simultaneous Equations

9.3 Linear Simultaneous Equations


The implications of a structural model’s functional relationships on iden-
tification are especially meaningful for those models that feature multiple
endogenous variables (P ≥ 2) such as linear SEMs. Analyzing the latter is
especially useful for both their pedagogical value and their ubiquitous ap-
pearance in the applied practice, whether implicit or explicit. All methods
of instrumental variable estimation like those discussed in Lecture 10 are,
in fact, based on SEMs (although often implicitly). This section is devoted
specifically to the analysis of identification in SEMs, which is illustrated via
a classical example about partial equilibrium in a market. To begin, some
definitions (which are not specific to SEMs) are in order.
Definition 9.9. The reduced form of a structural econometric model is
its solution for yi .
yi = r (zi , εi ; θ) (9.7)
Definition 9.10. A separable structural model is one that possesses a
reduced form representation like (9.7).
For example, SEMs are separable if the parameter matrix Γ from expression
(9.2) is invertible; in such a case, the reduced form can be written as:
yi = Γ−1 (Φzi + εi )
(9.8)
= Πzi + ηi
where Π ≡ Γ−1 Φ is a P × Q matrix of reduced form parameters πpq in
its entries indexed by p (rows) and q (columns), and ηi ≡ Γ−1 εi .
Example 9.5. Demand and Supply. Consider a standard microconomic
model of partial equilibrium in a single market, which an econometrician
is trying to analyze by looking at a sample of N different markets, which
contain information about prices and quantities, and that – similarly to ex-
ample 9.2 – are indexed by i = 1, . . . , N . We know that both demand QD i
and supply QSi of a specific good are functions of price Pi . The econometri-
cian assumes that, in particular, both the demand and the supply  functions
are linear up to some unobserved factors expressed as υi , υi .
D S

QD D
i = α0 + α1 Pi + υi
(9.9)
QSi = β0 + β1 Pi + υiS
We learn from economic theory that, in a market, demand and supply meet
in equilibrium, thus:
QD S
i = Qi = Qi

308
9.3. Linear Simultaneous Equations

and that both equilibrium prices Pi and quantities Qi are determined si-
multaneously and interdependently, hence they are both endogenous.
The parameters θ = (α0 , α1 , β0 , β1 ) of model (9.9) are not identified.
This is easily shown via the reduced form of the structural model:

β1 α0 − α1 β0 β1 υiD − α1 υiS
Qi = +
β1 − α1 β1 − α1
D S
(9.10)
α0 − β0 υi − υi
Pi = +
β1 − α1 β1 − α1

exhibiting how, by construction, E υiD , υiS Pi 6= 0 and E υiD , υiS Qi 6= 0.


   

Thus, the parameters of the two bivariate regression models featured in the
structural form (9.9) cannot be identified, neither in fully parametric nor
in semi-parametric environments (Theorems 9.1 and 9.2). The best one can
do is to exploit the reduced form to estimate the two unconditional moments
E [Qi ] = (β1 α0 − α1 β0 ) / (β1 − α1 ) and E [Pi ] = (α0 − β0 ) / (β1 − α1 ) that
are implied by (9.10), which obviously do not contain enough information
about each element of θ: the system is not identified in the sense that there
is an infinite number of θ combinations that predict these two unconditional
averages. This is represented in graphical form in Figure 9.1.

Pi QSi (Pi )

i QS` (P` )

QD
i (Pi ) `

QD
` (P` )

QSj (Pj )

QD
j (Pj )

Qi

Figure 9.1: Infinite supply and demand curves given the sample {i, j, `}

309
9.3. Linear Simultaneous Equations

The economic intuition behind this negative result is that changes in the
equilibrium price and quantity in one market cannot be attributed to either
demand or supply factors in isolation, absent any further information that
is specific to either supply or demand. 

In situation of this kind, which are ubiquitous in econometrics, the task


that econometricians face is to elaborate extensions or restrictions of their
model such that certain information contained in the data can improve on
identification. Depending on the circumstances, this may as well result only
on some parameters being identified, or rather on some parameters being
“redundantly” identified, or both! Here, more definitions are in order.

Definition 9.11. Exact identification. An econometric model is exactly


or just identified (the two expressions are interchangeable) if there exists a
unique association from the data to the parameter set θ.

Definition 9.12. Partial identification. An econometric model is par-


tially identified if there exists a unique association from the data to a subset
of the parameter set θ (θ∗ ⊂ θ), but not so for the other parameters.

Definition 9.13. Overidentification. In an econometric model, a subset


θ∗∗ of the parameter set θ (θ∗∗ ⊂ θ) is overidentified if there exist multiple
associations from the data to the parameter subset in question.

In a partially identified model, there may as well be some overidentified pa-


rameters coexisting with non-identified ones. This is again best illustrated
via the previous example about markets in partial equilibrium.

Example 9.6. Demand and Supply (continued). Introduce a new vari-


able to Example 9.5: the exogenous income Mi of all consumers in a market.
By standard microeconomic theory, consumers’ demand depends positively
on Mi . Furthermore, there is no theoretical reason why consumers’ income
should affect the production process and the supply function directly. Write:

Qi = α0 + α1 Pi + α2 Mi + υiD
(9.11)
Qi = β0 + β1 Pi + υiS

with reduced form:


β1 α0 − α1 β0 β1 α2 β1 υiD − α1 υiS
Qi = + Mi +
β1 − α1 β1 − α1 β1 − α1
D S
(9.12)
α0 − β0 α2 υ − υi
Pi = + Mi + i
β1 − α1 β1 − α1 β1 − α1

310
9.3. Linear Simultaneous Equations

note that if E [ υD , υS | Mi ] = 0, the two equations in (9.12) can be treated as


two separate bivariate linear regressions whose parameters are identified; in
particular, through the slopes one can identify the following two parameter
combinations: β1 α2 / (β1 − α1 ) as well α2 / (β1 − α1 ). This revised model
is partially identified: β1 , the slope of the structural supply equation, is
backed out as the unique ratio between those two quantities.
The intuition for this result is that now there is an “exogenous demand
shifter” that allows to isolate supply effects as demand changes: this idea
is illustrated graphically in Figure 9.2. Specifically, the variation in Mi (in
the Figure, from Mi to Mj to M` ), taking the supply curve as given, allows
to identify different market equilibrium points, which in turn help trace the
supply curve itself. It is easy to see that if Mi had been included into the
supply equation, identification of β1 would not be achieved, since it would
be impossible in principle to distinguish between the influence that Mi has
on demand against its effect on supply. Therefore, identification hinges on
a restriction imposed upon the model, justified here by economic theory.
Also observe that other parameters of the model, such as α1 – the slope of
the demand function – are not identified (hence “partial” identification).

Pi
QS (P )

M` > Mj `

Mj > Mi j QD
` (P, M` )

QD
j (P, Mj )
i

QD
i (P, Mi )

Qi

Figure 9.2: Identification of the supply curve via demand shifters

Suppose one can also observe another variable denoted by Ci that repre-
sents a synthetic index of production costs in this market. Clearly, Ci affects

311
9.3. Linear Simultaneous Equations

supply but not demand, and therefore it can be treated as exogenous. The
model now reads as follows.
Qi = α0 + α1 Pi + α2 Mi + υiD
(9.13)
Qi = β0 + β1 Pi + β2 Ci + υiS
The reduced form can be expressed in terms of two multivariate regressions,
one for quantity Qi and the other for price Pi , with the exogenous variables
Mi and Ci showing up on the right-hand side in both cases.
β1 α0 − α1 β0 β1 α2 α1 β2 β1 υiD − α1 υiS
Qi = + Mi − Ci +
β1 − α1 β1 − α1 β1 − α1 β1 − α1
D S
(9.14)
α0 − β0 υ2 β2 υ − υi
Pi = + Mi − Ci + i
β1 − α1 β1 − α1 β1 − α1 β1 − α1
If E [υD , υS | Mi ] = E [ υD , υS | Ci ] = 0, multivariate regression techniques (as
discussed later) allow to back up all the six combined parameters of (9.14).
It is easy to verify that there is a unique solution that maps this set onto the
set of the original “structural” parameters θM C = (α0 , α1 , α2 , β0 , β1 , β2 ).
The model is thus exactly identified. This result fails either if Ci enters
the demand equation, or if Mi does not enter the model while Ci does (in
this case, however, α1 would be identified: Ci would act as a “supply shifter”
that allows to map the demand curve, symmetrically to the scenario above).
In order to appreciate an instance of overidentification, and how it
can coexist with partial identification, let us consider one final case, that
is a model without Ci – no demand shifter – but with two supply shifters:
consumers’ income Mi and the price of a competing product Pi∗ , which is
expected to affect demand positively. The model now reads as:
Qi = α0 + α1 Pi + α2 Mi + α3 Pi∗ + υiD
(9.15)
Qi = β0 + β1 Pi + υiS
and its reduced form as:
β1 α0 − α1 β0 β1 α2 β1 α3 β1 υiD − α1 υiS
Qi = + Mi + Pi∗ +
β1 − α1 β1 − α1 β1 − α1 β1 − α1
D S
(9.16)
α0 − β0 α2 α3 υ − υi
Pi = + Mi + Pi∗ + i
β1 − α1 β1 − α1 β1 − α1 β1 − α1
which, by reasoning analogous to those from the previous cases, has some in-
teresting implications. First, notice that there are multiple ways to calculate
β1 (by taking the ratio of the two coefficients for Mi and Pi∗ , respectively):
this means that parameter β1 is overidentified. Second, parameter α1 is
not identified (this is easy to verify): the intuitive reason is that there are
no longer any “demand shifters” in the model. 

312
9.3. Linear Simultaneous Equations

This example is useful in two major respects. First, it showcases exam-


ples of both partial identification and overidentification, and as these two –
as it was anticipated – can actually coexist in the same model. Some general
implications of overidentification are elaborated along the analysis of the
Generalized Method of Moments (Lecture 12), a framework which includes
SEMs as a particular case. In particular, it is discussed how the so-called
“overidentifying restrictions” – that is, the implicit constraints featured in
the model that give rise to overidentification2 – can actually be tested sta-
tistically. This may help identify problems with the structural assumptions
of the model. In the last model from Example 9.6, for instance, if the two
ratios that allow to identify β1 are very different to one another, it may be
a signal about “something wrong” with the setup of the model.
Second, the example outlines a strategy for identifying SEMs. The main
issue is that due to the simultaneous structure of the model, the equations
of the structural form do not satisfy the requirements for semi-parametric
identification (a more statistical argument is developed later in Lecture 10).
Conversely, the model’s reduced form is identified; its P equations read as:

Ypi = πp1 Z1i + πp2 Z2i + · · · + πpQ ZQi + ηpi (9.17)

for p = 1, . . . , P . To verify whether any SEM is identified, one must check


whether some association form the parameters Π of the reduced form to the
structural parameters (Γ, Φ) can be established. The main problem is one
of dimensionality: while Π has dimension P Q, the structural form features
up to P (P + Q) parameters. Unless one imposes some more “constraints”
on the model, establishing such an association is mathematically impossible.
Even under the typical normalization of the diagonal of Γ, that is γpp = 1
for p = 1, . . . , P , one needs at least P (P − 1) more such constraints.
The most typical kind of constraint, or restriction, that is placed on a
SEM to obtain identification is an exclusion restriction, that is to force
certain structural parameters to equal zero, thus ruling out specific struc-
tural relationships between variables. For example, setting some element of
Φ equal to zero (φpq = 0) means that the q-th exogenous variable bears no
effect on the p-th endogenous variable. The example on demand and supply
showcases several exclusion restrictions that involve the exogenous variables
Mi , Ci and Pi∗ . Exclusion restrictions and the associated terminology are
common in the theory and practice of Instrumental Variables (Lecture 10);
the reason is that, as already hinted, the latter are grounded on SEMs.
2
In model (9.15), overidentification of parameter β1 actually follows from the fact
that both supply shifters Mi and Pi∗ are restricted to the supply function – that is, they
do not show up in the structural demand function. This idea is clearly more general.

313
9.3. Linear Simultaneous Equations

After imposing some exclusion restrictions, one can evaluate identifica-


tion for each equation of a SEM through a pair of intertwined mathematical
conditions. They are the so-called order condition, which is a necessary
one, and the associated rank condition, which completes the other. They
jointly characterize the algebraic properties that allow to solve for each row
of the following P × (P + Q) matrix, which “horizontally binds” Γ and Φ.
 
F≡ Γ Φ
In what follows, the identification conditions are formulated in terms of the
number of exclusion restrictions imposed on each structural equation; they
are ultimately adapted from standard (but tedious) linear algebra results.
Order condition for SEM identification. Define %p as be the number
of restrictions applied to the p-th equation of the structural form.
• if %p < P − 1, the p-th equation is not identified ;
• if %p = P − 1, the p-th equation is exactly identified, as long as the
rank condition holds;
• if %p > P − 1, the p-th equation is overidentified, as long as the rank
condition holds.
Rank condition for SEM identification. If the order condition holds,
the p-th Equation is identified if at least one nonzero determinant of order
(P − 1) (P − 1) can be constructed out of the coefficients of the variables
excluded from that equation but included in other equations in the model.
To check the rank condition for the p-th equation, one should:
1. delete from F the columns corresponding to the variables included in
the p-th equation; and
2. delete row p too, which results in some submatrix Fp .
Submatrix Fp should have full row rank for the rank condition to hold in
the p-th equation. Note that while the two conditions are formulated with
respect to the number of endogenous variables Q, but they can alternatively
be expressed in terms of the number of exogenous variables Q. Together,
the order and the rank condition determine a sufficient condition for the
identification of SEMs, which is stated next. Its central implication is that
the identification of SEMs rests on “equation-exclusive” exogenous variables,
which are analogous to the demand and supply shifters from Example 9.6,
and that are also called instruments for reasons that are clarified through
the discussion of Instrumental Variables in Lecture 10.

314
9.4. Causal Effects

Theorem 9.3. Sufficient Condition for Exact Identification. A SEM


is at least exactly identified if every equation of the structural form features
an exogenous variable that does not show up in any other equation.
Proof. (Exercise!) This proof is a straightforward and instructive applica-
tion of the order and rank conditions, and it is best left as an exercise.
Having learned that a SEM is identified, a researcher may want to esti-
mate its parameters. If all the equations are exactly identified, an intuitive
approach is to estimate the reduced form parameters Π via OLS and then
solve for the structural parameters Γ and Φ; this method is called Indirect
Least Squares (ILS) but is clearly unsuited to overidentified models. The
general approach for estimating a SEM is based on the Two Stages and the
Three Stages Least Squares estimators, and it is described in Lecture 10.

9.4 Causal Effects


Separable structural models that possess a reduced form representation such
as (9.7) allow to characterize the concept of “causality” in econometrics. In
order to define population-wide, “average” causal effects, it is necessary to
start from the individual-level concept. Suppose then that some exogenous
variable of interest Zqi , q = 1, . . . , Q can be varied on its own support Xzq
independently of other exogenous variables as well as of unobservables. In
this environment it is possible to define the causal effects of interest.
Definition 9.14. Individual Causal Effect. Consider the unit of obser-
vation i, whose realizations of the observable and unobservable factors are
written as (zi , εi ). Let zqi be the q-th element of zi and z−i be the collection
of all the other Q − 1 elements in that vector. The individual causal effect
of the exogenous variable Zqi on the endogenous variable Ypi for unit i is:
0
(9.18)

Cqpi (zqi , z−qi , εi ) = rp zqi , z−qi , εi − rp (zqi , z−qi , εi )

if Xzq is a countable discrete set with zqi , zqi 0


∈ X2zq being two consecutive


values; and:

Cqpi (zqi , z−qi , εi ) = rp (zqi , z−qi , εi ) (9.19)
∂zqi
if Xzq is a continuous set, and where rp (·) is the p-th equation of the reduced
form, the one that predicts Ypi . Therefore, the individual causal effect can
be interpreted as the “effect” of a ceteris paribus, marginal variation of Zqi
on the endogenous variable Ypi , obtained by keeping all the other observable
exogenous variables as well as the unobserved factors constant.

315
9.4. Causal Effects

Example 9.7. Causal Effects in the Mincer Equation. Consider again


equation (7.6) from the Mincer model of human capital. There, the causal
effect of education on the log-wage of any individual i equals the parameter
for education in the model: CSW i (si , ·) = β3 . The causal effect of experience
is, instead:
CXW i (xi , ·) = β1 + 2β2 xi
which is a function of the current experience xi of the i-th observation. In
linear models, in general, the causal effect of variables that enter the model
without higher order terms (unlike experience Xi in the Mincer equation)
and without “interactions terms” with other variables (for example, dummy
variables that allow for group-varying slopes, as in Example 7.9) is equal to
their associated structural parameter. 
By themselves, identification and causality are two unrelated concepts.
There can be identifiable models (or, with more proper terminology, models
with identified structures) in which causal effects are not defined, like non-
separable models. The converse is likewise true: the model in Example (9.3)
might not be identified, but the causal effect of Xi on Yi certainly exists as a
theoretical construct. The common confusion between the two terms stems
from their frequent mix in the professional parlance of applied economists.
In fact, in order to compute causal effects for some econometric model, it is
typically necessary to estimate first some of its parameters: which is a task
that requires said parameters to be identified in the first place.
When a variable of interest is a so-called binary treatment Si ∈ {0, 1},
that takes value Si = 1 if a certain condition is realized for observation i and
Si = 0 otherwise (in the education context, this could be, say, an indicator
for the achievement of a university degree), causality is often expressed via
the so-called potential outcomes notation, originating from the famous
Rubin (1974) causal model, where for some endogenous variable Yi :
(
Yi (1) if Si = 1
Yi = (9.20)
Yi (0) if Si = 0
which follows from some implicit or explicit model that makes the endoge-
nous variable Yi dependent on the treatment Si . For individual i, the causal
effect in question is simply given by CSY i = Yi (1) − Yi (0), so long as Si is
effectively an exogenous variable and the error term is mean-independent
of it. If this condition cannot be attained, the prevalent empirical practice
is to look for ways to approximate it (borrowing on the statistical literature
on causal inference, which is outside the scope of these lectures) or to search
for appropriate instruments (see Lecture 10), implicitly treating the model
as a larger structural simultaneous equations model.

316
9.4. Causal Effects

The concept of individual causal effect is not very useful, since εi is by


definition unobserved for a single individual observation. Better results are
achieved through the population-wide generalization of the concept.
Definition 9.15. Average Causal Effect. In the population, the average
causal effect of varying variable Zq is the expected value of the individual
causal effects conditional on the other exogenous variables z−q .
ACEqp (zqi , z−qi ) ≡ Eε [ Cqpi (zqi , z−qi , εi )| zqi , z−qi ]
ˆ
= Cqpi (zqi , z−qi , εi ) f ε|z (εi | zqi , z−qi ) dε1i . . . dεP i

Notice how the expectation above is taken with respect to the unobservables
εi (whose support is denoted here by Xε ), and conditional on the exogenous
variables zi . For a binary treatment, the average causal effect is called the
Average Treatment Effect (ATE).
(9.21)
 
ATEY ≡ E Yi (1) − Yi (0)| z1i , . . . , z(Q−1)i
A related quantity is the Average Treatment on the Treated (ATT):
(9.22)
 
ATTY ≡ E Yi (1) − Yi (0)| Si = 1; z1i , . . . , z(Q−1)i
which conditions on that part of the populations that does actually receive
the treatment (often a more interesting quantity for policy purposes). Note
how both expectations condition on the other Q − 1 exogenous variables.
An intermediate objective of much econometric analysis is the estima-
tion of the Conditional Expectation Function (CEF) of some endogenous
variable Ypi conditional on the observable exogenous variables zi . An inter-
esting question is whether the derivative of the CEF with respect to some
q-th exogenous variable:

µqYp |z (zqi , z−qi ) ≡ E [Ypi | Z1i , Z2i , . . . , ZQi ]
∂zq Zqi =zqi

with q = 1, . . . , Q, also equals the Average Causal Effect in the population.


In order to answer this question, consider that in typical cases derivatives
for an exogenous variable can pass through integrals taken over εi ; so:
ˆ
q ∂
µ Yp |z (zqi , z−qi ) = rp (zqi , z−qi , εi ) f ε|z (εi | zqi , z−qi ) dε1i . . . dεP i
∂zqi Xε
ˆ  

= rp (zqi , z−qi , εi ) f ε|z (εi | zqi , z−qi ) dε1i . . . dεP i +
∂zqi
ˆ 



+ rp (zqi , z−qi , εi ) f ε|z (εi | zqi , z−qi ) dε1i . . . dεP i
Xε ∂zqi
| {z }
=ACEqp (zqi ,z−qi )

317
9.4. Causal Effects

and the answer is a conditional no; µqYp |z (zqi , z−qi ) = ACEqp (zqi , z−qi ) only
if: ˆ  

rp (zqi , z−qi , εi ) f ε|z (εi | zqi , z−qi ) dε1i . . . dεP i = 0
Xε ∂zqi
that is, if Zqi and the unobservables εi are independent, conditional on the
other exogenous variables z−i . This has a proper definition in statistics.
Definition 9.16. Conditional Independence Assumption (CIA). The
CIA is the hypothesis that the the unobservables εi and a specific exoge-
nous variable Zqi are statistically independent, conditional on all the other
exogenous variables z−qi .
Zqi ⊥ εi | z−qi (9.23)
For binary treatments, the CIA is often expressed in potential outcomes
notation as follows.

Yi (1) − Yi (0) ⊥ Si | Z1i , . . . , Z(Q−1)i (9.24)

The importance of this concept lies in the fact that it provides a clear condi-
tion to verify if the parameters of an econometric model can be interpreted
causally. Suppose, for example, that the CEF of interest is linear:

E [Ypi | Z1i , Z2i , . . . , ZQi ] = β0 + β1 Z1i + β2 Z2i + . . . βQ ZQi

if the CIA holds for an exogenous variable of interest Zqi with q = 1, . . . , Q,


it follows βq = ACEqp (zqi , z−qi ). A practical strategy which is instrumental
to achieve such a result, is that of “enriching” a model with many exoge-
nous variables so that conditional independence becomes a more credible
hypothesis. One final observation is in order: while the CIA is weaker than
full statistical independence, it strongly resembles the central condition for
identification of linear models, that is mean independence of the error term
(e.g. E [ εi | Xi ] = 0). While the latter is definitely not identical to the CIA,
this is definitely another cause for confusion between the two concepts of
identification and causality. This mistake is to be avoided!
In a specific circumstance average causal effects are well approximated,
if not exactly estimated, by a linear model. This is the case if the outcome
of interest is Yi , the exogenous variable of interest is binary (write it Si ),
and the other exogenous variables xi satisfy E [Si | xi ] = xT i π0 : linearity of
the CEF (this holds trivially in some cases – for example if xi represents
a full dummy variable group partition, like in fully saturated regressions).
Then, by estimating the following linear model:

y i = xT
i β0 + δ 0 s i + εi

318
9.4. Causal Effects

the interpretation of the optimal linear predictor for δ0 can be based on the
Yitzakhi-Angrist-Krueger decomposition (7.56), where the derivative of the
CEF conditional on xi is precisely the ATE:
µ0Y |S,x = E [Yi | Si = 1, xi ] − E [Yi | Si = 0, xi ] ≡ ∆ (xi ) (9.25)
while the weighting factor here is φ (xi ) = P (Si = 1| xi ) [1 − P (Si = 1| xi )].
In this case, (7.56) becomes as follows.
Ex [∆ (xi ) P (Si = 1| xi ) [1 − P (Si = 1| xi )]]
δ∗ = (9.26)
Ex [P (Si = 1| xi ) [1 − P (Si = 1| xi )]]
By inspecting the expression above, it appears that the linear projection
δ∗ identifies the average causal effect of Si on Yi in one of two cases:
• the causal effect does not vary with xi (∆ (xi ) is a constant);
• the probability to “take up the treatment” (Si = 1) is constant for xi .
While these conditions are unlikely to hold in practice, δ∗ still carries an
interpretation as an average causal effect that is weighted by φ (xi ); as in
the general case, in a practical setting it is important to learn about these
weights in order to inform the interpretation of one’s estimates.
Finally, it is important to observe that while causality is best framed in
terms of effects of exogenous variables on endogenous ones in the setting of
reduced form models, a selected class of structural models – the so-called
triangular models – allows for “endogenous-to-endogenous” causal effects.
Definition 9.17. Triangular Models. A triangular structural model is
one where its P equations and its P endogenous variables can be ordered
in such a way that, for any natural number P 0 < P , the first P 0 endogenous
variables never enter the last P − P 0 equations, or vice versa.
Possibly, the simplest triangular model is the following trivariate model:
Yi = β0 + β1 Xi + β2 Zi + εi
(9.27)
Xi = π0 + π1 Zi + ηi
where (Yi , Xi ) are the endogenous variables while Zi is the only exogenous
one: notice that Yi does not enter the second equation. In general, a SEM is
triangular if matrix Γ is either upper- or lower-triangular (hence the name).
A Mincer model enriched with a linear equation for education where wages
are absent, such as (7.7), is triangular too. In the case of triangular models,
it is sensible to talk about the causal effects of endogenous variables upon
other endogenous variables (Xi on Yi , education on wages, et cetera); all
definitions and considerations made above apply.

319
Lecture 10

Instrumental Variables

This lecture is chiefly devoted to the most defining element of econometrics:


the method of Instrumental Variables, which is intended for addressing the
most typical problem of empirical economic studies: endogeneity. Following
an introduction to the different types of endogeneity, Instrumental Variables
and the related ideas are discussed through the presentation of increasingly
more general estimators for linear models: IV regression, Two Stages Least
Squares, and finally Three Stages Least Squares for simultaneous equations.

10.1 Endogeneity Problems


In the previous lectures, a single concept has been stated repeatedly and in
different forms: that if the “exogeneity” condition E [εi | xi ] = 0 (White’s As-
sumption 2) fails, the OLS estimator is inconsistent. As already argued, this
is the main condition against which the quality of an econometric analysis
is evaluated (beyond the relevance of the research question, of course) and
for a very good reason: this condition is likely to fail in most observational
data that are not generated via quasi-experiment or actual randomized ex-
periments. The circumstance where exogeneity fails:
E [εi | xi ] 6= 0 (10.1)
is called endogeneity or failure of identification. Both expressions are
largely conventional; as hinted, the former comes from the theory of simul-
taneous equation models while the latter is due to the intimate relationship
between the exogeneity condition and identification in linear models (see
example 9.3 and the subsequent discussion), although technically (10.1) is
really an issue of conditional moments. Another implication of endogeneity
is that it implies a failure of the Conditional Independence Assumption too,
and hence the impossibility to discuss about causal effects of xi on Yi .

320
10.1. Endogeneity Problems

It is worthwhile to provide a taxonomy of endogeneity problems, so


to learn to easily recognize these issues in empirical studies. What follows
is a separate discussion of the four scenarios that comprise a taxonomy of
endogeneity: 1. the omitted variable bias (including a discussion of fixed
effects); 2. simultaneity; 3. measurement error and 4. other forms of
structural endogeneity.

Omitted Variable Bias and Fixed Effects


The omitted variable bias is described earlier in Lecture 7, and it is certainly
the most common cause for concern about research papers in economics:
many of these only feature one structural equation, which is often linear. It
is worthwhile to summarize the problem again: suppose that the “true” CEF
is E [Yi | xi , Si ] = xT
i β0 + δ0 Si where Si is some relevant variable omitted
from the empirical model; then, the probability limit of the OLS estimator
is: −1
p
β0 + δ0 E xi xT

b OLS →
β E [xi Si ]
i

which is affected by a bias term that can be decomposed as the product


between: i. δ0 , that is the coefficient of Si in the “true” CEF of Yi , which
represents the “effect” of the omitted variable on the dependent variable; and
T −1
ii. the population linear projection of Si on xi – that is E xi xi E [xi Si ] –


which relates to the population correlation between the omitted variable and
the explanatory variables.1 Recall that if there is good reason to believe that
either term is zero, omitting Si is not a problem at all! Lecture 7 already
provides a generalization of this concept to multiple omitted variables, as
well as an illustrative interpretation in terms of the Mincer equation.

Fixed Effects. In applied economic research, a specific form of omitted


variable bias is routinely addressed with panel data. Suppose that the true
model is:
yit = αi + xT
it β0 + εit (10.2)
where E [εit | xit ] = 0, while αi is an additional unobserved error, typically
called the individual fixed effect, which is constant in time and possi-
bly “endogenous” in the following sense.
E [αi | xit ] 6= 0 (10.3)
Individual fixed effects represent the “unobserved heterogeneity” of the
data, that is factors that are constant in time, unobserved but vary across
units of observations (e.g. the ability of workers, the “know-how” of firms).
T −1
 
1
If, say, E [ Si | xi ] = xT
i γ0 , then clearly E xi xi E [xi Si ] = γ0 .

321
10.1. Endogeneity Problems

With panel data, fixed effects can be easily addressed: even if T is small
and thus the individual effects αi cannot be consistently estimated by brute
force (e.g. as separate dummy variables) we know from the Frisch-Waugh-
Lovell theorem that an estimate of β0 based on a “demeaned” model:

yit − ȳi = (xit − x̄i )T β0 + (εit − ε̄i ) (10.4)

where ȳi = T1 Tt=1 yit , x̄i = T1 Tt=1 xit , and ε̄i = T1 Tt=1 εit , is numerically
P P P
equivalent to the one from a brute force estimate that includes unit-specific
dummies; moreover, such an estimate would be consistent since αi is absent
from (10.4). An alternative, which is generally asymptotically equivalent,
is to estimate a model in “first differences:”

∆yit = ∆xT
it β0 + ∆εit (10.5)

where ∆ is the first-differences operator (∆ait = ait − ai(t−1) , and similarly


for vectors: ∆ait = ait − ai(t−1) ).
In panel data, (10.4) is called the within transformation of (10.2),
while (10.5) is called the between transformation. The two approaches
result in the removal of fixed effects; however, it is typically observed that
the resulting models are “identified from the time-series variation in the ex-
planatory variables.” What this means is that the information on β0 is now
inferred from how variations of xit in time relate to variations of Yit in time.
This naturally shrinks the set of applications to the analysis of explanatory
variables that show variability over time. For example, this rules out panel
data as a solution to the omitted-ability bias of the Mincer equation, since
the main variable of interest – education – is typically constant in panels of
workers, and would thus disappear from both (10.4) and (10.5).

Simultaneity
The problem of simultaneity is perhaps the archetypical type of endogene-
ity problem, and it derives from the classical analysis of linear simultane-
ous equations in the early days of econometrics. While already elaborated
in Lecture 9, this problem is revisited here from a more statistical angle.
Consider a Simultaneous Equations Model (SEM) written in compact form
(9.2): Γyi = Φzi + εi . Suppose that the exogenous variables zi are defensi-
bly exogenous, that is conditionally mean independent of the P error terms
εi of the model. Write:
E [εi | zi ] = 0 (10.6)
implying E [zi εpi ] = 0 for p = 1, . . . , P by (10.6) and by the
 Law  of Iterated
Expectations; this can be written more compactly as E zi εi = 0. T

322
10.1. Endogeneity Problems

Unfortunately, by construction the same cannot be argued for the P


endogenous variables; by exploiting the reduced form (9.8), also expressed
as yi = Πzi + Γ−1 εi , it is easy to show how identification fails.
E yi εT = Π · E zi εT −1
· E εi εT = Γ−1 · Var [εi ] 6= 0 (10.7)
     
i i +Γ i
| {z }
=0

The intuition is represented in Figure 10.1 below, which displays the graph
of structural relationships implied by two equations of a SEM, where any
two endogenous variables Y1i and Y2i show up in both equations. Since both
endogenous variables are also affected by the respective error terms ε1i and
ε2i , the latter are by construction correlated with both Y1i and Y2i , directly
or indirectly; this is an abstract representation of the identification problem
illustrated in Examples 9.5 and 9.6. Here the denomination simultaneity
is predicated on the concomitance of all these structural relationships.

Y1i ε1i

Y2i ε2i

Figure 10.1: Graphical representation of Simultaneity

The analysis conducted in the previous Lecture elaborates upon conditions


for the identification of a SEM. The final part of this Lecture completes the
discussion of SEMs by discussing methods for their estimation.

Measurement Error
It is quite common that some variables contained in a dataset are, to some
degree, measured with error. This problem typically leads to inconsistent
estimates whenever it affects the model’s explanatory (exogenous) variables.
To illustrate, consider a bivariate regression model Yi = β0 + β1 Xi + εi with
“exogenous” Xi (E [εi | Xi ] = 0). However, the researcher cannot observe the
true variable Xi , but only its error-ridden version:
Xi∗ = Xi + υi (10.8)
where υi denotes the error in the measurement of Xi (with E [υi ] = 0). In
addition, suppose that the error υi is also completely random, that it is
independent from both the “true” Xi as well as of the original error term εi .
E [Xi υi ] = E [εi υi ] = 0

323
10.1. Endogeneity Problems

This circumstance is known as classical measurement error. For a more


agile notation, write σ2x ≡ Var [Xi ] and σ2υ ≡ VarP [υi ].
Given actual data {(yi , xi )}i=1 (with x̄ = N N
∗ N ∗ 1
i=1 xi ), the OLS estima-

tor of the regression slope is inconsistent because:


PN
(x∗i − x̄∗ ) yi p Cov [Xi + υi , β0 + β1 Xi + ε]
β1,OLS = Pi=1
b
N

∗ ∗ 2 Var [Xi + υi ]
i=1 (xi − x̄ )
σ2x
 
β1 Var [Xi ]
= = β1
Var [Xi ] + Var [υi ] σ2x + σ2υ
its probability limit is actually smaller than β1 , to an extent that depends
on the following multiplicative constant.
σ2x 1 σ2υ σ−2
x
= = 1 − ∈ (0, 1)
σ2x + σ2υ 1 + σ2υ σ−2
x 1 + σ 2 σ−2
υ x

This is the infamous attenuation bias of classical measurement error; the


intuition for this result is that, even if υi is completely random, the relative
importance of the covariance between Yi and Xi∗ is obfuscated by the error-
inflated variance of Xi∗ . The magnitude of the problem depends on the one
of the term σ2υ σ−2
x , which is called noise-to-signal ratio.
Classical measurement error is an instance of failure of the exogeneity
assumption as per (10.1). To see this, consider the model:
Yi = β0 + β1 Xi + εi
= β0 + β1 Xi∗ + εi − β1 υi
where the actual regressor being used in practice by the analyst is Xi∗ and
the actual error term is therefore εi − β1 υi . Clearly, by construction it is:
E [εi − β1 υi | Xi∗ ] 6= 0
because Xi∗ incorporates υi . No similar problem affects the dependent vari-
able Yi : if the researcher really observes the latter with error: Yi∗ = Yi + υi ,
then the model:
Yi∗ = β0 + β1 Xi + εi + υi
is still estimated consistently under the previous assumptions about υi be-
ing “completely random” (although certainly the additional source of noise
would make the estimates overall less precise).
These facts are easily generalized to the multivariate model y = Xβ0 +ε.
Each explanatory variable Xki may or may not be affected by measurement
error, in the sense that the actually observed variables are:

Xki = Xki + υki (10.9)

324
10.1. Endogeneity Problems

for k = 1, . . . , K and i = 1, . . . , N . In compact matrix notation, the actual


realizations of the explanatory variables as well as the measurement errors
are collected by two N × K matrices:
   
x∗11 x∗21 . . . x∗K1 ∗
υ11 ∗
υ21 ∗
. . . υK1
 x∗ ∗ ∗   υ∗ υ∗ . . . υ∗ 
∗  12 x22 . . . xK2  ∗  12 22 K2 
X =  .. .. . . ..  ; U =  .. .. . . .. 
 . . . .   . . . . 
∗ ∗ ∗ ∗ ∗ ∗
x1N x2N . . . xKN υ1N υ2N . . . υKN

implying the following “true estimated model.”

y = X∗ β0 + ε − Uβ0 (10.10)

The “actual” error term is now εi − K k=1 βk,0 υki ; again by construction, it
P
is: " #
XK
E εi − βk,0 υki x∗i = x∗i 6= 0 (10.11)
k=1
where x∗iis the i-th row of X , even under the maintained hypothesis that

the errors υki are completely independent of both the “true” explanatory
variables and the primitive error term:

E [Xk0 i υki ] = E [εki υi ] = 0 (10.12)

for k, k 0 = 1, . . . , K. For simplicity, define the following probability limits:


1 T p
X X → Σx
N (10.13)
1 T p
U U → Συ
N
and note that the “actual” OLS estimator can be written as follows.
b OLS = X∗T X∗ −1 X∗T y

β

1 T
−1
1 (10.14)
= β0 + (X + U) (X + U) (X + U)T (ε − Uβ0 )
N N
p
Since, under the maintained assumptions, it is N1 UT ε → 0; and similarly
the probability limit of N1 XT U is a matrix full of zeros, the probability limit
of the OLS estimator can be written as:
p 
I − (Σx + Συ )−1 Συ β0 ≤ β0 (10.15)

βb OLS →

which is a generalization of the earlier formula about the attenuation bias


to the multivariate case.

325
10.1. Endogeneity Problems

Structural Endogeneity
The final subcategory of endogeneity is somewhat residual, and it collects
those kinds of endogeneity problems that are somewhat “built-in” the very
structural model. Consider, for example, the following time-series model:
yt = β0 + β1 yt−1 + xT T
t γ0 + xt−1 γ1 + εt
(10.16)
εt = ρεt−1 + ξt
where the current realizations of the dependent variable yt depend on its
past, as well as on current and past realizations of some explanatory vari-
ables xt ; in addition, the error term presents an AR(1) structure with au-
toregressive parameter ρ ∈ (0, 1) and “innovation” shock ξt – not autocorre-
lated; this is a more specialized version of model (8.44). It is quite obvious
that:
E [εt | Yt−1 ] = E [ρεt−1 + ξt | Yt−1 ] = ρ E [ εt−1 | Yt−1 ] + E [ξt | Yt−1 ] 6= 0
even if ξt is conditionally mean-independent, the grand error term εt is not,
as its lag εt−1 affects the lag of the dependent variable Yt by construction.
Another not too dissimilar case is a spatial model written in compact
matrix notation as:
y = β0 ι + β1 Wy + Xγ0 + WXγ1 + ε (10.17)
where W is a N × N non-stochastic spatial weighting matrix with zero
diagonal:  
0 w12 . . . w1N
 w21 0 . . . w2N 
W =  .. .. ... .. 
 
 . . . 
wN 1 wN 2 . . . 0
which collects the wij distances between two distinct units i and j in the
sample. A model of this sort is common in urban and regional econometrics,
as well as in the econometrics of networks and social interactions. In general,
it can be rewritten in terms of its solution for y as:
y = (I − β1 W)−1 (β0 ι + Xγ0 + WXγ1 + ε)
which suggests the existence of an endogeneity problem due to the feedback
mechanisms, that are built in the model, between the dependent variables
(and the error terms) of different economic units.
E [εi | Y1 , . . . , Yi−1 , Yi+1 , . . . , YN ] 6= 0
A careful reader will easily note an analogy between the endogeneity prob-
lem of spatial models and the issue of simultaneity in SEMs!

326
10.2. Instrumental Variables in Theory

10.2 Instrumental Variables in Theory


Instrumental variables are the chief solution that econometric studies adopt
to address endogeneity problems. Some methods employed in econometrics
and causal statistics (like the different “discontinuity” designs) can be seen
as specific applications of instrumental variables. As it is detailed in later
lectures, Instrumental Variables (IVs) or more simply instruments can be
generalized to multi-equation non-linear models; however, it is more useful
to start characterizing them in the setting of single-equation linear regres-
sion models such as (7.1). In that context, IVs are exogenous variables Zi
such that the regression error is conditionally mean-independent of them:
E [εi | Zi ] = 0 (10.18)
while at the same time, IVs Zi do not show up in the assumed model that
“explains” the endogenous variable of interest Yi . This last statement needs
some more qualification. It is useful for illustrative purposes to think of IVs
as part of an augmented structural model which features relationships like
the ones exemplified by the following graph:

Zi Xi Yi

ηi εi

Figure 10.2: Graphical representation of IVs

where Xi is the main explanatory variable that a researcher is interested


in, and that is possibly endogenous (as indicated by the bidirectional arrow
that connects it with the error term εi ), and which is itself “explained” by
both the IV Zi as well as some other unobserved factor ηi . Observe that Zi
is itself unrelated to both error terms (εi , ηi ) and – importantly – it does
not itself “explain” Yi , at least not directly: any “effect” that Zi might have
occurs through the Xi channel, which is indirectly represented through the
lack of a direct arrow starting from Zi and terminating in Yi . In addition,
even the error term for Xi , that is ηi , is conditionally mean independent of
the instrument Zi , in the following sense.
E [ηi | Zi ] = 0 (10.19)
A model with such features would clearly be a triangular structural model
as defined at the end of Lecture 9, and the plausibility of such a scenario
would depend on the socio-economic context of interest.

327
10.2. Instrumental Variables in Theory

Trivariate triangular models


In order to explain how IVs can help address endogeneity problems, it is use-
ful to start from the simplest triangular model, to move gradually towards
the analysis of the more general Two-stages Least Squares estimator.
Consider the very simple “trivariate” triangular model (9.27), by imposing
the additional restriction that the exogenous variable (here, the IV) does
not “explain” Yi in the structural model, as per Figure 10.2.
Yi = β0 + β1 Xi + εi
(10.20)
Xi = π0 + π1 Zi + ηi
In econometric terminology, the first of these two equations is named the
structural form (implicitly, the structural form for Yi – that is the main
relationship of interest) while the second equation is called the first stage
for Xi . Let us examine the properties of this model under the assumption
that Xi is endogenous, that is E [εi | Xi ] 6= 0.
Since the IV is exogenous, it is E [εi , ηi | Zi ] = 0, hence (π0 , π1 ) are
consistently estimated via OLS performed on the first stage equation. It
turns out that, in this model, even parameters (β0 , β1 ) are identified! To
see this, consider the reduced form of the model:
Yi = β0 + π0 β1 + β1 π1 Zi + εi + β1 ηi
(10.21)
Xi = π0 + π1 Zi + ηi
the first equation too can be consistently estimated via OLS! Moreover,
there clearly is a unique mapping from the reduced form to the structural
parameters, making the model exactly identified. In particular, a consistent
estimate of β1 is obtained – for a given sample {(yi , xi , zi )}N
i=1 – as the ratio
between the OLS estimates of the two coefficients for Zi from the reduced
form:
P
i=1 (zi −z̄)(yi −ȳ) P
[
β1 π1 OLS
P
i=1 (zi −z̄)
2 (zi − z̄) (yi − ȳ)
β
b 1,IV = = P
(zi −z̄)(xi −x̄)
= P i=1 (10.22)
π i=1 (zi − z̄) (xi − x̄)
b1OLS i=1
P 2
i=1 (zi −z̄)

where z̄ = N1 N i=1 zi while ȳ and x̄ are analogous, as usual. The expression


P
above implicitly defines the IV estimator for β1 in this simple trivariate
model. Instead, the intercept β0 of the structural form is estimated by IV
as:
b 0,IV = ȳ − π
β b 1,IV − π
b0,OLS β b 1,IV · z̄
b1,OLS β (10.23)
that is, its consistent IV estimator is obtained by plugging the appropriate
estimates for (π0 , π1 ) and β1 in the “average” reduced form equation for Yi .

328
10.2. Instrumental Variables in Theory

That the IV estimators (β0 , β1 ) are consistent should appear self-evident


by the properties of probability limits upon acknowledging that the reduced
form estimators are consistent too; it is useful to also show it as follows:
1
PN
b 1,IV = P i=1 (zi − z̄) (yi − ȳ) →
β N p Cov [Zi , Yi ]
1 N
i=1 (zi − z̄) (xi − x̄)
Cov [Zi , Xi ]
N (10.24)
Cov [Zi , εi ]
= β1 + = β1
Cov [Zi , Xi ]
where Cov [Zi , εi ] = E [Zi εi ] = 0 follows by (10.18) and the Law of Iterated
Expectations;2 the consistency of β b 0,IV follows easily. The decomposition in
the second line of (10.24) and an analysis akin to the one from Example 6.7
imply that, if the observations are independently, not identically distributed
(i.n.i.d.), it follows that:
√  
d
N 0, E ε2i (Zi − E [Zi ])2 · Cov [Zi , Xi ]−2 (10.25)
  
N β b 1,IV − β1 →

while, if the conditional variance of the error term εi – given the IV Zi – is


independent of the latter (homoscedasticity), the above simplifies as:
√  
d
N β1,IV − β1 → N 0, σ20 · Var [Zi ]2 · Cov [Zi , Xi ]−2 (10.26)

b

where σ20 = E [ε2i ]. The asymptotic distributions as well as the estimators of


their variances are obtained accordingly. Expressions (10.25) and (10.26),
however, both show an important point: the asymptotic variance of the IV
estimator of the structural form’s slope is inversely proportional to the
covariance between the instrument Zi and the endogenous variable
Xi . This is a crucial aspect that is discussed later at length.
Note that an analogous results would not obtain if the IV Zi showed up
as an explanatory variable for Yi in the structural form. This is shown via
the reduced form of the unrestricted triangular model (9.27), that is:

Yi = β0 + π0 β1 + (β1 π1 + β2 ) Zi + εi + β1 ηi
(10.27)
Xi = π0 + π1 Zi + ηi

which is easily shown not to be identified, due to the additional parameter


β2 (the one associated with Zi in the unrestricted structural equation for
Yi ). An analogous failure of identification would follow if the model were
not triangular, and Yi showed up on the right-hand side of the second struc-
tural equation. These observations help illustrate the intuition about the
2
The second-to-last equality follows from Cov [Zi , Yi ] = β1 Cov [Zi , Xi ] + Cov [Zi , εi ].

329
10.2. Instrumental Variables in Theory

identification and the estimation of β1 : the “effect” of Xi on Yi is backed up


by the component of the variation in Xi that is predicted by the exogenous
instrument Zi . Obviously, this would be impossible to disentangle from any
“direct” structural effect of Zi on Yi , if this were not assumed to be zero; or
if an additional “effect” of Yi on Xi were to be accounted for.
It is worth to summarize the four conditions that Instrumental Variables
must conform to in order to be adequate and effective in practical contexts.
These are expressed as follows in the context of the simple trivariate trian-
gular model, but they extend easily to higher-dimensional environments.
1. Exogeneity: conditions (10.18)-(10.19) hold, i.e. E [εi , ηi | Zi ] = 0.
2. Exclusion Restriction: the instrument Zi does not affect the main
endogenous variable Yi of the structural form, that is β2 = 0 in (9.27).
3. No Reverse Causality: the structural relationship between the two
endogenous variables is unidirectional: Yi does not affect Xi directly,
that is, the system is indeed triangular.
4. Relevance: the covariance Cov [Zi , Xi ] between the endogenous ex-
planatory variable Xi and the IV Zi must be “sufficiently strong,” or
else the IV estimates would be so imprecise to be useless (a problem
commonly referred to as the one of weak instruments).
This terminology is typically adopted in the applied practice of economet-
rics; it commonly appears whenever the adequacy of a specific instrumental
variable for estimation is under scrutiny.

The Multivariate IV Estimator


The IV estimator and the related ideas are easily generalized to multivariate
regression models, with the usual aid of vector- and matrix-based notation.
Our departure point is the usual linear model yi = xT i β0 + εi with K ≥ 2,
however, now the set of explanatory variables is partitioned as:
 
x
xi = i1
xi2
where xi1 is a subset of K1 exogenous regressors with:
E [ εi | xi1 ] = 0
while xi2 is a subset of K2 possibly endogenous regressors such that:
E [ εi | xi2 ] 6= 0
is allowed by the researcher, and K1 + K2 = K.

330
10.2. Instrumental Variables in Theory

Suppose that a vector of K2 instrumental variables, which is written


as zi2 , is available, and these IVs are exogenous in the sense that:
E [εi | zi2 ] = 0
if one groups these along the original K1 exogenous regressors, writing their
realizations as:  
x
zi = i1
zi2
one obtains that upon conditioning on the resulting vector of dimension K,
the expectation of the error term is zero as well:
E [εi | zi ] = 0 (10.28)
with the usual covariance implication by the Law of Iterated Expectations.
Cov [zi , εi ] = E [zi εi ] = Ez [zi · E [ εi | zi ]] = 0
In this context, the (just-identified) IV estimator is defined as:
N
!−1 N
X X
β
b IV = z i xT
i zi y i (10.29)
i=1 i=1

which generalizes (10.22), and which is itself a special case of the more
general “overidentified” Two-Stages Least Squares estimator, as discussed
below. By writing the N ×K matrix that collects the zi vectors of exogenous
variables as:
   
zT
1 x11 . . . x1K1 z11 . . . z1K2
 zT   x21 . . . x2K z21 . . . z2K2 
 2 
(10.30)
1
Z ≡  ..  =  .. ... .. .. . . .. 

 .   . . . . . 
zT
N xN 1 . . . xN K1 zN 1 . . . zN K2
the IV estimator can be elegantly written in compact matrix notation.
b IV = ZT X −1 ZT y (10.31)

β
Both representations showPstraightforwardly that the IV estimator is well-
defined so long as matrix i=1 zi xi = Z X is invertible, and that it admits
N T T

a decomposition in terms of the “true” parameters β0 and of the error terms


which is analogous to that of the OLS estimator, (8.2)-(8.3).
N
!−1 N
1 X 1 X
βIV = β0 +
b z i xiT
zi εi (10.32)
N i=1 N i=1
 −1
1 T 1 T
= β0 + Z X Z ε (10.33)
N N

331
10.2. Instrumental Variables in Theory

Clearly, under the exogeneity assumption (10.28) the “remainder” terms on


the right-hand sides of both (10.32) and (10.33) converge in probability to
E [zi εi ] = 0: the IV estimator is consistent.
p
b IV →
β β0 (10.34)
Example 10.1. Mincer, Revisited. Let us return to the Mincer Equa-
tion from Example 7.2, which is rewritten as (7.41) in Lecture 7:
log Wi = β0 + β1 Xi + β2 Xi2 + β3 Si + εi
where the error term εi = αi + i is the sum of unobserved “ability” αi and
some additional residual error i . It is, however, very difficult to justify any
hypothesis of the form E [αi | Si ] = 0 (which would imply E [εi | Si ] = 0 if i
is conditionally mean independent as well). It is clear that education and
individual ability are correlated: more skilled individuals tend to get more
education, and vice versa – this is essentially a case of omitted variable bias.
Hence, one cannot consistently estimate the Mincer equation by OLS.
Suppose that, however, researchers have some exogenous instrument Zi
at their disposal, one that:
1. is exogenous, in the sense that it does not correlate with unobserved
ability and (10.18) holds;
2. satisfies the exclusion restriction: it does not affect wages directly;
3. fits a setting that rules out reverse causality: education is itself hardly
affected by future individual wages;
4. is relevant, that is, it correlates with education.
In the literature, there is an innumerable amount of instruments proposed
for the education variable Si ; one famous example for Zi is the “distance of
one’s home from a college” from a celebrated study by Card (1995). A good
exercise is to ask oneself if the four conditions above hold in this example.
Regardless of the specific instrument Zi being chosen, a consistent estimate
of the Mincer equation’s parameters β0 = (β0 , β1 , β2 , β3 ) can be obtained
by setting:    
1 1
 xi   xi 
xi = x2i  and zi = x2i 
  

si zi
where xi is the experience Xi specific to observation i; si is instead his or
her education Si , while zi is her or his specific value of the instrument Zi .
IV estimation would then proceed as per (10.29) or (10.31). 

332
10.2. Instrumental Variables in Theory

Until now, the formula for the IV estimator – and its asymptotic proper-
ties – have been presented without much of a motivation. To gain intuition,
it is useful to once again represent the structural relationships between the
endogenous and the exogenous variables in the form of a triangular model
of simultaneous equations – in a simple case. Suppose, in particular, that
K1 = K −1 and K2 = 1, with xi2 = xKi = si (this is similar to an analogous
partition from Lecture 7) and zi0 = zi . Thus, only one variable of the main
linear model is suspected to be endogenous, with only one instrument zi to
compensate for it. The triangular model representation of this setup is:
yi = x T
i1 β0\K + δ0 si + εi
(10.35)
si = x T
i1 π0\K + τ0 zi + ηi
T
where β0 = βT ; the reduced form of this model is as follows.

0\K δ 0

yi = xT

i1 β0\K + δ0 π0\K + δ0 τ0 zi + εi + δ0 ηi
(10.36)
s i = xT
i1 π0\K + τ0 zi + ηi

This model is, like (10.20), identified, thanks to the exclusion restriction
whereby the instrument zi does not enter the structural equation for yi in
(10.35); note that a consistent estimator for δ0 can be obtained, in analogy
with the trivariate case above (10.22), as:
−1
zT MX1 y zT MX1 s zT MX1 y

δτ
cOLS
δIV =
b = T = (10.37)
τbOLS z MX1 z zT MX1 z zT MX1 s
where y, s and z are the vectors of length
−1 TN that collect, respectively, yi , si
and zi ; while MX1 = I − X1 XT 1 X 1 X1 is the residual-maker matrix
obtained from the first K − 1 (exogenous) explanatory regressors xi1 . Note
that expression (10.37) is best understood as an application of (7.30).
It turns out that the estimator for δ0 given in (10.37) is numerically
equivalent to the one be obtained via the IV estimator; it is a good exercise
to develop the proof of this result, which is in its essence a variation of the
Frisch-Waugh-Lovell Theorem, but applied to the partitioned IV estimator.
The intuition is likewise analogous: the IV estimator for multivariate linear
models extends the simple IV estimator of the slope (10.22) by partialing out
the empirical correlations of the instruments with the endogenous variables,
as well as the empirical correlation of the instrument with the dependent
variable, from the empirical correlations of the other explanatory variables
included in the structural form. In this respect, the IV estimator inherits
the desirable properties of the least squares solution and the OLS estimator
that have been discussed in the previous lectures.

333
10.2. Instrumental Variables in Theory

How to perform statistical hypothesis tests based on IV-based estimates


of a linear model? In fact, the asymptotic properties of the IV estimator
are obtained under very general conditions that can be deduced as special
cases of the Two-Stages Least Squares estimator and the even more general
GMM estimators to be discussed later, and that resemble the six White’s
assumptions for the large sample properties of OLS. Semi-formally, if matrix
ZT X is invertible, if the following probability limits:
PN T
z x
i=1 i i =
N N
1 X T p 1 X  T e
zi xi → lim E zi xi ≡ P0 (10.38)
N i=1 N →∞ N
i=1
N N
1 X 2 T p 1 X  2 T e
ε zi z → lim E εi zi zi ≡ Ψ0 (10.39)
N i=1 i i N →∞ N
i=1
are finite and of full rank, and if the observations are independent so that
a suitable
PN Central Limit Theorem can be extended to the random sequence
i=1 zi εi , then the limiting distribution of the IV estimator is:
√1
N
√  
d
 
N β b IV − β0 → N 0, Pe −1 Ψ
0
e T−1
e 0P
0 (10.40)
from which the asymptotic distribution follows accordingly. The asymptotic
variance is estimated, for inference purposes, through the following analogue
of the heteroscedasticity-consistent (HC) formula of OLS.
" N #−1 " N #" N #−1
  X X 2 X
[ HC β
Avar b IV = zi xT i yi − x T β
i
b IV zi zT
i xi z T i
i=1 i=1 i=1
(10.41)
It is a good exercise to work out the expression of the limiting variance
under homoscedasticity as well. Under instances of dependent observations,
(10.41) would obviously no longer work; in analogy with OLS, the clustering
case (CCE) allows the following estimator of the asymptotic variance:
" C #−1 " C #" C #−1
  X X X
[ CCE β
Avar b IV = Z T Xc
c ZT ec eT Zc
c c XT Zc c(10.42)
c=1 c=1 c=1

where ec ≡ yc − Xc βb IV whereas Zc is the cluster-specific sub-matrix of Z,


in a similar way as yc and Xc are the cluster-specific collections of the de-
pendent and explanatory variables, respectively. Similarly, HAC estimators
for time-series dependence, spatial correlation, or a combination of both,
are easily extended to the just-identified IV estimator as well. Importantly,
observe how in all these cases the estimated standard errors are inversely
proportional to the empirical covariance between the endogenous variables
and the instruments, which is subsumed in the matrix N i=1 zi xT i .
1
PN

334
10.2. Instrumental Variables in Theory

Two-Stages Least Squares


It turns out, as already hinted, that the IV estimator is a particular case
of the more general (overidentified) Two-Stages Least Squares (2SLS)
estimator. This estimator typically applies to cases where the researcher
has available more instruments than endogenous variables. In other
terms, the vector of exogenous instruments has redundant dimensions: this
could be written as |zi2 | = J2 > K2 or equivalently as |zi | = J > K. While
one could in principle obtain different IV estimators for each appropriate
subset of zi – in this respect the model is overidentified – intuitively it
appears useful to exploit the additional information that is contained in the
“redundant” instruments, simultaneously. The 2SLS estimator was designed
to perform precisely this task; it is traditionally attributed to Theil (1953).
The estimator is best illustrated in terms of the two abstract steps, or
stages, by which it is constructed, and which give it its name.
1. In the first stage, one shall perform a set of linear projections, one
for each (endogenous) regressors xi of the main model of interest, onto
the set of exogenous variables (including the instrumental variables)
zi . This results in a set of “projection” vectors x
bi equal to:
N
! −1 N
!
X X
bT
x i = zi
T
zi zT
i zi xT
i (10.43)
i=1 i=1

to be collected, in compact matrix notation, as the N × K matrix:


b = Z ZT Z −1 ZT X = PZ X (10.44)

X
−1 T
where PZ ≡ Z ZT Z Z is the projection matrix based on the
exogenous variables collected by the matrix Z of dimension N × J.
This is equivalent to running some first stage regressions:
xki = zT
i πk0 + ηki (10.45)
and calculating the fitted values x
bki = zT bkOLS . Observe that if xki is
i π
contained in zi , it must be x
bki = xki , since a vector projected onto it-
self returns the input vector. Also note the similarity between (10.45)
and the second equations of (10.20) and (10.35). This explains why, in
the terminology of triangular SEMs, the relationships that “explain”
endogenous regressors xki are called First Stage equations.
2. In the second stage, the 2SLS estimator of β0 is obtained by running
an OLS regression of yi onto the “projected regressors” x
bi .
bT
yi = x i β0 + εi (10.46)

335
10.2. Instrumental Variables in Theory

A 2SLS estimator can be written as simply the OLS estimator of (10.46):


N
!−1 N
X X
β
b 2SLS = x bT
bi x i x
b i yi (10.47)
i=1 i=1

however, it is way more convenient to write it in compact matrix notation:


 T −1 T
β
b 2SLS = Xb X b b y
X
−1 T (10.48)
= XT PZ X X PZ y
to understand why the two expressions above are equivalent, recall that a
projection matrix is symmetric and idempotent. As it has been mentioned,
in the just-identified case (J = K), the 2SLS and the IV estimators coincide:
b 2SLS = XT PZ X −1 XT PZ y

β
h −1 T i−1 T −1 T
= XT Z ZT Z Z X X Z ZT Z Z y
−1 T −1 T −1 T
= ZT X Z Z XT Z X Z ZT Z Z y
−1
= ZT X ZT y


=β b IV

where the third line is only possible if X and Z have the same (column)
dimensions. The usual decomposition gives:
N
!−1 N
1 X 1 X
β
b 2SLS = β0 + x bT
bi x i x
b i εi (10.49)
N i=1 N i=1
 −1
1 T 1 T
= β0 + X PZ X X PZ ε (10.50)
N N
which shows, in conjuction with the geometric interpretation of the 2SLS
estimator, why the latter is consistent. In fact, the operation of projecting
the possibly endogenous regressors xi onto the space which is spanned by the
exogenous variables zi – call it S (Z) – generates a set of “fitted” regressors
bki that, by construction, lie on S (Z), and are consequently orthogonal to
x
the disturbance vector ε. Again, this is by construction because of (10.28);
see figure 10.3 for the geometric intuition. Since E [b
xi εi ] = 0, it is:
N
1 T 1 X p
X PZ ε = b i εi → 0
x (10.51)
N N i=1
p
implying consistency of 2SLS estimator, i.e. β
b 2SLS → β0 .

336
10.2. Instrumental Variables in Theory

X∗,k

X∗,k − X
b ∗,k S (Z)
Z∗,2

90◦ ε
X
b ∗,k
90◦

Z∗,1

Figure 10.3: The geometric interpretation of the 2SLS estimator

Example 10.2. Mincer, Revisited (again). Let us return yet once more
to the Mincer equation, and the attempt to address the endogeneity bias
due to the omission of ability αi which is explored in example 10.1. One can
obtain the IV estimator in two alternative ways; both require specifying a
first stage equation for education, like:

Si = π0 + π1 Xi + π2 Xi2 + π3 Zi + ηi (10.52)

note that this is a linear version of the structural equation for ability (7.7)
from example 7.2, with the inclusion of a squared term for experience (as
in the “structural” Mincer equation) but the exclusion of ability itself. The
resulting reduced form of the model is:
log Wi = β0 + β3 π0 + (β1 + β3 π1 ) Xi + (β2 + β3 π2 ) Xi2
+ β3 π3 Zi + αi + i + β3 ηi (10.53)
Si = π0 + π1 Xi + π2 Xi2 + π3 Zi + ηi

which, once again, is just identified as long as E [ αi , i , ηi | Xi , Zi ] = 0.

337
10.2. Instrumental Variables in Theory

By the Frisch-Waugh-Lovell Theorem, a consistent estimate of β3 which


is numerically identical to the IV estimator can be obtained – see (10.37) –
as:
T
b 3 = β3 π3 = z MX y
[
β
π
b3 zT M X s
and the other coefficients from the Mincer equation can similarly be backed
up via through the reduced form estimates. Alternatively, one could obtain
exactly the same numerical estimate through the linear projection of Si that
results from the First Stage estimation of (10.52):

Sbi = π
b0 + π b2 Xi2 + π
b1 Xi + π b3 Zi
and the just-identified IV-2SLS estimator results from running OLS on the
following model.
log Wi = β0 + β1 Xi + β2 Xi2 + β3 Sbi + εi
However, the 2SLS estimator would work just as well in an overidentified
setting where the researcher has access to redundant instruments. Suppose
that there are two additional valid instruments Gi and Fi ; the First Stage
equation would then read as:
Si = π0 + π1 Xi + π2 Xi2 + π3 Zi + π4 Gi + π5 Fi + ηi (10.54)
and 2SLS estimation would proceed as described. In Lecture 12, which is
about the more general GMM framework, this case is revisited in order to
develop an example of an overidentification test, while making at the same
time some substantive examples of “additional” instruments Gi and Fi . 
It remains to show the remaining asymptotic properties of the 2SLS
estimator, especially with regard to the variance. These are now presented
more formally than how it was done in the context of the just-identified IV
estimator; it is a good exercise to derive the asymptotic properties of the
latter as a particular case. In what follows, some additional assumptions of
the generalized linear model – one that allows for instrumental variables
and possibly overidentification – are stated more rigorously.
Assumption 9. Independently but not identically distributed IVs.
The observations in the sample {(yi , xi , zi )}Ni=1 are independently, but not
necessarily identically, distributed (i.n.i.d.).
Assumption 10. Exogeneity of the Instruments. Conditional on the
J ≥ K regressors zi , the error term εi has mean zero.
E [εi | zi ] = 0 (10.55)

338
10.2. Instrumental Variables in Theory

Assumption 11. Asymptotics of the Projected Regressors. The fol-


lowing probability limit exists, is finite, and has full rank.
1 T p
X PZ X → P0 (10.56)
N
Assumption 12. Heteroscedastic, Independent Errors. The variance
of the error term εi , conditional on the instruments zi , is unrestricted (het-
eroscedasticity), while the conditional covariance between two error terms
from two different observations i, j = 1, . . . , N is zero.
E ε2i zi = σ2 (zi ) ≡ σ2i (10.57)
 

E [εi εj | zi , zj ] = 0 (10.58)
Assumption 13. Asymptotics of Projected Regressors interacted
with the Errors. Given the following diagonal matrix of squared errors:
 
ε21 0 · · · 0
 0 ε2 · · · 0 
2
E ≡  .. .. . . .  (10.59)
 
. . . .. 
0 0 · · · ε2N
the following probability limit exists, is finite, and has full rank.
1 T p
X PZ EPZ X → Ψ0 (10.60)
N
In addition, conditions hold so that the following Central Limit Theorem
result applies.
1 d
√ XT PZ ε → N (0, Ψ0 ) (10.61)
N
These assumption are worth to be discussed briefly in relationship with
the analogous (White’s) hypotheses of OLS. Assumption 9 simply extends
Assumption 2 to the instruments listed in zi . Assumption 10 was discussed
multiple times; it is necessary to obtain (10.51) and hence consistency of the
2SLS estimator. Assumption 11 allows to establish a condition analogous to
(8.6), for the sake of establishing and estimating the asymptotic variance.
Assumption 12 characterizes the concept of heteroscedasticity in the context
of instrumental variables. Finally, Assumption 13 ensures that some Central
Limit Theorem can be appropriately extended to the 2SLS estimator as well.
Observe that some of these assumptions might be founded – more rigorously
– onto more primitive hypotheses (like conditions on specific moments, e.g.
Ljapunov’s); this is not pursued here for the sake of conciseness.
In light of these assumptions, one can finally establish the asymptotic
properties of the 2SLS estimator.

339
10.2. Instrumental Variables in Theory

Theorem 10.1. Large Sample properties of the 2SLS Estimator.


Under Assumptions 1, 3 and 9-13, the 2SLS estimator is consistent:
p
b 2SLS →
β β0 (10.62)
and it is asymptotically normal, that is:
√  
d
N 0, P−1 −1
(10.63)

N β b 2SLS − β0 → Ψ 0 P
0 0

hence its asymptotic distribution, for a given N , is as follows.


 
A 1 −1
β2SLS ∼ N β0 , P0 Ψ0 P0
b −1
(10.64)
N
Proof. The proof is analogous to the one for the OLS case; it exploits the
decomposition (10.50), the asymptotic results from Assumptions 11 and 13,
as well as Slutskij’s Theorem and the Cramér-Wold device.
At this point, asymptotic properties such as these should appear famil-
iar, at least by comparison with OLS. In this case, though, the expressions
for the estimators of the asymptotic variance are less straightforward. The
heteroscedasticity-robust estimator of the asymptotic variance follows from
(10.56) and (10.60), and it reads:
b 2SLS = 1 P
 
[ β
Avar b −1 Ψ
b NP b −1 (10.65)
N N N

where:
b N ≡ 1 XT Z ZT Z −1 ZT X (10.66)

P
N
b N ≡ 1 XT Z ZT Z −1 ZT E b N Z ZT Z −1 ZT X (10.67)
 
Ψ
N
p p
with PbN → P0 , Ψ
bN → Ψ0 ; while E b N is the following diagonal matrix:
 
e21 0 · · · 0
 0 e2 · · · 0 
2
EN ≡  .. .. . . ..  (10.68)
b  
. . . . 
0 0 · · · e2N
where ei ≡ yi − xT
i β2SLS for i = 1, . . . , N . Naturally, this would not work
b
with dependent observations as Assumption 13 would fail; estimating the
appropriate “meat” matrix Ψ0 would require a CCE approach, as follows:
" C #
1 X −1 −1 p
XT T
ZT T T
ZT
 
b CCE ≡
Ψ c Zc Zc Zc c ec ec Zc Zc Zc c X c → Ψ0
N c=1
(10.69)
where ec ≡ yc − Xc β2SLS . Clearly, analogous HAC estimators also exist.
b

340
10.2. Instrumental Variables in Theory

The case of homoscedasticity is of particular interest in the 2SLS setting.


Here, homoscedasticity implies the variance-independence of the error term
with respect to the instruments zi in the following sense:
E εεT Z = σ20 I
 

and one can show that:


1 T p 1 T
X PZ EPZ X → σ20 · lim X PZ X = σ20 P0 (10.70)
N N →∞ N

hence, Ψ0 = σ20 P0 and:


σ20 −1
 
A
β2SLS ∼ N β0 , P0
b (10.71)
N
which, for e ≡ y − Xβ b 2SLS , is estimated as follows.
T h i−1
b 2SLS = e e XT Z ZT Z −1 ZT X
 
(10.72)

[ β
Avar
N
A result analogous to the Gauss-Markov Theorem for OLS shows that the
under homoscedasticity, the 2SLS estimator is the most efficient linear es-
timator that can be constructed under the “exogeneity” hypotheses (10.55).
The traditional practice of econometrics emphasizes the importance of gath-
ering as many instrumental variables as possible in order to approximate as
close as possible the efficiency bound represented by (10.71).
Example 10.3. 2SLS and structural endogeneity. This efficiency re-
sult is especially useful to address issues of “structural endogeneity.” Con-
sider the time series model (10.16); through iterated substitution, it can be
restated as:
t−1
X t−1 X
t−s
s T T
 X
yt = β0 + β1 xt−s γ0 + xt−1−s γ1 + βs1 ρz ξt−s−z
s=0 s=0 z=0

(this assumes the existence of some (x0 , ξ0 ) at t = 0). The above relation-
ship, lagged by one further period, applies to the endogenous yt−1 variable as
well. Hence, all valid lags xt−s for s ≥ 2 can be combined into the instru-
ments vector zt in a 2SLS framework. If, in addition, ξt is homoscedastic,
this leads to efficient estimates. In a similar vein, the solution of the spatial
model (10.17) can be rephrased, by standard results of linear algebra, as:
X∞
y= βs1 Ws (β0 ι + Xγ0 + WXγ1 + ε)
s=0

thus, all the linearly independent columns of the matrices in {Ws X}∞ s=2
can enter the instruments matrix Z (Kelejian and Prucha, 1998; Bramoullé
et al., 2009); with homoscedastic errors, this leads to efficient estimates. 

341
10.3. Instrumental Variables in Practice

10.3 Instrumental Variables in Practice


The previous section develops the basic theory of IV-2SLS estimation. Due
to the importance of these estimators in the applied econometric practice,
it is worth to separately discuss a series of technical aspects that are often
useful in actual econometric research. In this section, the following topics
are discussed: control function approaches and tests for endogeneity
(these are options available to researchers), as well as the the small sample
bias of IV-2SLS and the issue of weak instruments (these, instead, are
cautionary warnings about the possible misuse of instrumental variables).

Control function approaches


A control function approach is a method to address endogeneity in an
econometric model which is predicated on a specification of the endogeneity
problem that is “built-in” the error term εi . This definition is quite general,
but in the case of linear models, the control function approach is actually
complementary to IV-2SLS, and it is possible so long as there are at least as
many exogenous instruments as there are endogenous explanatory variables.
With J2 ≥ K2 instruments, it works as follows:
1. one shall first run some K2 first stage regressions like (10.45) (one for
each of the endogenous variables) and calculates the residuals from
these equations: η bki ≡ xki − zT bkOLS = xki − x
i π bki for k = 1, . . . , K2 ;
2. in the alternative second stage one would run OLS regressions on the
structural form augmented with the estimated residuals:

y i = xT bT
i β0 + ηi ρ0 + ςi (10.73)
T
where η bK2 i while ρ0 collects the K2 parameters

bi = η b2i . . . η
b1i η
associated with each set of residuals and ςi is some new error term.
The OLS estimates of (10.73) are actually consistent for both β0 and ρ0 .
A semi-formal argument, intuitive if convoluted, is provided next.
Consider, for k = 1, . . . , K2 , the first stage model for the k-th endoge-
nous variable (10.45); and note that, by construction:

E [ηki εi ] = E Xki − ziT π0 εi = E [Xki εi ] − E [zi εi ]T π0 6= 0


  
| {z } | {z }
6=0 =0

therefore, the error term of the k-th equation ηki appears to contain some
statistical information about what makes variable Xki endogenous in the

342
10.3. Instrumental Variables in Practice

first place (in fact this is so by construction, since ηki is the residual from
the projection of Xki onto the exogenous instruments). One might want to
extend this intuition to all the other K2 first stage residuals ηki by specifying
a statistical model for the error term, also called a control function.
If one supposes that such a model is linear:
εi = η T
i ρ0 + ξi (10.74)
where, in the population:
−1
ρ0 ≡ E ηi ηiT

E [ηi εi ]
T
is the linear projection of εi onto ηi = η1i η2i . . . ηK2 i , while ξi is


defined residually. Expression ηT i ρ0 represents the extent of endogeneity


contained in the model. Note that, by assumption:
E [zi ξi ] = E [zi εi ] − E zi ηiT ρ0 = 0
 
| {z } | {z }
=0 =0

because the instruments are exogenous; moreover, for k = 1, . . . , K2 :


E [ηki ξi ] = 0
by definition of linear projection, that is by construction, and:
E [Xki ξi ] = E [zi ξi ]T π0 + E [ηki ξi ] = 0
| {z } | {z }
=0 =0

as implied by the observations above. Substituting (10.74) into the original


structural equation yi = xT
i β0 + εi gives:

yi = xT T
i β0 + ηi ρ0 + ξi (10.75)
which could be consistently estimated by OLS if only ηi could be observed.
While ηi cannot be observed by definition, it can be definitely estimated in
the first stage, hence (10.73) matches (10.75) for
K0
X
T
b i ) ρ0 + ξi = zT
ςi ≡ (ηi − η i · (πk0 − π
bk,OLS ) ρk0 + ξi
k=0

and it can be shown that the first component of this expression is condition-
ally mean independent of (xi , ηbi ), because it only depends on the statistical
noise in the estimation of the first stage models. Consequently:
E xTi ςi = 0 and E η
   T 
bi ςi = 0
hence, OLS estimation of (10.73) is both consistent and, for β0 , equivalent
to IV-2SLS. The classical, full-fledged proof of this result is based on a vari-
ation of the Frisch-Waugh-Lovell Theorem and the algebra of projections.

343
10.3. Instrumental Variables in Practice

In practice, control function approaches are seldom used for linear mod-
els. With respect to IV-2SLS, in fact, they entail a few shortcomings: first,
they might not work too well if the endogenous variables entered the struc-
tural form non-linearly (with higher-order terms, interactions etc.); second,
they can be shown to be less efficient and to produce larger standard errors.
One might wonder, then, what are control function approaches useful for.
Not only they play a role in the tests for endogeneity, as it is mentioned
below; but they can be actually convenient for extending instrumental vari-
ables to non-linear models. In fact, IVs can be combined with non-linear
models in a variety of ways (see e.g. the discussion in Lecture 12 about the
GMM approach) but in practical terms, these often entail complications of
computational or statistical kind. Conversely, control function approaches
are very flexible; they typically entail augmenting a non-linear model with
the inclusion of some function of the residuals obtained from the first-stage
regressions of the endogenous variables.

Tests for endogeneity


After performing IV-2SLS estimation, researchers might ask themselves if
anything was gained in terms of the overall quality of their estimates, that
is, if they differ from the OLS results in a way that reveals the presence of
substantial endogeneity in the original model. This can be formulated as a
test where the null hypothesis is as follows.
H0 : E xT
 
i εi = 0

The above implies that OLS and IV-2SLS have the same probability limit;
thus, an operationally more convenient null hypothesis can be:
b OLS − plim β
H0 : plim β b 2SLS = 0

which suggests a natural test statistic.


h iT h ih i
p
H
e H0 = β b OLS − β [ β
b 2SLS Avar b OLS − β
b 2SLS β b OLS − β
b 2SLS → χ2K

Under the null hypothesis, this quadratic form should be close to zero. The
problem is that, in general, it is hard to derive an exact expression for the
variance of the difference between two estimators, like:
h i h i h i h i
Var βOLS − β2SLS = Var βOLS + Var β2SLS − 2 Cov βOLS , β2SLS
b b b b b b

since the covariance component is generally unknown. However, Hausman


(1978) showed that if one of the two estimator is efficient, as in the case of

344
10.3. Instrumental Variables in Practice

the OLS estimator under i.i.d. homoscedastic errors (by the Gauss-Markov
Theorem), the unknown covariance is actually equal to the variance of the
less efficient estimator, therefore:
h i h i
Cov β b OLS , β
b 2SLS = Var βb 2SLS

which allows to formulate the Hausman test statistic:3


h iT n h i h io h i
H H0 = β b OLS − β
b 2SLS [ β
Avar b OLS − Avar
[ β b 2SLS βb OLS − β
b 2SLS

p
with H → χ2K asymptotically. This statistic is easily calculated in the data.
Unfortunately, in regression analysis the Hausman test is limited to the
case of i.i.d. homoscedastic errors, or to other scenarios where an alternative
estimator is evaluated against an efficient benchmark (such as fixed effects
vs. random effects in panel data). Even with i.i.d. errors, however, it can be
shown that the Hausman test converges in probability to a Wald statistic
formed out of the ρ bOLS from the control function estimator of (10.73)! The
intuition is simple in light of the earlier discussion about control function:
the null hypothesis of “no endogeneity” is equivalent to:
H0 : ρ0 = 0
since it implies E ηiT εi = E xT i εi = 0. This observation is not only useful
   

as an alternative route for calculating the Hausman test. In fact, it suggest


methods for performing “regression-based” tests for endogeneity with
heteroscedastic or dependent errors, which are allowed by control function
approaches. Various tests of this sort exist; see e.g. Wooldridge (1995).

Small sample bias of IV-2SLS


One might think that, since the IV-2SLS estimator is consistent, in analogy
with OLS it is also unbiased in small samples. This is false. Observe that:
h i h i
b 2SLS = β0 + E X T PZ X −1 X T PZ ε

E β
h h −1 T ii
T
= β0 + EX,Z E X PZ X X PZ ε X, Z
h −1 T i
= β0 + EX,Z X T PZ X X PZ · E [ε| X, Z]

and it is impossible to simplify the expression further, since E [ε| X, Z] 6= 0


because of endogeneity: the IV-2SLS estimator is biased in small samples
and the bias goes towards the direction of the inconsistent OLS estimator.
3
The Hausman test is sometimes also called Durbin-Wu-Hausman test, in light of two
earlier contributions (Durbin, 1954; Wu, 1973) that proposed tests similar to Hausman’s.

345
10.3. Instrumental Variables in Practice

Some theoretical research has attempted to quantify this bias and to develop
procedures for testing it. The practice of empirical research which is based
on IV-2SLS estimation emphasizes the use of large datasets in order to rely
on the better asymptotic properties of the estimator.

Weak instruments
It was observed in several instances that the limiting variance of IV-2SLS
estimators is inversely proportional to the correlation between the endoge-
nous regressors and the exogenous instruments. This relates to the intuition
for identification in IV estimation: the “effect” of some endogenous variable
Xi on Yi is obtained through the indirect effect that the instrument Zi has
on Yi , since Zi affects Xi directly but it affects Yi only through Xi . It is then
self-evident that if the direct relationship between Xi and Zi is statistically
weak, the main effect of interest is hard to “capture” and it will be at best
imprecisely estimated. This is a problem of weak instruments.
Weak instruments have two main implications. First, IV-2SLS estimates
obtained with weak instruments might make predictions worse than the ones
obtained with inconsistent OLS estimates, in a Mean Squared Error sense.
Asymptotically, the latter reads (particularly for IV-2SLS) as:
 2  
plim MSEIV −2SLS = plim βIV −2SLS − β0 + Avar βIV −2SLS
b b
| {z } | {z }
= squared asymptotic bias = asymptotic variance

hence, the gains obtained in terms of lower bias might be more than offset
by the losses due to a higher variance. Second, if the instruments are weak
and only slightly endogenous, the “cure” to endogeneity achieved by IV-2SLS
estimation might be worse then the disease. To appreciate this, consider
the ratio between the asymptotic bias of the IV estimator (10.22) of the
trivariate triangular model, and the OLS estimator (5.15) from bivariate
regression:

plim β1,IV − β1 Cov (Zi , εi ) Var (Xi )


= ·
plim β1,OLS − β1 Cov (Zi , Xi ) Cov (Xi , εi )
Corr (Zi , εi ) 1
= ·
Corr (Xi , εi ) Corr (Zi , Xi )
which is obtained by elaborating the probability limit of the two estimators
under the assumptions that Cov (Zi , εi ) 6= 0 and Cov (Xi , εi ) 6= 0. Observe
that even if the instrument is, while not completely exogenous, somewhat
“less endogenous” – in the sense that Cov (Zi , εi ) < Cov (Xi , εi ) – a weak

346
10.3. Instrumental Variables in Practice

instrument might actually amplify the endogeneity problem. This intuition


is easily generalized to higher dimensional problems. Consider for example
a triangular model with one endogenous regressor si , similar to (10.35) but
possibly overidentified:
yi = x T
i1 β0\K + δ0 si + εi
si = z T
i π0 + ηi

and assume further that it has i.i.d. errors. It is possible to show that:
 
plim δIV − δ0 Corr Si , εi
b
1
= ·
plim δOLS − δ0 Corr (Si , εi ) plim R2s,z|x
where R2s,z|x is the following partialed out R-squared coefficient.
sT PZ MX1 PZ s
R2s,z|x =
sT MX1 s
Notice that, by the Frisch-Waugh-Lovell Theorem, this is the R2 coefficient
that would be obtained from a regression of si on zi , after partialing out
the exogenous regressors xi1 , as follows (see Lecture 7).
MX1 s = MX1 Zπ0 + MX1 η
In light of this analysis, it may appear that embarking into an empirical
study based on Instrumental Variables is very risky, due to the high chance
of ending up with mildly endogenous and fairly weak instruments. For the
sake of mitigating this risk, it is best to follow some general guidelines.
1. It is always useful to test the statistical power of the instruments
via estimates of the First Stage models (10.45). In the applied econo-
metric practice, some rules of thumb apply: t-statistics for the ex-
ogenous instruments higher than 3, or model-wide F -statistics higher
than 10, are considered signs that the instruments are “satisfactorily
strong.” These numbers appear to be based on simulation studies and
surveys, see e.g. Stock et al. (2002); however, they must be taken with
a grain of salt, since the conditions that make an instrument “strong
enough” are really context- and data-dependent.
2. The earlier observation that 2SLS is more likely to hit the efficiency
bound the more instruments are used must be revisited. While this is
true in theory, in practice chances are that the more instruments one
is employing, the higher the probability to include mildly endogenous,
weak instruments – it is advisable to drop instruments from overi-
dentified 2SLS estimators whenever they are suspected to be weak.

347
10.4. Estimation of Simultaneous Equations

To summarize, IV-2SLS are indeed powerful “instruments,” but they must


be used with care. Any researcher employing Instrumental Variables should
first make sure that the exogeneity assumptions can be credibly defended in
the context at hand, and then show that the instruments are strong enough.

10.4 Estimation of Simultaneous Equations


The 2SLS estimator is not only a solution to the endogeneity problem in
single-equation models, but is also the traditional tool for the estimation of
linear simultaneous equations models. As it has been argued, the very usage
of the word “endogeneity” originates in econometrics from the terminology
of linear SEMs. This section describes how SEMs are estimated via 2SLS
or, possibly better, via its extension which is specifically tailored to multiple
equations: the so-called Three-Stages Least Squares (3SLS) estimator.
The ensuing discussion builds on the analysis of the identification of SEMs
which is developed in Lecture 9.
A single equation of a SEM that is identified can be easily estimated
via 2SLS. To see this, rewrite the p-th equation of interest as:
ypi = xT
pi βp0 + εpi (10.76)
where the normalized endogenous variable ypi is isolated on the left-hand
side, while on the right-hand side, xpi collects the realizations of all variables
– both exogenous and endogenous – that are not excluded from the
structural form; the appropriately restricted parameter set βp0 is defined
accordingly. If (10.76) has K2 ≤ P −1 endogenous variables, it is associated
to as many first stage models from the reduced form of the SEM:
xki = zT
i πk0 + ηki (10.77)
for k = 1, . . . , K2 . Whether the p-th equation (10.76) is exactly identified
or overidentified, it can always be estimated by IV-2SLS (while the Indirect
Least Squares approach cannot work under overidentification).
A problem with the 2SLS estimation of SEMs, if separately performed
equation-by-equation, has to do to the estimation of the variance of βp0 . In
fact, separate estimation disregards any potential statistical variation that
is common across equations. To put it more concretely, in the likely circum-
stance where the errors (ε1i , . . . , εP i ) are correlated across the P equations,
2SLS estimation is inefficient. An analogous problem arises when estimat-
ing a set of P linear regressions with no endogenous variables on the right
hand side – basically, SEMs with Γ = I – but with correlated error terms,
models which are known as SURs (Seemingly Unrelated Regressions).

348
10.4. Estimation of Simultaneous Equations

Example 10.4. Household labor supply. Consider the following model:


Hhi = α0 + α1 Hwi + α2 Shi + α3 Swi + α4 log Whi + α5 log Wwi + εhi
Hwi = β0 + β1 Hhi + β2 Shi + β3 Swi + β4 log Whi + β5 log Wwi + εwi
where subscript i denotes a household (the unit of observation), w denotes
variables relative to the wife, h to the husband, and for s ∈ {h, w}, Hsi , Ssi
and log Whi denote “hours worked,” education and the logarithm of the wage
respectively. This model, which characterizes the interdependence of labor
supply choices between the two members of the household, is identified with
at least one restriction per equation (in particular, if α1 = β1 = 0 this SEM
becomes a SUR). If, for example, both the husband’s and the wife’s labor
supply choices are not influenced by their partner’s level of education Ssi ,
one can impose the restrictions α3 = β2 = 0 and separate 2SLS estimation
of both α = (α0 , α2 , α4 , α5 )T and β = (β0 , β2 , β3 , β5 )T is possible, by using
the wife’s education as an instrument of her labor supply – and vice versa.
It is quite likely, however, that the members of a family experience shared
external circumstances, such as for example the quality of social connections
or individual and cultural attitudes about work. In this case:
Cov (εhi , εwi ) 6= 0
therefore, separate 2SLS estimation of α b 2SLS and β
b 2SLS would be inefficient:
intuitively, statistical inferences would not take into account that both sets
of point estimates are generated by some common statistical variation. 
Thus, methods for the joint estimation of SEMs (or SURs) have been de-
veloped. These methods are known as full information approaches, which
contrast with equation-by-equation limited information approaches (such
as the equation-by-equation 2SLS). The most straightforward full informa-
tion method is the so-called “Three Stages Least Squares” (3SLS) estimator,
which extends 2SLS by adding a further “third stage” meant to obtain more
efficient estimates. This extended procedure aims to address cross-equation
error correlation, and it is analogous to a GLS approach to correct for het-
eroscedasticity or error dependence in single equation models.
The 3SLS estimator is best described using compact matrix notation.
Write any SEM whose equations are all at least exactly identified as:
y = Xβ0 + ε (10.78)
or:       
y1 X1 0 ··· 0 β10 ε1
 y2   0 X2 ··· 0   β20   ε2 
 ..  =  .. .. .. ..   ..  +  ..  (10.79)
      
 .   . . . .  .   . 
yP 0 0 · · · XP βP 0 εP

349
10.4. Estimation of Simultaneous Equations

where yp is the N -dimensional vector obtained from stacking all the obser-
vations ypi for i = 1, . . . , N , while εp is constructed analogously. Similarly,
matrix Xp results from vertically stacking vectors xT pi for i = 1, . . . , N ; thus
(10.80) can be rephrased as follows.
yp = Xβp0 + εp (10.80)
Furthermore, consider the stacked instruments matrix Z as in (10.30); the
associated projection matrix PZ , and construct the P equation-specific ma-
trices of projected regressors as:
Xb p = P Z Xp
while  
X
b1 0 ··· 0
0 X b2 ··· 0 
b ≡
X  .. .. .. 
..

 . . . . 
0 0 · · · XP
b

results from diagonally stacking, block-by-block, all the X


b p matrices. Given
this notation, the 2SLS estimator for all P equations can be written more
compactly as follows.
 −1
β
b 2SLS = X b TX
b b Ty
X (10.81)
Assume for simplicity that the error terms ε are i.i.d. within equations,
but are correlated across equations:
E [ε| Z] = 0
E εεT Z = Σ ⊗ I
 

where Σ is the symmetric P × P matrix containing the equation-specific


variance of each equation along the diagonal, and the cross-equation covari-
ance terms outside the diagonal.
 
σ11 σ12 · · · σ1P
 σ21 σ22 · · · σ2P 
Σ =  .. .. . ..  (10.82)
 
 . . . . . 
σP 1 σP 2 · · · σP P
Therefore, by the definition of Kronecker product:
 
σ11 I σ12 I · · · σ1P I
 σ21 I σ22 I · · · σ2P I 
E εεT Z = Σ ⊗ I =  ..
 
.. .. .. 
 
 . . . . 
σP 1 I σP 2 I · · · σP P I

350
10.4. Estimation of Simultaneous Equations

meaning that, in addition to the equation-specific variance terms σpp along


the diagonal, for any observation i (i = 1, . . . , N ) the conditional covariance
of the two shocks εpi and εqi for the p-th and the q-th equations respectively
is equal to σpq . All the other cross-equation covariance terms, that is those
for any two different observations, are assumed to be equal to zero.
Thus, the 3SLS estimator can be easily defined as just the GLS gener-
alization of the 2SLS estimator for this model. It is computed by iterating
the following “three stages” (the first two of which corresponding to the two
stages of the 2SLS estimator):
1. project xpi (Xp ) onto zi (Z) equation-by-equation for p = 1, . . . , P ;

2. compute the 2SLS estimator βb 2SLS from (10.81), as well as estimates


for the P (P − 1) parameters contained in matrix Σ:
 T  
y p − Xp β
b p2SLS yq − Xq β
b q2SLS
bpq =
σ (10.83)
N

for p, q = 1, . . . , P , resulting in an estimate Σ


b N of matrix Σ;

3. finally, compute the 3SLS estimator as:


h   i−1  
β
b 3SLS = XbT Σ b −1 ⊗ I X
N
b bT Σ
X b −1 ⊗ I y
N (10.84)

and its asymptotic variance as follows – compare it with (8.40).


  h   i−1
[ β
Avar b 3SLS = XbT Σb −1 ⊗ I X
N
b (10.85)

As hinted later in Lecture 12, there exist versions of the 3SLS estimator that
are robust to heteroscedasticity and to wider forms of error dependence.
The 3SLS estimator is the most efficient among all the semi-parametric
estimators of SEMs. Just like the 2SLS estimator, as it is discussed later, it
corresponds to the solution of a Generalized Method of Moments (GMM)
problem. Nonetheless, in the fully parametric case other methods are avail-
able for the estimation of SEMs (the so-called LIML, Limited Information
Maximum Likelihood, and FIML, Full Information Maximum Likelihood
methods). These maximum likelihood methods however, are not more effi-
cient than GMM-based or otherwise semi-parametric methods, and in ad-
dition they are liable to violations of the parametric assumptions. For this
reason, the current practice favors the use of semi-parametric methods for
the estimation of linear simultaneous equations.

351
Lecture 11

Maximum Estimation

This lecture illustrates the general estimation framework that encompasses


the three most common estimation methods in econometrics: Least Squares
methods (which include OLS, its generalizations, as well as their non-linear
versions), Maximum Likelihood Estimation, and the Generalized Method of
Moments. This framework is usually referred to as “Maximum Estimation”
(in short, “M-Estimation”) or as “Extremum Estimation.” All M-Estimators
are based on the optimization of some objective function, hence their name.
This lecture develops the theoretical and statistical framework that is com-
mon to all M-Estimators; whole doing so, it introduces the more specialized
Non-Linear Least Squares (NLLS) and especially the Maximum Likelihood
Estimation (MLE) framework, along with illustrative applied examples.

11.1 Criterion Functions


Consider a structural model such as (9.1). Suppose that given some specific
assumptions about the joint probability distribution of (yi , zi , εi ) (either
fully parametric or semi-parametric), the model’s true parametric structure
θ0 in the population is the solution to a maximization problem of the kind:
N
1 X
θ0 = arg max Q0 (θ) = arg max lim E [q (yi , zi ; θ)] (11.1)
θ∈Θ θ∈Θ N →∞ N
i=1

where q (yi , zi ; θ) is an observation-specific criterion for i = 1, . . . , N . For


the sake of a simpler notation, this function is heretofore written as q (xi ; θ),
where xi = (yi , zi ). Observe that with identically distributed observations,
it is Q0 (θ) = E [q (xi ; θ)]. To anticipate the discussion about identification,
notice how the definition in (11.1) implies that at the limit, function Q0 (θ)
should have a unique maximum, or else θ0 would be ill-defined.

352
11.1. Criterion Functions

An appropriate estimator of θ0 is the maximizer of the sample version


of (11.1), which is known as the sample criterion function Q bN (θ):
N
1 X
θM = arg max QN (θ) = arg max
b b q (xi ; θ) (11.2)
θ∈Θ θ∈Θ N i=1
such an object is called an M-Estimator. The analysis of M-Estimation
that follows concerns the identification conditions of M-Estimators and their
asymptotic properties. The full-fledged analysis is provided by Newey and
McFadden (1994) in a chapter of the Handbook of Econometrics.
Example 11.1. Ordinary Least Squares (OLS). Let the CEF of some
endogenous variable Yi , given some K exogenous variables xi , be linear.
E [Yi | xi ] = xT
i β0 (11.3)
In such a case, the “true” parameter vector β0 is shown by an extension of
Theorem 7.1, to minimize the “limiting” mean squared error (MSE):
N
1 X h 2 i
β0 = arg max lim − E Yi − x T
i β (11.4)
β∈RK N →∞ N i=1

thus the OLS estimator is just the minimizer of the sample analog:
N
b OLS = arg max − 1
X 2
β yi − x T
i β (11.5)
β∈RK N i=1
2
i β .
corresponding to an M-Estimator for q (Yi , xi ; β) = − Yi − xT 
Example 11.2. Non-Linear Least Squares (NLLS). If, instead, the
CEF is non-linear, governed by some function denoted as h (xi ; θ):
E [Yi | xi ] = h (xi ; θ0 ) (11.6)
one can show again by extending Theorem 7.1 that:
N
1 X
− E [Yi − h (xi ; θ)]2 (11.7)

θ0 = arg max lim
θ∈Θ N →∞ N
i=1

and the Non-Linear Least Squares (NLLS) estimator is defined as the


sample analog of the above problem, so long as h (xi ; θ) is invertible in θ:
N
bN LLS = arg max − 1
X
θ [yi − h (xi ; θ)]2 (11.8)
θ∈Θ N i=1

where here q (Yi , xi ; θ) = − [Yi − h (xi ; θ)]2 . In typical applications, (11.8)


lacks an explicit solution; estimates must be thus obtained numerically. 

353
11.1. Criterion Functions

For the present discussion, it is useful to define the following two objects.
Borrowing from Maximum Likelihood terminology, the observation-specific
score si (xi ; θ) for i = 1, . . . , N of an M-estimator is defined as the vector
of first derivatives of q (xi ; θ), with respect to the parameter set θ:
 
∂q(xi ;θ)
∂θ1
 ∂q(x i ;θ)

∂q (xi ; θ)  ∂θ2 

si (xi ; θ) = = . 
∂θ  .. 

∂q(xi ;θ)
∂θK

while Hi (xi ; θ) is the K × K Hessian matrix of the second derivatives


of q (xi ; θ), or – equivalently – of the score’s first derivatives (the Jacobian
matrix), with respect to the parameter set θ:
 ∂ 2 q(x ;θ) ∂ 2 q(x ;θ) ∂ 2 q(xi ;θ)

i i
· · ·
 ∂∂θ 1 ∂θ1
2 q(x ;θ)
∂θ1 ∂θ2
∂ 2 q(xi ;θ)
∂θ1 ∂θK
∂ 2 q(xi ;θ) 
∂si (xi ; θ) 2 i
∂ q (xi ; θ)  ∂ 2 θ2 ∂θ1
 ∂θ2 ∂θ2
· · · ∂θ2 ∂θK 
Hi (xi ; θ) = = =  .. .. .. .. 

∂θT ∂θ∂θT  . . . . 
∂ 2 q(xi ;θ) ∂ 2 q(xi ;θ) ∂ 2 q(xi ;θ)
∂θK ∂θ1 ∂θK ∂θ2
··· ∂θK ∂θK

again for i = 1, . . . , N . Notice that both the score vector and the Hessian
matrix only exist under certain conditions, specifically if the M-Estimation
objective function is, respectively, at least once or twice continuously differ-
entiable. These conditions might not be respected for the objective function
of some important econometric estimators, such as the quantile regression.
The score and the Hessian matrix are instrumental for the characterization
of the identification conditions for M-Estimators.
Theorem 11.1. Identification of M-Estimators. In any M-Estimation
environment, the “true” parameter set θ0 is locally point identified if the
following limiting average Hessian matrix evaluated at θ0 has full K rank.
N
1 X
Q0 ≡ lim E [Hi (xi ; θ0 )]
N →∞ N
i=1

Proof. This is indeed a simple application of the Implicit Function Theorem.


In a well-defined M-Estimator, the true parameter vector θ0 sets the K First
Order Conditions of the empirical criterion function Q bN (θ0 ) equal to zero
at the probability limit, that is, at some limiting average score.
N   N
∂ b 1 X ∂ p 1 X
QN (θ0 ) = q (xi ; θ0 ) → lim E [si (xi ; θ0 )] = 0
∂θ N i=1 ∂θ N →∞ N
i=1

If the Jacobian Q0 has full rank, there is a unique local solution θ0 .

354
11.1. Criterion Functions

Example 11.3. Identification of OLS. In OLS, the observation-specific


score and the Hessian matrix are as follows.

si (yi , xi ; β) = 2xi εi
Hi (yi , xi ; β) = −2xi xT
i

Consequently, the limiting average Hessian of OLS is just:


N
2 X 
E xi xT

Q0 = −2K0 = lim − i
N →∞ N i=1

and the requirement that the above matrix must have full rank in order for
the OLS estimator to be identified is quite a familiar condition. 
Example 11.4. Identification of NLLS. In the NLLS case, by denoting
the error term by εi ≡ yi − h (xi ; θ), the score and the Hessian are:

si (yi , xi ; θ) = 2 h (xi ; θ) · εi
∂θ
∂ ∂ ∂2
Hi (yi , xi ; θ) = −2 h (xi ; θ) T h (xi ; θ) + 2 h (xi ; θ) · εi
∂θ ∂θ ∂θ∂θT
and note that, by the Law of Iterated Expectations:
∂2 ∂2
   
E h (xi ; θ0 ) · εi = Ex h (xi ; θ0 ) · E [εi | xi ] = 0
∂θ∂θT ∂θ∂θT

as E [εi | xi ] = E [Yi | xi ]−h (xi ; θ0 ) = 0. Thus, any expected Hessian matrix


of NLLS is just:
E [Hi (xi ; θ0 )] = −2 E h0i hT
 
0i

where:

h0i ≡ h (xi ; θ0 )
∂θ
is the derivative of the CEF evaluated at xi and at the true parameters θ0 .
The identification of NLLS is generally evaluated in terms of the matrices
E h0i h0i , and the probability limits of their sample averages.
T


In practical applications, it is customary to verify that the sample mean
of the Hessian (like N −1 XT X in the OLS case) has full rank, as an indication
that the model is identified. In addition, it is useful to check that the rows
or columns of the Hessian’s sample mean are not too correlated ; otherwise,
identification is said to be weak, and the estimates are usually very imprecise
with large standard errors. This problem is called quasi-multicollinearity

355
11.1. Criterion Functions

and is intuitively due to the statistical difficulty of distinguishing between


two “factors” (like different explanatory variables, columns in X) if they are
very similar. In the IV/2SLS case, this corresponds to the problem of weak
instruments which appears if the inverse of XT PZ X is too large.
A relevant subclass of M-Estimation is constituted by the Maximum
Likelihood Estimation (MLE) framework. The general approach to MLE
in Statistics is introduced in Lecture 5; in econometrics, this framework is
utilized to construct estimators for fully parametric structural models.
To substantiate, suppose that a structural model reads, for every unit of ob-
servation i = 1, . . . , N , as in (9.1): yi = s (yi , zi , εi ; θ). If an analyst either
knows a priori or can confidently assume the joint probability distribution
Fz,ε (zi , εi ) of the exogenous variables (observable and unobservable), it is
possible to characterize the distribution of xi = (yi , zi ) as:
fz,ε (zi , εi ; θ) = fz,ε zi ; s−1

ε (yi , zi ; θ) ; θ

where here fz,ε (·) is the probability mass or density function associated with
Fz,ε (·), whereas s−1ε (·) is the solution of the structural relationship with
respect to the unobservable factors εi (assuming that such a unique inverse
exists). In this environment, MLE corresponds with the M-Estimator that
is defined for a criterion function equaling the logarithm of fz,ε (·); write
this function succintly as ` (xi ; θ) where again xi = (yi , zi ).
q (yi , zi ; θ) = log fz,ε zi ; s−1

ε (yi , zi ; θ) ; θ
≡ ` (xi ; θ)
This characterization perfectly complies with the definition of M-Estimators
since the true value of the parameters θ0 , by definition, satisfies:
N
1 X
θ0 = arg max lim E [` (xi ; θ)] (11.9)
θ∈Θ N →∞ N
i=1

which is a natural consequence of choosing this particular criterion function.


In fact, for i = 1, . . . , N one can maintain the following relationships, thanks
to an application of Jensen’s inequality (second line below):
 
fx (xi ; θ)
E [` (xi ; θ)] − E [` (xi ; θ0 )] = E log
f (x ; θ )
 x i 0 
fx (xi ; θ)
≤ log E
fx (xi ; θ0 )
ˆ
fx (xi ; θ)
= log fx (xi ; θ0 ) dxi
X fx (xi ; θ0 )
=0

356
11.1. Criterion Functions

where X is the joint support of xi = (yi , zi ), and dxi is the joint differential.
Note that the expectation must be evaluated by integrating over fx (xi ; θ0 ),
as this is the function which is assumed to generate the data. These rela-
tionships hold for any for any θ ∈ Θ; hence over the entire parameter space
it is E [` (xi ; θ0 )] ≥ E [` (xi ; θ)] for i = 1, . . . , N and (11.9) must hold.
A variant of this approach which is equally valid, and at the same time
typically more practical, is that where a researcher only specifies the condi-
tional distribution of the unobserved factors εi , given the realizations yi of
the exogenous variables: thus, the criterion function is specified as follows.
q (yi , zi ; θ) = log fz,ε s−1

ε (y i , zi ; θ) zi

This method is called Conditional Maximum Likelihood Estimation


(CMLE) and in econometrics it prevails over its unconditional version; it is
best illustrated via an example.
Example 11.5. Maximum Likelihood and Linear Regression. Con-
i β + εi . Suppose there is reason
sider a linear regression model like yi = xT
to believe that the error term of this model is homoscedastic, independent
across observations and normally distributed:
ε ∼ N 0, σ2 I


where σ2 is an unknown parameter of the model. In addition, assume that


the right-hand side variables xi are “fixed” (or non-stochastic): that is, one
realization xi appears with probability one in every sample. Together, these
assumptions specify the entire joint probability distribution of (xi , εi ), char-
acterizing the so-called “classical” regression model with spherical/normal
disturbances. Notice that one implication is that the covariance between the
error term and the regressors is zero, that is E [xi εi ] = 0. The probability
density function for the error term can be then written as:
T
2 !
1 yi − x i β
fε εi | β, σ2 = fε yi − xT 2
 
i β β, σ =√ exp −
2πσ2 2σ2
which is just the expression of a univariate normal distribution after having
applied a change of variable. Write the collection of the unknown parame-
ters as θ = (β; σ2 ). In this model, the likelihood function is:
N T
2 !
  Y 1 y i − x i β
L θ| {yi , xi }N
i=1 = √ exp − 2
i=1 2πσ 2 2σ
 N2 PN T
2 !


1 y i x i β
= exp − i=1
2πσ2 2σ2

357
11.1. Criterion Functions

and correspondingly, the log-likelihood function is:


PN 2
  N y i − xT
i β
θ| {yi , xi }N log 2π + log σ2 − i=1

log L i=1 =−
2 2σ2
which has clearly a unique maximum (the Maximum Likelihood Estimator
of θ is well defined). In fact, the First Order Conditions are:
" PN  −2 #
T
∂   x
i=1 i y − x β σ
log L θ| {yi , xi }N
i=1 = N −2 1
PNi i 2 −4 = 0
∂θ − 2 σ + 2 i=1 yi − xT i β σ
 
whose solution is the ML estimator θM LE = βM LE ; σM LE given by:
b b b 2

N
!−1 N
X X
β
b M LE = xi xT
i x i yi
i=1 i=1
PN  2
i=1 yi − xT
i βM LE
b
b2M LE =
σ
N
and the Second Order Conditions are satisfied for a maximum. Therefore,
this estimator exists and is unique as long as matrix i=1 xi xT i has full
PN
rank. Note that the Maximum Likelihood estimator of β is identical to the
corresponding OLS estimator of the linear regression model. The estimator
for the error variance parameter σ b2M LE differs, however, from the unbiased
estimator σ b from the small sample analysis of OLS, as the latter is larger
2

by the factor N N −K
. This is just one particular example of a general feature
of MLE: while this approach might produced biased estimators, these are
in general consistent and at least as efficient as their unbiased counterparts.
This Maximum Likelihood estimator can alternatively be obtained un-
der more general assumptions. Suppose that xi is not fixed; without specify-
ing its full data generation process, assume that conditional on any realiza-
tion xi , the error term is normal with constant variance: εi | xi ∼ N (0, σ2 ).
Since εi = yi − xT i β, this implies the following conditional density function:

T
2 !
1 y i − x i β
f Y |x yi | xi ; β, σ2 = √

exp − 2
2πσ 2 2σ

with the same associated likelihood function as above. This shows that the
CMLE approach here delivers the same result as the simple (but unrealistic)
assumption that xi is fixed. In fact, xi is allowed to follow any distribution,
so long as εi is normal when conditioning on it. 

358
11.2. Asymptotics of Maximum Estimators

11.2 Asymptotics of Maximum Estimators


In their leading article, Newey and McFadden (1994) establish conditions
for consistency and asymptotic normality of all M-Estimators; their results
are summarized here. In order to prove consistency, it is necessary to show
that the maximizer of the sample average criterion function converges in
probability to the unique maximizer of the population expected criterion.
p
bM = max Q
θ bN (θ) → max Q0 (θ) = θ0 (11.10)
θ∈Θ θ∈Θ

Intuitively, pointwise convergence of Q


bN (θ) to Q0 (θ) for all the possible
values θ ∈ Θ is necessary:
N
1 X p
{q (xi ; θ) − E [q (xi ; θ)]} → 0 (11.11)
N i=1

still, it is not sufficient. A sufficient condition is uniform convergence:


N
1 X p
sup {q (xi ; θ) − E [q (xi ; θ)]} → 0 (11.12)
θ∈Θ N i=1

this condition is stronger than (11.11) as it requires that Q


bN (θ) converges
in probability towards Q0 (θ) “at the same speed” over the entire parameter
space Θ. The intuition is graphically represented in Figure 11.1.

Q (θ) Q0 (θ)
Q0 (θ) ± 

θ
θ0 θ1

Figure 11.1: Uniform convergence: intuition (for Θ = R+ )

For (11.12) to hold it is necessary that Q


bN (θ) ∈ [Q0 (θ) − , Q0 (θ) + ]
for any small  > 0 and for all parameter values θ ∈ Θ: the sample criterion
must remain confined within a “sleeve” of the population expected criterion
over the entire parameter space Θ, like the area represented in Figure 11.1.

359
11.2. Asymptotics of Maximum Estimators

If, in this example, the sampling error increases at higher values of θ, the
local maximum θ1 could be mistaken for the actual global maximum θ0 .
Uniform convergence is ensured if these four conditions hold: i. q (xi ; θ)
is continuous; ii. Θ is a compact set; iii. E [|q (xi ; θ)|] < ∞: that is, q (xi ; θ)
has a bounded first absolute moment; iv. q (xi ; θ) is Borel-measureable on
its support. These conditions, together, allow to invoke a result known as
Uniform Weak Law of Large Numbers, implying uniform convergence.
These conditions are technical; notice, however, that i. and ii. relate to the
fact that M-Estimators are, in fact, maximum points; while iii. and iv. are
analogous to similar conditions from other Laws of Large Numbers. Armed
with the notion of uniform convergence, one can replicate the original proof
of M-Estimators’ consistency as it was given by Newey and McFadden (their
Theorem 2.1, which is reported here with minor variations).
Theorem 11.2. Consistency of M-Estimators. If i. Q0 (θ) is uniquely
maximized at θ0 , ii. Θ is a compact set, iii. Q0 (θ) is a continuous function,
and iv. Q
bN (θ) uniformly converges in probability to Q0 (θ), then it follows
that M-Estimators are consistent as per (11.10).
Proof. For any  > 0, with probability approaching 1 (w.p.a. 1);
bN (θ0 ) − 
 
by i.: Q
bN θ bM > Q (a)
    3
by iv.: Q0 θbM > Q bN θbM − (b)
3
by iv.: bN (θ0 ) > Q0 (θ0 ) − 
Q (c)
3
therefore, w.p.a. 1:
   (a)
bN (θ0 ) − 2 > Q0 (θ0 ) − 
  (b) (c)
Q0 θ bM > Q bM − > Q
bN θ
3 3
 
hence, Q0 θ bM > Q0 (θ0 ) −  w.p.a. 1. Now, denote by U any given open
neighborhood of θ0 and by Uc its complement in Θ. Also define,
Q0 (θ∗ ) = sup Q0 (θ)
θ∈Θ∩Uc

for some θ∗ , and notice that Q0 (θ∗ ) < Q0 (θ0 ) by i.-ii.-iii.: thus, by setting:
 = Q0 (θ0 ) − Q0 (θ∗ )
it follows that:
   
Q0 θ bM > Q0 (θ∗ )
bM > Q0 (θ0 ) −  ⇒ Q0 θ
p
implying that θ
bM ∈ U for any open neighborhood U. Thus, θ
bM → θ0 .

360
11.2. Asymptotics of Maximum Estimators

The result about asymptotic normality, which is instrumental for sta-


tistical inference, is derived in a perhaps more familiar way.
Theorem 11.3. Asymptotic Normality of M-Estimators. A generic
M-Estimator θ bM follows an asymptotically normal distribution if the fol-
lowing five conditions hold simultaneously:
i. θ
bM is a consistent estimator of θ0 ;

ii. q (xi ; θ) is a concave and twice continuously differentiable function in


an open neighborhood of θ0 ;

∂ 
iii. ∂θ E [q (xi ; θ)] = E ∂θ q (xi ; θ) : the derivative can pass through the
expectation integral;
iv. the data meet the requirements for the application of a Central Limit
Theorem (the data are “well behaved”);
v. the Hessian matrix is nonsingular, it is continuous in θ and it has a
bounded absolute first moment.
The limiting distribution is:
√  
d
N θM − θ0 → N 0, Q−1 −1
(11.13)

0 Υ0 Q0
b

where Q0 and Υ0 are defined as the following probability limits:


" N
#
1 X p
lim Var √ si (xi ; θ0 ) → Υ0
N →∞ N i=1
N
1 X p
lim E [Hi (xi ; θ0 )] → Q0
N →∞ N
i=1

which implies the following asymptotic distribution, for a fixed N .


 
A 1 −1
b M ∼ N θ 0 , Q Υ0 Q
θ −1
(11.14)
N 0 0

Proof. This derivation is reminiscent of the proofs for Theorems 6.17 and
6.18, respectively for MM and MLE estimators, in Lecture 6; this one is in a
way more general as it allows for possibly non i.i.d. data. Since, by condition
ii. the score function is assumed to be continuous and differentiable, then
by the Mean Value Theorem one can write:
    
si xi ; θM = si (xi ; θ0 ) + Hi xi ; θN θM − θ0
b e b

361
11.2. Asymptotics of Maximum Estimators

where θ
eN is some convex combination of θ

bM and θ0 . By summing over the
N observations and dividing by N , one gets:
N
1 X  b 
0= √ si xi ; θM
N i=1
N
" N
#
1 X 1 X  e  √ b 
=√ si (xi ; θ0 ) + Hi xi ; θ N N θM − θ0
N i=1 N i=1
by recalling that the sample score evaluated at the solution is equal to zero
by definition of M-Estimators. The expression above can be rewritten as:
 −1 1 X
" N
# N
√   1 X 
N θM − θ0 = −
b Hi xi ; θ N
e √ si (xi ; θ0 ) (11.15)
N i=1 N i=1
as v. lets invert the average Hessian matrix. Next, consider the following.
1. By i. and v. one can apply some suitable Law of Large Numbers to
the “sample-averaged” Hessian matrix, showing that:
N
1 X  e  p
H i x i ; θ N → Q0 (11.16)
N i=1
p
which follows from the Continuous Mapping Theorem since θ eN → θ0 .
2. Condition iii. implies ∂Q(θ 0)
= limN →∞ N1 Ni=1 E [si (xi ; θ0 )] = 0 and:
P
∂θ
N
1 X p
si (xi ; θ0 ) → 0
N i=1
hence, by condition iv. and the Continuous Mapping Theorem, it is
as follows.
N
1 X d
√ si (xi ; θ0 ) → N (0, Υ0 )
N i=1
Again, Slutskij’s Theorem and the Cramér-Wold Device allow to recombine
these intermediate results so to show (11.13).
As usual, the asymptotic sandwiched variance-covariance is not imme-
diately workable for statistical inference since matrices Q0 and Υ0 are un-
known and must be estimated. The “bread” Q0 is asymptotically evaluated
as:
N
1 X  b  p
QN ≡
b H i x i ; θ M → Q0 (11.17)
N i=1

362
11.2. Asymptotics of Maximum Estimators

which is straightforward; like in linear models, it is the estimation of the


“meat” matrix Υ0 that requires more care. Note that condition iv. from the
Theorem guarantees that some Central Limit Theorem can be applied, but
it is silent as to which version of it is being invoked. This, in turn, depends
on the assumptions regarding the data that the researcher feels confident
about making. If the observations are assumed independent (but possibly
not identically distributed) it is:
N
1 X 
E si (xi ; θ0 ) sT (11.18)

Υ0 = lim i (xi , θ0 )
N →∞ N
i=1

which specializes further to Υ0 = E si (xi ; θ0 ) sT i (xi , θ0 ) if, in addition,


 

the data are identically distributed (i.i.d.). In such cases, Υ0 is consistently


estimated as follows.
N
1 X  b  T b  p
ΥN =
b si xi ; θM si xi , θM → Υ0 (11.19)
N i=1

In analogy with the discussion from Lecture 8 about the consequences


of dependent observations, the above might not be a consistent estimator of
Υ0 if the observations are dependent. Yet the CCE estimator can be easily
adapted to the case of within-group dependence even for M-Estimators:
C X Nc XNc
b CCE = 1
   
p
X
Υ bM sT xjc ; θ
sic xic ; θ jc
bM → Υ0 (11.20)
N c=1 i=1 j=1

where both observations and scores are also indexed by the group or cluster
c = 1, . . . , C which they belong to. Similar extensions to HAC estimation do
exist, although in order to account for dependent observations in practical
applications of M-Estimators, CCE is overwhelmingly preferred because it
p
is much easier to implement. For any appropriate estimator Υ bN → Υ0 , the
variance-covariance matrix of M-Estimators is estimated as follows.
bM = 1 Q
 
[ θ
Avar b −1 Υ b −1
b NQ (11.21)
N N N

The above expression can be used to perform inference about θ bM .

Example 11.6. Asymptotics of OLS. These results are best understood


by making appropriate comparisons with OLS. Consistency is easily estab-
lished under E [εi | xi ] = 0. The results on asymptotic normality are clearly
related to the discussion above by observing that the score and the Hessian
matrix of OLS are as in Example 11.3; therefore Υ0 corresponds with 4Ξ0 ;
while the “bread” matrices Q0 are equal to −2K0 in the OLS case. 

363
11.2. Asymptotics of Maximum Estimators

Example 11.7. Asymptotics of NLLS. For the discussion of the asymp-


totic properties of NNLS, it is useful to define the following K × 1 vector:
∂  b 
hi ≡
b h xi ; θN LLS
∂θ
that is, the derivative of the CEF evaluated at xi and at the estimate θ
bN LLS .
To evaluate consistency of the NLLS estimator, recall that by construction
the latter sets the average score at zero for every value of N .
N N
1 X   1 X b p
0= si yi , xi ; θN LLS =
b 2hi εi → 0 (11.22)
N i=1 N i=1
The NLLS is consistent under its motivating assumption about the CEF of
Yi given xi , which for clarity’s sake is reported again below.
E [εi | xi ] = E [yi | xi ] − h (xi ; θ0 ) = 0
From this condition, along with its direct implication (11.7), it follows that:
N    N
1 X ∂ 1 X
lim E h (xi ; θ0 ) εi = lim E [h0i εi ] = 0
N →∞ N ∂θ N →∞ N
i=1 i=1
p
which can be reconciled with (11.22) so long as h
bi → h0i for i = 1, . . . , N , as
per some applicable Law of Large Numbers. Thus, the Continuous Mapping
Theorem also implies that:
p
bN LLS →
θ θ0
that is, the NLLS estimator is indeed consistent. Regarding the asymptotic
distribution, note that with independent observations:
N
1 X
2 · E h0i hT
 
Q0 = lim − 0i
N →∞ N i=1
N
1 X
4 · E ε2i h0i hT
 
Υ0 = lim 0i
N →∞ N
i=1
 
and therefore, by defining the residual ei ≡ yi − h xi ; θ bN LLS :
" N #−1 " N
#" N #−1
  X X X
[ θ
Avar bN LLS = h bT
bih
i e2i h bT
bih
i h bT
bih
i (11.23)
i=1 i=1 i=1

is the heteroscedasticity-consistent estimator of the variance-covariance of


NLLS. Note how it parallels estimator (8.22) for OLS. The homoscedastic,
CCE and HAC versions are analogous to their OLS counterparts too. 

364
11.3. The Trinity of Asymptotic Tests

11.3 The Trinity of Asymptotic Tests


After estimating an econometric model, a researcher is often interested into
performing some tests of hypothesis, which are possibly non-linear:
H0 : v (θ0 ) = 0 H1 : v (θ0 ) 6= 0
where v (·), a vector-valued function, has length L (for multiple hypotheses).
There are three alternative methods to perform such tests; these are known
together as the “Trinity.” Under the unifying framework of M-Estimation,
these methods have definitions that are uniform across Least Squares, MLE,
GMM and other estimators. Here, these methods are briefly reviewed.
Asymptotic Test 1. The Generalized Wald Statistics. Consider the
following scalar, called the generalized Wald Statistics:
  h   i−1  
fH0 = v T θ
W bM · V b · Avar
[ θ bT
bM · V ·v θ bM (11.24)
 
where Vb ≡ ∂T v θ
∂θ
bM . The limiting distribution of this statistic is:

d
fH0 →
W χ2L
that is, under the null hypothesis H0 the test statistic has a limiting χ2L
distribution with L degrees of freedom. This result appears more intuitive
upon comparing the generalized Wald Statistic to its original version from
the linear model, where v (β) = Rβ − c = 0 is some linear function. There,
the original Wald Statistic is just a particular case of Hotelling’s t-squared
statistic, which asymptotically follows the chi-squared distribution as per
Observation 6.2. This non-linear case is analogously derived after applying
the Delta Method to the central matrix of the quadratic form.
Example 11.8. Generalized Wald statistic and a linear constraint.
Suppose that interest lies in a specific hypothesis about the linear model:
K
X K
X
H0 : βk = 1 H1 : βk 6= 1
k=1 k=1

corresponding, for example, to the hypothesis of constant return to scale in


production functions (with the constant parameter being β0 ). In this case:
P 2
K b
βk,OLS − 1
fH0 = Pk=1 P
W
d
→ χ21
K K
k=1 q=1 σ
bβkq
where σ
bβkq is the kq-th element of the estimated variance-covariance of the
OLS estimates. 

365
11.3. The Trinity of Asymptotic Tests

The Wald Test is particularly easy to implement, as it only requires to


recombine some already calculated estimates. It has, however, two relevant
drawbacks in the case of non-linear hypotheses tests. First, it performs quite
poorly in small samples: this is a consequence of applying the Delta Method
(which is an asymptotic result). The second problem is that the Wald test
is not transformation-invariant: in fact, it computes different values of the
Wald Statistic for two equivalent hypotheses such as, say, H0 : βk = 0 and
H0 : exp (βk ) = 1. For all these reasons, the Wald Test should be preferably
used only when performing simple tests about linear hypotheses.
Asymptotic Test 2. The Distance, or Likelihood Ratio test. The
“Distance Test” was and still is also called “Likelihood Ratio Test,” as it was
originally conceived in the context of MLE. With respect to the Generalized
Wald Test, it has two major advantages: first, it is transformation-invariant;
second, it deals non-linear hypotheses quite well. This comes at a cost: this
is the most computationally demanding of all tests: in addition to the main
estimate of θ it requires to compute an additional “restricted” estimate
bV = arg max Q
θ bN (θ)
θ∈ΘV

where ΘV = {θ ∈ Θ : v (θ) = 0} is the “restricted parameter space.” Then


h    i
d
DH0 = N Q bN θbM − Q bN θ bV → χ2L (11.25)

is the expression of the “Distance” Statistic in all cases but MLE, while
h    i
d
LRH0 = 2 log QN θM − log QN θ
b b b bV → χ2L (11.26)

is the expression of the “Likelihood Ratio” for MLE, where QbN (θ) = LbN (θ)
is the empirical likelihood function (notice a difference in the scaling fac-
tor). Intuitively, the test is comparing how much gain is there to make, in
terms of explaining the data, by letting the model to be estimated “freely”
without the restriction. Clearly, the unrestricted model will always perform
statistically better at fitting the data; the question is “how much better”
with respect to the researcher’s a priori hypotheses.
Example 11.9. The distance test and a linear constraint. One can
test the same hypothesis as in Example 11.8 through the estimation of a
“restricted” model, such as:
K−1
!
X
Yi = β0 + β1 X1i + β2 X2i + · · · + βK−1 X(K−1)i + 1 − βk XKi + i
k=1

366
11.3. The Trinity of Asymptotic Tests

which can also be written as follows, with Ẍki ≡ Xki −XKi for k = 1, . . . , K.
Yi − XKi = β0 + β1 Ẍ1i + β2 Ẍ2i + · · · + βK−1 Ẍ(K−1)i + εi
In this example, the last coefficient of the original model is forced to conform
to the restriction that is implied by the null hypothesis; yet imposing the
restriction on any other coefficient (except β0 ) is equivalent. The Distance
Test is computed in this case as:
" N N 
#
X 2 X 2
d
DH =
0 yi − xKi − ẍT β
i
bV − y i − xT β
i
b OLS → χ2
i=1 i=1

where β
b V are the parameter estimates from the restricted model above. 

Asymptotic Test 3. The score – or Lagrange multiplier – test. The


last type of test in the Trinity, which also features different names, presents
the same advantages as the Distance/LR test, with the extra benefit that
it does not require the “unrestricted” model to be estimated at all. Thus, it
is computationally more parsimonious. The test is based on the properties
of the sample average score function evaluated at one specific parameter
value θv implied by the null hypothesis. Recall that the average score:
N
1 X  b 
si xi ; θM = 0
N i=1

always equals zero when evaluated at the unrestricted estimate value θ


bM , by
the definition of M-Estimators. Consider, however, a restricted parameter
value θv such that v (θv ) = 0. It follows that:
N
1 X
|si (xi ; θv )| > 0 (11.27)
N i=1
the K-dimensional sample score vector deviates from zero when evaluated
at any “suboptimal” parameter choice. The Lagrange Multiplier statistic is:
" N #T " N #
d
X X
LMH0 = N si (xi ; θv ) Υ b −1
v si (xi ; θv ) → χ2 K(11.28)
i=1 i=1

where, as appropriate:
N
" #
[ √1
X
Υ
b v = Avar si (xi ; θv )
N i=1
and it is indicative of how statistically relevant are the deviations in (11.27),
because it computes a quadratic form of their standardized values.

367
11.4. Quasi-Maximum Likelihood

Example 11.10. The score test and a linear constraint. Continuing


the running Example 11.8-11.9, a set of parameters βv of dimension K that
satisfies the hypothesized restriction could be computed from the estimates
b V of the restricted model as follows.
β
K−1
!T
X
βv = β b 0V , β
b 1V , β b (K−1)V , 1 −
b 2V , . . . , β β
b kV
k=1

Therefore, hence, the Lagrange Multiplier Statistic would read as:


" N #T " N #−1 " N #
d
X X X
LMH0 = xi ei (βv ) e2i (βv ) xi xT
i xi ei (βv ) → χ2K
i=1 i=1 i=1

given the “restricted” residuals ei (βv ) = yi − xT


i βv . 

11.4 Quasi-Maximum Likelihood


The asymptotic properties of Maximum Likelihood estimators are especially
desirable in light of the considerations that are made in Lecture 6 following
the analysis of Theorem 6.18. The main implications can be rephrased using
the notation developed in this Lecture; with i.i.d. data the MLE framework
has the property that:
Υ0 = −Q0
which follows from the Information Matrix Equality. Therefore, for a fixed
N the asymptotic distribution of any Maximum Likelihood estimator is:
A
N θ0 , [IN (θ0 )]−1 (11.29)

bM LE ∼
θ
where IN (θ0 ) is the information matrix; thus the variance-covariance hits
the Cramér-Rao bound making MLE the most suitable choice for estimation
so long as the distributional assumptions can be believed. This also delivers
the convenient implication that the information matrix can be consistently
estimated in two alternative ways, by either Υ b N or −Qb N , where matrix
b N is evaluated according to expression (11.19). In the econometric prac-
Υ 1

tice, the method based on the so-called outer product of the gradients
(OPG), that is using Υ b N , is often favored due to computational consider-
ations. In fact, statistical softwares routinely compute scores in order to
perform MLE, and the OPG adds little computational cost to the problem
of estimating the sample variance-covariance.
1
In the treatment developed in Lecture 6, the two alternative options are expressed
through the notation Hb N and J
bN , respectively.

368
11.4. Quasi-Maximum Likelihood

Example 11.11. MLE and Regression, continued. Continue the anal-


ysis of the MLE estimator of a linear regression model with normal distur-
bances from Example 11.5. To perform statistical inference, it is necessary
to estimate the variance of the estimates. In this context, the observation-
specific score – the individual contribution to the log-likelihood function –
is:  −2
xi yi − xT
 
i β σ 
si (yi , xi ; θ) = 2 −4
− 12 σ−2 + 12 yi − xTi β σ
hence the individual Hessian matrix, for some given value of θ, is:
−2
 −4 
−xi xT T

i σ −x i y i − x i β σ
Hi (yi , xi ; θ) = T T
 −4 1 −4 T
2 −6
−xi yi − xi β σ 2
σ − y i − xi β σ

which is symmetric. By plugging the MLE estimates into the above matrix,
summing it over all the observations and taking the inverse of the result
one obtains a consistent estimate of the opposite of the information matrix:

 −1
" N # " P −1 #
b −1
Q N T
b2M LE
i=1 xi xi σ 0
X 
N
= Hi y i , xi ; θ
bM LE =−
N i=1 0T 2 4
σ
N M LE
b

where the border elements (except the lower bottom one) are equal to zero
because they are proportional to the first K elements of the sample score –
that is, the K normal equations.2 An equivalent way to obtain the estimator
of interest is to calculate the outer product of the gradients. By the above
expression for si (yi , xi ; θ), it is easy to verify that:

 −1
" N #
b −1
Υ X   
N
= bM LE sT yi , xi ; θ
si yi , xi ; θ bM LE
i
N i=1
" P −1 #
N T 2
= i=1 xi xi σ
bM LE 0
T 2 4
0 σ
N M LE
b
2
The calculations to obtain the opposite of the bottom right element of this matrix
b2M LE ) are as follows.
(that is, the asymptotic variance of σ

N 
!−1
N −4 X 2
−6
b2M LE = − Tb

[ σ
Avar σ − yi − xi βM LE σ
2 M LE i=1
b bM LE
 −1  −1
N −4 N −4 2 4
=− σ b−4
bM LE − N σ = − − σ = σ
M LE
2 M LE N M LE
b b
2

For the outer product of the gradients the calculations are similar.

369
11.4. Quasi-Maximum Likelihood

that is, Υ
b −1 = −Q
N
b −1 as predicted by the information matrix equality. This
N
result highlights more clearly that the OLS estimator of the variance of β
in small samples (under the homoscedasticity assumption) differs from the
Cramér-Rao bound by a multiplicative factor of N N −K
. 
Unfortunately, the desirable properties of MLE break down if the i.i.d.
hypothesis cannot be defended, since the information matrix equality fails.
To illustrate, let again xi = (yi , zi ) be the collection of all endogenous and
exogenous variables in the model, and allow for group dependence between
observations. In this case, the likelihood function can be factored between
clusters:
C
Y
L (θ| x1 , . . . , xN ) = fx1 ,...,xNc (x1c , . . . , xNc c ; θ)
c=1
but not further; the information matrix equality no longer applies. A simi-
lar argument applies to more general cases of spatial and time dependence,
as well as to the i.n.i.d. case where the observations are independent but not
identically distributed (for example, when the homoscedasticity assumption
which is implicit in the CMLE model described in Example 11.5 cannot be
maintained, even if the error terms are always conditionally normal). In all
these cases, MLE retains the sandwiched limiting variance of M-Estimators
as per (11.13), and Υ0 must be estimated according to the working assump-
tions – for example, by formula (11.20) under group dependence.
As it has been already observed, the almost ideal asymptotic properties
of MLE break down even if the data are generated from a random (i.i.d.)
sample, but the likelihood function is misspecified, that is it does not match
the “true” data generation process in the population under examination. It
is interesting to investigate the consequences of misspecifation, that is of es-
timating a model via MLE while assuming a wrong underlying distribution,
since this can occur frequently in practice. In such cases, the estimator of
interest is called the Quasi-Maximum Likelihood Estimator θ bQM LE ,
and it is useful to characterize its probability limit, which is commonly
called the pseudo-true value θ∗ :
p
bQM LE = arg max L (θ| x1 , . . . , xN ) →
θ θ∗ (11.30)
θ∈Θ

where the probability limit is evaluated with respect to the true distribution.
A relevant question is whether the QMLE is consistent, that is θ∗ = θ0 .
In fact, in the main example of MLE examined so far – the linear model
under normality assumptions – it can be observed that the ML estimator
of β0 coincides with the standard OLS estimator, so it is consistent if the
standard conditional mean assumption for linear models E [εi | xi ] = 0 holds,

370
11.4. Quasi-Maximum Likelihood

even if the true underlying distribution is not normal. This observation is


actually more general than that: if the assumed distribution is fY,x (yi , xi ; θ)
for some scalar endogenous variable Yi , and in addition it belongs to the
exponential macro-family of distributions, that is it can be decomposed
in terms of some primitive scalar functions a [·], b [·] and c [·] as:
    
fY,x (yi , xi ; θ) = exp a µ Y |x (xi ; θ) + b [yi ] + c µ Y |x (xi ; θ) yi
where µ Y |x (xi ; θ) ≡ E [Yi | xi ; θ] is a parametric specification of the CEF
of Yi given xi , then MLE consistently estimates µ Y |x (xi ; θ) if the CEF is
correctly specified in the model. For linear models this is quite a familiar
condition, but the importance of this general statistical result goes beyond
linear models, since it applies to all non-linear models that are estimated via
MLE and are based on a distribution belonging to the exponential macro-
family. In fact, quite a relevant number of distributions that are commonly
employed in fully parametric econometric models belongs to the exponential
family, as it is already observed in Lecture 5.
Example 11.12. Poisson Regression. A count data model is a model
suited for explaining some variable of interest Yi that only assumes non-
negative integer values Yi = 0, 1, 2, . . . and where smaller values occur with
higher frequency than larger ones. A simple example of a count data model
is the Poisson regression, which is based on the Poisson distribution:
λi (xi )Yi exp (−λi (xi ))
P (Yi | xi ) =
Yi !
where the count Yi for each observation i = 1, . . . , N of an N -dimensional
sample is assumed to be Poisson-distributed each with a distinct Poisson
parameter λi (xi ), treated as a function of the individual characteristics xi .
By the properties of the Poisson distribution:
λi (xi ) = E [Yi | xi ] = Var [Yi | xi ]
that is, λi (xi ) equals both the conditional mean and the conditional vari-
ance of Yi given xi . The most common choice is λi (xi ) = exp xTi β0 ; this
results in the following conditional distribution.
  
exp Yi · xT T
i β0 exp − exp xi β0
P (Yi | xi ) =
Yi !
An implication of this assumption is that:

∂ exp xT i β0 ∂ E [Yi | xi ] 1
= exp xT

i β0 β0 ⇒ β0 =
∂xi ∂xi E [Yi | xi ]

371
11.4. Quasi-Maximum Likelihood

hence, β0 can be interpreted as a semi-elasticiy just like in a log-lin model.


A Poisson regression may be more convenient than a simple linear regression
of log Yi on xT
i β as it allows for the frequent observations Yi = 0.
Given a random sample {(yi , xi )}N i=1 , the log-likelihood function of the
Poisson Regression model is the following.
  N
X
β| {(yi , xi )}N y i · xT T
  
log L i=1 = i β − exp x i β − log y i !
i=1

The First Order Conditions of the MLE problem, expressed as the sum of
the individual scores, are:
N
X   XN h  i
si yi , xi ; βM LE =
b xi yi − exp xT
i β
b M LE = 0
i=1 i=1

and they lack a closed form solution; consequently, the estimator in question
must be obtained by numerical methods. The empirical Hessian matrix is:
N N
1 X   1 X  
QN =
b Hi yi , xi ; βM LE = −
b exp xT
i
b M LE · xi xT
β i
N i=1 N i=1

the opposite of which – divided by N – is an appropriate estimator of the in-


formation matrix. Since the Poisson distribution belongs to the exponential
family, even if the likelihood is misspecified the CEF of Yi given xi is consis-
tently estimated if it is itself well specified; together with the “exponential”
specification for λi (xi ) this implies that the MLE can be interpreted in such
a model in terms of semi-elasticities. 
The Poisson regression is quite an extreme example where MLE works
even if the likelihood is misspecified. In the case of the conditionally normal
linear model, even if the estimates of β0 survive the misspecification, the
MLE estimate of σ2 is rendered meaningless even by the smallest deviation
from the parametric assumptions – for example, if the errors are still normal,
but heteroscedastic – which affects the estimate of the variance of β b M LE .
This is a consequence of the more general fact that under misspecification
of the likelihood function, the information matrix equality fails, and it can
be shown that the limiting distribution of the QMLE is:
√  
d
N θQM LE − θ → N 0, [Q∗ ]−1 Υ∗ [Q∗ ]−1

(11.31)

b

where Q∗ and Υ∗ are the analogues of Q0 and Υ0 respectively, but evaluated


at θ∗ instead of θ0 . Hence, even if the QMLE is consistent, it is safest to

372
11.4. Quasi-Maximum Likelihood

estimate its covariance matrix as if the true asymptotic variance-covariance


took the standard sandwiched formula (11.14) of M-Estimators, even if the
observations are indeed both independent and identically distributed. This
is analogous to cautiously estimating the variance of OLS estimates by the
heteroscedasticity-robust formula when homoscedasticity is not too certain.
In more general cases where the assumed distribution does not belong to
the exponential family, the QMLE is actually inconsistent: θ∗ 6= θ0 . Yet not
all is lost! It turns out that the QMLE can be still given an interpretation in
terms of “best approximation” to the true distribution, similarly as OLS can
be interpreted as the best approximation to the true CEF in the population.
The theory underlying this approach relies on the analysis of the so-called
Kullback-Leibler Information Criterion (KLIC), defined as:
  
gx (x1 , . . . , xN ; θ0 )
Kx (θ) ≡ Eg log
f (x , . . . , xN ; θ)
ˆ  x 1 
gx (x1 , . . . , xN ; θ0 )
= log gx (x1 , . . . , xN ; θ0 ) dx1 . . . dxN
X fx (x1 , . . . , xN ; θ)
where fx (·) is the assumed joint mass or density function generating the
data, while gx (·) is the true function, which is taken as given; the expecta-
tion is taken with respect to gx (·). Note that by construction, Kx (θ) ≥ 0
for all θ ∈ Θ; but if the distribution is correctly specified, it is fx (·) = gx (·),
and thus Kx (θ0 ) = 0: the KLIC would attain its minimum. In addition:
Kx (θ) = Eg [log gx (x1 , . . . , xN ; θ0 ) − log fx (x1 , . . . , xN ; θ)]
= Eg [log gx (x1 , . . . , xN ; θ0 )] − log Lg0 (θ| x1 , . . . , xN )
where log Lg0 (θ| x1 , . . . , xN ) is the pseudo-population likelihood that results
if the assumed distribution is fx (·), but the true one is gx (·). Clearly, under
general assumptions:
p
bQM LE →
θ θ∗ = max log Lg0 (θ| x1 , . . . , xN ) = min Kx (θ)
θ∈Θ θ∈Θ

that is, the QMLE converges in probability to the pseudo-true value, which
is at the same time the maximizer of the pseudo-population likelihood and
the minimizer of the KLIC! In this sense, the probability limit of the QMLE
minimizes a well-defined criterion of distance between the assumed and the
true density, and as such it is thinkable as some “best approximation” of sort.
Mirroring the analogous discussion of Least Squares as best approximation
of the CEF, this is not an excuse for disregarding the problem of correctly
specifying the likelihood function! However, it motivates the applied prac-
tice of enriching MLE models with flexible parametric specifications (say,
polynomials of xi ) in order to best approximate the true distribution.

373
11.5. Introduction to Binary Outcome Models

11.5 Introduction to Binary Outcome Models


A case where the choice of MLE over semi-parametric methods is well moti-
vated is when some dependent variable Yi is limited, meaning that it only
takes a discrete set of values. In the most extreme, as well as most common
cases of limited dependent variable (LDV) models, Yi is a dummy or
binary variable: Yi ∈ {0, 1}. The reason is that economists are constantly
interested in the determinants of binary outcomes, that are usually framed
as microeconomic problems of choice over two alternatives. Examples are:
• What are the determinants of individual enrollment in college?
• Which factors influence on the probability of default of a firm?
• What are the causes of the eruption of civil war in a country?
in all these cases, either outcome takes one value (say college, default and
civil war take Yi = 1) while the alternative takes the other (Yi = 0).
Other LDV models take multiple outcomes:
• What means of transportations do individuals choose for commuting?
• Which characteristic of one country determine its political regime?
• What type of insurance contract is preferred by different individuals?
• Which individual characteristics predict people’s responses in surveys?
and the choice of LDV models often depends on whether alternatives can
be nested in groups (e.g. public vs. private transportation; democratic vs.
authoritarian regimes), and on whether they can be ranked or ordered (e.g.
insurance contracts from minimal to maximal coverage; “strongly disagree”
to “strongly agree” types of answer in surveys). The objective here to either
review all types of LDV models or to summarize the immense literature
about their econometric estimation. Instead, the aim is to provide a mini-
mal introduction to the most common binary outcome models, in order
to make a case for M-Estimation and specifically MLE in concrete economic
settings. Moreover, this is useful towards the eventual discussion of other
applications of MLE in the next section.
Let us consider a problem with binary outcomes Yi ∈ {0, 1}. Treating
the latter as random (Bernoulli) events, it is natural to think about their
realization probability as a conditional function of some variables xi :
P (Yi = 1| xi ) = G (xi , β0 )
P (Yi = 0| xi ) = 1 − G (xi , β0 )

374
11.5. Introduction to Binary Outcome Models

where G (·) is some function of the variables xi parametrized by vector β0 .


Notice that as the problem is binary, the probability of either outcome can
be treated residually with respect to the other’s. In fact, one can write the
conditional expectation of Yi given xi as

E [Yi | xi ] = 1 · [G (xi , β0 )] + 0 · [1 − G (xi , β0 )]


= G (xi , β0 )

as it happens with any Bernoulli distribution.


A natural question is whether this model is estimable via linear regres-
sion: this is called the linear probability model (LPM):

yi = x T
i β0 + i , yi ∈ {0, 1}

which is equivalent to assuming G (xi , β0 ) = xT i β0 . Note that by definition


of regression it is i = Yi − E [Yi | xi ] = Yi − xi β0 ; and so it follows that:
T

P i = 1 − xT T

i β0 x i = x i β0
P i = −xT T

i β0 xi = 1 − xi β0

yet this instance of “natural heteroscedasticity” also implies:

E [i | xi ] = 1 − xT
 T T T

i β 0 x i β0 − x i β0 1 − x i β0 = 0

hence OLS applied to this model would still produce unbiased and consistent
estimates of β0 even if the problem is naturally heteroscedastic (something
that is normally addressed either via “robust” standard errors or, in small
samples, via FGLS). The main issues of the LPM depend on the fact that
the linear conditional expectation xTi β0 cannot be constrained to lie within
the (0, 1) interval. This implies that:
1. the conditional variance of the error term might take negative values;

Var [i | xi ] = xT T

i β0 1 − x i β0 R 0

2. the predicted probabilities E b LP M = ybi might take


b [Yi | xi = xi ] = xT β
i
values outside the [0, 1] interval.
Both are facts that make no probabilistic sense. Unsurprisingly, the LPM is
used in practice only in a few limited circumstances, that is for the sake of
comparison with other LDV models or in well-defined quasi-experimental
settings where interest falls, in particular, on the transparent estimation of
the causal effect of some variable Xi upon a binary outcome Yi .

375
11.5. Introduction to Binary Outcome Models

In general, however, econometricians tends to prefer non-linear models


that produce consistent predictions of the predicted outcomes probabilities.
For any given realization xi , an obvious choice is:

G (xi , β0 ) = Fx xT (11.32)

i β0 = Fx (λi )

where Fx (·) is a probability distribution function with a single “free”


parameter (typically a location parameter) λi = xT i β0 . Furthermore, most
applications make use of a symmetric distribution function, that is one
for which F (λi ) = 1 − F (−λi ). It is apparent that this choice solves the
problem of the predicted probabilities, given that by the very definition of
probability distribution function, conditionally on any realization xi :

lim P (Yi = 1| xi ) = lim Fx xT



i β0 = 1
xT
i β0 →+∞ xT
i β0 →+∞

lim P (Yi = 1| xi ) = lim Fx xT



i β0 = 0
xT
i β0 →−∞ xT
i β0 →−∞

and symmetrically for Yi = 0. In addition, this model has a clear advantage


that appeals econometricians: it can be motivated on a “structural” model
of individual choice, which describes the (micro-)economics of the problem.
Specifically, in its simplest form a latent variable model for a binary
outcome is a model that reads like

yi∗ = xT β0 + εi (11.33)
(i
1 if yi∗ > α0
yi = (11.34)
0 if yi∗ ≤ α0

where Yi∗ is a latent variable that represents the cost-benefit evaluation of


the binary choice by the i-th individual. This latent variable is a theoretical
construct that cannot be observed by the econometrician – much like any
error term – but that is assumed to determine the choice of the outcome
according to a simple rule. In particular, if Yi∗ is larger than some unknown
“threshold” parameter α0 , then Yi = 1 is chosen; otherwise Yi = 0 is opted
for. Notably, the latent variable is a linear function of the individual char-
acteristics xi and a specific error term εi , which is distributed according to
Fx (·). In such a model, conditionally on any realization xi :

P (Yi = 1| xi ) = P (Yi∗ > α0 | xi ) = P εi > −xT (11.35)



i β0 + α0 xi

where intuitively, if xi contains a constant element, its associated “intercept”


parameter and α0 are not separately identified (they would have the same

376
11.5. Introduction to Binary Outcome Models

implication as the conditional probability of choosing Yi = 1 over Yi = 0 for


some given xi = 0). Hence, the normalization α0 = 0 is typically imposed
(this has no practical implications as the “constant probability” is included
in β0 ). Moreover, if Fx (·) is a symmetric distribution, (11.35) reshapes as:
P (Yi = 1| xi ) = P (Yi∗ > 0| xi )
= P εi > −xT

i β0 xi
= 1 − P εi ≤ −xT (11.36)

i β0 xi
= P εi < x T

i β0 x i
= Fx x T

i β0

where the fourth line exploits the symmetry of Fx (·). This fact reconciles
the latent variable model with our specification of the conditional probabil-
ity for the outcome Yi .
Before getting to practical aspects and the MLE estimation of models
with binary outcomes, two observations need to be made.
• Latent variable models are not specific of binary outcomes: multinomial
LDV models are usually motivated by more complex versions, which are
outside the scope of this overview. Latent variable models are also used
in the structural analysis of empirical strategic games.
• Derivation (11.36) above shows why Fx (·) should not contain a variable
scale parameter, such as the variance (if Fx (·) is, say, normal, its variance
should be known or normalized, e.g. σ2 = 1); otherwise the K parameters
in β0 and the scale parameter would not be separately identified. To see
intuitively why, consider the case where α0 = 0 and Fx (·) features some
scale parameter, call it σ. Here, the two equations:
yi∗ = xT
i β0 + εi
σyi∗ = σ xT

i β0 + εi

are observationally equivalent, that is Fx xT i β0 = Fx σ · xi β0 . The


 T


intuition behind this is that one can only observe whether the latent
variable takes values above (Yi = 1) or below (Yi = 0) its hypothesized
threshold, and not its variation as a function of the variation of xi . For a
similar reason, the scale parameter could be identified if α0 = 0 and xi
did not include any constant term. In such a case the “scale” parameter
would be identified by changes in the average value of Yi that are not
explained by xi . However, the basic fact that scale parameters are not
independently identified remains. In general, there is seldom a reason to
include a scale parameter instead of a constant location parameter.

377
11.5. Introduction to Binary Outcome Models

Given the choice of a symmetric probability distribution Fx (·) and a


sample {(yi , xi )}N
i=1 , the likelihood function of a binary choice model is:

  YN
yi  1−yi
L β| {(yi , xi )}N F x xT 1 − Fx xT

i=1 = i β i β
i=1

which is just a generalization of the likelihood function for a Bernoulli sam-


ple. The corresponding log-likelihood function is:
 
N
log L β| {(yi , xi )}i=1 =
N
X
yi log Fx xT T
   
= i β + (1 − yi ) log 1 − Fx xi β
i=1

where the First Order Conditions (the sum of the individual scores, evalu-
ated at the estimates) are:
N
∂ 
N
 X  
log L βM LE {(yi , xi )}i=1 =
b si yi , xi ; βM LE =
b
∂β i=1
    
XN y i f x xT i β
b M LE (1 − y i ) f x x Tb
i β M LE
=    −    xi = 0
Tb Tb
i=1 Fx xi βM LE 1 − Fx xi βM LE

where fx xT i β is the probability density function associated with the – im-




plicity continuous – distribution Fx xT . Like in the Poisson regression,



i β
there is no closed form solution, thus the estimator must be calculated via
numerical methods; its variance-covariance is more conveniently estimated
via the OPG.
Clearly, the exact solution depends on the assumptions made on Fx (·).
Even if other possibilities exist, the most common choices are:
• the probit model, in which Fx xT where function Φ (·)
 T

i β 0 = Φ x i β 0
is a cumulative standard normal distribution:
ˆ xTi β0  2
T
 1 t
P (Yi = 1| xi ) = Φ xi β0 = √ exp dt
−∞ 2π 2

• the logit model, in which Fx xT i β0 = Λ xi β0 where function Λ (·) is


 T


a scale-normalized cumulative logistic distribution:


T

exp x i β 0
P (Yi = 1| xi ) = Λ xT

i β0 =
1 + exp (xT i β0 )

which is much easier to manipulate and handle computationally-wise.

378
11.6. Simulated Maximum Estimation

In practice, the two distributions are very similar (they are both bell-shaped,
although the logistic has fatter tails) and the two models usually produce
similar sets of results which are easily compared against one another.
After having estimated a probit or a logit model, one must be careful
at interpreting the estimates of β! In fact, while the linear specification of
the latent variable might induce some confusion, a coefficient βk is neither
the causal effect of Xik on Yi nor the predicted change in the probability to
get Yi = 1 following some unitary increase in variable Xik . The best way to
interpret the estimated parameters is by calculating the marginal effects.
For all the explanatory variables in xi , these are characterized as:

∂ P (Yi = 1| xi ) ∂ E [Yi | xi ]
=
∂xi ∂xi

∂Fx xT i β
=
∂xi
= fx x T

i β β

and they are a function of the data for any value of β. There are two ways
to calculate marginal effects that meaningful for interpretation’s sake:

• to evaluate fx xT i β β at βM LE and at x = x̄ = N i=1 xi , the average



b 1
PN
characteristics in the sample;

• to evaluate fx xT i β β at βM LE and at xi for every observation, and then



b
average the resulting individual marginal effects over all observations.

It can be shown that these two approaches are asymptotically equivalent.

11.6 Simulated Maximum Estimation


There are instances such that the numerical evaluation of the M-Estimation
criterion function (11.2) is so complicated as to make practical applications
of the estimators discussed so far unfeasible. In such cases, theoretical and
applied econometricians alike advocate the use of estimators that make use
of simulation methods to approximate the evaluations in question. The
leading techniques that make use of simulation techniques lie in the domain
of MLE, and are adopted extensively in a subset of LDV models – those with
so-called random coefficients – that are especially popular in some fields of
economics such as Industrial Organization. For the sake of exposition, the
following discussion starts from such particular cases of MLE and is later
generalized to all M-Estimators.

379
11.6. Simulated Maximum Estimation

Suppose that the probability mass or density function of all observable


variables xi can be written, given the model parameters θ, as follows:
ˆ
fx (xi | θ) = f x|u (xi | ui ; θ) dHu (ui ) (11.37)
U

where ui is a random vector with cumulative distribution Hu (ui ) that is


integrated out over its support U. Now, suppose that there is no closed form
solution for the integral expressing (11.37), even if the conditional mass or
density function f x|u (xi | ui ; θ) is by itself tractable. It is obvious that the
MLE problem based on (11.37) cannot be easily solved.
Example 11.13. Random coefficients logit. Consider the following bi-
variate logit model:
P [Yi = 1| Xi ] = Λ (β0 + β1i Xi ) (11.38)
which has a look of a simple version – with one binary dependent variable
Yi ∈ {0, 1} and one possibly continuous independent variable Xi – of one of
the LDV models discussed previously. Notice, however, that the parameter
β1i is observation specific: it is obvious that if there are only N observations
available an econometrician cannot identify (let alone estimate) all the N +1
parameters implicitly expressed in (11.38). Yet, there are many real world
applications where it is sensible to allow for variation in the individual of Yi
to Xi , that is to allow for individual heterogeneity in the regression slope.
Such models are called random coefficients models; in particular, (11.38)
is a (bivariate) random coefficients logit.
Random coefficients models are typically handled by assuming that the
individual parameters themselves follow a probability distribution whose
parameters can be estimated – so that it is possible to evaluate the extent
of individual endogeneity. In the context of the current example, a typical
assumption is for example that of normality:
β1i ∼ N β1 , σ2 (11.39)


where θ = (β0 , β1 , σ2 ) is the parameter set of interest. Notice that (11.38)


and (11.39) together imply that the conditional probability mass function
of Yi is, by defining ui ≡ (β1i − β1 ) /σ and with φ (·) being the probability
density function of the normal distribution:
f Yi |Xi yi | xi ; β0 , β1 , σ2 =

ˆ
= Λ [β0 + (β1 + σui ) xi ]yi {1 − Λ [β0 + (β1 + σui ) xi ]}1−yi φ (ui ) dui
R
(11.40)
and that the resulting integral has no closed form solution. 

380
11.6. Simulated Maximum Estimation

Random coefficients LDV models exist in more complicated forms (like


multiple regressors, multinomial outcomes, distributions of the latent error
other than the logistic) than the one exposed in the previous example. All
such models pose the econometric problem of how to evaluate the likelihood
function resulting from integrals with no closed form solutions.3 One brute
force approach can be that of using numerical methods to evaluate the inte-
grals for each value of the parameters as it is necessary in the optimization.
However, this can be computationally overwhelming in practical settings.
The alternative, which is more common, is that of simulating the values of
(11.37) that are used to compute the likelihood function. Specifically, the
typical method called Direct Monte Carlo Sampling consists of taking
a sample {us }Ss=1 of S random draws of ui from Hu (ui ),4 and constructing
for each observation a simulator as the Monte Carlo estimate
S
1Xe
fx,S (xi | θ) =
b f x|u (xi | us ; θ) (11.41)
S s=1

where function fex|u (xi | us ; θ) is called a subsimulator. If the latter is an


unbiased predictor of the true density of interest, that is:
h i
E fex|u (xi | us ; θ) = fx (xi | θ) (11.42)

where the expectation is taken over the support of ui , then by some suitable
Law of Large Numbers:
p
fbx (xi | θ) → fx (xi | θ) (11.43)

that is, the simulator is consistent insofar as S → ∞. This approach allows


to derive the Maximum Simulated Likelihood (MSL) estimator as:
N
1 X
θM SL = arg max
b log fbx,S (xi | θ) (11.44)
θ∈Θ N i=1

where the summation on the right-hand side is based on simulators as in


expression (11.41).
3
Random coefficients linear models also exist, naturally. They are however typically
easier to handle, as slope deviations like ui = (β1i − β1 ) /σ in Example 11.13 are sub-
sumed into the error term, and multiple approaches to handle this case exist. In the MLE
case, the inability to evaluate the likelihood function is a more fundamental problem.
4
More precisely, computational techniques of this sort are based on “pseudo-random”
draws. In the univariate case, sequences are typically drawn from the standard uniform
distribution and projected back onto the support of interest through the quantile function
of Hu (ui ), where |u| = 1. The logic is easily extended to the multivariate case.

381
11.6. Simulated Maximum Estimation

Some practical considerations apply to the given MSL estimator. First,


it is clearly easier to compute if the simulator is differentiable with respect
to the parameter vector θ. It is easy to check that this is verified in, say, the
MSL estimator arising from the problem of Example 11.13 where, given S
draws {us }Ss=1 from the standard normal, the subsimulator reads as follows.

feYi ,Xi |Us yi , xi | us ; β0 , β1 , σ2 =




= Λ [β0 + (β1 + σus ) xi ]yi {1 − Λ [β0 + (β1 + σus ) xi ]}1−yi


Second, it is more convenient to calculate every element of the summation
on the right-hand side of (11.44) using the same draw {us }Ss=1 : if the sim-
ulator is consistent, this allows to confine the statistical uncertainty of the
simulator to noise in the observable, rather than simulated, variables. More
generally, the asymptotic properties of the estimator clearly depend on the
number of simulation draws S. In this regard, the following is a key result,
originally provided by Gouriéroux and Monfort (1991), and adapted here
for better consistency with the exposition from this and previous Lectures.
Theorem 11.4. Asymptotic Efficiency of Maximum Simulated Li-
kelihood. Suppose that the mass or density function fx (xi | θ) describing
the model’s data generation process meets the requirements of Theorem 6.18,
and thus the corresponding “theoretical” MLE has a limiting distribution as
per the statement of that Theorem. A SML estimator based on an unbiased
subsimulator as in (11.42) is asymptotically equivalent to the “theoretical”
MLE and it has the same limiting distribution:
√  
d
N 0, [I (θ0 )]−1

N θ bSM L − θ0 →

if S, N → ∞ (a condition that is sufficient for consistency) and N /S → 0.
Proof. (Outline.) Gouriéroux and Monfort (1991) work out the standard
Taylor expansion of the First Order Conditions of (11.44) and detail on how
it depends on two sources of noise: the one coming from the data {xi }N i=1
and the one due to the simulation draws {us }Ss=1 . They show that the latter
vanishes asymptotically if S grows at a rate higher than that of N .
While this is an important result, it is not a panacea. In fact, for large
datasets it implies that S be set at very large values, which at times could
be unfeasible due to the high computational costs that this implies. If S is
as good as finite then the MSL is inconsistent, because even if the simulator
is unbiased, its logarithm is not, that is
h i
E log f x|S (xi | θ) 6= log fx (xi | θ)
b

382
11.6. Simulated Maximum Estimation

by the (non-)properties of expectations. Following the work by Gouriéroux


and Monfort (1991), it has been suggested to leverage a second-order Taylor
expansion of log fbx|S (xi | θ) around log fx (xi | θ), which writes as:
 2 
h i Var f x|S (xi | θ) − fx (xi | θ)
b
log fx (xi | θ) ' E log f x|S (xi | θ) +
b
2 [fx (xi | θ)]2
(where both moments are taken over the support of ui ) in order to provide
an approximated correction for the asymptotic bias which is due to a finite
S. This lets define the first-order asymptotic bias-corrected MSL as:

θ
bBCM SL =
 h i2 
XN XS f x|u (xi | us ; θ) − fx,S (xi | θ) 
e b
= arg max log f (x | θ) +

x,S i
 b h i2 
θ∈Θ i=1 s=1 2S fbx,S (xi | θ)
(11.45)
given that the inner summation inside the brackets on the right-hand side
is easily motivated as a consistent estimator of the second-order term of
the above Taylor expansion for each observation i = 1, . . . , N . Researchers
shall consider this extended estimator if they are concerned about the size
of S relative to N in a practical environment.
The theory of simulated M-Estimators extends beyond Maximum Likeli-
hood: if a generic M-Estimator is defined in terms of an observation-specific
criterion q (xi ; θ) that is based upon integrals without closed form solution,
a simulation approach is rendered necessary. A Simulated M-Estimator
(SM), of which MSL is a special case, is defined as:
N
1 X
θSM = arg max
b qbS (xi ; θ) (11.46)
θ∈Θ N i=1

where qbS (xi ; θ) = S1 Ss=1 qes (xi , us ; θ) is typically an average of subsimu-


P
lators that are written as qes (xi , us ; θ) and are based upon pseudo-random
draws of ui in analogy with the SML case. All previous considerations about
SML extend to this more general case too: the SM estimator is consistent if
S, N → ∞; furthermore
√ it is as efficient as the corresponding non-simulated
M-Estimator if N /S → 0; if S is too small, an approximated bias cor-
rection may be necessary. Clearly, all this applies so long as the conditions
underpinning consistency and asymptotic normality of M-Estimators hold
for the simulated estimator as they would in the standard case.

383
11.7. Applications of Maximum Estimation

To conclude this brief overview of simulation-based M-Estimators, it is


useful to make some remarks on how the asymptotic components Υ0 and
Q0 variance-covariance matrices are estimated. Clearly, standard formulae
such as (11.17) and (11.19) are unfeasible because the elements of the sum-
mations involved cannot be evaluated. The solution is, unsurprisingly, that
of simulating them. Specifically, a consistent estimator for Υ0 is:
N
1 X    
p
ΥM,S =
b b T
bsSi xi ; θM bsSi xi , θM → Υ0
b (11.47)
N i=1

where bsSi (xi ; θ) = ∂ qbS (xi ;θ)


∂θ
, while a consistent estimator for Q0 is:
N
1 Xb  b  p
QN ≡
b HSi xi ; θM → Q0 (11.48)
N i=1

where H b Si (xi ; θ) = bsSi (xTi ;θ) = ∂ qbS (xiT;θ) . Expressions that extend (11.47) to
∂θ ∂θ∂θ
the CCE or HAC cases can be derived. Notice that in the MSL case, the
estimator of the information matrix (under i.i.d. observations) is typically
obtained through the outer product of the gradients as:
    
XS ∂ fe
x|u x i | u s ; θ
b SM L X S ∂ fe
x|u x i | u s ; θ
b SM L
 
N  ∂θ ∂θ T 
X  s=1 s=1
 p
 → I (θ0 )
 
 PS e  P
S

i=1 f x | u ; θ f x | u ; θ
 s=1 x|u i s SM L i s SM L 
b b
s=1 x|u
 e

because of the particular mathematical properties of likelihood functions; a


similar expression can also be derived in the Hessian’s case.

11.7 Applications of Maximum Estimation


Maximum Estimators, and in particular Maximum Likelihood Estimators,
are applied in all fields of economics. Especially when they are adapted from
a specific structural economic model, these estimators are often unique, and
it is difficult to provide a taxonomy (although some models – like certain
LDV ones – can be applied in more diverse contexts). In order to illustrate
this diversity, this section briefly delineates a few different applications of
M-Estimation. This overview starts by describing the NLLS estimator for
CES production functions, and subsequently provides a succinct summary
of two different models based on MLE: Heckman’s sample selection model
and Bresnahan’s test for collusion in oligopolistic industries.

384
11.7. Applications of Maximum Estimation

Estimation of the CES Production Function


Lecture 7 briefly describes how the Cobb-Douglas production function, writ-
ten as (7.42) in its simplest form (that is, with only two inputs: capital and
labor), can be easily transformed to a log-log model, endowed with an error
term, and estimated via OLS. This is quite a clean example of a structural
econometric model! It turns out, however, that the Cobb-Douglas produc-
tion function is a limiting case of the more general Constant Elasticity of
Substitution (CES) production function, a fundamental ingredient of many
economic models. In its simplest form this function writes as:
1
Yi = [αK Ki + αL Li ] ρ + εi (11.49)

where Yi is output, Ki and Li are capital and labor, αK > 0 and αL > 0
are the respective so-called saliency parameters that determine the relative
importance of each input, ρ > 0 is a parameter related to the elasticity of
substitution between inputs, which as the model’s name goes in this model
is constant and writes σ = (1 + ρ)−1 ∈ (0, 1), while εi is an error term. It
can be shown that as ρ → 0, (11.49) becomes a Cobb-Douglas production
function like (7.42) where αK = βK and αL = βL .
This model must obviously be estimated by NLLS via numerical meth-
ods, and even the simplest case (11.49) is known to entail complications. A
typical estimation algorithm involves splitting the problem as follows:
" N 
#
X 1
2
(b
ρ, α b L )N LLS ∈ arg min arg min
bK , α yi − [αK ki + αL li ] ρ
ρ∈R++ (αK ,αL )∈R2++ i=1

where (yi , ki , li ) denote observations of (Yi , Ki , Li ). In words, numerical al-


gorithms feature an inner maximizer of (αK , αL ) given a value of ρ, and an
outer maximizer for ρ; the parameter combination that minimizes the sum
of squared residuals is ultimately selected. Unfortunately, the applied prac-
tice has shown that the solution is quite unstable and very dependent on
the value of ρ, for this reason, practitioners often prefer to estimate a linear
approximation of the model via OLS, see Kmenta (1967). The problem am-
plifies further for more complicated versions of the model (multiple inputs,
nested CES structures) which explains why CES production functions are
seldom encountered in the empirical practice.

The Heckit Model of Sample Selection


The determinants of individual labor supply decisions have always interested
economists. In particular, due to women’s lesser labor market participation

385
11.7. Applications of Maximum Estimation

rates, female labor supply is of particular interest from a policy perspective.


Consider an equation that describes the intensity of a woman’s participa-
tion as a function of her individual characteristics xi :

hi = xT
i β + εi (11.50)

where hi represents worked hours over a given time frame or other measures
of labor supply (for example days or weeks in the case of seasonal jobs).
A problem is that hi is only observed for women who do actually work:

zi∗ = wiT γ + υi (11.51)


(
> 0 if zi∗ > 0
hi (11.52)
= 0 if zi∗ ≤ 0

where there is some latent variable zi∗ , depending on a possibly different set
of characteristics wi , which represents the individual cost-benefit evaluation
on whether to work or not. The difference with binary outcome models is
that in this case if an individual does participate to the labor market, as
specified by the participation equation (11.51) and by the assignment rule
(11.52), the intensity of her work is observed as a continuous variable. In
fact, the ultimate objective of the researcher is to estimate a model such
as (11.50) for the determinants of the intensity variable hi , and not merely
a binary outcome model for participation per se. Notice that this model
could be alternatively specified for other intensity variables hi such as the
market wage for women; in other variations of this model interest may lie
in both the quantity (hours) and the price (wage) variables.
Unfortunately, OLS cannot estimate (11.50) consistently. Denoting by
Hi the random variable whence the observations of hi are drawn, it is:

E [Hi | xi , hi > 0] = E [Hi | xi , zi∗ > 0]



= xT
i β + E [εi | xi , zi > 0]
= xT T
 
i β + E εi | xi , υi > −wi γ

where λ (wi ) ≡ E εi | υi > −wiT γ 6= 0 as long as the two error terms are
 

correlated. The quantity λ (wi ) is in all effects an omitted variable of the


equation, and represents the fact that individuals who are more inclined to
work – or otherwise more favored by the circumstances to be able to work
– will likely participate to the labor market at a higher intensity. Since this
quantity is a function of wi , if some of the elements (variables) of vectors xi
and wi are the same, we are in presence of an omitted variable bias type of
problem: one that takes the well-known name of sample selection bias.

386
11.7. Applications of Maximum Estimation

For example, a woman with a wealthy husband who is happy supporting her
will be both less inclined to work and to work many hours if she works at
all (the husband’s income is an element of both xi and wi ). If, in addition
to this, the natural inclinations of the woman in question are correlated for
both the participation (υi ) and the intensity (εi ) decisions – a natural state
of things – all the conditions for a sample selection bias are present.
Heckman (1977) devised a solution to this problem that was worth him
the Nobel Prize in Economics. This solution, which is also known with the
name of heckit in analogy with probit, logit and other models with LDV
components, is based on a parametric assumption about the two error terms
(εi , υi ) such as a bivariate normal distribution:
     2 
εi 0 σ ρ
∼N ,
υi 0 ρ 1
where ρ is the correlation coefficient between the two errors, σ2 is the vari-
ance of the error of the intensity equation while the corresponding variance
for the participation equation is normalized to 1 since it is not identified in
binary LDV models. Thus, all the parameters of the model could be in prin-
ciple estimated via MLE by specifying an appropriate likelihood function
that accounts for the common dependence of the two equations.
However, Heckman also proposed an alternative procedure that is much
easier to implement, while still requiring the bivariate normal assumption:
1. run a probit on the participation equation (11.52) and obtain γ
b M LE ;
2. for each observation, calculate the inverse Mills ratio:
" #
φ wiT γ
b M LE
λi =
Φ (wiT γ
b M LE )

where φ (·) and Φ (·) are, respectively, the density and cumulative func-
tions of the standard normal distribution;
3. run OLS on a modified intensity equation:
hi = xT
i β + ρλi + εi (11.53)
where the correlation parameter ρ is the OLS coefficient for λi .
Under the assumptions of the heckit model this procedure produces consis-
tent estimates of β. A disadvantage of this approach is that the standard
errors of the OLS step are inconsistently estimated, because they do not ac-
count for the joint distribution of (εi , υi ). However, “resampling” techniques
such as the bootstrap can address this.

387
11.7. Applications of Maximum Estimation

Detecting Collusion in Oligopolies


For decades, Industrial Organization has developed as a mainly theoretical
field that analyzed, on the basis of strategic microeconomic analysis (mostly
game theory) the implications of imperfectly competitive markets such as
monopolies and oligopolies. Until 20-30 years ago there was a lamentable
dearth of empirical studies in IO, as the econometric methods available were
too simplistic with respect to the theoretical models developed in the field.
With a new generation of structural models being introduced since the late
‘80s, the field has been revolutionized to the point that now it is mostly an
empirical one, the “structural” field par excellence.
Before then, though, important empirical questions were left unsolved:
a famous example was the sudden 45% increase in the US automobile pro-
duction and sales (together with a corresponding prices fall) on 1955, with
a rebound the following year. As demand was not that strong in 1955, most
economists suspected the existence of a secret collusive agreement that for
some reason broke apart in 1955 and was eventually resumed in 1956. For a
long time they did not possess, however, any methodological tool to prove
this hypothesis. In fact, Paul Samuelson had allegedly once said that he:

“would flunk any econometrics paper that claimed to provide an


explanation of 1955 auto sales”

a sentence that scared any economist who would think about actually trying
to test the suspicion. Quoting Samuelson’s words in the introduction of his
paper, Bresnahan (1987) developed a methodology for detecting collusion in
oligopolies that back then was quite innovative, becoming a starting point
of the “empirical revolution” in IO. In fact, he was able to show statistically
that hypotheses other than a momentary price war were unlikely.
Bresnahan models the automobile industry as one of N types of cars,
each with quality Xi = X (zi , β) being a function of one car’s characteristics
zi given parameters β. Qualities can be ordered from best to worst: without
loss of generality, Xi > Xh if i > h. He provides microfoundations for the
demand functions of each car, defined for each year t = 1, . . . , T as

QD
it = D (Pht , Pit , Pjt , Xht , Xit , Xjt , γ) (11.54)

where Qit is the quantity of product i, Pit its price, h, i, j are three consec-
utive products in the order of qualities and γ are some parameters. This
specification makes prices and quantities only dependent, in equilibrium, on
those of the “neighbors” of one product in the product space, and follows
from a particular specification of consumers’ utility.

388
11.7. Applications of Maximum Estimation

As for supply, Bresnahan develops a standard framework of a profit-


maximizing firm, whose profits from the sale of product i are:

πit = Pit Qit − c (Xit ) Qit

with c (Xit ) = µ exp (Xit ); and he distinguishes the following two scenarios.

1. Competition: in this case each firm sets its own price Pit by taking the
price of neighbors h and j as given, with First Order Conditions:

∂πit ∂Qit (·)


= Qit + (Pit − c (Xit )) =0
∂Pit ∂Pit
as Qit is a function of Pit as per (11.54).

2. Cooperation: in this case the firm(s) selling two products, say, i and j
would set prices Pit and Pjt so to maximize the joint profits, with First
Order Conditions for the i-th price:

∂ [πit + πjt ] ∂Qit (·) ∂Qjt (·)


= Qit + (Pit − c (Xit )) + (Pjt − c (Xjt )) =0
∂Pit ∂Pit ∂Pit
and symmetrically for the j-th price.

Bresnahan then defines several matrices Ht such that, in each year,


(
1 cooperation between products i and j
h(ij)t =
0 competition between products i and j

and characterizes several hypothetical scenarios for 1955 and surrounding


years to which correspond associated sets of matrices Ht . Thus, for a given
choice of matrix Ht the supply function can be written as

qitS = S (Pht , Pit , Pjt , Xht , Xit , Xjt , Ht , γ, µ) (11.55)

where the demand function parameters γ enter via the derivative of the
demand functions implied in the First Order Conditions.
By setting the equilibrium condition QD it = Qit = Qit (and similarly
S ∗

for prices) for each product i = 1, . . . , N in every year t = 1, . . . , T , it is


possible to obtain the reduced form of this model:

Pit = P ∗ (Xht , Xit , Xjt , Ht , β, γ, µ) (11.56)


Qit = Q∗ (Xht , Xit , Xjt , Ht , β, γ, µ) (11.57)

389
11.7. Applications of Maximum Estimation

which is more easily obtained by solving for both the demand functions and
the supply side First Order Conditions simultaneously. The last assumption
is that the actual prices and quantities differ from their theoretical, reduced
form values by a pair of normally distributed error terms:
  P
Pit − P ∗
    2 
ξit 0 σP 0
= Q =N ,
Qit − Q∗ ξit 0 0 σ2Q
where the variances of the two error terms reflect heteroscedasticity. Hence,
the likelihood function can be written in terms of data realizations as:
P 2
T Y N  !
  Y 1 ξ
L β, γ, µ| Ht , {pit , qit , zi }N
i=1 = p exp − it 2 ×
t=1 i=1 2πσ 2
P
2σ P
  2 
Q
 ξit 
T N
YY 1
× exp −
2σ2Q
q 
t=1 i=1 2πσ2 Q

which can be estimated just by observing products’ characteristics, quanti-


ties and prices in every year for a given value of Ht .
Notice that in this model the structural parameters might not be sepa-
rately identified, but this is not the objective the analysis: which is, in fact,
to evaluate the performance of the model for two alternative choices of the
“cooperation” matrix Ht . Suppose that the hypothesis to be tested is:
H0 : Ht0 for competition H1 : Ht1 for collusion
where “competition” and “ collusion” change over a set of products selected
by the researcher. The test performed by Bresnahan to evaluate each sce-
nario is of the likelihood ratio type:
h    i
CH = 2 log L β, e γ e Ht1 , z1 , . . . , zN − log L β,
e, µ b γ b Ht0 , z1 , . . . , zN
b, µ

the associated test statistic would reject the null if Ht1 fits the data signif-
icantly better than Ht0 . Thanks to this procedure, Bresnahan has statisti-
cally shown that some car producers have been colluding in the US market
in all years but 1955: Paul Samuelson must have not been happy.
It must be remarked that while the Bresnahan model is still a nice exam-
ple of a structural model in IO that simultaneously incorporates both the
demand and supply side within an elegant MLE framework, by today’s stan-
dards it certainly feels antiquated and “mechanical.” The current practice
in Industrial Organization favors the use of random coefficients multinomial
LDV models that incorporate the supply side while attempting to correct
for endogeneity of prices and product characteristics through instrumental
variables – all within a larger Generalized Method of Moments framework.

390
Lecture 12

Generalized Method of Moments

This lecture introduces the Generalized Method of Moments (GMM): an en-


compassing framework for estimating semi-parametric econometric models.
To illustrate and motivate its applicability, GMM is shown to be a general-
ization of of standard IV estimators for linear models, and consequently how
2SLS and 3SLS are particular cases of this framework. Analogous consider-
ations are then extended to non-linear models. This lecture also overviews
some theoretical and practical issues such as methods for estimation of the
variance-covariance and their implementation, tests for overidentification,
and simulation-based approaches. Lastly, this lecture provides some exam-
ples about applications of GMM from actual economic research.

12.1 Generalizing the Method of Moments


The Method of Moments estimator introduced in Lecture 5, and motivated
by the Analogy Principle, is suited to address many parameter estimation
problems in Statistics. It turns out that also most econometric estimators
examined thus far can be reformulated as Method of Moments estimators!
Consider, for example, the following zero moment conditions:
E zi Yi − xT (12.1)
 
i β 0 =0
where both zi and xi are two random vectors of equal dimension (say K)
and Yi is another random variable; these conditions follow naturally from
the exogeneity assumption in an IV setting, that is E [εi | zi ] = 0. From the
above conditions one retrieve the parameter vector β0 as:
−1
β0 = E zi xT (12.2)

i E [zi Yi ]
whose sample analogue is precisely the IV estimator. By replacing zi with
xi one obtains the standard OLS estimator instead. All M-Estimators can

391
12.1. Generalizing the Method of Moments

be similarly formulated as Method of Moments estimators where the moti-


vating zero moment conditions are the First Order Conditions of the pop-
ulation criterion maximization (11.1). In fact, the asymptotic properties of
all these estimators are derived following an approach which mirrors that
of Theorems 6.8 and 6.17, but extending it to possibly non i.i.d. data.1
Other econometric estimators, however, cannot be phrased as Method
of Moments estimators: this framework, in fact, allows for a number of zero
moment conditions equal to the dimensionality of the problem (the number
of parameters K). While, clearly, fewer moment conditions than parame-
ters makes for an unsolvable problem (that is, an unidentified model), some
econometric estimators are overidentified : one can posit more moment con-
ditions than parameters. These are situations in which “redundant informa-
tion” is available to the econometrician for estimation purposes. This is, for
example, the case of the 2SLS estimator that emerges if, say, the random
vector zi in (11.1) has dimension J > K. It turns out that the Method of
Moments can be generalized to allow for overidentification. Moreover,
the “restrictions” in excess2 can be tested to “evaluate” their contribution
to parameter identification.
To elaborate, suppose that a researcher postulates the validity of J ≥ K
zero moment conditions described by a vector-valued function g (·):
E [g (xi ; θ0 )] = 0 (12.3)
where xi = (yi , zi ) is the collection of all the variables of the model (both
exogenous and endogenous) while θ0 is the K-dimensional vector collecting
the true values of the parameters. The sample analog of (12.3), motivated
by the Analogy Principle, is the following J-dimensional vector.
N
1 X
gN (θ) ≡ g (xi ; θ) (12.4)
N i=1

The Generalized Method of Moments (GMM) estimator is defined as


the minimizer of a quadratic form GbN (θ) based on these empirical moments:
bGM M = arg min GbN (θ) = arg min gT (θ) AN g (θ)
θ N N (12.5)
θ∈Θ θ∈Θ

for any full rank positive semi-definite J-dimensional square matrix AN .


1
This was already observed in Lecture 8, footnote 2 with reference to OLS. Recall
that Theorems 6.8 and 6.17 are derived under the assumption of a random sample.
2
In econometrics, the expression “restriction” is sometimes used to indicate a single
moment conditions like (12.1). The reason for it is that in a SEM setting, such a condition
is typically associated with an “exclusion restriction” of Zi on Yi . In principle, however,
a “restriction” and a “moment condition” are two conceptually distinct notions.

392
12.1. Generalizing the Method of Moments

Intuitively, the GMM estimator picks the value θ bGM M that minimizes
the distance of all the empirical moments from their expected “true” value
(zero). Such a distance is measured as a quadratic form that employs matrix
AN in order to “weigh” the relative importance of different moments, as it
is clarified later. Some important observations are in order.
1. The First Order Conditions of the problem are as follows:
∂ T b   
gN θGM M · AN · gN θ bGM M = 0 (12.6)
∂θ
note that the term the pre-multiplies AN , the transposed Jacobian of
the empirical moment conditions evaluated at the solution, is a K × J
matrix. Given that an analytic solution is generally not available, the
GMM estimator is typically obtained numerically.
2. The GMM estimator resembles – indeed, is – an M-Estimator. Define:
N
1 X
G0 (θ) ≡ lim E [g (xi ; θ)]T A0 E [g (xi ; θ)] ≥ 0 (12.7)
N →∞ N
i=1

for some full rank positive semi-definite J-dimensional matrix A0 such


p
that AN → A0 . One can easily see that the GMM objective function
GbN (θ) converges in probability to G0 (θ) for all θ ∈ Θ:
p
GbN (θ) → G0 (θ)
as in M-Estimators for GbN (θ) = −Q
bN (θ) and G0 (θ) = −Q0 (θ).

3. The model is identified if the population criterion G0 (θ) has a unique


(local) minimum, which must be equal to the true parameter θ0 , since
G0 (θ) ≥ 0 for all θ ∈ Θ and G0 (θ0 ) = 0 by (12.3) and (12.7) – that
is, by construction. The minimization of (12.7) implies the following
First Order Conditions:
N  
1 X ∂ T
lim E g (xi ; θ0 ) · A0 · E [g (xi ; θ0 )] = 0 (12.8)
N →∞ N ∂θ
i=1

and one can see that a unique solution is obtained if the J × K matrix
G0 which is defined as:
N  
1 X ∂
G0 ≡ lim E g (xi ; θ0 ) (12.9)
N →∞ N
i=1
∂θT
has full column rank K, as otherwise many combination of parameters
are equally capable of minimizing G0 (θ).
 ∂ Note that under identically
distributed observations, it is G0 = E ∂θT g (xi ; θ0 ) .

393
12.1. Generalizing the Method of Moments

Under some fairly general conditions, GMM estimators are consistent


and asymptotically normal. While these results could be adapted from the
analysis of M-Estimators, it is worth to show them – especially asymptotic
normality – via an alternative route. This allows to better highlight certain
peculiar aspects of GMM while simultaneously circumventing issues about
the calculation of Hessian matrices.
Theorem 12.1. Asymptotic Properties of GMM. If a GMM estimator
based on some zero moment conditions like (12.4) is identified and meets
the uniform convergence requirements of M-Estimators from Theorem 11.2,
it is consistent.
p
bGM M = min GbN (θ) →
θ min G0 (θ) = θ0 (12.10)
θ∈Θ θ∈Θ

Furthermore, if conditions analogous to those from Theorem 11.3 are met,


the GMM estimator is also asymptotically normal and its limiting variance
presents the following sandwiched expression:
√  
d
 −1 T −1 
N θbGM M − θ0 → N 0, GT 0 A G
0 0 G 0 A Ω A
0 0 0 0 G GT
0 A G
0 0
(12.11)
where Ω0 is the following J × J limiting matrix.
" N
#
1 X
Ω0 ≡ lim Var √ g (xi ; θ0 ) (12.12)
N →∞ N i=1
Note. Before proceeding with a “sketched” proof, it is useful to observe that
like in similar cases, when the observations are independent it is:
N
1 X 
E g (xi ; θ0 ) g T (xi ; θ0 ) (12.13)

Ω0 = lim
N →∞ N
i=1

simplifying further to Ω0 = E g (xi ; θ0 ) g T (xi ; θ0 ) when the observations


 

are also identically distributed.


Proof. (Sketched.) A heuristic argument is useful here to show consistency.
By some Weak Law of Large Numbers it must be that:
∂ b ∂ T p
GN (θ0 ) = g (θ0 ) · AN · gN (θ0 ) → G0 A0 · E [g (xi ; θ0 )] = 0
∂θ ∂θ N
since E [g (xi ; θ0 )] = 0 by assumption; by the First Order Conditions (12.6)
this implies that:  
p
GN θGM M → GbN (θ0 )
b b (12.14)
p
which entails θ
bGM M → θ0 if the model is identified (G0 is of full rank).

394
12.1. Generalizing the Method of Moments

To show asymptotic normality, apply the Mean Value Theorem directly


to the empirical moments (12.4); with very little manipulation:
√   √  √  
N gN θGM M = N gN (θ0 ) + GN θN
b e N θGM M − θ0 (12.15)
b

where as usual θ
eN is a convex combination of θ
bGM M and θ0 , while GN (θ)
is defined as:
N
1 X ∂
GN (θ) ≡ g (xi ; θ)
N i=1 ∂θT
in analogy with G0 . Plugging (12.15) into the First Order Conditions (12.6)
delivers the expression:
  h√  √  i
GTN θ
bGM M AN N g N (θ 0 ) + GN θ
e N N θ
b GM M − θ 0 =0

which can be manipulated so to return the following equation.


√   h    i−1
N θbGM M − θ0 = − GT θ bGM M AN GN θ eN ×
N
  √
× GT N θGM M AN N gN (θ0 )
b
   
p p
Since GN θ bGM M → G0 and GN θ eN → G0 by consistency of GMM, if
√ d
N gN (θ0 ) → N (0, Ω0 )
that is if some Central Limit Theorem can be applied to the data at hand,
these results can be combined via the Delta Method to deliver (12.11).
As usual, the matrices in the limiting variance (12.11) are unknown and
must be estimated. The asymptotic variance-covariance is calculated as:
−1 −1
bGM M = 1 G
   
[ θ
Avar b T AN G
bN b T AN Ω
G b N AN G
bN G b T AN G
bN
N N N
N
(12.16)
where GN is a consistent estimator of G0 :
b
N
1 X ∂  
p
GN =
b
T
g x i ; bGM M →
θ G0 (12.17)
N i=1 ∂θ

while the estimator of Ω0 , denoted as Ω


b N , once again depends on the specific
assumptions. For example, if the observations are independent:
N
1 X  b   
p
ΩN =
b T
g xi ; θGM M g xi ; θGM M → Ω0
b (12.18)
N i=1

395
12.1. Generalizing the Method of Moments

whereas the following applies under clustering (HAC extensions also exist).
C Nc X Nc
1 XX    
p
ΩCCE =
b b T
gic xic ; θGM M gjc xjc ; θGM M → Ω0
b (12.19)
N c=1 i=1 j=1

It is apparent that the asymptotic variance of the GMM estimator θbGM M


depends on the choice of the weighting matrix AN , as it converges to A0 . A
celebrated, yet a posteriori intuitive result in econometrics (Hansen, 1982)
is the one that shows that the most efficient GMM estimator is the one
for which the optimal “weighting matrix” of the moment conditions is:
b −1
AN = Ω N (12.20)
where Ω
b −1 is a matrix that converges in probability to the inverse of Ω0 .
N
p
b −1 →
Ω N Ω−1
0 = A0 (12.21)
Under this circumstance, the limiting distribution of the GMM estimator
reads in a conveniently simpler fashion.
√  
d
 −1 
N θ bGM M − θ0 → N 0, GT 0 Ω −1
0 G 0 (12.22)

One can demonstrate that the limiting variance in (12.22) is efficient by


noting that the difference between the standard, (12.11) and the optimal
variance-covariance matrices of GMM is given by:
−1 T −1 T −1
−1
GT0 A 0 G0 G 0 A 0 Ω 0 A 0 G 0 G T
0 A 0 G 0 − G 0 Ω 0 G0 =
1 1
−1 −1
= GT GT T
(12.23)
 
0 A0 G0 0 A0 Ω0 · MG e 0 · Ω 0 A 0 G0 G0 A 0 G0
2 2

1
e 0 ≡ Ω− 2 G0 , it is:
where, for G 0
 −1
e eTe
e 0 ≡ I − G0 G0 G0
MG eT
G 0

which is a symmetric and idempotent matrix, hence the overall difference is


a semi-definite positive matrix whatever A0 is. This observation is analo-
gous to the proof of the Gauss-Markov Theorem for OLS, and in fact the sta-
tistical intuition is best given through a comparison with Generalized Least
Squares (GLS): the most efficient linear estimator under heteroscedasticity.
In GLS, observations are reweighted by the inverse of the variance of the re-
spective error terms, as per the Weighted Least Squares formulation (8.43).
Analogously, in the GMM problem the moment conditions are weighted by
the inverse of their respective statistical variance: the larger the statistical
variance of a single moment condition gj (xi ; θ0 ) – for j = 1, . . . , J – the
smaller its contribution towards the GMM objective function.

396
12.1. Generalizing the Method of Moments

A fundamental result associated with GMM is the following theorem,


which was originally proved by Chamberlain (1987).
Theorem 12.2. Semi-Parametric Efficiency Bound of GMM. If the
moment conditions (12.3) hold, the GMM estimator derived through the
optimal weighting matrix Ω−10 hits the efficiency bound which applies to the
class of all semi-parametric estimators of θ0 .
Proof. (Outline.) The argument by Chamberlain proceeds as follows. Sup-
pose that the data {xi }N
i=1 are drawn from a discrete support of dimension
D denoted as XD = {χ1 , χ2 , . . . , χD }, where χd for d = 1, . . . , D is a given
point in the support. If the moment conditions (12.3) hold, it follows that:
D
1 X
E [g (xi ; θ0 )] = g (χd ; θ0 ) pd = 0 (12.24)
D d=1
where pd is the probability attached to the d-th element of XD . Thus, esti-
mating θ0 by optimal GMM is equivalent to solving a parametric maximum
likelihood problem based on (12.24), meaning that the Cramér-Rao bound
from mathematical probability (recall the discussions in Lectures 5, 6, 11)
applies to this context, and the bound is obviously the limiting variance in
(12.22) because no efficiency gains can be obtained with any other weighting
matrices. In addition, Chamberlain shows that the result does not depend
upon the granularity of XD in a fundamental way, and that it approximately
holds even when the data have a continuous support.
Chamberlain’s result is extremely powerful, since it provides a rationale
for the practical use of GMM as the least-variance estimator under minimal
semi-parametric working assumptions – especially as many econometric es-
timators can be rephrased as GMM estimators, including (as it is discussed
later) 2SLS and 3SLS. An outstanding issue remains though, which is that
matrix Ω0 is not known ex ante, since it is a function of θ0 . To circumvent
this, Hansen (1982) proposed the two-step GMM estimation procedure.
The steps entailed in this method are illustrated next under the assumption
of independent observations, although they apply more generally.
1. Obtain a first step estimate θ b1 with some arbitrary weighting matrix:
usually the identity matrix I, which implies that the GMM objective
function reduces to Gb1 (θ) = gT N (θ) gN (θ). The resulting estimate θ1
b
is consistent but inefficient; yet being consistent it allows to compute
the following consistent estimator of Ω0 .
N
1 X h  b  T  b i p
ΩN =
b g xi ; θ 1 g x i ; θ 1 → Ω 0 (12.25)
N i=1

397
12.1. Generalizing the Method of Moments

2. Obtain a second step, final GMM estimate θ b2 by minimizing


bGM M = θ
the objective function G2 (θ) = gN (θ) ΩN gN (θ).
b T b −1

Finally, the estimation of the asymptotic variance is obtained as follows.


−1
bGM M = 1 G
  
[ θ
Avar bTΩb −1 G
N N
bN (12.26)
N
Possibly, Ω
b N can be re-evaluated at the final estimate θ b2 . While it
bGM M = θ
remains perhaps the most popular estimation procedure to estimate GMM
models, Hansen’s two-step is not the only one – especially as simulations
have shown that it is biased in small samples. The main available alter-
natives are the following two algorithms.

• The iterated GMM estimation: this is practically an “infinite steps”


GMM estimation procedure. The idea is not to stop Hansen’s algorithm
at his second step, but to re-compute Ωb N instead by making use of the
second step estimates θ
b2 . Then, one would obtain a “third step” θ
b3 vector
of parameter estimates, re-compute ΩN once again and so forth – until
b
convergence is achieved. This computationally demanding approach has
been shown to be asymptotically equivalent to the two-steps procedure,
although it may perform better in small samples.
• The continuously updating GMM estimation (CUGMM): the idea
of this approach is to estimate the weighting matrix (which is a function of
the parameters), jointly with the parameters themselves. The CUGMM
estimator is computed as the minimizer of the objective function
" N #T " N #−1 " N #
X X X
GbN (θ) = g (xi ; θ) g (xi ; θ) g T (xi ; θ) g (xi ; θ)
i=1 i=1 i=1

and takes its name from the fact that whenever GbN (θ) is numerically
optimized, the weight matrix changes at every iteration. This is shown in
Monte Carlo simulations to perform better than the two-step procedure,
and to make overidentification tests more reliable (Hansen et al., 1996).
However, it is also very computationally demanding.

In both cases, the asymptotic variance of θ bGM M is estimated via (12.26) by


combining the estimate of Ω0 obtained last together with an estimate of G0
as per (12.17). In practical applications, numerical optimization is typically
inevitable for the implementation of all these procedures; the choice of the
most appropriate optimization method is case-dependent and it is best left
to the specific evaluation of the practitioner.

398
12.2. GMM and Instrumental Variables

12.2 GMM and Instrumental Variables


In most applications, GMM estimators are based on so-called conditional
moment conditions, which take the form:
E [h (yi , zi ; θ0 )| zi ] = 0 (12.27)
where h (·) is a P -valued function and zi is a vector of J instrumental
variables. By the Law of Iterated Expectations, (12.27) delivers P J mo-
ment conditions that are usable for estimation.
E [g (xi ; θ0 )] = E [zi ⊗ h (yi , zi ; θ0 )] = 0 (12.28)
The discussion developed next shows that GMM estimators based on this
class of moments encompass and generalize many common econometric esti-
mators, including 2SLS, 3SLS and extensions of NLLS, such as Instrumental
Variables Non-Linear Least Squares (IV-NLLS).

2SLS as a GMM Estimator


The initial motivation given for GMM is for addressing the scenario of “more
instruments than parameters” (overidentification) in linear models. In such
a situation, the moment conditions like (12.1) are based on linear functions
i β with P = 1 (hence the moments are J in total)
h (Yi , xi ; θ0 ) = Yi − xT
and their sample analogs are as follows.
N
1 X
z i y i − xT

gN (β) = i β = 0
N i=1
The associated GMM estimator is:
" N
#T " N
#
1 X 1 X
z i y i − xT zi yi − xT
 
β
b GM M = arg min
i β AN i β
β∈RK N i=1
N i=1

which clearly has an analytic solution. In fact, the First Order Conditions
of the problem above are:
" N
# " N
#
1 X 1 X  
−2 xi zT
i AN zi y i − x T
i βGM M
b =0
N i=1 N i=1
therefore:
" N
! N
!#−1 N
! N
X X X X
β
b GM M = xi zT
i AN zi xT
i x i zT
i AN zi y i
i=1 i=1 i=1 i=1
(12.29)

399
12.2. GMM and Instrumental Variables

or, in compact matrix notation


−1
b GM M = XT ZAN ZT X
β XT ZAN ZT y (12.30)

which already resembles the 2SLS estimator. Note, in fact, that if one were
to choose the weighting matrix AN as:
N
!−1  −1
1 X T 1 T
A
eN = zi zi = Z Z
N i=1 N

this estimator would correspond exactly with the standard 2SLS estimator.
The actual estimate of its variance would depend on the assumptions made
by the researcher (standard heteroscedasticity, homoscedasticity, group de-
pendence etc.) but would anyhow be easy relatable to the “long” expression
of the GMM asymptotic variance-covariance (12.16).
The theory of GMM, however, allows for additional efficiency gains. In
fact, if the weighting matrix AN were chosen as the inverse of the estimated
variance-covariance matrix of the moment conditions which is obtained un-
der the assumption that the observations are independent:
( " N
#)−1
b −1 1 X  Tb

AN = Ω N = Avar
[ √ zi yi − x βGM M
N i=1
N
!−1
1 X 2 T
= e zi z
N i=1 i i
 −1
1 Tb
= Z EN Z
N

where ei ≡ yi − xT
i βGM M for i = 1, . . . , N and EN is as in (10.68). The
b b
GMM estimator (12.30) would thus become:
  −1 −1  −1
T Tb
βGM M = X Z Z EN Z
b T
Z X XT Z ZT E
bNZ ZT y (12.31)

which differs slightly from standard 2SLS. In fact, this kind of GMM estima-
tion retrieves a generalized version (in the GLS sense) of the overidentified
2SLS estimator. To better appreciate this, consider the estimated asymp-
totic variance of (12.31):
 −1 −1
[ β
h i 1 T

Tb T
Avar b GM M = X Z Z EN Z Z X
N

400
12.2. GMM and Instrumental Variables

which no longer takes a typical sandwiched form akin to (10.65). Thus, by


a decomposition analogous to (12.23), linear GMM can be shown to be a
more efficient estimator – yet likewise consistent – than standard 2SLS.
As a follow-up to these observations, one might wonder “how well” does
the standard 2SLS fare, in terms of efficiency, relative to linear GMM. In the
highly ideal case of independent, identically distributed, and homoscedastic
observations, the probability limit of the optimal weighting matrix for linear
GMM is as follows:
N
−1 1 X 2 T p 2  T
AN = Ω N =
b e zi z → σ E zi zi
N i=1 i i
where σ2 ≡ Var Yi − xT β0 does not depend on zi ; at the same time:
 

N
−1 1 X T p  T
AN =
e zi zi → E zi zi
N i=1
so that the two estimators would asymptotically coincide: observe that σ2
would simplify in the expression of the probability limit of (12.22). In less
ideal scenarios, the equivalence collapses. Nevertheless, it has been observed
that for linear models, the efficiency gains obtained thanks to the optimal
variance GMM are small in comparison to the computational and empirical
costs associated with its implementation – for example, if observations are
dependent a different estimator of Ω0 may be necessary in order to construct
the optimal weighting matrix, and the resulting GMM estimator may be
even less efficient than 2SLS if wrong choices are taken. It is no surprise,
then, that current practice favors the use of the standard 2SLS estimator
coupled with appropriate estimators of its variance – most typically, the
heteroscedasticity- or the cluster-robust ones.

3SLS as a GMM Estimator


In analogy with 2SLS, also the 3SLS estimator for SEMs results from the
solution of a GMM problem. I report here the SEM model that is analyzed
in Lecture 10, already formulated in compact matrix notation:
y = Xβ0 + ε
or, more extensively:
      
y1 X1 0 ··· 0 β10 ε1
 y2   0 X2 ··· 0   β20   ε2 
 ..  =  .. .. .. ..   ..  +  .. 
      
 .   . . . .  .   . 
yP 0 0 · · · XP βP 0 εP

401
12.2. GMM and Instrumental Variables

a model that has N ×P dimension (N observations for P equations). Recall


that each of the matrices Xp for p = 1, . . . , P combines all the endogenous
and exogenous variables not excluded from the p-th equation. The moment
conditions, however, single out the exogenous variables zi :
E [εpi | zi ] = 0 ⇒ E [zi εpi ] = 0
for all equations p = 1, . . . , P . The sample analog of the moment conditions
is in this case:
 
y1i − xT 1i β1
N T
1 X   y2i − x2i β2 

gN (β) = zi  .. =0
N i=1  . 
T
y P i − xP i βP
a P Q-dimensional vector (P equations for Q “exogenous” instruments). A
more practical and elegant way to write these moment conditions is to use
compact matrix notation:
1
I ⊗ ZT (y − Xβ) = 0

N
where the Kronecker product operates by diagonally stacking the transpose
of the exogenous variables’ matrix Z = z1 z2 . . . zN just P times.


 
ZT 0 · · · 0
 0 ZT · · · 0 
I ⊗ ZT =  .. .. . . .. 
 
 . . . . 
0 0 · · · ZT
Notice that this Kronecker product has dimension P Q × P N , where P N is
also the length of the error terms vector ε. In this case, the GMM problem
is written as:
 T  
1 T
 1 T

β
b GM M = arg min I ⊗ Z (y − Xβ) AN I ⊗ Z (y − Xβ)
β∈RB N N

where B = |β|. The First Order Conditions are:


   
1 T 1 T

−2 X (I ⊗ Z) AN I⊗Z y − XβGM M
b =0
N N
with the following “general” solution.
b GM M = XT (I ⊗ Z) AN I ⊗ ZT X −1 XT (I ⊗ Z) AN I ⊗ ZT y
   
β
(12.32)

402
12.2. GMM and Instrumental Variables

The above estimator is equivalent to 3SLS if the weighting matrix is:


h   i−1
T
AN = I ⊗ Z
e ΣN ⊗ I (I ⊗ Z)
b
h i−1
T b −1 ⊗ ZT Z −1

= ΣN ⊗ Z Z
b =Σ N

where Σb N is still the estimate of the conditional error covariance matrix Σ


as defined in (10.82), and it obtains from 2SLS estimates as per (10.83) in
the previous discussion of the 3SLS estimator. Furthermore, consider that
projection matrices are symmetric and idempotent, thus:
h  i
b −1 ⊗ ZT Z −1 I ⊗ ZT = Σ b −1 ⊗ Z ZT Z −1 ZT
 
(I ⊗ Z) Σ N N
 
= Σ b −1 ⊗ I (I ⊗ PZ )
N
 
−1
= (I ⊗ PZ ) ΣN ⊗ I (I ⊗ PZ )
b
−1 T
where PZ = Z ZT Z Z . Therefore, plugging this weighting matrix A eN
into (12.32) yields the 3SLS estimator in (10.84) which is obtained under
the hypotheses of within-equation homoscedasticity and cross-equation de-
pendence; the expression of its variance is given in (10.85).
Further efficiency gains are obtained with an optimal weighting matrix.
Under the hypothesis of independent observations, this is estimated as:
 
1  −1
−1 [ √ T
AN = Ω b = Avar
N I⊗Z y − Xβ b GM M
N
 −1
1 T b

= I ⊗ Z SN (I ⊗ Z)
N
where S
b N is an object analogous to Σ b N ⊗ I, but slightly more complex:
 
S
b N,11 Sb N,12 · · · Sb N,1P
Sb N,21 Sb N,22 · · · Sb N,2P 
S
bN =  .. .. .. .. 

 . . . . 
S b N,P 2 · · · S
b N,P 1 S b N,P P
and, for any p, q = 1, . . . , P :
 2 
epp1 0 · · · 0
 0 e2
pq2 · · · 0 
SN,pq =  .. .. . .. 
 
.
b
 . . . . 
2
0 0 · · · epqN
  
with epqi = ypi − xpi βpGM M yqi − xqi βqGM M for i = 1, . . . , N .
Tb Tb

403
12.2. GMM and Instrumental Variables

Despite its algebraic complexity, this estimate of the variance-covariance


of the moment conditions has a straightforward interpretation: in addition
to allowing for cross-equation correlation (like standard 3SLS) it is robust
to heteroscedasticity in the within-equation variances as well as in the cross-
equation covariances. The resulting GMM estimator is:
 h i−1 −1
T T T
 
b GM M = X (I ⊗ Z) I ⊗ Z S
β b N (I ⊗ Z) I⊗Z X ×
h i−1
× XT (I ⊗ Z) I ⊗ ZT S I ⊗ ZT y (12.33)
 
b N (I ⊗ Z)

with estimated asymptotic variance:


 i−1 −1
h i 1 T
h
T
 T

[ β
Avar b GM M = X (I ⊗ Z) I ⊗ Z S b N (I ⊗ Z) I⊗Z X
N
and it is asymptotically identical to standard 3SLS under homoscedasticity
in both the within-equation variances and in the cross-equation covariances.
Just like the GMM version of 2SLS, this estimator is seldom used; however
it is of theoretical importance as the GMM generalization of the main full
information semi-parametric estimator for simultaneous equations.

Instrumental Variables in Non-Linear Models


The GMM framework is well suited to extend the Instrumental Variables
approach also to non-linear estimators like, for example, generalizations of
the MLE score that allow for “exogenous” instruments. The general GMM
problem associated with the moment conditions (12.28) is:
" N
#T " N
#
1 X 1 X
θ
bGM M = arg min zi · h (yi , zi ; θ) AN zi · h (yi , zi ; θ)
θ∈Θ N i=1 N i=1

which in general lacks an explicit solution. However, an expression for both


the limiting and asymptotic variances is easily obtained from (12.11) and
(12.16) by noting that, in this case:
N  
1 X ∂ T
G0 = lim E zi · h (yi , zi ; θ0 )
N →∞ N ∂θ
i=1

and similarly for GN (θ). A typical application of this is in single-equation


non-linear models where:
 

E h (xi ; θ0 ) · (yi − h (xi ; θ0 )) 6= 0
∂θ

404
12.2. GMM and Instrumental Variables

that is, the error term εi = yi − h (xi ; θ0 ) is not mean-independent of the


implicit set of instruments that is defined by the Non-Linear Least Squares
estimator – the K-dimensional vector of derivatives of h (·) with respect to
the parameters θ, itself a function of the explanatory variables xi – because
of some type of endogeneity problem (just as for linear models).
Thus, the solution to the problem would be to look for a J-dimensional
vector of instrumental variables zi such that:
E [zi · h (yi , zi ; θ)] = E [zi (Yi − h (xi ; θ0 ))] = 0 (12.34)
and the Non-Linear Two-Stages Least Squares (NL2SLS) estimator
which follows from the solution of this GMM problem crucially depends on
the choice of AN . Its limiting variance is:
√  
d
 −1 T −1 
N θ bN L2SLS − θ0 → N 0, JT 0 A J
0 0 J 0 A Ω A J
0 0 0 0 J T
0 A J
0 0
(12.35)
where J0 , the analogue of G0 , is defined as the following probability limit:
N N
1 X p 1 X 
zi hT E zi hT

0i → lim 0i ≡ J0
N i=1 N →∞ N
i=1

(where J0 = E zi hT 0i if the observations are identically distributed), h0i is


 

as in Examples 11.4 and 11.7, while A0 and Ω0 are as in the standard theory
of GMM. The estimated asymptotic variance of the NL2SLS estimator is:
h i 1 bT −1  −1
T T
Avar θN L2SLS =
[ b JN A N J N
b JN AN ΩN AN JN JN AN JN
b b b b b
N
where:
N
bN = 1
X
J bT
zi h i
N i=1

and hb i is as in Example 11.7. The particular choice of the weighting matrix


entails the same considerations as in the case of “linear” GMM-2SLS:
P −1
1. choosing AN = N N
z z
i=1 i i
T
delivers the traditional (standard)
version of the NL2SLS estimator, which is akin to standard 2SLS;
 −1
2. with independent observations, Ω b N = N PN e2 zi zT
i=1 i i where ei is
the usual residual for each i-th observation;
3. in such a case, however, the optimal NL2SLS estimator is obtained
by setting AN = Ωb −1 , and it is asymptotically equivalent to standard
N
NL2SLS only under homoscedasticity.

405
12.2. GMM and Instrumental Variables

Note that if points 1. and 2. above are maintained and, in addition, in the
moment conditions (12.34) one specifies zi = h0i – so that the instruments
enter as zi = h
b i in the estimation problem – the GMM returns the standard
NLLS estimator which is examined in Lecture 11. This is clearly analogous
to setting zi = xi in linear models, which returns standard OLS.

Optimal Instruments
When operationalized via GMM, conditional moment conditions like (12.27)
constitute a framework for the semi-parametric estimation of a wide class of
econometric models: in fact, the theory for instrumental variables in non-
linear models can be extended to non-linear systems of structural equations
as well, similarly as how 3SLS generalizes 2SLS. However, conditional mo-
ment conditions are even more general, since any function of the instruments
l (zi ) which takes values upon a J 0 -dimensional set makes for valid moment
conditions of the kind:
E [l (zi ) ⊗ h (yi , zi ; θ0 )] = 0
so long as P J 0 ≥ K, where K is the total number of parameters. A relevant
question is to what extent it is possible to construct appropriate optimal
instruments so that the resulting GMM problem delivers the most efficient
estimate available with the information enclosed in the conditional moment
conditions (12.27). A result proved by several authors is that this objective
is achieved through the K × P matrix L (yi , zi ; θ0 ) defined as:
 
∂ T
L (yi , zi ; θ0 ) = E h (yi , zi ; θ0 ) zi {Var [h (yi , zi ; θ0 )| zi ]}−1
∂θ
(12.36)
where the first term (the conditional expectation) is a K × P matrix, while
the second term (the inverted conditional variance) is a P × P matrix. The
efficient estimate of θ0 is then obtained through the following K “optimal”
moment conditions:
E [g (yi , zi ; θ0 )] = E [L (yi , zi ; θ0 ) · h (yi , zi ; θ0 )] = 0
and the corresponding estimate θ bM M solves a simple Method of Moments
sample analog system of equations.
N
1 Xh   
b M M · h yi , z i ; θ
i
L yi , zi ; θ bM M = 0
N i=1
Clearly, the limiting variance for this estimator assumes a “sandwiched” ex-
pression which is simpler than (12.16): as the problem is exactly identified,
the weighting matrix is redundant.

406
12.3. Testing Overidentification

Unfortunately, the practical use of the optimal instruments defined in


(12.36) requires either a priori knowledge or the formulation of assumptions
about specific moments conditional on the instruments, and any mistakes
would jeopardize the entire approach. This is analogous to the problem of
specifying the variance of the error term conditional on the regressors in
linear models in order to make GLS “feasible.” In fact, the parallel is more
i β0 , then:
than just intuitive: if zi = xi , P = 1 and h (Yi , xi ; β0 ) = Yi − xT
1
LGLS (Yi , xi ; β0 ) = −xi ·
σ2L (xi )
and the resulting Method of Moments estimator is just GLS! Similarly, if
zi = xi , P = 1 but the model is non-linear: h (Yi , xi ; θ0 ) = Yi − h (xi ; θ0 ):
1
LGN LLS (Yi , xi ; θ0 ) = −h0i ·
σ2N L (xi )

which yields the “Generalized” version of Non-Linear Least Squares instead.3


Because of the complications entailed in specifying the conditional moments
that compose L (Yi , xi ; θ0 ), in more general cases the current practice favors
using moment conditions based on the simple vector of instruments zi .

12.3 Testing Overidentification


While discussing identification in the context of SEMs (Lecture 9), it was
already mentioned how it is possible to statistically test the conditions that
give rise to overidentification. The idea is that, when multiple restrictions
allow to identify a single parameter, a researcher may think about holding
one of them “constant” and verify, by performing a formal statistical test,
how likely the other ones are in probabilistic terms, given the estimated pa-
rameters. Thus, if the statistical test is not favorable to the null hypothesis
associated with the additional identifying conditions, the researcher may
consider revising the model. It turns out that overidentification tests are
well integrated in the GMM framework. In what follows the most common
of such tests, the Sargan-Hansen test, is described briefly. The starting
point is the following “Hansen J ” statistic, which is a quadratic form of the
estimated J moment conditions.
     
d
b b b −1
J θGM M = N gN θGM M ΩN gN θGM M → χ2J−K
T b (12.37)
 
3
Here σ2L (xi ) = Var Yi − xT 2
i β0 xi and σN L (xi ) = Var [ Yi − h (xi ; θ0 )| xi ] are the
two conditional variances of the two models’ error terms. Clearly, under homoscedasticity
the two MM estimators result in OLS and standard NLLS, respectively.

407
12.3. Testing Overidentification

Observe that when the GMM estimation is performed with an optimal


weighting matrix AN , the Hansen J -statistic corresponds to the estimated
GMM objective function evaluated at θ bGM M and multiplied by N . Under
standard conditions this statistic is asymptotically χ2J−K distributed with
J −K degrees of freedom, where K degrees are subtracted to account for the
fact that θ bGM M has been estimated. The intuition about this test statistic
is that if the moment conditions accurately describe the real world, it follows
that empirically their sample analogues, when evaluated at θ bGM M , should
be close to zero. The test evaluates exactly this hypothesis, taking care of
normalizing the sample moment conditions by their empirically observed
variance. Hence, if the measured Hansen J -statistic following some GMM
estimation is too large relative to some critical value, the test may lead to
rejecting the null hypothesis that all moments are zero in the population.
The Hansen J -statistic is mostly useful when there are fewer moment
conditions than parameters. In such a case, if the null hypothesis is rejected,
the researcher might selectively remove moment conditions and evaluate
the consequent performance of the modified model. However, when the J
moment conditions exceed the number of parameters K by a substantial
amount, one might be interested in testing a subset of moment conditions
in block. This task is performed by the so-called “incremental” Sargan test.
In order to characterize this test, suppose one can divide the moment con-
ditions in two subsets:
 
g1 (xi ; θ0 )
E [g (xi ; θ0 )] = E =0
g2 (xi ; θ0 )
where |g1 (xi ; θ0 )| = J1 > K, |g2 (xi ; θ0 )| = J2 , and J1 + J2 = J. Suppose
that E [g1 (xi ; θ0 )] = 0 can be confidently believed, because of either prior
knowledge or previous testing. Instead, one is interested about testing the
null hypothesis H0 : E [g2 (xi ; θ0 )] = 0.
The “incremental” Sargan statistic is defined as follows:
     
d
JS θGM M , θ = J θGM M − J θ
b e b e e → χ2J2 (12.38)

for:      
e = N gT θ b −1 gN 1 θ d
J θ
e N1
e Ω
N1
e → χ2J1 −K
and:
b −1 gN 1 (θ)
e = arg min gT (θ) Ω
θ N1 N1
θ∈Θ

where here gN 1 (θ) is the sample mean of g1 (xi ; θ), Ω


b N1 is some consistent
estimate of its variance-covariance matrix, and J (·) is Hansen’s J -statistic.
4

4
It may well be the J1 -dimensional upper left square block of Ω
bN.

408
12.3. Testing Overidentification

In practice, the incremental Sargan test results from subtracting from the
original Hansen J -statistics another Hansen J -statistic, where the latter is
obtained from a “reduced” GMM model of J1 “certain” moment conditions.
This second Hansen J -statistic is always smaller by construction, because
there are fewer moment conditions to match the zero vector. Therefore, the
incremental Sargan test measures how much the other J2 conditions deviate
from zero once the J1 “certain” ones are held constant. It is apparent how
the intuition behind this test presents many analogies with the Distance or
“Likelihood Ratio” test from the Trinity of statistical tests.

Example 12.1. Overidentified Mincer Equation. Return once more


to the recurring example about the Mincer Equation:

log Wi = β0 + β1 Xi + β2 Xi2 + β3 Si + αi + i

however, suppose that there are now three potential instruments available.
The first is Zi , the already mentioned “distance from college” instrument
by Card. The second is Gi , representing past eligibility to some “fellowship
grant” for attending higher education programs. For example, Gi might be
motivated on some random (exogenous) past allocation of scholarship grants
by the government authorities. Clearly, in this case the eligible individuals
had obtained an advantage at the time of deciding whether or not to enroll
in college. This, however, did not likely affect their future wages other than
via better education (exclusion restriction). The last instrument is Fi , the
average education of one individual’s friends or close social network. One
may argue that one’s friends might have affected the individual decision on
whether to enroll in college, but not his or her wages: the latter statement
however is about exogeneity, and it is dubious at best.
The resulting set of moment conditions is
  
1
 Xi  
 2  
Xi 
(12.39)
2

E Zi  |log Wi − β0 − β1{z Xi − β2 Xi − β3 Si  =0
 
  }

 Gi  =αi +i 
Fi

six conditions for four parameters of the Mincer Equations. As we have dis-
cussed, this model can be estimated via 2SLS-GMM by setting Yi = log Wi ,
xTi = (1, Xi , Xi , Si ) and zi = (1, Xi , Xi , Zi , Gi , Fi ). However, having three
2 T 2

instruments for the education variable provides the interesting opportunity


to test the exogeneity conditions implied, respectively, by the fourth, fifth

409
12.4. Methods of Simulated Moments

and sixth rows of (12.39). In this context, the Hansen’s J-statistic reads as
" N #T " N #−1 " N #
 
d
X X X
J βb 2SLS = N zi ei e 2 zi zT
i i zi ei → χ2 2
i=1 i=1 i=1

where, as usual, ei = yi − xTi β2SLS . A rejection of the associated test would


b
cast doubts on the validity of (12.39). Which are, however, the very “false
instruments” responsible for the rejection of Hansen’s J test? An answer to
this question can be given by sequentially removing each of the instruments
Zi , Gi and Fi , estimating a smaller 2SLS model with five moments for four
parameters, and computing the associated Hansen J -statistic, which would
be asymptotically χ2 distributed with only one degree of freedom. Such a
battery of incremental Sargan tests may highlight the fact that, for example,
the moment condition for the “friends” instrument Fi is patently violated,
arguably because the quality of one individual’s connections are correlated
to determinants of his or her wage in the labor market. 

12.4 Methods of Simulated Moments


The discussion in this lecture has thus far assumed that functions g (xi ; θ) –
or more restrictively, zi ⊗h (yi , zi ; θ) – that are used to define and construct
GMM estimators can always be computed given values for the parameters
θ and realizations of the random variables involved. Sometimes, however,
this is not the case: for example, these functions may be expressed in terms
of integrals lacking a closed form solution. This is analogous to the problem
discussed in Lecture 11, identified in the case of MLE and more generally of
all M-Estimators, a typical solution for which is the theoretical development
and practical implementation of simulation-based estimators. Solutions of
this sort also extend to estimators based on moment conditions.
To better introduce such approaches to methods of moments, it is useful
to rephrase some concepts and notation previously introduced in the discus-
sion of simulated M-Estimators from Lecture 11. Suppose that theoretical
analysis postulates some moment conditions E [g (xi ; θ0 )] = 0, where:
ˆ
g (xi ; θ) = gu (xi , ui ; θ) dHu (ui ) (12.40)
U

where, like in (11.37), ui is a random vector with cumulative distribution


dHu (ui ) that is integrated out over its support U. The integral in (12.40)
could lack a closed form solution for any given values of the parameters θ
and realizations of the observable variables xi : naturally, this would prevent
the straightforward implementation of (G)MM.

410
12.4. Methods of Simulated Moments

In such cases, Direct Monte Carlo Sampling based on a sample {us }Ss=1
of S random draws of ui from Hu (ui ) allows to construct a simulator that
converges in probability to (12.40) for each observation i = 1, . . . , N :
S
1X
gbS (xi ; θ) = geu (xi , us ; θ) (12.41)
S s=1

where geu (xi , us ; θ) is a subsimulator that is ideally an unbiased estimator


p
of (12.40) so as to guarantee gbS (xi ; θ) → g (xi ; θ) by standard asymptotic
arguments. Thus, the Method of Simulated Moments (MSM) estimator
is defined as:
" N
#T " N
#
1 X 1 X
θ
bM SM = arg min gbS (xi ; θ) AN gbS (xi ; θ) (12.42)
θ∈Θ N i=1 N i=1

where AN is a J × J weighting matrix whose probability limit is A0 , as in


standard GMM. The MSM estimator expressed in (12.42) naturally accom-
modates overidentification; notice however that this estimator is extensively
used even in just-identified cases (J = K) where the expression of the mo-
ment conditions call for simulation and where AN = A0 = I. In those cases
where the moment conditions are conditional upon instrumental variables
as in (12.27), the simulator takes the form
S
1X e
gbS (xi ; θ) = zi hu (yi , zi , us ; θ) (12.43)
S s=1

where he u (yi , zi , us ; θ) is again a suitable subsimulator, here for h (yi , zi ; θ).


The following example illustrates the necessity for MSM estimators using
a particular case, that builds upon Example 11.13, of the LDV settings that
originally motivated the seminal article on MSM by McFadden (1989).

Example 12.2. Random coefficients logit with instrumental vari-


ables. Recall the random coefficients logit model from Example 11.13. The
assumptions of that model naturally lead to the moment condition

E [Yi − Λ (β0 + β1i Xi )| Xi ] = 0 (12.44)

where Λ (·) is the cumulative logistic distribution of the binary outcome Yi


conditional on Xi . The interpretation of (12.44) is analogous to that of the
mean independence condition of linear models: conditional on the regressor
Xi , the “error” Yi − Λ (β0 + β1i Xi ) must equal zero on average, that is, the

411
12.4. Methods of Simulated Moments

logistic model predicts the outcome Yi without bias. Although (12.44) lends
itself naturally to Method of Moments estimation, in practice function Λ (·)
might be hard to calculate under probabilistic assumptions on the random
coefficients βi1 . Let again β1i ∼ N (β1 , σ2 ) and ui = (β1i − β1 ) /σ, then:
ˆ
Λ (β0 + β1i Xi ) = Λ (β0 + (β1 + σui ) Xi ) φ (ui ) dui (12.45)
R

which is an integral without closed form solution, similarly as in (11.40). It


is easy to see how a simulation allows to construct a MSM estimator with
the particular expression of (12.43) here being:
N
" S
#
1 X 1 Xe
gbS (yi , xi ; θ) = xi yi − Λ (β0 + (β1 + σus ) xi )
N i=1 S s=1

given the subsimulator Λe (β0 + (β1 + σus ) xi ) and θ = (β0 , β1 , σ2 ). Notice


that this would be a case of just-identified MSM.
Now suppose that (12.44) does not hold with equality because of some
instance of endogeneity: for whatever reason, the regressor Xi is correlated
with the model error Yi − Λ (β0 + β1i Xi ), thereby invalidating the moment
condition! A straightforward solution to this problem in the GMM spirit is
to use J ≥ K instrumental variables zi that satisfy

E [Yi − Λ (β0 + β1i Xi )| zi ] = 0 (12.46)

and that correlate with the main regressor Xi . Thus, using the simulator
N
" S
#
1 X 1 Xe
gbS (yi , xi , zi ; θ) = zi yi − Λ (β0 + (β1 + σus ) xi )
N i=1 S s=1

allows to construct an overidentified MSM estimator suited to this setting,


so long as assumption (12.46) can be maintained. This example illustrates
the type of flexibility that the MSM affords for LDV models that has helped
make this estimation framework popular in relatively more structural fields
of economics like Industrial Organization. 
The discussion so far has borne many similarities with that about MSL
from Lecture 11. There are less analogies, however, as far as the asymptotic
properties of the two estimators are concerned. To appreciate this, define
" N
#
1 X
e 0 ≡ lim Var √
Ω gb (xi ; θ0 ) (12.47)
N →∞ N i=1

412
12.4. Methods of Simulated Moments

as the J × J limiting variance-covariance of the simulators, akin to (12.12).


Note that this matrix also features noise due to the simulation; indeed one
can show by the Law of Total Variance that if the simulator is unbiased:
" " N
##
1 X
e 0 = Ω0 + lim Ex Varu √
Ω gb (xi ; θ0 ) (12.48)
N →∞ N i=1
where with regard to the second element on the right-hand side, the outer
expectation is taken with respect to the observable variables, whereas the
inner variance-covariance is taken with respect to the simulation draws. It
must be remarked, however, that as S → ∞ this second element vanishes!
Intuitively, the larger the simulation the smaller the noise associated with
it. Having these considerations in mind, it is easier to discuss the following
result, which extends one originally given by McFadden (1989).
Theorem 12.3. Asymptotic Efficiency of the Method of Simulated
Moments estimators. If some MSM estimator is based upon an ubiased
simulator gbS (xi ; θ), and all conditions implicit in the statement of Theorem
12.1 are met, even with fixed S the estimator is consistent and its limiting
distribution is as follows.
√  
d

T
−1 T T
−1 
N θM SM − θ0 → N 0, G0 A0 G0
b G0 A0 Ω0 A0 G0 G0 A0 G0
e
(12.49)
Proof. (Outline.) The proof proceeds by decomposing and reworking the
First Order Conditions of (12.42) as in the proof of Theorem 12.1, relying on
the unbiasedness of the simulator for simplifying some key expressions.
The key phrase of the Theorem’s statement is “even with fixed S.” The
key result is that a large simulation size is not necessary in order to guaran-
tee consistency, unlike the case of MSL! The reason is that the simulation is
“washed out” in the First Order Conditions (has no effect on average if the
simulator is unbasied) unlike in MSL where the need for taking logarithms
complicates things. This is quite a relevant advantage of MSM, even though
it has been documented that the solution might be numerically unstable for
small values of S. Hence, there are still practical advantages in increasing
the size of the simulation wherever possible; among the others – as already
p
mentioned – as S → ∞ it follows that Ω e0 → Ω0 ; this implies that (12.49)
and (12.11) coincide at the limit, making the estimation of the asymptotic
variance-covariance easier. When comparing MSM against MSL (which can
be relevant in contexts like that of Example 12.2) all other considerations
already made about GMM against MLE still apply: in particular, MSM is
more robust to assumption failures, otherwise MLS may be more efficient.

413
12.5. Applications of GMM

When conducting inference about the MSM estimator, it is necessary


to consistently estimate the elements of the variance-covariance of (12.49).
While A0 is naturally estimated by AN whenever necessary, G0 and Ω0 are
estimated via versions of (12.17), (12.18) and (12.19) that substitute func-
tion g (xi ; θ) evaluated at the GMM estimator with the simulator gbS (xi ; θ)
evaluated at the MSM estimator (and similarly when dealing with the HAC
case). With small S it is also necessary to estimate the component of matrix
e 0 that depends on the simulation; this is done by taking the appropriate

sample analogue of the second element on the right-hand side of (12.48).

12.5 Applications of GMM


Beyond making the foundation for estimators of linear models such as the
2SLS and the 3SLS, what are exactly the applications of the GMM frame-
work in economics? Actually, one could make the case that all estimators in
econometrics are particular cases of GMM: (as it was observed in passing,
the Maximum Likelihood score at the true parameter value θ0 can be seen
as a set of moment conditions). In order to better illustrate both the useful-
ness and flexibility of GMM, this section discusses three particular settings
that are especially well suited to the application of this framework: generic
dynamic linear models for panel data; the estimation of production func-
tions with its associated longstanding issues, and non-linear macroeconomic
models based on the rational expectation hypothesis.

Dynamic Linear Models for Panel Data


A common hypothesis that is made when dealing with panel data is that
the model is dynamic, that is the endogenous variables are dependent on
their past realizations.
yit = α + xT
it β + γyi(t−1) + εit (12.50)
In a model of this kind, parameter γ measures the dependence of the depen-
dent variable from its immediate past: a recurring theme for macroeconomic
variables such as say GDP, unemployment or inflation. A typical problem is
that the error term εit is typically also autoregressive, that is it also depends
on its past:
εit = ρεi(t−1) + ξit
for some ρ ∈ [−1, 1]. This fact alone clearly invalidates the use of a a simple
estimator like OLS: past outcomes depend on both current and past shocks,
leading to endogeneity: E[Yi(t−1) εit ] 6= 0.

414
12.5. Applications of GMM

Another problem that is specific of dynamic panels is that the standard


solutions to the issue of unobserved heterogeneity (observation-specific
omitted factors) are not available. Suppose that α = 0 and that instead of
being autoregressive, the error term contains a constant observation-specific
factor αi : a so-called fixed effect.

εit = αi + it

The fixed effect αi is by construction correlated to the the lagged outcome:


E[Yi(t−1) αit ] 6= 0. In panel data models that are not “dynamic,” the typical
solution to “endogenous” fixed effects is a transformation like:

yit − ȳi = (xit − x̄i )T β + γ yi(t−1) − ȳi + εit − ε̄i




where the upper bar applied to someP variable xki denote observation-specific
averages over time t, like x̄ki = T −1 T t=1 xkit (this particular approach has
been called “within transformation” in Lecture 10). Not even this method
can, unfortunately, work for dynamic linear models, because the demeaned
lagged outcome Yi(t−1) − Ȳi is mechanically correlated to the demeaned
shock εit − ε̄i . In fact, past values of the outcomes depend on all the past
shocks, which in turn are all included in the average shock ε̄i !
  
E Yi(t−1) − Ȳi (εit − ε̄i ) 6= 0

The typical solution to this problem is to estimate dynamic linear panel


data models via GMM, where the moment conditions include either:
• moment in levels, featuring a product between the first difference
of the error term and the higher lags (one period and beyond) of the
dependent variable (see Arellano and Bond, 1991; Arellano and Bover,
1995), these can be expressed as:

E Yi(t−s) ∆εit = 0 for s ≥ 2;


 

• moment in differences, featuring a product between the error term


and the higher lags (one period and beyond) of the first difference of
the dependent variable (Blundell and Bond, 1998), these write as:

E ∆Yi(t−s) εit = 0 for s ≥ 2;


 

or both. Notice that both approaches generally result in overidentified mod-


els. A major problem with this type of moment conditions, however, is that
they work in theory, while in practice they present endemic problems of the

415
12.5. Applications of GMM

weak instruments type (that is, weak statistical correlation between the
instruments and the structural variables). The GMM framework, thanks to
its power to combine many instruments, optimally weigh them by their sta-
tistical relevance, and test their validity, is therefore well suited to address
these issues. Unsurprisingly, the GMM framework has become dominant in
the estimation of dynamic macroeconomic models or other dynamic models
for panel data. As the following discussion shows, however, approaches sim-
ilar to that outlined above are also adopted to address other more “classical”
kinds of endogeneity problem.

Estimation of Production Functions


Recall the log-log production function model (7.43) described in Lecture 7,
based on the Cobb-Douglas functional form. That model can be adapted
to panel data, writing for convenience yit ≡ log Yit , kit ≡ log Kit , `it ≡ log Lit
and:
yit = αi + βK kit + βL `it + ωit + εit (12.51)
where:
log Ait = αi + ωit + εit
that is, the logarithm of the productivity measure Ait has been separated
between three components: a unit-constant factor αi and two time-varying
factors ωit and εit . Why make this distinction? It is usually impossible to
observe all the specific factors affecting one firm’s productivity, and thus
they must be treated as random shocks. The problem here is that that while
some of these factors can be thought independent of firm’s idiosyncratic de-
cisions (exogenous), some others cannot. In fact, standard microeconomic
analysis suggests that if the management of a firm experiences higher pro-
ductivity Ait , at the same costs it will find more convenient to hire more
workers Lit and invest in more capital Kit .
A researcher interested in recovering empirical values for the parameters
βK and βL should think twice about estimating a standard regression model
for (12.51). In fact, while the exogeneity assumption may sound plausible
for some component of the “random” shocks to productivity, like εit :
E [εit | kit , `it ] = 0
(think about lucky events), it is not so for the rest:
E [αi , ωit | kit , `it ] 6= 0
thereby making OLS estimates inconsistent. While αi might be removed by,
say, applying the within transformation, one is still left with the so-called

416
12.5. Applications of GMM

endogenous unobserved productivity shock ωit . To complicate things,


this shock is typically assumed (on the basis of the empirical evidence) to
present positive autocorrelation: ωit = ρωi(t−1) + ξit , where ρ ∈ [0, 1].
There are several proposed solutions to the “unobserved productivity
shock” problem in the estimation of production functions, and none of them
is perfect. The method by Blundell and Bond (1998) devised for dynamic
panel data models is one of these potential solutions, despite the fact that a
production function is not strictly speaking a dynamic model. The intuition
goes as follows: suppose to subtract ρyi(t−1) from both sides of (12.51); the
result is the transformed model:
 
yit − ρyi(t−1) = αi (1 − ρ) + βK kit − ρki(t−1) + βL `it − ρ`i(t−1) + υit

where υit ≡ ξit +εit −ρεi(t−1) : the “backward looking” autoregressive endoge-
nous shock is removed, and all that is left are components of the random
shocks that are arguably exogenous to appropriate lags of the first differ-
ences of the capital and labor inputs. This shows how moment conditions
of the kind,
  
∆ki(t−s) 
E αi (1 − ρ) + ξit + εit − ρεi(t−1) = 0
∆`i(t−s)

can be used to form GMM estimators for s ≥ 2.


In practice, however, the Blundell-Bond approach applied to production
functions is known to work poorly in small samples and it is generally less
precise than other competing estimators, similarly to the weak instruments
problem associated to GMM estimators for dynamic panel data models. For
this reason, the frontier method for the estimation of production functions
nowadays is based on a combination of moment conditions such as:
  
ki(t−s) 
E yit − βK kit − βL `it − g ϕ bit − βK ki(t−1) − βL `i(t−1) =0
`i(t−s)

for s = 2, . . . , t; where ϕ b ki(t−s) , ki(t−s) , mi(t−s) is a non-parametric



bit = ϕ
prediction function – e.g. a polynomial approximation – of αi + ωi(t−1) , that
is obtained via suitable lags of kit , `it and of another instrument or “shifter”
variable mit (for example, variable input materials); and g (·) is yet another
non-parametric function. In this control function approach to production
function estimation, a “first step” aimed at estimating ϕ bit is necessary before
proceeding to GMM estimation based on the above conditions. The implied
exclusion restrictions are motivated on careful assumptions about the timing
of firms’ decisions; for example, if capital investment reflects changes in the

417
12.5. Applications of GMM

firm’s economic conditions with a lag (it takes time to invest in new capital
and equip it) it makes sense to motivate a moment condition akin to:

E [εit , ξit | kit ] = 0

as firms cannot observe ξit timely enough so as to affect their choice of kit .
For additional discussion about the more modern practices in the estimation
of production functions, see Wooldridge (2009) and Ackerberg et al. (2015).

Estimation of Rational Expectations Models


While GMM is currently seen as a wide-encompassing framework which in-
cludes many econometric estimators, its initial popularity was largely due to
its flexibility in estimating – without resorting to fully parametric assump-
tions – the possibly non-linear moment conditions derived from economic
theory. While nowadays GMM is employed for empirical research in many
fields of economics, it is traditionally especially relevant in macroeconomics,
and in particular for modeling theories based on the hypothesis of rational
expectations. An example of such an application of GMM, based on the
model of permanent income by Hall (1978) is discussed next.5
Suppose that a consumer aims at maximizing his lifetime utility from the
consumption of various goods and services as per the following intertemporal
utility function:
" T −t  τ #
X 1
Ut (Ct , Ct+1 , . . . , CT ) = Et U (Ct+τ ) It (12.52)
τ =0
1+δ

subject to intertemporal budget constraint


T −t  τ
X 1
(Ct+τ − Wt+τ ) = At (12.53)
τ =0
1+δ

where: Cs are aggregate consumption expenditures at time s; U (Cs ) is the


associated per-period utility; Ws are the individual earnings at time s; As
is the amount of individual asset owned by the individual at time s; r is
the interest rate (assumed constant for simplicity); δ is the discount factor,
a parameter that denotes an individual’s “impatience” towards the idea of
postponing consumption. Finally, Is represents quite an abstract economic
concept, the information set: a set of variables (possibly written also as zs )
5
The treatment here adapts the one provided by William H. Greene in his leading
econometric textbook.

418
12.5. Applications of GMM

whose value at time s affects individual expectations about future economic


outcomes. The substance of the problem is that individuals cannot consume
over their lifetime, in excess of what they actually earn, a sum that in net
present value terms exceeds their current assets. However, they do not even
know for certain the value of their future earnings: consequently also their
future utility is subject to stochastic uncertainty. However, individuals can
form expectations of the form Et [Wt+τ | It ] about their future earnings, that
are conditional of their information set.
Intertemporal utility is clearly separable as the sum of distinct period-
specific utility functions: a crucial assumption. Hall’s main result is having
shown that in such a case, the problem’s solution is conveniently expressed
as the Euler equation of two consecutive marginal utilities:
1+δ 0
Et [U 0 (Ct+1 )| It ] = U (Ct ) (12.54)
1+r
which can be operationalized if one is willing to make assumptions about the
functional form of U (·). The “Constant Relative Risk Aversion” (CRRA)
utility function is a popular choice; it reads as follows.
1
U (Ct ) = C 1−α
1−α t
With this assumption, (12.54) becomes:
λ
(12.55)
 
Et β (1 + r) Rt+1 − 1 It = 0
where β ≡ (1 + δ)−1 , λ ≡ −α and Rt+1 = Ct+1 /Ct .
A researcher might be interested about the GMM estimation of the two
parameters (β, λ). A natural set of moment conditions that make the model
just identified is given by the following expression.
  
1 λ
(12.56)

Et β (1 + r) Rt+1 − 1 = 0
Rt
However, if the researcher knows about some other variables zt that work
as good predictors of future earnings, extra moment conditions in the form:
λ
(12.57)
 
Et zt β (1 + r) Rt+1 −1 =0
result in overidentification, and thus the GMM estimator of (12.57) follows
from the previous discussion of instrumental variables in non-linear models.
It must be appreciated that this GMM estimator has been constructed using
just: i. some predictions of economic theory, ii. a specific mean assumption
about the stream of future earnings, conditional on current information It .
In fact, no more detailed parametric assumption about the distribution of
future income have resulted necessary for deriving the moment conditions.

419
Bibliography

Ackerberg, Daniel A., Kevin Caves, and Garth Frazer, “Identifica-


tion Properties of Recent Production Function Estimators,” Economet-
rica, 2015, 83 (6), 2411–2451.

Andrews, Donald W. K. and J. Christopher Monahan, “An Im-


proved Heteroskedasticity and Autocorrelation Consistent Covariance
Matrix Estimator,” Econometrica, 1991, 60 (4), 953–966.

Angrist, Joshua D. and Alan B. Krueger, “Empirical Strategies in


Labor Economics,” in O. C. Ashenfelter and D. Card, eds., Handbook of
Labor Economics, Vol. 3, Elsevier, 1999, pp. 1277–1366.

Arellano, Manuel and Olympia Bover, “Another look at the instrumen-


tal variable estimation of error-components models,” Journal of Econo-
metrics, 1995, 68 (1), 29–51.

and Stephen Bond, “Some tests of specification for panel data: Monte
Carlo evidence and an application to employment equations,” The Review
of Economic Studies, 1991, 58 (2), 277–297.

Bartlett, Maurice S., “Periodogram Analysis and Continuous Spectra,”


Biometrika, 1950, 37 (1/2), 1–16.

Becker, Gary S., “Investment in Human Capital: A Theoretical Analysis,”


Journal of Political Economy, 1962, 70 (5), 9–49.

Berry, Steven, “Estimation of a Model of Entry in the Airline Industry,”


Econometrica, 1992, 60 (4), 889–917.

and Peter Reiss, “Empirical Models of Entry and Market Structure,”


in Mark Armstrong and Robert Porter, eds., Handbook of Industrial Or-
ganization, Vol. 3, North Holland, 2007, pp. 1845–1886.

420
Bibliography

Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan,


“How Much Should We Trust Differences-In-Differences Estimates?,” The
Quarterly Journal of Economics, 2004, 119 (1), 249–275.

Bester, C. Alan, Timothy G. Conley, and Christian B. Hansen, “In-


ference with dependent data using cluster covariance estimators,” Journal
of Econometrics, 2011, 165 (2), 137–151.

Blundell, Richard and Stephen Bond, “Initial conditions and moment


restrictions in dynamic panel data models,” Journal of Econometrics,
1998, 87 (1), 115–143.

Bramoullé, Yann, Habiba Djebbari, and Bernard Fortin, “Identifi-


cation of peer effects through social networks,” Journal of Econometrics,
2009, 150 (1), 41–55.

Bresnahan, Timothy F., “Competition and collusion in the American


automobile industry: The 1955 price war,” The Journal of Industrial
Economics, 1987, 35 (4), 457–482.

Cameron, Colin A., Jonah G. Douglas, and Douglas L. Miller,


“Robust Inference with Multiway Clustering.,” Journal of Business &
Economic Statistics, 2011, 29 (2), 238–249.

Card, David, “Using geographic variation in cullege proximity to esti-


mate the return to schooling,” in L. N. Christofides, E. K. Grant, and
R. Swidinsky, eds., Aspects of Labor Market Behaviour: Essays in Honour
of John Vanderkamp, University of Toronto Press, 1995.

Chamberlain, Gary, “Asymptotic efficiency in estimation with condi-


tional moment restrictions,” Journal of Econometrics, 1987, 34 (3), 305–
334.

Conley, Timothy J., “GMM estimation with cross sectional dependence,”


Journal of Econometrics, 1999, 92 (1), 1–45.

Durbin, James, “Errors in Variables,” Review of the International Statis-


tical Institute, 1954, 22 (1/3), 23–32.

Eicker, Friedhelm, “Limit Theorems for Regressions with Unequal and


Dependent Errors,” in L. LeCam and J. Nexman, eds., Proceedings of the
Fifth Berkeley Symposium on Mathematical Statistics and Probability,
Vol. 1, University of California Press, 1967, pp. 59–82.

421
Bibliography

Gouriéroux, Christian and Alan Monfort, “Simulation Based Infer-


ence in Models with Heterogeneity,” Annales d’Économie et de Statis-
tique, 1991, 20/21, 69–107.

Hall, Robert E., “Stochastic implications of the life cycle-permanent in-


come hypothesis: theory and evidence,” Journal of Political Economy,
1978, 86 (6), 971–987.

Hansen, Lars P., “Large sample properties of Generalized Method of Mo-


ments estimators,” Econometrica, 1982, 50 (4), 1029–1054.

, John Heaton, and Amir Yaron, “Finite-sample properties of some


alternative GMM estimators,” Journal of Business & Economic Statis-
tics, 1996, 14 (3), 262–280.

Hausman, Jerry A., “Specification Tests in Econometrics,” Econometrica,


1978, 46 (6), 1251–1271.

Heckman, James J., “Sample selection bias as a specification error (with


an application to the estimation of labor supply functions),” Economet-
rica, 1977, 47 (1), 153–161.

Kelejian, Harry H. and Ingmar R. Prucha, “A generalized spatial


two-stage least squares procedure for estimating a spatial autoregressive
model with autoregressive disturbances,” The Journal of Real Estate Fi-
nance and Economics, 1998, 17 (1), 99–121.

and , “HAC estimation in a spatial framework,” Journal of Econo-


metrics, 2007, 140 (1), 131–154.

Klein, Lawrence Robert, Economic Fluctuations in the United States,


1921-1941, John Wiley & Sons, 1950.

Kmenta, Jan, “On Estimation of the CES Production Function,” Inter-


national Economic Review, 1967, 8 (2), 180–189.

McFadden, Daniel, “A Method of Simulated Moments for Estimation of


Discrete Response Models Without Numerical Integration,” Economet-
rica, 1989, 57 (5), 995–1026.

Mincer, Jacob A., “Investment in Human Capital and Personal Income


Distribution,” Journal of Political Economy, 1958, 66 (4), 281–302.

Moulton, Brent R., “Random Group Effects and the Precision of Regres-
sion Estimates,” Journal of Econometrics, 1986, 32 (3), 385–397.

422
Bibliography

Newey, Whitney K. and Daniel McFadden, “Large Sample Estimation


and Hypothesis Testing,” in Robert Engle and Daniel McFadden, eds.,
Handbook of Econometrics, Vol. 4, North Holland, 1994, pp. 2111–2245.

Newey, Whitney K. and Kenneth D. West, “A Simple, Positive Semi-


Definite, Heteroskedasticity and Autocorrelation Consistent Covariance
Matrix,” Econometrica, 1987, 55 (3), 703–708.

Rothenberg, Thomas J., “Identification in parametric models,” Econo-


metrica, 1971, 39 (3), 577–591.

Rubin, Donald, “Estimating Causal Effects of Treatments in Randomized


and Nonrandomized Studies,” Journal of Educational Psychology, 1974,
66 (5), 688–701.

Stock, James H., Jonathan H. Wright, and Motohiro Yogo, “A Sur-


vey of Weak Instruments and Weak Identification in Generalized Method
of Moments,” Journal of Business & Economic Statistics, 2002, 20 (4),
518–529.

Theil, Henry, “Repeated least squares applied to complete equation sys-


tems,” Technical Report, Central Planning Bureau of The Hague, 1953.

Walsh, J. R., “Capital Concept Applied to Man,” Quarterly Journal of


Economics, 1935, 49 (2), 255–285.

White, Halbert, “A Heteroskedasticity-Consistent Covariance Matrix Es-


timator and a Direct Test for Heteroskedasticity,” Econometrica, 1980,
48 (4), 817–838.

Wooldridge, Jeffrey M., “Score diagnostics for linear models estimated


by two stage least squares,” in G. S. Maddala, P. C. B. Phillips, and T. N.
Srinivasan, eds., Advances in Econometrics and Quantitative Economics:
Essays in Honor of Professor C. R. Rao, Oxford: Blackwell, 1995.

, “On estimating firm-level production functions using proxy variables to


control for unobservables,” Economic Letters, 2009, 104 (3), 112–114.

Wu, De-Min, “Alternative Tests of Independence between Stochastic Re-


gressors and Disturbances,” Econometrica, 1973, 41 (4), 733–750.

Yitzhaki, Shlomo, “On Using Linear Regressions in Welfare Economics,”


Journal of Business Economics and Statistics, 1996, 14 (4), 478–486.

423

You might also like