Applied Economics: Instrumental Variables
Philipp Ager
University of Mannheim and CEPR
IV Lecture Notes
September 10, 16, 23 and 24, 2024
1 / 64
Non-technical Introduction to IV-Estimation
In many applications of the linear regression model, we
suspect that some regressors are endogenous
How can we solve the problem of endogeneity of one or
more explanatory variables?
Sometimes, we can find exogenous variables that are correlated
with the endogenous regressor but not correlated with the
error term
Those variables are called instrumental variables or instruments
2 / 64
Inconsistency of OLS
y = βx + u
Under the assumption that the regressors are uncorrelated
with the errors in the model, the OLS estimate is unbiased
In many circumstances the regressor is correlated with the
error term and the OLS estimate is biased
Ex.: if wages are only regressed on the years of schooling
3 / 64
Threats to Identification
Applied researchers are often confronted with so-called
“endogeneity problems”
The most common threats to identification are:
1 Omitted variables
2 Reverse causality
3 Measurement error
One potential solution to circumvent these problems is the
instrumental variable approach
4 / 64
Definition of an Instrument
Typically derived from natural or random experiments
The aim is to find a variable z that generates only exogenous
variation in x and is not correlated with the error term
z can be used as instrumental variable for the regressor
x in the scalar regression model y = βx + u ONLY IF
(1) z is uncorrelated with the error u (exogeneity)
(2) z is correlated with the regressor (relevance)
5 / 64
Instrument Validity
1 Instrument is as good as randomly assigned
2 Instrument Exogeneity:
No direct effect of the instruments on the dependent variable
or through omitted variables (exclusion restriction)
No reverse effect of the dependent variable on the instruments
Convincingly describe why the instruments only influence
the endogenous regressors
3 Instrument Relevance (can be tested):
Valid instruments are highly correlated with the endogenous
regressors even after controlling for exogenous regressors
Diagnostics for the detection of weak instruments
6 / 64
Example
Yi = α + ρSi + γAi + ei
Aim is to estimate the returns to schooling
Problem of omitted variable bias, e.g., is there a good proxy
for ability Ai
IV solution: use variation in Si which is unrelated to Ai and
other unobserved variables
7 / 64
Three important causal effects
Call the instrumental variable Zi
There are three causal effects we can think about:
1 The causal effect of Zi on Si
2 The causal effect of Zi on Yi
3 The causal effect of Si on Yi
The last effect is the one we are ultimately interested in, the
returns to schooling ρ
8 / 64
Three important equations
Some “IV language” which relates to the three causal effects
of the previous slide
1 First stage: Regression of schooling on the instrument
(causal effect #1)
Si = π10 + π11 Zi + ϵ1i
2 Reduced Form: Regression of earnings on the instrument
(causal effect #2)
Yi = π20 + π21 Zi + ϵ2i
3 Structural Equation: Regression of earnings on schooling
(causal effect #3)
Yi = α + ρSi + ηi
ηi = γAi + ei is the structural error term
9 / 64
Important to know (1)
Conditions 1 and 3 on the instrument validity are enough to
get causal effects for equation #1 and #2
These conditions are sufficient for the first stage and reduced
forms to have a causal interpretation
Note: The reduced form coefficient can be interesting in
itself. For example, the instrument might be a policy variable
(e.g., compulsory schooling laws) in which case it is the
policy effect
10 / 64
Important to know (2)
To get the causal effect of equation (#3), i.e., the structural
parameter ρ, we also need condition 2, the exclusion restriction
This condition is often the most difficult requirement on an
instrument. It is distinct from random assignment, so having
experimental variation does not guarantee a valid interpretation
of the IV estimates
11 / 64
Linking the three equations
The coefficients of the three regressions are indeed linked:
Yi = α + ρSi + ηi
= α + ρ[π10 + π11 Zi + ϵ1i ] + ηi
= (α + ρπ10 ) + ρπ11 Zi + (ρϵ1i + ηi )
= π20 + π21 Zi + ϵ2i
The reduced form coefficients are:
π20 = α + ρπ10
π21 = ρπ11
12 / 64
Indirect Least Squares (ILS)
The IV estimate is equal to the ratio of the reduced form
coefficient on the instrument to the first stage coefficient
This is called indirect least squares
π21
ρ=
π11
Only works with one endogenous regressor and one instrument.
In this case one says the model is just identified
If there are multiple instruments for a single endogenous
regressor the model is over-identified
13 / 64
Two-stage Least Squares Estimation
A standard estimation technique used by applied researchers
when using instrumental variables is the so-called two-stage
least squares estimation (2SLS)
The 2SLS estimator gets its name from the result that it can
be obtained by two consecutive OLS regressions:
OLS regression of x on z ′ s to get xb
Followed by OLS of y on xb which gives βb2SLS
Standard statistical programs, such as STATA, have a routine
to estimate 2SLS
One can show that in case of one single endogenous variable
and one instrument, the 2SLS estimator is the same as the
corresponding ILS estimator.
14 / 64
Example: Angrist and Krueger (1991, QJE)
Quarter-of-birth as instruments
Most U.S. states require students to enter school in the
calendar year when they turn 6
School start age is therefore a function of date of birth (children
born in the 1st quarter enter school at an older age than
those born in the 4th quarter)
Compulsory schooling laws typically require students to remain
in school until their 16th birthday
Combination of entry-age and compulsory schooling laws
create natural experiment in “school length” before dropping
out depending on their birthdays
15 / 64
Example: Angrist and Krueger (1991, QJE)
Data are from the 1980 US Census
Sample includes 329,509 men born 1930 to 1939 (i.e., they
are in their 40s when observed)
For these men information on year of birth, quarter of birth,
years of schooling, and earnings in 1979
16 / 64
Example: Angrist and Krueger (1991, QJE)
Men born earlier in the calendar year tend to have lower average
schooling levels
Source: Angrist and Pischke (2009)
17 / 64
Example: Angrist and Krueger (1991, QJE)
Men born in early quarters almost always earned less, on average,
than those born later in the year
Source: Angrist and Pischke (2009)
18 / 64
Example: Angrist and Krueger (1991, QJE)
Related pattern in reduced-form and first-stage
Key assumption: Individual’s date of birth should be unrelated
to innate ability, family connections, motivation . . .
⇒ Only reason for up-and-down pattern in quarter-of-birth in
earnings is driven by the quarter-of-birth pattern in schooling
19 / 64
Example: Angrist and Krueger (1991, QJE)
One way to look at IV with a binary instrument and no covariates
(e.g., Angrist and Krueger use a dummy if born in the first
quarter (Q1)), is the following:
cov (lnYi , Q1i )
βIV =
cov (Si , Q1i )
E[lnYi |Q1i = 1] − E[lnYi |Q1i = 0]
=
E[Si |Q1i = 1] − E[Si |Q1i = 0]
Rescaling of reduced form-difference in means by the corres-
ponding first-stage difference in means (= Wald-Estimator)
20 / 64
Example: Angrist and Krueger (1991, QJE)
The first stage and reduced from are:
Si = π11 + π12 Q1i + ϵ1i ,
lnYi = π21 + π22 Q1i + ϵ2i
Taking expectation conditionally on Q1i , we obtain:
E[lnYi |Q1i = 1] = π21 + π22 ,
E[lnYi |Q1i = 0] = π21
Same for the first stage. The differences in group means
with the instrument switched on/off are π12 and π22
The ratio between the two is the IV-estimator
21 / 64
Example: Angrist and Krueger (1991, QJE)
Wald estimator: return to education is the ratio of the difference
in earnings between men born in 1st /other quarters of the
year and the corresponding difference in schooling
Source: Angrist and Krueger (1991)
22 / 64
Example: Angrist and Krueger (1991, QJE)
Source: Angrist and Pischke (2009)
23 / 64
Example: Angrist and Krueger (1991, QJE)
Is quarter-of-birth a good instrument?
1 Random Assignment: Births are almost uniformly spaced
over the year ... but
2 Exclusion restriction: variation in maternal characteristics
– women giving birth in winter are more likely to be teenagers
and less likely to be married or to have a high school diploma
(Buckles and Hungerman, 2013)
3 Relevance: issue of QOB being a weak instrument (see
Bound et al., 1995). Detection possible by looking at first
stage statistics
24 / 64
Buckles and Hungerman (2013, RESTAT)
Source: Buckles and Hungerman (2013))
25 / 64
IV with Heterogeneous Treatment Effects
26 / 64
Recap: IV with a constant causal effect
Assume one binary endogenous regressor and one binary
instrument
Let Yi be the outcome of interest for unit i, Di the endogenous
regressor, and Zi the instrument
Linear model:
Yi = α + ρDi + ηi
If the instrument Zi is both uncorrelated with ηi and correlated
with the endogenous regressor Di
⇒ use 2SLS (or ILS) to obtain ρ̂IV
27 / 64
Heterogeneous Treatment Effects
Let’s allow for treatment effect heterogeneity (i.e., a distribution
of causal effects across individuals)
The main questions there are:
1 What is IV estimating when we have heterogeneous treatment
effects?
2 Under what assumptions will IV identify a causal effect with
heterogeneous treatment effects?
28 / 64
Heterogeneous Treatment Effects
Why is treatment heterogeneity important?
With heterogeneous treatment effects, we introduce a distinction
between the internal validity of a study and its external validity
Internal validity means our strategy identified a causal effect
for the population we studied (e.g., a randomized clinical trial
has a strong claim to internal validity)
External validity is the predictive value of the study’s findings
in a different context
Under homogeneous treatment effects, there is no tension
between external and internal validity because everyone
has the same treatment effect, ρ
29 / 64
Local average treatment effects (LATE) – (1)
Let’s adopt a generalized potential outcomes concept, indexed
against both instruments and treatment status
Let Yi (d, z) denote the potential outcome of individual i were
this person to have treatment status Di = d and instrument
value Zi = z
This tells us, what the outcome of i would be given alternative
combinations of Di and Zi
We can think of instrumental variables as initiating a causal
chain where the instrument Zi affects the variable of interest,
Di , which in turn affects outcome, Yi
30 / 64
Local average treatment effects (LATE) – (2)
Let D1i be i’s treatment status when Zi = 1 and D0i the
treatment status when Zi = 0
The sub-populations
LATE framework partitions any population with an instrument
into a set of three instrument-dependent subgroups:
Compliers: the subpopulation with D1i = 1 and D0i = 0
Always-takers: the subpopulation with D1i = D0i = 1
Never-takers: the subpopulation with D1i = D0i = 0
There is also a group called Defiers that do exactly the
opposite what the assignment (instrument) wants them to
do (we will rule them out later)
31 / 64
Local average treatment effects (LATE) – (3)
Let’s think about the compliance behavior of the different
units, that is how they respond to different values of the
instrument in terms of the treatment received
There are four possible pairs of values (Di (0), Di (1)), given
the binary nature of the treatment and instrument
Problem: we only see the pair (Zi , Di ), not the pair (Di (0), Di (1))
32 / 64
Local average treatment effects (LATE) – (4)
Only one one of the potential treatment assignments, D1i
and D0i , is ever observed for any one person
The observed treatment status is therefore:
Di = D0i + (D1i − D0i )Zi
= π0 + π1i Zi + ηi
π0 ≡ E[D0i ]
π1i ≡ (D1i - D0i ) is the heterogeneous causal effect of the
instrument on Di
The average causal effect of Zi on Di is E[π1i ]
33 / 64
Local average treatment effects (LATE) – (5)
Table 2 summarizes the information about compliance behavior
from observed treatment status and instrument
34 / 64
Assumptions for Identification (1)
1 Independence assumption: the instrument is a good as
randomly assigned – it is independent of the vector of potential
outcomes and potential treatment assignments
sufficient for a causal interpretation of the reduced form (i.e.,
the effect of the instrument on Y)
the first stage captures the causal effect of Zi on Di
2 Exclusion restriction: is distinct from the claim that the
instrument is (as good as) randomly assigned
35 / 64
Assumptions for Identification (2)
Example: Draft lottery to estimate causal effect of military
service on earnings (Angrist, 1990)
The exclusion restriction would be violated if low lottery numbers
affected schooling by people avoiding the draft
If this was the case, then the lottery number would be correlated
with earnings for at least two cases
One, through the instrument’s effect on military service, and
two, through the instrument’s effect on schooling
The implication of the exclusion restriction is that a random
lottery number (independence) does not therefore imply that
the exclusion restriction is satisfied
It is a claim about a unique channel for causal effects of the
instrument
36 / 64
Assumptions for Identification (3)
3 Existence of a first stage
E[D1i − D0i ] is not 0
Z needs to have some statistically significant effect on the
average probability of treatment
Example: having a low lottery number. Does it increase the
average probability of military service? If so, then it satisfies
the first stage requirement
Note: this can be tested
37 / 64
Assumptions for Identification (4)
4 Monotonicity assumption
The instrument may have no effect on some people, all those
who are affected are affected in the same way
It is not the case that the instrument pushes some people
into treatment while pushing others out ⇒ no defiers
38 / 64
Local average treatment effects (LATE) – (6)
Given these four assumptions one can interpret the coefficient
of interest, ρ as the local average treatment effect (LATE)
Effect of Z on Y
ρIV ,LATE =
Effect of Z on D
E[Yi |Zi = 1] − E[Yi |Zi = 0]
=
E[Di |Zi = 1] − E[Di |Zi = 0]
E[Yi (D1i , 1) − Yi (D0i , 0)]
=
E[D1i − D0i ]
= E[(Y1i − Y0i )|D1i − D0i = 1]
39 / 64
Local average treatment effects (LATE) – (6)
ρIV ,LATE is the average causal effect of D on Y on for those
whose treatment status was changed by the instrument Z
We know that because notice the difference in the last line
D1i − D0i
So, for those people for whom that is equal to 1, we calculate
the difference in potential outcomes, which means we are
only averaging over treatment effects for compliers
This is why the parameter we are estimating is “local”
40 / 64
Local average treatment effects (LATE) – (7)
Some more formalities . . .
Consider the least squares regression of Y on a constant
and Z (reduced form). The slope coefficient from this regression
is E[Yi |Zi = 1] − E[Yi |Zi = 0]
Consider the first term:
E[Yi |Zi = 1] = E[Yi |Zi = 1, c] · P(c|Zi = 1) + E[Yi |Zi = 1, n] · P(n|Zi = 1) + . . .
+ E[Yi |Zi = 1, a] · P(a|Zi = 1)
= E[Y1i |c] · πc + E[Y0i |n] · πn + E[Y1i |a] · πa
where πc = share of compliers in the population; πn = share of never-takers in the
population; and πa = share of always-takers in the population(*)
41 / 64
Local average treatment effects (LATE) – (8)
Consider the second term:
E[Yi |Zi = 0] = E[Yi |Zi = 0, c] · P(c|Zi = 0) + E[Yi |Zi = 0, n] · P(n|Zi = 0) + . . .
+ E[Yi |Zi = 0, a] · P(a|Zi = 0)
= E[Y0i |c] · πc + E[Y0i |n] · πn + E[Y1i |a] · πa
Hence the difference is:
E[Yi |Zi = 1] − E[Yi |Zi = 0] = E[Y1i − Y0i |compliers] · πc
The same argument can be used to show that the slope
coefficient in the regression of D on Z is:
E[Di |Zi = 1] − E[Di |Zi = 0] = πc
42 / 64
Local average treatment effects (LATE) – (9)
Hence, the instrumental variables estimand, the ratio of these
two reduced form estimands, is equal to the local average
treatment effect
E[Yi |Zi = 1] − E[Yi |Zi = 0]
βIV ,LATE = = E[Y1i − Y0i |compliers]
E[Di |Zi = 1] − E[Di |Zi = 0]
The key insight is that the data are informative solely about
the average effect for compliers
The data are not informative about the average effect for
never-takers because they are never seen receiving treatment
The data are also not informative about the average effect for
always-taker because they are never seen without treatment
43 / 64
Distribution of Compliance Types* (p.41)
Under random assignment and monotonicity we can estimate
the distribution of compliance types
πa = P(D1i = D0i = 1) = E[Di |Zi = 0]
πc = P(D1i = 1, D0i = 0) = E[Di |Zi = 1] − E[Di |Zi = 0]
πn = P(D1i = D0i = 0) = 1 − E[Di |Zi = 1]
We can then consider average outcomes by instrument and
treatment
44 / 64
Average Outcomes by Instrument and Treatment
E[Yi |Di = 0, Zi = 1] = E[Y0i |n]
E[Yi |Di = 1, Zi = 0] = E[Y1i |a]
πc πn
E[Yi |Di = 0, Zi = 0] = πc +πn · E[Y0i |c] + πc +πn · E[Y0i |n]
πc πa
E[Yi |Di = 1, Zi = 1] = πc +πa · E[Y1i |c] + πc +πa · E[Y1i |a]
From this we can infer the average outcome for compliers
E[Y0i |c] and E[Y1i |c] and thus the average effect of compliers
E[Y1i − Y0i |compliers]
45 / 64
Proportion of treated who are compliers
We can also tell what proportion of the treated are compliers:
P(Zi = 1)(E[Di |Zi = 1] − E[Di |Zi = 0])
πc|Di =1 =
P(Di = 1)
that is, the first stage times the probability the instrument is switched on divided
by the proportion treated
46 / 64
Example: Probability of Compliance
47 / 64
General Remarks (1)
E[Y1i − Y0i |Di = 1] — the effect of treatment on the treated
is a weighted average of effects on always-takers and compliers
(see Angrist and Pischke (2009, p. 159))
E[Y1i − Y0i |Di = 0] — the average effect of treatment on the
non-treated is a weighted average of effects on never-takers
and compliers
E[Y1i − Y0i ] — the unconditional average treatment effect is
a weighted average of the effects on compliers, always-takers,
and never-takers
Because an IV is not directly informative about effects on always-takers and
never-takers, instruments do not usually capture the average causal effect
on all of the treated or on all of the non-treated
Exceptions: instruments that allow no always-takers or no never-takers
48 / 64
General Remarks (2)
Although we cannot consistently estimate the average effect of
the treatment for always-takers and never-takers, we do have
some information and can estimate E[Y0i |n] and E[Y1i |a]
Is there evidence of heterogeneity in outcomes by compliance
status: compare the pair of average outcomes of Y0i of never-takers
and compliers and Y1i of always-takers and compliers
If these outcomes are found to be substantially different in levels
⇒ less plausible that the average effect for compliers is indicative
of average effects for other compliance types
If these outcomes are similar between never-taker and compliers
and always-takers and compliers ⇒ more plausible that average
treatment effects for these groups are also comparable
49 / 64
General Remarks (3)
Generalizations:
Generalization in three important ways:
Multiple instruments (e.g., a set of quarter-of-birth dummies)
Models with covariates (e.g., controls for year of birth)
Models with variable and continuous treatment intensity (e.g.,
years of schooling)
In all three cases: the IV estimand is a weighted average of
causal effects for instrument-specific compliers
The econometric tool remains 2SLS and the interpretation
remains fundamentally similar to the basic LATE result, with
a few bells and whistles
50 / 64
IV — More Applications
51 / 64
Example: Devereux and Hart (2010)
Do students benefit from compulsory schooling?
Previous studies provide mixed evidence (high returns in
US but low returns in Europe)
Devereux and Hart find modest effects for men and no positive
returns for women
Strategy: exploit major change in CSL in Britain in 1947
⇒ school leaving age increased from 14 to 15
52 / 64
Example: Devereux and Hart (2010)
Persons born before Apr 33 faced a minimum age of 14,
while persons born after Apr 33 faced a minimum age of 15
Fraction of leaving school age before age 15 fell from over
60% for the 1932 cohort to about 10% for the 1934 cohort
53 / 64
Example: Devereux and Hart (2010)
Aim: Estimate the relationships between the law change
and schooling, wages and earnings
Estimation approach: Two-stages least squares (2SLS)
using CSL change as IV for schooling
Data: General Household Survey (GHS) for 1979-98 – national
survey of people living in private households
Sample: includes individuals born between 1921 and 1951
(age 28-64 in survey)
Variable of interest
Schooling: age at which person left school
Wages: weekly earnings and hourly earnings
54 / 64
Example: Devereux and Hart (2010)
55 / 64
Example: Devereux and Hart (2010)
Reduced-form: regresses log weekly earnings (Y ) on a
quartic function of year-of-birth (YOB) and the LAW variable
ln(Yi ) = γ0 + γ1 LAWi + f (YOBi ) + ei
First-stage: regresses age left school (SCH) on a quartic
function of year-of-birth (YOB) and the LAW variable
SCHi = α0 + α1 LAWi + f (YOBi ) + ϵi
Structural equation: regresses log weekly earnings on
age left school and a quartic function of year-of-birth
ln(Yi ) = β0 + β1 SCHi + f (YOBi ) + µi
56 / 64
Example: Devereux and Hart (2010)
57 / 64
Example: Devereux and Hart (2010)
Col. 1-3 (first row): the law increased the average school
leaving age by about a half of a year
Col. 4-6 (first row): the law has a positive but insignificant
effect on weekly earnings
Col. 7-9 (first row): the 2SLS coefficients imply that one
year extra schooling increase earnings by about 2%, but
coefficients are always statistically insignificant
Homework: Check if 2SLS coefficients equal the reduced
form coefficient divided by the first stage coefficient
Other rows are robustness checks to compare results to
Oreopoulus (2006)
58 / 64
Example: Devereux and Hart (2010)
59 / 64
Example: Devereux and Hart (2010)
60 / 64
Example: Devereux and Hart (2010)
61 / 64
Example: Devereux and Hart (2010)
Summary
Evaluates how change of school leaving age in Britain in
1947 from age 14 to age 15 affected earnings
Modest effect of an extra year of schooling for men and no
extra return for women
Indicates rather low returns of schooling for those who dropped
out early of school
62 / 64
Pro and Cons of IV (Becker 2016)
63 / 64
Homework
Read Becker’s article “Using instrumental variables to establish
causality ” available on Ilias
Questions
1 What conditions does a credible instrument need to satisfy?
2 Which of the conditions are easier to satisfy and why?
3 Can you think of examples that would invalidate the IV used
in Devereux and Hart (2010)?
4 What is a local average treatment effect (LATE)?
5 When can the IV estimate be interpreted as LATE?
64 / 64