[go: up one dir, main page]

0% found this document useful (0 votes)
12 views296 pages

Causal Inference and Machine Learning

rgtrtr rtr trerergh y jy ret e e erere trytjh j,m,kk, de ssesfdfd fdfdfg hgfhg hgg h ereredfd fd df d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views296 pages

Causal Inference and Machine Learning

rgtrtr rtr trerergh y jy ret e e erere trytjh j,m,kk, de ssesfdfd fdfdfg hgfhg hgg h ereredfd fd df d
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 296

Causal Inference Course 1

September 2019 Potsdam

Causal Inference and Machine Learning


Guido Imbens, imbens@stanford.edu

Course Description
The course will cover topics on the intersection of causal inference and machine learning.
There will be particular emphasis on the use of machine learning methods for estimating
causal effects. In addition there will be some discussion of basic machine learning methods
that we view as useful tools for empirical economists.

Lectures
There will be six lectures.

Background Reading
We strongly recommend that participants read these articles in preparation for the course.

• Athey, Susan, and Guido W. Imbens. ”The state of applied econometrics: Causality
and policy evaluation.” Journal of Economic Perspectives 31.2 (2017): 3-32.

Course Outline

1. Monday September 9th, 14.30-16.00: Introduction to Causal Inference

(a) Holland, Paul W. ”Statistics and causal inference.” Journal of the American sta-
tistical Association 81.396 (1986): 945-960.
(b) Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social,
and biomedical sciences. Cambridge University Press, 2015.
(c) Imbens, Guido W., and Jeffrey M. Wooldridge. ”Recent developments in the
econometrics of program evaluation.” Journal of economic literature 47.1 (2009):
5-86.

2. Monday, September 9th 16.30-18.00: Introduction to Machine Learning Concepts

(a) S. Athey (2018, January) “The Impact of Machine Learning on Economics,” Sec-
tions 1-2. http://bit.ly/2EENtvy
(b) H. R. Varian (2014) “Big data: New tricks for econometrics.” The Journal of
Economic Perspectives, 28 (2):3-27. http://pubs.aeaweb.org/doi/pdfplus/
10.1257/jep.28.2.3
(c) S. Mullainathan and J. Spiess (2017) “Machine learning: an applied econometric
approach” Journal of Economic Perspectives, 31(2):87-106 http://pubs.aeaweb.
org/doi/pdfplus/10.1257/jep.31.2.87
Causal Inference Course 2

(d) L. Breiman, J. Friedman, C. J. Stone R. A. Olshen (1984) “Classification and


regression trees,” CRC press.
(e) Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statis-
tical learning. Vol. 1. No. 10. New York, NY, USA:: Springer series in statistics,
2001.
(f) I. Goodfellow, Y. Bengio, and A. Courville (2016) “Deep Learning.” MIT Press.
3. Tuesday, September 10th, 10.30-12.00: Causal Inference: Average Treatment Effects
with Many Covariates
(a) A. Belloni, V. Chernozhukov, and C. Hansen (2014) “High-dimensional methods
and inference on structural and treatment effects.” The Journal of Economic Per-
spectives, 28(2):29-50. http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.
28.2.29
(b) V. Chernozhukov, D. Chetverikov, M. Demirer, E. Duflo, C. Hansen, W. Newey,
and J. Robins (2017, December) “Double/Debiased Machine Learning for Treat-
ment and Causal Parameters.” https://arxiv.org/abs/1608.00060.
(c) Athey, Susan, Guido W. Imbens, and Stefan Wager. ”Approximate residual
balancing: debiased inference of average treatment effects in high dimensions.”
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80.4
(2018): 597-623.
(d) S. Athey, G. Imbens, and S. Wager (2016) “Estimating Average Treatment Effects:
Supplementary Analyses and Remaining Challenges.” http://arXiv/abs/1702.
01250. Forthcoming, Journal of the Royal Statistical Society-Series B.
4. Tuesday, September 10th, 13.15-14.45: Causal Inference: Heterogeneous Treatment
Effects
(a) S. Wager and S. Athey (2017) “Estimation and inference of heterogeneous treat-
ment effects using random forests.” Journal of the American Statistical Associa-
tion http://arxiv.org/abs/1510.04342
(b) S. Athey, Tibshirani, J., and S. Wager (2017, July) “Generalized Random Forests”
http://arxiv.org/abs/1610.01271
5. Tuesday, September 10th, 15.15-16.45pm: Causal Inference: Experimental Design and
Multi-armed Bandits
(a) S. Athey and S. Wager (2017) “Efficient Policy Learning.” http://arXiv.org/
abs/1702.02896.
(b) M. Dudik, D. Erhan, J. Langford, and L. Li, (2014) “Doubly Robust Policy
Evaluation and Optimization” Statistical Science, Vol 29(4):485-511.
(c) S. Scott (2010), “A modern Bayesian look at the multi-armed bandit,” Applied
Stochastic Models in Business and Industry, vol 26(6):639–658.
(d) M. Dimakopoulou, S. Athey, and G. Imbens (2017). “Estimation Considerations
in Contextual Bandits.” http://arXiv.org/abs/1711.07077.
Causal Inference Course 3

6. Wednesday, September 11th, 10.00-11.30: Synthetic Control Methods and Matrix Com-
pletion

(a) S. Athey, M. Bayati, N. Doudchenko, G. Imbens, and K. Khosravi (2017) “Matrix


Completion Methods for Causal Panel Data Models.” http://arXiv.org/abs/
1710.10251.
(b) J. Bai (2009), “Panel data models with interactive fixed effects.” Econometrica,
77(4): 1229–1279.
(c) E. Candès and B. Recht (2009) “Exact matrix completion via convex optimiza-
tion.” Foundations of Computational mathematics, 9(6):717-730.
Causal Inference
and Machine Learning

Guido Imbens – Stanford University

Lecture 1:

Introduction to Causal Inference

Potsdam Center for Quantitative Research


Monday September 9th, 14.30-16.00
Outline

1. Causality: Potential Outcomes, Multiple Units, and the


Assignment Mechanism

2. Fisher Randomization Tests

3. Neyman’s Repeated Sampling Approach

4. Stratified Randomized Experiments

1
1. Causality: Potential Outcomes, Multiple Units, and
the Assignment Mechanism

Three key notions underlying the general approach to causality.


First, potential outcomes, each corresponding to the various
levels of a treatment or manipulation.

Second, the presence of multiple units, and the related stability


assumption.

Central role of the assignment mechanism, which is crucial for


inferring causal effects and serves as the organizing principle.

2
1.1 Potential Outcomes

Given a unit and a set of actions, we associate each action/unit


pair with a potential outcome: “potential” because only one
will ultimately be realized and therefore possibly observed: the
potential outcome corresponding to the action actually taken
at that time.

The causal effect of the action or treatment involves the com-


parison of these potential outcomes, some realized (and per-
haps observed) and others not realized and thus not observed.

Y (0) denotes the outcome given the control treatment,


Y (1) denotes the outcome given the active treatment.
W ∈ {0, 1} denotes indicator for treatment,
observe W and Y obs = Y (W ) = W · Y (1) + (1 − W ) · Y (0).
3
Is this useful?

• Potential outcome notion is consistent with the way economists


think about demand functions: quantities demanded at differ-
ent prices.

• some causal questions become more tricky: causal effect of


race on economic outcomes. One solution is to make ma-
nipulation precise: change names on cv for job applications
(Bertrand and Mullainathan).

• what is causal effect of physical appearance, height, or gen-


der, on earnings, obesity on health? Strong statistical correla-
tions, but what do they mean? Many manipulations possible,
probably all with different causal effects.
4
1.2 Multiple Units

Because we cannot learn about causal effects from a single


observed outcome, we must rely on multiple units exposed to
different treatments to make causal inferences.

By itself, however, the presence of multiple units does not solve


the problem of causal inference. Consider a drug (aspirin) ex-
ample with two units—you and I—and two possible treatments
for each unit—aspirin or no aspirin.

There are now a total of four treatment levels: you take an


aspirin and I do not, I take an aspirin and you do not, we both
take an aspirin, or we both do not.

5
In many situations it may be reasonable to assume that treat-
ments applied to one unit do not affect the outcome for another
(Stable Unit Treatment Value Assumption, Rubin, 1978).

• In agricultural fertilizer experiments, researchers have taken


care to separate plots using “guard rows,” unfertilized strips of
land between fertilized areas.

• In large scale job training programs the outcomes for one in-
dividual may well be affected by the number of people trained
when that number is sufficiently large to create increased com-
petition for certain jobs (Crepon, Duflo et al)

• In the peer effects / social interactions literature these inter-


action effects are the main focus.
6
Six Observations from the GAIN Experiment in Los Angeles

Individual Potential Outcomes Actual Observed Outcome


Yi(0) Yi(1) Treatment Yiobs

1 66 ? 0 66
2 0 ? 0 0
3 0 ? 0 0
4 ? 0 1 0
5 ? 607 1 607
6 ? 436 1 436

Note: (Yi(0), Yi(1)) fixed for i = 1, . . . , 6. (W1, . . . , W6) is


stochastic.

7
1.3 The Assignment Mechanism

The key piece of information is how each individual came to


receive the treatment level received: in our language of causa-
tion, the assignment mechanism.

Pr(W|Y(0), Y(1), X)

Known, no dependence on Y(0), Y(1): randomized experi-


ment (first three lectures)

Unknown, no dependence on Y(0), Y(1): Unconfounded as-


signment / Selection on Observables (later in course)

• Compare with conventional focus on distribution of outcomes


given explanatory variables. Here, other way around, e.g.,

Y obs|Wi ∼ N(α + βWi, σ 2)


8
1.4 Graphical Models for Causality

In graphical models the causal relationships are captured by


arrows. (Pearl, 1995, 2000)

Z0

B
Z1

X Z3
Z2

Y
9
Differences between Directed Acyclical Graphs (DAGs)
and Potential Outcome Framework

• DAGs are all about identification, not about estimation.

• Causes need not be manipulable. in DAGs

• No special role for randomized experiments

• Difficult to capture shape restrictions, e.g., monotonicity,


convexity, that are common in economics, for example in in-
strumental variables.

• Pearl views DAG assumptions as more accessible then poten-


tial outcome assumptions.
10
2. Randomized Experiments: Fisher Exact P-values

Given data from a randomized experiment, Fisher was inter-


ested testing sharp null hypotheses, that is, null hypotheses
under which all values of the potential outcomes for the units
in the experiment are either observed or can be inferred.

Notice that this is distinct from the question of whether the


average treatment effect across units is zero.

The null of a zero average is a much weaker hypothesis because


the average effect of the treatment may be zero even if for
some units the treatment has a positive effect, as long as for
others the effect is negative.

11
2.1 Basics

Because the null hypothesis is sharp we can determine the


distribution of any test statistic T (a function of the stochas-
tic assignment vector, W, the observed outcomes, Yobs, and
pretreatment variables, X) generated by the randomization of
units across treatments.

The test statistic is stochastic solely through the stochastic


nature of the assignment vector, leading to the randomization
distribution of the test statistic.

Using this distribution, we can compare the observed test statis-


tic, T obs, against its distribution under the null hypothesis.

The Fisher exact test approach entails two choices: (i) the
choice of the sharp null hypothesis, (ii) the choice of test statis-
tic.
12
We will test the sharp null hypothesis that the program had
absolutely no effect on earnings, that is:

H0 : Yi(0) = Yi(1) for all i = 1, . . . , 6.

Under this null hypothesis, the unobserved potential outcomes


are equal to the observed outcomes for each unit. Thus we
can fill in all six of the missing entries using the observed data.

This is the first key point of the Fisher approach: under the
sharp null hypothesis all the missing values can be inferred from
the observed ones.

13
Six Observations from the GAIN Experiment in Los Angeles

Individual Potential Outcomes Actual Observed Outcome


Yi(0) Yi(1) Treatment Yi

1 66 (66) 0 66
2 0 (0) 0 0
3 0 (0) 0 0
4 (0) 0 1 0
5 (607) 607 1 607
6 (436) 436 1 436

14
Now consider testing this null against the alternative hypothesis
that Yi(0) 6= Yi(1) for some units, based on the test statistic:

6 6
1 1
T1 = T (W, Yobs) = Wi · Yiobs − (1 − Wi) · Yiobs
X X
3 i=1 3 i=1

6 6
1 X 1 X
= Wi · Yi(1) − (1 − Wi) · Yi(0).
3 i=1 3 i=1

For the observed data the value of the test statistic is (Y4obs +
Y5obs + Y6obs − Y1obs − Y2obs − Y3obs)/3 = 325.6.

Suppose for example, that instead of the observed assignment


vector Wobs = (0, 0, 0, 1, 1, 1)0 the assignment vector had been
W̃ = (0, 1, 1, 0, 1, 0). Under this assignment vector the test
statistic would have been (−Y4obs + Y5obs − Y6obs − Y1obs + Y2obs +
Y3obs)/3 = 35.
15
Randomization Distribution for six observations from GAIN data

W1 W2 W3 W4 W5 W6 levels ranks

0 0 0 1 1 1 325.6 1.00
0 0 1 0 1 1 325.6 1.67
0 0 1 1 0 1 -79.0 -1.67
0 0 1 1 1 0 35.0 -1.00
0 1 0 0 1 1 325.6 2.33
0 1 0 1 0 1 -79.0 -1.00
0 1 0 1 1 0 35.0 -0.33
... ... ... ... ... ... ... ...
1 1 1 0 0 0 325.6 -1.00

16
Given the distribution of the test statistic, how unusual is this
observed average difference 325.6), assuming the null hypoth-
esis is true?

One way to formalize this question is to ask how likely it is


(under the randomization distribution) to observe a value of
the test statistic that is as large in absolute value as the one
actually observed.

Simply counting we see that there are twelve vectors of assign-


ments with at least a difference in absolute value of 325.6 be-
tween treated and control classes, out of a set of twenty possi-
ble assignment vectors. This implies a p-value of 8/20 = 0.40.

17
2.2 The Choice of Null Hypothesis

The first question when considering a Fisher Exact P-value


calculation is the choice of null hypothesis. Typically the most
interesting sharp null hypothesis is that of no effect of the
treatment: Yi(0) = Yi(1) for all units.

Although Fisher’s approach cannot accommodate a null hy-


pothesis of an average treatment effect of zero, it can accom-
modate sharp null hypotheses other than the null hypothesis
of no effect whatsoever, e.g.,

H0 : Yi(1) = Yi(0) + ci, for all i = 1, . . . , N,

for known ci.


18
2.3 The Choice of Statistic

The second decision, the choice of test statistic, is typically


more difficult than the choice of the null hypothesis. First let
us formally define a statistic:

A statistic T is a known function T (W, Yobs, X) of assignments,


W, observed outcomes, Yobs, and pretreatment variables, X.

Any statistic that satisfies this definition is valid for use in


Fisher’s approach and we can derive its distribution under the
null hypothesis.

19
The most standard choice of statistic is the difference in aver-
age outcomes by treatment status:

WiYiobs (1 − Wi)Yiobs
P P
T = − .
(1 − Wi)
P P
Wi

An obvious alternative to the simple difference in average out-


comes by treatment status is to transform the outcomes before
comparing average differences between treatment levels, e.g.,
by taking logarithms, leading to the following test statistic:

Wi ln(Yiobs) (1 − Wi) ln(Yiobs)


P P
T = − .
1 − Wi
P P
Wi

20
An important class of statistics involves transforming the out-
comes to ranks before considering differences by treatment
status. This improves robustness.

We also often subtract (N + 1)/2 from each to obtain a nor-


malized rank that has average zero in the population:

N
N +1
Ri(Y1obs, . . . , YNobs) =
X
1Y obs≤Y obs − .
j=1 j i 2

Given the ranks Ri, an attractive test statistic is the difference


in average ranks for treated and control units:

(1 − Wi)Ri
P P
Wi Ri
T = P − .
1 − Wi
P
Wi

21
2.4 Computation of p-values

The p-value calculations presented so far have been exact.


With both N and M sufficiently large, it may therefore be
unwieldy to calculate the test statistic for every value of the
assignment vector.

In that case we rely on numerical approximations to the p-value.

Formally, randomly draw an N -dimensional vector with N − M


zeros and M ones from the set of assignment vectors. Cal-
culate the statistic for this draw (denoted T1). Repeat this
process K − 1 times, in each instance drawing another vector
of assignments and calculating the statistic Tk , for k = 2, . . . , K.
We then approximate the p-value for our test statistic by the
fraction of these K statistics that are more extreme than T obs.
22
Comparison to p-value based on normal approximation to dis-
tribution of t-statistic:

Y1−Y0
t=q
s2
0 /(N − M ) + s 2 /M
1

where

1 1
s2 (Yiobs − Y 0)2, s2 (Yiobs − Y 1
X X
0= 1=
N − M − 1 i:W =0 M − 1 i:W =1
i i

and

x2
Z a !
1
p = 2 × Φ(−|t|) where Φ(a) = √ exp − dx
−∞ 2π 2

23
P-values for Fisher Exact Tests: Ranks versus Levels

sample size p-values


Prog Loc controls treated t-test FET (levels) FET (ranks)

GAIN AL 601 597 0.835 0.836 0.890


GAIN LA 1400 2995 0.544 0.531 0.561
GAIN RI 1040 4405 0.000 0.000 0.000
GAIN SD 1154 6978 0.057 0.068 0.018

WIN AR 37 34 0.750 0.753 0.805


WIN BA 260 222 0.339 0.339 0.286
WIN SD 257 264 0.136 0.137 0.024
WIN VI 154 331 0.960 0.957 0.249

24
Exact P-values: Take Aways

• Randomization-based p-values underly tests for treatment


effects.

• In practice using t-statistic based p-values is often similar to


exact p-values based on difference in averages test.

• With very skewed distributions rank-based tests are much


better.

• See recent Alwyn Young papers on inference and leverage.

25
3. Randomized Experiments: Neyman’s Repeated Sam-
pling Approach

During the same period in which Fisher was developing his p-


value calculations, Jerzey Neyman was focusing on methods
for estimating average treatment effects.

His approach was to consider an estimator and derive its distri-


bution under repeated sampling by drawing from the random-
ization distribution of W, the assignment vector.

• Y(0), Y(1) still fixed in repeated sampling thought experi-


ment.

26
3.1 Unbiased Estimation of the Ave Treat Effect

Neyman was interested in the population average treatment


effect:

N
1 X
τ = (Yi(1) − Yi(0)) = Y (1) − Y (0).
N i=1

Suppose that we observed data from a completely randomized


experiment in which M units were assigned to treatment and
N − M assigned to control. Given randomization, the intuitive
estimator for the average treatment effect is the difference
in the average outcomes for those assigned to the treatment
versus those assigned to the control:

1 X obs 1
Yiobs = Y obs obs
X
τ̂ = Yi − 1 −Y0 .
M i:W =1 N − M i:W =0
i i
27
To see that this measure, Y 1 − Y 0, is an unbiased estimator of
τ , consider the statistic

Wi · Yiobs (1 − Wi) · Yiobs


!
Ti = − .
M/N (N − M )/N

The average of this statistic over the population is equal to


our estimator, τ̂ = i Ti/N = Y obs − obs :
P
1 Y 0

28
Using the fact that Yiobs is equal to Yi(1) if Wi = 1 and Yi(0)
if Wi = 0, we can rewrite this statistic as:
!
Wi · Yi(1) (1 − Wi) · Yi(0)
Ti = − .
M/N (N − M )/N

The only element in this statistic that is random is the treat-


ment assignment, Wi, with E[1 − Wi] = (1 − E[Wi]), is equal to
(N − M )/N .

Using these results we can show that the expectation of Ti is


equal to the unit-level causal effect, Yi(1) − Yi(0):
!
E[Wi] · Yi(1) (1 − E[Wi]) · Yi(0)
E[Ti] = − = Yi(1) − Yi(0)
M/N (N − M )/N

29
3.2 The Variance of the Unbiased Estimator Y obs obs
1 −Y0

Neyman was also interested in the variance of this unbiased


estimator of the average treatment effect

This involved two steps: first, deriving the variance of the esti-
mator for the average treatment effect; and second, developing
unbiased estimators of this variance.

In addition, Neyman sought to create confidence intervals for


the population average treatment effect which also requires an
appeal to the central limit theorem for large sample normality.

30
Consider a completely randomized experiment of N units, M
assigned to treatment. To calculate the variance of Y obs
1 −Y obs ,
0
we need the second and cross moments of the random variable
Wi, E[Wi2] and E[Wi · Wj ].

E[Wi2] = E[Wi] = M/N.

E[Wi · Wj ] = Pr(Wi = 1) · Pr(Wj = 1|Wi = 1)

= (M/N ) · (M − 1)/(N − 1) 6= E[Wi] · E[Wj ],

for i 6= j, since conditional on Wi = 1 there are M − 1 treated


units remaining out of N − 1 total remaining.

31
The variance of Y obs obs is equal to:
1 −Y0

S02 S12 S01


2
Var(Y obs
1 − Y obs ) =
0 + − , (1)
N −M M N

2 is the variance of Y (w) in the population, defined as:


where Sw i

N
2 = 1
(Yi(w) − Ȳ (w))2,
X
Sw
N − 1 i=1

2 is the population variance of the unit-level


for w = 0, 1, and S01
treatment effect, defined as:

N
2 = 1
(Yi(1) − Yi(0) − τ )2.
X
S01
N − 1 i=1

32
The numerator of the first term, the population variance of
the potential control outcome vector, Y(0), is equal to

N
1
S02 = (Yi(0) − Y (0))2.
X
N − 1 i=1

An unbiased estimator for S02 is

1
s2 (Yiobs − Y obs 2.
X
0= 0 )
N − M − 1 i:W =0
i

33
2 (the population variance of the unit-level
The third term, S01
treatment effect) is more difficult to estimate because we can-
not observe both Yi(1) and Yi(0) for any unit.

We have no direct observations on the variation in the treat-


ment effect across the population and cannot directly estimate
2 .
S01

As noted previously, if the treatment effect is additive (Yi(1) −


Yi(0) = c for all i), then this variance is equal to zero and the
third term vanishes.

Under this circumstance we can obtain an unbiased estimator


for the variance as:

  s2 s 2
V̂ Y obs
1 − Y obs
0 = 0 + 1. (2)
N −M M
34
This estimator for the variance is widely used, even when the
assumption of an additive treatment effect is inappropriate.
There are two main reasons for this estimator’s popularity.

First, confidence intervals generated using this estimator of the


variance will be conservative with actual coverage at least as
large, but not necessarily equal to, the nominal coverage.

The second reason for using this estimator for the variance
is that it is always unbiased for the variance of τ̂ = Y obs
1 −
Y obs
0 when this statistic is interpreted as the estimator of the
average treatment effect in the super-population from which
the N observed units are a random sample. (we return to this
interpretation later)

35
Confidence Intervals

Given the estimator τ̂ and the variance estimator V̂, how do


we think about confidence intervals?

Let’s consider the case where E[Wi] = 1/2, and define

Di = 2Wi − 1, so that E[Di] = 0, Di2 = 1.

Write

N N
1 X 1 X
τ̂ = Y 1 − Y 0 = WiYi(1) − (1 − Wi)Yi(0)
N/2 i=1 N/2 i=1

N  N
1 X  1 X  
= Yi(1) − Yi(0) + Di Yi(1) + Yi(0)
N i=1 N i=1
36
The stochastic part, normalized by the sample size, is

N
1 X  
√ Di Yi(1) + Yi(0)
N i=1

It has mean zero and variance

N 
1 X 2
V= Yi(1) + Yi(0) .
N i=1

Under conditions on the sequence of σi2 = (Yi(1) + Yi(0))2,


we can use a central limit theorem for independent but not
identically distributed random variables to get
 
√1 PN
Di Yi(1) + Yi(0)
N i=1 d
q P −→ N(0, 1)
1 N 2
N i=1 σi
37
Neyman Repeated Sampling Thought Experiments

• Basis for estimating causal effects

• Finite population argument

• Uncertainty based on assignment mechanism, not sampling.

38
4. Stratified Randomized Experiments

• Suppose we have N units, we observe some covariates on


each unit, and wish to evaluate a binary treatment.

• Should we randomize the full sample, or should we stratify


the sample first, or even pair the units up?

Recommendation In Literature:

• In large samples, and if the covariates are strongly associ-


ated with the outcomes, definitely stratify or pair.

• In small samples, with weak association between covariates


and outcomes, the literature offers mixed advice.

39
Quotes from the Literature

Snedecor and Cochran (1989, page 101) write, comparing


paired randomization and complete randomization:

“If the criterion [the covariate used for constructing the


pairs] has no correlation with the response variable, a
small loss in accuracy results from the pairing due
to the adjustment for degrees of freedom. A sub-
stantial loss may even occur if the criterion is badly
chosen so that member of a pair are negatively corre-
lated.”

40
Box, Hunter and Hunter (2005, page 93) also suggest that
there is a tradeoff in terms of accuracy or variance in the de-
cision to pair, writing:

“Thus you would gain from the paired design only if the
reduction in variance from pairing outweighed the effect
of the decrease in the number of degrees of freedom of
the t distribution.”

41
Klar and Donner (1997) raise additional issues that make them
concerned about pairwise randomized experiments (in the con-
text of randomization at the cluster level):

“We shown in this paper that there are also several ana-
lytic limitations associated with pair-matched designs.
These include: the restriction of prediction models to
cluster-level baseline risk factors (for example, cluster
size), the inability to test for homogeneity of odds ra-
tios, and difficulties in estimating the intracluster corre-
lation coefficient. These limitations lead us to present
arguments that favour stratified designs in which there
are more than two clusters in each stratum.”

42
Imai, King and Nall (2009) claim there are no tradeoffs at all
between pairing and complete randomization, and summarily
dismiss all claims in the literature to the contrary:

“Claims in the literature about problems with matched-


pair cluster randomization designs are misguided: clus-
ters should be paired prior to randomization when
considered from the perspective of efficiency, power,
bias, or robustness.”

and then exhort researchers to randomize matched pairs.

“randomization by cluster without prior construction of


matched pairs, when pairing is feasible, is an exercise
in selfdestruction.”

43
How Do We Reconcile These Statements?

• Be careful and explicit about goals: precision of estimators


versus power of tests.

• Be careful about estimands: population versus sample, av-


erage over clusters or average over individuals.

44
4.1 Expected Squared Error Calculations for Completely
Randomized vs Stratified Randomized Experiments

Suppose we have a single binary covariate Xi ∈ {f, m}. Define

τ (x) = E [ Yi(1) − Yi(0)| Xi = x]

where the expectations denote expectations taken over the su-


perpopulation.

The estimand we focus on is the (super-)population version of


the the finite sample average treatment effect,

τ = E[Yi(1) − Yi(0)] = E[τ (Xi)]

46
Notation

µ(w, x) = E [ Yi(w)| Wi = w, Xi = x] ,

σ 2(w, x) = V ( Yi(w)| Wi = w, Xi = x) ,

for w = 0, 1, and x ∈ {f, m}, and


h i
2 2
σ01(x) = E (Yi(1) − Yi(0) − (µ(1, x) − µ(0, x))) Xi = x ,

47
Three Estimators: τ̂dif , τ̂reg, and τ̂strata

First, simple difference:

τ̂dif = Y obs
1 − Y obs
0

Second, use the regression function

Yiobs = α + τ · Wi + β · 1Xi=f + εi.

Then estimate τ by least squares regression. This leads to τ̂reg.

The third estimator we consider is based on first estimating


the average treatment effects within each stratum, and then
weighting these by the relative stratum sizes:

N0f + N1l N0m + N1m


τ̂strata = · (Y obs
1f − Y obs +
0f ) · (Y obs obs
1m − Y 0m ).
N N
48
Large (infinitely large) superpopulation.

We draw a stratified random sample of size 4N from this popu-


lation, where N is integer. Half the units come from the Xi = f
subpopulation, and half come from the Xi = m subpopulation.

Two experimental designs. First, a completely randomized


design (C) where 2N units are randomly assigned to the treat-
ment group, and the remaining 2N are assigned to the control
group.

Second, a stratified randomized design (S) where N are ran-


domly selected from the Xi = f subsample and assigned to the
treatment group, and N units are randomly selected from the
Xi = m subsample and assigned to the treatment group.

In both designs the conditional probability of a unit being as-


signed to the treatment group, given the covariate, is the same:
pr(Wi = 1|Xi) = 1/2, for both types, x = f, m.
49
h i
2
VS = E (τ̂dif − τ ) S

σ 2(1, f ) σ 2(0, f ) σ 2(1, m) σ 2(0, m)


! !
q 1−q
= · + + · +
N p 1−p N p 1−p
h i
VC = E (τ̂dif − τ ) C = q(1 − q)(µ(0, f ) − µ(0, m))2
2

qσ 2(0, f ) (1 − q)σ 2(0, m)


+ +
(1 − p)N (1 − p)N

2 qσ 2(1, f ) (1 − q)σ 2(1, m)


+q(1 − q)(µ(1, f ) − µ(1, m)) + +
pN pN

VC − VS =
 
2 2
q(1 − q) · (µ(0, f ) − µ(0, m)) + (µ(1, f ) − µ(1, m)) ≥0
50
Comment 1:

Stratified randomized design has lower expected squared error


than completely randomized design.

Strictly lower if the covariate predict potential outcomes in


population.

• True irrespective of sample size

51
Comment 2: For this result it is important that we compare
the marginal variances, not conditional variances. There is
no general ranking of the conditional variances
h i
E (τ̂dif − τ )2 Y(0), Y(1), X, C

versus
h i
E (τ̂dif − τ )2 Y(0), Y(1), X, S .

It is possible that stratification leads to larger variances be-


cause of negative correlations within strata in a finite sample
(Snedecor and Cochran quote). That is not possible on aver-
age, that is, over repeated samples.

In practice it means that if the primary interest is in the most


precise estimate of the average effect of the treatment,
stratification dominates complete randomization, even in
small samples.
52
Comment 3: Under a stratified design the three estimators
τ̂reg, τ̂strata, and τ̂dif are identical, so their variances are the
same.

Under a completely randomized experiment, the estimators are


generally different. In sufficiently large samples, if there is
some correlation between the outcomes and the covariates that
underly the stratification, the regression estimator τ̂reg will have
a lower variance than τ̂dif .

However, for any fixed sample size, if the correlation is suffi-


ciently weak, the variance of τ̂reg will actually be strictly higher
than that of τ̂dif .

53
Think through analyses in advance

Thus for ex post adjustment there is a potentially complicated


tradeoff: in small samples one should not adjust, and in large
samples one should adjust if the objective is to minimize the
expected squared error.

If one wishes to adjust for differences in particular covariates,


do so by design: randomize in a way such that τ̂dif = τ̂reg (e.g.,
stratify, or rerandomize).

54
4.2 Analytic Limitations of Pairwise Randomization

Compare two designs with 4N units.

• N strata with 4 units each (S).

• 2N pairs with 2 units each (P).

What are costs and benefits of S versus P?

55
Benefits of Pairing

• The paired design will lead to lower expected squared error


than stratified design in finite samples. (similar argument
as before.)

• In sufficiently large sample power of paired design will be


higher (but not in very small samples, similar argument as
before).

56
Difference with Stratified Randomized Experiments

• Suppose we have a stratum with size ≥ 4 and conduct a


randomized experiment within the stratum with ≥ 2 treated
and ≥ 2 controls.

• Within each stratum we can estimate the average effect


and its variance (and thus intraclass variance). The vari-
ance may be imprecisely estimated, but we can estimate it
without bias.

• Suppose we have a stratum (that is, a pair) with 2 units.


We can estimate the the average effect in each pair (with
the difference in outcomes by treatment status), but we
can not estimate the variance.

57
Difference with Stratified Randomized Experiments (ctd)

• From data on outcomes and pairs alone we cannot establish


whether there is heterogeneity in treatment effects.

• We can establish the presence of heterogeneity if we have


data on covariates used to create pairs (compare “similar”
pairs).

• Efficiency gains from going from strata with 4 units to


strata with 2 units is likely to be small.

58
Recommendation

• Use small strata, rather than pairs (but not a big deal either
way)

• Largely agree with Klar & Donner

59
4.3 Power Comparisons for t-statistic Based Tests

The basic calculation underlying the concern with pairwise ran-


domization is based on calculation of t-statistics.

Randomly sample N units from a large population. Covariate


Xi ∼ N(µX , σX2 ). We then draw another set of N units, with
exactly the same values for the covariates. Assume covariates
are irrelevant.

The distribution of the potential control outcome is

Yi(0)|Xi = N(µ, σ 2) and Yi(1) = Yi(0) + τ

Completely randomized design (C): randomly pick N units to


receive the treatment.

Pairwise randomized design (P): pair the units by covariate


and randomly assign one unit from each pair to the treatment.
60
The estimator for τ under both designs is

τ̂ = Y obs obs
1 −Y0 .

Its distribution under the two designs is the same as well (be-
cause covariate is independent of outcomes):

2 · σ2 2 · σ2
! !
τ̂ |C ∼ N τ, and τ̂ |P ∼ N τ,
N N

61
The natural estimator for the variance for the estimator given
the pairwise randomized experiment is

N
1 X
2 2 · σ 2 X2(N − 1)
V̂P = (τ̂i − τ̂ ) ∼ ·
N · (N − 1) i=1 N N −1

The variance estimator for the completely randomized design,


exploiting homoskedasticity, is

(N − 1) · s2(0) + (N − 1) · s2(1) 2 · σ 2 X2(2 · N − 2)


!
2
V̂C = ∼ ·
N 2N − 2 N 2·N −2

62
Under the normality the expected values of the varianc estima-
tors are the same

h i h i2 · σ2
E V̂P = E V̂C =
N

but their variances differ:

    8 · σ4
V V̂P = 2 · V V̂C = 2
N · (N − 1)

63
This leads to the t-statistics

τ̂ τ̂
tP = q , and tC = q .
V̂P V̂C

If we wish to test the null hypothesis of τ = 0 against the alter-


native of τ 6= 0 at level α, we would reject the null hypothesis
if |t| exceeds the critical value cα (different for the two designs)

cP t
α = q1−α/2 (N − 1), cC t
α = q1−α/2 (2N − 2)

64
For any τ 6= 0, and for any N ≥ 2 the power of the test based
on the t-statistic tC is strictly greater than the power based on
the t-statistic tP. (assuming covariates are irrelevant.)

(at N = 1 we cannot test the hypothesis without knowledge


of the variances)

By extension, the power for the test based on the completely


randomized design is still greater than the power based on the
pairwise randomized experiment if the association between the
covariate and the potential outcomes is weak, at least in small
samples.

This is the formal argument against doing a pairwise (or by


extension) a stratified randomized experiment if the covariates
are only weakly associated with the potential outcomes.
65
Limitations

• Test comparison relies on normality. Without normality we


cannot directly rank the power, and the actual size of the
tests need not even be equal to the nominal size.

• Homoskedastic case is most favorable to completely ran-


domized experiment (but features most often in power
comparisons). In the case of heteroskedasticity, the loss
in power for pairwise randomized experiment is less.

66
Conclusion

• Stratify, with small strata, but at least two treated and two
control units.

• Dont worry about power, use variance estimator that takes


into account stratification.

67
Causal Inference
and Machine Learning

Guido Imbens – Stanford University

Lecture 2:

Introduction to Machine Learning Concepts

Potsdam Center for Quantitative Research


Monday September 9th, 16.30-18.00
Outline

1. Nonparametric Regression

2. Regression Trees

3. Multiple Covariates/Features

4. Pruning

5. Random Forests

6. Boosting
1
7. Neural Nets

8. Generative Adverserial Nets


1. Nonparametric Regression

Data:

(Xi, Yi), i = 1, . . . , N, i.i.d.

where Xi ∈ Rd, Yi ∈ R, or Yi ∈ {0, 1}

Define

g(x) = E[Yi|Xi = x]

Goal: estimate g(x), minimize


h i
E (g(Xi) − ĝ(Xi))2

2
The regression/prediction problem is special:

Suppose we put one randomly chosen observation aside: (Yi, Xi),


and use the rest of the sample to estimate g(·) as ĝ(i)(·).

Then we can assess the quality of the estimator by calculating


the squared error
 2
Yi − ĝ(i)(Xi)

We can use this out-of-sample cross-validation to rank dif-


ferent estimators ĝ1(·) and ĝ2(·).

Not true directly for estimators of average causal effects, or


when we want to estimate the regression function at a point,
g(x0).
3
Many methods satisfy:

N
X N
X
ĝ(x) = ωiYi, often with ωi = 1, sometimes ωi ≥ 0.
i=1 i=1

• Question: how to choose the weights ωi?

• Is it important or not to do inference on g(·) (confidence


intervals / standard errors)?

• How well do estimators perform in terms of out-of-sample


mean squared error (as opposed to in-sample fit)?

• What to do if dim(Xi) is high (relative to N )?

4
First: scalar case, Xi ∈ [0, 1]

1. Define knots κjJ = j/J, for j = 1, . . . , J − 1, J = 2, 3, . . .,


and

N
X
ĝJ (x) = 1x∈[κj−1J ,κjJ ]ĉjJ
j=1

where ĉjJ is the average outcome within the interval [κj−1J , κjJ ]:

N
X N
.X
ĉjJ = 1Xi∈[κj−1J ,κjJ ]Yi 1Xi∈[κj−1J ,κjJ ]
j=1 j=1

Also define number of observations in each interval:

N
X
NjJ = 1Xi∈[κj−1J ,κjJ ]
j=1
5
For fixed x the bias of this estimator depends on the deriva-
tive of g(·) around x, and the density of of Xi around x.
Given some smoothness, the bias-squared is approximately
equal to to the square of g 0(x)/J and the variance is equal to
V (Yi|Xi = x)/(N f (x)/J).

So as a function of the number of intervals J:

2 g 0(x)2 JV (Yi|Xi = x)
Bias (J) + Var(J) = +
J2 N f (x)

Optimal choice for J is

!1/3
2g 0(x)2
J opt = N 1/3
V (Yi|Xi = x)

6
If we let J increase slightly slower than proportional to N 1/3
we get asymptotic normality without bias, without under-
smoothing no valid confidence intervals.

(You can do better than this by centering the interval at x,


which lowers the bias, and then the optimal rate is J ∝ N 1/5.)

7
How can we modify this to improve properties?

1. Use more sophisticated ways of averaging:

(a) Use weights that give more weight to nearby observa-


tions (kernel estimation)

(b) Instead of using means within intervals use polynomial


approximation within interval (e.g., local linear regres-
sion, splines)

2. Choose knots in data dependent way, but need to give


up easy asymptotic properties (regression trees)

8
2. Regression Trees

Define for a, b the set of indices such that Xi is in [a, b):

Ia,b = {i = 1, . . . , N |Xi ∈ [a, b)}

Define the average within an interval:

X . X
Y a,b = Yi 1
i∈Ia,b i∈Ia,b

Define the sum of squared deviations from means:

X  2 X  2
Q(x) = Yi − Y 0,x + Yi − Y x,1
i:Xi ∈I0,x i:Xi ∈Ix,1

9
Find the split point that minimizes the sum of squared devi-
ations:

c1 = arg min Q(x)


x∈[0,1]

Then:
(
Y 0,c1 ifx ≤ c1,
ĝ(x) =
Y c1,1 ifx > c1,

• This is a tree with two leaves: [0, c1] and [c1, 1].

• ĝ(·) is step function.

10
We can do this again: consider all possible split points c2 ∈
[0, 1] and calculate the sum of squares as a function of the
additional split point. For example, for c2 ∈ (c1, 1), the sum
of squares is

X  2 X  2
Q(c1, c2) = Yi − Y 0,c1 + Yi − Y c1,c2
i:Xi ∈I0,c1 i:Xi ∈Ic1 ,c2

X  2
+ Yi − Y c2,1
i:Xi ∈Ic2 ,1

Now we have a tree with three leaves. ĝ(·) is still a step


function:

 Y 0,c1
 ifx ∈ [0, c1),
ĝ(x) = Y c1,c2 ifx ∈ [c1, c2)
ifx ∈ [c2, 1]

 Y
c2 ,1
11
Note:

• We can keep doing this, each time adding a leaf to the


tree.

• For every new potential split the sum of squares is lower


than what it is without the additional split, until we have only
the same value of Xi within each interval.

12
1. Given J splits, this looks very similar to just dividing the
interval [0, 1] into J equal subintervals.

2. It is more adaptive: it will be more likely to divide for


values of x where

(a) there are more observations (where the variance is


smaller – nearest neighbor estimators also do that)

(b) the derivative of g(x) is larger (where the bias is bigger)

13
In both cases (simple dividing [0, 1] into J equal intervals,
or tree with J leaves), we need to choose the smoothing
parameter J.

• leave-one-out cross-validation: leave out observation i,


re-estimate model with J pieces/leaves, predict Yi as ĝJ,(i)(Xi),
and calculate error Yi − ĝJ,(i)(Xi).

Minimize over J:

N 
1 X 2
CV (J) = Yi − ĝJ,(i)(Xi)
N i=1

To make this computationally easier, do 10-fold cross-validation:


partition sample into ten subsamples, and estimate 10 times
on the samples of size N × 0.9 and validate on 10% samples.
14
This is how cross-validation is often done for kernel and near-
est neighbor type regression estimators. Note: this means
bias-squared and variance are balanced, and so confidence
intervals are not valid.

Cross-validation is not implemented this way for regression


trees, partly for computational reasons, and partly because
this is not necessarily unimodal in J.

Instead the criterion is, to choose tree T that minimizes the


sum of squared deviations plus a penalty term, typically a
constant times the number of leaves in the tree:

Q(T) + λ|T|

Now the penalty parameter λ is choosen through cross-validation,


say 10-fold cross-validation.
15
3. Multiple Covariates

Suppose we have multiple covariates or features, say x =


(x1, x2, . . . , xp) ∈ [0, 1]p.

Suppose Xi has uniform distribution.

Suppose we want to estimate E[Yi|Xi = (0, 0, . . . , 0)] by aver-


age over nearby observations:

N
X N
.X
ĝ(0) = Yi1Xik ≤ε 1Xik ≤ε.
i=1 i=1

Problem is that the number of observations close by,


 
N
1Xik ≤ε = N εp
X
E curse of dimensionality
i=1
16
For kernel methods we typically use multivariate kernels that
are simply the product of univariate kernels:

K(x1, x2) = K0(x1) × K0(x2),

possibly with different bandwidths, but similar rates for the


different bandwidths.

This works poorly in high dimensions - rate of convergence


declines rapidly with the dimension of Xi.

17
Trees deal with multiple covariates differently.

Now, for the first split, we consider all subsets of [0, 1] × [0, 1]
of the form

[0, c) × [0, 1], split on x1

or

[0, 1] × [0, c), split on x2

Repeat this after the first split.

18
• This means that some covariates may never be used to split
the sample - the method will deal better with cases where
the regression function is flat in some covariates (sparsity).

• It can deal with high dimensional covariates, as long as


the regression function does not depend too much on too
many of them. (will not perform uniformly well, but well in
important parts of parameter space)

19
This difference in the way trees (and forests) deal with mul-
tiple covariates compared to kernel methods is important in
practice. There is some tension there:

• for asymptotic properties (focus of much of econometrics


literature) it is key that eventually the leaves are small in
all dimensions. Kernel type methods do this automatically.
With trees and forests it can be imposed by forcing the splits
to depend on any covariate with probability bounded away
from zero (or even equal probability).

• But for finite sample properties with many covariates (focus


of much of machine learning literature) you dont want to split
very often on covariates that do not matter much.

20
Comparison with linear additive models:

• Trees allow for complex nonlinearity and non-monotonicity.

• With social science data conditional expectations are often


monotone, so linear additive models may provide good fit.
If conditional mean of Yi is increasing in Xi2 given Xi1 < c, it
is likely to be increasing in Xi2 given Xi1 ≥ c. Trees do not
exploit this. You could do linear models within leaves, but
then need to be careful with many covariates.

21
4. Pruning

If we grow a tree as just described, we may stop too early


and miss important features of the joint distribution.

Suppose (x1, x2) ∈ {(−1, −1), (−1, 1), (1, −1), 1, 1)}, and

g(x1, x2) = x1 × x2

No first split (either on x1 or on x2) improves the expected


squared error compared to no-split, but two or three splits
improve the expected squared error substantially.

How do we get there if the first split delivers no benefit?

22
• First grow a “big” tree, with many leaves, even if they do
not improve the sum of squared errors enough given λ, up to
the point that the leaves are all very small in terms of the
number of observations per leave.

• Then “prune” the tree: consider dropping splits (and com-


bining all the subsequent leaves) to see if that improves the
criterion function.

23
5. Random Forests

Trees are step function approximations to the true regression


function. They are not smooth, and a single observation may
affect the tree substantially. We may want smoother esti-
mates, and ones that are more robust to single observations.

Random forests achieve this by introducing two modifications


that introduce randomness in the trees.

24
Random Forests

1. Create B trees based on bootstrap samples. Start by con-


structing a bootstrap sample of size N from the original
sample. Grow a tree on the bootstrap sample (this part
is known as bagging), and leads to smoother estimates.

2. For each split (in each tree) only a subset of size m of the

p covariates are considered in the split (typically m = p,
or m = p/3 - heuristic, no formal result).

3. Average estimates ĝb(·) for each of the B bootstrap sam-


ple based trees.

Flexible, simple and effective out-of-the-box method in many


cases. Not a lot of tuning to be done.
25
6. Gradient Boosting

Initial estimate Ĝ0(x) = 0.

First estimate g(·) using a very simple method (a simple base


learner). For example, a tree with a single split on (Yi −
Ĝ0(Xi), Xi). Call this estimate ĝ1(x), and define Ĝ1(x) =
Ĝ0(x) + ĝ1(x)

Then calculate the residual ε̂1i = Yi − Ĝ1(Xi).

Apply the same simply method again to ε̂1i, with estimator


ĝ2(x). The estimator for g(x) is now Ĝ2(x) = Ĝ1(x) + ĝ2(x).

Apply the same simply method again to ε̂2i = Yi − Ĝ2(Xi),


with estimator ĝ3(x). The estimator for g(x) is now Ĝ3(x) =
Ĝ2(x) + ĝ3(x).
26
What does this do?

Each ĝk (x) depends only on a single element of x (single


covariate/feature).

Hence ĝ(x) is always an additive function of x1, . . . , xp.

What if we want the approximation to allow for some but


not all higher order interactions?

If we want only first order interactions, we can use a base


learner that allows for two splits. Then the approximation
allows for the sum of general functions of two variables, but
not more.

27
Boosting refers to the repeated use of a simple basic estima-
tion method, repeatedly applied to the residuals.

Can use methods other than trees as base learners.

28
For each split, we can calculate the improvement in mean
squared error, and assign that to the variable that we split
on.

Sum this up over all splits, and over all trees.

This is informative about the importance of the different


variables in the prediction.

29
Modification

Three tuning parameters: number of trees B, depth of trees


d, and shrinkage factor ε ∈ (0, 1].

Initial estimate Ĝ0(x) = 0, for all x.

First grow tree of depth d on (Yi − Ĝ0(Xi), Xi), call this ĝ1(x).

New estimate: Ĝ1(x) = Ĝ0(x) + εĝ1(x).

Next, grow tree of depth d on (Yi − Ĝb(Xi), Xi), call this


ĝb+1(x).

ε = 1 is regular boosting. ε < 1 slows down learning, spreads


importance around more variables.
30
Generalized Boosting

We can do this in more general settings. Suppose we are


interested in estimating a binary response model, with a high-
dimensional covariate. Start again with

Ĝ0(x) = 0 specify parametric model g(x; γ)

N
X
Minimize over γ : L(Yi, Ĝ0(Xi) + g(Xi; γ))
i=1

and update Ĝk=1(x) = Ĝk (x) + εg(x; γ̂)

L(·) could be log likelihood with g log odds ratio:

L(y, g) = y (g − ln(1 + exp(g))) − (1 − y) ln(1 + exp(g))


31
7. Neural Nets

Scalar outcome Yi, p-dimsional vector of covariates/inputs/features


Xi, j th element equal to Xij .

Let’s focus on the case where Yi is ordered (not discrete


unordered).

Interest in conditional mean

h(x) = E[Yi|Xi = x]

32
Linear Model:

p
X
f (x) = β0 + βj xj
j=1

OLS estimator: minimize


 2
N
X p
X
Yi − β0 − βj Xij 
i=1 j=1

• Does not work well if p large relative to N .

• Restrictive if p << N

33
Let’s make this more flexible:

Single index model:


 
p
X
h(x) = g  βj xj 
j=1

Estimate p parameters βj and single function g(·)

34
Additive model:

p
X
h(x) = gj (xj )
j=1

Estimate p functions gj (·)

35
Projection Pursuit:
 
L
X p
X
h(x) = gl  βlj xj 
l=1 j=1

Estimate L functions gl (·) and L × p parameters βlj

36
Neural net with single hidden layer:
 
L p
(2) X (2)  X (1)
f (x) = β0 + βl g βlj xj 
l=1 j=1

(1)
Fix g(·) and estimate L × p parameters βlj and L + 1 param-
eters βl2.

(Note that (1) and (2) index layers, does not mean “to the
power of”)

37
General neural net with K hidden layers, one observed input
layer, one observed output layer, K + 2 layers total.

(m)
Observe y and x1, . . . , xp. zk are hidden variables.

Layer k: pk input variables, pk+1 output variables. p1 = p,


pK+2 = 1.

K and pk , k = 2, . . . , K + 1 are tuning parameters.

38
First layer: p2 hidden elements, l = 1, . . . , p2. Model:

p
(2) (1) X (1)
zl = ωl0 + ωlj xj ,
j=1

Transformation of output variables:


 
(2) (2)
αl = g zl

39
Layer k, pk+1 hidden elements, pk hidden inputs, for layer
k = 2, . . . , K + 1, l = 1, . . . , pk . Model:

pk
(k+1) (k) X (k) (k)
zl = ωl0 + ωlj αj
j=1

Transformation:
 
(k+1) (k)
αl = g zl

Final layer (layer K + 2), output layer, with single, observed


output variable, pK+2 = 1. Model:

pK+1
(K+1) X (K+1) K+1
y = ω0 + ωj αj
j=1

40
Naive approach: minimize

N
(Yi − f (Xi; ω))2
X

i=1

This is badly behaved. Multiple solutions, numerically unsta-


ble.

Instead, regularize, and minimize:

N K X pk pX
k−1  2
(k)
(Yi − f (Xi; ω))2 + λ
X X
ωjl
i=1 k=1 j=1 l=1

(k)
over all ωlj .

Choose penalty factor λ through cross-validation.


41
Common choices for transformation g(·) (pre-specified, not
choosen by optimization):

1. sigmoid: g(a) = (1 + exp(−a))−1

2. tanh: g(a) = (exp(a) − exp(−a))/(exp(a) + exp(−a))

3. rectified linear g(a) = a1a>0

4. leaky rectified linear g(a) = a(1a>0 + γ 1a<0)

Important to have nonlinearity in the transformation, but ex-


act nature of nonlinearity appears to be less important.
42
Lost of complexity allowed for in neural nets, but comes with
lots of choices.

Not easy to use out-of-the-box, but very successful in com-


plex settings.

Computationally tricky because of multi-modality.

• can approximate smooth functions accurately (universal


approximator) with many layers and many hidden units.

43
Interpretation

We can think of the layers up to the last one as constructing


regressors: z (K+1) = h(ω, x)

Alternative is to choose functions of regressors, e.g., polyno-


mials zij = xi1 × xi4 × x2
i7 .

In what sense is this better? Is this a statement about the


type of functions we encounter?

44
Multiple layers versus multiple hidden units

“We observe that shallow models [models with few layers] in


this context overfit at around 20 millions parameters while
deep ones can benefit from having over 60 million. This sug-
gests that using a deep model expresses a useful preference
over the space of functions the model can learn.”

Goodfellow, Bengio, and Courville, Deep Learning

45
Convolutional Neural Nets

Recall model for layer k:

pk
(k+1) (k) X (k) (k)
zl = ωl0 + ωlj αj
j=1

(k)
We can set some of the ωlj equal to zero. This obviously
simplifies computations and makes estimation more precise.
But, how do we choose restrictions?

46
Example Digit recognition: xj are black/white scale mea-
sures on pixels. Suppose we have 16 by 16 pixels, 256 total.
So, xij , i = 1, . . . , 16, j = 1, . . . , 16. We could make the nodes
in the first hidden layer functions of only sets of pixels close
together:

3 X 3
(2) (1) X (1)
z1 = ωl0 + ωlij xij
i=1 j=1

6 X 3
(2) (1) X (1)
z2 = ω20 + ω2ij xij
i=4 j=1

et cetera.

47
Estimating the Parameters of a Neural Network: Back-
propagation

Define objective function (without regularization), for single


observation:

Ji(ω, x, y) = (yi − f (xi; ω))2

For N observations:

N N
(yi − f (xi; ω))2
X X
J(ω, x, y) = Ji(ω, xi, yi) =
i=1 i=1

We wish to minimize this over ω.

48
Recall: K hidden layers, one input and output layer, K + 2
layers total.
First layer: p1 observed elements,
 
(1) (1) (1) (1)
z l = xl , αl = g (1) zl = zl l = 1, . . . , p1

Hidden layer k, pk hidden elements, for k = 2, . . . , K + 1,


pk−1    
(k) (k−1) (k−1) (k−1) (k) (k) (k)
, αl = g (k)
X
zl = ωl0 + ωlj αj zl = g zl
j=1

Final layer (layer K + 2) with pK+2 = 1:

pK+1
(K+1) (K+1) K+1
z (K+2) = ω0
X
+ ωj αj ,
j=1
 
f (x) = g (K+2) z (K+2) = z (K+2)
49
We can write
 
J(ω, x, y) = J ω (1), ω (2), . . . , ω (K+1), x, y

We can also write the vector


 
z(k) =h(k) ω (k−1) ,z (k−1)

This function h(k)(·) does not depend on ω beyond ω (k−1).


By further substitution we can write this as
 
z(k) = h̃(k) ω (k−1) (1)
,...,ω ,x

50
Now start with the last layer.

Write
 
f (x) = g (K+2) z (K+2) = z (K+2)

Define
2

 
(K+2)
δiK+2 = yi − g (K+2) zi
(K+2)
∂zi
    
(K+2) (K+2)
= −2 yi − g zi g (K+2)0 zil

 
(K+2)
= −2 yi − zi

(this is just the scaled residual).


51
We can write:
pK+1  
(K+2) (K+1) X (K+1) (K+1)
zi = ω0 + ωj g zji
j=1

so

 
(K+2) (K+1) 0 (K+1)
zi = ωj g zji
(K+1)
∂zji

(K+1)
Now consider writing the objective function in terms of zi :

 2
(K+2)
(yi − f (xi))2 = yi − zi

pK+1  2
 

(K+1) X (K+1) (K+1) 
= yi − ω0 − ωj g zji
j=1
52
Then define:

K+1 ∂
δli = ( yi − f (xi))2
(K+1)
∂zli

∂ 2 ∂ (K+1)
=
(K+2)
(yi − f (xi)) × (K+1)
zi
∂zi ∂zi
 
(K+2) (K+1) 0 (K+1)
= δi ωl g zli

53
Go down the layers:

Define

pk
 
 
(k) X (k) (k+1)  0 (k)
δli = ωjl δli g zli
j=1

Then the derivatives are


 
(k) (k+1)
Ji(ω, xi, yi) = g zli δli
(k)
∂ωlj

54
Given the derivatives, iterate over


ωm+1 = ωm − α × J(ω, x, y)
∂ω

α is the “learning rate” often set at 0.01.

55
• often stochastic gradient descent: Instead of calculating


J(ω, x, y)
∂ω

use

N N
X ∂ .X
Ri J(ω, xi, yi) Ri
i=1 ∂ω i=1

for a random selection of units (Ri ∈ {0, 1}, eg, R = 0.01)


because it is faster.

56
Regularization:

 2
(k)
• add penalty term, jlk ωjl
P

• early stopping rule: stop iterating when test error deterio-


rates.

57
8. Generative Adverserial Nets (GANs)

Given data set Xi, i = 1, . . . , N

• Generate data that are indistinguishable from real data

• Use two stage procedure:

– Generate artificial data

– Use classifier/discriminator/critic to see if it is possible


to distinguish between real and artificial data

• Successful in generating fake pictures

58
General set up:

• Real observations X1, . . . , XNR , with empirical distribution


F̂X (·), in X

• Noise distribution FZ (·), e.g., multivariate normal, in Z.

• Generator gθ : Z 7→ X, can be quite flexible.

• Discriminator/critic to tell fake and real data apart, can


be quite flexible.

Goal is to find θ so that gθ (Z) ∼ F̂X according to discrimina-


tor/critic.
59
Distances and Divergences Consider two dist f (·) and g(·)

• Kullback-Leibler
!
f (x)
Z
KL(f, g) = ln f (x)dµ(x)
g(x)

• Jannson-Shannon

f +g f +g
   
JS(f, g) = KL f, + KL g,
2 2

• Earth-Mover / Wasserstein Distance

W (f, g) = inf E(X,Y )∼γ kY − Xk


γ∈Π(f,g)

60
where Π(f, g) is set of joint distrs with marginals f and g.
Original GAN proposal (Goodfellow, Pouget-Abadie, Mirza,
Xu, Warde-Farley, Ozair, Courville, & Bengio, 2014)

KL divergence with discriminator Dφ : X 7→ [0, 1],

 
 X X 
inf sup ln Dφ(gθ (Zi)) + ln(1 − Dφ(Xi))
θ φ i:Y =F 
i i:Yi =R

Awkward if support of Xi and gθ (Zi) do not agree.

61
Wasserstein GAN (WGAN, Arjovsky, Chintala, & Bottou,
2017):

WGAN uses Earth Mover distance, through a function fφ :


X → R, parametrized by φ, called critic.

• Find a function fφ(x) so that the difference between the


expected value of of fφ(Xi) and fφ(gθ (Zi)) is maximized.

• Then choose θ to minimize this difference.

Formally:
 
 1 X X 1 
inf sup fφ(gθ (Zi)) − fφ(Xi)
θ φ, kfφ kL ≤1 NF
i:Yi =R NR
 
i:Yi =F

(where kf kL denotes the Lipschitz constant of f )


62
Lalonde Data: Summary Statistics

NSW Treated NSW Controls CPS Controls

mean (s.d.) mean (s.d.) mean (s.d.)


Black 0.84 (0.36) 0.83 (0.38) 0.07 (0.26)
Hisp 0.06 (0.24) 0.11 (0.31) 0.07 (0.26)
Age 25.8 (7.2) 25.1 (7.1) 33.2 (11.0)
Married 0.19 (0.39) 0.15 (0.36) 0.71 (0.45)
Nodegree 0.71 (0.46) 0.83 (0.37) 0.30 (0.46)
Education 10.3 (2.0) 10.1 (1.6) 12.0 (2.9)
E’74 2.10 (4.89) 2.11 (5.69) 14.02 (9.57)
U’74 0.71 (0.46) 0.75 (0.43) 0.12 (0.32)
E’75 1.53 (3.22) 1.27 (3.10) 13.65 (9.27)
U’74 0.60 (0.49) 0.68 (0.47) 0.11 (0.31)
E’78 6.35 (7.87) 4.55 (5.48) 14.85 (9.65)
U’78 0.24 (0.43) 0.35 (0.48) 0.16 (0.34)

63
Simulating the Lalonde Data

For the generator we use 11-dimensional normally distributed


noise.

Three hidden layers:

1. 11 inputs, 64 outputs, rectified linear

2. 64 inputs, 128 outputs, rectified linear

3. 138 inputs, 256 outputs, rectified linear

Final layer, 256 inputs, 11 outputs: For binary variables,


use sigmoid, for censored variables use rectified linear, for
continuous variables use linear.
64
For the discriminator, three hidden layers

1. 11 inputs, 256 outputs, rectified linear

2. 256 inputs, 128 outputs, rectified linear

3. 128 intputs, 64 outputs, rectified linear

Final layer, 64 inputs, 1 output, linear.

65
How do the generated data compare to the actual data?

• Marginal distributions are close

• Correlations are close

• Conditional distributions (conditional on one variable at


a time) are close

66
67
68
69
70
71
Causal Inference
and Machine Learning

Guido Imbens – Stanford University

Lecture 3:

Average Treatment Effects with Many Covariates

Potsdam Center for Quantitative Research


Tuesday September 10th, 10.30-12.00
Outline

1. Unconfoundedness

2. Efficiency Bound

3. Outcome Modeling, Propensity Score Modeling, and Dou-


ble Robust Methods

4. Many Covariates

5. Efficient Score Methods


1
6. Balancing Methods

7. Comparisons of Estimators
1. Unconfoundedness Set up:

Treatment indicator: Wi ∈ {0, 1}

Potential Outcomes Yi(0), Yi(1)

Covariates Xi

Observed outcome: Yiobs = Wi · Yi(1) + (1 − Wi) · Yi(0).

2
Estimand: average effect for treated:

τ = E[Yi(1) − Yi(0)|Wi = 1]

Key Assumptions: unconfoundedness:


 
Wi ⊥ Yi(0), Yi(1) Xi.

Overlap

pr(Wi = 1|Xi = x) ∈ (0, 1)

3
If there are concerns with overlap, we may need to time sample
based on propensity score

e(x) = pr(Wi = 1|Xi = x) propensity score

Trim if e(x) ∈
/ [0.1, 0.9]

See Crump, Hotz, Imbens & Mitnik (Biometrika, 2008) for


optimal trimming.

Important in practice.

4
Define the conditional mean of potential outcomes

µw (x) = E[Yi (w)|Xi = x]

and the conditional variance

2 (x) = V[Y (w)|X = x]


σw i i

Under unconfoundedness the conditional potential outcome


mean is equal to conditional mean of observed outcome:

µw (x) = E[Yiobs|Wi = w, Xi = x]

5
2. Semi-parametric efficiency bound for average treat-
ment effect

τ = E[Yi(1) − Yi(0)]

under unconfoundedness

σ12(x) σ02(Xi )
" #
E + + (µ1(Xi ) − µ0(Xi) − θ)2
e(Xi) 1 − e(Xi )

2
h i
= E (ψ(Yi , Wi, Xi ))

where the efficient influence function is

y − µ1(x) y − µ0(x)
ψ(y, w, x) = µ1(x) − µ0(x) + w − (1 − w) −τ
e(x) 1 − e(x)
6
How can we estimate τ efficiently?

Let µ̂w (x) and ê(x) be nonparametric estimators for µw (x) and
e(x). Then the following three estimators are efficient for τ :

A. based on estimation of regression function

N 
1 X 
τ̂reg = µ̂1(Xi ) − µ̂1(Xi)
N i=1

B. based on estimation of the propensity score

N Wi · Yiobs (1 − Wi) · Yiobs


!
1 X
τ̂ipw = −
N i=1 ê(Xi ) 1 − ê(Xi )

7
C. based on estimation of efficient score

N
1 X
τ̂es =
N i=1

!
Wi · (Yi − µ̂1(Xi)) (1 − Wi) · (Yi − µ̂0(Xi )) n o
− + (µ̂1(Xi − µ̂0(Xi)
ê(Xi) 1 − ê(Xi )

8
• Single nearest neighbor matching also possible, but not effi-
cient.

• Estimators seem very different.

• How should we think about choosing between them and what


are their properties?

9
τ̂reg , τ̂ipw , and τ̂es are efficient in the sense that they achieve
the semiparametric efficiency bound, for fixed dimension of the
covariates, but irrespective of what that dimension is.

Define:

1 XN
τ̂ infeasible =
N i=1
!
Wi · (Yi − µ1(Xi)) (1 − Wi) · (Yi − µ0(Xi ))
− + {(µ1(Xi − µ0(Xi)}
e(Xi) 1 − e(Xi)

Then:

τ̂reg = τ̂ipw + op (N −1/2) = τ̂es + op(N −1/2)

= τ̂ infeasible + op(N −1/2)


10
Why are these estimators first order equivalent?

Suppose single binary regressor: Xi ∈ {0, 1}

Simple non-parametric estimators are available for e(x) and


µw (x):
PN
i=1 1Wi =1,Xi =x
ê(x) = x PN
i=1 1Xi =x

PN
Yi1Wi=w,Xi=x
µ̂w (x) = Pi=1
N 1
i=1 Wi =w,Xi =x

Then all estimators are identical:

τ̂reg = τ̂ipw = τ̂es


11
How do they do this with continuous covariates?

• Assume lots of smoothness of the conditional expectations


µw (x) and e(x) (existence of derivatives up to high order)

• Use bias reduction techniques: higher order kernels, or local


polynomial regression. The order of the kernel required is
related to the dimension of the covariates.

12
• Regression estimator based on series estimator for µw (x).

Suppose Xi is an element of a compact subset of Rd We can


approximate µw (x) by a polynomial series with including all
terms up to xkj , where xj is the jth element of x ∈ Rd. (Other
series are possible.)

The approximation error is small if µw (·) has many derivatives


relative to the dimension of x.

13
• Regression estimator based on kernel estimator for µw (x).

N  N
Xi − x . X Xi − x
X   
µ̂w (x) = 1Wi=w YiK 1Wi=w K
i=1 h i=1 h

This estimator is consistent under weak conditions, but to


make the bias vanish from the asymptotic distribution we need
to use higher order kernels (kernels with negative weights).

14
4. What do we do with many covariates?

Kernel regression and series methods do not work well in high


dimensions.

A. Propensity score methods. Estimate e(·) using machine


learning methods, e.g., LASSO, random forests, deep learning
methods, to minimize something
2
h i
E (ê(Xi) − e(Xi))

leading to ê(·). Then use inverse propensity score weighting:


,
1 X X ê(Xi) X ê(Xi )
τ̂ = Yi − Yi
Nt i:W =1
i i:W =0 1 − ê(Xi )
i i:W =0 1 − ê(Xi )
i

Problem is that this does not select covariates that are highly
correlated with Yi
15
B. Regression methods. Estimate µ0(x) = E[Yi |Wi = 0, Xi = x]
using machine learning methods, e.g., LASSO, random forests,
deep learning methods, to minimize something

2
h i
E (µ̂0(Xi ) − µ0(Xi))

leading to ê(·). Then use regression difference:

1 X
τ̂ = (Yi − µ̂0(Xi ))
Nt i:W =1
i

Problem is that this does not select covariates that are highly
correlated with Wi

16
Recall omitted variable bias:

Yi = α + τ Wi + β >Xi + εi

Omitted Xi from regression leads to bias in τ that is propor-


tional to β and correlation between Wi and Xi.

Selecting covariates only on basis of correlation with Yi, or


only on the basis of correlation with Wi is not effective.

• As in case with few covariates, it is better to work both


with the correlations between Wi and Xi and the correlations
between Yi(w) and Xi.

17
First improvement, use selection methods that select covari-
ates that are correlated with Wi or Yi (double selection, Belloni
et al, 2012).

E.g., use lasso to select covariates that predict Yi. Use lasso
to select covariates that predict Wi.

Take union of two sets of covariates, and then regress Yi on


that set of covariates.

• works better than single selection methods.

18
5. Efficient Score Methods and Double Robustness (Robins
& Rotnitzky, 1996; Van Der Laan and Rubin (2006), Imbens
and Rubin (2015) and others.

We do not need e(·) to be estimated consistently as long as


mu0(·) and µ1(·) are estimated consistently because
" #
Yi − µ1(Xi) Y − µ0(Xi)
E Wi − (1 − Wi) i + µ1(Xi) − µ0(Xi) = τ
a(Xi) 1 − a(Xi)

for any function a(·)

Also, we do not need µ0(·) and µ1(·) to be estimated consis-


tenly, as long as e(·) is estimated consistently because
" #
Yi − c(Xi ) Yi − b(Xi)
E Wi − (1 − Wi) + c(Xi) − b(Xi) = τ
e(Xi) 1 − e(Xi)

for any functions b(·) and c(·)


19
But, we can improve on these etimators: (e.g., Chernozhukov
et al, 2016):

Split the sample randomly into two equal parts, i = 1, . . . , N/2


and i = N/2 + 1, . . . , N .

Estimate µ0(·), µ1(·) and e(·) on the first subsample, and then
estimate τ on the second subsample as

N
1 X
τ̂1 =
N/2 i=N/2+1

(1) (1)
 
Y − µ̂1 (Xi)
Wi i
Yi − µ̂0 (Xi) (1) (1)
− (1 − Wi) + µ̂1 (Xi) − µ̂0 (Xi)
ê(1)(Xi ) 1 − ê(1) (Xi)

This is consistent, but not efficient.


20
Do the reverse to get

N/2
1 X
τ̂2 =
N/2 i=1

(2) (2)
 
Y − µ̂1 (Xi)
Wi i
Yi − µ̂0 (Xi) (2) (2)
− (1 − Wi) + µ̂1 (Xi) − µ̂0 (Xi)
ê(2)(Xi ) 1 − ê(2) (Xi)

Finally, combine:

τ̂1 + τ̂2
τ̂ =
2

21
Key Assumptions

Estimators for µ0(·), µ1(·) and e(·) need to converge fast enough,
e..g., faster than N −1/4 rate.

That is not as fast as parametric models, which converge at


N −1/2 rate, but still faster than simple nonparametric (non-
negative) kernel estimators that converge at a rate that de-
pends on the dimension of Xi . Using kernel estimators one
would need to use higher order kernels. Other methods, e.g.,
random forests, deep neural nets, may work, but no easy in-
terpretable assumptions available.

22
6. Balancing Methods

Suppose we are interested in τ = E[Yi (1) − Yi(0)|Wi = 1], so


that we need to estimate

E[Yi |Wi = 0, Xi]|Wi = 1]

Note that, with e(·) the propensity score,


" #
e(Xi)
E (1 − Wi)Yi = E[Yi |Wi = 0, Xi]|Wi = 1]
1 − e(Xi)

So, we could estimate e(·) as ê(·), and then

1 X N e(Xi)
(1 − Wi)Yiγi, where γi =
N1 i=1 1 − e(Xi)
23
The key insight is that for any function h : X 7→ Rp,
" #
e(Xi)
E (1 − Wi)h(Xi) = E[h(Xi)|Wi = 1]
1 − e(Xi)

including for h(Xi) = Xi :


" #
e(Xi)
E (1 − Wi)Xi = E[Xi|Wi = 1]
1 − e(Xi)

24
Zubizarreta (2012) suggests directly focusing on the balance
in covariates. Find weights γi that solve

Nc N
γi2,
X X
min subject to (1 − Wi)γi Xi = X t
γ1 ,...,γN
i=1 i=1

See also Hainmueller (2010), and Abadie, Diamond and Hain-


mueller (2012) in a different context.

γi = e(Xi)/(1 − e(Xi)) solves the restriction in expectation, but


not in sample.

We may get better balance directly focusing on balance in


sample than by propensity score weighting.

25
Athey, Imbens and Wager (2015) combine this with a linear
regression for the potential outcomes.

In their setting there are too many covariates to balance the


averages exactly: there is no solution for γ that solves

N
X
(1 − Wi)γiXi = X t
i=1

So, the objective function for γ is

Nc N 2
1 X 1 X
min ζ × γi2 + (1 − ζ) × (1 − Wi)γiXi − X t
γ1 ,...,γN Nc i=1 Nc i=1

where ζ ∈ (0, 1) is a tuning parameter, e.g., 1/2.


26
Suppose that the conditional expectation of Yi(0) given Xi is
linear:

µ0(x) = β > x

AIW estimate β using lasso or elastic nets:

p p
 
2
>
|βp|2 
X  X X
min Yi − β Xi + λ α |βp | + (1 − α)
β
i:Wi =0 k=1 k=1

27
A standard estimator for the average effect for the treated
would be

τ̂ = Y t − X >
t β̂

A simple weighting estimator would be

N
X
τ̂ = Y t − (1 − Wi)γiYi
i=1

The residual balancing estimator for the average effect for the
treated is
 
N
τ̂ = Y t − X > (1 − Wi)γi Yi − Xi> β̂ 
X  
t β̂ +
i=1

28
• does not require estimation of the propensity score.

• relies on approximate linearity of the regression function.

29
7. Comparison of Estimators

1. Methods based on Outcome Modeling

(a) Generalized Linear Models (Linear and Logistic Models)

(b) Random Forests

(c) Neural Nets

2. Methods based on Propensity Score Modeling

3. Doubly Robust Methods

30
Experimental CPS PSID
est s.e. est s.e. est s.e.
DIM 1.79 (0.67) -8.50 (0.58) -15.20 (0.66)
BCM 1.90 () 2.35 () 1.47 ()

Outcome Models
L 1.00 (0.57) 0.69 (0.60) 0.79 (0.60)
RF 1.73 (0.58) 0.92 (0.6) 0.06 (0.63)
NN 2.07 (0.59) 1.43 (0.59) 2.12 (0.59)

Propensity Score Weighting


L 1.81 (0.83) 1.18 (0.77) 1.26 (1.13)
RF 1.78 (0.94) 0.65 (0.77) -0.46 (1.00)
NN 1.92 (0.87) 1.26 (0.93) 0.10 (1.28)

Double Robust Methods


L 1.80 (0.67) 1.27 (0.65) 1.50 (0.97)
RF 1.84 (0.8) 1.46 (0.63) 1.34 (0.85)
NN 2.15 (0.74) 1.52 (0.75) 1.14 (1.08)
31
32
rmse
Estimator rmse rank bias sdev coverage

Difference in Means 0.62 9 -0.29 0.55 0.90


Bias Corrected Matching 0.64 10 -0.08 0.64

Outcome Models
Linear 0.56 2 -0.06 0.56 0.90
Random Forest 0.58 4 -0.15 0.56 0.89
Neural Nets 0.65 11 -0.17 0.63 0.85

Propensity Score Weighting


Linear 0.56 3 -0.04 0.56 0.99
Random Forest 0.60 7 -0.17 0.58 0.99
Neural Nets 0.59 5 -0.11 0.58 0.99

Double Robust Methods


Linear 0.56 1 -0.04 0.56 0.95
Random Forest 0.60 8 -0.08 0.60 0.95
Neural Nets 0.59 6 -0.09 0.59 0.95
33
Estimator rmse rank bias sdev coverage
Difference in Means 10.50 11 -10.49 0.40 0.00
Bias Corrected Matching 0.71 7 -0.37 0.61 0.00

Outcome Models
Linear 0.77 8 -0.62 0.45 0.67
Random Forest 0.80 9 -0.67 0.44 0.62
Neural Nets 0.51 3 -0.10 0.50 0.89

Propensity Score Weighting


Linear 0.66 6 -0.47 0.46 0.95
Random Forest 0.89 10 -0.77 0.45 0.86
Neural Nets 0.52 4 0.09 0.51 0.98

Double Robust Methods


Linear 0.64 5 -0.45 0.45 0.84
Random Forest 0.50 1 -0.13 0.48 0.92
Neural Nets 0.50 2 0.01 0.50 0.96
34
185 treated, 2,490 Controls 945 Treated, 12,450
Estimator rmse rank bias sdev cov rmse rank bias s
DIM 15.18 12 -15.17 0.48 0.00 15.17 11 -15.17 0
BCM 0.88 8 0.42 0.77 0.00 0.51 7 0.38 0

Outcome Models
L 0.57 1 0.09 0.56 0.88 0.27 1 0.09 0
RF 0.97 10 -0.79 0.57 0.52 0.63 10 -0.57 0
NN 1.20 11 0.85 0.85 0.48 0.62 9 0.50 0

Propensity Score Weighting


L 0.67 2 -0.01 0.67 0.98 0.29 2 -0.02 0
RF 0.91 9 -0.65 0.64 0.94 0.53 8 -0.44 0
NN 0.83 7 -0.40 0.73 0.96 0.31 3 0.06 0

Double Robust Methods


L 0.72 4 0.27 0.67 0.93 0.38 6 0.23 0
RF 0.70 3 0.11 0.69 0.91 0.33 4 0.07 0
NN 0.76 6 0.35 0.67 0.90 0.34 5 0.17 0
35
CPS Original Estimator Ave Simulations Standard Dev
RMSE Bias sdev RMSE Bias sdev RMSE Bias

DIM 10.50 -10.49 0.40 10.46 -10.45 0.38 0.52 0.52


BCM 0.71 -0.37 0.61 0.78 -0.11 0.59 0.15 0.55

Outcome Models
L 0.77 -0.62 0.45 0.89 -0.66 0.42 0.40 0.61
RF 0.80 -0.67 0.44 0.66 -0.50 0.40 0.19 0.24
NN 0.51 -0.10 0.50 0.50 -0.10 0.46 0.05 0.17

Propensity Score Weighting


L 0.66 -0.47 0.46 0.66 -0.44 0.43 0.24 0.35
RF 0.89 -0.77 0.45 0.74 -0.60 0.40 0.23 0.27
NN 0.52 0.09 0.51 0.46 0.06 0.45 0.05 0.06

Double Robust Methods


L 0.64 -0.45 0.45 0.66 -0.44 0.43 0.23 0.34
RF 0.50 -0.13 0.48 0.45 -0.09 0.43 0.05 0.13
NN 0.50 0.01 0.50 0.45 -0.02 0.44 0.05 0.05
Causal Inference
and Machine Learning

Guido Imbens – Stanford University

Lecture 4:

Heterogenous Treatment Effects

Potsdam Center for Quantitative Research


Tuesday September 10th, 13.15-14.45
Heterogenous Treatment Effects

• Given experimental data with binary treatment, how can we


flexibly estimate the average effect conditional on pretreat-
ment variables, in settings with a large number of pretreat-
ment variables, and large samples?

• Adapt machine learning / supervised learning methods de-


signed for prediction.

• Focus mainly on tree methods, because they lead to the


partitioning of the covariate space into interpretable subsets
with approximately constant conditional treatment effects.

1
Potential Outcome Set Up for Causal Inference

Binary treatment Wi ∈ {0, 1}, randomly assigned.

Pair of potential outcomes (Yi(0), Yi(1))

Vector-valued pre-treatment variable Xi

Observe for a random sample from a large population the


triple (Wi, Yiobs, Xi) where
(
Yi(0) if Wi = 0,
Yiobs = Yi(Wi) =
Yi(1) if Wi = 1.

Unit-level treatment effect and conditional average treatment


effect are

τi = Yi(1) − Yi(0), τ (x) = E[Yi (1) − Yi(0)|Xi = x]


2
Application: effect of placement of answers to search queries
on screen on click rates.

• Units are search queries.

• Treatment is moving the answer that is rated first by the


search engine algorithm from the first place on the screen to
the third place on the screen.

• Outcome is indicator that the answer that is rated first by


search is clicked on (“click rate”).

• Pre-treatment variables are characteristics of the search


queries. Many of these are binary indicators, e.g., is the query
about consumer electronics, celebrities, clothes and shoes, is
it from the Safari operating system, is it about movie times,
is it for images, etc.
3
Application (ctd)

• 499,486 search queries.

• 60 pre-treatment variables

• Experimental estimate of overall average effect on click-


through rate: τ̂ = Y t − Y c = −0.13.

Moving answer down the list lowers substantially the click-


through rate.

4
Question: are there search queries where the effect is small
or large relative to this?

• If I search for “ebay” (and my algorithm ranks “ebay.com”


first) it probably does not make much difference whether I
put ebay.com on the first line or the fifth line, people know
where they want to go.

• If I search for “econometrics textbook” and the algorithm


ranks ‘mostly harmless” first, it probably does make a differ-
ence for the click-through rate whether I actually put “mostly
harmless” on the first line or on the fifth line.

5
Naive Solution

• If all we had was one or two binary covariates, we would


just partition the sample by covariate values and estimate the
average causal effects for each subsample and we would be
done.

• Too many covariates to do that: 260 different cells.

• No strong prior beliefs on where ranking matters.

Approach: be flexible / nonparametric about estimating


τ (x).

6
Regression Trees (conventional, non-causal, set up, e.g.,
Breiman book - also possible to use lasso or other flexible
prediction methods)

Suppose we have a random sample of (Xi, Yi), i = 1, . . . , N ,


and we wish to estimate µ(x) = E[Yi|Xi = x].

Trees recursively partition covariate space into “leaves.” Within


a leaf the average outcome is estimated as the subsample av-
erage (could do something more sophisticated, model-based,
within leaves).

• trees are easy to interpret.

7
Start with single “leaf.” Predicted outcome is µ̂(x) = Y .
Average in-sample squared error is

N
1 X
Q(µ̂) = (Yi − µ̂(Xi))2
N i=1

Next we consider splitting this leaf into two leaves, in order


to optimize in-sample fit.

We need to choose which covariate we split on, and what


threshold we split at. Split covariate k, at threshold c.

Leave 1: Xik ≤ c

Leave 2: Xik > c

8
Consider splitting leaf into two parts, depending on whether
kth covariate is below or above threshold c: Now the pre-
dicted outcome is
(
YL if xk ≤ c
µ̂(x) =
YH if xk > c,

where

1 X 1 X
YL = Yi , YH = Yi
NL i:X ≤c NH i:X >c
ik ik

Choose the covariate to split on and the threshold to mini-


mize

N
1 X
Q(µ̂) = (Yi − µ̂(Xi))2
N i=1
9
Next, consider splitting either of the two leaves, and find the
optimal threshold, the optimal covariate and the optimal leaf
to split.

Keep doing this to minimize

Q(µ̂) + α · M

where M is the number of leaves.

The penalty rate α is choosen by out-of-sample cross-validation


to avoid over-fitting.

• many variations on simple trees, boosting, bagging, random


forests. All tend to work better in terms of out-of-sample
performance than kernel regression.
10
Alternative Representation of Goodness-of-Fit Measure

We compare two possible splits leading to estimates µ̂1 and


µ̂2 by comparing Q(µ̂1) and Q(µ̂2), where

N
1 X
Q(µ̂) = (Yi − µ̂(Xi))2
N i=1

If the models include an intercept, as they usually do, most


estimation methods would ensure that the average of (Yi −
µ̂(Xi)) · µ̂(Xi) would be equal to zero.

Then we get an identical ranking by comparing Q̃(µ̂1) and


Q̃(µ̂2), where

1 XN
Q̃(µ̂) = − µ̂(Xi)2.
N i=1
11
Trees for Causal Effects

• We would like to construct trees for τ (x)

• Problem 1: we do not observe τi = Yi(1) − Yi(0), so cannot


directly use standard tree methods. Need other estimation
methods.

• Problem 2: given two candidate estimators, trees or lasso or


otherwise, we cannot directly use out-of-sample comparisons
between methods because we do not observe τi = Yi(1) −
Yi(0). Need other validation methods.

12
Simple, Non-causal, Solutions to First Problem

Solution I (single tree): use conventional tree methods


to construct tree for µ(w, x) = E[Yiobs|Wi = w, Xi = x] and
estimate

τ̂ (x) = µ̂(1, x) − µ̂(0, x)

May never split on treatment w. E.g., Imai and Ratkovic


(2013), Foster, Tailor, Ruberg (2011)

13
Solution II (two trees): construct separate trees for µ0(x) =
E[Yiobs|Wi = 0, Xi = x] and µ1(x) = E[Yiobs|Wi = 1, Xi = x],
and estimate

τ̂ (x) = µ̂1(x) − µ̂0(x)

µw (x) may vary much more with pretreatment variables than


τ (x).

Still second problem: How do we compare the two


methods in test sample?

14
Insight

Define

Wi − E[Wi]
Yi∗ = Yiobs ·
E[Wi] · (1 − E[Wi])

Yi∗ is unbiased for treatment effect Yi(1) − Yi(0) (based on


single observation!), but quite noisy.

Then

τ (x) = E[Yi∗|Xi = x]

15
Generalization to observational studies with unconfounded-
ness:

Wi − e(Xi)
Yi∗ = Yiobs ·
e(Xi) · (1 − e(Xi))

where e(x) is the propensity score:

e(x) = pr(Wi = 1|Xi = x)

This suggests an out-of-sample goodness-of-fit measure:

N
1 X
Yi∗ − τ̂ (Xi ) 2

Q1(τ̂ ) =
N i=1

16
Alternative out-of-sample goodness-of-fit measure based on
matching. Replace τi by

obs obs
 
τ̂i = (2 · Wi − 1) · Yi − Y`(i) ,

where `(i) is closest match:

`(i) = arg min kXi − Xj k


j:Wj 6=Wi

Then use

N
1 X
Q2(τ̂ ) = (τ̂i − τ̂ (Xi ))2
N i=1

1 XN  2
obs obs
 
= (2 · Wi − 1) · Yi − Y`(i) − τ̂ (Xi)
N i=1
17
Solution III (transformed outcome tree): Use conven-
tional tree methods to construct tree based on (Xi, Yi∗) data
(discarding Wi).

(Not necessarily efficient: suppose V(Yi(w)|Xi = x) is very


small, but treatment effect is substantial and heterogenous.
Then Solutions I and II will be better than Solution III.)

18
Solution IV (causal tree 1):

Start with a single leaf. Consider splitting it based on a


particular covariate and a particular threshold, leading to two
potential new leaves.

Estimate within each potential leaf the average treatment


effect, as well as the overall average treatment effect:

τ̂ = Y obs
1 − Y obs,
0 τ̂H = Y obs
H,1 − Y obs ,
H,0 τ̂L = Y obs
L,1 − Y obs,
L,0

obs 1 1 1
Yiobs, obs Yiobs, Y obs
X X
Yw = Y L,w = H,w =
N i:W =w NL i:W =w,X ≤c NH
i i i

If NL,w or NH,w is zero for w = 0, 1, we do not consider this


potential split.
19
To assess the improvement of goodness of fit we would like
to calculate the difference
 
N
(τi − τ̂ )2 −  (τi − τ̂L)2 + (τi − τ̂H )2 .
X X X

i=1 i:Xi≤c i:Xi>c

This is not feasible because we do not observe τi. We replace


τi by Yi∗, which is unbiased for τi, and calculate the difference
 
N
Yi∗ − τ̂ 2 −  Yi∗ − τ̂L 2 + Yi∗ − τ̂H 2 
X  X  X 
Q1(τ̂ ) =
i=1 i:Xi ≤c i:Xi >c

The difference with Solution III is that τ̂H and τ̂L are not
calculated as the average of Yi∗ within the leafs, but as the
difference in average outcomes by treatment status.

20
Solution V (causal tree 2):

Same approach to leaf splitting, but now with modified cri-


terion, along the lines of

N
1 X
Q̃(µ̂) = − µ̂(Xi)2.
N i=1

Now we choose the split to minimize

1  2

2 2

Q̃1(τ̂ ) = · N · τ̂ − NL · τ̂L + NH · τ̂H
N

Does not rely on transformed outcome, less noisy.

21
Application

Training sample, Ntrain = 249, 742, Test sample, Ntest =


249, 742.

Single Two Transf. Causal Modified


Tree Trees Outc. Tree Tree Causal Tree

OOS I 0.8053 0.8053 0.8046 0.8048 0.8044


OOS II 0.3111 0.3111 0.3107 0.3106 0.3105

# Leaves 52 36 26 21 21 24

22
Correlation Between Predictions and Yi∗ in Test Sample

Yi∗ Single Two Transf. Causal Mod.


Tree Trees Outc. Tree Causal

Yi∗ 1.000 0.026 0.027 0.034 0.030 0.037


Single Tree 0.026 1.000 0.963 0.638 0.729 0.664
Two Trees 0.027 0.963 1.000 0.671 0.734 0.685
Transf. Outc. 0.034 0.638 0.671 1.000 0.733 0.864
Causal Tree 0.030 0.729 0.734 0.733 1.000 0.791
Mod Causal 0.037 0.664 0.685 0.864 0.791 1.000

23
leaf training sample test sample
est se share est se share
1 -0.1235 0.0036 0.2022 -0.1236 0.0036 0.2018
2 -0.1339 0.0099 0.0247 -0.1349 0.0102 0.0240
3 -0.0099 0.0044 0.0129 -0.0073 0.0044 0.0132
4 -0.2148 0.0128 0.0214 -0.2467 0.0126 0.0216
5 -0.1453 0.0030 0.3049 -0.1480 0.0030 0.3044
6 -0.1109 0.0056 0.0628 -0.1100 0.0055 0.0635
7 -0.2303 0.0283 0.0036 -0.2675 0.0284 0.0037
8 -0.0575 0.0096 0.0165 -0.0324 0.0095 0.0168
9 -0.0868 0.0307 0.0026 -0.0559 0.0294 0.0025
10 -0.1505 0.0048 0.1191 -0.1693 0.0047 0.1191
11 -0.1741 0.0236 0.0045 -0.1682 0.0239 0.0046
12 0.0255 0.1267 0.0003 0.2857 0.1235 0.0002
13 -0.0297 0.0264 0.0019 -0.0085 0.0250 0.0022
14 -0.1352 0.0142 0.0106 -0.1139 0.0147 0.0100
15 -0.1591 0.0552 0.0010 -0.1432 0.0526 0.0011
16 -0.0135 0.0260 0.0005 0.0080 0.0502 0.0004
17 -0.0809 0.0118 0.0131 -0.0498 0.0124 0.0132
18 -0.0453 0.0231 0.0014 -0.0454 0.0208 0.0014
19 -0.1694 0.0158 0.0105 -0.1997 0.0162 0.0106
24
20 -0.2072 0.0304 0.0031 -0.2790 0.0305 0.0030
21 -0.0955 0.0106 0.0226 -0.0834 0.0108 0.0223
Substantial variation in conditional treatment effects

Let’s look at two specific leaves out of the 24.

Leaf 3: -0.0073 (s.e. 0.0044), proportion 0.0132


What is the query in this leaf: Image & Celebrity

If I search for “image of Chuck Manski” it does not matter


whether the image is ranked first or third.

Leaf 4: -0.2467 (s.e. 0.0126), proportion 0.0216


Not image & not search bot & navigation & wikipedia refer-
ence.

If I search for “machine learning” or “instrumental variables”


the ranking may be very important.
25
• leaves defined through interactions, no through simple main
effects, even second order effects may not be sufficient to
capture all effects of interest.

• interpretable leaves.

26
Simulations

5
X
Yi(w) = Xik · βkw + εiw
k=1

   
1 −1
 −1   1 
   
   
β0 =  0 

 β1 =  0 


 0   0 
   
0 0

εi0 = εi1 ∼ N(0, 0.42)

Xik ∼ B(0, 0.5) Wi ∼ B(0, 0.4)

N train = N test = 100


27
Simulation Results

Single Two Transformed Causal Modified


Tree Trees Outcome Tree Tree Causal Tree

OOS I 1.31 1.29 1.28 1.25 1.24


OOS II 0.45 0.43 0.42 0.40 0.39
Oracle 0.14 0.12 0.13 0.10 0.07

# Leaves 12.7 5.1 8.4 5.6 6.7 5.8

28
Causal Inference
and Machine Learning

Guido Imbens – Stanford University

Lecture 5:

Experimental Design and Multi-armed Bandits

Potsdam Center for Quantitative Research


Tuesday September 10th, 15.15-16.45
Outline

1. Re-randomization

2. Experiments in Networks

3. Multi-armed Bandits

1
1. Re-Randomization

Sometimes researchers randomize assignment to treatment,


then assess the (im)balance the specific assignment would gen-
erate, and decide to re-randomize if the initial assignment failed
to lead to sufficient balance.

What to make of that?

Re-randomization can improve precision of estimates and


power of tests considerably, but needs to be done carefully
to maintain ability to do inference.

2
Re-randomization is conceptually similar to completely
randomized experiment:

Consider a sample of 2N units.

Randomize treatment to each unit by flipping a fair coin.

Re-randomize till the number of treated units is exactly equal


to N .

This leads to the same design as randomly selecting N units


for assignment to treatment in a completely randomized ex-
periment.

3
Formal Analysis of Re-Randomization

Suppose we have 2N units. We observe a K-vector of covari-


ates Xi. Without taking into account the covariate values, N
units are randomly selected to receive the treatment, and the
remaining units are assigned to the control group.

Calculate

1 X X1 − X0
Xw = Xi, tX = q
N i:W =w
i
s2
X,0 /N + s 2 /N
X,1

What to do if |tX | is large, if discovered before assignment is


implemented?

4
Two Cases

• Decide a priori to randomize M times, and implement as-


signment vector that minimizes some criterion e.g., mini-
mize the maximum of the t-statistics for the K covariates.

• Re-randomize until the criterion meets some threshold: e.g.,


with two covariates, until both t-statistics are below 1.
(need to be careful here: the threshold should be feasible).

Key:
1. articulate strategy a priori, so randomization inference is
possible.

2. Do not search over all assignments for optimal value for


criterion because then there is little randomness left.
5
Cautionary Note

• Suppose with 2N units, Xi earnings, 2N − 1 units have


Xi ∈ [0, 10], and one unit has Xi = 1000.

• Minimizing t-statistic leads to one treatment group con-


taining individual with Xi = 1000 and N − 1 individual with
lowest earnings, and other group containing N richest in-
dividuals after very richest individual.

• Irrespective of design estimation of ave effect is difficult.

• Rank-based tests may still have substantial power.

• Maybe remove outlier unit for estimation purposes.

6
Conclusion

• Instead of re-randomization, lay out acceptable set of random


assignments.

7
2. Experiments in Networks

• We are interested in testing complex hypotheses on causal


effects in settings where individuals are connected through a
network and there may be spillovers.

Bond, Fariss, Jones, Kramer, Marlow, Settle and Fowler (“A


61-million-person experiment ...”, 2012) write:

“Furthermore, the messages not only influenced the


users who received them but also the users friends, and
friends of friends.”

Christakis and Fowler (2008, “the spread of obesity in a large


social network”) claim changes in weight spread beyond friends.
8
How can we test such claims, in the presence of unre-
stricted homophily, in a single network?
Clearly there is some evidence in the data

Compare two individuals, both in the control group, both with


one friend, one with a treated friend, one with a friend in the
control group.

Finding a correlation between outcomes for egos and treatment


for alters is evidence of spillovers.

• Does that rely on large sample approximations?

• Can we test hypotheses about friends of friends?

9
2.1 Causal Effects and Potential Outcomes

We have a finite population P with N units. These units may


be linked through a network with adjacency matrix A. We also
measure covariates on the individuals, with X the matrix of
covariates.

The units are exposed to a treatment W, where W is an N -


vector with ith element Wi. W takes on values in W.

For each unit there is a set of potential outcomes Yi(w), one


for each w ∈ W. We observe Yiobs = Yi(W).

Causal effects are comparisons Yi(w) − Yi(w0) for any pair w 6=


w0 ∈ W
10
Most of the literature: Stable-unit-treatment-value-assumption
(sutva, Rubin), so that Yi(w) depends only on own treatment
wi .

Without sutva there are lots of causal effects.

• Here we focus on exact tests whether these spillovers are


present and detectable.

• Ultimately tests are not that interesting on their own, but


it demonstrates that the randomization allows researcher to
detect these effects.

11
2.2 Three Null Hypotheses of Interest

No treatment effects:
Yi(w) = Yi(w0 ) for all units i, and all pairs of assignments
w, w0 ∈ W.
(straightforward because this hypothesis is sharp)

No spillover effects: (but own treatment effects)


Yi(w) = Yi(w0) for all units i, and all pairs of assignment vectors
w, w0 ∈ W such that wi = wi0 .

No higher order effects: (but effects of own treatment and


friends’ treatment)
Yi(w) = Yi(w0) for all units i, and for all pairs of assignment
vectors w, w0 ∈ W such that wj = wj0 for all units j such that
d(i, j) < 2 (distance in network).
12
Problem with second and third null hypothesis is that they are
not sharp:

We see one pair (w, Y(w)), based on that we cannot infer the
value of Y(w0 ) for some other w0 under the null.

Using the standard approach, we cannot calculate exact p-


values without that.

Aronow (2012) shows how to calculate p-values for second


hypothesis.

We develop general way to deal with hypotheses like the third


one, as well as others.

13
2.3 What Not To Do

Bond et al (2012) focus on the statistic that averages ego


outcomes over friendships with treated alter and control alter:

1 obs 1
Yiobs · (1 − Wj )
X X
T = Yi · Wj −
N1 i,j:G =1 N0 i,j:G =1
ij ij

They then calculate p-values as if the null hypothesis is that


of no treatment effects whatsoever (which is sharp so they can
evaluate the distribution of the statistic).

This is not valid: the rejection rates under the null for a 5%
test can be much larger than 5%.

14
Randomization inference:

Specify sharp null hypothesis of direct effects, no spillovers:

Yi(wi, w) = Yi(0, 0) + τ wi,

Possible alternative hypothesis that is also sharp (for some


value of β):

Yi(wi, w) = Yi(0, 0) + τ wi + Ã0iwβ,

15
2.4 Artificial Experiments

Think of an experiment E as a combination of a set of treat-


ment values W, a population P of units characterized by poten-
tial outcomes, and a distribution of assignments, p : W 7→ (0, 1).

We will analyze a different, artificial, experiment.

Take a subset of units PF , the focal units. We will only use


outcome data for these individuals.

Now for individual i in the focal group, given the actual treat-
ment W, figure out the set of treatments Wi(W, H0) that
would lead to the same outcome under the null outcome.

If the null is no effect of the treatment whatsoever, then


Wi (W, H0) = W. If H0 allows for own treatment effects, but
no spillovers, Wi(W, H0) = {w ∈ W|wi = Wi}.
16
Take the union over all focal individuals:

WR = ∪i∈PF Wi(W, H0).

The new assignment probability is

p(w)
p0(w) = P 0
w0 ∈WR p(w )

Analyse the experiment E0 = (WR , PF , p0(·))

This artificial experiment is valid, because of the randomization


underlying E, and the null hypothesis is now sharp.

17
Example I H0 is no treatment effect whatsoever. PF = P,
E0 = (W, P, p(·)) = E.

Example II H0 is no spillover effects. Choose PF ⊂ P, arbitrar-


ily. Then the restricted set of assignments is

WR (W, PF ) = {w ∈ W|wi = Wi for all i ∈ PF },

with assignment probabilities

p(w)
p0(w) = P 0
w0 ∈WR p(w )

The artificial experiment is

E0 = (WR , PF , p0(·))
18
The artificial experiment takes a set of focal units, and looks
at randomization distribution induced by changing assignment
for a set of auxiliary units.

• for conventional null of no effects, set of auxiliary units is


identical to set of focal units

• in aronow case of test for no spillovers population partitions


in set of focal units and set of auxiliary units

• with high order spillover null, the set of auxiliary units is a


subset of the complement of set of focal units: the population
partitions into set of focal units, set of auxiliary units, and the
rest.

19
Statistics

Any statistic T : WR × YN 7→ R is valid as statistic.

Edge-level-contrast: average outcome for focal egos with non-


focal treated alters minus average outcome for focal egos with
non-focal control alters (where Fi is indicator for being focal
i ∈ PF ):

1
Yiobs · Fi · (1 − Fj ) · Wj
X
T =
N1 i,j,G =1
ij

1
Yiobs · Fi · (1 − Fj ) · (1 − Wj )
X

N0 i,j,G =1
ij

20
Score statistic based on linear model

Yiobs = α0 + αw · Wi + αy · Y obs
(i) + εi

G is the row-normalized adjacency matrix

1 X n obs obs

obs obs

Tscore = Yi − Y F,0 − Wi · Y F,1 − Y F,0
NF i∈P
F


N 
X 
× Gij · Wj − G · W

j=1

TA is average of indicator of having at least one treated friend.

21
2.5 Some Simulations

We took a network of high school friends from AddHealth (599)


individuals.

Yi(w0) ∼ N(0, 1), independent across all units

Ki,1
Yi(w) = Yi(w0) + wi · τdirect + · τspill.
Ki

Ki and Ki,1 are number of friends and number of treated


friends.

Focal units are selected at random, to maximize number of


contrasts between focal and auxiliary units, or based on epsilon
nets.
22
Own Spillover Focal Node Selection
Network Statistic Effect Effect Random ε-net δN,i

AddHealth Tscore 0 0 0.059 0.056 0.045


Telc 0 0 0.058 0.054 0.044
TA 0 0 0.059 0.039 0.046

Tscore 4 0 0.056 0.053 0.051


Telc 4 0 0.051 0.048 0.059
TA 4 0 0.050 0.053 0.051

Tscore 0 0.4 0.362 0.463 0.527


Telc 0 0.4 0.174 0.299 0.413
TA 0 0.4 0.141 0.296 0.327

Tscore 4 0.4 0.346 0.461 0.529


Telc 4 0.4 0.083 0.102 0.123
TA 4 0.4 0.069 0.08823 0.116
• Also looked at test for second order spillover effect.

• There power may be very low.

• lots of design questions: proportion of focal individuals, dis-


tribution of focal individuals through network.

24
Rejection Rates of Null Hypothesis of No Spillovers
Beyond the First Order Spillovers from the Sparsified Network

Proportion of Links Dropped


Network Statistic αw αspill λ q = 0.9 q = 0.5

AddHealth Tcorr 4 0.4 0 0.047 0.046


Telc 4 0.4 0 0.048
Tcorr 4 0.1 0 0.050 0.049
Telc 4 0.1 0 0.046

Tcorr 4 0.4 0.5 0.216 0.120


Telc 4 0.4 0.5 0.059
Tcorr 0 0.4 0.5
Telc 0 0.4 0.5 0.123 0.087
Tcorr 4 0.1 0.5 0.059 0.061
25
3. Multi-armed Bandits

• In many cases we wish to evaluate multiple treatments:


putting the button on the left or on the right, making it green
or red, making it big or small.

• We could run experiments with multiple treatments and test


various null hypotheses.

• This is cumbersome, and not effective for answering the


question: which is the best treatment out of a set.

26
Suppose there are K treatments, with binary outcomes Yi ∼
B(1, pk ).

We are interested in identifying the treatment arm k with the


highest value of pk .

Suppose we start by observing 100 draws for each arm, and


get p̂k for each arm. Then our best guess is the arm with the
highest p̂k .

Now suppose we have the opportunity to allocate another 1000


units to these K treatment arms, how should we do that?

E.g., p̂1 = 0.10, p̂2 = 0.80, p̂3 = 0.81, p̂4 = 0.70

27
Allocating a lot of units to treatment arm 1 does not serve
much of a purpose: it is unlikely that arm 1 is the best arm.

To learn about the optimal arm, we should assign more units


to treatment arms 2, 3 and 4.

But: how many units to each?

Should we assign a lot to arm 4?

28
Thompson Sampling and Upper Confidence Bound Meth-
ods

Two approaches to determining assignment for next unit.

In both cases we assign more units to arms that look promising,


in slightly different ways.

1. Thompson sampling: calculate posterior probability that


arm k is the optimal arm, and assign to this arm with
probability proportional to that.

2. Upper Confidence Bound method. Calculate confidence


intervals for each pk , with confidence level αN (N is the
total sample size so far, αN → 1 as N → ∞.

29
Thompson Sampling

• Calculate the posterior distribution of p1, . . . , pK given prior


(say flat prior). Easy here because these are Beta posterior
distributions. In other cases this may require some numerical
approximations.

• Allocate to each arm proportional to the probability that


pk = maxK m=1 pm . Easy to implement by drawing pk from its
posterior for each k, and then assign to arm with highest pk .

30
This balances exploration: learn more about the arms by al-
locating units to them, and exploitation: send more units to
arms that do well.

In example, arm 1 does very poorly, dont send more units to


that arm. We are not sure about the other arms, so we send
units to all of them, but more to 2 and 3 than to 4.

Very effective way to experiment in settings with many


treatments, and with sequential assignment.

31
Consider a case with two arms, and p1 = 0.04, p2 = 0.05.

Consider a classical experiment with testing at the 0.05 level,


for 95% power.

We need 22,000 observations for this.

The regret is 11000×(0.05 − 0.04) = 111.

Suppose we get 100 observations per day, the experiment will


take 220 days.

32
Thompson sampling The binomial bandit

How bad is equal allocation?

I Consider two arms: θ1 = .04, and θ2 = .05.


I Plan a classical experiment to detect this change with 95% power at
5% significance.

> power.prop.test(p1 = .04, p2 = .05, power = .95)


n = 11165.99
NOTE: n is number in *each* group

I We need over 22,000 observations.


I Regret is 11, 165 × .01 = 111 lost conversions.
I At 100 visits per day, the experiment will take over 220 days.

Steven L. Scott (Google) Bayesian Bandits May 11, 2016 22 / 47


Thompson sampling The binomial bandit

Two-armed experiment
Bandit shown 100 visits per day

Steven L. Scott (Google) Bayesian Bandits May 11, 2016 23 / 47


Thompson sampling The binomial bandit

Two armed experiment


Savings vs equal allocation in terms of time and conversions

Source: https://support.google.com/analytics/answer/2844870?hl=en

Steven L. Scott (Google) Bayesian Bandits May 11, 2016 24 / 47


Thompson sampling The binomial bandit

Bandits’ advantage grows with experiment size


Now consider 6 arms (formerly the limit of GA Content Experiments).
I Compare the original arm to the “best” competitor.
I Bonferroni correction says divide significance level by 5.

> power.prop.test(p1 = .04, p2 = .05, power = .95,


sig.level=.01)
n = 15307.8
NOTE: n is number in *each* group

I In theory we only need this sample size in the largest arm, but we
don’t know ahead of time which arm that will be.
I Experiment needs 91848 observations.
I At 100 per day that is 2.5 years.

Steven L. Scott (Google) Bayesian Bandits May 11, 2016 25 / 47


Thompson sampling The binomial bandit

6-arm experiment
Still 100 observations per day

Steven L. Scott (Google) Bayesian Bandits May 11, 2016 26 / 47


Thompson sampling The binomial bandit

Huge savings vs equal allocation


Partly due to ending early, and partly due to lower cost per day.

Steven L. Scott (Google) Bayesian Bandits May 11, 2016 27 / 47


Now suppose we have 6 arms, best arm is 0.05, second best
is 0.04. We now need to test each comparison at 1% level for
Bonferroni correction because we do 5 tests.

Need 90,000 observations, will take 2.5 years.

Huge savings!

33
Upper Confidence Bounds

Construct confidence bound for pk with confidence level αN .


Let αN go to 1 slowly.

Pick arm with the highest upper confidence limit, and assign
next unit to that arm.

• if that is a poor arm, the upper confidence bound will shrink


relative to the others, and it will get less traffic subsequently.

34
Contextual Bandits

Suppose we also have covariates Xi for each unit, and want to


find the function that assigns each unit to the treatment with
the highest expected return as a function of the covariates.

• Given a parametric model for the expected return, we can


directly use Thompson sampling.

• We may wish to build increasingly flexible models to avoid


basing assignment on a misspecified model =⇒ can use random
forests, but need to account for variation in propensity score.

35
Causal Inference
and Machine Learning

Guido Imbens – Stanford University

Lecture 6:

Synthetic Control and Matrix Completion Methods

Potsdam Center for Quantitative Research


Wednesday September 11th, 10.00-11.30
Based on:

Doudchenko, Nikolay, and Guido W. Imbens. Balancing, re-


gression, difference-in-differences and synthetic control meth-
ods: A synthesis. No. w22791. National Bureau of Eco-
nomic Research, 2016.

Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido


W. Imbens, and Stefan Wager. Synthetic difference in dif-
ferences, 2019.

Athey, Susan, Mohsen Bayati, Mohsen, Nick Doudchenko,


Guido Imbens, and Khashayar Khosravi, (2018). Matrix com-
pletion methods for causal panel data models.

0
• California’s anti-smoking legislation (Proposition 99) took
effect in 1989.

• What is the causal effect of the legislation on smoking


rates in California in 1989?

• We observe smoking rates in California in 1989 given the


legislation. We need to impute the counterfactual smok-
ing rates in California in 1989 had the legislation not been
enacted.

• We have data in the absence of smoking legislation in Cal-


ifornia prior to 1989, and for other states both before and in
1989. (and other variables, but not of essence)
Set Up: we observe (in addition to covariates):
 
Y Y12 Y13 ... Y1T
 11
 Y21 Y22 Y23 ... Y2T


 
Y =
 Y31 Y32 Y33 ... Y3T  (realized outcome).
 ... ... ... ... ...


 
YN 1 YN 2 YN 3 ... YN T

 
0 0 0 ... 0 0
0 0 0 ... 0 0
 
 
0 0 0 0 0
 
 ... 
W =
 .. .. .. ... .. .. 
 (binary treatment).
 

 0 0 0 ... 1 1 

0 0 0 ... 1 1

• rows of Y and W correspond to units (e.g., states), columns


correspond to time periods (years).
In terms of potential outcome matrices Y(0) and Y(1):
   
X X ... X X ? ? ... ? ?
 X X ... X X  ? ? ... ? ?
   
 
? ? ? ?
   
 X X ... X X   ... 
Y(0) = 
 .. .. . . . .. 
 Y(1) = 
 .. .. ... .. .

   
 X

X ... ? ? 

 ? ? ... X X 

X X ... ? ? ? ? ... X X

Yit = (1 − Wit)Yit(0) + WitYit.

In order to estimate the average treatment effect for the


treated, (or other average, e.g., overall average effect)
 
i,t Wit Yit(1) − Yit(0)
P
τ = P ,
it Wit

we impute the missing potential outcomes in Y(0).


Alternative Possible Structures on W:

Staggered adoption (e.g., adoption of technology, Athey and


Stern, 1998)
 
0 0 0 0 ... 0 (never adopter)

 0 0 0 0 ... 1 (late adopter) 

0 0 0 0 ... 1
 
 
 
W =
 0 0 1 1 ... 1 

 0 0 1 1 ... 1 (medium adopter) 
.. .. .. .. ..
 

 ... 

0 1 1 1 ... 1 (early adopter)
Part of the talk I will focus on case with a single treated
unit/time-period
 
0 0 0 ... 0 0
0 0 0 ... 0 0
 
 
0 0 0 0 0
 
 ... 
W =
 .. .. .. ... .. .. 

 

 0 0 0 ... 0 0 

0 0 0 ... 0 1

Challenge:

Trying to predict YN T (0) based on observed values Yit(0)


for (i, t) 6= (N, T ).
In empirical studies there is a wide range of values for

• N0, the number of control units

• N1, the number of treated units

• T0, the number of pre-treatment periods

• T1, the number of post-treatment periods

This is important for guiding choice of analyses.


1. Mariel Boatlift Study (Card,1990), N1 = 1, N0 = 44, T0 =
7, T1 = 6

2. Minimum wage study (Card-Krueger 1994), N1 = 321, N0 =


78, T0 = 1, T1 = 1

3. California smoking example (Abadie, Diamond, Hainmueller,


2010) N1 = 1, N0 = 29, T0 = 17, T1 = 13

4. German unification (Abadie, Diamond, Hainmueller, 2014)


N1 = 1, N0 = 16, T0 = 30, T1 = 14

5. Lalonde study (1986) N1 = 185, N0 = 15992, T0 = 2, T1 =


1
Three related literatures on causal inference for this setting:

1. causal literature with unconfoundedness / horizontal re-


gression

2. synthetic control literature / vertical regression

3. difference-in-differences and factor models

Here: doubly robust methods that combine weighting and


outcome modeling
Unconfoundedness Methods / Horizontal Regression

Typical setting: N0 and N1 large, T0 modest, T1 = 1.


 
0 0 0 0 0

 0 0 0 0 0 


 .. .. .. .. .. 

 
W =
 0 0 0 0 0 

 0 0 0 0 1 
.. .. .. .. ..
 
 
 
0 0 0 0 1
Linear Model

1 X  
τ̂UNC = YiT (1) − ŶiT (0)
N1 i:W =1
iT

where

−1
TX
ŶiT (0) = α̂ + λ̂tYit
t=1

and α̂ and λ̂ are estimated by least squares:


2
0 −1

NX −1
TX
min 00 horizontal00 regression
iT − α −
Y λtYit 
α,λ i=1 t=1

Note: regression with N0 observations, and T0 regres-


sors. May need regularization if T0 is big.
Fancier methods:

Matching: for each treated unit i with WiT = 1, find the


closest match j(i):

j(i) = arg min Yi,1:T0 − Yj,1:T0


j:Wj =0

Then:

ŶiT (0) = Yj(i),T


Double robust methods:

Estimate the propensity score

e(y) = pr(WiT = 1|Yi,1:T0 = y)

Estimate conditional mean for controls:

µ(y) = E[YiT |WiT = 0, Yi,1:T0 = y]

Then for all treated units:

1 X e(Xj )  
ŶiT (0) = µ(Yi,1:T0 ) + YjT − µ(Yi,1:T0 )
N0 j:W =0 1 − e(Xj )
jT
Abadie-Diamond-Hainmueller Synthetic Control Method

Typical setting: T0 and T1 modest, N0 small, N1 = 1.


 
0 0 ... 0 0 ... 0
0 0 ... 0 0 ... 0
 
 
0 0 0 0 0
 
 ... ... 
W =
 ... ... ... ... ... ... ... 

 

 0 0 ... 0 0 ... 0 

0 0 ... 0 1 ... 1
For simplicity focus on case with T1 = 1, T0 = T − 1.

ADH suggest using a weighted average of outcomes for other


states:

−1
NX
ŶN T (0) = ωiYjT
j=1

ADH restrict the weights ωj to be non-negative, and restrict


them to sum to one.

• Yi,1:T0 is the lagged values Yi,t for t ≤ T0.

• Xi are other covariates, unit specific.


Let Zi be the vector of functions of covariates Xi, including
possibly some lagged outcomes Yit for pre-T periods. Let
the norm be kakV = a0V−1a, for positive semi-definite square
matrix V.

ADH first solve, for given V

−1
NX
ω(V) = arg min ZN − ωi · Zi
ω
i=1 V

This finds, for a given weight matrix V the optimal weights


ω.

• But: how do we choose V? Equal weights is not right,


would not be invariant to linear transformations of covariates.
ADH find the optimal positive semi-definite V by minimizing

−1
NX
V̂ = arg min YN,1:T0 − ωi(V) · Yi,1:T0
V i=1

Then they use the optimal weights ω based on that V̂:

−1
NX
ω ∗ = ω(V̂) = arg min ZN − ωi · Zi
ω
i=1 V̂
Doudchenko-Imbens:

−1
NX
τ̂DI = YN T − ŶN T (0), ŶN T (0) = α + ωiYiT
i=1

where
 2
−1
TX −1
NX
min 00 vertical00 regression
Nt − α −
Y ωiYit
α,ω
t=1 i=1

Regularization is important here if N is large relative to T ,


partly because of lack of restrictions on ω

Note: regression with T0 observations, and N0 regres-


sors.
Comparison Unconfoundedness vs Synthetic Controls in
Case with N1 = T1 = 1

• Unconfoundedness req. N0 > T0 =⇒ horizontal regression

• Synthetic Control requires N0 < T0 =⇒ vertical regression

But, with regularization on regression coefficients we can


use either unconfoundedness or synthetic control methods,
irrespective of relative magnitude of N0 and T0.
Difference-In-Differences / Factor Models

Model Yit(0):

Yit(0) = αi + γt + εit

leading to

N X
T
(1 − Wit ) (Yit − γt − αi )2
X
min
α,γ
i=1 t=1

N T N 0 T
1 X X 1 X X
τ̂ = Yit − Yit
N1T1 i=N +1 t=T +1 N1T0 i=N +1 t=1
0 0 0

 
N0 T N0 X
T0
1 X X 1 X
− Yit − Yit 
N0T1 i=1 t=T +1 N0T0 i=1 t=1
0
More general, factor models:

R
X
Yit(0) = γtr αir + εit
r=1

(Athey, Bayati, Doudchenko, Imbens, Khosravi, 2018)

N X
T
(1 − Wit) (Yit − αi − γt − Lit)2 + λkLk
X
arg min
α,γ,L i=1 t=1

with nuclear normal regularization on L to lead to low rank


solution.
• Challenge: How to choose between these methods (verti-
cal/horizontal regression, factor models), or how to tie them
together?

• Relative merits of these methods

Comparison of

1. unconfoundedness (horizontal) regression with elastic net


regularization (EN-H)

2. synthetic control (vertical) regression with elastic net reg-


ularization and no restrictions (EN-V)

3. matrix completion with nuclear normal (MC-NNM)


Illustration: Stock Market Data

We use daily returns for 2453 stocks over 10 years (3082


days). We create sub-samples by looking at the first T daily
returns of N randomly sampled stocks for pairs of (N, T ) such
that N × T = 4900, ranging from fat to thin:
(N, T ) = (10, 490), . . . , (70, 70), . . . , (490, 10).

Given the sample, we pretend that half the stocks are treated
at the mid point over time, so that 25% of the entries in the
matrix are missing.
 
X X X X ... X

 X X X X ... X 

X X X X ... X
 
 
 
YN ×T = 
 X X X ? ... ? 

 X X X ? ... ? 
... ... ... ... ...
 

 ... 

X X X ? ... ?
NxT = 4900 Fraction Missing = 0.25

1.0
Average RMSE(Normalized)

Method
0.9
EN−H
EN−V
MC−NNM

0.8

0.7

3 4 5 6
log(N)
Results

• MC-NNM does better than EN-H and EN-V, adapts to


shape of matrix

• ADH restrictions (non-negativity of weights, and summing


to one, and no intercept) sometimes improve things relative
to Elastic-Net estimator, more so for the vertical regressions
than for the horizontal regressions.
Combining Synthetic Control Methods and Matrix Com-
pletion: Observation I

Synthetic Control is weighted linear regression without


unit fixed effects:

T
N X
τ̂ ADH = arg min (Yit − γt − τ Wit)2 × ωiADH
X
τ,γ
i=1 t=1

• regression with time fixed effects and ADH weights (easy


to include covariates).

• under some conditions standard errors can be based on re-


gression interpretation taking weights as given (even though
the weights depend on outcome data).
Combining Synthetic Control Methods and Matrix Com-
pletion: Observation II

DID is unweighted regression with unit and time fixed


effects:

N X
T
τ̂ DID = arg min (Yit − γt − αi − τ Wit)2
X
τ,γ,α
i=1 t=1

• regression with time fixed effects and unit fixed effects, no


weights.
Synthetic Difference In Differences

N X
T
τ̂ SDID = arg min (Yit − γt − αi − τ Wit)2 × ωiADH × λADH
X
τ,γ,α t
i=1 t=1

Regression with unit and time fixed effects, and with unit
and time weights.
Time weights satisfy:
 2
−1
NX −1
TX
λ = arg min Y
iT − λt Yit + regularization term,
λ
i=1 t=1

subject to

−1
TX
λt ≥ 0, λt = 1.
t=1

(or down-weight observations from distant past.)


Generalization: Synthetic Factor Models (SFM)

τ̂ SFM =

N X
T
(Yit − αi − γt − Lit − τ Wit)2 ωiADHλADH
X
arg min t
L,α,γ,τ i=1 t=1

+λkLk,
Double Robustness

• If a factor model holds, but the weights are good (e.g.,


ADH weights), SDID is consistent.

• If the DID model holds, but we use arbitrary weights, SDID


is consistent.
California smoking data calculations

Take pre-1988 data for all states, so we observe all Yit (0) for
all unit/time pairs.

We pretend unit i was treated in periods T0 +1, . . . , T , impute


the “missing” values and compare them to actual values using
SC (blue), DID (teal), SDID (red).

We average squared error by state for 8 periods (T − T0 = 8)


to get RMSEs for each state.
smoking [packs per capita]
90 100 110 120 130

1980
1982
year
1984
1986
1988
smoking [packs per capita]
95 100 105 110 115 120

1980
1982
year
1984
1986
1988
smoking [packs per capita]
160 180 200 220 240 260 280

1980
1982
year
1984
1986
1988
50 ●


20
synthetic control RMSE


10


● ●


● ●
● ● ●
5

● ●
●●

● ● ●
● ●
● ● ●

● ●●
● ●●●
● ● ●

2

2 4 6 8
synthetic diff−in−diff RMSE

You might also like