Causal Inference and Machine Learning
Causal Inference and Machine Learning
Course Description
The course will cover topics on the intersection of causal inference and machine learning.
There will be particular emphasis on the use of machine learning methods for estimating
causal effects. In addition there will be some discussion of basic machine learning methods
that we view as useful tools for empirical economists.
Lectures
There will be six lectures.
Background Reading
We strongly recommend that participants read these articles in preparation for the course.
• Athey, Susan, and Guido W. Imbens. ”The state of applied econometrics: Causality
and policy evaluation.” Journal of Economic Perspectives 31.2 (2017): 3-32.
Course Outline
(a) Holland, Paul W. ”Statistics and causal inference.” Journal of the American sta-
tistical Association 81.396 (1986): 945-960.
(b) Imbens, Guido W., and Donald B. Rubin. Causal inference in statistics, social,
and biomedical sciences. Cambridge University Press, 2015.
(c) Imbens, Guido W., and Jeffrey M. Wooldridge. ”Recent developments in the
econometrics of program evaluation.” Journal of economic literature 47.1 (2009):
5-86.
(a) S. Athey (2018, January) “The Impact of Machine Learning on Economics,” Sec-
tions 1-2. http://bit.ly/2EENtvy
(b) H. R. Varian (2014) “Big data: New tricks for econometrics.” The Journal of
Economic Perspectives, 28 (2):3-27. http://pubs.aeaweb.org/doi/pdfplus/
10.1257/jep.28.2.3
(c) S. Mullainathan and J. Spiess (2017) “Machine learning: an applied econometric
approach” Journal of Economic Perspectives, 31(2):87-106 http://pubs.aeaweb.
org/doi/pdfplus/10.1257/jep.31.2.87
Causal Inference Course 2
6. Wednesday, September 11th, 10.00-11.30: Synthetic Control Methods and Matrix Com-
pletion
Lecture 1:
1
1. Causality: Potential Outcomes, Multiple Units, and
the Assignment Mechanism
2
1.1 Potential Outcomes
5
In many situations it may be reasonable to assume that treat-
ments applied to one unit do not affect the outcome for another
(Stable Unit Treatment Value Assumption, Rubin, 1978).
• In large scale job training programs the outcomes for one in-
dividual may well be affected by the number of people trained
when that number is sufficiently large to create increased com-
petition for certain jobs (Crepon, Duflo et al)
1 66 ? 0 66
2 0 ? 0 0
3 0 ? 0 0
4 ? 0 1 0
5 ? 607 1 607
6 ? 436 1 436
7
1.3 The Assignment Mechanism
Pr(W|Y(0), Y(1), X)
Z0
B
Z1
X Z3
Z2
Y
9
Differences between Directed Acyclical Graphs (DAGs)
and Potential Outcome Framework
11
2.1 Basics
The Fisher exact test approach entails two choices: (i) the
choice of the sharp null hypothesis, (ii) the choice of test statis-
tic.
12
We will test the sharp null hypothesis that the program had
absolutely no effect on earnings, that is:
This is the first key point of the Fisher approach: under the
sharp null hypothesis all the missing values can be inferred from
the observed ones.
13
Six Observations from the GAIN Experiment in Los Angeles
1 66 (66) 0 66
2 0 (0) 0 0
3 0 (0) 0 0
4 (0) 0 1 0
5 (607) 607 1 607
6 (436) 436 1 436
14
Now consider testing this null against the alternative hypothesis
that Yi(0) 6= Yi(1) for some units, based on the test statistic:
6 6
1 1
T1 = T (W, Yobs) = Wi · Yiobs − (1 − Wi) · Yiobs
X X
3 i=1 3 i=1
6 6
1 X 1 X
= Wi · Yi(1) − (1 − Wi) · Yi(0).
3 i=1 3 i=1
For the observed data the value of the test statistic is (Y4obs +
Y5obs + Y6obs − Y1obs − Y2obs − Y3obs)/3 = 325.6.
W1 W2 W3 W4 W5 W6 levels ranks
0 0 0 1 1 1 325.6 1.00
0 0 1 0 1 1 325.6 1.67
0 0 1 1 0 1 -79.0 -1.67
0 0 1 1 1 0 35.0 -1.00
0 1 0 0 1 1 325.6 2.33
0 1 0 1 0 1 -79.0 -1.00
0 1 0 1 1 0 35.0 -0.33
... ... ... ... ... ... ... ...
1 1 1 0 0 0 325.6 -1.00
16
Given the distribution of the test statistic, how unusual is this
observed average difference 325.6), assuming the null hypoth-
esis is true?
17
2.2 The Choice of Null Hypothesis
19
The most standard choice of statistic is the difference in aver-
age outcomes by treatment status:
WiYiobs (1 − Wi)Yiobs
P P
T = − .
(1 − Wi)
P P
Wi
20
An important class of statistics involves transforming the out-
comes to ranks before considering differences by treatment
status. This improves robustness.
N
N +1
Ri(Y1obs, . . . , YNobs) =
X
1Y obs≤Y obs − .
j=1 j i 2
(1 − Wi)Ri
P P
Wi Ri
T = P − .
1 − Wi
P
Wi
21
2.4 Computation of p-values
Y1−Y0
t=q
s2
0 /(N − M ) + s 2 /M
1
where
1 1
s2 (Yiobs − Y 0)2, s2 (Yiobs − Y 1
X X
0= 1=
N − M − 1 i:W =0 M − 1 i:W =1
i i
and
x2
Z a !
1
p = 2 × Φ(−|t|) where Φ(a) = √ exp − dx
−∞ 2π 2
23
P-values for Fisher Exact Tests: Ranks versus Levels
24
Exact P-values: Take Aways
25
3. Randomized Experiments: Neyman’s Repeated Sam-
pling Approach
26
3.1 Unbiased Estimation of the Ave Treat Effect
N
1 X
τ = (Yi(1) − Yi(0)) = Y (1) − Y (0).
N i=1
1 X obs 1
Yiobs = Y obs obs
X
τ̂ = Yi − 1 −Y0 .
M i:W =1 N − M i:W =0
i i
27
To see that this measure, Y 1 − Y 0, is an unbiased estimator of
τ , consider the statistic
28
Using the fact that Yiobs is equal to Yi(1) if Wi = 1 and Yi(0)
if Wi = 0, we can rewrite this statistic as:
!
Wi · Yi(1) (1 − Wi) · Yi(0)
Ti = − .
M/N (N − M )/N
29
3.2 The Variance of the Unbiased Estimator Y obs obs
1 −Y0
This involved two steps: first, deriving the variance of the esti-
mator for the average treatment effect; and second, developing
unbiased estimators of this variance.
30
Consider a completely randomized experiment of N units, M
assigned to treatment. To calculate the variance of Y obs
1 −Y obs ,
0
we need the second and cross moments of the random variable
Wi, E[Wi2] and E[Wi · Wj ].
31
The variance of Y obs obs is equal to:
1 −Y0
N
2 = 1
(Yi(w) − Ȳ (w))2,
X
Sw
N − 1 i=1
N
2 = 1
(Yi(1) − Yi(0) − τ )2.
X
S01
N − 1 i=1
32
The numerator of the first term, the population variance of
the potential control outcome vector, Y(0), is equal to
N
1
S02 = (Yi(0) − Y (0))2.
X
N − 1 i=1
1
s2 (Yiobs − Y obs 2.
X
0= 0 )
N − M − 1 i:W =0
i
33
2 (the population variance of the unit-level
The third term, S01
treatment effect) is more difficult to estimate because we can-
not observe both Yi(1) and Yi(0) for any unit.
s2 s 2
V̂ Y obs
1 − Y obs
0 = 0 + 1. (2)
N −M M
34
This estimator for the variance is widely used, even when the
assumption of an additive treatment effect is inappropriate.
There are two main reasons for this estimator’s popularity.
The second reason for using this estimator for the variance
is that it is always unbiased for the variance of τ̂ = Y obs
1 −
Y obs
0 when this statistic is interpreted as the estimator of the
average treatment effect in the super-population from which
the N observed units are a random sample. (we return to this
interpretation later)
35
Confidence Intervals
Write
N N
1 X 1 X
τ̂ = Y 1 − Y 0 = WiYi(1) − (1 − Wi)Yi(0)
N/2 i=1 N/2 i=1
N N
1 X 1 X
= Yi(1) − Yi(0) + Di Yi(1) + Yi(0)
N i=1 N i=1
36
The stochastic part, normalized by the sample size, is
N
1 X
√ Di Yi(1) + Yi(0)
N i=1
N
1 X 2
V= Yi(1) + Yi(0) .
N i=1
38
4. Stratified Randomized Experiments
Recommendation In Literature:
39
Quotes from the Literature
40
Box, Hunter and Hunter (2005, page 93) also suggest that
there is a tradeoff in terms of accuracy or variance in the de-
cision to pair, writing:
“Thus you would gain from the paired design only if the
reduction in variance from pairing outweighed the effect
of the decrease in the number of degrees of freedom of
the t distribution.”
41
Klar and Donner (1997) raise additional issues that make them
concerned about pairwise randomized experiments (in the con-
text of randomization at the cluster level):
“We shown in this paper that there are also several ana-
lytic limitations associated with pair-matched designs.
These include: the restriction of prediction models to
cluster-level baseline risk factors (for example, cluster
size), the inability to test for homogeneity of odds ra-
tios, and difficulties in estimating the intracluster corre-
lation coefficient. These limitations lead us to present
arguments that favour stratified designs in which there
are more than two clusters in each stratum.”
42
Imai, King and Nall (2009) claim there are no tradeoffs at all
between pairing and complete randomization, and summarily
dismiss all claims in the literature to the contrary:
43
How Do We Reconcile These Statements?
44
4.1 Expected Squared Error Calculations for Completely
Randomized vs Stratified Randomized Experiments
46
Notation
µ(w, x) = E [ Yi(w)| Wi = w, Xi = x] ,
σ 2(w, x) = V ( Yi(w)| Wi = w, Xi = x) ,
47
Three Estimators: τ̂dif , τ̂reg, and τ̂strata
τ̂dif = Y obs
1 − Y obs
0
VC − VS =
2 2
q(1 − q) · (µ(0, f ) − µ(0, m)) + (µ(1, f ) − µ(1, m)) ≥0
50
Comment 1:
51
Comment 2: For this result it is important that we compare
the marginal variances, not conditional variances. There is
no general ranking of the conditional variances
h i
E (τ̂dif − τ )2 Y(0), Y(1), X, C
versus
h i
E (τ̂dif − τ )2 Y(0), Y(1), X, S .
53
Think through analyses in advance
54
4.2 Analytic Limitations of Pairwise Randomization
55
Benefits of Pairing
56
Difference with Stratified Randomized Experiments
57
Difference with Stratified Randomized Experiments (ctd)
58
Recommendation
• Use small strata, rather than pairs (but not a big deal either
way)
59
4.3 Power Comparisons for t-statistic Based Tests
τ̂ = Y obs obs
1 −Y0 .
Its distribution under the two designs is the same as well (be-
cause covariate is independent of outcomes):
2 · σ2 2 · σ2
! !
τ̂ |C ∼ N τ, and τ̂ |P ∼ N τ,
N N
61
The natural estimator for the variance for the estimator given
the pairwise randomized experiment is
N
1 X
2 2 · σ 2 X2(N − 1)
V̂P = (τ̂i − τ̂ ) ∼ ·
N · (N − 1) i=1 N N −1
62
Under the normality the expected values of the varianc estima-
tors are the same
h i h i2 · σ2
E V̂P = E V̂C =
N
8 · σ4
V V̂P = 2 · V V̂C = 2
N · (N − 1)
63
This leads to the t-statistics
τ̂ τ̂
tP = q , and tC = q .
V̂P V̂C
cP t
α = q1−α/2 (N − 1), cC t
α = q1−α/2 (2N − 2)
64
For any τ 6= 0, and for any N ≥ 2 the power of the test based
on the t-statistic tC is strictly greater than the power based on
the t-statistic tP. (assuming covariates are irrelevant.)
66
Conclusion
• Stratify, with small strata, but at least two treated and two
control units.
67
Causal Inference
and Machine Learning
Lecture 2:
1. Nonparametric Regression
2. Regression Trees
3. Multiple Covariates/Features
4. Pruning
5. Random Forests
6. Boosting
1
7. Neural Nets
Data:
Define
g(x) = E[Yi|Xi = x]
2
The regression/prediction problem is special:
N
X N
X
ĝ(x) = ωiYi, often with ωi = 1, sometimes ωi ≥ 0.
i=1 i=1
4
First: scalar case, Xi ∈ [0, 1]
N
X
ĝJ (x) = 1x∈[κj−1J ,κjJ ]ĉjJ
j=1
where ĉjJ is the average outcome within the interval [κj−1J , κjJ ]:
N
X N
.X
ĉjJ = 1Xi∈[κj−1J ,κjJ ]Yi 1Xi∈[κj−1J ,κjJ ]
j=1 j=1
N
X
NjJ = 1Xi∈[κj−1J ,κjJ ]
j=1
5
For fixed x the bias of this estimator depends on the deriva-
tive of g(·) around x, and the density of of Xi around x.
Given some smoothness, the bias-squared is approximately
equal to to the square of g 0(x)/J and the variance is equal to
V (Yi|Xi = x)/(N f (x)/J).
2 g 0(x)2 JV (Yi|Xi = x)
Bias (J) + Var(J) = +
J2 N f (x)
!1/3
2g 0(x)2
J opt = N 1/3
V (Yi|Xi = x)
6
If we let J increase slightly slower than proportional to N 1/3
we get asymptotic normality without bias, without under-
smoothing no valid confidence intervals.
7
How can we modify this to improve properties?
8
2. Regression Trees
X . X
Y a,b = Yi 1
i∈Ia,b i∈Ia,b
X 2 X 2
Q(x) = Yi − Y 0,x + Yi − Y x,1
i:Xi ∈I0,x i:Xi ∈Ix,1
9
Find the split point that minimizes the sum of squared devi-
ations:
Then:
(
Y 0,c1 ifx ≤ c1,
ĝ(x) =
Y c1,1 ifx > c1,
• This is a tree with two leaves: [0, c1] and [c1, 1].
10
We can do this again: consider all possible split points c2 ∈
[0, 1] and calculate the sum of squares as a function of the
additional split point. For example, for c2 ∈ (c1, 1), the sum
of squares is
X 2 X 2
Q(c1, c2) = Yi − Y 0,c1 + Yi − Y c1,c2
i:Xi ∈I0,c1 i:Xi ∈Ic1 ,c2
X 2
+ Yi − Y c2,1
i:Xi ∈Ic2 ,1
12
1. Given J splits, this looks very similar to just dividing the
interval [0, 1] into J equal subintervals.
13
In both cases (simple dividing [0, 1] into J equal intervals,
or tree with J leaves), we need to choose the smoothing
parameter J.
Minimize over J:
N
1 X 2
CV (J) = Yi − ĝJ,(i)(Xi)
N i=1
Q(T) + λ|T|
N
X N
.X
ĝ(0) = Yi1Xik ≤ε 1Xik ≤ε.
i=1 i=1
17
Trees deal with multiple covariates differently.
Now, for the first split, we consider all subsets of [0, 1] × [0, 1]
of the form
or
18
• This means that some covariates may never be used to split
the sample - the method will deal better with cases where
the regression function is flat in some covariates (sparsity).
19
This difference in the way trees (and forests) deal with mul-
tiple covariates compared to kernel methods is important in
practice. There is some tension there:
20
Comparison with linear additive models:
21
4. Pruning
Suppose (x1, x2) ∈ {(−1, −1), (−1, 1), (1, −1), 1, 1)}, and
g(x1, x2) = x1 × x2
22
• First grow a “big” tree, with many leaves, even if they do
not improve the sum of squared errors enough given λ, up to
the point that the leaves are all very small in terms of the
number of observations per leave.
23
5. Random Forests
24
Random Forests
2. For each split (in each tree) only a subset of size m of the
√
p covariates are considered in the split (typically m = p,
or m = p/3 - heuristic, no formal result).
27
Boosting refers to the repeated use of a simple basic estima-
tion method, repeatedly applied to the residuals.
28
For each split, we can calculate the improvement in mean
squared error, and assign that to the variable that we split
on.
29
Modification
First grow tree of depth d on (Yi − Ĝ0(Xi), Xi), call this ĝ1(x).
N
X
Minimize over γ : L(Yi, Ĝ0(Xi) + g(Xi; γ))
i=1
h(x) = E[Yi|Xi = x]
32
Linear Model:
p
X
f (x) = β0 + βj xj
j=1
• Restrictive if p << N
33
Let’s make this more flexible:
34
Additive model:
p
X
h(x) = gj (xj )
j=1
35
Projection Pursuit:
L
X p
X
h(x) = gl βlj xj
l=1 j=1
36
Neural net with single hidden layer:
L p
(2) X (2) X (1)
f (x) = β0 + βl g βlj xj
l=1 j=1
(1)
Fix g(·) and estimate L × p parameters βlj and L + 1 param-
eters βl2.
(Note that (1) and (2) index layers, does not mean “to the
power of”)
37
General neural net with K hidden layers, one observed input
layer, one observed output layer, K + 2 layers total.
(m)
Observe y and x1, . . . , xp. zk are hidden variables.
38
First layer: p2 hidden elements, l = 1, . . . , p2. Model:
p
(2) (1) X (1)
zl = ωl0 + ωlj xj ,
j=1
39
Layer k, pk+1 hidden elements, pk hidden inputs, for layer
k = 2, . . . , K + 1, l = 1, . . . , pk . Model:
pk
(k+1) (k) X (k) (k)
zl = ωl0 + ωlj αj
j=1
Transformation:
(k+1) (k)
αl = g zl
pK+1
(K+1) X (K+1) K+1
y = ω0 + ωj αj
j=1
40
Naive approach: minimize
N
(Yi − f (Xi; ω))2
X
i=1
N K X pk pX
k−1 2
(k)
(Yi − f (Xi; ω))2 + λ
X X
ωjl
i=1 k=1 j=1 l=1
(k)
over all ωlj .
43
Interpretation
44
Multiple layers versus multiple hidden units
45
Convolutional Neural Nets
pk
(k+1) (k) X (k) (k)
zl = ωl0 + ωlj αj
j=1
(k)
We can set some of the ωlj equal to zero. This obviously
simplifies computations and makes estimation more precise.
But, how do we choose restrictions?
46
Example Digit recognition: xj are black/white scale mea-
sures on pixels. Suppose we have 16 by 16 pixels, 256 total.
So, xij , i = 1, . . . , 16, j = 1, . . . , 16. We could make the nodes
in the first hidden layer functions of only sets of pixels close
together:
3 X 3
(2) (1) X (1)
z1 = ωl0 + ωlij xij
i=1 j=1
6 X 3
(2) (1) X (1)
z2 = ω20 + ω2ij xij
i=4 j=1
et cetera.
47
Estimating the Parameters of a Neural Network: Back-
propagation
For N observations:
N N
(yi − f (xi; ω))2
X X
J(ω, x, y) = Ji(ω, xi, yi) =
i=1 i=1
48
Recall: K hidden layers, one input and output layer, K + 2
layers total.
First layer: p1 observed elements,
(1) (1) (1) (1)
z l = xl , αl = g (1) zl = zl l = 1, . . . , p1
pK+1
(K+1) (K+1) K+1
z (K+2) = ω0
X
+ ωj αj ,
j=1
f (x) = g (K+2) z (K+2) = z (K+2)
49
We can write
J(ω, x, y) = J ω (1), ω (2), . . . , ω (K+1), x, y
50
Now start with the last layer.
Write
f (x) = g (K+2) z (K+2) = z (K+2)
Define
2
∂
(K+2)
δiK+2 = yi − g (K+2) zi
(K+2)
∂zi
(K+2) (K+2)
= −2 yi − g zi g (K+2)0 zil
(K+2)
= −2 yi − zi
so
∂
(K+2) (K+1) 0 (K+1)
zi = ωj g zji
(K+1)
∂zji
(K+1)
Now consider writing the objective function in terms of zi :
2
(K+2)
(yi − f (xi))2 = yi − zi
pK+1 2
(K+1) X (K+1) (K+1)
= yi − ω0 − ωj g zji
j=1
52
Then define:
K+1 ∂
δli = ( yi − f (xi))2
(K+1)
∂zli
∂ 2 ∂ (K+1)
=
(K+2)
(yi − f (xi)) × (K+1)
zi
∂zi ∂zi
(K+2) (K+1) 0 (K+1)
= δi ωl g zli
53
Go down the layers:
Define
pk
(k) X (k) (k+1) 0 (k)
δli = ωjl δli g zli
j=1
∂
(k) (k+1)
Ji(ω, xi, yi) = g zli δli
(k)
∂ωlj
54
Given the derivatives, iterate over
∂
ωm+1 = ωm − α × J(ω, x, y)
∂ω
55
• often stochastic gradient descent: Instead of calculating
∂
J(ω, x, y)
∂ω
use
N N
X ∂ .X
Ri J(ω, xi, yi) Ri
i=1 ∂ω i=1
56
Regularization:
2
(k)
• add penalty term, jlk ωjl
P
57
8. Generative Adverserial Nets (GANs)
58
General set up:
• Kullback-Leibler
!
f (x)
Z
KL(f, g) = ln f (x)dµ(x)
g(x)
• Jannson-Shannon
f +g f +g
JS(f, g) = KL f, + KL g,
2 2
60
where Π(f, g) is set of joint distrs with marginals f and g.
Original GAN proposal (Goodfellow, Pouget-Abadie, Mirza,
Xu, Warde-Farley, Ozair, Courville, & Bengio, 2014)
X X
inf sup ln Dφ(gθ (Zi)) + ln(1 − Dφ(Xi))
θ φ i:Y =F
i i:Yi =R
61
Wasserstein GAN (WGAN, Arjovsky, Chintala, & Bottou,
2017):
Formally:
1 X X 1
inf sup fφ(gθ (Zi)) − fφ(Xi)
θ φ, kfφ kL ≤1 NF
i:Yi =R NR
i:Yi =F
63
Simulating the Lalonde Data
65
How do the generated data compare to the actual data?
66
67
68
69
70
71
Causal Inference
and Machine Learning
Lecture 3:
1. Unconfoundedness
2. Efficiency Bound
4. Many Covariates
7. Comparisons of Estimators
1. Unconfoundedness Set up:
Covariates Xi
2
Estimand: average effect for treated:
τ = E[Yi(1) − Yi(0)|Wi = 1]
Overlap
3
If there are concerns with overlap, we may need to time sample
based on propensity score
Trim if e(x) ∈
/ [0.1, 0.9]
Important in practice.
4
Define the conditional mean of potential outcomes
µw (x) = E[Yiobs|Wi = w, Xi = x]
5
2. Semi-parametric efficiency bound for average treat-
ment effect
τ = E[Yi(1) − Yi(0)]
under unconfoundedness
σ12(x) σ02(Xi )
" #
E + + (µ1(Xi ) − µ0(Xi) − θ)2
e(Xi) 1 − e(Xi )
2
h i
= E (ψ(Yi , Wi, Xi ))
y − µ1(x) y − µ0(x)
ψ(y, w, x) = µ1(x) − µ0(x) + w − (1 − w) −τ
e(x) 1 − e(x)
6
How can we estimate τ efficiently?
Let µ̂w (x) and ê(x) be nonparametric estimators for µw (x) and
e(x). Then the following three estimators are efficient for τ :
N
1 X
τ̂reg = µ̂1(Xi ) − µ̂1(Xi)
N i=1
7
C. based on estimation of efficient score
N
1 X
τ̂es =
N i=1
!
Wi · (Yi − µ̂1(Xi)) (1 − Wi) · (Yi − µ̂0(Xi )) n o
− + (µ̂1(Xi − µ̂0(Xi)
ê(Xi) 1 − ê(Xi )
8
• Single nearest neighbor matching also possible, but not effi-
cient.
9
τ̂reg , τ̂ipw , and τ̂es are efficient in the sense that they achieve
the semiparametric efficiency bound, for fixed dimension of the
covariates, but irrespective of what that dimension is.
Define:
1 XN
τ̂ infeasible =
N i=1
!
Wi · (Yi − µ1(Xi)) (1 − Wi) · (Yi − µ0(Xi ))
− + {(µ1(Xi − µ0(Xi)}
e(Xi) 1 − e(Xi)
Then:
PN
Yi1Wi=w,Xi=x
µ̂w (x) = Pi=1
N 1
i=1 Wi =w,Xi =x
12
• Regression estimator based on series estimator for µw (x).
13
• Regression estimator based on kernel estimator for µw (x).
N N
Xi − x . X Xi − x
X
µ̂w (x) = 1Wi=w YiK 1Wi=w K
i=1 h i=1 h
14
4. What do we do with many covariates?
Problem is that this does not select covariates that are highly
correlated with Yi
15
B. Regression methods. Estimate µ0(x) = E[Yi |Wi = 0, Xi = x]
using machine learning methods, e.g., LASSO, random forests,
deep learning methods, to minimize something
2
h i
E (µ̂0(Xi ) − µ0(Xi))
1 X
τ̂ = (Yi − µ̂0(Xi ))
Nt i:W =1
i
Problem is that this does not select covariates that are highly
correlated with Wi
16
Recall omitted variable bias:
Yi = α + τ Wi + β >Xi + εi
17
First improvement, use selection methods that select covari-
ates that are correlated with Wi or Yi (double selection, Belloni
et al, 2012).
E.g., use lasso to select covariates that predict Yi. Use lasso
to select covariates that predict Wi.
18
5. Efficient Score Methods and Double Robustness (Robins
& Rotnitzky, 1996; Van Der Laan and Rubin (2006), Imbens
and Rubin (2015) and others.
Estimate µ0(·), µ1(·) and e(·) on the first subsample, and then
estimate τ on the second subsample as
N
1 X
τ̂1 =
N/2 i=N/2+1
(1) (1)
Y − µ̂1 (Xi)
Wi i
Yi − µ̂0 (Xi) (1) (1)
− (1 − Wi) + µ̂1 (Xi) − µ̂0 (Xi)
ê(1)(Xi ) 1 − ê(1) (Xi)
N/2
1 X
τ̂2 =
N/2 i=1
(2) (2)
Y − µ̂1 (Xi)
Wi i
Yi − µ̂0 (Xi) (2) (2)
− (1 − Wi) + µ̂1 (Xi) − µ̂0 (Xi)
ê(2)(Xi ) 1 − ê(2) (Xi)
Finally, combine:
τ̂1 + τ̂2
τ̂ =
2
21
Key Assumptions
Estimators for µ0(·), µ1(·) and e(·) need to converge fast enough,
e..g., faster than N −1/4 rate.
22
6. Balancing Methods
1 X N e(Xi)
(1 − Wi)Yiγi, where γi =
N1 i=1 1 − e(Xi)
23
The key insight is that for any function h : X 7→ Rp,
" #
e(Xi)
E (1 − Wi)h(Xi) = E[h(Xi)|Wi = 1]
1 − e(Xi)
24
Zubizarreta (2012) suggests directly focusing on the balance
in covariates. Find weights γi that solve
Nc N
γi2,
X X
min subject to (1 − Wi)γi Xi = X t
γ1 ,...,γN
i=1 i=1
25
Athey, Imbens and Wager (2015) combine this with a linear
regression for the potential outcomes.
N
X
(1 − Wi)γiXi = X t
i=1
Nc N 2
1 X 1 X
min ζ × γi2 + (1 − ζ) × (1 − Wi)γiXi − X t
γ1 ,...,γN Nc i=1 Nc i=1
µ0(x) = β > x
p p
2
>
|βp|2
X X X
min Yi − β Xi + λ α |βp | + (1 − α)
β
i:Wi =0 k=1 k=1
27
A standard estimator for the average effect for the treated
would be
τ̂ = Y t − X >
t β̂
N
X
τ̂ = Y t − (1 − Wi)γiYi
i=1
The residual balancing estimator for the average effect for the
treated is
N
τ̂ = Y t − X > (1 − Wi)γi Yi − Xi> β̂
X
t β̂ +
i=1
28
• does not require estimation of the propensity score.
29
7. Comparison of Estimators
30
Experimental CPS PSID
est s.e. est s.e. est s.e.
DIM 1.79 (0.67) -8.50 (0.58) -15.20 (0.66)
BCM 1.90 () 2.35 () 1.47 ()
Outcome Models
L 1.00 (0.57) 0.69 (0.60) 0.79 (0.60)
RF 1.73 (0.58) 0.92 (0.6) 0.06 (0.63)
NN 2.07 (0.59) 1.43 (0.59) 2.12 (0.59)
Outcome Models
Linear 0.56 2 -0.06 0.56 0.90
Random Forest 0.58 4 -0.15 0.56 0.89
Neural Nets 0.65 11 -0.17 0.63 0.85
Outcome Models
Linear 0.77 8 -0.62 0.45 0.67
Random Forest 0.80 9 -0.67 0.44 0.62
Neural Nets 0.51 3 -0.10 0.50 0.89
Outcome Models
L 0.57 1 0.09 0.56 0.88 0.27 1 0.09 0
RF 0.97 10 -0.79 0.57 0.52 0.63 10 -0.57 0
NN 1.20 11 0.85 0.85 0.48 0.62 9 0.50 0
Outcome Models
L 0.77 -0.62 0.45 0.89 -0.66 0.42 0.40 0.61
RF 0.80 -0.67 0.44 0.66 -0.50 0.40 0.19 0.24
NN 0.51 -0.10 0.50 0.50 -0.10 0.46 0.05 0.17
Lecture 4:
1
Potential Outcome Set Up for Causal Inference
• 60 pre-treatment variables
4
Question: are there search queries where the effect is small
or large relative to this?
5
Naive Solution
6
Regression Trees (conventional, non-causal, set up, e.g.,
Breiman book - also possible to use lasso or other flexible
prediction methods)
7
Start with single “leaf.” Predicted outcome is µ̂(x) = Y .
Average in-sample squared error is
N
1 X
Q(µ̂) = (Yi − µ̂(Xi))2
N i=1
Leave 1: Xik ≤ c
8
Consider splitting leaf into two parts, depending on whether
kth covariate is below or above threshold c: Now the pre-
dicted outcome is
(
YL if xk ≤ c
µ̂(x) =
YH if xk > c,
where
1 X 1 X
YL = Yi , YH = Yi
NL i:X ≤c NH i:X >c
ik ik
N
1 X
Q(µ̂) = (Yi − µ̂(Xi))2
N i=1
9
Next, consider splitting either of the two leaves, and find the
optimal threshold, the optimal covariate and the optimal leaf
to split.
Q(µ̂) + α · M
N
1 X
Q(µ̂) = (Yi − µ̂(Xi))2
N i=1
1 XN
Q̃(µ̂) = − µ̂(Xi)2.
N i=1
11
Trees for Causal Effects
12
Simple, Non-causal, Solutions to First Problem
13
Solution II (two trees): construct separate trees for µ0(x) =
E[Yiobs|Wi = 0, Xi = x] and µ1(x) = E[Yiobs|Wi = 1, Xi = x],
and estimate
14
Insight
Define
Wi − E[Wi]
Yi∗ = Yiobs ·
E[Wi] · (1 − E[Wi])
Then
τ (x) = E[Yi∗|Xi = x]
15
Generalization to observational studies with unconfounded-
ness:
Wi − e(Xi)
Yi∗ = Yiobs ·
e(Xi) · (1 − e(Xi))
N
1 X
Yi∗ − τ̂ (Xi ) 2
Q1(τ̂ ) =
N i=1
16
Alternative out-of-sample goodness-of-fit measure based on
matching. Replace τi by
obs obs
τ̂i = (2 · Wi − 1) · Yi − Y`(i) ,
Then use
N
1 X
Q2(τ̂ ) = (τ̂i − τ̂ (Xi ))2
N i=1
1 XN 2
obs obs
= (2 · Wi − 1) · Yi − Y`(i) − τ̂ (Xi)
N i=1
17
Solution III (transformed outcome tree): Use conven-
tional tree methods to construct tree based on (Xi, Yi∗) data
(discarding Wi).
18
Solution IV (causal tree 1):
τ̂ = Y obs
1 − Y obs,
0 τ̂H = Y obs
H,1 − Y obs ,
H,0 τ̂L = Y obs
L,1 − Y obs,
L,0
obs 1 1 1
Yiobs, obs Yiobs, Y obs
X X
Yw = Y L,w = H,w =
N i:W =w NL i:W =w,X ≤c NH
i i i
The difference with Solution III is that τ̂H and τ̂L are not
calculated as the average of Yi∗ within the leafs, but as the
difference in average outcomes by treatment status.
20
Solution V (causal tree 2):
N
1 X
Q̃(µ̂) = − µ̂(Xi)2.
N i=1
1 2
2 2
Q̃1(τ̂ ) = · N · τ̂ − NL · τ̂L + NH · τ̂H
N
21
Application
# Leaves 52 36 26 21 21 24
22
Correlation Between Predictions and Yi∗ in Test Sample
23
leaf training sample test sample
est se share est se share
1 -0.1235 0.0036 0.2022 -0.1236 0.0036 0.2018
2 -0.1339 0.0099 0.0247 -0.1349 0.0102 0.0240
3 -0.0099 0.0044 0.0129 -0.0073 0.0044 0.0132
4 -0.2148 0.0128 0.0214 -0.2467 0.0126 0.0216
5 -0.1453 0.0030 0.3049 -0.1480 0.0030 0.3044
6 -0.1109 0.0056 0.0628 -0.1100 0.0055 0.0635
7 -0.2303 0.0283 0.0036 -0.2675 0.0284 0.0037
8 -0.0575 0.0096 0.0165 -0.0324 0.0095 0.0168
9 -0.0868 0.0307 0.0026 -0.0559 0.0294 0.0025
10 -0.1505 0.0048 0.1191 -0.1693 0.0047 0.1191
11 -0.1741 0.0236 0.0045 -0.1682 0.0239 0.0046
12 0.0255 0.1267 0.0003 0.2857 0.1235 0.0002
13 -0.0297 0.0264 0.0019 -0.0085 0.0250 0.0022
14 -0.1352 0.0142 0.0106 -0.1139 0.0147 0.0100
15 -0.1591 0.0552 0.0010 -0.1432 0.0526 0.0011
16 -0.0135 0.0260 0.0005 0.0080 0.0502 0.0004
17 -0.0809 0.0118 0.0131 -0.0498 0.0124 0.0132
18 -0.0453 0.0231 0.0014 -0.0454 0.0208 0.0014
19 -0.1694 0.0158 0.0105 -0.1997 0.0162 0.0106
24
20 -0.2072 0.0304 0.0031 -0.2790 0.0305 0.0030
21 -0.0955 0.0106 0.0226 -0.0834 0.0108 0.0223
Substantial variation in conditional treatment effects
• interpretable leaves.
26
Simulations
5
X
Yi(w) = Xik · βkw + εiw
k=1
1 −1
−1 1
β0 = 0
β1 = 0
0 0
0 0
28
Causal Inference
and Machine Learning
Lecture 5:
1. Re-randomization
2. Experiments in Networks
3. Multi-armed Bandits
1
1. Re-Randomization
2
Re-randomization is conceptually similar to completely
randomized experiment:
3
Formal Analysis of Re-Randomization
Calculate
1 X X1 − X0
Xw = Xi, tX = q
N i:W =w
i
s2
X,0 /N + s 2 /N
X,1
4
Two Cases
Key:
1. articulate strategy a priori, so randomization inference is
possible.
6
Conclusion
7
2. Experiments in Networks
9
2.1 Causal Effects and Potential Outcomes
11
2.2 Three Null Hypotheses of Interest
No treatment effects:
Yi(w) = Yi(w0 ) for all units i, and all pairs of assignments
w, w0 ∈ W.
(straightforward because this hypothesis is sharp)
We see one pair (w, Y(w)), based on that we cannot infer the
value of Y(w0 ) for some other w0 under the null.
13
2.3 What Not To Do
1 obs 1
Yiobs · (1 − Wj )
X X
T = Yi · Wj −
N1 i,j:G =1 N0 i,j:G =1
ij ij
This is not valid: the rejection rates under the null for a 5%
test can be much larger than 5%.
14
Randomization inference:
15
2.4 Artificial Experiments
Now for individual i in the focal group, given the actual treat-
ment W, figure out the set of treatments Wi(W, H0) that
would lead to the same outcome under the null outcome.
p(w)
p0(w) = P 0
w0 ∈WR p(w )
17
Example I H0 is no treatment effect whatsoever. PF = P,
E0 = (W, P, p(·)) = E.
p(w)
p0(w) = P 0
w0 ∈WR p(w )
E0 = (WR , PF , p0(·))
18
The artificial experiment takes a set of focal units, and looks
at randomization distribution induced by changing assignment
for a set of auxiliary units.
19
Statistics
1
Yiobs · Fi · (1 − Fj ) · Wj
X
T =
N1 i,j,G =1
ij
1
Yiobs · Fi · (1 − Fj ) · (1 − Wj )
X
−
N0 i,j,G =1
ij
20
Score statistic based on linear model
Yiobs = α0 + αw · Wi + αy · Y obs
(i) + εi
1 X n obs obs
obs obs
Tscore = Yi − Y F,0 − Wi · Y F,1 − Y F,0
NF i∈P
F
N
X
× Gij · Wj − G · W
j=1
21
2.5 Some Simulations
Ki,1
Yi(w) = Yi(w0) + wi · τdirect + · τspill.
Ki
24
Rejection Rates of Null Hypothesis of No Spillovers
Beyond the First Order Spillovers from the Sparsified Network
26
Suppose there are K treatments, with binary outcomes Yi ∼
B(1, pk ).
27
Allocating a lot of units to treatment arm 1 does not serve
much of a purpose: it is unlikely that arm 1 is the best arm.
28
Thompson Sampling and Upper Confidence Bound Meth-
ods
29
Thompson Sampling
30
This balances exploration: learn more about the arms by al-
locating units to them, and exploitation: send more units to
arms that do well.
31
Consider a case with two arms, and p1 = 0.04, p2 = 0.05.
32
Thompson sampling The binomial bandit
Two-armed experiment
Bandit shown 100 visits per day
Source: https://support.google.com/analytics/answer/2844870?hl=en
I In theory we only need this sample size in the largest arm, but we
don’t know ahead of time which arm that will be.
I Experiment needs 91848 observations.
I At 100 per day that is 2.5 years.
6-arm experiment
Still 100 observations per day
Huge savings!
33
Upper Confidence Bounds
Pick arm with the highest upper confidence limit, and assign
next unit to that arm.
34
Contextual Bandits
35
Causal Inference
and Machine Learning
Lecture 6:
0
• California’s anti-smoking legislation (Proposition 99) took
effect in 1989.
0 0 0 ... 0 0
0 0 0 ... 0 0
0 0 0 0 0
...
W =
.. .. .. ... .. ..
(binary treatment).
0 0 0 ... 1 1
0 0 0 ... 1 1
Challenge:
1 X
τ̂UNC = YiT (1) − ŶiT (0)
N1 i:W =1
iT
where
−1
TX
ŶiT (0) = α̂ + λ̂tYit
t=1
Then:
1 X e(Xj )
ŶiT (0) = µ(Yi,1:T0 ) + YjT − µ(Yi,1:T0 )
N0 j:W =0 1 − e(Xj )
jT
Abadie-Diamond-Hainmueller Synthetic Control Method
−1
NX
ŶN T (0) = ωiYjT
j=1
−1
NX
ω(V) = arg min ZN − ωi · Zi
ω
i=1 V
−1
NX
V̂ = arg min YN,1:T0 − ωi(V) · Yi,1:T0
V i=1
−1
NX
ω ∗ = ω(V̂) = arg min ZN − ωi · Zi
ω
i=1 V̂
Doudchenko-Imbens:
−1
NX
τ̂DI = YN T − ŶN T (0), ŶN T (0) = α + ωiYiT
i=1
where
2
−1
TX −1
NX
min 00 vertical00 regression
Nt − α −
Y ωiYit
α,ω
t=1 i=1
Model Yit(0):
Yit(0) = αi + γt + εit
leading to
N X
T
(1 − Wit ) (Yit − γt − αi )2
X
min
α,γ
i=1 t=1
N T N 0 T
1 X X 1 X X
τ̂ = Yit − Yit
N1T1 i=N +1 t=T +1 N1T0 i=N +1 t=1
0 0 0
N0 T N0 X
T0
1 X X 1 X
− Yit − Yit
N0T1 i=1 t=T +1 N0T0 i=1 t=1
0
More general, factor models:
R
X
Yit(0) = γtr αir + εit
r=1
N X
T
(1 − Wit) (Yit − αi − γt − Lit)2 + λkLk
X
arg min
α,γ,L i=1 t=1
Comparison of
Given the sample, we pretend that half the stocks are treated
at the mid point over time, so that 25% of the entries in the
matrix are missing.
X X X X ... X
X X X X ... X
X X X X ... X
YN ×T =
X X X ? ... ?
X X X ? ... ?
... ... ... ... ...
...
X X X ? ... ?
NxT = 4900 Fraction Missing = 0.25
1.0
Average RMSE(Normalized)
Method
0.9
EN−H
EN−V
MC−NNM
0.8
0.7
3 4 5 6
log(N)
Results
T
N X
τ̂ ADH = arg min (Yit − γt − τ Wit)2 × ωiADH
X
τ,γ
i=1 t=1
N X
T
τ̂ DID = arg min (Yit − γt − αi − τ Wit)2
X
τ,γ,α
i=1 t=1
N X
T
τ̂ SDID = arg min (Yit − γt − αi − τ Wit)2 × ωiADH × λADH
X
τ,γ,α t
i=1 t=1
Regression with unit and time fixed effects, and with unit
and time weights.
Time weights satisfy:
2
−1
NX −1
TX
λ = arg min Y
iT − λt Yit + regularization term,
λ
i=1 t=1
subject to
−1
TX
λt ≥ 0, λt = 1.
t=1
τ̂ SFM =
N X
T
(Yit − αi − γt − Lit − τ Wit)2 ωiADHλADH
X
arg min t
L,α,γ,τ i=1 t=1
+λkLk,
Double Robustness
Take pre-1988 data for all states, so we observe all Yit (0) for
all unit/time pairs.
1980
1982
year
1984
1986
1988
smoking [packs per capita]
95 100 105 110 115 120
1980
1982
year
1984
1986
1988
smoking [packs per capita]
160 180 200 220 240 260 280
1980
1982
year
1984
1986
1988
50 ●
●
20
synthetic control RMSE
●
10
●
● ●
●
●
● ●
● ● ●
5
● ●
●●
●
● ● ●
● ●
● ● ●
●
● ●●
● ●●●
● ● ●
●
2
2 4 6 8
synthetic diff−in−diff RMSE