0% found this document useful (0 votes)

18 views19 pages

A Survey of Causal Inference Framework

Uploaded by

ahmad fairuz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views19 pages

A Survey of Causal Inference Framework

Uploaded by

ahmad fairuz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

A Survey of Causal Inference Frameworks

Jingying Zeng1 and Run Wang2

1
Georgia Institute of Technology
2
Iowa State University

September 2022
arXiv:2209.00869v1 [stat.ME] 2 Sep 2022

Abstract
Causal inference is a science with multi-disciplinary evolution and applications. On the
one hand, it measures effects of treatments in observational data based on experimental
designs and rigorous statistical inference to draw causal statements. One of the most
influential framework in quantifying causal effects is the potential outcomes framework.
On the other hand, causal graphical models utilizes directed edges to represent causalities
and encodes conditional independence relationships among variables in the graphs. A
series of research has been done both in reading-off conditional independencies from graphs
and in re-constructing causal structures. In recent years, the most state-of-art research in
causal inference starts unifying the different causal inference frameworks together. This
survey aims to provide a review of the past work on causal inference, focusing mainly on
potential outcomes framework and causal graphical models. We hope that this survey will
help accelerate the understanding of causal inference in different domains.

1 Introduction
Causal reasoning is widely used in everyday language interchangeably with correlation or asso-
ciation. Based on observations, one might draw informal conclusions like ”Aspirin helps reduce
my pain” or ”She got a good grade because she is smart”. In many cases, ”correlation does not
imply causation”; however, an action (or manipulation) can cause an effect. Therefore, lots
of causal statements are implausible if no explicit intervention being considered. Traditional
statistical analysis such as estimation and hypothesis testing typically emphasizes on the as-
sociations among variables and bypass the language of causality. Nevertheless, the questions
that scientific researchers trying to answer are more related to causality in nature rather than
correlation. For instance, will smoking cause lung cancer? Will a decrease in demand of houses
cause housing price drop? Can a certain drug prolong survival for cancer patients? Causal
inference goes beyond association and further identifies the causal effects when a cause of the
effect variables is intervened, which makes it an critical research area across many fields, such
as computer science, economics, epidemiology, and social science.
In scientific research, controlled randomized experiment using random assignment mech-
anisms to assign subjects to different treatment groups, has a long history in serving as the
gold standard in establishing causality. In many situations, however, randomized experiments
are not feasible nor ethical in practice, so that researchers need to rely on observational data
to inference causal relationships. To this end, several research communities have developed
various frameworks for causal identification based on observational data.
In history, there are three major origins of causal inference developed separately for differ-
ent purposes, namely potential outcomes (counterfactuals) community, graphs community, and
structural equations community. The potential outcomes framework provides a way to iden-
tify causal effects using statistical inference. This framework was first introduced by Neyman
(1923) for randomized experiments and generalized by Rubin (1974) for observational studies.

1
Later, Holland (1986) published an influential paper and named the general potential outcomes
method as Rubin Causal Model. The potential outcomes approach associates causality with
manipulation applied to units, and compares causal effects of different treatments via their cor-
responding potential outcomes. It is convenient to use potential outcomes to make statistical
inference on single cause-effect pair, and yet it has some drawbacks when the system becomes
complicated. On the other hand, graph analysis and structural equations in causal inference
traced back to the works in path analysis introduced by geneticist Wright (1918). Wright’s
path analysis combines graphs with linear structural equation models (SEMs) to present causal
relationships by directed edges, which is a useful tool to differentiate correlation from causation
when the graph structure is given. Pearl (1988) later relaxed linearity assumption and formal-
ized causal graphical models for presenting conditional independence relations among random
variables using directed acyclic graphs (DAGs). A variety of criteria have been developed
for reading-off independencies from a given graph, including the most famous theorem, the
completeness of d-separation. Nowadays, researchers have been making efforts in unifying the
theories of causality at the intersection of those communities (Richardson and Robins, 2013),
and using the causal inference to make estimations more reliable (Rojas-Carulla et al., 2018;
Arjovsky et al., 2019).

2 Potential Outcomes Framework

One of the most popular statistical frameworks inferring causal effects is Neyman-Rubin model,
also referred to as potential outcomes model. For each unit i from a target population under
a binary treatment Zi ∈ {0, 1}, the framework quantifies the individual causal effects via
comparing the difference between potential outcomes for this unit in both alternative futures,
denoted as ∆i = Yi (1) − Yi (0), where Yi (1) and Yi (0) are the responses of the subject i
corresponding to treatment or control respectively (Neyman, 1923; Rubin, 1974, 1977, 1978).
Potential outcomes are also referred to as counterfactuals in literature. Even though ∆i can
never be directly estimated because Yi (1) and Yi (0) can never be simultaneously observed for
each unit, statisticians shed light on this fundamental problem in causal inference by shifting the
target estimand to the average treatment effect (ATE) (Holland, 1986). Since at most of one of
the potential outcomes can be observed and realized, causal inference under potential outcomes
framework is intrinsically a missing data problem (Rubin, 1976). In statistical literature, there
are generally two versions in defining ATE, namely population ATE (PATE) τP and finite-
sample ATE (FATE) τF S , formulated as follows (Imbens and Rubin, 2015):

τP ≡ E[Yi (1) − Yi (0)] (1)

N
X
τF S ≡ [Yi (1) − Yi (0)] (2)
i=1

Both estimands can be viewed as a function of {Y (0), Y (1), X, Z}, where X is the ob-
served covariate matrix. However, the distinction between the two is whether making the
assumption that the potential outcomes are stochastic. For PATE, N subjects, each associated
with a quadruple {Yi (0), Yi (1), Zi , Xi }, are viewed as a random sample drawn from a super-
population, which induces the fact that the potential outcomes are random variables with a
distribution. When FATE is the target estimand, the potential outcomes are treated as a fixed
vector and inferences are conditional on this vector. The randomness comes from the treat-
ment assignment Zi ’s. With random sampling generated from the target population, SATE
is in expectation equal to PATE. SATE is typically of interest in randomized experiments.
Conversely, observational studies often utilize PATE as target causal estimand. Intuitively
speaking, for instance, an investigator randomly sampled a hundred of subjects from the U.S
to test the efficacy of a vaccine. The question that causal inference is trying to answer un-
der finite population is whether this vaccine is effective for this sample of subjects, yet under
super population settings, the question shifts to whether the efficacy of this vaccine can be
generalized for the whole U.S population.

2
According to the potential outcomes model of binary treatment, the values of (Yiobs , Zi , Xi ) ∈
R × {0, 1} × X of N independent and identically distributed samples were observed, where Zi
and Xi denote the treatment assignment and a vector of covariates for each sample unit. By
assuming Stable Unit Treatment Value Assumption (SUTVA) (no different forms of treatment
level and no interference among sample subjects), the observed outcome is deterministic by the
treatment assignment as Yiobs = Yi (Zi ), or equivalently, Yiobs = Zi Yi (1) + (1 − Zi )Yi (0) (Rubin,
1980). Similary, the unobserved potential outcome can be denoted as Yimiss = Yi (1 − Zi ).
In addition to the observed outcomes, Rubin (1990) claims that the assignment mecha-
nism, defined as the process of determining which subject receives which treatment level, is
also an essential piece of information in inferring reliable causal estimands. In potential out-
comes framework, causal estimands are disentangle from the probabilistic models of assignment
mechanism, which is the primary difference that distinguishes potential outcomes model from
other frameworks (Imbens and Rubin, 2015; Frangakis and Rubin, 2002). Classical random-
ized controlled trial is a randomized experiment with a controlled and random assignment
mechanism, which guarantees the unconfoundedness such that {Yi (0), Yi (1)} ⊥ Zi |Xi (Rubin,
1978).
In a randomized experiment, randomization balances both observed and unobserved base-
line characteristics (Rubin et al., 2008) so that a simple difference-in-means estimator, also
known as Neyman’s inference, formulated as
1 X 1 X
τ̂DIM = Yi − Yi ,
N1 N0
Zi =1 Zi =0

PN PN
where N1 = i=1 Zi and N0 = i=1 (1 − Zi ), can consistently and unbiasedly estimating
the ATE and establish credible causal link at the population level (Imbens and Rubin, 2015).
Studying on randomized experiments has attracted much attention from different researchers.
Fisher (1992) considered randomization test in testing the sharp null hypothesis of whether
Yi (0) − Yi (1) = β ∀i, which in practice is often approximated by Monte-Carlo simulation.
Aronow et al. (2014) constructed a sharper bound on the variance of the difference-in-means
estimator. In modern causal inference, theories indicate that appropriately incorporating pre-
treatment covariates can increase precision in randomized experiments analyses (Li and Ding,
2020; Ye et al., 2020), and many researchers further investigated the use of ordinary least
squares regression-adjusted estimates that do not depend on the assumptions of linear model
to improve asymptotic accuracy, including the discussions by Freedman (2008), Tsiatis et al.
(2008), Lin et al. (2013), and Ye et al. (2020). Making inference about causal effects from
randomized experiments have been viewed as gold standard, however, randomized experiments
can be infeasible and unethical in practice. Therefore, studying on observational data for
causality serves as an alternative when it is impossible to conduct randomized experiments.
The biggest challenge for observational studies is that the treatment assignment mecha-
nisms are unknown and the subjects assigned to different groups might systematically differ
in some unobserved characteristics. Thus, additional assumptions are necessary to identify
causality. To obtain credible causal effects from observational studies, one needs to see how
well an observational study emulate a randomization-like scenario. The majority of causal
inferences for observational studies are based on the strong ignorability assumption on as-
signment mechanisms, which requires unconfoundedness and positivity to make it free from
latent bias (Rosenbaum and Rubin, 1983). The unconfoundedness assumption assumes that
with a set of controlled covariates Xi given, the assignment mechanism is independent of the
potential outcomes. When one is willing to assume unconfoundedness, an observational study
can be considered as a randomized controlled trial and the treatment assignment is random
defined by a level of x. In addition to unconfoundedness, the positivity assumption further
requires propensity score, defined as e(x) = P(Zi = 1|Xi = x), to be strictly within 0 and
1 for all x in the support of Xi , so that each unit has positive probability of being assigned
to either treatment or controlled group. Without positivity assumption, the probability of a
certain subpopulation to be assigned to one of the groups might be zero, and the inference
on treatment effects for this subpopulation will depend on extrapolation (Imbens and Rubin,

3
2015).
By assuming no unmeasured confounders, one popular way to derive an estimator for the
ATE is via regression models. Defining µz (x) = E[Yi (z)|Xi = x] as the outcome regressions
among treated and control respectively and µ̂z (Xi ) as the corresponding fitted models, an
alternative representation of the PATE is

τP = EX [E(Yi (1)|X)] − EX [E(Yi (0)|X)] = EX {E(Yiobs |Z = 1, X) − EX (Yiobs |Z = 0, X)}

suggesting a regression-based estimator can be constructed as follows (Rubin, 1979):

N
1 X
τ̂reg = τ̂reg,1 − τ̂reg,0 = [µ̂1 (Xi ) − µ̂0 (Xi )].
N i=1
The advantage of using regression-based estimator τ̂reg is that it can easily be estimated by
machine learning methods. However, the unbiasedness and consistency of this estimator τ̂reg
are ensured by correct specification on the postulated outcome regression models, which might
be hard to guarantee when X is high-dimensional due to the curse of dimensionality (Glynn
and Quinn, 2010).
A common way to generalize randomized controlled trials in observational studies is through
propensity scores. One of the most appealing properties of propensity score, showed by Rosen-
baum and Rubin (1983), is that, if the treatment assignment is unconfounded given Xi , it is
also unconfounded given the propensity score, suggesting that adjusting according to propen-
sity score can remove confounding in observational studies. Once propensity score is estimated,
methods such as matching, stratification, and weighting can be further applied to make causal
inference.

2.1 Weighting Methods

Most early studies focused on the causal effects of binary treatments. Inverse Propensity
Weighting (IPW) is a popular weighting estimator estimating the PATE under strong ignora-
bility assumption. The IPW estimator formulated as follows (Rosenbaum, 1987; Horvitz and
Thompson, 1952):
N N
1 X Zi Yiobs 1 X (1 − Zi )Yiobs
τ̂IP W = τ̂IP W,1 − τ̂IP W,0 = − .
N i=1 ê(Xi ) N i=1 1 − ê(Xi )

The IPW estimator τ̂IP W is the same as the regression estimator τ̂reg when X is discrete,
but when X is continuous, generally, they are different. By taking a glimpse at τ̂IP W , the
estimator weights the observed data by the inverse of the probability of receiving treatment
or control. Intuitively, a subject represents larger population if he/she has lower chance of
being sampled. In practice, it is common to normalize weights to reduce the variance of the
weighting estimators, leading to a more stable estimate (Hirano et al., 2003). From the per-
spectives of semiparametric inference, τ̂IP W admits asymptotic linear expansion and reaches
semi-parametric efficiency bound so that no regular estimator can improve its asymptotic per-
formance, in other words, τ̂IP W is an optimal estimation of ATE (Hirano et al., 2003; Tsiatis,
2007). Even though IPW estimator τ̂IP W has good theoretical foundations, its unbiasedness
and consistency highly rely on whether the propensity score model is correctly specified (Funk
et al., 2011). Some researchers proposed using machine learning models to estimate propen-
sity scores and Dorie et al. (2019) has also showed the improvement by simulation compared
with traditional logistic models, however, Keele and Small (2021) pointed out that black-box
machine learning methods make little difference in real data. Additionally, similar to other
propensity-score-based methods, τ̂IP W does not perform well when the estimated propensity
scores are close to 0 or 1. To solve the issue of unstable IPW estimator causing by extreme
weights, Crump et al. (2009) suggested further adjustment such as trimming the extreme
weights to exclude subjects beyond the range of the common support region.

4
Doubly Robust (DB) Estimator, also called augmented IPW (AIPW), was firstly introduced
by Robins et al. (1994) from the missing data perspective, which combines both regression-
based approach and weighting method to provide double protection against misspecification
(Lunceford and Davidian, 2004):

τ̂DR = τ̂DR,1 − τ̂DR,0

N N
1 X Zi Yiobs Zi − ê(Xi ) 1 X (1 − Zi )Yiobs Zi − ê(Xi )
= [ − · µ̂1 (Xi )] − [ − · µ̂0 (Xi )]
N i=1 ê(Xi ) ê(Xi ) N i=1 1 − ê(Xi ) 1 − ê(Xi )
N N
1 Xh i 1 X h Yiobs − µ̂1 (Xi ) Y obs − µ̂0 (Xi ) i
= µ̂1 (Xi ) − µ̂0 (Xi ) + Zi − (1 − Zi ) i .
N i=1 N i=1 ê(Xi ) 1 − ê(Xi )

The DR estimator can be interpreted as estimating the PATE by using the regressions, then
employing IPW to the residuals to adjust for bias. For this approach, we firstly estimates the
nuisance parameters µz (x) and e(x) non-parametrically, then estimates the parametric part,
average treatment effect. When either one of the postulated models is misspecifed, τ̂DR is still
consistent for the PATE, which refers to as double-robustness property (Waernbaum, 2010).
A sequence of papers have been developed to research on the properties of the DR estimator.
Glynn and Quinn (2010) showed that the DR estimator is more stable when the propensity
scores is close to 0 or 1 compared to IPW estimator. Farrell (2015) examined the behavior of
DR estimator in high dimensional regression adjustments. Robins et al. (1994) laid the the
foundation of the later research on the semiparametrically efficient property of DR estimator
in estimating ATEs, and later, Hahn (1998) demonstrated the effects of the propensity score
for efficient semiparametric estimation on the ATEs and ATTs. In recent years, research on
incorporating machine learning algorithms in estimating the nuisance part in DR estimator has
become popular. Westreich et al. (2010) proposed using machine learning models such as neural
networks, support vector machines, decision trees (CART), as alternatives to logistic regression
in estimating propensity scores. Recently, several authors such as Chernozhukov et al. (2018);
Newey and Robins (2018) have discussed estimating the nuisance parameters in DR estimator
via cross-fitting so as to attain efficiency under some certain conditions. Specifically, the idea of
cross-fitting is splitting the data into K-fold and use the holdout examples to estimate nuisance
parameters, namely outcome regression and propensity scores, while the remaining examples
are used in estimating the treatment effect.

2.2 Balancing Estimators in Inverse-Propensity Weighting

Quite recently, considerable attention has been drew to the covariate balancing in estimat-
ing propensity scores. As pointed out by Kang et al. (2007), the major pitfall of IPW-based
analysis is its sensitivity to misspecification of the propensity score model, which yields biased
estimation on average treatment effect. In traditional propensity scores estimation, one can
improve the fit by estimating propensity score via different models until the resulting covariate
balance between treatment and controlled groups is satisfactory (Imbens and Rubin, 2015).
Imai and Ratkovic (2014) introduced Covariate Balancing Propensity Score (CBPS) and ad-
vocated that since the essense of IPW analysis is to re-create a weighted population to make
unconfounded comparison possible, one could use covariate balancing as a constraint to guide
the propensity score estimation so that despite model misspecification, one can still achieve
a good balance in covariate distributions between the treatment and control groups. Later,
Fan et al. (2016) discussed the optimal choice of balancing functions for the CBPS methodol-
ogy. Several authors investigated the connections between covariate balance constraints and
the dual convex optimization problem, including the work by Hainmueller (2012); Zubizarreta
(2015). Li et al. (2018) proposed overlap weights that weights each subject by the probability
of being assigned to the opposite treatment group, so as to achieve exact mean balance of all
the included covariates and to solve the problems of extreme weights. A landmark paper by
Zhao et al. (2019) proposed a unifying framework that tailors these methods together via a
loss function based approach. For high-dimensional inference on the ATE Athey et al. (2016)

5
considered the use of augmented balancing estimators. Moreover, Chernozhukov et al. (2016)
generalized balance conditions into a broader use and developed the AIPW-based estimators
for general functionals that can be written in terms of Riesz representer.

2.3 Bayesian Paradigm in Causal Inference

Potential outcomes framework can be view as a missing data problem intrinsically, since at most
one of the potential outcomes can be realized for each subject. Bayesian modeling provides
sophisticated frameworks in drawing inference from incomplete data, which have brought a lot
of new insights into causal inference motivated by missing data perspectives.
Let (Y (1), Y (0)) be the N × 2 matrix where Y (z) is the vector notation of potential out-
comes under treatment z with i-th entry equal to Yi (z), and let Y obs and Y miss be the vector
notation with i-th entry equal to Yiobs = Yi (Zi ) and Yiobs = Yi (1−Zi ) respectively. In Frequen-
tist paradigm, there is a subtle theoretical difference between the estimands of finite-sample
ATE and super-population ATE. However, in Bayesian Paradigm, both observed Y obs and
unobserved Y miss are treated as random variables, which results in a substantial difference
between the estimation of these two estimands in the sense of the source of uncertainty. Specif-
ically, Ding et al. (2018) pointed out that for finite-sample estimands considering all potential
outcomes as fixed values, Bayesian inference focus on imputing the missing potential outcomes
based on the posterior draws, while super-population estimands considering all potential out-
comes as random variables from a target distribution, Y obs can also be simulated from the
posterior predictive draws.
Since a causal estimand that can be written as a function of τ = τ (Y (1), Y (0), Z, X) can
also be represented as τ = τ (Y obs , Y miss , Z, X), research on methodologies in imputing Y miss
play an importation role in finite-sample inference (Imbens and Rubin, 2015). One of the most
widely used Bayesian modeling was proposed by Rubin (1978). In his framework, beyond
unconfoundedness assumption, the model also assumes exchangeability for the existence of
the prior distribution p(θ), and distinct and independent prior parameters for f (Zi |X) and
f (Yi (1), Yi (0)|Xi ). The joint distribution of potential outcomes f (Yi (1), Yi (0)|Xi ) is the key
”model for science”. Y miss can be obtained from posterior predictive distribution through data
augmentation (Tanner and Wong, 1987), also known as Gibbs sampling. For super-population
causal estimand, both Y obs and Y miss are considered as random variables and can be simulated
from the posterior predictive distributions. A conventional way of estimating population ATE
in Bayesian inference is by using the empirical distribution of covariates X to compute the
conditional ATE (Ding et al., 2018).
The problem appears in both Frequentist or Bayesian framework, that only the marginal
distributions of the potential outcomes can potentially be observed, but the correlation between
potential outcomes, denoted as ρ, is not observable. Some researchers might cast doubt on
the assumption of independence between potential outcomes. In Frequentist framework, if the
causal estimands are comparisons of potential outcomes under different treatment assignments,
it is sufficient to estimate E[Yi (z)] without taking into account the non-identifiable parameter ρ.
On the other hand, researchers might care the association between potential outcomes in some
scenarios, for instance, when outcome of interests are ordinal, the estimand such as P (Yi (1) >
Yi (0)) might be of interest. From Bayesian’s perspectives, even though all parameters are
identifiable if imposing proper prior distributions (Lindley, 1972), some parameters might be
weakly identifiable if the posterior distributions are highly sensitive to the prior (Gustafson,
2015; Ding et al., 2018). To address this issue, Richardson et al. (2011) advocated transparent
parametrization to isolate the non-identifiable parameter ρ and treat it as a fixed value when
simulating posterior draws. Later, Ding and Dasgupta (2016) proposed a sensitivity analysis
for the non-identifiable parameter ρ, and illustrated the idea using a completely randomized
experiment with binary outcome of interest.

6
2.4 Instrumental Variables Approach
Traditionally, unbiasedly estimating causal effects has primarily been based on the assumption
of unconfoundedness. However, in some certain circumstances, there are variables, so-called
instrumental variables (IV), that might potentially affect the treatment assignment so that
indirectly affect the outcome of interests. For instance, the variable Z in Figure 1 can be
considered as an instrumental variable that does not have a direct effect on outcome of interest
Y but affects the outcome through the variable W . As such, the unconfoundedness assumption
seems implausible.

Figure 1: A Example of Instrumental Variable (Z: instrumental variable, W : treatment vari-

able, Y : outcome of interest, and U : unmeasured confounders)

The IV approach has long been discussed in econometric literature, dominating by struc-
tural equation models that relies on linear parametric specifications to model the constant
treatment effects (Wright, 1921; Haavelmo, 1943; Morgan et al., 1990; Stock and Trebbi, 2003).
Since the influential work by Imbens and Angrist (1994), Angrist et al. (1996), and Imbens and
Rubin (1997b); that embedded instrumental variables into the potential outcomes framework
and formulated the assumptions in a more traceable way, there has been a growing interest
in IV method in recent statistics literature. A common IV estimand in statistic literature is
Local Average Treatment Effect (LATE), introduced by Imbens and Angrist (1994), which is
interchangeably referred to as Complier Average Treatment Effect (CATE) (Imbens and Ru-
bin, 1997b) since the inference is based on the average effect of the subpopulation of compliers.
The finite-sample LATE is defined in Equation 2.4 below
1 X
τSLAT E = [Yi (1) − Yi (0)],
Nco
i:Gi =co

where Nco = i=1 1(Gi = co).

PN
Although economists and statisticians approach the matter from different starting point,
recent researches have shown that they shared a lot of similarities. For instance, Imbens (2014)
gave a comprehensive review on the connections between the statistical literature on IV method
and the econometrics point of views, and how they complemented each other along the way.
One popular application using instrumental variables is the estimation of treatment ef-
fect under noncompliance. Even though randomized experiments are the gold standard in
establishing causal link, the presence of noncompliance with the assignment breaks the initial
randomized assignment mechanism. A strand of the literature tackles the problem of noncom-
pliance in design phase, such as Zelan (1979); Torgerson and Roland (1998). Several naive
methods have also been developed for randomized trials with noncompliance. For instance,
as-treated analysis compares units based on actual treatment received; per-protocol analysis
focuses only on the compliers and discarding complying units; and intention-to-treat (ITT)
analysis compares units based on their initial random assignment (Sommer and Zeger, 1991;
McNamee, 2009).
A key limitation of the conventional ITT analysis is that it is centered mainly on the causal
effects of the assignment on the outcome (Lee et al., 1991). However, the choices on the receipt
of treatment are self-selected, which might capture the latent characteristics of each unit. In

7
this sense, the causal effects of the treatment received might seem to be of more interest than
the causal effects of the treatment assigned. IV approach provides an alternative that not only
allows researchers to relax the assumption of unconfoundedness, but also enables the estimation
of the effects on the receipt of treatment.
For the basic setup of randomized experiment with noncompliance, in addition to the
binary treatment assignment Zi and outcome of interest Yiobs as in the usual potential outcomes
models, the actual receipt of treatment Wiobs = Wi (Zi ) is also observed. Each unit is associated
with two potential outcomes on the treatment received, Wi (0) and Wi (1), so that they can be
partitioned into subgroups as compliers (co), nevertakers (nt), defiers (df), and alwaystakers
(at), based on their compliance status Gi , which is deterministic according to the pairs of
values of (Wi (0), Wi (1)) (Angrist et al., 1996). More specifically, the four compliance types are
defined as follows:


 Nevertakers (nt) if Wi (0) = 0, Wi (1) = 0

Compliers (co) if Wi (0) = 0, Wi (1) = 1
Ci = .


 Defiers (df) if Wi (0) = 1, Wi (1) = 0
Alwaystakers (at) if Wi (0) = 1, Wi (1) = 1


Because of the fundamental problem of causal inference that Wi (Zi ) and Wi (1 − Zi ) are not
jointly observable, the possible compliance types can not be inferred directly by the observed
assignment Zi and the treatment received Wiobs without making extra assumptions. The
key assumptions in identifying the estimand CATE are SUTVA, cross-world counterfactual
independence, the exclusion restriction, and monotonicity. Under exclusion restriction as-
sumption, Yi (z, w), denoting the double-indexed potential outcomes of the primary outcome of
interest, can be written as Yi (w). CATE can be nonparametrically estimated by the method-
of-moment-based estimator, the ratio of the ITT effects of primary outcome and treatment
received (Angrist et al., 1996; Imbens and Rubin, 2015). Later, Frangakis and Rubin (2002)
referred the compliance types as principal stratum and generalized noncompliance as a special
case of post-treatment variable in their principal stratification approach. Apart from non-
compliance, principal stratification method has been widely used in other applications such as
censoring by death (Rubin et al., 2006; Zhang and Little, 2009), fuzzy regression discontinuity
designs (FRD) (Hahn et al., 2001; Chib and Jacobi, 2016), and mediation analysis (Elliott
et al., 2010).
The main pitfall of moment-based estimator is its difficulty in including covariates. There-
fore, Imbens and Rubin (1997a) and Hirano et al. (2000) outlined Bayesian inference for prin-
cipal stratification that can incorporate pre-treatment covariates in analysis. For each subject,
only one of the potential outcomes, namely Wiobs = Wi (Zi ) and Yiobs = Yi (Wiobs ), can be
possibly observed, while Wimiss = Wi (1 − Zi ) and Yimiss = Yi (1 − Wiobs ) are missing. In-
trinsically, causal inference can be seen as a missing data problem. Let Y obs , Y miss , W obs ,
and W miss be the N-vector of observed and missing outcomes and receipt of treatment respec-
tively. The goal in Bayesian paradigm is to derive the predictive distribution of (Y miss , W miss )
based on the observed data (Y obs , W obs , Z, X) so as to compute the estimand τSLAT E =
τ (Y obs , Y miss , W obs , W miss , Z, X) as a function of observed and missing variables.
The core inputs that are needed to be specified in the model-based approach are the dis-
tribution of the compliance types f (Gi |Xi , θ) and the joint distribution of potential outcomes
f (Yi (0), Yi (1)|Wi (0), Wi (1), Xi ; θ), or equivalently, f (Yi (0), Yi (1)|Gi , Xi ; θ). Rubin et al. (2010)
suggested using the two binary models, one for being a complier or not, and one for being a
nevertaker conditional on not being a complier, to model the three-valued compliance status
indicator. The main idea of this framework is that the observed pairs (Zi , Wiobs ) consists of
a mixture of subjects from different compliance types, so that methods like EM algorithm or
Data Augmentation (DA), that are typically used in mixture model inference, can be used
in inferencing causal effects in principal stratification (Imbens and Rubin, 2015). For DA
method, Richardson et al. (2011) further pointed out that posterior is sensitive to the model
specification of compliance types and proposed transparent parametrization by separating the
identified and non-identified parameters.

8
3 Causal Graphical Models
Potential outcome framework is powerful in recovering the effect of causes. In potential out-
come framework, causal effects are answered by specific manipulation on treatments (Holland,
1986). However, when it comes to identifying the causal pathway or visualizing causal net-
works, potential outcome model have it own limitations. As an alternative, causal graphical
models provides an intriguing tool in representing causal effects by directed edges that describes
dependency relationships, which not only allows us to model interventions more generally, but
also describes the data generating mechanism of the random variables.

3.1 Path Analysis

Functional causal models have a long history, tracing back to the original path analysis devel-
oped by geneticist Wright (1918). They have been widely used in modeling the causal structures
by combining the structural equation models (SEMs) and directed graphs (Wright, 1934; Pearl,
2009; Pearl and Mackenzie, 2018). Consider the random vector X[p] = (X1 , ..., Xp ), the linear
SEM consists of a set of equations of the form
X
Xi = β0i + βji Xj + i , i = 1, 2, ..., p (3)
j∈pa(Xi )

or equivalently,
X[p] = β0 + M 0 X[p] + [p] (4)
where pa(Xi ) denotes the set of variables that are parents (or direct predecessors) of Xi ,
1 , ..., p are mutually independent noise terms with zero mean, βji ’s are so-called path coeffi-
cients that quantify the causal effects of Xj on Xi , and M is the matrix of path coefficients.
This framework is structural since even if we intervene on several variables, the functional
form among variables in equation (3) would not change. The random variables X[p] that sat-
isfies the model structure of the form in Equation (3) can be represented by a directed acyclic
graph (DAG) G = (V, E), where V is the set of associated vertices, each corresponding to one
of variable of interest Xi , and E ⊆ V × V is the corresponding edge set. Because for any
DAG, the graph contains no cycles and there exists a topological ordering of the vertices, the
matrix Id − M is invertible and thus such structural assignments induces a unique solution
X[p] = (Id − M )−1 (β0 + X[p] ).
Path analysis can be used in differentiating correlation and causation. Applying Wright’s
path analysis, when the variance of Xi ’s are all standardized, the covariance between Xi and
Xj is the product of path coefficients that are on the edges of all d-connected paths, that is

X m
Y
Cov(Xi , Xj ) = βdk−1 dk , (5)
(d0 ,...,dm )∈D(i,j) k=1

where D(i, j) is the collection of the d-connected paths between i and j. Additionally, in a
linear SEM, causal effect of Xi on Xj can be defined as

X m
Y
c(Xi → Xj ) = βdk−1 dk , (6)
(d0 ,...,dm )∈G(i,j) k=1

where G(i, j) is the collection of the directed paths between i and j. As can be seen from
Equation (5), if Cov(Xi , Xj ) 6= 0 then there exists a d-connected path between Xi and Xj . The
d-connected path between Xi and Xj only introduces dependency between the two variables,
which does not imply causation. The correlation between Xi and Xj implies causation only if
D(i, j) = G(i, j), in other words, all the d-connected paths between i and j are directed paths.
With the underlying graph structure given, confirmatory factor analysis focuses on the sta-
tistical inference part. Suppose the random vector X[p] satisfies the linear SEM with Gaussian

9
errors and the matrix of path coefficients M is identifiable, generally, M can be estimated
by maximum likelihood estimator or generalized method of moments (Browne, 1984). Lin-
ear SEMs are also useful in identifying the causal effects between unobserved variables using
”three indicator rule” (O’Brien, 1994). With pre-specified DAG and assumptions on the latent
variables, the path coefficients between the latent variables are identifiable (Kuroki and Pearl,
2014).

3.2 Bayesian networks

Causal inference can be naturally embedded in graphical models frameworks since the de-
pendencies and interactions between variables can be presented by graphs with probabilistic
distributions, in which nodes correspond to variables of interests and edges represents associa-
tions. In Bayesian networks, causalities among variables are holistically represented in the form
of graphs with directed paths carrying causal information. A Bayesian network is a collection
of directed graphical models, which can be described using directed acyclic graphs (DAG),
denoted by G = (V, E), where V is a finite set of vertices i’s corresponding to a set of random
variables X[p] = {X1 , ..., Xp }, i.e. i 7→ Xi , and E ⊆ V × V is the corresponding edge set. A
DAG is a directed graph with no cycles. A node j in a DAG graph G is a parent of node i if
there exists directed edge (i, j) such that j → i. In addition to connecting two variables using
directed edges, two variables in a DAG can be connected by different paths in the graph, where
path is a sequence of adjacent nodes with directed arrows that connect those two variables.
A DAG encodes dependency relationships by associating each node with a conditional
probability distribution of the corresponding variable xi given its parents’ values xpa(i) , where
pa(i) denotes the set of parental nodes of the vertex i. A joint probability distribution P
factorizes with respect to a DAG G if it satisfies:
Y
f (x1 , ..., xp ) = f (xi |xpa(i) ). (7)
i

Bayesian networks provide a natural setup for expressing dependency relationships. It

models the changes in joint distribution related to external interventions. As an example,
consider a model of four variables, in which variable X1 affects both variables X2 and X3 , and
variable X4 depends on the values of X2 and X3 . Based on the dependency assumptions, this
Bayesian network defines a joint probability distribution as follows:

f (x1 , x2 , x3 , x4 ) = f (x1 )f (x2 |x1 )f (x3 |x1 )f (x4 |x2 , x3 ).

Figure 2 visualizes the dependencies among those variables in a DAG.

Figure 2: An Example of a DAG

With structure given by a DAG G, the associated joint probability distribution can be
expressed compactly by conditional distributions. For instance, without knowing the relation-
ships among variables, a naive approach in describing a probability distribution with p binary

10
random variables requires O(2p ) parameters. Given DAG structure that the maximum number
of incoming edges is m, the size of parameters in describing the entire probability distribution
can be reduced to O(p · 2m ) (Russell and Norvig, 2002). The compact expression also induces
certain independence assumptions in the model.
Several criteria have been developed for reading-off conditional independencies from DAGs.
The global Markov property (Pearl, 1986; Lauritzen et al., 1990), together with the Hammersley-
Clifford theorem (Cowell et al., 1999), provide a tool in reading-off conditional independencies
for undirected graph. For DAGs, the Hammersley-Clifford theorem can not be directly applied
to recover independencies among variables, since the densities in general do not factorize with
respect to cliques (a subset of nodes that every two nodes are directly connected in the undi-
rected graph). The undirected moral graph G m of a DAG G, obtained by adding edges between
parents having common child, suggests an alternative in checking factorization with respect to
a DAG (Cowell et al., 1999). The theorem states that if a probability distribution factorizes
with respect to a DAG G, it also factorizes with respect to the G m . Then the Hammersley-
Clifford theorem can be used to check whether the G satisfies the global Markov property so
as to obtain conditional independencies in the original DAG.
The d-separation rule is also a criterion in obtaining conditional independencies for DAGs.
The criterion emphasizes the role of colliders in introducing dependencies. A node is said to
be a collider if it is a common child of the other two or more nodes. For instance, in Figure
3, the variable X3 is a collider, since it is directed influenced by the variables X1 and X2 .
Conditioning in general removes dependencies between variables. As can be seen from this
example, the variables X1 and X2 are originally independent, i.e. X1 ⊥ X2 . However, when
conditioning on the collider variable X3 or the descendent of the collider variable, conditioning
introduces dependencies between variables, i.e. X1 6⊥⊥ X2 |X3 and X1 6⊥⊥ X2 |X4 .

Figure 3: An Example of a Collider in a DAG

In this sense, the d-separation rule states as the followings. In a DAG G = (V, E), a path
is blocked by the set C ⊂ V if the path contains a non-collider node c1 ∈ C, or it contains a
collider node c2 ∈/ C and des(c2 ) ∩ C = ∅, where des(c2 ) denotes the set of the descendants
of the node c2 (Pearl and Verma, 1991). For any sets V1 , V2 , V3 ⊂ V , if every path from a
node in V1 to a node in V2 is blocked by V3 , then V1 and V2 are d-separated by V3 , denoted
as V1 ⊥ V2 |V3 [G]. The theorem says that in a DAG G = (V, E) with respect to a probability
distribution of X[p] , for any disjoint sets V1 , V2 , V3 ⊂ V , the d-separation of V1 and V2 by
V3 implies conditional independencies i.e. XV1 ⊥ XV2 |XV3 , if and only if the probability
distribution factorizes with respect to the DAG. In other words, a probability distribution
with respect to the DAG satisfies the global Markov property if the d-separation criterion
holds. Compared with constructing intermediate moral graphs, the d-separation criterion does
not require the probability distribution to have positive density, and can be directly applied
to original DAGs to read-off independencies.

11
3.3 Non-parametric Structural Equation Models (NPSEMs)
Linear SEMs impose a strong assumption on linearity. Pearl and Verma (1991) adapted the
structural equation models and relaxed linearity assumptions by introduction non-parametric
structural equation models (NPSEMs). For NPSEMs, the arrows in the DAGs represent
functional relationships instead of inducing linearity. Formally, random variables X[p] =
(X1 , ..., Xp ) with respect to a DAG G = (V, E) can be further expressed as a deterministic
function
Xi = fi (Xpa(i) , i ), i = 1, ..., p. (8)
When further assuming the noise terms, i ’s, to be mutually independent, the model is referred
to as non-parametric structural equation models with independent errors (NPSEM-IE). In
NPSEMs framework, the sampling distributions of random variables can be expressed by both
equations (7) and (8), while the latter describes the whole data generating mechanism and
provides an essential tool for reasoning causal effects through the concept of do-calculus (Pearl,
1995, 1998, 2000).
Their functional mechanism Xi = fi (pa(Xi ), i ) provides a general method in querying
causal effects when there are exogenous interventions on several variables. One of the most
famous tool in answering such causal queries is the ”do calculus” advocated by Pearl (1995).
The concept of do-calculus is as follows. The functional characterization in equation (8) defines
how the values of Xi ’s would change if external interventions were made to some of the variables
in the system. Specifically, for any two subsets of nodes XA , XB ⊆ X[p] , where XA ∩ XB = ∅,
when making an intervention of setting XA to certain values xA , the structural equations can
be modified accordingly by replacing the equations with XA = xA and the causal effect on
XB can be estimated using the newly computed observables by E[XB |do(XA = xA )].
The counterfactuals from a DAG can be also modeled by using NPSEMs. For intervention
of setting XA = xA , where A ⊆ [p], the counterfactual variables {Xi (XA = xA ) | i ∈ [p]}, or
abbreviated as {Xi (xA ) | i ∈ [p]}, can be defined by recursive substitution (Malinsky et al.,
2019). Specifically, for the interventions on the parental nodes of Xi , i.e., setting Xpa(i) =
xpa(i) , the counterfactual of Xi is defined as
Xi (Xpa(i) = xpa(i) ). (9)
On the other hand, if the interventions are made on the ancestor set A, i.e. A 6= pa(i), the
counterfactual is defined as

Xi (XA = xA ) = Xi Xpa(i)∩A = xpa(i)∩A , Xpa(i)\A = Xpa(i)\A (xA ) . (10)

Defining in this way, it implies the consistency property (Malinsky et al., 2019): for any disjoint
sets A, B ⊆ V , i ∈ V \ (A ∪ B),
XB (xA ) = xB implies Xi (xA , xB ) = Xi (xA ). (11)
The intuition behind the consistency property is that if the system is intervened by setting the
values to what they would have in the first place, then this additional intervention would not
make a difference to the system.
Multiple-world independence assumption assumes the variables {Xi (xpa(i) )|xpa(i) } are mu-
tually independent for all i ∈ [p], while single-world independence assumption assumes the
variables {Xi (xpa(i) )} are mutually independent for all i ∈ [p] (Richardson and Robins, 2013).
Intuitively speaking, multiple-world independence makes hypothesis on the potential outcomes
that can never occur simultaneously, while single world independence assumes that all of the
potential outcomes are independent of each other if the specified variables were intervened in
the graph. For instance, consider the graph structure in Figure 4 with binary variable X1 .
The multiple-world independence makes assumptions on the counterfactual of X1 that can
never occur simultaneously, i.e. X2 (X1 = 1) ⊥ X3 (X1 = 0), while the single-world indepen-
dence only requires X2 (X1 = 0) ⊥ X3 (X1 = 0). NPSEM-IE framework induces multiple-world
independence assumption in the model by assuming independence of i ’s. In general, the
multiple-world independence assumption is non-testable since the assumption is based on the
counterfactuals that can never happen at the same time.

12
Figure 4: An Illustration for Multiple-world and Single-world Independence Assumptions

3.4 Single-world Intervention Graphs (SWIGs)

Richardson and Robins (2013) argue that multiple-world independence is a strong assumption
for NPSEM, and to model counterfactual variables in a graphical model, single-world inde-
pendence assumption is suffices for causal identification. They introduced the single-world
intervention graph (SWIG) that unifies the graphical theories and potential outcomes frame-
work. Specifically, for any intervention of setting XA = xA in the system, the SWIG, denoted
as G[X(XA = xA )], can be constructed from a causal DAG G by splitting all of the ver-
tices in A into a random and a fixed component, then re-labelling each random node Xi as
Xi (xA∩an(i) ). Next, we are going to illustrate the concept of how a SWIG can be constructed
for the graphical model in Figure 5.

Figure 5: A Graphical Structure for Illustration of SWIG

Consider a specific intervention (X1 = x1 , X3 = x3 ) made in the system. In the first step of
constructing SWIG, all the nodes in the intervention set {X1 , X3 } are split into a random and
a fixed part, with random components getting all the incoming edges while fixed components
getting all the outgoing edges. Then in the second step, all the random vertices in the graph
from first step are
relabeled. The Figure6 shows the SWIG for the graphical model in Figure
5, denoted as G X(X1 = x1 , X3 = x3 ) or G X(x1 , x3 ) . As can be seen from the SWIG,
the variables X2 (x1 ), X3 (x1 ), and X4 (x1 , x3 ) now becomes counterfactual variables.

Figure 6: The SWIG of the Graph Structure in Figure 5 for intervention (X1 = x1 , X3 = x3 )

In SWIGs, it consists of both random and fixed components. By conditioning on the fixed
parts, one can construct subgraph G ∗ [X(xA )] inherited only the random parts from the SWIG
G[X(xA )], then apply d-seperation criterion (Pearl, 1988) to read-off conditional independence
among counterfactual variables. Alternatively, in the case of no unmeasured variables, g-
computation formula (Robins, 1986; Pearl, 2009) can also be used to derive identification

13
formulas for counterfactuals. Specifically, for continuous variable X[p] satisfying single-world
causal model with respect to a DAG G = (V, E), for any disjoint sets A, B ⊆ V ,
Z Y
P(XB (xA ) = x̃B ) = P(Xi = x̃i |Xpa(i)∩A = xpa(i)∩A , Xpa(i)\A = x̃pa(i)\A )dx̃C , (12)
i∈V \A

where C = V \ (A ∪ B). The identification formula for discrete case can be derived anal-
ogously by replacing the integral with summation. If graphical structures contains hidden
variables, the front-door criterion and back-door adjustment advocated by Pearl (1995), and
potential outcomes calculus proposed by (Malinsky et al., 2019) are commonly used in deriving
identification formulas.

3.5 Structure Discovery

Structure learning, on the other hand, focuses on re-constructing graphs and discovering
causal structures based on conditional independencies in observational data. The concept
of faithfulness assumption was introduced by Pearl (1988) and Spirtes et al. (1993), provid-
ing the theoretical foundation for a series of algorithms using independence tests to learn
causal structures. A probability distribution with respect to DAG G is said to be faithful if
A ⊥ C|B [G] ⇔ XA ⊥ XC |XB holds. The existence of faithfulness of probability distri-
butions have been proved in the class of multivariate normal distributions and the class of
multinomial distributions. Meek (1995) further proved that the faithful distributions exist for
almost all the discrete distributions that are Markov to a DAG G, except those with Lebesgue
measure zero. Most of the conditional-independence-test-based algorithms that can identify
graphs up to Markov equivalence class usually require the faithfulness assumption, which can
be regarded as a relatively weak assumption in structure learning (Pearl, 2000). Some state-
of-the-art algorithms include PC algorithm, IC algorithm, and SGS algorithm (Pearl, 2000).
Those algorithms depend on statistical tests of conditional independence to remove a number
of possible structures and output a set of DAGs with d-seperated variables. Shah et al. (2020)
pointed out that conditional independence test can be a hard statistical hypothesis and pro-
posed using a test statistic computed based on generalised covariance measure. The greedy
equivalence search (GES) algorithm introduced by Chickering (2002) takes another approach
using score-based criterion. For Gaussian data, the algorithm grows a graph greedily from an
empty graph by maximizing the score function, which in general provides more stable estimates
of the Markov equivalence class of the DAGs than PC does.

4 Conclusion
The theoretical development of causal inference was driven by its diverse applications in ob-
servational data to understand the impact of treatments on outcome variables. The potential
outcomes or counterfactuals framework is convenient for statistical inference and making causal
argument. For observational analysis, this framework often requires non-verifiable assumptions
such as unconfoundedness. Causal graphical models represent causal effects among variables
using directed edges, which is powerful in visualizing the whole causal networks and learning
causal pathways. With graphical structures given, a series of theories have been developed in
reading-off conditional dependencies among variables from graphs. Structure discovery on the
other hand provides a way to recover Markov equivalence class based on the independencies
in observational data. It is useful for prioritization experiments or when intervention on all
the variables are impractical. In recent years, research on unifying those methods has become
popular. More research in bridging the perspectives of graphical models and counterfactual
framework would be valuable.

14
References
Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using
instrumental variables. Journal of the American statistical Association, 91(434):444–455.
Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant risk minimization.
arXiv preprint arXiv:1907.02893.
Aronow, P. M., Green, D. P., Lee, D. K., et al. (2014). Sharp bounds on the variance in
randomized experiments. Annals of Statistics, 42(3):850–871.
Athey, S., Imbens, G. W., and Wager, S. (2016). Approximate residual balancing: De-biased
inference of average treatment effects in high dimensions. arXiv preprint arXiv:1604.07125.
Browne, M. W. (1984). Asymptotically distribution-free methods for the analysis of covariance
structures. British journal of mathematical and statistical psychology, 37(1):62–83.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and
Robins, J. (2018). Double/debiased machine learning for treatment and structural parame-
ters.
Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. (2016).
Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033.
Chib, S. and Jacobi, L. (2016). Bayesian fuzzy regression discontinuity analysis and returns to
compulsory schooling. Journal of Applied Econometrics, 31(6):1026–1047.
Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of
machine learning research, 3(Nov):507–554.
Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Building and
using probabilistic networks. Probabilistic Networks and Expert Systems, pages 25–41.

Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O. A. (2009). Dealing with limited
overlap in estimation of average treatment effects. Biometrika, 96(1):187–199.
Ding, P. and Dasgupta, T. (2016). A potential tale of two-by-two tables from completely
randomized experiments. Journal of the American Statistical Association, 111(513):157–
168.
Ding, P., Li, F., et al. (2018). Causal inference: A missing data perspective. Statistical Science,
33(2):214–237.
Dorie, V., Hill, J., Shalit, U., Scott, M., Cervone, D., et al. (2019). Automated versus do-
it-yourself methods for causal inference: Lessons learned from a data analysis competition.
Statistical Science, 34(1):43–68.
Elliott, M. R., Raghunathan, T. E., and Li, Y. (2010). Bayesian inference for causal mediation
effects using principal stratification with dichotomous mediators and outcomes. Biostatistics,
11(2):353–372.

Fan, J., Imai, K., Liu, H., Ning, Y., and Yang, X. (2016). Improving covariate balancing
propensity score: A doubly robust and efficient approach. Technical report, Technical report,
Princeton University.
Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more co-
variates than observations. Journal of Econometrics, 189(1):1–23.

Fisher, R. A. (1992). Statistical methods for research workers. In Breakthroughs in statistics,

pages 66–70. Springer.

15
Frangakis, C. E. and Rubin, D. B. (2002). Principal stratification in causal inference. Biomet-
rics, 58(1):21–29.
Freedman, D. A. (2008). On regression adjustments to experimental data. Advances in Applied
Mathematics, 40(2):180–193.
Funk, M. J., Westreich, D., Wiesen, C., Stürmer, T., Brookhart, M. A., and Davidian, M.
(2011). Doubly robust estimation of causal effects. American journal of epidemiology,
173(7):761–767.
Glynn, A. N. and Quinn, K. M. (2010). An introduction to the augmented inverse propensity
weighted estimator. Political analysis, pages 36–56.
Gustafson, P. (2015). Bayesian inference for partially identified models: Exploring the limits
of limited data, volume 140. CRC Press.
Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations.
Econometrica, Journal of the Econometric Society, pages 1–12.
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of
average treatment effects. Econometrica, pages 315–331.
Hahn, J., Todd, P., and Van der Klaauw, W. (2001). Identification and estimation of treatment
effects with a regression-discontinuity design. Econometrica, 69(1):201–209.
Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting
method to produce balanced samples in observational studies. Political analysis, pages
25–46.
Hirano, K., Imbens, G. W., and Ridder, G. (2003). Efficient estimation of average treatment
effects using the estimated propensity score. Econometrica, 71(4):1161–1189.
Hirano, K., Imbens, G. W., Rubin, D. B., and Zhou, X.-H. (2000). Assessing the effect of an
influenza vaccine in an encouragement design. Biostatistics, 1(1):69–88.
Holland, P. W. (1986). Statistics and causal inference. Journal of the American statistical
Association, 81(396):945–960.
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement
from a finite universe. Journal of the American statistical Association, 47(260):663–685.
Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal
Statistical Society: Series B: Statistical Methodology, pages 243–263.
Imbens, G. (2014). Instrumental variables: an econometrician’s perspective. Technical report,
National Bureau of Economic Research.
Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treat-
ment effects. Econometrica: Journal of the Econometric Society, pages 467–475.
Imbens, G. W. and Rubin, D. B. (1997a). Bayesian inference for causal effects in randomized
experiments with noncompliance. The annals of statistics, pages 305–327.
Imbens, G. W. and Rubin, D. B. (1997b). Estimating outcome distributions for compliers in
instrumental variables models. The Review of Economic Studies, 64(4):555–574.
Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical
sciences. Cambridge University Press.
Kang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparison of
alternative strategies for estimating a population mean from incomplete data. Statistical
science, 22(4):523–539.

16
Keele, L. and Small, D. S. (2021). Comparing covariate prioritization via matching to ma-
chine learning methods for causal inference using five empirical applications. The American
Statistician, pages 1–9.

Kuroki, M. and Pearl, J. (2014). Measurement bias and effect restoration in causal inference.
Biometrika, 101(2):423–437.
Lauritzen, S. L., Dawid, A. P., Larsen, B. N., and Leimer, H.-G. (1990). Independence prop-
erties of directed markov fields. Networks, 20(5):491–505.

Lee, Y. J., Ellenberg, J. H., Hirtz, D. G., and Nelson, K. B. (1991). Analysis of clinical trials by
treatment actually received: is it really an option? Statistics in medicine, 10(10):1595–1605.
Li, F., Morgan, K. L., and Zaslavsky, A. M. (2018). Balancing covariates via propensity score
weighting. Journal of the American Statistical Association, 113(521):390–400.

Li, X. and Ding, P. (2020). Rerandomization and regression adjustment. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 82(1):241–268.
Lin, W. et al. (2013). Agnostic notes on regression adjustments to experimental data: Reex-
amining freedman’s critique. Annals of Applied Statistics, 7(1):295–318.
Lindley, D. V. (1972). Bayesian statistics: A review. SIAM.

Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity
score in estimation of causal treatment effects: a comparative study. Statistics in medicine,
23(19):2937–2960.
Malinsky, D., Shpitser, I., and Richardson, T. (2019). A potential outcomes calculus for iden-
tifying conditional path-specific effects. In The 22nd International Conference on Artificial
Intelligence and Statistics, pages 3080–3088. PMLR.
McNamee, R. (2009). Intention to treat, per protocol, as treated and instrumental variable es-
timators given non-compliance and effect heterogeneity. Statistics in medicine, 28(21):2639–
2652.

Meek, C. (1995). Strong completeness and faithfulness in bayesian networks. in uncertainty in

artificial intelligence: Proceedings of the eleventh conference.
Morgan, M. S. et al. (1990). The history of econometric ideas. Cambridge University Press.
Newey, W. K. and Robins, J. R. (2018). Cross-fitting and fast remainder rates for semipara-
metric estimation. arXiv preprint arXiv:1801.09138.
Neyman, J. S. (1923). On the application of probability theory to agricultural experiments.
essay on principles. section 9.(tlanslated and edited by dm dabrowska and tp speed, statistical
science (1990), 5, 465-480). Annals of Agricultural Sciences, 10:1–51.
O’Brien, R. M. (1994). Identification of simple measurement models with multiple latent
variables and correlated errors. Sociological methodology, pages 137–170.
Pearl, J. (1986). A constraint–propagation approach to probabilistic reasoning. In Machine
Intelligence and Pattern Recognition, volume 4, pages 357–369. Elsevier.
Pearl, J. (1988). Probabilistic reasoning in intelligent systems. 1988. San Mateo, CA: Kauf-
mann, 23:33–34.
Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–688.
Pearl, J. (1998). Graphical models for probabilistic and causal reasoning. Quantified represen-
tation of uncertainty and imprecision, pages 367–389.

17
Pearl, J. (2000). Causality: Models, reasoning and inference cambridge university press. Cam-
bridge, MA, USA,, 9:10–11.
Pearl, J. (2009). Causality. Cambridge university press.
Pearl, J. and Mackenzie, D. (2018). The book of why: the new science of cause and effect.
Basic books.
Pearl, J. and Verma, T. (1991). “a theory of inferred causation”in allen, j, fikes, r and sandewall,
e (eds), kr-91: Principles of knowledge representation and reasoning: Proceedings of the
second international conference.
Richardson, T. S., Evans, R. J., and Robins, J. M. (2011). Transparent parameterizations of
models for potential outcomes. Bayesian Statistics, 9:569–610.
Richardson, T. S. and Robins, J. M. (2013). Single world intervention graphs (swigs): A unifi-
cation of the counterfactual and graphical approaches to causality. Center for the Statistics
and the Social Sciences, University of Washington Series. Working Paper, 128(30):2013.
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained
exposure period-application to control of the healthy worker survivor effect. Mathematical
modelling, 7(9-12):1393–1512.
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when
some regressors are not always observed. Journal of the American statistical Association,
89(427):846–866.
Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J. (2018). Invariant models for causal
transfer learning. The Journal of Machine Learning Research, 19(1):1309–1342.
Rosenbaum, P. R. (1987). Model-based direct adjustment. Journal of the American Statistical
Association, 82(398):387–394.
Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in
observational studies for causal effects. Biometrika, 70(1):41–55.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of educational Psychology, 66(5):688.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592.
Rubin, D. B. (1977). Assignment to treatment group on the basis of a covariate. Journal of
educational Statistics, 2(1):1–26.
Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. The
Annals of statistics, pages 34–58.
Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control
bias in observational studies. Journal of the American Statistical Association, 74(366a):318–
328.
Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization
test comment. Journal of the American Statistical Association, 75(371):591–593.
Rubin, D. B. (1990). Comment: Neyman (1923) and causal inference in experiments and
observational studies. Statistical Science, 5(4):472–480.
Rubin, D. B. et al. (2006). Causal inference through potential outcomes and principal stratifica-
tion: application to studies with ”censoring” due to death. Statistical Science, 21(3):299–309.
Rubin, D. B. et al. (2008). For objective causal inference, design trumps analysis. Annals of
Applied Statistics, 2(3):808–840.

18
Rubin, D. B., Wang, X., Yin, L., and Zell, E. (2010). Bayesian casual inference: Appraches to
estimating the effect of treating hospital type on cancer survival in sweden using principal
stratification.

Russell, S. and Norvig, P. (2002). Artificial intelligence: a modern approach.

Shah, R. D., Peters, J., et al. (2020). The hardness of conditional independence testing and
the generalised covariance measure. Annals of Statistics, 48(3):1514–1538.
Sommer, A. and Zeger, S. L. (1991). On estimating efficacy from clinical trials. Statistics in
medicine, 10(1):45–52.
Spirtes, P., Glymour, C., and Scheines, R. (1993). Causation and prediction: axioms and
explications. In Causation, prediction, and search, pages 41–86. Springer.
Stock, J. H. and Trebbi, F. (2003). Retrospectives: who invented instrumental variable regres-
sion? Journal of Economic Perspectives, 17(3):177–194.
Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data
augmentation. Journal of the American statistical Association, 82(398):528–540.
Torgerson, D. J. and Roland, M. (1998). What is zelen’s design? Bmj, 316(7131):606.

Tsiatis, A. (2007). Semiparametric theory and missing data. Springer Science & Business
Media.
Tsiatis, A. A., Davidian, M., Zhang, M., and Lu, X. (2008). Covariate adjustment for two-
sample treatment comparisons in randomized clinical trials: a principled yet flexible ap-
proach. Statistics in medicine, 27(23):4658–4677.

Waernbaum, I. (2010). Propensity score model specification for estimation of average treatment
effects. Journal of Statistical Planning and Inference, 140(7):1948–1956.
Westreich, D., Lessler, J., and Funk, M. J. (2010). Propensity score estimation: neural net-
works, support vector machines, decision trees (cart), and meta-classifiers as alternatives to
logistic regression. Journal of clinical epidemiology, 63(8):826–833.

Wright, S. (1918). On the nature of size factors. Genetics, 3(4):367.

Wright, S. (1921). Correlation and causation. J. agric. Res., 20:557–580.
Wright, S. (1934). The method of path coefficients. The annals of mathematical statistics,
5(3):161–215.

Ye, T., Shao, J., and Zhao, Q. (2020). Principles for covariate adjustment in analyzing ran-
domized clinical trials. arXiv preprint arXiv:2009.11828.
Zelan, M. (1979). A new design for randomized clinical trials. N Engl J Med, 300:1242–5.
Zhang, G. and Little, R. (2009). Extensions of the penalized spline of propensity prediction
method of imputation. Biometrics, 65(3):911–918.
Zhao, Q. et al. (2019). Covariate balancing propensity score by tailored loss functions. Annals
of Statistics, 47(2):965–993.
Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete
outcome data. Journal of the American Statistical Association, 110(511):910–922.

Causal Inference in The Social Sciences
No ratings yet
Causal Inference in The Social Sciences
30 pages
Annurev Statistics 033121 114601
No ratings yet
Annurev Statistics 033121 114601
30 pages
AAAI-2023 教程用于因果推断的机器学习
No ratings yet
AAAI-2023 教程用于因果推断的机器学习
145 pages
Li Et Al 2023 Bayesian Causal Inference A Critical Review
No ratings yet
Li Et Al 2023 Bayesian Causal Inference A Critical Review
24 pages
Causal Inference in Statistics: An Overview
100% (1)
Causal Inference in Statistics: An Overview
51 pages
An Introduction To Causal Modelling: Gauranga Kumar Baishya and M. R. Srinivasan Chennai Mathematical Institute (CMI)
No ratings yet
An Introduction To Causal Modelling: Gauranga Kumar Baishya and M. R. Srinivasan Chennai Mathematical Institute (CMI)
52 pages
Causal Report
No ratings yet
Causal Report
52 pages
The Statistics of Causal Inference: A View From Political Methodology
No ratings yet
The Statistics of Causal Inference: A View From Political Methodology
23 pages
Causal Inference for Researchers
100% (2)
Causal Inference for Researchers
51 pages
An Introduction To Causal Inference
No ratings yet
An Introduction To Causal Inference
67 pages
Annurev Soc 030420 015345
No ratings yet
Annurev Soc 030420 015345
30 pages
Testing Identifiability of Causal Effects - David Galles Judea Pearl
No ratings yet
Testing Identifiability of Causal Effects - David Galles Judea Pearl
11 pages
Lecture Notes
No ratings yet
Lecture Notes
10 pages
01 Foundations
No ratings yet
01 Foundations
102 pages
The International Journal of Biostatistics: An Introduction To Causal Inference
No ratings yet
The International Journal of Biostatistics: An Introduction To Causal Inference
62 pages
Causal Inference For Statistics, Social, and Biomedical Science An Introduction
No ratings yet
Causal Inference For Statistics, Social, and Biomedical Science An Introduction
2 pages
M Api
No ratings yet
M Api
17 pages
Intro Stat
No ratings yet
Intro Stat
17 pages
Causal Inference and Machine Learning
No ratings yet
Causal Inference and Machine Learning
296 pages
r354 Reprint Corrected
No ratings yet
r354 Reprint Corrected
61 pages
Bayesian Causal Tutorial Ohiostate June2019
No ratings yet
Bayesian Causal Tutorial Ohiostate June2019
56 pages
Peter Spirtes 2010
No ratings yet
Peter Spirtes 2010
20 pages
Perspective On Interviews With Heckman, Pearl, Robins and Rubin
No ratings yet
Perspective On Interviews With Heckman, Pearl, Robins and Rubin
11 pages
Wieczorek Roth 2019 Entropy
No ratings yet
Wieczorek Roth 2019 Entropy
26 pages
Causal Inference
No ratings yet
Causal Inference
11 pages
Kenneth Rothman - Timothy L. Lash - Modern Epidemiology-LWW (2020) - 96-142
No ratings yet
Kenneth Rothman - Timothy L. Lash - Modern Epidemiology-LWW (2020) - 96-142
47 pages
Pearl 10 A
No ratings yet
Pearl 10 A
20 pages
Causality in Economics: PO vs. DAG
No ratings yet
Causality in Economics: PO vs. DAG
76 pages
Causal Inference for Researchers
No ratings yet
Causal Inference for Researchers
17 pages
CH 1
No ratings yet
CH 1
80 pages
CH 1
No ratings yet
CH 1
80 pages
Generalization Bounds and Representation Learning For Estimation of Potential Outcomes and Causal Effects
No ratings yet
Generalization Bounds and Representation Learning For Estimation of Potential Outcomes and Causal Effects
50 pages
Econ5813 Lecturenotes Lecture1 0
No ratings yet
Econ5813 Lecturenotes Lecture1 0
25 pages
Thinking
No ratings yet
Thinking
16 pages
Causal Models and Learning From Data
No ratings yet
Causal Models and Learning From Data
9 pages
Sprites, Glymour, Scheines - 1991 - From Probability To Causality
No ratings yet
Sprites, Glymour, Scheines - 1991 - From Probability To Causality
36 pages
Greenland & Robins 2009
No ratings yet
Greenland & Robins 2009
9 pages
Graphical Causal Inference Guide
No ratings yet
Graphical Causal Inference Guide
15 pages
01.0 PP I IV Frontmatter
No ratings yet
01.0 PP I IV Frontmatter
4 pages
Introduction To Causal Inference 1711927430
No ratings yet
Introduction To Causal Inference 1711927430
108 pages
Causal Inference Lesson One PDF
No ratings yet
Causal Inference Lesson One PDF
16 pages
Causal-Inference 2020 Engineering
No ratings yet
Causal-Inference 2020 Engineering
11 pages
Causality in The Sciences
No ratings yet
Causality in The Sciences
32 pages
Perraillon MC, Causal Inference
No ratings yet
Perraillon MC, Causal Inference
22 pages
A2 Causality
No ratings yet
A2 Causality
28 pages
Causal Inference Book Part I-Ifqdve
No ratings yet
Causal Inference Book Part I-Ifqdve
158 pages
Causal Notes
No ratings yet
Causal Notes
110 pages
A Brief Introduction To Causal Inference in Machine Learning
No ratings yet
A Brief Introduction To Causal Inference in Machine Learning
88 pages
Causal Inference, Michael E. Sobel
No ratings yet
Causal Inference, Michael E. Sobel
3 pages
Complete Download Causal Inference What If 1st Edition Miguel A. Hernan PDF All Chapters
100% (21)
Complete Download Causal Inference What If 1st Edition Miguel A. Hernan PDF All Chapters
78 pages
Bayesian Causal Modeling for Analysts
100% (1)
Bayesian Causal Modeling for Analysts
31 pages
Why19 Hunermund
No ratings yet
Why19 Hunermund
24 pages
Decomposing Causality Into Its Synergistic, Unique, and Redundant Components (Alvaro Sanchez)
No ratings yet
Decomposing Causality Into Its Synergistic, Unique, and Redundant Components (Alvaro Sanchez)
58 pages
Shpit Ser 2016
No ratings yet
Shpit Ser 2016
34 pages
Introduction To Causal Inference-Aug25 2020-Neal
No ratings yet
Introduction To Causal Inference-Aug25 2020-Neal
61 pages
Casual Tutorial Slides
No ratings yet
Casual Tutorial Slides
254 pages
Book Review JIBS
No ratings yet
Book Review JIBS
5 pages
1 s2.0 S2214845022000837 Main
No ratings yet
1 s2.0 S2214845022000837 Main
15 pages
Paper On Bootstrapping
No ratings yet
Paper On Bootstrapping
9 pages
206-Article Text-1421-2-10-20220824
No ratings yet
206-Article Text-1421-2-10-20220824
19 pages
Motivation of Accounting Business and Management Abm College Freshmen Students and Academic Achievement in General Education 5 PDF
No ratings yet
Motivation of Accounting Business and Management Abm College Freshmen Students and Academic Achievement in General Education 5 PDF
16 pages
Factor Affecting Job Satisfaction and Employee Performance - Good Research 2
No ratings yet
Factor Affecting Job Satisfaction and Employee Performance - Good Research 2
30 pages
Assessing Post-Adoption Utilisation of An Information Technology Within A Supply Chain Management Context
No ratings yet
Assessing Post-Adoption Utilisation of An Information Technology Within A Supply Chain Management Context
24 pages
Fake-Love Brand Love For Counterfeits
No ratings yet
Fake-Love Brand Love For Counterfeits
19 pages
PCA EFA CFE With R
No ratings yet
PCA EFA CFE With R
56 pages
Business Environment-Unit 1 (MBA Sem I)
No ratings yet
Business Environment-Unit 1 (MBA Sem I)
75 pages
Eustress On Telework-2022
No ratings yet
Eustress On Telework-2022
24 pages
The Impacts of Port Infrastructure and Logistics Performance On Economic Growth
100% (1)
The Impacts of Port Infrastructure and Logistics Performance On Economic Growth
19 pages
Social Media Use Loneliness and Psychological Distress in Emerging Adults
No ratings yet
Social Media Use Loneliness and Psychological Distress in Emerging Adults
15 pages
Ahmad Noor Ud Din-MPM153004 With Questionnaire
No ratings yet
Ahmad Noor Ud Din-MPM153004 With Questionnaire
81 pages
The Relationship Between Inert Thinking and Chatgpt Dependence: An I-Pace Model Perspective
No ratings yet
The Relationship Between Inert Thinking and Chatgpt Dependence: An I-Pace Model Perspective
25 pages
Sleep Deprivation
No ratings yet
Sleep Deprivation
17 pages
Lopez PowerStatusAbuse 2009
No ratings yet
Lopez PowerStatusAbuse 2009
26 pages
Personal Values and Mall Shopping Behaviour
No ratings yet
Personal Values and Mall Shopping Behaviour
29 pages
AI's Role in HR Digitalization
No ratings yet
AI's Role in HR Digitalization
9 pages
Fashion Involvment
No ratings yet
Fashion Involvment
11 pages
Hassan Et Al 2021 Individual Entrepreneurial Orientation Entrepreneurship Education and Entrepreneurial Intention The
No ratings yet
Hassan Et Al 2021 Individual Entrepreneurial Orientation Entrepreneurship Education and Entrepreneurial Intention The
16 pages
Using The Extensions of The Theory of Planned Behavior (TPB) For Behavioral Intentions To Use Public Transport (PT) in Kanazawa, Japan
No ratings yet
Using The Extensions of The Theory of Planned Behavior (TPB) For Behavioral Intentions To Use Public Transport (PT) in Kanazawa, Japan
10 pages
Kock2016 Minimum Sample Size Estimation in PLS-SEM
No ratings yet
Kock2016 Minimum Sample Size Estimation in PLS-SEM
35 pages
GTG 58142-1427
No ratings yet
GTG 58142-1427
9 pages
S-MATH311LA BSM31 1st Sem (2021-2022) - 1st Sem AY2021-2022
No ratings yet
S-MATH311LA BSM31 1st Sem (2021-2022) - 1st Sem AY2021-2022
4 pages
International Journal of Mathematics and Statistics Invention (IJMSI)
No ratings yet
International Journal of Mathematics and Statistics Invention (IJMSI)
17 pages
Does Transformational Leadership Influence Organisational Culture and Organisational Performance
No ratings yet
Does Transformational Leadership Influence Organisational Culture and Organisational Performance
11 pages
Scandinavian Med Sci Sports - 2020 - Hanssen Doose - Predictive Value of Physical Fitness On Self Rated Health A
No ratings yet
Scandinavian Med Sci Sports - 2020 - Hanssen Doose - Predictive Value of Physical Fitness On Self Rated Health A
9 pages
SEM Features and Stata Commands
No ratings yet
SEM Features and Stata Commands
10 pages
State of The Art in Partial Least Squares Structural Equation Modeling (PLS-SEM)
No ratings yet
State of The Art in Partial Least Squares Structural Equation Modeling (PLS-SEM)
586 pages

A Survey of Causal Inference Framework

Uploaded by

A Survey of Causal Inference Framework

Uploaded by

A Survey of Causal Inference Frameworks

Jingying Zeng1 and Run Wang2

2 Potential Outcomes Framework

τP ≡ E[Yi (1) − Yi (0)] (1)

τP = EX [E(Yi (1)|X)] − EX [E(Yi (0)|X)] = EX {E(Yiobs |Z = 1, X) − EX (Yiobs |Z = 0, X)}

suggesting a regression-based estimator can be constructed as follows (Rubin, 1979):

2.1 Weighting Methods

τ̂DR = τ̂DR,1 − τ̂DR,0

2.2 Balancing Estimators in Inverse-Propensity Weighting

2.3 Bayesian Paradigm in Causal Inference

Figure 1: A Example of Instrumental Variable (Z: instrumental variable, W : treatment vari-

where Nco = i=1 1(Gi = co).

3.1 Path Analysis

3.2 Bayesian networks

Bayesian networks provide a natural setup for expressing dependency relationships. It

f (x1 , x2 , x3 , x4 ) = f (x1 )f (x2 |x1 )f (x3 |x1 )f (x4 |x2 , x3 ).

Figure 2 visualizes the dependencies among those variables in a DAG.

Figure 2: An Example of a DAG

Figure 3: An Example of a Collider in a DAG

3.4 Single-world Intervention Graphs (SWIGs)

Figure 5: A Graphical Structure for Illustration of SWIG

3.5 Structure Discovery

Fisher, R. A. (1992). Statistical methods for research workers. In Breakthroughs in statistics,

Meek, C. (1995). Strong completeness and faithfulness in bayesian networks. in uncertainty in

Russell, S. and Norvig, P. (2002). Artificial intelligence: a modern approach.

Wright, S. (1918). On the nature of size factors. Genetics, 3(4):367.

You might also like