A Survey of Causal Inference Framework
A Survey of Causal Inference Framework
September 2022
arXiv:2209.00869v1 [stat.ME] 2 Sep 2022
Abstract
Causal inference is a science with multi-disciplinary evolution and applications. On the
one hand, it measures effects of treatments in observational data based on experimental
designs and rigorous statistical inference to draw causal statements. One of the most
influential framework in quantifying causal effects is the potential outcomes framework.
On the other hand, causal graphical models utilizes directed edges to represent causalities
and encodes conditional independence relationships among variables in the graphs. A
series of research has been done both in reading-off conditional independencies from graphs
and in re-constructing causal structures. In recent years, the most state-of-art research in
causal inference starts unifying the different causal inference frameworks together. This
survey aims to provide a review of the past work on causal inference, focusing mainly on
potential outcomes framework and causal graphical models. We hope that this survey will
help accelerate the understanding of causal inference in different domains.
1 Introduction
Causal reasoning is widely used in everyday language interchangeably with correlation or asso-
ciation. Based on observations, one might draw informal conclusions like ”Aspirin helps reduce
my pain” or ”She got a good grade because she is smart”. In many cases, ”correlation does not
imply causation”; however, an action (or manipulation) can cause an effect. Therefore, lots
of causal statements are implausible if no explicit intervention being considered. Traditional
statistical analysis such as estimation and hypothesis testing typically emphasizes on the as-
sociations among variables and bypass the language of causality. Nevertheless, the questions
that scientific researchers trying to answer are more related to causality in nature rather than
correlation. For instance, will smoking cause lung cancer? Will a decrease in demand of houses
cause housing price drop? Can a certain drug prolong survival for cancer patients? Causal
inference goes beyond association and further identifies the causal effects when a cause of the
effect variables is intervened, which makes it an critical research area across many fields, such
as computer science, economics, epidemiology, and social science.
In scientific research, controlled randomized experiment using random assignment mech-
anisms to assign subjects to different treatment groups, has a long history in serving as the
gold standard in establishing causality. In many situations, however, randomized experiments
are not feasible nor ethical in practice, so that researchers need to rely on observational data
to inference causal relationships. To this end, several research communities have developed
various frameworks for causal identification based on observational data.
In history, there are three major origins of causal inference developed separately for differ-
ent purposes, namely potential outcomes (counterfactuals) community, graphs community, and
structural equations community. The potential outcomes framework provides a way to iden-
tify causal effects using statistical inference. This framework was first introduced by Neyman
(1923) for randomized experiments and generalized by Rubin (1974) for observational studies.
1
Later, Holland (1986) published an influential paper and named the general potential outcomes
method as Rubin Causal Model. The potential outcomes approach associates causality with
manipulation applied to units, and compares causal effects of different treatments via their cor-
responding potential outcomes. It is convenient to use potential outcomes to make statistical
inference on single cause-effect pair, and yet it has some drawbacks when the system becomes
complicated. On the other hand, graph analysis and structural equations in causal inference
traced back to the works in path analysis introduced by geneticist Wright (1918). Wright’s
path analysis combines graphs with linear structural equation models (SEMs) to present causal
relationships by directed edges, which is a useful tool to differentiate correlation from causation
when the graph structure is given. Pearl (1988) later relaxed linearity assumption and formal-
ized causal graphical models for presenting conditional independence relations among random
variables using directed acyclic graphs (DAGs). A variety of criteria have been developed
for reading-off independencies from a given graph, including the most famous theorem, the
completeness of d-separation. Nowadays, researchers have been making efforts in unifying the
theories of causality at the intersection of those communities (Richardson and Robins, 2013),
and using the causal inference to make estimations more reliable (Rojas-Carulla et al., 2018;
Arjovsky et al., 2019).
Both estimands can be viewed as a function of {Y (0), Y (1), X, Z}, where X is the ob-
served covariate matrix. However, the distinction between the two is whether making the
assumption that the potential outcomes are stochastic. For PATE, N subjects, each associated
with a quadruple {Yi (0), Yi (1), Zi , Xi }, are viewed as a random sample drawn from a super-
population, which induces the fact that the potential outcomes are random variables with a
distribution. When FATE is the target estimand, the potential outcomes are treated as a fixed
vector and inferences are conditional on this vector. The randomness comes from the treat-
ment assignment Zi ’s. With random sampling generated from the target population, SATE
is in expectation equal to PATE. SATE is typically of interest in randomized experiments.
Conversely, observational studies often utilize PATE as target causal estimand. Intuitively
speaking, for instance, an investigator randomly sampled a hundred of subjects from the U.S
to test the efficacy of a vaccine. The question that causal inference is trying to answer un-
der finite population is whether this vaccine is effective for this sample of subjects, yet under
super population settings, the question shifts to whether the efficacy of this vaccine can be
generalized for the whole U.S population.
2
According to the potential outcomes model of binary treatment, the values of (Yiobs , Zi , Xi ) ∈
R × {0, 1} × X of N independent and identically distributed samples were observed, where Zi
and Xi denote the treatment assignment and a vector of covariates for each sample unit. By
assuming Stable Unit Treatment Value Assumption (SUTVA) (no different forms of treatment
level and no interference among sample subjects), the observed outcome is deterministic by the
treatment assignment as Yiobs = Yi (Zi ), or equivalently, Yiobs = Zi Yi (1) + (1 − Zi )Yi (0) (Rubin,
1980). Similary, the unobserved potential outcome can be denoted as Yimiss = Yi (1 − Zi ).
In addition to the observed outcomes, Rubin (1990) claims that the assignment mecha-
nism, defined as the process of determining which subject receives which treatment level, is
also an essential piece of information in inferring reliable causal estimands. In potential out-
comes framework, causal estimands are disentangle from the probabilistic models of assignment
mechanism, which is the primary difference that distinguishes potential outcomes model from
other frameworks (Imbens and Rubin, 2015; Frangakis and Rubin, 2002). Classical random-
ized controlled trial is a randomized experiment with a controlled and random assignment
mechanism, which guarantees the unconfoundedness such that {Yi (0), Yi (1)} ⊥ Zi |Xi (Rubin,
1978).
In a randomized experiment, randomization balances both observed and unobserved base-
line characteristics (Rubin et al., 2008) so that a simple difference-in-means estimator, also
known as Neyman’s inference, formulated as
1 X 1 X
τ̂DIM = Yi − Yi ,
N1 N0
Zi =1 Zi =0
PN PN
where N1 = i=1 Zi and N0 = i=1 (1 − Zi ), can consistently and unbiasedly estimating
the ATE and establish credible causal link at the population level (Imbens and Rubin, 2015).
Studying on randomized experiments has attracted much attention from different researchers.
Fisher (1992) considered randomization test in testing the sharp null hypothesis of whether
Yi (0) − Yi (1) = β ∀i, which in practice is often approximated by Monte-Carlo simulation.
Aronow et al. (2014) constructed a sharper bound on the variance of the difference-in-means
estimator. In modern causal inference, theories indicate that appropriately incorporating pre-
treatment covariates can increase precision in randomized experiments analyses (Li and Ding,
2020; Ye et al., 2020), and many researchers further investigated the use of ordinary least
squares regression-adjusted estimates that do not depend on the assumptions of linear model
to improve asymptotic accuracy, including the discussions by Freedman (2008), Tsiatis et al.
(2008), Lin et al. (2013), and Ye et al. (2020). Making inference about causal effects from
randomized experiments have been viewed as gold standard, however, randomized experiments
can be infeasible and unethical in practice. Therefore, studying on observational data for
causality serves as an alternative when it is impossible to conduct randomized experiments.
The biggest challenge for observational studies is that the treatment assignment mecha-
nisms are unknown and the subjects assigned to different groups might systematically differ
in some unobserved characteristics. Thus, additional assumptions are necessary to identify
causality. To obtain credible causal effects from observational studies, one needs to see how
well an observational study emulate a randomization-like scenario. The majority of causal
inferences for observational studies are based on the strong ignorability assumption on as-
signment mechanisms, which requires unconfoundedness and positivity to make it free from
latent bias (Rosenbaum and Rubin, 1983). The unconfoundedness assumption assumes that
with a set of controlled covariates Xi given, the assignment mechanism is independent of the
potential outcomes. When one is willing to assume unconfoundedness, an observational study
can be considered as a randomized controlled trial and the treatment assignment is random
defined by a level of x. In addition to unconfoundedness, the positivity assumption further
requires propensity score, defined as e(x) = P(Zi = 1|Xi = x), to be strictly within 0 and
1 for all x in the support of Xi , so that each unit has positive probability of being assigned
to either treatment or controlled group. Without positivity assumption, the probability of a
certain subpopulation to be assigned to one of the groups might be zero, and the inference
on treatment effects for this subpopulation will depend on extrapolation (Imbens and Rubin,
3
2015).
By assuming no unmeasured confounders, one popular way to derive an estimator for the
ATE is via regression models. Defining µz (x) = E[Yi (z)|Xi = x] as the outcome regressions
among treated and control respectively and µ̂z (Xi ) as the corresponding fitted models, an
alternative representation of the PATE is
The IPW estimator τ̂IP W is the same as the regression estimator τ̂reg when X is discrete,
but when X is continuous, generally, they are different. By taking a glimpse at τ̂IP W , the
estimator weights the observed data by the inverse of the probability of receiving treatment
or control. Intuitively, a subject represents larger population if he/she has lower chance of
being sampled. In practice, it is common to normalize weights to reduce the variance of the
weighting estimators, leading to a more stable estimate (Hirano et al., 2003). From the per-
spectives of semiparametric inference, τ̂IP W admits asymptotic linear expansion and reaches
semi-parametric efficiency bound so that no regular estimator can improve its asymptotic per-
formance, in other words, τ̂IP W is an optimal estimation of ATE (Hirano et al., 2003; Tsiatis,
2007). Even though IPW estimator τ̂IP W has good theoretical foundations, its unbiasedness
and consistency highly rely on whether the propensity score model is correctly specified (Funk
et al., 2011). Some researchers proposed using machine learning models to estimate propen-
sity scores and Dorie et al. (2019) has also showed the improvement by simulation compared
with traditional logistic models, however, Keele and Small (2021) pointed out that black-box
machine learning methods make little difference in real data. Additionally, similar to other
propensity-score-based methods, τ̂IP W does not perform well when the estimated propensity
scores are close to 0 or 1. To solve the issue of unstable IPW estimator causing by extreme
weights, Crump et al. (2009) suggested further adjustment such as trimming the extreme
weights to exclude subjects beyond the range of the common support region.
4
Doubly Robust (DB) Estimator, also called augmented IPW (AIPW), was firstly introduced
by Robins et al. (1994) from the missing data perspective, which combines both regression-
based approach and weighting method to provide double protection against misspecification
(Lunceford and Davidian, 2004):
The DR estimator can be interpreted as estimating the PATE by using the regressions, then
employing IPW to the residuals to adjust for bias. For this approach, we firstly estimates the
nuisance parameters µz (x) and e(x) non-parametrically, then estimates the parametric part,
average treatment effect. When either one of the postulated models is misspecifed, τ̂DR is still
consistent for the PATE, which refers to as double-robustness property (Waernbaum, 2010).
A sequence of papers have been developed to research on the properties of the DR estimator.
Glynn and Quinn (2010) showed that the DR estimator is more stable when the propensity
scores is close to 0 or 1 compared to IPW estimator. Farrell (2015) examined the behavior of
DR estimator in high dimensional regression adjustments. Robins et al. (1994) laid the the
foundation of the later research on the semiparametrically efficient property of DR estimator
in estimating ATEs, and later, Hahn (1998) demonstrated the effects of the propensity score
for efficient semiparametric estimation on the ATEs and ATTs. In recent years, research on
incorporating machine learning algorithms in estimating the nuisance part in DR estimator has
become popular. Westreich et al. (2010) proposed using machine learning models such as neural
networks, support vector machines, decision trees (CART), as alternatives to logistic regression
in estimating propensity scores. Recently, several authors such as Chernozhukov et al. (2018);
Newey and Robins (2018) have discussed estimating the nuisance parameters in DR estimator
via cross-fitting so as to attain efficiency under some certain conditions. Specifically, the idea of
cross-fitting is splitting the data into K-fold and use the holdout examples to estimate nuisance
parameters, namely outcome regression and propensity scores, while the remaining examples
are used in estimating the treatment effect.
5
considered the use of augmented balancing estimators. Moreover, Chernozhukov et al. (2016)
generalized balance conditions into a broader use and developed the AIPW-based estimators
for general functionals that can be written in terms of Riesz representer.
6
2.4 Instrumental Variables Approach
Traditionally, unbiasedly estimating causal effects has primarily been based on the assumption
of unconfoundedness. However, in some certain circumstances, there are variables, so-called
instrumental variables (IV), that might potentially affect the treatment assignment so that
indirectly affect the outcome of interests. For instance, the variable Z in Figure 1 can be
considered as an instrumental variable that does not have a direct effect on outcome of interest
Y but affects the outcome through the variable W . As such, the unconfoundedness assumption
seems implausible.
The IV approach has long been discussed in econometric literature, dominating by struc-
tural equation models that relies on linear parametric specifications to model the constant
treatment effects (Wright, 1921; Haavelmo, 1943; Morgan et al., 1990; Stock and Trebbi, 2003).
Since the influential work by Imbens and Angrist (1994), Angrist et al. (1996), and Imbens and
Rubin (1997b); that embedded instrumental variables into the potential outcomes framework
and formulated the assumptions in a more traceable way, there has been a growing interest
in IV method in recent statistics literature. A common IV estimand in statistic literature is
Local Average Treatment Effect (LATE), introduced by Imbens and Angrist (1994), which is
interchangeably referred to as Complier Average Treatment Effect (CATE) (Imbens and Ru-
bin, 1997b) since the inference is based on the average effect of the subpopulation of compliers.
The finite-sample LATE is defined in Equation 2.4 below
1 X
τSLAT E = [Yi (1) − Yi (0)],
Nco
i:Gi =co
7
this sense, the causal effects of the treatment received might seem to be of more interest than
the causal effects of the treatment assigned. IV approach provides an alternative that not only
allows researchers to relax the assumption of unconfoundedness, but also enables the estimation
of the effects on the receipt of treatment.
For the basic setup of randomized experiment with noncompliance, in addition to the
binary treatment assignment Zi and outcome of interest Yiobs as in the usual potential outcomes
models, the actual receipt of treatment Wiobs = Wi (Zi ) is also observed. Each unit is associated
with two potential outcomes on the treatment received, Wi (0) and Wi (1), so that they can be
partitioned into subgroups as compliers (co), nevertakers (nt), defiers (df), and alwaystakers
(at), based on their compliance status Gi , which is deterministic according to the pairs of
values of (Wi (0), Wi (1)) (Angrist et al., 1996). More specifically, the four compliance types are
defined as follows:
Nevertakers (nt) if Wi (0) = 0, Wi (1) = 0
Compliers (co) if Wi (0) = 0, Wi (1) = 1
Ci = .
Defiers (df) if Wi (0) = 1, Wi (1) = 0
Alwaystakers (at) if Wi (0) = 1, Wi (1) = 1
Because of the fundamental problem of causal inference that Wi (Zi ) and Wi (1 − Zi ) are not
jointly observable, the possible compliance types can not be inferred directly by the observed
assignment Zi and the treatment received Wiobs without making extra assumptions. The
key assumptions in identifying the estimand CATE are SUTVA, cross-world counterfactual
independence, the exclusion restriction, and monotonicity. Under exclusion restriction as-
sumption, Yi (z, w), denoting the double-indexed potential outcomes of the primary outcome of
interest, can be written as Yi (w). CATE can be nonparametrically estimated by the method-
of-moment-based estimator, the ratio of the ITT effects of primary outcome and treatment
received (Angrist et al., 1996; Imbens and Rubin, 2015). Later, Frangakis and Rubin (2002)
referred the compliance types as principal stratum and generalized noncompliance as a special
case of post-treatment variable in their principal stratification approach. Apart from non-
compliance, principal stratification method has been widely used in other applications such as
censoring by death (Rubin et al., 2006; Zhang and Little, 2009), fuzzy regression discontinuity
designs (FRD) (Hahn et al., 2001; Chib and Jacobi, 2016), and mediation analysis (Elliott
et al., 2010).
The main pitfall of moment-based estimator is its difficulty in including covariates. There-
fore, Imbens and Rubin (1997a) and Hirano et al. (2000) outlined Bayesian inference for prin-
cipal stratification that can incorporate pre-treatment covariates in analysis. For each subject,
only one of the potential outcomes, namely Wiobs = Wi (Zi ) and Yiobs = Yi (Wiobs ), can be
possibly observed, while Wimiss = Wi (1 − Zi ) and Yimiss = Yi (1 − Wiobs ) are missing. In-
trinsically, causal inference can be seen as a missing data problem. Let Y obs , Y miss , W obs ,
and W miss be the N-vector of observed and missing outcomes and receipt of treatment respec-
tively. The goal in Bayesian paradigm is to derive the predictive distribution of (Y miss , W miss )
based on the observed data (Y obs , W obs , Z, X) so as to compute the estimand τSLAT E =
τ (Y obs , Y miss , W obs , W miss , Z, X) as a function of observed and missing variables.
The core inputs that are needed to be specified in the model-based approach are the dis-
tribution of the compliance types f (Gi |Xi , θ) and the joint distribution of potential outcomes
f (Yi (0), Yi (1)|Wi (0), Wi (1), Xi ; θ), or equivalently, f (Yi (0), Yi (1)|Gi , Xi ; θ). Rubin et al. (2010)
suggested using the two binary models, one for being a complier or not, and one for being a
nevertaker conditional on not being a complier, to model the three-valued compliance status
indicator. The main idea of this framework is that the observed pairs (Zi , Wiobs ) consists of
a mixture of subjects from different compliance types, so that methods like EM algorithm or
Data Augmentation (DA), that are typically used in mixture model inference, can be used
in inferencing causal effects in principal stratification (Imbens and Rubin, 2015). For DA
method, Richardson et al. (2011) further pointed out that posterior is sensitive to the model
specification of compliance types and proposed transparent parametrization by separating the
identified and non-identified parameters.
8
3 Causal Graphical Models
Potential outcome framework is powerful in recovering the effect of causes. In potential out-
come framework, causal effects are answered by specific manipulation on treatments (Holland,
1986). However, when it comes to identifying the causal pathway or visualizing causal net-
works, potential outcome model have it own limitations. As an alternative, causal graphical
models provides an intriguing tool in representing causal effects by directed edges that describes
dependency relationships, which not only allows us to model interventions more generally, but
also describes the data generating mechanism of the random variables.
or equivalently,
X[p] = β0 + M 0 X[p] + [p] (4)
where pa(Xi ) denotes the set of variables that are parents (or direct predecessors) of Xi ,
1 , ..., p are mutually independent noise terms with zero mean, βji ’s are so-called path coeffi-
cients that quantify the causal effects of Xj on Xi , and M is the matrix of path coefficients.
This framework is structural since even if we intervene on several variables, the functional
form among variables in equation (3) would not change. The random variables X[p] that sat-
isfies the model structure of the form in Equation (3) can be represented by a directed acyclic
graph (DAG) G = (V, E), where V is the set of associated vertices, each corresponding to one
of variable of interest Xi , and E ⊆ V × V is the corresponding edge set. Because for any
DAG, the graph contains no cycles and there exists a topological ordering of the vertices, the
matrix Id − M is invertible and thus such structural assignments induces a unique solution
X[p] = (Id − M )−1 (β0 + X[p] ).
Path analysis can be used in differentiating correlation and causation. Applying Wright’s
path analysis, when the variance of Xi ’s are all standardized, the covariance between Xi and
Xj is the product of path coefficients that are on the edges of all d-connected paths, that is
X m
Y
Cov(Xi , Xj ) = βdk−1 dk , (5)
(d0 ,...,dm )∈D(i,j) k=1
where D(i, j) is the collection of the d-connected paths between i and j. Additionally, in a
linear SEM, causal effect of Xi on Xj can be defined as
X m
Y
c(Xi → Xj ) = βdk−1 dk , (6)
(d0 ,...,dm )∈G(i,j) k=1
where G(i, j) is the collection of the directed paths between i and j. As can be seen from
Equation (5), if Cov(Xi , Xj ) 6= 0 then there exists a d-connected path between Xi and Xj . The
d-connected path between Xi and Xj only introduces dependency between the two variables,
which does not imply causation. The correlation between Xi and Xj implies causation only if
D(i, j) = G(i, j), in other words, all the d-connected paths between i and j are directed paths.
With the underlying graph structure given, confirmatory factor analysis focuses on the sta-
tistical inference part. Suppose the random vector X[p] satisfies the linear SEM with Gaussian
9
errors and the matrix of path coefficients M is identifiable, generally, M can be estimated
by maximum likelihood estimator or generalized method of moments (Browne, 1984). Lin-
ear SEMs are also useful in identifying the causal effects between unobserved variables using
”three indicator rule” (O’Brien, 1994). With pre-specified DAG and assumptions on the latent
variables, the path coefficients between the latent variables are identifiable (Kuroki and Pearl,
2014).
With structure given by a DAG G, the associated joint probability distribution can be
expressed compactly by conditional distributions. For instance, without knowing the relation-
ships among variables, a naive approach in describing a probability distribution with p binary
10
random variables requires O(2p ) parameters. Given DAG structure that the maximum number
of incoming edges is m, the size of parameters in describing the entire probability distribution
can be reduced to O(p · 2m ) (Russell and Norvig, 2002). The compact expression also induces
certain independence assumptions in the model.
Several criteria have been developed for reading-off conditional independencies from DAGs.
The global Markov property (Pearl, 1986; Lauritzen et al., 1990), together with the Hammersley-
Clifford theorem (Cowell et al., 1999), provide a tool in reading-off conditional independencies
for undirected graph. For DAGs, the Hammersley-Clifford theorem can not be directly applied
to recover independencies among variables, since the densities in general do not factorize with
respect to cliques (a subset of nodes that every two nodes are directly connected in the undi-
rected graph). The undirected moral graph G m of a DAG G, obtained by adding edges between
parents having common child, suggests an alternative in checking factorization with respect to
a DAG (Cowell et al., 1999). The theorem states that if a probability distribution factorizes
with respect to a DAG G, it also factorizes with respect to the G m . Then the Hammersley-
Clifford theorem can be used to check whether the G satisfies the global Markov property so
as to obtain conditional independencies in the original DAG.
The d-separation rule is also a criterion in obtaining conditional independencies for DAGs.
The criterion emphasizes the role of colliders in introducing dependencies. A node is said to
be a collider if it is a common child of the other two or more nodes. For instance, in Figure
3, the variable X3 is a collider, since it is directed influenced by the variables X1 and X2 .
Conditioning in general removes dependencies between variables. As can be seen from this
example, the variables X1 and X2 are originally independent, i.e. X1 ⊥ X2 . However, when
conditioning on the collider variable X3 or the descendent of the collider variable, conditioning
introduces dependencies between variables, i.e. X1 6⊥⊥ X2 |X3 and X1 6⊥⊥ X2 |X4 .
In this sense, the d-separation rule states as the followings. In a DAG G = (V, E), a path
is blocked by the set C ⊂ V if the path contains a non-collider node c1 ∈ C, or it contains a
collider node c2 ∈/ C and des(c2 ) ∩ C = ∅, where des(c2 ) denotes the set of the descendants
of the node c2 (Pearl and Verma, 1991). For any sets V1 , V2 , V3 ⊂ V , if every path from a
node in V1 to a node in V2 is blocked by V3 , then V1 and V2 are d-separated by V3 , denoted
as V1 ⊥ V2 |V3 [G]. The theorem says that in a DAG G = (V, E) with respect to a probability
distribution of X[p] , for any disjoint sets V1 , V2 , V3 ⊂ V , the d-separation of V1 and V2 by
V3 implies conditional independencies i.e. XV1 ⊥ XV2 |XV3 , if and only if the probability
distribution factorizes with respect to the DAG. In other words, a probability distribution
with respect to the DAG satisfies the global Markov property if the d-separation criterion
holds. Compared with constructing intermediate moral graphs, the d-separation criterion does
not require the probability distribution to have positive density, and can be directly applied
to original DAGs to read-off independencies.
11
3.3 Non-parametric Structural Equation Models (NPSEMs)
Linear SEMs impose a strong assumption on linearity. Pearl and Verma (1991) adapted the
structural equation models and relaxed linearity assumptions by introduction non-parametric
structural equation models (NPSEMs). For NPSEMs, the arrows in the DAGs represent
functional relationships instead of inducing linearity. Formally, random variables X[p] =
(X1 , ..., Xp ) with respect to a DAG G = (V, E) can be further expressed as a deterministic
function
Xi = fi (Xpa(i) , i ), i = 1, ..., p. (8)
When further assuming the noise terms, i ’s, to be mutually independent, the model is referred
to as non-parametric structural equation models with independent errors (NPSEM-IE). In
NPSEMs framework, the sampling distributions of random variables can be expressed by both
equations (7) and (8), while the latter describes the whole data generating mechanism and
provides an essential tool for reasoning causal effects through the concept of do-calculus (Pearl,
1995, 1998, 2000).
Their functional mechanism Xi = fi (pa(Xi ), i ) provides a general method in querying
causal effects when there are exogenous interventions on several variables. One of the most
famous tool in answering such causal queries is the ”do calculus” advocated by Pearl (1995).
The concept of do-calculus is as follows. The functional characterization in equation (8) defines
how the values of Xi ’s would change if external interventions were made to some of the variables
in the system. Specifically, for any two subsets of nodes XA , XB ⊆ X[p] , where XA ∩ XB = ∅,
when making an intervention of setting XA to certain values xA , the structural equations can
be modified accordingly by replacing the equations with XA = xA and the causal effect on
XB can be estimated using the newly computed observables by E[XB |do(XA = xA )].
The counterfactuals from a DAG can be also modeled by using NPSEMs. For intervention
of setting XA = xA , where A ⊆ [p], the counterfactual variables {Xi (XA = xA ) | i ∈ [p]}, or
abbreviated as {Xi (xA ) | i ∈ [p]}, can be defined by recursive substitution (Malinsky et al.,
2019). Specifically, for the interventions on the parental nodes of Xi , i.e., setting Xpa(i) =
xpa(i) , the counterfactual of Xi is defined as
Xi (Xpa(i) = xpa(i) ). (9)
On the other hand, if the interventions are made on the ancestor set A, i.e. A 6= pa(i), the
counterfactual is defined as
Xi (XA = xA ) = Xi Xpa(i)∩A = xpa(i)∩A , Xpa(i)\A = Xpa(i)\A (xA ) . (10)
Defining in this way, it implies the consistency property (Malinsky et al., 2019): for any disjoint
sets A, B ⊆ V , i ∈ V \ (A ∪ B),
XB (xA ) = xB implies Xi (xA , xB ) = Xi (xA ). (11)
The intuition behind the consistency property is that if the system is intervened by setting the
values to what they would have in the first place, then this additional intervention would not
make a difference to the system.
Multiple-world independence assumption assumes the variables {Xi (xpa(i) )|xpa(i) } are mu-
tually independent for all i ∈ [p], while single-world independence assumption assumes the
variables {Xi (xpa(i) )} are mutually independent for all i ∈ [p] (Richardson and Robins, 2013).
Intuitively speaking, multiple-world independence makes hypothesis on the potential outcomes
that can never occur simultaneously, while single world independence assumes that all of the
potential outcomes are independent of each other if the specified variables were intervened in
the graph. For instance, consider the graph structure in Figure 4 with binary variable X1 .
The multiple-world independence makes assumptions on the counterfactual of X1 that can
never occur simultaneously, i.e. X2 (X1 = 1) ⊥ X3 (X1 = 0), while the single-world indepen-
dence only requires X2 (X1 = 0) ⊥ X3 (X1 = 0). NPSEM-IE framework induces multiple-world
independence assumption in the model by assuming independence of i ’s. In general, the
multiple-world independence assumption is non-testable since the assumption is based on the
counterfactuals that can never happen at the same time.
12
Figure 4: An Illustration for Multiple-world and Single-world Independence Assumptions
Consider a specific intervention (X1 = x1 , X3 = x3 ) made in the system. In the first step of
constructing SWIG, all the nodes in the intervention set {X1 , X3 } are split into a random and
a fixed part, with random components getting all the incoming edges while fixed components
getting all the outgoing edges. Then in the second step, all the random vertices in the graph
from first step are
relabeled. The Figure6 shows the SWIG for the graphical model in Figure
5, denoted as G X(X1 = x1 , X3 = x3 ) or G X(x1 , x3 ) . As can be seen from the SWIG,
the variables X2 (x1 ), X3 (x1 ), and X4 (x1 , x3 ) now becomes counterfactual variables.
Figure 6: The SWIG of the Graph Structure in Figure 5 for intervention (X1 = x1 , X3 = x3 )
In SWIGs, it consists of both random and fixed components. By conditioning on the fixed
parts, one can construct subgraph G ∗ [X(xA )] inherited only the random parts from the SWIG
G[X(xA )], then apply d-seperation criterion (Pearl, 1988) to read-off conditional independence
among counterfactual variables. Alternatively, in the case of no unmeasured variables, g-
computation formula (Robins, 1986; Pearl, 2009) can also be used to derive identification
13
formulas for counterfactuals. Specifically, for continuous variable X[p] satisfying single-world
causal model with respect to a DAG G = (V, E), for any disjoint sets A, B ⊆ V ,
Z Y
P(XB (xA ) = x̃B ) = P(Xi = x̃i |Xpa(i)∩A = xpa(i)∩A , Xpa(i)\A = x̃pa(i)\A )dx̃C , (12)
i∈V \A
where C = V \ (A ∪ B). The identification formula for discrete case can be derived anal-
ogously by replacing the integral with summation. If graphical structures contains hidden
variables, the front-door criterion and back-door adjustment advocated by Pearl (1995), and
potential outcomes calculus proposed by (Malinsky et al., 2019) are commonly used in deriving
identification formulas.
4 Conclusion
The theoretical development of causal inference was driven by its diverse applications in ob-
servational data to understand the impact of treatments on outcome variables. The potential
outcomes or counterfactuals framework is convenient for statistical inference and making causal
argument. For observational analysis, this framework often requires non-verifiable assumptions
such as unconfoundedness. Causal graphical models represent causal effects among variables
using directed edges, which is powerful in visualizing the whole causal networks and learning
causal pathways. With graphical structures given, a series of theories have been developed in
reading-off conditional dependencies among variables from graphs. Structure discovery on the
other hand provides a way to recover Markov equivalence class based on the independencies
in observational data. It is useful for prioritization experiments or when intervention on all
the variables are impractical. In recent years, research on unifying those methods has become
popular. More research in bridging the perspectives of graphical models and counterfactual
framework would be valuable.
14
References
Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using
instrumental variables. Journal of the American statistical Association, 91(434):444–455.
Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant risk minimization.
arXiv preprint arXiv:1907.02893.
Aronow, P. M., Green, D. P., Lee, D. K., et al. (2014). Sharp bounds on the variance in
randomized experiments. Annals of Statistics, 42(3):850–871.
Athey, S., Imbens, G. W., and Wager, S. (2016). Approximate residual balancing: De-biased
inference of average treatment effects in high dimensions. arXiv preprint arXiv:1604.07125.
Browne, M. W. (1984). Asymptotically distribution-free methods for the analysis of covariance
structures. British journal of mathematical and statistical psychology, 37(1):62–83.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and
Robins, J. (2018). Double/debiased machine learning for treatment and structural parame-
ters.
Chernozhukov, V., Escanciano, J. C., Ichimura, H., Newey, W. K., and Robins, J. M. (2016).
Locally robust semiparametric estimation. arXiv preprint arXiv:1608.00033.
Chib, S. and Jacobi, L. (2016). Bayesian fuzzy regression discontinuity analysis and returns to
compulsory schooling. Journal of Applied Econometrics, 31(6):1026–1047.
Chickering, D. M. (2002). Optimal structure identification with greedy search. Journal of
machine learning research, 3(Nov):507–554.
Cowell, R. G., Dawid, A. P., Lauritzen, S. L., and Spiegelhalter, D. J. (1999). Building and
using probabilistic networks. Probabilistic Networks and Expert Systems, pages 25–41.
Crump, R. K., Hotz, V. J., Imbens, G. W., and Mitnik, O. A. (2009). Dealing with limited
overlap in estimation of average treatment effects. Biometrika, 96(1):187–199.
Ding, P. and Dasgupta, T. (2016). A potential tale of two-by-two tables from completely
randomized experiments. Journal of the American Statistical Association, 111(513):157–
168.
Ding, P., Li, F., et al. (2018). Causal inference: A missing data perspective. Statistical Science,
33(2):214–237.
Dorie, V., Hill, J., Shalit, U., Scott, M., Cervone, D., et al. (2019). Automated versus do-
it-yourself methods for causal inference: Lessons learned from a data analysis competition.
Statistical Science, 34(1):43–68.
Elliott, M. R., Raghunathan, T. E., and Li, Y. (2010). Bayesian inference for causal mediation
effects using principal stratification with dichotomous mediators and outcomes. Biostatistics,
11(2):353–372.
Fan, J., Imai, K., Liu, H., Ning, Y., and Yang, X. (2016). Improving covariate balancing
propensity score: A doubly robust and efficient approach. Technical report, Technical report,
Princeton University.
Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more co-
variates than observations. Journal of Econometrics, 189(1):1–23.
15
Frangakis, C. E. and Rubin, D. B. (2002). Principal stratification in causal inference. Biomet-
rics, 58(1):21–29.
Freedman, D. A. (2008). On regression adjustments to experimental data. Advances in Applied
Mathematics, 40(2):180–193.
Funk, M. J., Westreich, D., Wiesen, C., Stürmer, T., Brookhart, M. A., and Davidian, M.
(2011). Doubly robust estimation of causal effects. American journal of epidemiology,
173(7):761–767.
Glynn, A. N. and Quinn, K. M. (2010). An introduction to the augmented inverse propensity
weighted estimator. Political analysis, pages 36–56.
Gustafson, P. (2015). Bayesian inference for partially identified models: Exploring the limits
of limited data, volume 140. CRC Press.
Haavelmo, T. (1943). The statistical implications of a system of simultaneous equations.
Econometrica, Journal of the Econometric Society, pages 1–12.
Hahn, J. (1998). On the role of the propensity score in efficient semiparametric estimation of
average treatment effects. Econometrica, pages 315–331.
Hahn, J., Todd, P., and Van der Klaauw, W. (2001). Identification and estimation of treatment
effects with a regression-discontinuity design. Econometrica, 69(1):201–209.
Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting
method to produce balanced samples in observational studies. Political analysis, pages
25–46.
Hirano, K., Imbens, G. W., and Ridder, G. (2003). Efficient estimation of average treatment
effects using the estimated propensity score. Econometrica, 71(4):1161–1189.
Hirano, K., Imbens, G. W., Rubin, D. B., and Zhou, X.-H. (2000). Assessing the effect of an
influenza vaccine in an encouragement design. Biostatistics, 1(1):69–88.
Holland, P. W. (1986). Statistics and causal inference. Journal of the American statistical
Association, 81(396):945–960.
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement
from a finite universe. Journal of the American statistical Association, 47(260):663–685.
Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal
Statistical Society: Series B: Statistical Methodology, pages 243–263.
Imbens, G. (2014). Instrumental variables: an econometrician’s perspective. Technical report,
National Bureau of Economic Research.
Imbens, G. W. and Angrist, J. D. (1994). Identification and estimation of local average treat-
ment effects. Econometrica: Journal of the Econometric Society, pages 467–475.
Imbens, G. W. and Rubin, D. B. (1997a). Bayesian inference for causal effects in randomized
experiments with noncompliance. The annals of statistics, pages 305–327.
Imbens, G. W. and Rubin, D. B. (1997b). Estimating outcome distributions for compliers in
instrumental variables models. The Review of Economic Studies, 64(4):555–574.
Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical
sciences. Cambridge University Press.
Kang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparison of
alternative strategies for estimating a population mean from incomplete data. Statistical
science, 22(4):523–539.
16
Keele, L. and Small, D. S. (2021). Comparing covariate prioritization via matching to ma-
chine learning methods for causal inference using five empirical applications. The American
Statistician, pages 1–9.
Kuroki, M. and Pearl, J. (2014). Measurement bias and effect restoration in causal inference.
Biometrika, 101(2):423–437.
Lauritzen, S. L., Dawid, A. P., Larsen, B. N., and Leimer, H.-G. (1990). Independence prop-
erties of directed markov fields. Networks, 20(5):491–505.
Lee, Y. J., Ellenberg, J. H., Hirtz, D. G., and Nelson, K. B. (1991). Analysis of clinical trials by
treatment actually received: is it really an option? Statistics in medicine, 10(10):1595–1605.
Li, F., Morgan, K. L., and Zaslavsky, A. M. (2018). Balancing covariates via propensity score
weighting. Journal of the American Statistical Association, 113(521):390–400.
Li, X. and Ding, P. (2020). Rerandomization and regression adjustment. Journal of the Royal
Statistical Society: Series B (Statistical Methodology), 82(1):241–268.
Lin, W. et al. (2013). Agnostic notes on regression adjustments to experimental data: Reex-
amining freedman’s critique. Annals of Applied Statistics, 7(1):295–318.
Lindley, D. V. (1972). Bayesian statistics: A review. SIAM.
Lunceford, J. K. and Davidian, M. (2004). Stratification and weighting via the propensity
score in estimation of causal treatment effects: a comparative study. Statistics in medicine,
23(19):2937–2960.
Malinsky, D., Shpitser, I., and Richardson, T. (2019). A potential outcomes calculus for iden-
tifying conditional path-specific effects. In The 22nd International Conference on Artificial
Intelligence and Statistics, pages 3080–3088. PMLR.
McNamee, R. (2009). Intention to treat, per protocol, as treated and instrumental variable es-
timators given non-compliance and effect heterogeneity. Statistics in medicine, 28(21):2639–
2652.
17
Pearl, J. (2000). Causality: Models, reasoning and inference cambridge university press. Cam-
bridge, MA, USA,, 9:10–11.
Pearl, J. (2009). Causality. Cambridge university press.
Pearl, J. and Mackenzie, D. (2018). The book of why: the new science of cause and effect.
Basic books.
Pearl, J. and Verma, T. (1991). “a theory of inferred causation”in allen, j, fikes, r and sandewall,
e (eds), kr-91: Principles of knowledge representation and reasoning: Proceedings of the
second international conference.
Richardson, T. S., Evans, R. J., and Robins, J. M. (2011). Transparent parameterizations of
models for potential outcomes. Bayesian Statistics, 9:569–610.
Richardson, T. S. and Robins, J. M. (2013). Single world intervention graphs (swigs): A unifi-
cation of the counterfactual and graphical approaches to causality. Center for the Statistics
and the Social Sciences, University of Washington Series. Working Paper, 128(30):2013.
Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained
exposure period-application to control of the healthy worker survivor effect. Mathematical
modelling, 7(9-12):1393–1512.
Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when
some regressors are not always observed. Journal of the American statistical Association,
89(427):846–866.
Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J. (2018). Invariant models for causal
transfer learning. The Journal of Machine Learning Research, 19(1):1309–1342.
Rosenbaum, P. R. (1987). Model-based direct adjustment. Journal of the American Statistical
Association, 82(398):387–394.
Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in
observational studies for causal effects. Biometrika, 70(1):41–55.
Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized
studies. Journal of educational Psychology, 66(5):688.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3):581–592.
Rubin, D. B. (1977). Assignment to treatment group on the basis of a covariate. Journal of
educational Statistics, 2(1):1–26.
Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. The
Annals of statistics, pages 34–58.
Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to control
bias in observational studies. Journal of the American Statistical Association, 74(366a):318–
328.
Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization
test comment. Journal of the American Statistical Association, 75(371):591–593.
Rubin, D. B. (1990). Comment: Neyman (1923) and causal inference in experiments and
observational studies. Statistical Science, 5(4):472–480.
Rubin, D. B. et al. (2006). Causal inference through potential outcomes and principal stratifica-
tion: application to studies with ”censoring” due to death. Statistical Science, 21(3):299–309.
Rubin, D. B. et al. (2008). For objective causal inference, design trumps analysis. Annals of
Applied Statistics, 2(3):808–840.
18
Rubin, D. B., Wang, X., Yin, L., and Zell, E. (2010). Bayesian casual inference: Appraches to
estimating the effect of treating hospital type on cancer survival in sweden using principal
stratification.
Tsiatis, A. (2007). Semiparametric theory and missing data. Springer Science & Business
Media.
Tsiatis, A. A., Davidian, M., Zhang, M., and Lu, X. (2008). Covariate adjustment for two-
sample treatment comparisons in randomized clinical trials: a principled yet flexible ap-
proach. Statistics in medicine, 27(23):4658–4677.
Waernbaum, I. (2010). Propensity score model specification for estimation of average treatment
effects. Journal of Statistical Planning and Inference, 140(7):1948–1956.
Westreich, D., Lessler, J., and Funk, M. J. (2010). Propensity score estimation: neural net-
works, support vector machines, decision trees (cart), and meta-classifiers as alternatives to
logistic regression. Journal of clinical epidemiology, 63(8):826–833.
Ye, T., Shao, J., and Zhao, Q. (2020). Principles for covariate adjustment in analyzing ran-
domized clinical trials. arXiv preprint arXiv:2009.11828.
Zelan, M. (1979). A new design for randomized clinical trials. N Engl J Med, 300:1242–5.
Zhang, G. and Little, R. (2009). Extensions of the penalized spline of propensity prediction
method of imputation. Biometrics, 65(3):911–918.
Zhao, Q. et al. (2019). Covariate balancing propensity score by tailored loss functions. Annals
of Statistics, 47(2):965–993.
Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete
outcome data. Journal of the American Statistical Association, 110(511):910–922.
19