C AUSES OF E FFECTS : L EARNING INDIVIDUAL RESPONSES
arXiv:2104.13730v2 [stat.ME] 2 May 2021
FROM POPULATION DATA
Ang Li
University of California, Los Angeles
Computer Science Department
angli@cs.ucla.edu
Scott Mueller
University of California, Los Angeles
Computer Science Department
scott@cs.ucla.edu
Judea Pearl
University of California, Los Angeles
Computer Science Department
judea@cs.ucla.edu
April 28, 2021
A BSTRACT
The problem of individualization is recognized as crucial in almost every field. Identifying causes
of effects in specific events is likewise essential for accurate decision making. However, such
estimates invoke counterfactual relationships, and are therefore indeterminable from population
data. For example, the probability of benefiting from a treatment concerns an individual having a
favorable outcome if treated and an unfavorable outcome if untreated. Experiments conditioning
on fine-grained features are fundamentally inadequate because we can’t test both possibilities for
an individual. Tian and Pearl provided bounds on this and other probabilities of causation using a
combination of experimental and observational data. Even though those bounds were proven tight,
narrower bounds, sometimes significantly so, can be achieved when structural information is available
in the form of a causal model. This has the power to solve central problems, such as explainable AI,
legal responsibility, and personalized medicine, all of which demand counterfactual logic. We analyze
and expand on existing research by applying bounds to the probability of necessity and sufficiency
(PNS) along with graphical criteria and practical applications.
1
INTRODUCTION
Machine learning advances have enabled tremendous capabilities of learning functions accurately and efficiently from
enormous quantities of data. These functions allow for better policies, like whether a surgery, chemotherapy, or radiation
therapy is most effective for a population of given characteristics such as age, sex, and type of symptoms. However,
this mapping from characteristics to efficacy can be quite misleading when applied to individual decision making,
even when the data originate from a randomized controlled trial (RCT). To see why let’s follow the example treated in
[Mueller and Pearl, 2020]. Imagine a novel vaccine for a deadly virus in the midst of a pandemic is in short supply. We
want to administer the vaccine to people most likely to benefit from it. In other words, we need to identify the group
most likely to both survive if vaccinated and succumb if unvaccinated.
A clinical study is conducted to test the effectiveness of the vaccine. A machine learning algorithm trained on data
from this RCT learns a correlation between age and recovery. For simplicity, let’s assume a binary age classification:
sixty years old and under and over sixty years old. Older people survive 57% of the time when vaccinated and 37%
of the time when unvaccinated, while younger people survive 55% of the time when vaccinated and 45% of the time
when unvaccinated. A naive interpretation is that the vaccine is 20 − 10 = 10 percentage points more effective for
older people.
Before deciding to vaccinate the elders, we need to assess the percentage of elderly patients who would actually benefit
from the treatment (PNS) and compare it to the percentage of beneficiaries among the young. Such assessment requires
counterfactual analysis such as in [Tian and Pearl, 2000] and, based on the data above, yields the following bounds: the
probability of over-sixties benefiting from the vaccine is between 20% and 57%, while the under-sixties’ probability is
between 10% and 55%. We see that it’s anything but clear which group should be vaccinated first.
What is more remarkable is these bounds can be narrowed significantly if data from observational studies is also
available, and may even flip priority from the elderly to the young. Observational studies reflect individuals’ willingness
to get vaccinated in the two age groups. In our example, one can show that the the bounds for over-sixties and
under-sixties may become [20%, 40%] and [40%, 55%] respectively, thus reversing the naïve priorities above.
Clearly, when a subpopulation with a particular set of characteristics is analyzed for PNS, those covariates should be
conditioned on. However, this is not always possible, as in the case of ancestral knowledge or a mediating effect of a
vaccine. We may have data on the population, but not at the level an individual can make a decision from. After all, the
individual doesn’t know a potential side-effect of the vaccine until after it’s been administered. We present a method to
potentially obtain narrower bounds by utilizing population-level data and mild structural assumptions.
Since Tian and Pearl [Tian and Pearl, 2000], the problem of bounding probabilities of causation was analyzed by
combining only two sources of information: experimental data and observational studies. It’s surprising that knowing
the structure of the causal graph allows us to narrow the bounds, despite the fact that the graph may seem redundant; i.e.,
we already know the causal effects. Moreover, the graph adds information about an individual, although it describes
properties of the population. The analysis of causes of effects can now take advantage of the causal diagram.
2
PRELIMINARIES AND RELATED WORK
In this section, we review the definitions for the three aspects of causation as defined in [Pearl, 1999]. We use the causal
diagrams [Pearl, 1995, Spirtes et al., 2000, Pearl, 2009, Koller and Friedman, 2009] and the language of counterfactuals in its structural model semantics, as given in [Balke and Pearl, 2013, Galles and Pearl, 1998, Halpern, 2000].
We use Yx = y to denote the counterfactual sentence “Variable Y would have the value y, had X been x". For simplicity
purposes, in the rest of the paper, we use yx to denote the event Yx = y, yx′ to denote the event Yx′ = y, yx′ to denote
the event Yx = y ′ , and yx′ ′ to denote the event Yx′ = y ′ . For notational simplicity, we limit the discussion to binary X
and Y , extension to multi-valued variables are straightforward [Pearl, 2009].
Three prominent probabilities of causation are the following:
Definition 1 (Probability of necessity (PN)). Let X and Y be two binary variables in a causal model M , let x and y
stand for the propositions X = true and Y = true, respectively, and x′ and y ′ for their complements. The probability
of necessity is defined as the expression [Pearl, 1999]
PN
∆
=
P (Yx′ = f alse|X = true, Y = true)
∆
P (yx′ ′ |x, y)
=
(1)
In other words, PN stands for the probability that event y would not have occurred in the absence of event x, given that
x and y did in fact occur.
Note that lower case letters (e.g., x, y) stand for propositions (or events). PN has applications in epidemiology, legal
reasoning, and artificial intelligence. Epidemiologists have long been concerned with estimating the probability that a
certain case of disease is attributable to a particular exposure, which is normally interpreted counterfactually as “the
probability that disease would not have occurred in the absence of exposure, given that disease and exposure did in
fact occur." This counterfactual notion is also used frequently in lawsuits, where legal responsibility is at the center of
contention.
Definition 2 (Probability of sufficiency (PS)). [Pearl, 1999]
∆
PS =
P (yx |y ′ , x′ )
(2)
PS finds applications in policy analysis, artificial intelligence, and psychology. A policy maker may well be interested
in the dangers that a certain exposure may present to the healthy population [Khoury et al., 1989]. Counterfactually,
this notion is expressed as the “probability that a healthy unexposed individual would have gotten the disease had he/she
2
been exposed." In psychology, PS serves as the basis for Cheng’s [Cheng, 1997] causal power theory [Glymour, 2013],
which attempts to explain how humans judge causal strength among events. In artificial intelligence, PS plays a major
role in the generation of explanations [Pearl, 2009].
Definition 3 (Probability of necessity and sufficiency (PNS)). [Pearl, 1999]
∆
PNS =
P (yx , yx′ ′ )
(3)
PNS stands for the probability that y would respond to x both ways, and therefore measures both the sufficiency and
necessity of x to produce y.
Tian and Pearl [Tian and Pearl, 2000] provide tight bounds for PNS, PN, and PS without a causal diagram using Balke’s
program [Balke and Pearl, 1997]. Li and Pearl [Li and Pearl, 2019] provide a theoretical proof of the tight bounds for
PNS, PS, PN, and other probabilities of causation without a causal diagram.
PNS, PN, and PS have the following tight bounds:
0
P (yx ) − P (yx′ )
max
P (y) − P (yx′ )
P (yx ) − P (y)
PNS ≤ min
max
≤ PNS
P (yx )
P (yx′ ′ )
P (x, y) + P (x′ , y ′ )
P (yx ) − P (yx′ )+
+P (x, y ′ ) + P (x′ , y)
0
P (y)−P (yx′ )
P (x,y)
PN ≤ min
(
≤ PN
1
′ ′
′
P (yx
′ )−P (x ,y )
P (x,y)
)
(4)
(5)
(6)
(7)
Note that we only consider PNS and PN here because the bounds of PS can be easily obtained by exchanging x with x′
and y with y ′ in the bounds of PN.
To obtain bounds for a specific population, defined by a set C of characteristics, the expressions above should be
modified by conditioning each term on C = c. This would normally yield narrower bounds because, when C is not
affected by X, it reduces variations among units in the subpopulation considered. In this paper, however, we obtain
narrower bounds of PNS by leveraging another source of knowledge – the causal diagram behind the data, together with
measurements of a set Z of covariates in that diagram. We provide graphical conditions under which the availability
of such measurements would improve the bounds and demonstrate, both analytically and by simulation, the degree
of improvement achieved. Narrower bounds and graphical criteria can be obtained for PN and PS through the same
mechanism detailed in the proofs in the appendix.
3
BOUNDS WITH CAUSAL DIAGRAM
3.1
3.1.1
With additional covariate Z
Non-descendant Z
Theorems 4 and 5 below provide bounds for PNS when a set Z of variables can be measured which satisfy only
one simple condition: Z contains no descendant of X. This condition is important because if X was set to x and Z
contains a descendant of X, then Z could be altered as well and P (yx |z) would be unmeasurable. If the descendant is
independent of Yx , then P (yx |z would be measurable, but that descendant wouldn’t contribute to any narrowing of
bounds. These bounds are always contained within the Tian-Pearl bounds of equations 4, 5, 6, and 7.
3
Theorem 4. Given a causal diagram G and distribution compatible with G, let Z be a set of variables that does not
contain any descendant of X in G, then PNS is bounded as follows:
X
z
0,
P (yx |z) − P (yx′ |z),
max
P (y|z) − P (yx′ |z),
P (yx |z) − P (y|z)
× P (z) ≤ PNS
P (yx |z),
P (yx′ ′ |z),
X
P (y, x|z) + P (y ′ , x′ |z),
min
− P (yx′ |z)+
z
P (yx |z)
+P (y, x′ |z) + P (y ′ , x|z)
(8)
× P (z) ≥ PNS
(9)
Proof. See Appendix.
Note that, unlike the population-specific bounds, where each term was conditioned on C = c, here Z = z enters only
some of the terms. This is because the measurement of Z is conducted in the study, but may not be available for the
individual seeking advice. Examples are illustrated in Section 4.
Note also that if only experimental data are available (i.e., P (Y ), P (Y, X), P (Y |Z), P (Y, X|Z) are not measured),
arguments to the max or min functions involving observational
data can be disregarded. For example, the lower bounds
P
of theorem 4 would become max{P (yx ) − P (yx′ ), z max{0, P (yx |z) − P (yx′ |z)} × P (z)}.
3.1.2
Sufficient Covariate Z
Z
Z
X
Y
X
(a) Confounder Z
Y
(b) Outcome-affecting covariate Z
Figure 1: Z is not a descendant of X
In figures 1a and 1b, Z is not a descendant of X and further satisfies the back-door criterion. For such cases the PNS
bounds can be simplified to read:
Theorem 5. Given a causal diagram G and distribution compatible with G, let Z be a set of variables satisfying the
back-door criterion [Pearl, 1993] in G, then the PNS is bounded as follows:
X
max{0, P (y|x, z) − P (y|x′ , z)} × P (z) ≤ PNS
(10)
z
X
min{P (y|x, z), P (y ′ |x′ , z)} × P (z) ≥ PNS
z
Proof. See Appendix.
The significance of theorem 5 is due to the ability to compute bounds using purely observational data.
4
(11)
U
Z
X
Y
Figure 2: Mediator Z with direct effect
3.2
3.2.1
Mediation
Z as a PARTIAL MEDIATOR
In figure 2, Z is a descendant of X, so we cannot use theorems 4 and 5. However, the absence of confounders between
Z and Y and between X and Y permits us to bound PNS as follows:
Theorem 6. Given a causal diagram G and distribution compatible with G, let Z be a set of variables such that
∀x, x′ ∈ X : x 6= x′ , (Yx ⊥
⊥ X ∪ Zx′ | Zx ) in G, then the PNS is bounded as follows:
0,
P (yx ) − P (yx′ ),
max
≤ PNS
P (y) − P (yx′ ),
P (yx ) − P (y)
min
Proof. See Appendix.
P (yx ),
P (yx′ ′ ),
P (y, x) + P (y ′ , x′ ),
P (yx ) − P (yx′ )+
+P
(y, x′ ) + P (y ′ , x),
P P
z
z ′ min{P (y|z, x),
P
(y ′ |z ′ , x′ )}×
min{P (zx ), P (zx′ ′ )}
≥ PNS
(12)
(13)
Note that although this lower bound is unchanged from Tian and Pearl, the upper bound contains a vital additional
argument to the min function. This new term can significantly reduce the upper bound. The rest of the terms are
included because sometimes Tian and Pearl’s bounds are superior. The following theorem has the same quality.
3.2.2
PURE MEDIATOR
Figure 3 is a special case of figure 2, in which X has no direct effect on Y . The resulting bounds for PNS read:
Theorem 7. Given a causal diagram G in figure 3 and distribution that compatible with G, then PNS are bounded as
follow:
Z
X
Y
Figure 3: Mediator Z with no direct effect
0,
P (yx ) − P (yx′ ),
max
≤ PNS
P (y) − P (yx′ ),
P (yx ) − P (y)
5
(14)
min
P (yx ),
P (yx′ ′ ),
P (y, x) + P (y ′ , x′ ),
P (yx ) − P (yx′ )+
+P
(y, x′ ) + P (y ′ , x),
′ ′
Σz Σz′ 6=z min{P (y|z), P′ (y′ |z )}×
min{P (z|x), P (z |x )}
≥ PNS
(15)
Proof. See Appendix.
The core terms for theorems 6 and 7 added to the upper bounds notably only require observational data.
4
4.1
EXAMPLES
CREDIT TO THE TREATMENT
The manufacturer of a drug wants to claim that a non-trivial number of recovered patients who were given access to
the drug owe their recovery to the drug. So they conduct an observational study; they record the recovery rates of
700 patients. 464 patients chose to take the drug and 236 patients did not. The results of the study are in table 1. The
manufacturer claims success for their drug because the overall recovery rate from the observational study has increased
from 54% to 68% for non-drug-takers to drug-takers.
Women
Men
Overall
Drug
1 out of 110
recovered (1%)
313 out of 354
recovered (88%)
314 out of 464
recovered (68%)
No Drug
13 out of 120
recovered (11%)
114 out of 116
recovered (98%)
127 out of 236
recovered (54%)
Table 1: Results of a drug study with gender taken into account
The number of recovered patients that should credit the drug for their recovery are those who would recover if they had
taken the drug and would not recover if they had not taken the drug. This is the PNS.
Let X = x denote the event that the patient took the drug and X = x′ denote the event that the patient did not take the
drug. Let Y = y denote the event that the patient has recovered and Y = y ′ denote the event that the patient has not
recovered. Let Z = z represent female patients and Z = z ′ represent male patients. Suppose we know an additional
fact, estrogen has a negative effect on recovery, so women are less likely to recover than men, regardless of the drug.
Additionally, as we can see from the data, men are significantly more likely to take the drug than women are. The
causal diagram is shown in Figure 1a.
Node Z on the graph satisfies the back-door criterion, therefore we can compute the causal effect P (yx ) and P (yx′ ) via
the adjustment formula [Pearl, 1993] and observational data from table 1, where,
X
P (yx ) =
P (y|x, z)P (z) = 0.597,
z
P (yx′ ) =
X
P (y|x′ , z)P (z) = 0.696,
z
P (yx′ ′ ) = 1 − P (yx′ ) = 0.304.
Therefore, the bounds of PNS computed using equations 4 and 5 are 0 ≤ P N S ≤ 0.297, where the diagram was used
only to identify the causal effects yx and yx′ . These bounds aren’t informative enough to conclude whether or not the
drug was the cause of recovery for a meaningful number of patients. They suggest that the fraction of beneficiaries can
6
be as low as 0% or as high as 29.7%. Now, consider the bounds in theorem 5 which takes into account the position of Z
in the diagram. Since Z satisfies the back-door criterion, we can use equations 10 and 11 to compute 0 6 P N S 6 0.01.
The conclusion now is obvious. At most 7 out of 314 patients’ recoveries can be credited to the drug. This is strong
evidence that counters the manufacturer’s claim.
4.2
INFLAMMATION MEDIATOR
As before, let X and Y represent drug consumption and recovery. Let Z represent acute inflammation with z being
present and z ′ being absent. In some people, the drug causes acute inflammation, which has adverse effects on recovery,
so the causal structure is depicted in figure 3. We observe the following proportions among drug takers, non-takers,
with inflammation, and without inflammation:
P (y|z) = 0.5,
P (z|x) = 0.1,
′
P (z|x′ ) = 0.1.
P (y|z ) = 0.5,
The Tian-Pearl PNS upper bound is:
PNS 6 min {P (y|x), P (y ′ |x′ )} = 0.5.
Given that the lower bound is 0, these bounds are not very informative. If we knew that an individual would react to the
drug with acute inflammation, we would only look at the data comprising of people reacting to the drug with acute
inflammation. Since we are conditioning on z, P N S = 0 because the outcome, Y , will have the same result regardless
of whether the person consumed the drug. So knowing a person’s inflammation response to the drug narrows PNS from
a wide [0, 0.5] to a point estimate of 0. Imagine, for this drug, that we can’t know ahead of time how a person will react
inflammation-wise. We can only observe acute inflammation after the drug is administered. Since we have population
data from patients who have already taken the drug, we can utilize this mediator to bound the PNS for new patients who
haven’t yet taken the drug:
P (y|z) · P (z|x) + P (y|z ′ ) · P (z|x′ ),
′ ′
′
′
P (y|z) · P (z |x ) + P (y|z ) · P (z |x),
PNS 6 min
′ ′
′
′
+ P (y |z) · P (z|x ),
P (y ′ |z ′ ) · P (z|x)
P (y |z ) · P (z ′ |x′ ) + P (y ′ |z) · P (z ′ |x)
= 0.1.
The mediator-improved PNS upper bound is significantly smaller than what the Tian-Pearl upper bound provides, 0.1
vs 0.5. The new upper bound can now be effectively weighed against other factors like cost and side-effects.
4.3
ANCESTRAL COVARIATE
Let’s continue from the introduction, where X represents vaccination with x being vaccinated and x′ being unvaccinated
and Y represents survival with y is surviving and y ′ is succumbing to the pandemic. Instead of classifying by age, let’s
assume our machine learning algorithm uncovers a correlation between survival and ancestry. Let Z represent ancestry
and, for simplicity, there are only two ancestries, z and z ′ . Either graph of figure 1 is representative of this. Our RCT
data reveals:
P (Z = z) = 0.5,
P (yx |Z = z ′ ) = 0.25,
P (yx |Z = z) = 0.75,
P (yx′ |Z = z ′ ) = 0.6.
P (yx′ |Z = z) = 0.2,
We now have four different bounds on PNS:
Tian-Pearl =⇒ 0.1 6 P N S 6 0.5
Covariate-improved =⇒ 0.275 6 P N S 6 0.5
Person has ancestry z =⇒ 0.55 6 P N S 6 0.75
Person has ancestry z ′ =⇒ 0 6 P N S 6 0.25
As expected, using the causal diagram and ancestral Z yields narrower bounds than the Tian-Pearl bounds. However,
it’s surprising that knowing a person has either ancestry z or z ′ gives us bounds outside of our new bounds. In fact, they
are completely outside the wider Tian-Pearl bounds. This is discussed in section 6.
7
In the meantime, it’s important to recognize that the last two ancestry-specific PNS bounds are what should be referred
to if an individual knows their ancestry. The covariate-improved PNS bounds should only be referred to if a person’s
ancestry is unknown. This might be because the person was adopted with no hint as to whether they’re from ancestry z
or z ′ (physical features are right in between or indistinguishable).
5
SIMULATION RESULTS
In this section, we illustrate that the bounds of PNS are improved in four different simple causal diagrams.
The first causal diagram is the simplest one in figure 1a with binary Z satisfying the back-door criterion. The second
causal diagram is figure 4 where {Z1 , Z2 } satisfies the back-door criterion and both Z1 and Z2 are binary. The third
causal diagram is figure 5, with six subsets of Z1 , Z2 , and Z3 satisfying the back-door criterion. The last causal diagram
is figure 1a, the same as the first causal diagram, except Z has 1024 instantiates.
For each of the causal diagrams, we randomly generated 100000 sample distributions compatible with the causal
diagram. We compared the average increased lower bound (i.e., lower bound with causal diagram - lower bound without
causal diagram), the average decreased upper bound (i.e., upper bound without causal diagram - upper bound with
causal diagram), the average gap without causal diagram (i.e., upper bound without causal diagram - lower bound
without causal diagram), and the average gap with causal diagram (i.e., upper bound with causal diagram - lower bound
with causal diagram). The results are summarized in table 2. For each of the causal diagrams, we then randomly pick
100 out of 100000 sample distributions to draw the graph of bounds with and without causal diagram. The results are in
figures 6 to 9.
Z1
Z2
X
Y
Figure 4
Z1
Z3
Z2
X
Y
Figure 5
From figures 6 to 9, we see that the bounds of PNS are improved in most of the samples with each causal diagram.
Surprisingly, we see in table 2 that the average gap without a causal diagram fluctuates substantially (between 0.219
and 0.483) between causal diagrams. However, the average gap with a causal diagram is consistently 0.166 among all
causal diagrams. If subsets Z are available that satisfy theorem 4, the bounds of PNS are useful as the gap is narrow
and with low variability across causal structures.
6
DISCUSSION
We have shown that knowledge of a causal structure enables narrower PNS bounds to be estimated, compared with the
tight bounds of Tian and Pearl which were derived without such knowledge. However, it must be emphasized that this
narrowing is only applicable to individuals when unable to determine their characteristics at decision time. If their Z
values are known, the bounds of equations 4 and 5, conditioned on those values, should be consulted. Example 4.3
provides a scenario where people who know their ancestry have very different PNS bounds than people who don’t
know their ancestry. You would expect the additional information of ancestral knowledge would further narrow the
8
Figure 6: Bounds of PNS in Figure 1a, where general bounds are obtained from Equation 4 and 5 and the bounds with
causal diagram are obtained by Theorem 5.
Figure 7: Bounds of PNS in figure 4, where general bounds are obtained from equations 4 and 5 and the bounds with
causal diagram are obtained by theorem 5.
Figure 8: Bounds of PNS in figure 5, where general bounds are obtained from equations 4 and 5 and the bounds with
causal diagram are obtained by theorem 5.
9
Average
increased
lower
bound
Average
decreased
upper
bound
Average
gap
without
causal
diagram
Average
gap
with
causal
diagram
Fig 1a
Fig 4
Fig 5
Fig 1a
with nonbinary Z
0.026
0.050
0.032
0.158
0.026
0.050
0.032
0.158
0.219
0.267
0.231
0.483
0.166
0.166
0.166
0.166
Table 2: Bounds of PNS with and without causal diagram
Figure 9: Bounds of PNS in figure 1a with Z having 1024 instantiates, where general bounds are obtained from
Equation 4 and 5 and the bounds with causal diagram are obtained by theorem 5.
bounds, but they change the bounds to a different non-overlapping range. This violates the heuristic that additional
information should narrow the bounds or, at worst, not widen them. To rephrase, if you don’t know someone’s ancestry,
the probability they benefit from this drug is between 0.275 and 0.5. Once you acquire the additional information that
the person is of ancestry z, the probability they benefit from this treatment becomes between 0.55 and 0.75. How is
this possible? Was the person’s probability of benefiting never really between 0.275 and 0.5 that we calculated before
knowing their ancestry?
The reason for this seeming inconsistency is that we’re asking different questions. When we didn’t know the ancestry,
we were asking, “what is the probability of benefiting for a person regardless of ancestry?” When we found out the
person is of ancestry z, we then asked a different question, “what is the probability of benefiting for a person of ancestry
z?” The additional information of the person’s ancestry didn’t help the first question and the second question isn’t
answerable without the additional information.
The following example will illuminate the reasons for this phenomenon [Pearl, 2009, p. 296]. Let the covariate Z
stand for the outcome of a fair coin toss, so P (Z = heads) = 0.5. Without knowing what treatment X and success Y
10
represent, let’s assume the following measurements are taken:
P (yx′ ) = 0.5,
P (yx′ |Z = heads) = 0,
P (yx′ |Z = tails) = 1.
P (yx ) = 0.5,
P (yx |Z = heads) = 1,
P (yx |Z = tails) = 0,
Tian-Pearl bounds gives us 0 6 P N S 6 0.5 and the bounds utilizing Z are 0.5 6 P N S 6 0.5 or PNS = 0.5.
Now, let’s uncover the functional mechanism, x represents betting $1 on heads, x′ represents betting $1 on tails, y
represents winning $1, and y ′ represents losing $1. It should now be clear why P (yx ) = P (yx′ ) = 0.5. Without
knowing the coin toss result, Z, the odds of winning $1 are 50/50 whether you bet on heads or tails. PNS is also 0.5
because benefiting from betting on heads is true only when the coin toss was heads. The coin toss is heads 50% of the
time.
This brings us back to the PNS bounds when we have the additional information of what the coin toss result was. If we
know the coin toss resulted in heads, then the probability of benefiting from betting on heads is 100%. Similarly, if
we know the coin toss resulted in tails, then the probability of benefiting from betting on heads is 0%. In other words
PNS(heads) = 1 and PNS(tails) = 0. If the coin toss is heads, winning only happens when betting on heads. Even
though the bounds are completely different when we provided with the very useful additional information of the coin
toss, there is clearly no contradiction here. There was a 50% probability of benefiting from betting on heads when we
didn’t know the coin toss result and a 100% probability of benefiting from betting on heads when we knew the coin toss
resulted in heads. We were asking two separate questions. The first question was, “what is the probability of benefiting
regardless of coin toss result?” The second question was, “what is the probability of benefiting for a coin toss of heads?”
7
CONCLUSION
In this work, we have developed a graphical method of learning individualized functions (representing PNS, PN, and
PS) from population data, based the structure of the causal graph. This generalizes both the PN bounds derived in
[Dawid et al., 2017] and those derived in [Tian and Pearl, 2000]. Often these functions return bounds as opposed to
point estimates. Nevertheless, these bounds can be tremendously informative. Machine learning algorithms need to
incorporate these techniques in order to understand, interpret, and apply the underlying probabilities of causation of
their data. Identifying causes of effects and decision making benefit greatly from using population data for individual
cases.
References
[Balke and Pearl, 1997] Balke, A. and Pearl, J. (1997). Probabilistic Counterfactuals: Semantics, Computation, and
Applications. PhD thesis, University of California, Los Angeles.
[Balke and Pearl, 2013] Balke, A. and Pearl, J. (2013). Counterfactuals and policy analysis in structural models. arXiv
preprint arXiv:1302.4929.
[Cheng, 1997] Cheng, P. W. (1997). From covariation to causation: A causal power theory. Psychological review,
104(2):367.
[Dawid et al., 2017] Dawid, P., Musio, M., and Murtas, R. (2017). The probability of causation. Law, Probability and
Risk, (16):163–179.
[Galles and Pearl, 1998] Galles, D. and Pearl, J. (1998). An axiomatic characterization of causal counterfactuals.
Foundations of Science, 3(1):151–182.
[Glymour, 2013] Glymour, C. (2013). Psychological and normative theories of causal power and the probabilities of
causes. arXiv preprint arXiv:1301.7377.
[Halpern, 2000] Halpern, J. Y. (2000). Axiomatizing causal reasoning. Journal of Artificial Intelligence Research,
12:317–337.
[Khoury et al., 1989] Khoury, M. J., Flanders, W. D., Greenland, S., and Adams, M. J. (1989). On the measurement of
susceptibility in epidemiologic studies. American Journal of Epidemiology, 129(1):183–190.
[Koller and Friedman, 2009] Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and
Techniques. MIT press.
[Li and Pearl, 2019] Li, A. and Pearl, J. (2019). Unit selection based on counterfactual logic. In Proceedings of the
28th International Joint Conference on Artificial Intelligence, pages 1793–1799. AAAI Press.
11
[Mueller and Pearl, 2020] Mueller, S. and Pearl, J. (2020). Which Patients are in Greater Need: A counterfactual
analysis with reflections on COVID-19. https://ucla.in/39Ey8sU+.
[Pearl, 1993] Pearl, J. (1993). Aspects of graphical models connected with causality. Proceedings of the 49th Session
of the International Statistical Institute, Italy, pages 399–401.
[Pearl, 1995] Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–688.
[Pearl, 1999] Pearl, J. (1999). Probabilities of Causation: Three counterfactual interpretations and their identification.
Synthese, 121(1-2):93–149.
[Pearl, 2009] Pearl, J. (2009). Causality. Cambridge University Press, Second edition.
[Spirtes et al., 2000] Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. (2000). Causation, Prediction, and
Search. MIT press.
[Tian and Pearl, 2000] Tian, J. and Pearl, J. (2000). Probabilities of causation: Bounds and identification. Annals of
Mathematics and Artificial Intelligence, 28(1-4):287–313.
12
A Proof of Theorem 4
Proof.
PNS =
=
P (yx , yx′ ′ )
Σz P (yx , yx′ ′ |z) × P (z)
(16)
From [Li and Pearl, 2019], we have the z-specific PNS as follows:
0,
P (yx |z) − P (yx′ |z),
max
P (y|z) − P (yx′ |z),
P (yx |z) − P (y|z)
P (yx |z),
P (yx′ ′ |z),
P (y, x|z) + P (y ′ , x′ |z),
min
− P (yx′ |z)+
P (yx |z)
+P (y, x′ |z) + P (y ′ , x|z)
Substituting 17 and 18 into 16, theorem 4 holds.
Note that since we have,
X
≤ z-PNS
(17)
≥ z-PNS
(18)
max{0, P (yx |z) − P (yx′ |z),
z
≥
P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)} × P (z)
X
0 × P (z)
z
=
0,
X
max{0, P (yx |z) − P (yx′ |z),
z
≥
P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)} × P (z)
X
[P (yx |z) − P (yx′ |z)] × P (z)
z
=
P (yx ) − P (yx′ ),
X
max{0, P (yx |z) − P (yx′ |z),
z
≥
P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)} × P (z)
X
[P (y|z) − P (yx′ |z)] × P (z)
z
=
P (y) − P (yx′ ),
X
max{0, P (yx |z) − P (yx′ |z),
z
≥
P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)} × P (z)
X
[P (yx |z) − P (y|z)] × P (z)
z
=
P (yx ) − P (y),
then the lower bound in theorem 4 is guaranteed to be no worse than the Tian-Pearl lower bound in equation 4. Similarly,
the upper bound in theorem 4 is guaranteed to be no worse than the Tian-Pearl upper bound in equation 5. Also note
that, since Z does not contain a descendant of X, the term P (yx |z) refers to experimental data under population z.
13
B Proof of Theorem 5
Proof. Since Z satisfies the back-door criterion, then equations 8 and 9 still hold and P (yx |z) = P (y|x, z),
P (yx′ |z) = P (y|x′ , z), and P (yx′ ′ |z) = P (y ′ |x′ , z). We further have,
=
≥
=
=
=
=
P (yx |z) − P (yx′ |z)
P (y|x, z) − P (y|x′ , z)
[P (y|x, z) − P (y|x′ , z)] × P (x|z)
P (y|x, z) × P (x|z) − P (y|x′ , z) × (1 − P (x′ |z))
P (y, x|z) + P (y, x′ |z) − P (y|x′ , z)
P (y|z) − P (y|x′ , z)
P (y|z) − P (yx′ |z)
(19)
=
≥
=
=
=
=
P (yx |z) − P (yx′ |z)
P (y|x, z) − P (y|x′ , z)
[P (y|x, z) − P (y|x′ , z)] × P (x′ |z),
P (y|x, z) × (1 − P (x|z)) − P (y|x′ , z) × P (x′ |z)
P (y|x, z) − P (y, x|z) − P (y, x′ |z)
P (y|x, z) − P (y|z)
P (yx |z) − P (y|z).
(20)
and
With equations 19 and 20, equation 8 reduces to equation 10 in theorem 5.
We also have,
=
≤
=
=
min{P (yx |z), P (yx′ ′ |z)}
min{P (y|x, z), P (y ′ |x′ , z)}
P (y|x, z) × P (x|z) + P (y ′ |x′ , z) × (1 − P (x|z))
P (y|x, z) × P (x|z) + P (y ′ |x′ , z) × P (x′ |z)
P (y, x|z) + P (y ′ , x′ |z)
(21)
and
min{P (yx |z), P (yx′ ′ |z)}
= min{P (y|x, z), P (y ′ |x′ , z)}
≤ P (y|x, z) × (1 − P (x|z)) + P (y ′ |x′ , z) × P (x|z)
= P (y|x, z) × (1 − P (x|z)) + P (y ′ |x′ , z) × (1 − P (x′ |z))
= P (y|x, z) − P (y, x|z) + P (y ′ |x′ , z) − P (y ′ , x′ |z)
= P (y|x, z) − P (y|x′ , z) + P (y, x′ |z) + P (y ′ , x|z)
= P (yx |z) − P (yx′ |z) + P (y, x′ |z) + P (y ′ , x|z).
With equations 21 and 22, equation 9 reduces to equation 11 in theorem 5.
14
(22)
C Proof of Theorem 6
Proof.
=
=
=
≤
=
=
=
PNS
P (yx , yx′ ′ )
Σz Σz′ P (yx , yx′ ′ , zx , zx′ ′ )
Σz Σz′ P (yx , yx′ ′ |zx , zx′ ′ ) × P (zx , zx′ ′ )
Σz Σz′ min{P (yx |zx , zx′ ′ ), P (yx′ ′ |zx , zx′ ′ )} ×
min{P (zx ), P (zx′ ′ )}
Σz Σz′ min{P (yx |zx ), P (yx′ ′ |zx′ ′ )} ×
min{P (zx ), P (zx′ ′ )}
Σz Σz′ min{P (y|zx , x), P (y ′ |zx′ ′ , x′ )} ×
min{P (zx ), P (zx′ ′ )}
Σz Σz′ min{P (y|z, x), P (y ′ |z ′ , x′ )} ×
min{P (zx ), P (zx′ ′ )}.
(23)
(24)
Combined with the Tian-Pearl bounds in equations 4 and 5, theorem 6 holds. Note that equation 23 is due to
⊥ X |Zx .
⊥ Zx | Zx′ . Equation 24 is due to ∀x, Yx ⊥
Yx ⊥
⊥ Zx′ | Zx and Yx′ ⊥
D Proof of Theorem 7
Proof. First we show that in graph G, if an individual is a complier from X to Y , then Zx and Zx′ must have the different
values. This is because the structural equations for Y and Z are fy (z, uy ) and fz (x, uz ), respectively. If an individual
has the same Zx and Zx′ value, then fz (x, uz ) = fz (x′ , uz ). This means fy (fz (x, uz ), uy ) = fy (fz (x′ , uz ), uy ), i.e.,
Yx and Yx′ must have the same value. Thus this individual is not a complier. Therefore,
=
=
≤
=
PNS
P (yx , yx′ ′ )
Σz Σz′ 6=z P (yz , yz′ ′ ) × P (zx , zx′ ′ )
Σz Σz′ 6=z min{P (yz ), P (yz′ ′ )} ×
min{P (zx ), P (zx′ ′ )}
Σz Σz′ 6=z min{P (y|z), P (y ′ |z ′ )} ×
min{P (z|x), P (z ′ |x′ )}
Combined with the Tian-Pearl bounds in equations 4 and 5, theorem 7 holds.
15
E Simulation Algorithm
We used the following algorithm to generate samples and conduct the simulations in section 5:
Algorithm 1: Generate PNS simulation data with theorem 5
input :Number of output samples n
Causal diagram G
Covariates to condition on Z
output :List of 4-tuples consisting of general lower bound, lower bound with causal graph, upper bound with
causal graph, and general upper bound
begin
for i ← 1 to n do
cpt ← generate-cpt(G,random-uniform) ;
// Lower/upper Tian-Pearl bounds
lb, ub ← pns-bounds(cpt);
// Lower/upper bounds with graph
lb_graph, ub_graph ← pns-graph(cpt, Z);
append-result(lb, lb_graph, ub_graph, ub);
end
end
Procedure generate-cpt
input :n causal diagram nodes (X1 , ..., Xn )
Distribution D
output :n conditional probability tables for P (Xi |P arents(Xi ))
begin
for i ← 1 to n do
s ← num-instantiates(Xi )
p ← num-instantiates(P arents(Xi ))
for k ← 1 to p do
sum ← 0
for j ← 1 to s do
aj ← sample(D)
sum ← sum + aj
end
for j ← 1 to s do
P (xij |P arents(Xi )k ) ← aj /sum
end
end
end
end
For figure 4, binary variables Z1 and Z2 were considered as a covariate with 4 instantiates. Similarly, figure 5’s Z
variables were considered as a single covariate with 8 instantiates.
16