[go: up one dir, main page]

Academia.eduAcademia.edu
C AUSES OF E FFECTS : L EARNING INDIVIDUAL RESPONSES arXiv:2104.13730v2 [stat.ME] 2 May 2021 FROM POPULATION DATA Ang Li University of California, Los Angeles Computer Science Department angli@cs.ucla.edu Scott Mueller University of California, Los Angeles Computer Science Department scott@cs.ucla.edu Judea Pearl University of California, Los Angeles Computer Science Department judea@cs.ucla.edu April 28, 2021 A BSTRACT The problem of individualization is recognized as crucial in almost every field. Identifying causes of effects in specific events is likewise essential for accurate decision making. However, such estimates invoke counterfactual relationships, and are therefore indeterminable from population data. For example, the probability of benefiting from a treatment concerns an individual having a favorable outcome if treated and an unfavorable outcome if untreated. Experiments conditioning on fine-grained features are fundamentally inadequate because we can’t test both possibilities for an individual. Tian and Pearl provided bounds on this and other probabilities of causation using a combination of experimental and observational data. Even though those bounds were proven tight, narrower bounds, sometimes significantly so, can be achieved when structural information is available in the form of a causal model. This has the power to solve central problems, such as explainable AI, legal responsibility, and personalized medicine, all of which demand counterfactual logic. We analyze and expand on existing research by applying bounds to the probability of necessity and sufficiency (PNS) along with graphical criteria and practical applications. 1 INTRODUCTION Machine learning advances have enabled tremendous capabilities of learning functions accurately and efficiently from enormous quantities of data. These functions allow for better policies, like whether a surgery, chemotherapy, or radiation therapy is most effective for a population of given characteristics such as age, sex, and type of symptoms. However, this mapping from characteristics to efficacy can be quite misleading when applied to individual decision making, even when the data originate from a randomized controlled trial (RCT). To see why let’s follow the example treated in [Mueller and Pearl, 2020]. Imagine a novel vaccine for a deadly virus in the midst of a pandemic is in short supply. We want to administer the vaccine to people most likely to benefit from it. In other words, we need to identify the group most likely to both survive if vaccinated and succumb if unvaccinated. A clinical study is conducted to test the effectiveness of the vaccine. A machine learning algorithm trained on data from this RCT learns a correlation between age and recovery. For simplicity, let’s assume a binary age classification: sixty years old and under and over sixty years old. Older people survive 57% of the time when vaccinated and 37% of the time when unvaccinated, while younger people survive 55% of the time when vaccinated and 45% of the time when unvaccinated. A naive interpretation is that the vaccine is 20 − 10 = 10 percentage points more effective for older people. Before deciding to vaccinate the elders, we need to assess the percentage of elderly patients who would actually benefit from the treatment (PNS) and compare it to the percentage of beneficiaries among the young. Such assessment requires counterfactual analysis such as in [Tian and Pearl, 2000] and, based on the data above, yields the following bounds: the probability of over-sixties benefiting from the vaccine is between 20% and 57%, while the under-sixties’ probability is between 10% and 55%. We see that it’s anything but clear which group should be vaccinated first. What is more remarkable is these bounds can be narrowed significantly if data from observational studies is also available, and may even flip priority from the elderly to the young. Observational studies reflect individuals’ willingness to get vaccinated in the two age groups. In our example, one can show that the the bounds for over-sixties and under-sixties may become [20%, 40%] and [40%, 55%] respectively, thus reversing the naïve priorities above. Clearly, when a subpopulation with a particular set of characteristics is analyzed for PNS, those covariates should be conditioned on. However, this is not always possible, as in the case of ancestral knowledge or a mediating effect of a vaccine. We may have data on the population, but not at the level an individual can make a decision from. After all, the individual doesn’t know a potential side-effect of the vaccine until after it’s been administered. We present a method to potentially obtain narrower bounds by utilizing population-level data and mild structural assumptions. Since Tian and Pearl [Tian and Pearl, 2000], the problem of bounding probabilities of causation was analyzed by combining only two sources of information: experimental data and observational studies. It’s surprising that knowing the structure of the causal graph allows us to narrow the bounds, despite the fact that the graph may seem redundant; i.e., we already know the causal effects. Moreover, the graph adds information about an individual, although it describes properties of the population. The analysis of causes of effects can now take advantage of the causal diagram. 2 PRELIMINARIES AND RELATED WORK In this section, we review the definitions for the three aspects of causation as defined in [Pearl, 1999]. We use the causal diagrams [Pearl, 1995, Spirtes et al., 2000, Pearl, 2009, Koller and Friedman, 2009] and the language of counterfactuals in its structural model semantics, as given in [Balke and Pearl, 2013, Galles and Pearl, 1998, Halpern, 2000]. We use Yx = y to denote the counterfactual sentence “Variable Y would have the value y, had X been x". For simplicity purposes, in the rest of the paper, we use yx to denote the event Yx = y, yx′ to denote the event Yx′ = y, yx′ to denote the event Yx = y ′ , and yx′ ′ to denote the event Yx′ = y ′ . For notational simplicity, we limit the discussion to binary X and Y , extension to multi-valued variables are straightforward [Pearl, 2009]. Three prominent probabilities of causation are the following: Definition 1 (Probability of necessity (PN)). Let X and Y be two binary variables in a causal model M , let x and y stand for the propositions X = true and Y = true, respectively, and x′ and y ′ for their complements. The probability of necessity is defined as the expression [Pearl, 1999] PN ∆ = P (Yx′ = f alse|X = true, Y = true) ∆ P (yx′ ′ |x, y) = (1) In other words, PN stands for the probability that event y would not have occurred in the absence of event x, given that x and y did in fact occur. Note that lower case letters (e.g., x, y) stand for propositions (or events). PN has applications in epidemiology, legal reasoning, and artificial intelligence. Epidemiologists have long been concerned with estimating the probability that a certain case of disease is attributable to a particular exposure, which is normally interpreted counterfactually as “the probability that disease would not have occurred in the absence of exposure, given that disease and exposure did in fact occur." This counterfactual notion is also used frequently in lawsuits, where legal responsibility is at the center of contention. Definition 2 (Probability of sufficiency (PS)). [Pearl, 1999] ∆ PS = P (yx |y ′ , x′ ) (2) PS finds applications in policy analysis, artificial intelligence, and psychology. A policy maker may well be interested in the dangers that a certain exposure may present to the healthy population [Khoury et al., 1989]. Counterfactually, this notion is expressed as the “probability that a healthy unexposed individual would have gotten the disease had he/she 2 been exposed." In psychology, PS serves as the basis for Cheng’s [Cheng, 1997] causal power theory [Glymour, 2013], which attempts to explain how humans judge causal strength among events. In artificial intelligence, PS plays a major role in the generation of explanations [Pearl, 2009]. Definition 3 (Probability of necessity and sufficiency (PNS)). [Pearl, 1999] ∆ PNS = P (yx , yx′ ′ ) (3) PNS stands for the probability that y would respond to x both ways, and therefore measures both the sufficiency and necessity of x to produce y. Tian and Pearl [Tian and Pearl, 2000] provide tight bounds for PNS, PN, and PS without a causal diagram using Balke’s program [Balke and Pearl, 1997]. Li and Pearl [Li and Pearl, 2019] provide a theoretical proof of the tight bounds for PNS, PS, PN, and other probabilities of causation without a causal diagram. PNS, PN, and PS have the following tight bounds: 0 P (yx ) − P (yx′ ) max   P (y) − P (yx′ ) P (yx ) − P (y)    PNS ≤ min          max  ≤ PNS   P (yx ) P (yx′ ′ ) P (x, y) + P (x′ , y ′ ) P (yx ) − P (yx′ )+ +P (x, y ′ ) + P (x′ , y) 0 P (y)−P (yx′ ) P (x,y) PN ≤ min    (  ≤ PN 1 ′ ′ ′ P (yx ′ )−P (x ,y ) P (x,y) ) (4)      (5)     (6) (7) Note that we only consider PNS and PN here because the bounds of PS can be easily obtained by exchanging x with x′ and y with y ′ in the bounds of PN. To obtain bounds for a specific population, defined by a set C of characteristics, the expressions above should be modified by conditioning each term on C = c. This would normally yield narrower bounds because, when C is not affected by X, it reduces variations among units in the subpopulation considered. In this paper, however, we obtain narrower bounds of PNS by leveraging another source of knowledge – the causal diagram behind the data, together with measurements of a set Z of covariates in that diagram. We provide graphical conditions under which the availability of such measurements would improve the bounds and demonstrate, both analytically and by simulation, the degree of improvement achieved. Narrower bounds and graphical criteria can be obtained for PN and PS through the same mechanism detailed in the proofs in the appendix. 3 BOUNDS WITH CAUSAL DIAGRAM 3.1 3.1.1 With additional covariate Z Non-descendant Z Theorems 4 and 5 below provide bounds for PNS when a set Z of variables can be measured which satisfy only one simple condition: Z contains no descendant of X. This condition is important because if X was set to x and Z contains a descendant of X, then Z could be altered as well and P (yx |z) would be unmeasurable. If the descendant is independent of Yx , then P (yx |z would be measurable, but that descendant wouldn’t contribute to any narrowing of bounds. These bounds are always contained within the Tian-Pearl bounds of equations 4, 5, 6, and 7. 3 Theorem 4. Given a causal diagram G and distribution compatible with G, let Z be a set of variables that does not contain any descendant of X in G, then PNS is bounded as follows: X z 0, P (yx |z) − P (yx′ |z), max   P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)              × P (z) ≤ PNS P (yx |z), P (yx′ ′ |z), X P (y, x|z) + P (y ′ , x′ |z), min   − P (yx′ |z)+ z   P (yx |z) +P (y, x′ |z) + P (y ′ , x|z)      (8) × P (z) ≥ PNS (9)     Proof. See Appendix. Note that, unlike the population-specific bounds, where each term was conditioned on C = c, here Z = z enters only some of the terms. This is because the measurement of Z is conducted in the study, but may not be available for the individual seeking advice. Examples are illustrated in Section 4. Note also that if only experimental data are available (i.e., P (Y ), P (Y, X), P (Y |Z), P (Y, X|Z) are not measured), arguments to the max or min functions involving observational data can be disregarded. For example, the lower bounds P of theorem 4 would become max{P (yx ) − P (yx′ ), z max{0, P (yx |z) − P (yx′ |z)} × P (z)}. 3.1.2 Sufficient Covariate Z Z Z X Y X (a) Confounder Z Y (b) Outcome-affecting covariate Z Figure 1: Z is not a descendant of X In figures 1a and 1b, Z is not a descendant of X and further satisfies the back-door criterion. For such cases the PNS bounds can be simplified to read: Theorem 5. Given a causal diagram G and distribution compatible with G, let Z be a set of variables satisfying the back-door criterion [Pearl, 1993] in G, then the PNS is bounded as follows: X max{0, P (y|x, z) − P (y|x′ , z)} × P (z) ≤ PNS (10) z X min{P (y|x, z), P (y ′ |x′ , z)} × P (z) ≥ PNS z Proof. See Appendix. The significance of theorem 5 is due to the ability to compute bounds using purely observational data. 4 (11) U Z X Y Figure 2: Mediator Z with direct effect 3.2 3.2.1 Mediation Z as a PARTIAL MEDIATOR In figure 2, Z is a descendant of X, so we cannot use theorems 4 and 5. However, the absence of confounders between Z and Y and between X and Y permits us to bound PNS as follows: Theorem 6. Given a causal diagram G and distribution compatible with G, let Z be a set of variables such that ∀x, x′ ∈ X : x 6= x′ , (Yx ⊥ ⊥ X ∪ Zx′ | Zx ) in G, then the PNS is bounded as follows:  0,   P (yx ) − P (yx′ ), max ≤ PNS  P (y) − P (yx′ ),    P (yx ) − P (y)    min Proof. See Appendix.               P (yx ), P (yx′ ′ ), P (y, x) + P (y ′ , x′ ), P (yx ) − P (yx′ )+ +P (y, x′ ) + P (y ′ , x),      P P    z z ′ min{P (y|z, x),    P (y ′ |z ′ , x′ )}×   min{P (zx ), P (zx′ ′ )}               ≥ PNS (12) (13)              Note that although this lower bound is unchanged from Tian and Pearl, the upper bound contains a vital additional argument to the min function. This new term can significantly reduce the upper bound. The rest of the terms are included because sometimes Tian and Pearl’s bounds are superior. The following theorem has the same quality. 3.2.2 PURE MEDIATOR Figure 3 is a special case of figure 2, in which X has no direct effect on Y . The resulting bounds for PNS read: Theorem 7. Given a causal diagram G in figure 3 and distribution that compatible with G, then PNS are bounded as follow: Z X Y Figure 3: Mediator Z with no direct effect  0,   P (yx ) − P (yx′ ), max ≤ PNS  P (y) − P (yx′ ),    P (yx ) − P (y)    5 (14) min P (yx ), P (yx′ ′ ), P (y, x) + P (y ′ , x′ ),             P (yx ) − P (yx′ )+   +P (y, x′ ) + P (y ′ , x),       ′ ′    Σz Σz′ 6=z min{P (y|z), P′ (y′ |z )}× min{P (z|x), P (z |x )}             ≥ PNS (15)            Proof. See Appendix. The core terms for theorems 6 and 7 added to the upper bounds notably only require observational data. 4 4.1 EXAMPLES CREDIT TO THE TREATMENT The manufacturer of a drug wants to claim that a non-trivial number of recovered patients who were given access to the drug owe their recovery to the drug. So they conduct an observational study; they record the recovery rates of 700 patients. 464 patients chose to take the drug and 236 patients did not. The results of the study are in table 1. The manufacturer claims success for their drug because the overall recovery rate from the observational study has increased from 54% to 68% for non-drug-takers to drug-takers. Women Men Overall Drug 1 out of 110 recovered (1%) 313 out of 354 recovered (88%) 314 out of 464 recovered (68%) No Drug 13 out of 120 recovered (11%) 114 out of 116 recovered (98%) 127 out of 236 recovered (54%) Table 1: Results of a drug study with gender taken into account The number of recovered patients that should credit the drug for their recovery are those who would recover if they had taken the drug and would not recover if they had not taken the drug. This is the PNS. Let X = x denote the event that the patient took the drug and X = x′ denote the event that the patient did not take the drug. Let Y = y denote the event that the patient has recovered and Y = y ′ denote the event that the patient has not recovered. Let Z = z represent female patients and Z = z ′ represent male patients. Suppose we know an additional fact, estrogen has a negative effect on recovery, so women are less likely to recover than men, regardless of the drug. Additionally, as we can see from the data, men are significantly more likely to take the drug than women are. The causal diagram is shown in Figure 1a. Node Z on the graph satisfies the back-door criterion, therefore we can compute the causal effect P (yx ) and P (yx′ ) via the adjustment formula [Pearl, 1993] and observational data from table 1, where, X P (yx ) = P (y|x, z)P (z) = 0.597, z P (yx′ ) = X P (y|x′ , z)P (z) = 0.696, z P (yx′ ′ ) = 1 − P (yx′ ) = 0.304. Therefore, the bounds of PNS computed using equations 4 and 5 are 0 ≤ P N S ≤ 0.297, where the diagram was used only to identify the causal effects yx and yx′ . These bounds aren’t informative enough to conclude whether or not the drug was the cause of recovery for a meaningful number of patients. They suggest that the fraction of beneficiaries can 6 be as low as 0% or as high as 29.7%. Now, consider the bounds in theorem 5 which takes into account the position of Z in the diagram. Since Z satisfies the back-door criterion, we can use equations 10 and 11 to compute 0 6 P N S 6 0.01. The conclusion now is obvious. At most 7 out of 314 patients’ recoveries can be credited to the drug. This is strong evidence that counters the manufacturer’s claim. 4.2 INFLAMMATION MEDIATOR As before, let X and Y represent drug consumption and recovery. Let Z represent acute inflammation with z being present and z ′ being absent. In some people, the drug causes acute inflammation, which has adverse effects on recovery, so the causal structure is depicted in figure 3. We observe the following proportions among drug takers, non-takers, with inflammation, and without inflammation: P (y|z) = 0.5, P (z|x) = 0.1, ′ P (z|x′ ) = 0.1. P (y|z ) = 0.5, The Tian-Pearl PNS upper bound is: PNS 6 min {P (y|x), P (y ′ |x′ )} = 0.5. Given that the lower bound is 0, these bounds are not very informative. If we knew that an individual would react to the drug with acute inflammation, we would only look at the data comprising of people reacting to the drug with acute inflammation. Since we are conditioning on z, P N S = 0 because the outcome, Y , will have the same result regardless of whether the person consumed the drug. So knowing a person’s inflammation response to the drug narrows PNS from a wide [0, 0.5] to a point estimate of 0. Imagine, for this drug, that we can’t know ahead of time how a person will react inflammation-wise. We can only observe acute inflammation after the drug is administered. Since we have population data from patients who have already taken the drug, we can utilize this mediator to bound the PNS for new patients who haven’t yet taken the drug:   P (y|z) · P (z|x) + P (y|z ′ ) · P (z|x′ ),     ′ ′ ′ ′ P (y|z) · P (z |x ) + P (y|z ) · P (z |x), PNS 6 min ′ ′ ′ ′ + P (y |z) · P (z|x ),    P (y ′ |z ′ ) · P (z|x)  P (y |z ) · P (z ′ |x′ ) + P (y ′ |z) · P (z ′ |x) = 0.1. The mediator-improved PNS upper bound is significantly smaller than what the Tian-Pearl upper bound provides, 0.1 vs 0.5. The new upper bound can now be effectively weighed against other factors like cost and side-effects. 4.3 ANCESTRAL COVARIATE Let’s continue from the introduction, where X represents vaccination with x being vaccinated and x′ being unvaccinated and Y represents survival with y is surviving and y ′ is succumbing to the pandemic. Instead of classifying by age, let’s assume our machine learning algorithm uncovers a correlation between survival and ancestry. Let Z represent ancestry and, for simplicity, there are only two ancestries, z and z ′ . Either graph of figure 1 is representative of this. Our RCT data reveals: P (Z = z) = 0.5, P (yx |Z = z ′ ) = 0.25, P (yx |Z = z) = 0.75, P (yx′ |Z = z ′ ) = 0.6. P (yx′ |Z = z) = 0.2, We now have four different bounds on PNS: Tian-Pearl =⇒ 0.1 6 P N S 6 0.5 Covariate-improved =⇒ 0.275 6 P N S 6 0.5 Person has ancestry z =⇒ 0.55 6 P N S 6 0.75 Person has ancestry z ′ =⇒ 0 6 P N S 6 0.25 As expected, using the causal diagram and ancestral Z yields narrower bounds than the Tian-Pearl bounds. However, it’s surprising that knowing a person has either ancestry z or z ′ gives us bounds outside of our new bounds. In fact, they are completely outside the wider Tian-Pearl bounds. This is discussed in section 6. 7 In the meantime, it’s important to recognize that the last two ancestry-specific PNS bounds are what should be referred to if an individual knows their ancestry. The covariate-improved PNS bounds should only be referred to if a person’s ancestry is unknown. This might be because the person was adopted with no hint as to whether they’re from ancestry z or z ′ (physical features are right in between or indistinguishable). 5 SIMULATION RESULTS In this section, we illustrate that the bounds of PNS are improved in four different simple causal diagrams. The first causal diagram is the simplest one in figure 1a with binary Z satisfying the back-door criterion. The second causal diagram is figure 4 where {Z1 , Z2 } satisfies the back-door criterion and both Z1 and Z2 are binary. The third causal diagram is figure 5, with six subsets of Z1 , Z2 , and Z3 satisfying the back-door criterion. The last causal diagram is figure 1a, the same as the first causal diagram, except Z has 1024 instantiates. For each of the causal diagrams, we randomly generated 100000 sample distributions compatible with the causal diagram. We compared the average increased lower bound (i.e., lower bound with causal diagram - lower bound without causal diagram), the average decreased upper bound (i.e., upper bound without causal diagram - upper bound with causal diagram), the average gap without causal diagram (i.e., upper bound without causal diagram - lower bound without causal diagram), and the average gap with causal diagram (i.e., upper bound with causal diagram - lower bound with causal diagram). The results are summarized in table 2. For each of the causal diagrams, we then randomly pick 100 out of 100000 sample distributions to draw the graph of bounds with and without causal diagram. The results are in figures 6 to 9. Z1 Z2 X Y Figure 4 Z1 Z3 Z2 X Y Figure 5 From figures 6 to 9, we see that the bounds of PNS are improved in most of the samples with each causal diagram. Surprisingly, we see in table 2 that the average gap without a causal diagram fluctuates substantially (between 0.219 and 0.483) between causal diagrams. However, the average gap with a causal diagram is consistently 0.166 among all causal diagrams. If subsets Z are available that satisfy theorem 4, the bounds of PNS are useful as the gap is narrow and with low variability across causal structures. 6 DISCUSSION We have shown that knowledge of a causal structure enables narrower PNS bounds to be estimated, compared with the tight bounds of Tian and Pearl which were derived without such knowledge. However, it must be emphasized that this narrowing is only applicable to individuals when unable to determine their characteristics at decision time. If their Z values are known, the bounds of equations 4 and 5, conditioned on those values, should be consulted. Example 4.3 provides a scenario where people who know their ancestry have very different PNS bounds than people who don’t know their ancestry. You would expect the additional information of ancestral knowledge would further narrow the 8 Figure 6: Bounds of PNS in Figure 1a, where general bounds are obtained from Equation 4 and 5 and the bounds with causal diagram are obtained by Theorem 5. Figure 7: Bounds of PNS in figure 4, where general bounds are obtained from equations 4 and 5 and the bounds with causal diagram are obtained by theorem 5. Figure 8: Bounds of PNS in figure 5, where general bounds are obtained from equations 4 and 5 and the bounds with causal diagram are obtained by theorem 5. 9 Average increased lower bound Average decreased upper bound Average gap without causal diagram Average gap with causal diagram Fig 1a Fig 4 Fig 5 Fig 1a with nonbinary Z 0.026 0.050 0.032 0.158 0.026 0.050 0.032 0.158 0.219 0.267 0.231 0.483 0.166 0.166 0.166 0.166 Table 2: Bounds of PNS with and without causal diagram Figure 9: Bounds of PNS in figure 1a with Z having 1024 instantiates, where general bounds are obtained from Equation 4 and 5 and the bounds with causal diagram are obtained by theorem 5. bounds, but they change the bounds to a different non-overlapping range. This violates the heuristic that additional information should narrow the bounds or, at worst, not widen them. To rephrase, if you don’t know someone’s ancestry, the probability they benefit from this drug is between 0.275 and 0.5. Once you acquire the additional information that the person is of ancestry z, the probability they benefit from this treatment becomes between 0.55 and 0.75. How is this possible? Was the person’s probability of benefiting never really between 0.275 and 0.5 that we calculated before knowing their ancestry? The reason for this seeming inconsistency is that we’re asking different questions. When we didn’t know the ancestry, we were asking, “what is the probability of benefiting for a person regardless of ancestry?” When we found out the person is of ancestry z, we then asked a different question, “what is the probability of benefiting for a person of ancestry z?” The additional information of the person’s ancestry didn’t help the first question and the second question isn’t answerable without the additional information. The following example will illuminate the reasons for this phenomenon [Pearl, 2009, p. 296]. Let the covariate Z stand for the outcome of a fair coin toss, so P (Z = heads) = 0.5. Without knowing what treatment X and success Y 10 represent, let’s assume the following measurements are taken: P (yx′ ) = 0.5, P (yx′ |Z = heads) = 0, P (yx′ |Z = tails) = 1. P (yx ) = 0.5, P (yx |Z = heads) = 1, P (yx |Z = tails) = 0, Tian-Pearl bounds gives us 0 6 P N S 6 0.5 and the bounds utilizing Z are 0.5 6 P N S 6 0.5 or PNS = 0.5. Now, let’s uncover the functional mechanism, x represents betting $1 on heads, x′ represents betting $1 on tails, y represents winning $1, and y ′ represents losing $1. It should now be clear why P (yx ) = P (yx′ ) = 0.5. Without knowing the coin toss result, Z, the odds of winning $1 are 50/50 whether you bet on heads or tails. PNS is also 0.5 because benefiting from betting on heads is true only when the coin toss was heads. The coin toss is heads 50% of the time. This brings us back to the PNS bounds when we have the additional information of what the coin toss result was. If we know the coin toss resulted in heads, then the probability of benefiting from betting on heads is 100%. Similarly, if we know the coin toss resulted in tails, then the probability of benefiting from betting on heads is 0%. In other words PNS(heads) = 1 and PNS(tails) = 0. If the coin toss is heads, winning only happens when betting on heads. Even though the bounds are completely different when we provided with the very useful additional information of the coin toss, there is clearly no contradiction here. There was a 50% probability of benefiting from betting on heads when we didn’t know the coin toss result and a 100% probability of benefiting from betting on heads when we knew the coin toss resulted in heads. We were asking two separate questions. The first question was, “what is the probability of benefiting regardless of coin toss result?” The second question was, “what is the probability of benefiting for a coin toss of heads?” 7 CONCLUSION In this work, we have developed a graphical method of learning individualized functions (representing PNS, PN, and PS) from population data, based the structure of the causal graph. This generalizes both the PN bounds derived in [Dawid et al., 2017] and those derived in [Tian and Pearl, 2000]. Often these functions return bounds as opposed to point estimates. Nevertheless, these bounds can be tremendously informative. Machine learning algorithms need to incorporate these techniques in order to understand, interpret, and apply the underlying probabilities of causation of their data. Identifying causes of effects and decision making benefit greatly from using population data for individual cases. References [Balke and Pearl, 1997] Balke, A. and Pearl, J. (1997). Probabilistic Counterfactuals: Semantics, Computation, and Applications. PhD thesis, University of California, Los Angeles. [Balke and Pearl, 2013] Balke, A. and Pearl, J. (2013). Counterfactuals and policy analysis in structural models. arXiv preprint arXiv:1302.4929. [Cheng, 1997] Cheng, P. W. (1997). From covariation to causation: A causal power theory. Psychological review, 104(2):367. [Dawid et al., 2017] Dawid, P., Musio, M., and Murtas, R. (2017). The probability of causation. Law, Probability and Risk, (16):163–179. [Galles and Pearl, 1998] Galles, D. and Pearl, J. (1998). An axiomatic characterization of causal counterfactuals. Foundations of Science, 3(1):151–182. [Glymour, 2013] Glymour, C. (2013). Psychological and normative theories of causal power and the probabilities of causes. arXiv preprint arXiv:1301.7377. [Halpern, 2000] Halpern, J. Y. (2000). Axiomatizing causal reasoning. Journal of Artificial Intelligence Research, 12:317–337. [Khoury et al., 1989] Khoury, M. J., Flanders, W. D., Greenland, S., and Adams, M. J. (1989). On the measurement of susceptibility in epidemiologic studies. American Journal of Epidemiology, 129(1):183–190. [Koller and Friedman, 2009] Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT press. [Li and Pearl, 2019] Li, A. and Pearl, J. (2019). Unit selection based on counterfactual logic. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 1793–1799. AAAI Press. 11 [Mueller and Pearl, 2020] Mueller, S. and Pearl, J. (2020). Which Patients are in Greater Need: A counterfactual analysis with reflections on COVID-19. https://ucla.in/39Ey8sU+. [Pearl, 1993] Pearl, J. (1993). Aspects of graphical models connected with causality. Proceedings of the 49th Session of the International Statistical Institute, Italy, pages 399–401. [Pearl, 1995] Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–688. [Pearl, 1999] Pearl, J. (1999). Probabilities of Causation: Three counterfactual interpretations and their identification. Synthese, 121(1-2):93–149. [Pearl, 2009] Pearl, J. (2009). Causality. Cambridge University Press, Second edition. [Spirtes et al., 2000] Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. (2000). Causation, Prediction, and Search. MIT press. [Tian and Pearl, 2000] Tian, J. and Pearl, J. (2000). Probabilities of causation: Bounds and identification. Annals of Mathematics and Artificial Intelligence, 28(1-4):287–313. 12 A Proof of Theorem 4 Proof. PNS = = P (yx , yx′ ′ ) Σz P (yx , yx′ ′ |z) × P (z) (16) From [Li and Pearl, 2019], we have the z-specific PNS as follows: 0, P (yx |z) − P (yx′ |z), max   P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)            P (yx |z), P (yx′ ′ |z), P (y, x|z) + P (y ′ , x′ |z), min   − P (yx′ |z)+   P (yx |z) +P (y, x′ |z) + P (y ′ , x|z) Substituting 17 and 18 into 16, theorem 4 holds. Note that since we have, X ≤ z-PNS (17)        ≥ z-PNS (18)     max{0, P (yx |z) − P (yx′ |z), z ≥ P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)} × P (z) X 0 × P (z) z = 0, X max{0, P (yx |z) − P (yx′ |z), z ≥ P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)} × P (z) X [P (yx |z) − P (yx′ |z)] × P (z) z = P (yx ) − P (yx′ ), X max{0, P (yx |z) − P (yx′ |z), z ≥ P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)} × P (z) X [P (y|z) − P (yx′ |z)] × P (z) z = P (y) − P (yx′ ), X max{0, P (yx |z) − P (yx′ |z), z ≥ P (y|z) − P (yx′ |z), P (yx |z) − P (y|z)} × P (z) X [P (yx |z) − P (y|z)] × P (z) z = P (yx ) − P (y), then the lower bound in theorem 4 is guaranteed to be no worse than the Tian-Pearl lower bound in equation 4. Similarly, the upper bound in theorem 4 is guaranteed to be no worse than the Tian-Pearl upper bound in equation 5. Also note that, since Z does not contain a descendant of X, the term P (yx |z) refers to experimental data under population z. 13 B Proof of Theorem 5 Proof. Since Z satisfies the back-door criterion, then equations 8 and 9 still hold and P (yx |z) = P (y|x, z), P (yx′ |z) = P (y|x′ , z), and P (yx′ ′ |z) = P (y ′ |x′ , z). We further have, = ≥ = = = = P (yx |z) − P (yx′ |z) P (y|x, z) − P (y|x′ , z) [P (y|x, z) − P (y|x′ , z)] × P (x|z) P (y|x, z) × P (x|z) − P (y|x′ , z) × (1 − P (x′ |z)) P (y, x|z) + P (y, x′ |z) − P (y|x′ , z) P (y|z) − P (y|x′ , z) P (y|z) − P (yx′ |z) (19) = ≥ = = = = P (yx |z) − P (yx′ |z) P (y|x, z) − P (y|x′ , z) [P (y|x, z) − P (y|x′ , z)] × P (x′ |z), P (y|x, z) × (1 − P (x|z)) − P (y|x′ , z) × P (x′ |z) P (y|x, z) − P (y, x|z) − P (y, x′ |z) P (y|x, z) − P (y|z) P (yx |z) − P (y|z). (20) and With equations 19 and 20, equation 8 reduces to equation 10 in theorem 5. We also have, = ≤ = = min{P (yx |z), P (yx′ ′ |z)} min{P (y|x, z), P (y ′ |x′ , z)} P (y|x, z) × P (x|z) + P (y ′ |x′ , z) × (1 − P (x|z)) P (y|x, z) × P (x|z) + P (y ′ |x′ , z) × P (x′ |z) P (y, x|z) + P (y ′ , x′ |z) (21) and min{P (yx |z), P (yx′ ′ |z)} = min{P (y|x, z), P (y ′ |x′ , z)} ≤ P (y|x, z) × (1 − P (x|z)) + P (y ′ |x′ , z) × P (x|z) = P (y|x, z) × (1 − P (x|z)) + P (y ′ |x′ , z) × (1 − P (x′ |z)) = P (y|x, z) − P (y, x|z) + P (y ′ |x′ , z) − P (y ′ , x′ |z) = P (y|x, z) − P (y|x′ , z) + P (y, x′ |z) + P (y ′ , x|z) = P (yx |z) − P (yx′ |z) + P (y, x′ |z) + P (y ′ , x|z). With equations 21 and 22, equation 9 reduces to equation 11 in theorem 5. 14 (22) C Proof of Theorem 6 Proof. = = = ≤ = = = PNS P (yx , yx′ ′ ) Σz Σz′ P (yx , yx′ ′ , zx , zx′ ′ ) Σz Σz′ P (yx , yx′ ′ |zx , zx′ ′ ) × P (zx , zx′ ′ ) Σz Σz′ min{P (yx |zx , zx′ ′ ), P (yx′ ′ |zx , zx′ ′ )} × min{P (zx ), P (zx′ ′ )} Σz Σz′ min{P (yx |zx ), P (yx′ ′ |zx′ ′ )} × min{P (zx ), P (zx′ ′ )} Σz Σz′ min{P (y|zx , x), P (y ′ |zx′ ′ , x′ )} × min{P (zx ), P (zx′ ′ )} Σz Σz′ min{P (y|z, x), P (y ′ |z ′ , x′ )} × min{P (zx ), P (zx′ ′ )}. (23) (24) Combined with the Tian-Pearl bounds in equations 4 and 5, theorem 6 holds. Note that equation 23 is due to ⊥ X |Zx . ⊥ Zx | Zx′ . Equation 24 is due to ∀x, Yx ⊥ Yx ⊥ ⊥ Zx′ | Zx and Yx′ ⊥ D Proof of Theorem 7 Proof. First we show that in graph G, if an individual is a complier from X to Y , then Zx and Zx′ must have the different values. This is because the structural equations for Y and Z are fy (z, uy ) and fz (x, uz ), respectively. If an individual has the same Zx and Zx′ value, then fz (x, uz ) = fz (x′ , uz ). This means fy (fz (x, uz ), uy ) = fy (fz (x′ , uz ), uy ), i.e., Yx and Yx′ must have the same value. Thus this individual is not a complier. Therefore, = = ≤ = PNS P (yx , yx′ ′ ) Σz Σz′ 6=z P (yz , yz′ ′ ) × P (zx , zx′ ′ ) Σz Σz′ 6=z min{P (yz ), P (yz′ ′ )} × min{P (zx ), P (zx′ ′ )} Σz Σz′ 6=z min{P (y|z), P (y ′ |z ′ )} × min{P (z|x), P (z ′ |x′ )} Combined with the Tian-Pearl bounds in equations 4 and 5, theorem 7 holds. 15 E Simulation Algorithm We used the following algorithm to generate samples and conduct the simulations in section 5: Algorithm 1: Generate PNS simulation data with theorem 5 input :Number of output samples n Causal diagram G Covariates to condition on Z output :List of 4-tuples consisting of general lower bound, lower bound with causal graph, upper bound with causal graph, and general upper bound begin for i ← 1 to n do cpt ← generate-cpt(G,random-uniform) ; // Lower/upper Tian-Pearl bounds lb, ub ← pns-bounds(cpt); // Lower/upper bounds with graph lb_graph, ub_graph ← pns-graph(cpt, Z); append-result(lb, lb_graph, ub_graph, ub); end end Procedure generate-cpt input :n causal diagram nodes (X1 , ..., Xn ) Distribution D output :n conditional probability tables for P (Xi |P arents(Xi )) begin for i ← 1 to n do s ← num-instantiates(Xi ) p ← num-instantiates(P arents(Xi )) for k ← 1 to p do sum ← 0 for j ← 1 to s do aj ← sample(D) sum ← sum + aj end for j ← 1 to s do P (xij |P arents(Xi )k ) ← aj /sum end end end end For figure 4, binary variables Z1 and Z2 were considered as a covariate with 4 instantiates. Similarly, figure 5’s Z variables were considered as a single covariate with 8 instantiates. 16