[go: up one dir, main page]

0% found this document useful (0 votes)
39 views11 pages

A Tutorial On Applying The Difference-in-Differenc

This document provides a tutorial on applying the difference-in-differences (DiD) method to health data, aimed at epidemiologists and biomedical researchers. It highlights the importance of DiD in estimating the effects of group-level interventions and offers practical guidance, including synthetic data and code for implementation. The tutorial also discusses the assumptions necessary for valid causal inference using DiD, particularly in the context of health services research.

Uploaded by

ana santos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views11 pages

A Tutorial On Applying The Difference-in-Differenc

This document provides a tutorial on applying the difference-in-differences (DiD) method to health data, aimed at epidemiologists and biomedical researchers. It highlights the importance of DiD in estimating the effects of group-level interventions and offers practical guidance, including synthetic data and code for implementation. The tutorial also discusses the assumptions necessary for valid causal inference using DiD, particularly in the context of health services research.

Uploaded by

ana santos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Current Epidemiology Reports

https://doi.org/10.1007/s40471-023-00327-x

A Tutorial on Applying the Difference‑in‑Differences Method to Health


Data
Sarah Rothbard1,2 · James C. Etheridge3,4 · Eleanor J. Murray2

Accepted: 21 June 2023


© The Author(s) 2023

Abstract
Purpose of Review Difference-in-differences analyses are a useful tool for estimating group-level decisions, such as policy
changes, training programs, or other non-randomized interventions, on outcomes which occur within the intervention group.
However, there is little practical advice on how to apply difference-in-differences to epidemiologic and health data. Here,
we provide a tutorial on applying the difference-in-differences method to health services data, targeted at epidemiologists
and other biomedical researchers.
Recent Findings As epidemiologists increasingly engage in policy discussions, familiarity with difference-in-differences will
be increasingly important. However, much of the literature on difference-in-differences is limited to econometrics examples
where the types of data and questions encountered may differ from health research. There remain limited resources for epi-
demiologists and other medical researchers to learn how to implement difference-in-differences analyses without first having
to familiarize themselves with econometric terminology and concepts.
Summary This tutorial contains synthetic data, code, and worksheets for class instruction. We provide a step-by-step descrip-
tion of the difference-in-differences analysis including sensitivity checks, modeling decisions, and interpretation. In addition,
we supply novel guidance on modeling difference-in-differences outcomes for count or score outcomes.

Keywords Difference-in-differences · Causal inference · Statistical methods · Causal effect estimation

Introduction application of methods which have been developed based on


theory to the messy world of actual data can be challenging, and
Over the past several decades, significant advances in statistical, the complexities of data differ between fields of specialization.
epidemiologic, and econometric methods for estimating causal An economist and a public health researcher cannot necessarily
effects have been made. These new methods provide useful expect to encounter the same types of analytic problems nor to
tools for improving the efficiency of data analysis. However, analyze their data using the exact same means and underlying
the introduction of these methods to applied research has not assumptions when the goals of their research are often different.
been uniform across scientific disciplines, and even within a As a result, tutorial material specifically tailored to public health
field of study, causal inference methods have often been limited and medical questions are needed in order to address the types
to particular application areas. This is a problem because the of data and analytic issues that arise in these fields and connect
econometric methods with other methods more well-known
* Eleanor J. Murray in epidemiological research. Here, we provide a step-by-step
ejmurray@bu.edu walk-through of using the difference-in-differences method for
epidemiologists and medical researchers. We apply this method to
1
Department of Biostatistics, Boston University School estimate the impact of a department-level intervention on surgical
of Public Health, Boston, MA, USA
safety outcomes at the department-level within a single hospital.
2
Department of Epidemiology, Boston University School The difference-in-differences (DiD) method, although orig-
of Public Health, Boston, MA, USA
inally developed by an epidemiologist [1], is a widely used tool
3
Department of Surgery, Brigham and Women’s Hospital, for estimating causal effects in econometrics. The DiD method
Boston, MA, USA
compares changes in an outcome over time between an inter-
4
Ariadne Labs, Harvard T.H. Chan School of Public Health, vention and control group and is thus ideal for estimating the
Brigham and Women’s Hospital, Boston, MA, USA

13
Vol.:(0123456789)
Current Epidemiology Reports

effects of policies or interventions that are applied at the group study, the investigators aimed to test the impact of a Device
level rather than at the individual level. An important caveat Briefing Tool (DBT), a short communication instrument
to this method is that rather than estimating the average effect designed to promote discussion of safe device use among
of the intervention on the outcome in the entire population members of surgical teams to improve surgical safety. The
(which is a common estimand in epidemiologic research), DiD DBT was implemented in four general surgery departments
estimates the average effect that the intervention has had on the (colorectal surgery, upper gastrointestinal surgery, hepato-
outcome in the group exposed to the intervention. pancreato-biliary surgery, and acute care surgery) in a large
There have recently been calls for this method to be more academic referral center. Four additional surgical depart-
widely used in epidemiology [2•, 3] as well as attempts to frame ments (cardiothoracic surgery, head and neck surgery,
DiD in the language of epidemiologic methods [4•]. However, gynecology, and urology) were enrolled as a comparator
there remain relatively few applications of DiD to epidemiologic group. The impact of the DBT on non-technical skills was
problems [5] and a dearth of training materials targeted specifi- assessed using the NOTECHS behavioral marker system.
cally towards epidemiologists and medical researchers. The NOTECHS system evaluates four behaviors across three
Recently, our group, composed of epidemiologists and clin- surgical sub-scores, with total possible scores ranging from
ical researchers, collaborated on a study using DiD to estimate 4 to 48 points. Trained observers rated teams in live opera-
the impact of a pre-operative device briefing tool on surgi- tions before and after implementation of the DBT. Observers
cal team performance in departments where surgeons had the also recorded features of each operation such as case com-
opportunity to use this tool [6]. Surgical quality was measured plexity (rated 1–3, least challenging to most challenging)
with an ordinal outcome which summarized the score on an and type of surgical device in use (tissue sealing device,
observer-recorded survey. Although this initially seemed like a surgical stapler, or other). However, baseline observations
straightforward use case for DiD, we rapidly discovered a chal- were interrupted by the COVID-19 pandemic. Therefore,
lenge: despite the widespread use of DiD in the econometrics this study had three distinct time periods: pre-COVID base-
literature, we were unable to find guidance or examples on the line, post-COVID baseline, and post-intervention.
use of DiD for bounded, categorical, or count outcomes. We created the synthetic dataset in SAS using the
In retrospect, this is unsurprising. Epidemiologic data empirical distribution of covariates and NOTECHS scores
most often deals with binary outcomes, such as case versus observed in the real data. The use of a synthetic dataset
control, or alive versus dead, but categorical outcomes are allows us to share the data publicly, without concerns about
also frequently of interest. These may include outcomes such violating patient confidentiality. However, it is important to
as our survey score variable, or similar survey outcomes remember that since the data in this tutorial are simulated,
such as pain level, as well as count outcomes such as num- any conclusions from the analysis of this tutorial dataset
bered events, such as number of migraines experienced in should not be taken as evidence for or against real world
a month. In contrast, these types of outcomes are less com- causal effects.
mon in econometrics, where the primary outcomes are often Although the original study only included 210 surger-
continuous variables such as costs. Furthermore, even when ies distributed across 8 simulated departments, we created
the outcome is binary, econometricians commonly model the a larger simulated dataset with 20,000 simulated surgeries
outcome with a linear probability model. As a result, there reflecting the distributions observed in the original study [6].
has been little need for econometrics to address the question Simulated operations were then assigned to a time period
of how to handle categorical outcomes in DiD models. and intervention group, based on the three periods of obser-
In this tutorial paper, we describe the basics of the DiD vation, and using a pseudo-linear time scale in months with
method, walk through the assumptions required to apply it t = 0 denoting the post-COVID baseline, t = 1 denoting
to a research question, and explore the different options for the post-intervention period, and t = −12 denoting the pre-
implementing DiD in data with categorical outcomes. We COVID baseline period. The covariates case complexity
provide synthetic data and code in SAS, Stata, and R along (case_challenge) and type of surgical device (device) were
with an explanation of the steps in the DiD modeling pro- simulated by drawing a random uniform number and assign-
cess, including sensitivity checks and assessment of mod- ing the complexity or device category based on the empirical
eling assumptions. distribution of these variables in the original dataset. Finally,
dummy variables for department and case complexity and
their interactions were created. We modeled the expected
Getting to Know the Data average NOTECHS score in the actual data as a function of
these dummy variables using a log-linear regression model
Our tutorial is based on synthetic data created to mirror and then generated synthetic NOTECHS scores for our sim-
the real data collected in a study by an academic medical ulated data by exponentiating the linear predictor for each
center in collaboration with Johnson & Johnson [6]. In that surgery from the estimated regression coefficients. Scores

13
Current Epidemiology Reports

were then rounded to the nearest whole number using the would have been if they had not received the intervention.
round (<>, 0) function in SAS, and truncated at the mini- But we do not require the reverse (i.e., we do not require
mum and maximum possible values of 4 and 48. Code for that the intervention group’s experience reflect what would
simulating the data is provided in the Online Supplement have happened to the control group had they been given the
(Code Section A1, SAS code only), as well as data files in intervention).
SAS (did_sim.sas7bdat), Stata (did_sim.dta), and comma-
separated (did_sim.csv) formats.
Causal Assumptions

Defining the Causal Question Table 1 outlines the assumptions required for validity of our
causal effect estimate from the DiD analysis. These assump-
DiD answers a causal question which may be unfamiliar to tions have been described in more detail for epidemiologic
some epidemiologists and medical researchers — rather than audiences elsewhere [4•], but we briefly review them here.
the more commonly used average treatment effect (or ATE), First, we require the assumption of causal consistency,
DiD provides an estimate of the causal effect of the change which can be expressed as the requirement that we are ask-
in treatment status for the group which actually experienced ing a well-defined causal question. Mathematically, this
that change (Fig. 1). That is, it estimates the average treat- assumption requires that the intervention we have in the data
ment effect among the treated (often called the ATT). In is the one that we intended to assess with our counterfactual
the context of our example, the ATT can be described as comparison, such that the observed outcome under interven-
the impact on surgical performance of the DBT for those tion is equal to the counterfactual outcome under that same
operations which occurred in departments where the DBT intervention. In the context of our tutorial example, consist-
was available for use. ency is expected to hold at the level of being trained on the
This causal question may not seem much different to the DBT, since the same tool was given to all surgeons in the
general reader, but it has important implications for our intervention departments and all surgeons in those depart-
required causal assumptions — namely, we must have a ments were provided with tool training. However, surgeons
control group whose trend over time provides a valid coun- had discretion about choosing to use the DBT or not, as well
terfactual for what the experience of the intervention group as there being potential variation in how the DBT was used.

Fig. 1  Potential contrasts which could be estimated. The semi-cir- tion. The average treatment effect (ATE) estimates what would have
cles represent study groups; shading represents (real or hypothetical) happened if everyone had been assigned to each of the intervention and
assigned conditions: gray = intervention; white = control. Line type control conditions. The average treatment effect in the treated (ATT)
represents observed (solid) or counterfactual (dashed) estimates. When estimates the difference between what happened to the intervention
observed outcomes in each group are compared, based on their actual group under their assigned condition, and what would have happened
assigned conditions, this provides an estimate of the observed associa- to the intervention group if they had been under the control condition

13
Current Epidemiology Reports

Table 1  Difference-in-differences assumptions


Assumption Definition Application to the example

Consistency Well-defined causal question, such that the intervention in our The intervention of interest is the option to use the device
counterfactual is the same as the intervention in our data. briefing tool, and variation in how it is used is not expected to
affect its impact on surgical quality
Positivity Each study subgroup has positive chance of being assigned to All eight surgical departments were eligible to be selected for
the treatment group. use of the DBT.
Parallel trends The difference in outcomes from before to after exposure in The change in NOTECHS scores from the pre-DBT baselines to
the control group is a counterfactual ideal representation of post-DBT for the control group is the assumed to be the same
the difference in outcomes in the exposed group had they as the counterfactual change in scores for the intervention
been unexposed. group had the DBT not been implemented.

In order for the consistency assumption to hold, we must group, had the intervention group not received the interven-
further assume that the reasons a surgeon chose to use or not tion. Changes in trends of covariates over time might indi-
use the DBT, and the various ways in which the DBT was cate a potential for the outcome trend to change over time
used in practice, did not alter the causal effect of the DBT’s but are not in themselves violations of the exchangeability
use on surgical quality as measured by NOTECHS score. assumption. If this assumption holds, we can use the con-
Second, we require positivity, sometimes called “over- trol group’s trend to estimate the counterfactual NOTECHS
lap,” such that all departments in the study had a non-zero scores for the intervention group. Figure 2 displays the
probability of being selected for either the intervention or assumed pattern of actual and counterfactual trends that the
control group. Since this was an intervention study, posi- difference-in-differences method relies upon.
tivity is expected by design — even though the interven- The parallel trends assumption can never be empirically
tion departments were selected as units, based on shared verified, but we can assess how reasonable this assumption
nursing load, it was possible, pre-assignment, that either set is by looking at the pre-intervention time-trends in the
of departments could have been assigned to the control or control and intervention group. Although a DiD can be
intervention group. estimated with only two time points — one each before and
Finally, we need a version of the causal exchangeability after the intervention — having a third time point allows
assumption — the parallel trends assumption. This assump- quantitative assessment of the parallel trend assumption. In
tion specifies that the trend in the outcome over time in the studies with only one pre-intervention time-point, subject
control group is parallel to the trend in the outcome over matter knowledge must be used to support the validity of
time that would have been observed in the intervention this assumption. In the tutorial code, the first section (Code

Fig. 2  Difference-in-differences example graph. The two lines repre- had not received the intervention, the average outcome in that group
sent the trends over time in outcome measures for the intervention (Z would have continued to follow the trend observed in the control
= 1, orange) group and the control (Z = 0, blue) group. The x-axis group (and in the intervention group, pre-intervention). This counter-
represents time in months, from t = − 12 months (pre-COVID base- factual trend is indicated with the gray line. The difference-in-differ-
line) to t = 1 (end of follow-up). The vertical line indicates the switch ences estimates the outcome change between the counterfactual inter-
from pre-intervention (T = 0) to post-intervention (T = 1). The paral- vention line (gray) and the actual intervention line (orange) between
lel trends assumption relies on the idea that if the intervention group baseline (t = 0) and the end of follow-up (t = 1)

13
Current Epidemiology Reports

Section 1.1) estimates the change in NOTECHS scores covariates, and our outcome variable over time in a directed
between the pre- and post-COVID baseline periods for acyclic graph (DAG).
the control and intervention departments and calculates We can represent the DiD estimator in expectation nota-
the difference-in-differences for this contrast. The results, tion with Equation (1) (Table 3):
displayed in Table 2, show that there is very little difference in ( ( ) ( ))
z=1 | z=1 |
the pre-intervention trends between the two groups (trend pre-/ 𝛿DiD = E Yt=1 |Z = 1 − E Yt=0
|
|Z = 1
|
( ( ) ( ))
post-COVID differs by 0.4 points between groups), providing − E Yt=1 z=0 | z=0 |
|Z = 1 − E Yt=0 |Z = 1 (1)
| |
support for (but not proof of) the parallel trends assumption. (
|
) (
|
)
= E ΔY z=1 |Z = 1 − E ΔY z=0 |Z = 1
| |

where Z indicates group assignment (Z = 1 for intervention,


Difference‑in‑Differences Estimator Z = 0 for control), T indicates time period (T = 1 for post-
intervention, T = 0 for baseline). Y represents the outcome
Now that we have discussed the data, questions, and causal
such that Y indicates observed NOTECHS score, Yz indicates
assumptions required, we briefly review the estimator. As
NOTECHS score counterfactual on intervention status, Yt
described above, the DiD estimator, 𝛿DiD, is a comparison
indicates observed NOTECHS score at a particular time
between two average counterfactuals: the expected counter-
point, Ytz indicates the counterfactual NOTECHS score at a
factual outcome for the intervention group after the inter-
particular time point. ∆Y & ∆Yz are the change in outcome
vention time point had the intervention been delivered, and
and counterfactual between the baseline and follow-up time
the expected counterfactual outcome for the intervention
points.
group after the intervention time point had the interven-
To understand how the choice of modeling assumptions
tion not been delivered. Figure 3 summarizes the relation-
impact the estimated difference-in-differences value, it is
ships between group assignment, intervention application,
useful to walk through the estimation of 𝛿DiD. Under the
assumptions described in Table 1, we can estimate the two
pieces of 𝛿DiD from our synthetic data as follows.
Table 2  Parallel trends evaluation: observed average outcome score If consistency holds, the average potential outcome scores
in the synthetic dataset comparing the two baseline periods (t≤0; N = for the intervention group after intervention, E(Yt=1
z=1
|Z = 1),
9951) and assignment group
and before the intervention, E(Yt=0 |Z = 1), can be estimated
z=1

Pre- Post- Row differences directly from the observed average scores in the interven-
COVID COVID tion group, E(Y|Z = 1, T = 1) and E(Y|Z = 1, T = 1), or the
baseline baseline
observed change, E(∆Y|Z = 1).
Intervention 34.48 36.38 1.9 Under the assumption of consistency, the observed out-
Control 34.37 35.87 1.5 comes under no intervention in the control group are direct
Column differences 0.11 0.51 Trend difference = 0.4 estimates of the potential outcomes under no intervention for

Fig. 3  Directed acyclic graph (DAG) representing the assumed pools between departments within a given group assignment (hence
data structure for the tutorial example. Yt is the average NOTECHS the department --> Z arrow). The specific device type and case com-
score at time t, with t = −12 at the pre-COVID baseline, t = 0 at the plexity of a given surgery will also be an independent determinant
post-COVID (main) baseline, and t = 1 at follow-up. Z is interven- of NOTECHS score for a given surgery. While NOTECHS score is
tion group assignment; Z = 1 for departments assigned to the DBT, measured at three time points, each component observation comes
and Z = 0 for control departments. NOTECHS score is expected to from a different surgery for which the device types used and case
vary between departments due to several factors, including the typi- complexity are time-invariant characteristics of that surgery, unaf-
cal complexity of surgeries within a department, and the typical fected by the characteristics of prior surgeries
devices used within the department, as well as due to shared nursing

13
Current Epidemiology Reports

Table 3  Notation Variables found in equations Symbols

Difference-in-differences estimator, marginal 𝛿DiD


Difference-in-differences estimator, conditional 𝛿DiD|L
Group assignment
Control Z=0
Intervention Z=1
Time period
Pre-COVID baseline time T = -12
Baseline (post-COVID) time T=0
Post-intervention time T=1
Indicator variable for post-intervention (1) versus baseline (0) I(T=1)
Exposure status
No availability of Device Briefing Tool A=0
Device Briefing Tool available for use A=1
Outcome
Observed NOTECHS score Y
NOTECHS score counterfactual on intervention status YZ
Observed NOTECHS score at a particular time point Yt
Counterfactual NOTECHS score at a particular time point YZt
Change in outcome between baseline and follow-up time points ∆Y = Yt − Y0
Change in counterfactual between baseline and follow-up time points ΔYZ = Ytz − Y0z

the control group. In contrast, the potential outcome at fol- interest into the linear predictor to derive the difference-in-
low-up under no intervention is unobservable for the inter- differences estimator. We see that in this case, the estima-
vention group. However, the parallel trends assumption, if tor is simply the regression coefficient for the interaction
true, allows us to estimate the counterfactual outcomes under between intervention status (time) and group.
no intervention for the intervention group. If we assume that
𝛿̂DiD =[E( Y|Z = 1, T = 1) − E( Y|Z = 1, T = 0)]
the (counterfactual) trends over time, under no intervention,
are parallel between the two groups, then the counterfac- − [E( Y|Z = 0, T = 1) − E( Y|Z = 0, T = 0)]
tual average change in outcome that the intervention group
[( ) ( )]
= 𝛽0 + 𝛽1 (1) + 𝛽2 (1) + 𝛽3 (1)(1) − 𝛽0 + 𝛽1 (0) + 𝛽2 (1) + 𝛽3 (0)(1)
would have had, between baseline and follow-up time, if [( ) (
− 𝛽0 + 𝛽1 (1) + 𝛽2 (0) + 𝛽3 (1)(0) − 𝛽0 + 𝛽1 (0) + 𝛽2 (0) + 𝛽3 (0)(0)
)]
they had not received the intervention, can be estimated from [( ) ( )] [( ) ( )]
= 𝛽0 + 𝛽1 + 𝛽2 + 𝛽3 − 𝛽0 + 𝛽2 − 𝛽0 + 𝛽1 − 𝛽0 = 𝛽3
the observed change over time in the control group, E(∆Y|Z
(3)
= 0). The parallel trends assumption is equivalent to assum-
ing that E(∆Yz=0|Z = 1) = E(∆Yz=0|Z = 0), while consistency Code Section 2.1 demonstrates the estimation of the
tells us that E(∆Yz=0|Z = 0) = E(∆Y|Z = 0). unadjusted difference-in-differences using Equation (2). The
Given these assumptions, we can estimate 𝛿̂DiD as E(∆Y|Z results are displayed in Table 4. We can see that this model
= 1) − E(∆Y|Z = 0). This can be done via hand calculation, estimates a roughly 1.7-point decrease in NOTECHS score
but since we are also interested in adjusting for covariates, in the intervention group, after receiving the DBT. We note
we present code to calculate the unadjusted estimator using
the regression equation:
E( Y|Z, T, T ≥ 0) = 𝛽0 + 𝛽1 I(T = 1) + 𝛽2 Z + 𝛽3 I(T = 1)Z Table 4  Difference-in-differences from simple crude linear model,
(2) discarding pre-COVID baseline time-point (t≥0, N = 15,086) (Equa-
tion (2))
In this equation, I(T = 1) represents an indicator vari-
able for pre- versus post-intervention. When the data are Post- Post-intervention Row differences
COVID
restricted to T ≥ 0, the regression model estimates four baseline
groups: control group pre-intervention; control group
post-intervention; intervention group pre-intervention; and Intervention 36.38 37.19 0.81
intervention group post-intervention. Using this regression Control 35.87 38.37 2.5
model, we can substitute the four values of T and Z of Column differences 0.51 −1.18 𝛿̂DiD = −1.7

13
Current Epidemiology Reports

here that since the intervention assignment was made at the assignment was made at the department level, we should
department level, we should use robust (sandwich) estima- use robust (sandwich) estimators or bootstrapping. The
tors to obtain the variance accounting for any clustering confidence intervals in Table 5 were obtained using 500
by department. Alternatively, the variance or confidence non-parametric bootstrap samples.
interval can be obtained via bootstrapping.

Modeling Assumption: Outcome


Modeling Assumptions: Time Trends Distribution

The model in Equation (2) makes some simplifying sta- The second modeling assumption we have made in Equa-
tistical assumptions, in addition to the causal inference tions (2) and (4) is a linear link and normal residual distri-
assumptions. First, this model discards information from bution between our independent and dependent variables.
the pre-COVID baseline period (t=−12). However, we However, as we described above, the NOTECHS score is
can potentially improve the estimation of the counter- an integer bounded between 4 and 48, and it may be more
factual trend line by including the data from this time appropriate to assume log-link and a Poisson error distribu-
point, but doing so requires making assumptions about tion. This would result in the following regression model:
the appropriate shape of the trend-line. Since we are mak-
logE( Y|Z, T) = 𝛼0,t + 𝛼1 AI(T = 1) + 𝛼2 Z + 𝛼3 I(T = 1)Z
ing the parallel trends assumption, and have only three
(5)
time points, here it seems reasonable to assume that
the trend is linear, and update the regression equation In Equation (5), we use the Greek letter alpha to clarify
to include time and data from the pre-COVID baseline that the coefficients on the variables will have different val-
period (t=−12): ues from those estimated with Equations (2) and (4). Now,
the regression model estimates the natural logarithm of the
E( Y|Z, T) = 𝛽0 + 𝛽1 I(T = 1) + 𝛽2 Z + 𝛽3 I(T = 1)Z + 𝛽4 t expected NOTECHS score, and we can no longer use the
(4a) coefficient on the interaction term as the estimator of our
= 𝛽0,t + 𝛽1 I(T = 1) + 𝛽2 Z + 𝛽3 I(T = 1)Z (4b) causal effect. However, we can use the same process as in
Equation (3) to derive our estimator, recognizing that the
Code Section 2.2 demonstrates the estimation of the estimated expected value of the NOTECHS score is obtained
unadjusted difference-in-differences using Equation (4a). by exponentiating the linear predictor:
Note that time is not expected to interact with our effect, ( )
and so we can also consider the coefficient for time as a E Y ̂∣ Z, T = e𝛼0t +𝛼1 I(T=1)+𝛼2 Z+𝛼3 I(T=1)Z (6)
component of the slope, as in Equation (4b). With this
modification, it is clear that our estimator is still obtained Now, plugging the appropriate values of A and Z into our
via the coefficient on the I(T = 1)Z interaction term as in estimator from Equation (3) gives us:
Equation (3) (i.e., 𝛿̂DiD = 𝛽̂3).
The results are displayed in row 1 of Table 5. This 𝛿̂DiD =[E( Y|Z = 1, T = 1) − E( Y|Z = 1, T = 0)]
model estimates a similar, but slightly smaller, 1.5-point − [E( Y|Z = 0, T = 1) − E( Y|Z = 0, T = 0)] (7)
decrease in score in the intervention group, after receiving [( ) ( )] [( )
= e𝛼0t +𝛼1 +𝛼2 +𝛼3 − e𝛼0t +𝛼2 − e𝛼0t +𝛼1 − (e𝛼0t )
]
the DBT. As before, we note that since the intervention

Table 5  Summary of marginal effect estimates and predicted mean outcome scores using the full dataset, including pre-COVID baseline. 95%
confidence intervals from empirical distributions of results across 500 bootstrap samples
Model Group Post-COVID baseline Post-intervention 𝛿̂DiD

Unadjusted Linear model Intervention 36.33 (36.31, 36.37) 37.18 (37.16, 37.21) −1.49 (−1.58, −1.41)
Control 36.02 (35.97, 36.07) 38.37 (38.30, 38.44)
Unadjusted Poisson model Intervention 36.34 (36.31, 36.37) 37.18 (37.16, 37.21) −1.50 (−1.59, −1.42)
Control 36.02 (35.97, 36.07) 38.37 (38.30, 38.44)
Adjusted Linear model Intervention 36.22 (36.19, 36.25) 37.25 (37.22, 37.37) −1.24 (−1.33, −1.16)
Control 36.08 (36.02, 36.14) 38.34 (38.28, 38.41)
Adjusted Poisson model Intervention 36.22 (36.18,36.25) 37.25 (37.22, 37.27) −1.24 (−1.33, −1.16)
Control 36.07 (36.02, 36.13) 38.34 (38.28, 38.41)

13
Current Epidemiology Reports

The rules of exponents mean that Equation (7) does not for confounding control. Specifically, the parallel trends
readily simplify. Instead, we must compute each expected assumption is not violated by any group-specific confound-
value and then calculate the estimate, as in Code Section 2.3. ers which remain at a constant frequency over time. Instead,
Note that, in SAS, this necessitates the use of bootstraps the confounders of interest for a difference-in-differences
to obtain valid confidence intervals (Supplementary Code analysis are those variables which differ between groups,
Section A2, SAS code only), while in Stata, careful applica- impact the outcome of interest, and for which the frequency
tion of the margins command can provide robust variance changes over time in different ways between the groups. As
estimates. a result, such variables are potentially alternative causes of
The results of the analysis using Equations (6) and (7) the change in the outcome.
are presented in row 2 of Table 5. Comparing the estimates In our tutorial example, there are two potential covari-
in Table 5, we see that the choice of modeling assumptions ates to consider: device type and case complexity. Device
does not affect our effect estimate. Modeling the time trend type and case complexity both vary between departments. In
with additional data did slightly attenuate our effect esti- addition, because surgeons in intervention departments were
mate, but changing the regression model link and error dis- provided the opportunity to choose which surgeries to use
tribution had essentially no effect on our point estimate. the DBT in and only these surgeries had recorded post-inter-
We can reconcile this by understanding that even if our vention DBT group NOTECHS scores, both device type and
categorical or count outcome may be better represented by case complexity may vary before and after the intervention
a Poisson error distribution, the outcome of interest for the within the intervention departments. As a result, we should
difference-in-differences estimator is actually the change consider both device type and case complexity as potential
in the outcome (∆Y). The distribution of the difference of confounders. Note that the average value of these variables
two Poisson distributed variables is not itself expected to across surgeries within a department are time-varying, but
be Poisson-distributed, since the value is now bounded by our data are collected at the level of surgeries, and for each
a minimum change of −44 points and a maximum of +44 individual surgery, the device type and case complexity are
points. Instead, the change outcome is expected to follow a time-invariant.
Skellam distribution, which describes the difference between The dataset also contains a third covariate: department.
two Poisson-distributed variables [7]. When the average Although NOTECHS score may vary by department, depart-
value of the two Poisson-distributed variables is similar, the ment is not a confounding variable. This is because the
resulting Skellam distribution will be approximately normal group assignment was made at the department level and thus
(although with a sharp peak). In addition, the distribution of all operations within a department share group assignment.
interest for the regression model is not the distribution of the It is important to note, however, that if we had assigned
outcome per se but rather the distribution of the residuals or the intervention to individual operations, we might need to
error terms. Since our true outcome may be expected to be consider confounding by department because the number of
approximately (sharply) normally distributed, it is reason- surgeries in each department does change over time.
able to assume that the residuals will also be approximately
normal, and we can use a linear regression model.
As we can see in our synthetic data, and as was also the Estimating Conditional Adjusted
case in sensitivity analyses on the original study data, linear Difference‑in‑Differences
regression can be an appropriate choice for modeling the
DiD estimator even when the outcome data are categori- Adjustment for confounders requires using a multivariable
cal, bounded, or count data since the target for estimation is regression model, but as we shall see, such a model is only
the change in outcome. This result has also been described the first step in obtaining an adjusted DiD estimate. Since
by the economist Dr Jeff Wooldridge [8–10] but does not our initial model-building process indicated that a linear
appear to be widely known. regression model was reasonable, we focus the discussion
of adjustment on linear models, Code Section 3. Equations
for adjusted Poisson regression and DiD estimation are pre-
Selecting Confounders sented in the Supplementary Materials and in Supplemen-
tary Code Section A3. Note that as with the unadjusted Pois-
The final step in modeling our DiD estimator is to consider son estimator, the conditional adjusted Poisson estimator
whether any confounders are needed to meet the paral- cannot be read directly off the model output.
lel trends assumption. The parallel trends assumption is a For the adjusted linear regression, we model the expected
weaker assumption than the exchangeability assumption NOTECHS value conditional on the covariates (Equation
more commonly used in epidemiologic research, and it (8)). This model strongly resembles that of the crude lin-
reduces the types of variables which are relevant to consider ear analysis (despite the use of a new Greek letter, gamma,

13
Current Epidemiology Reports

for the coefficients), except for the addition of a new term: Table 6  Summary of stratum-specific conditional effect estimates
γ4’L. This term is bolded and has an apostrophe after the using the full dataset, including pre-COVID baseline. The synthetic
data has no additive effect measure modification, but does have log-
coefficient to indicate that rather than representing a single scale (multiplicative) effect measure modification. Conditional esti-
variable and its single coefficient the term γ4’L represents a mates are equal between strata when no effect measure modification
sequence (or matrix) of covariate-coefficient product terms is present
added together for as many confounders (and potentially Device type Case complexity Linear Poisson
their interactions with each other) as are to be included in model 𝛿̂DiD|L model 𝛿̂DiD|L
the model. This short hand makes the derivation of the DiD
All* All* −1.24 −1.23
estimator simpler, without impacting its validity. Code for
Sealer Overall* −1.24 −1.23
the regression in Equation (8) is found in Code Section 3.1.
Very challenging −1.24 −1.21
Somewhat challenging −1.24 −1.25

E( Y|Z, T, L) = 𝛾0t + 𝛾1 I(T = 1) + 𝛾2 Z + 𝛾3 I(T = 1)Z + 𝜸 4 L
(8) Not challenging −1.24 −1.23
As with the unadjusted linear model, we can compute a Stapler Overall* −1.24 −1.25
DiD estimator from the coefficient on the I(T = 1)Z inter- Very challenging −1.24 −1.24
action term (Equation (9)). However, because our adjusted Somewhat challenging −1.24 −1.24
regression model estimates the conditional outcome for each Not challenging −1.24 −1.27
T-Z grouping, E(Y|Z,T,L) not the unconditional outcome, Other device Overall* −1.24 −1.19
E(Y|Z,T), this DiD estimator in Equation (9), δ̂ did|L, is not nec- Very challenging −1.24 −1.19
essarily the same as the one estimated in Equation (3), δ̂ did. Somewhat challenging −1.24 −1.20
Not challenging −1.24 −1.19
𝛿̂did|𝐋 =[E( Y|Z = 1, T = 1, L = 𝓵) − E( Y|Z = 1, T = 0, L = 𝓵)]
*Unweighted average of stratum-specific estimates
− [E( Y|Z = 0, T = 1, L = 𝓵) − E( Y|Z = 0, T = 0, L = 𝓵)]
[( ) ( )] (9)
= 𝛾0t + 𝛾1 + 𝛾2 + 𝛾3 + 𝜸 𝟒 ′ 𝓵 − 𝛾0t + 𝛾2 + 𝜸 𝟒 ′ 𝓵

Estimating the Unconditional (Marginal)


[( ) ( )]
− 𝛾0t + 𝛾1 + 𝜸 𝟒 ′ 𝓵 − 𝛾0t + 𝜸 𝟒 ′ 𝓵 = ̂
𝛾3
Effect After Adjustment
If we are interested in the δ̂ did|L estimator, however, we
can use the (robust) variance estimate directly from the It is likely, however, that despite the need to adjust for confound-
regression output to obtain our variance and confidence ers, we may actually be interested in the marginal average treat-
interval around this value. Since this estimator is conditional ment effect in the treated; that is, in δ̂ did not δ̂ did|L. When we are
on the value of the covariates, Table 6 presents the δ̂ did|L confident about a lack of effect modification, the conditional
estimates for each joint case complexity — device stratum. DiD estimate can provide this information as we have seen.
The table also presents the simple average within device However, we can also estimate the unconditional DiD estimator
strata, and case complexity strata, and overall, for both the directly using the output of the above regression model without
linear and the Poisson models (no weighting). We can see making any assumptions about whether or not effect modifica-
that the linear model returns the same estimate ( γ̂ 3) for both tion exists. This process, called standardization in epidemiol-
the overall δ̂ did|L estimate and the stratum-specific estimates. ogy and marginalization in econometrics, requires some brief
This is because our data does not have any additive effect explanation.
measure modification, so despite the use of a product term Epidemiologists are likely familiar with direct and indirect
in the regression to allow for assessment of stratum-spe- standardization, where an outcome is updated to estimate the
cific effects, we find no variation. On the other hand, the outcome that would have been obtained if one or more key
data were simulated with log-scale (multiplicative) effect covariates had been distributed to match another population.
measure modification between case complexity and device Age-standardized mortality rates are a common example of
type; as a result, we see that the stratum-specific δ̂ did|L esti- this type of standardization. Here, we are interested in knowing
mates from the Poisson model vary slightly between strata what the DiD estimate would have been in this population if
and differ from the overall estimate, holding all covariates our confounders had been distributed randomly between the
constant. In addition, as we shall see in the next section, two groups. As such, we need to standardize the estimate to the
because of the multiplicative effect measure modification, distribution of the confounders in our study sample. When this
the (unweighted) average δ̂ did|L estimate holding covariates type of standardization is used to estimate a causal effect, it is
constant differs slightly from the (weighted) average 𝛿̂DiD called the g-formula or g-computation [11] — although those
estimate which accounts for the frequency of each stratum terms are more often used to refer more complex implementa-
in the sample. tions with time-varying exposures.

13
Current Epidemiology Reports

To obtain our standardized or marginalized estimate, we standardization components. The R code uses robust vari-
need to use the covariate distributions to estimate E(Y|Z,T) ance estimation, but we encourage researchers to consider
from the values of E(Y|Z,T,L) produced by our regression the use of bootstrapping to obtain valid, non-conservative
model. We could do this by hand using Equation (10): confidence intervals.
∑ ∑ [ ]
E[Y|T, Z] = E[Y|T, Z, L = l] ∗ P(L = l) = E Y|T, Z, Case complexity, Device ∗ f (Case complexity, Device) (10)
l l

Since we only have two confounders, this would be fairly Table 5 row 3 reports the estimated values of δ̂ did obtained
a reasonable approach, and code to obtain values of P(L = from the modeling procedure described above, as well as
l) for all confounder strata is included in Code Section 3.2. for the adjusted Poisson regression model (Table 5 row 4)
However, hand-calculation can be time-consuming and, described in the Supplementary Materials. Compared to the
whenever any confounder is continuous or the number of unadjusted analyses, we can see that there is a small amount
confounders is large, may be essentially impossible. of confounding bias away from the null which is reduced
A simpler approach is provided in Code Section 3.3. First, in our adjusted model, resulting in a slightly less negative
we make copies of our data such that there is one copy for effect estimate in the adjusted analyses (rows 3–4). Second,
each strata of E[Y|T,Z] that we need to compute; here, and in when there is no effect measure modification on the scale
most DiD analyses, we have 4 T-Z groups, so we need 4 cop- of the model by any of the covariates, we expect that the
ies of the dataset. Remember that the pre-COVID time point marginal and conditional estimates will be equal, and this is
is only for getting a better pre-intervention trend and not for what we see in the data with the conditional DiD estimates
estimating the DiD, so we must exclude surgeries in this (Table 6). Our synthetic data was created with effect
time point from our copies. For each copy, we delete their measure modification on the log scale (i.e., multiplicative
outcome value, and replace their period and group assign- effect measure modification) but not on the linear scale
ment with one of the four T-Z combinations. This provides (i.e., no additive effect measure modification), and therefore
us with a new version of our dataset in which every indi- the marginal and conditional estimates differ when using
vidual has been assigned to every group of interest. Now, we adjusted Poisson models (row 4) but not when using adjusted
can compute, for each copy of each observation, the linear linear models (row 3).
predictor (E[Y|T,Z,L]) based on the covariate values of the
original observation and the new T-Z assignment. Finally,
we simply average the predicted value within each set of Summary
copies, such that, for example, the predictors for copies with
T = 0 and Z = 0 are averaged to provide an estimate of E[Y|T Overall, using the synthetic dataset, we estimate that
= 0, Z = 0]. Since the predicted value of the outcome within the effect of the DBT intervention on the intervention
T-Z levels will be the same for all individuals who have the departments was to decrease their average NOTECHS
same confounder values, this process is simply an automa- score by about 1.2 points on a scale from 4 to 48 (95%
tion of the weighted summation in Equation (10). confidence interval: −1.3, −1.16). If this were the real
In Stata, this process has been automated in the mar- data, we might conclude that the DBT slightly decreased
gins package. However, one note of caution: since we have surgical performance immediately following adoption and
included time as a variable in our model, and since time recommend additional training. However, despite significant
varies in a deterministic way with the value of A, we must be result, a change in 1.2 points on a 44-point scale is quite
careful in how we apply the margins command. Specifically, small, so we might also conclude that the DBT did not
we need to exclude the pre-COVID baseline surgeries from meaningfully alter surgical quality in the month immediately
the marginalization. Code Section 3.4 in Stata provides the following introduction of the DBT in the intervention group.
appropriate margins command and selects only the relevant It is important to remember that our synthetic data is not
pieces of the command output for our example. the real data — the original study found a null result and
Finally, we can use our standardized predicted outcomes concluded the DBT had no short-term impact on surgical
to compute the desired estimator, δ̂ did, using Equation (3). quality in the first month of use [6].
Due to the multistep nature of this process, we will need to This manuscript is intended to provide a worked example of
use bootstraps to obtain valid confidence intervals in SAS the estimation of the average causal effect of treatment among
(Supplementary Code Section A2, SAS code only), while the treated using the difference-in-differences methods when
in Stata, the margins command uses the Delta method to applied to epidemiologic or medical data. All data and code
obtain a valid variance that accounts for the regression and used in this tutorial are available online at www.​github.​com/​

13
Current Epidemiology Reports

elean​ormur​ray/​DiDTu​torial and as Online Supplementary 2.• Wing C, Simon K, Bello-Gomez RA. Designing difference
Materials accompanying this article. The GitHub repository in difference studies: best practices for public health policy
research. Annu Rev Public Health. 2018;39(1):453–69. https://​
also includes Student and Teacher handouts, and the version doi.​org/​10.​1146/​annur​ev-​publh​ealth-​040617-​013507. Refer-
at the time of publication acceptance are archived on Zenodo ence 2 explains the benefits of using DiD methods to estimate
at https://​doi.​org/​10.​5281/​zenodo.​81789​59. effects in public health contexts, particularly policy settings.
3. Lopez Bernal J, Cummins S, Gasparrini A. Difference in differ-
Supplementary Information The online version contains supplemen- ence, controlled interrupted time series and synthetic controls.
tary material available at https://d​ oi.o​ rg/1​ 0.1​ 007/s​ 40471-0​ 23-0​ 0327-x. Int J Epidemiol. 2019;48(6):2062–3. https://​acade​mic.​oup.​com/​
ije/​artic​le/​48/6/​2062/​54190​48.
Acknowledgements The authors would also like to thank the Boston 4.• Caniglia EC, Murray EJ. Difference-in-difference in the time
University IdeaHub and Gwenn Fairall for their support in coordinating of cholera: a gentle introduction for epidemiologists. Curr Epi-
this project. No code is expected to be perfect, and while we have vetted demiol Rep. 2020;7(4):203–11. https://​doi.​org/​10.​1007/​s40471-​
the code intensively, we expect there remain residual bugs or possible 020-​00245-2. Reference 4 explains in detail the causal infer-
ways to improve the code; if you find a bug or an improvement for our ence assumptions underlying DiD from an epidemiologic
code, please submit a ticket via GitHub or contact the corresponding perspective.
author at ejmurray@bu.edu. We would also like to thank Dr Jeff Wool- 5. Hamad R, Collin DF, Rehkopf DH. Estimating the short-term
ridge and others at #econtwitter for engaging with our query about the effects of the earned income tax credit on child health. Am J
use of non-linear models for DiD in econometrics. Epidemiol. 2018;187(12):2633–41. https://​acade​mic.​oup.​com/​
aje/​artic​le/​187/​12/​2633/​50909​55.
Funding This work was supported in part by funding from Johnson & 6. Etheridge J, Moyal-Smith R, Lim S, Yong T, Han H, Sonnay Y, et al.
Johnson and Ariadne Labs. Utility of a device briefing tool to improve surgical safety. 2022;
7. Tomy L, Veena G. A retrospective study on Skellam and related
Compliance with Ethical Standards distributions. Aust J Stat. 2022;51(1):102–11. https://​ajs.​or.​at/​
index.​php/​ajs/​artic​le/​view/​1224.
Conflicts of Interest Neither Johnson & Johnson nor Ariadne Labs re- 8. Wooldridge JM. Two-way fixed effects, the two-way Mundlak
quired approval of current manuscript. regression, and difference-in-differences estimators. Rochester,
NY; 2021. https://​papers.​ssrn.​com/​abstr​act=​39063​45. Accessed
Human and Animal Rights and Informed Consent This article does not 24 July 2023.
contain any studies with human or animal subjects performed by any 9. Wooldridge JM. I spent an entire session on this in my recent
of the authors. DiD webinar. You do not want to use a linear model. Ellie, your
outcome is almost certainly not Poisson distributed. But you
should use the Poisson QMLE. And the TEs and proper standard
Open Access This article is licensed under a Creative Commons Attri- errors are not hard to get. 2021. https://​twitt​er.​com/​jmwoo​ldrid​
bution 4.0 International License, which permits use, sharing, adapta- ge/​status/​14759​77204​76073​1653. Accessed 24 July 2023.
tion, distribution and reproduction in any medium or format, as long 10. Wooldridge JM. This is actually an important misconception
as you give appropriate credit to the original author(s) and the source, and, in the binary case, is at the heart of the debate between the
provide a link to the Creative Commons licence, and indicate if changes Ai-Norton “interaction effect” and the Puhani treatment effect.
were made. The images or other third party material in this article are The Ai-Norton is exactly the same as the linear no matter which
included in the article's Creative Commons licence, unless indicated nonlinear model you use. 2021. https://​twitt​er.​com/​jmwoo​ldrid​
otherwise in a credit line to the material. If material is not included in ge/​status/​14762​27817​43624​1920. Accessed 24 July 2023.
the article's Creative Commons licence and your intended use is not 11. Hernan MA, Robins J. Causal inference: what if. Boca Raton:
permitted by statutory regulation or exceeds the permitted use, you will Chapman & Hill/CRC; 2020.
need to obtain permission directly from the copyright holder. To view a
copy of this licence, visit http://​creat​iveco​mmons.​org/​licen​ses/​by/4.​0/. Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

References

Papers of particular interest, published recently, have


been highlighted as:
• Of importance

1. Snow J. On the mode of communication of cholera. John


Churchill; 1855. p. 216.

13

You might also like