Missing Values
Missing Values
Abstract
In multilevel settings such as individual participant data meta-analysis, a variable is ‘systematically missing’ if it is wholly
missing in some clusters and ‘sporadically missing’ if it is partly missing in some clusters. Previously proposed methods to
impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns
are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with
arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal
data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete
variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic
data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and
show that even the simplest joint multilevel model implies conditional models which involve cluster means and
heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in
a multilevel MICE procedure, even when cluster means are not included in the imputation models.
Keywords
Missing data, multilevel model, multiple imputation, chained equations, fully conditional specification, individual patient
data meta-analysis
1 Introduction
Multiple imputation (MI) is a popular approach for handling the pervasive problem of missing data in
biostatistics.1 MI uses the distribution of the observed data to estimate a set of plausible values for the missing
data, usually under a missing at random (MAR) assumption.2 MI is a Bayesian procedure: given a joint prior
distribution for the observed data and a specific data model, we obtain a posterior distribution of the missing
values given the observed data. Missing data in a single variable are straightforward to impute, for example using
an appropriate generalised linear model,3,4 but missing data in several variables present a more complex problem
except on the rare occasions when data are monotone missing. There are two standard approaches. Firstly, a joint
model may be assumed for the data. Most commonly, a multivariate normal model is assumed, as implemented in
standard software.5–7 This is surprisingly effective despite being mis-specified for categorical data.8 Secondly,
multiple imputation by chained equations (MICE), also known as multiple imputation by fully conditional
specification, specifies a suitable conditional imputation model for each incomplete variable and iteratively
imputes until convergence.3,9 The theoretical properties of MICE are not well understood: except in simple
cases, conditional imputation models do not correspond to any joint model.10,11 Despite this it performs well
1
Service de Biostatistique et Information Médicale, Hôpital Saint-Louis, Paris, France
2
Université Paris Diderot – Paris 7, Sorbonne Paris Cité, Paris, France
3
ECSTRA Team, INSERM, Paris, France
4
MRC Biostatistics Unit, Cambridge Institute of Public Health, Cambridge, UK
Corresponding author:
Matthieu Resche-Rigon, Service de Biostatistique et Information Médicale, Hôpital Saint-Louis, AP-HP, 1 avenue, Claude Vellefaux 75475, Paris Cedex
10, France.
Email: matthieu.resche-rigon@univ-paris-diderot.fr
Resche-Rigon and White 1635
in practice,12,13 especially when the conditional imputation models are well accommodated to the substantive
model.14 MICE is also implemented in standard software.6,15,16
Many datasets have a multilevel or clustered structure. Imputation that ignores this clustering leads to
underestimation of the magnitude of clustering and hence underestimated standard errors, even if the analysis
does allow for clustering.17 However, imputation that allows for the clustering through fixed effects of cluster
overestimates the magnitude of the clustering and hence overestimates standard errors.18,19
There is therefore a need for advanced imputation methods that allow for the clustering. Schafer
and Yucel proposed a Gibbs sampler to generate MIs of continuous missing variables from a joint
multivariate linear mixed model:20,21 their method is implemented in the PAN package.22 REALCOM
software23 and the R package jomo24 extended this approach allowing missing data at any level and
handling categorical data through latent normal variables. van Buuren proposed extending MICE to
perform multilevel imputation by a Bayesian procedure.25,26 More recently, extensions were proposed to
impute level 2 variables.27
In this paper we focus on imputation of two-level clustered data, such as are found in individual participant
data (IPD) meta-analysis28 (where the cluster is the study) or in multi-centre studies (where the cluster is the
centre). A feature of such data is that some variables may be ‘systematically missing’ – that is, missing for all
individuals in one or more clusters.29 This arises in particular in IPD meta-analysis when a potential confounding
variable is not collected in one or more studies. Variables may also be ‘sporadically missing’ – that is, missing for
some but not all individuals in one or more clusters.29 Unfortunately, the multiple imputation methods and
packages presented above are currently not able to handle systematically missing data except for jomo.
Therefore, methods to handle systematically missing data have been proposed for continuous data29 and
recently extended to binary and discrete data.30 These approaches handle the clustering by using generalized
linear mixed models to impute data; they differ only in how they model the uncertainty around the between-
cluster covariance parameters. However, they only deal with systematically missing data. The aim of this paper is
to propose methods that simultaneously handle systematically and sporadically missing data.
The paper is organised as follows. Section 2 describes imputation of a single incomplete variable and makes two
innovations. Firstly, it extends the method previously described for systematically missing data29 to also handle
sporadically missing data. Secondly, it proposes a new two-stage algorithm which conveniently allows for
heteroscedasticity between clusters in the linear mixed model, as recommended by van Buuren in the
sporadically missing case.25 The two-stage algorithm also offers an easy extension to handle categorical data
alongside normal data, although we here evaluate it only for normal data. In section 3, we explore the sorts of
conditional models needed in multilevel MICE and explain why it is important to account for heteroscedasticity.
In section 4, we evaluate the performance of the multilevel imputation algorithms in multilevel MICE. We
illustrate the methods in an IPD meta-analysis in cardiovascular disease in section 5. Finally, section 6 provides
a discussion.
where Xi is a ni p matrix of variables associated with the outcome yi via , a p 1 vector of fixed effects, and Zi
(typically equal to or contained within Xi) is a ni q matrix of variables associated with yi via bi, a q 1 vector of
random effects. We model bi Nð0, b Þ where b is the q q variance-covariance matrix of the random effects,
and the ni 1 error vector as ei Nð0, i Þ with i ¼ i2 Iðni Þ. Under this model, the posterior distribution of bi
given yi is bi j yi Nðmð yi Þ, vð yi ÞÞ, with mð yi Þ ¼ b ZTi ðZi b ZTi þ i Þ1 ð yi Xi Þ and vð yi Þ ¼ b
b ZTi ðZi b ZTi þ i Þ1 Zi b .20,31,32
Suppose y ¼ ð yT1 , yT2 , . . . , yTN ÞT contains missing data ymis. yi is systematically missing if it is missing for all
individuals in cluster i; it is sporadically missing if it is only partly incomplete.
Our goal is to generate independent R draws under a MAR assumption from a posterior predictive distribution
for the missing data pð ymis j yobs Þ ¼ pð ymis j yobs , Þ pðj yobs Þd where ¼ ð, b , fi gÞ is the vector of parameters in
1636 Statistical Methods in Medical Research 27(6)
the linear mixed model (1) and pðj yobs Þ is the observed data posterior density of .1 In practice, this may be
achieved (with implicit vague priors) by:
(1) fitting the model pð yjÞ to the units with observed y, yielding an estimate (typically an MLE) ^ with an
estimated variance-covariance matrix S ;
^ S Þ;
(2) drawing a value of , , from its posterior, approximated as Nð,
mis
(3) drawing values of y from pð yj Þ.
In the next two subsections, we propose two approaches to perform step 1. The first approach uses one-stage
estimation of the model parameters.33 The second approach, two-stage estimation, first estimates parameters
within each cluster separately using the same model. Then, it applies multivariate meta-analysis methods using
the summary statistics obtained at the first stage.34 One-stage and two-stage approaches give similar results in
individual patient data meta-analyses,35 except when there are relatively few studies or the studies are small
(especially with binary outcomes).36 The two approaches both fit versions of model (1), but extra assumptions
are convenient in each approach: in one-stage estimation we consider a homoscedastic model, i.e. a constant i for
all clusters, while in two-stage estimation we assume Zi ¼ Xi.
This approach, recently extended by Jolani in a generalised linear model framework,30 is close to van Buuren’s
method for sporadically missing data which used a Gibbs sampler to generate and bi .25 However, van Buuren
argued that the imputation quality could be improved by allowing the within cluster variance to vary over the
0 2
clusters, and modelled i 1= , where 0 and are hyperparameters for the location of prior belief about
residual variance and a measure of variability respectively.
Implementation
We fit the mixed model using the lme() function from the nlme package in R 3.1.1.39,40 This does not report the
covariance between and ð b , Þ, so in practice we draw and ð b , Þ independently.
with ei Nð0, i2 Iðni ÞÞ. This is model (1) with Zi ¼ Xi (i.e. every variable in the model has a random effect) and
i ¼ þ bi ; we propose how to allow Zi 6¼ Xi in the discussion. We approximate ^i Nði , Si Þ and
log ^i Nðlog i , Si Þ. The between-studies model is i Nð, Þ, where is the same as b in section 2.1,
and log i Nðlog , Þ, where is an average within-cluster standard deviation. Then the marginal models are
^i Nð, Si þ Þ and log ^i Nðlog , Si þ Þ.
The main steps of the two-stage imputation procedure are:
(1) (a) For each cluster i without systematically missing data, fit model (2) using yobs i (complete case analysis).
This gives ^i with variance Si , and log ^i whose variance Si is obtained by noting that ^i2 follows a 2
distribution and using the delta method.
(b) Perform a multivariate meta-analysis using the ^i and Si from each cluster without systematically
missing data. Using a REML estimator, obtain , ^
^ and their variance-covariance matrices S and
S . is parameterised via its Cholesky decomposition chol .
(c) Perform a univariate meta-analysis using the ^ i and Si from each cluster without systematically missing
data. Using a REML estimator, obtain log , ^ ^ and their variance-covariance matrices S and S .
^ chol ^
(2) Draw using Nð, S Þ and using Nð , Schol Þ. Hence set ¼ chol
chol
ðchol ÞT : the use of the Cholesky
decomposition ensures a positive semi-definite . Similarly draw log and using Nðlog ,
^ S Þ and
Nð^ , S Þ.
(3) Draw the missing yij:
(a) For clusters with systematically missing data, draw i using Nð , Þ and draw i using
log i Nðlog , Þ.
(b) For clusters with sporadically missing data, draw i from its posterior given and ,
1 1 1 b 1 1
i N ð1 þ Si Þ ð 1
þ S
i i Þ, ð 1
þ Si Þ
and draw log i from its posterior given and log
log = þ log ^i =Si 1
log i N , :
1= þ 1=Si 1= þ 1=Si
(c) For each missing yij, draw eij Nð0, i2 Þ and set yij ¼ xij i þ eij .
The REML estimation proposed in step 1 can be computationally slow. We therefore propose the method of
moments (MM) as a simple and computationally efficient alternative. In step 1(c), we use the univariate MM
described in Dersimonian and Laird.43 In step 1(b), we use the multivariate extension proposed by Jackson et al.38
^ and ¼ .
The MM does not allow estimation of S and S , so we are forced to modify Step 2 by setting ¼ ^
Implementation
Multivariate meta-analysis is performed using the mvmeta package in R 3.1.1.34,44,45
We prove in Appendix 1 that the conditional expectation of x2ij depends on x1ij and x 1i , the mean of x1ij within
cluster i, and on ni, the size of cluster i (equation (8)). Similarly, the conditional variance of x2ij depends on ni
(equation (9)). Appendix 2 generalises this result to the case of many incomplete variables by letting x1ij and x2ij be
vectors. These results suggest (1) that we should incorporate the cluster means of the predictors in the imputation
model and (2) that if cluster sizes vary then the level-1 variance should vary between clusters, i.e. that the
imputation model should be heteroscedastic. Thus, a simple joint model implies complex conditional models:
this is unlike the situation with independent observations, when a joint normal model implies simple conditional
models.11
Many settings, including the simulation setting of section 4, include random slopes: for example, we might
combine model (3) with a random slopes model regressing yij on x1ij and x2ij . Under this joint model, the cluster-
specific density of (x, y) is multivariate normal, but the marginal density is not. Deriving the conditional
distributions in this model is intractable, and we anticipate that the problems identified above will be joined
by others.
4 Simulation study
The methods proposed in section 2 make certain approximations, and section 3 suggests that combining them
into a MICE procedure involves mis-specification of the conditional models. We therefore conduct a
simulation study aiming (a) to explore and compare the performance of the methods in a MICE procedure,
and (b) to explore whether their performance is improved by including cluster means or heteroscedasticity in
the imputation model.
with eij Nð0, i2 Þ, log i Nðlog , Þ and bi ¼ ðb0i , b1i , b2i Þ Nð0, b Þ. Our parameter values for the base case
were 0 ¼ 0, 1 ¼ 0:5 and 2 ¼ 1; b had all diagonal elements 0:52 and all correlations 0.3; ¼ 0:5 and ¼ 0,
so the outcome model was homoscedastic with intra-cluster correlation (conditional on x1ij ¼ x2ij ¼ 0) of 0.5.
Following Appendices 1 and 2, Eðx1ij jx2ij , yij Þ and Eðx2ij jx1ij , yij Þ may depend on y i and on x 2i and x 1i
respectively. We estimate this dependency using a large simulated data set of 1000 clusters of size 100.
Resche-Rigon and White 1639
Estimated coefficients of y i and x 2i in the model for x1ij and of yi and x 1i in the model for x2ij were 0.020, 0.049,
0.007 and 0.020 respectively, much smaller than the coefficients of yi, x1i , x2i which were all greater than 0.12.
The total number of patients was fixed at 2000. The number of clusters was N ¼ 20 yielding ni ¼ 100 patients per
cluster. A total of 1000 independent datasets were generated for each setting.
4.4 Analysis
In every case, the analysis model was equation (4). The data were analysed before data deletion as a benchmark for
the multiple imputation procedure, and after data deletion using complete cases. After imputation, the model was
estimated in each imputed data set using the REML estimator of the lmer() function of the lme4 R-package47 and
the estimates were combined using Rubin’s rules.1 The performance of each method was assessed by computing
the empirical mean of the parameter estimates, root mean square estimated standard error (Model SE), empirical
Monte Carlo standard error (Emp SE), the coverage of nominal 95% confidence intervals (95%CI), and the mean
run time per simulated data set.
4.5 Results
In the base case (Table 1), point estimates are slightly negatively biased for 1 ¼ 0:5 and unbiased for 2 ¼ 1 for
two-stage methods. Empirical standard errors are lower for multiple imputation approaches than for complete
case analysis, reflecting the better use made of the data. Model-based standard errors are below empirical standard
errors, in particular for the one-stage approach. The underestimated standard errors appear to arise from
underestimated random effect variances. Moderate under-coverage is observed, and is only partly explained by
the underestimated standard errors: only two-stage methods achieve coverage within 5% of the nominal 95%.
Mean run times are in favour of the two-stage MM especially compared to the two-stage REML. Including cluster
means in imputation models has little impact on results (lower part of Table 1).
We next modified the data generating mechanism to explore performance when thecoefficients of cluster means
0:172 0:52 0:17
are clearly different from 0. To do this we set ":1 ¼ 0:52 , ":2 ¼ 0:172 and u ¼ .
0:52 0:17 0:52
Using a large simulated data set, the estimated coefficients of y i and x 2i in the model for x1ij and of yi and x 1i in the
model for x2ij were 0.16, 0.87, 0.22 and 0.95 respectively, larger than the corresponding coefficients of yi,
x1i , x2i which were 0.30, 0.60, 0.07 and 0.12, respectively.
1640 Statistical Methods in Medical Research 27(6)
1 0:5
Table 1. Base-case simulations. u ¼ 0:52 , ":1 ¼ ":2 ¼ 0:5, ":1 ¼ ":2 ¼ 0.
0:5 1
pffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi
1 b11 2 b22
Mean
Est Model Emp Est Est Model Emp Est run
Method mean SE SE Cover mean mean SE SE Cover mean time
Est mean: mean estimate over imputed data sets; Model SE: standard error derived from Rubin’s rules (mean over imputed data sets); Emp SE: standard
deviation of estimate over imputed data sets; Cover: coverage of nominal 95% confidence interval.
0:172 0:52 0:17
Table 2. Simulations when the coefficients of cluster means are clearly different from 0. u ¼ ,
0:52 0:17 0:52
":1 ¼ 0:52 , ":2 ¼ 0:172 , ":1 ¼ ":2 ¼ 0.
pffiffiffiffi pffiffiffiffi
1 b11 2 b22
Mean
Est Model Emp Est Est Model Emp Est run
Method mean SE SE Cover mean mean SE SE Cover mean time
The bias in 1 and 2 remains small, but seems smaller for two-stage approaches (Table 2). The MI methods
underestimate b11 (especially with one-stage approaches) and overestimate b22 (especially with two-stage
approaches), leading to mis-estimated model-based standard errors. In this setting, in which cluster means
could have an impact on the efficiency of the imputation model, including cluster means has very little effect on
the final results, with only a small reduction in model-based standard errors.
We next modified the data generating mechanism of Table 2 by introducing unequal cluster sizes, with 10
clusters of size 160 and 10 clusters of size 40. Results (Table 3) were very close to those obtained in Table 2, and
two-stage methods seem to be less biased than one-stage approach.
Finally, Table 4 reports a simulation study using the same parameter values as in Table 2, but with
heteroscedasticity on X1 and X2 introduced by setting ":1 ¼ 0:172 , ":2 ¼ 0:062 . Results were still less biased
Resche-Rigon and White 1641
Table 3. Simulations when the coefficients of cluster means are clearly different from 0 and cluster sizes are unequal (10 clusters of
0:172 0:52 0:17
size 160, 10 clusters of size 40). u ¼ , ":1 ¼ 0:52 , ":2 ¼ 0:172 , ":1 ¼ ":1 ¼ 0.
0:52 0:17 0:52
pffiffiffiffi pffiffiffiffi
1 b11 2 b22
Mean
Est Model Emp Est Est Model Emp Est run
Method mean SE SE Cover mean mean SE SE Cover mean time
with two-stage methods than with the one-stage approach. b22 was largely overestimated. Model-based standard
errors obtained with the two-stage approaches were overestimated.
6 Discussion
There is increasing need for ways to handle missing values in multilevel structures, notably with the development
of meta-analysis of IPD.50 IPD, whether from randomised clinical trials or observational studies, have the
1642 Statistical Methods in Medical Research 27(6)
Table 4. Simulations when the coefficients of cluster means are clearly different from 0 and with heteroscedasticity on X1 and X2.
0:172 0:52 0:17
u ¼ , ":1 ¼ 0:52 , ":2 ¼ 0:172 , ":1 ¼ 0:172 , ":2 ¼ 0:062 .
0:52 0:17 0:52
pffiffiffiffi pffiffiffiffi
1 b11 2 b22
Mean
Est Model Emp Est Est Model Emp Est run
Method mean SE SE Cover mean mean SE SE Cover mean time
Patients
Variables Values (N ¼ 4546) Statistics
Note: Categorical variables are presented by counts and percentages and quantitative variable by
their median and quartiles.
BMI: body mass index; SBP: systolic blood pressure; DBP: diastolic blood pressure; HR: heart rate;
BNP: brain natriuretic peptide; LVEF: left ventricular ejection fraction.
Cohort 1 2 3 4 6 7 8 9 10 11 12 13
Sample size 410 567 210 375 107 267 203 354 137 48 1790 78
Percentage missing
Gender 0 0 0 0 0 0 0 0 0 0 0 0
Age 0 0 0 0 0 0 0 1 0 0 0 0
BMI 36 19 43 2 1 100 0 44 1 100 21 60
SBP 1 2 1 1 0 100 1 16 0 0 1 0
DBP 2 3 2 1 0 100 2 16 0 0 1 0
HR 3 1 1 2 0 0 1 19 0 4 0 0
BNP 57 1 0 4 100 100 0 12 0 100 93 100
LVEF 0 0 0 0 0 0 0 0 0 0 0 0
SE SE SE SE SE SE SE
Complete case 6.869 0.803 0.200 0.030 0.184 0.064 0.167 0.016 0.176 0.028 0.060 0.016 9.887 1.176
Imputation methods, excluding cohort mean in imputation model
One-stage 5.669 0.475 0.165 0.019 0.233 0.043 0.124 0.012 0.124 0.023 0.048 0.009 10.359 1.112
Two-stage 5.795 0.474 0.154 0.019 0.226 0.048 0.129 0.012 0.132 0.023 0.050 0.009 9.313 2.034
Two-stage MM 5.824 0.463 0.156 0.020 0.243 0.043 0.127 0.010 0.127 0.017 0.051 0.009 9.816 1.900
Imputation methods, including cohort mean in imputation model
One-stage 5.662 0.503 0.170 0.020 0.230 0.049 0.123 0.011 0.123 0.019 0.049 0.009 10.096 1.666
Two-stage 5.737 0.455 0.158 0.020 0.243 0.045 0.125 0.010 0.125 0.020 0.050 0.009 6.440 2.803
Two-stage MM 5.950 0.484 0.152 0.020 0.250 0.047 0.137 0.013 0.141 0.023 0.049 0.009 6.302 2.672
advantage of facilitating consistent definitions of outcomes, exposures and confounders and consistent analyses
between studies. Nevertheless, the availability of confounders typically varies between studies. In this paper, we
proposed a method to handle both systematically and sporadically missing covariates in a two-level structure using
MICE. We explored the conditional imputation models needed in multilevel MICE and showed that cluster means
and heteroscedasticity should be considered in the imputation model. We proposed a new two-stage approach in
which we first fit the imputation model within each cluster and then combine the results using multivariate random
effects models. Our simulation studies showed broadly good performance for the MICE methods, a very low
impact of including the cluster means, and an advantage for heteroscedastic models. Small biases and slight under-
coverage were observed in the presence of only systematically missing data.29 These biases disappeared when only
sporadically missing covariates were considered, suggesting that they are linked to the presence of systematically
missing covariates (Appendix 3). Moreover, the two-stage approach using the MM was more computationally
efficient.
Our work extended a previous algorithm that we proposed to handle systematically missing data.29 In recent
work, Jolani et al. also extended our one-stage approach to handle systematically missing binary data.30 The
approaches differ in the way uncertainty is introduced around the estimated between-cluster covariance
parameters, whose sampling distribution is skewed: we used a general log-transformation for the variances and
Fisher’s transformation for correlations, as proposed by Pinheiro and Bates37 and applied in some current
software packages for fitting non-linear mixed models (e.g. nlme in R), whereas Jolani et al. used a Bayesian
framework with a Wishart distribution for the inverse between-cluster covariance.30
To our knowledge, no other package can handle both systematically and sporadically missing data except the
recent jomo R-package.24,51 Other methods developed to impute one type of missing data could be modified to
handle the other type. In particular, the 2 l.norm routine15 for sporadically missing data in MICE R-package could
be adapted using step 3(a) and 3(b) of the one stage approach, although using a MCMC algorithm at each step of
the chained equations algorithm seems to be computationally laborious.30
Another alternative is Gibbs sampling. Schafer proposed a Gibbs sampler to generate multiple imputations of a
single continuous incomplete variable from a joint multivariate linear mixed model;20 this approach was
implemented in the PAN package22 and also in the MICE package.15 The REALCOM package23 and the
recent jomo R-package24 perform multilevel joint modelling multiple imputations, handling binary and
categorical variables through latent normal variables. The jomo R-package is also able to use heteroscedastic
framework, and should be in that sense close to our two-stage approach.
The multilevel MICE approach has several problems, especially the heteroscedasticity of the conditional
imputation models demonstrated in section 3.2. Our simulation studies showed that with heteroscedastic
models our two-stage methods (which naturally allow heteroscedasticity) outperform the one-stage method. It
would be possible to introduce heteroscedasticity in the one-stage model, thus avoiding any issues due to small
studies or small number of studies,36 but in our current view this is computationally laborious and unrealistic
within a MICE procedure. To our knowledge only the 2 l.norm routine in the MICE package allows for
heteroscedasticity.25 More recently, Yucel described a heteroscedastic imputation model for imputing
1644 Statistical Methods in Medical Research 27(6)
multivariate multilevel continuous data.21 Similarly, Jolani argued for a cluster-specific error variance which they
related to cluster-level characteristics.30
Another potential problem is compatibility between the full conditional distribution of each incomplete variable
(represented by the specification of variable-by-variable imputation models) and the global joint distribution of the
multivariate missing data. Ideally, the conditional distribution should be derived from a joint probability
distribution.26,52 Recent studies have identified compatibility in simple cases, notably for the linear model
without any random effect,10 but the MICE approach is usually validated only by simulation studies, and the
results appear to be robust even with incompatible models.26,53 In this work, we studied simulations with only two
missing covariates but extension of our results with more missing variables should be valid provided that
conditional imputation models are well specified. Moreover, having a valid joint distribution may be
‘‘less important than incorporating information from other variables and unique features of the dataset’’.54
This could partially explain the good results observed for our method even without including the cluster means
in the imputation model. Another possible explanation is that x 1i and x 2i are normally distributed and can
therefore be absorbed into the random intercept term.
The number of random effects that we can consider is a further problem, especially when most studies have
systematically missing data. We previously showed that random effects on the intercept and on the factor of
interest could be sufficient to produce good results.29 A large number of random effects can make the
computational time prohibitive. Nevertheless, we prefer to put a random effect on all the variables, and the
two-stage method with the simple and computationally efficient MM is a good alternative despite the fact that
we are forced to ignore uncertainty in the variance components and . The computational advantage becomes
much greater as the number of random effects increases. Of course, our methods could benefit from parsimonious
modelling of the variance-covariance matrix of random effects in the imputation model (e.g. modelling
independent random effects). One could also constrain some components of the random effect covariance
matrix to zero, based on empirical covariance matrix estimates.23 Such constraints could be directly
implemented in the two-stage methods by forcing some between-studies variances to be zero in the second step.
In practice, this corresponds to considering a smaller set of variables in Zi than in Xi. The two stage approach also
allows level-2 variables to be included through meta-regression in the second stage.
An important extension of our imputation model is to impute categorical variables. In the past, this has
been done in a MICE framework using random-effects logistic imputation models,30 and in a joint model
framework using latent continuous normally distributed variables.24 Our two-stage approach would be
particularly simple to modify for any generalised linear or other regression model, and this is the subject of
future work.
In conclusion, MICE seems to be a valuable approach to handle both systematically and sporadically missing
data. We proposed three methods, but only the two-stage methods easily take account of the potential
heteroscedasticity in the imputation model, and they outperform the one-stage approach in some settings.
The two-stage methods perform quite similarly, but the computational time needed to fit a REML estimator is
a disadvantage. Therefore, the two-stage method with the MM approach appears to be a solid and efficient
alternative.
Acknowledgements
We thank the Global Research on Acute Conditions Team (GREAT, http://www.greatnetwork.org) for allowing the use of
their data.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article:
MR-R was supported by Agence Nationale de Sécurité du Médicament et des Produits de Santé. IRW was supported by the
Medical Research Council [Unit Programme number U105260558].
Resche-Rigon and White 1645
References
1. Rubin DB. Multiple Imputation for Nonresponse in Surveys, Wiley Series in Probability and Statistics. New York: Wiley,
1987.
2. Little R and Rubin D. Statistical analysis with missing data, 2nd ed. Hoboken, NJ: Wiley, 2002.
3. White I, Royston P and Wood A. Multiple imputation using chained equations: issues and guidance for practice. Stat Med
2011; 30: 377–399.
4. Carpenter J and Kenward M. Multiple imputation and its application. New York: John Wiley & Sons, 2012.
5. Novo A and Schafer J. Package NORM, R package version 1.0-9.5, 2015, http://CRAN.R-project.org/package¼norm
6. StataCorp. Stata multiple-imputation reference manual: release 12. College Station, TX: Stata Press, 2011.
7. SAS Institute Inc. SAS/STAT 9.1 user s guide. Chapter 46. Cary, NC: SAS Institute Inc., 2004.
8. Schafer J. Analysis of incomplete multivariate data. Boca Raton: Chapman & Hall/CRC, 1997.
9. van Buuren S, Boshuizen H and Knook D. Multiple imputation of missing blood pressure covariates in survival analysis.
Stat Med 1999; 18: 681–694.
10. Chen H. Compatibility of conditionally specified models. Stat Probab Lett 2010; 80: 670–677.
11. Hughes R, White I, Seaman S, et al. Joint modelling rationale for chained equations. BMC Med Res Meth 2014; 14: 28.
12. van Buuren S, Brand J, Groothuis-Oudshoorn C, et al. Fully conditional specification in multivariate imputation. J Stat
Comput Simul 2006; 76: 1049–1064.
13. van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Meth Med Res
2007; 16: 219–242.
14. Bartlett JW, Seaman SR, White IR, et al. Multiple imputation of covariates by fully conditional specification:
accommodating the substantive model. Stat Meth Med Res 2015; 24: 462–487.
15. van Buuren S and Groothuis-Oudshoorn K. MICE: multivariate imputation by chained equations in R. J Stat Softw 2011;
45: 1–67.
16. Raghunathan T, Solenberger P and van Hoewyk J. IVEware: imputation and variance estimation software, 2016, http://
www.isr.umich.edu/src/smp/ive
17. Taljaard M, Donner A and Klar N. Imputation strategies for missing continuous outcomes in cluster randomized trials.
Biometr J 2008; 50: 329–345.
18. Andridge R. Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized
trials. Biometr J 2011; 53: 57–74.
19. Drechsler J. Multiple imputation of multilevel missing datarigor versus simplicity. J Educ Behav Stat 2015; 40: 69–95.
20. Schafer J and Yucel R. Computational strategies for multivariate linear mixed-effects models with missing values.
J Comput Graph Stat 2002; 11: 437–457.
21. Yucel R. Random covariances and mixed-effects models for imputing multivariate multilevel continuous data. Stat Model
2011; 11: 351–370.
22. Zhao J and Schafer J. pan: Multiple imputation for multivariate panel or clustered data. R package version 1.3, 2015.
23. Carpenter J, Goldstein H and Kenward M. REALCOM-IMPUTE software for multilevel multiple imputation with mixed
response types. J Stat Softw 2011; 45: 1–14.
24. Quartagno M and Carpenter J. jomo: a package for multilevel joint modelling multiple imputation, 2016, R package version
2.2, http://CRAN.R-project.org/package¼jomo
25. van Buuren S. Multiple imputation of multilevel data. In: Hox J and Roberts J (eds) The handbook of advanced multilevel
analysis. New York: Taylor & Francis, Ltd, 2010, pp.173–196.
26. van Buuren S. Flexible imputation of missing data. Boca Raton: CRC Press, 2012.
27. van Buuren S and Groothuis-Oudshoorn K. MICE: multivariate imputation by chained equations in R, 2015, R package
version 2.25, http://CRAN.R-project.org/package¼mice
28. Stewart L and Parmar M. Meta-analysis of the literature or of individual patient data: is there a difference? Lancet 1993;
341: 418–22.
29. Resche-Rigon M, White I, Bartlett J, et al. Multiple imputation for handling systematically missing confounders in meta-
analysis of individual participant data. Stat Med 2013; 32: 4890–4905.
30. Jolani S, Debray T, Koffijberg H, et al. Imputation of systematically missing predictors in an individual participant data
meta-analysis: a generalized approach using MICE. Stat Med 2015; 34: 1841–1863.
31. West B, Welch K and Galecki A. Linear mixed models: a practical guide using statistical software. Boca Raton: CRC Press,
2007.
32. Lee Y, Nelder J and Pawitan Y. Generalized linear models with random effects: unified analysis via H-likelihood. Boca
Raton: CRC Press, 2006.
33. Laird N and Ware J. Random-effects models for longitudinal data. Biometrics 1982; 38: 963–974.
34. Jackson D, Riley R and White I. Multivariate meta-analysis: potential and promise. Stat Med 2011; 30: 2481–2498.
35. Riley R, Lambert P, Staessen J, et al. Meta-analysis of continuous outcomes combining individual patient data and
aggregate data. Stat Med 2008; 27: 1870–1893.
1646 Statistical Methods in Medical Research 27(6)
36. Debray TP, Moons KG, Abo-Zaid GMA, et al. Individual participant data meta-analysis for a binary outcome: one-stage
or two-stage? PLoS One 2013; 8: e60650.
37. Pinheiro J and Bates D. Mixed-effects models in S and S-PLUS. New York: Springer Science & Business Media, 2000.
38. Jackson D, White I and Thompson S. Extending DerSimonian and Laird’s methodology to perform multivariate random
effects meta-analyses. Stat Med 2010; 29: 1282–1297.
39. Pinheiro J, Bates D, DebRoy S, et al. nlme: linear and nonlinear mixed effects models, 2016, R package version 3.1-28,
http://CRAN.R-project.org/package¼nlme
40. R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna,
Austria, 2016, http://www.R-project.org/
41. Simmonds M, Higgins J, Stewart L, et al. Meta-analysis of individual patient data from randomized trials: a review of
methods used in practice. Clinical Trials 2005; 2: 209–217.
42. Thomas D, Radji S and Benedetti A. Systematic review of methods for individual patient data meta-analysis with binary
outcomes. BMC Med Res Meth 2014; 14: 79.
43. DerSimonian R and Laird N. Meta-analysis in clinical trials. Control Clin Trials 1986; 7: 177–188.
44. Gasparrini A. mvmeta: multivariate meta-analysis and meta-regression, R package version 0.4.5, 2011.
45. Gasparrini A, Armstrong B and Kenward M. Multivariate meta-analysis for non-linear and other multi-parameter
associations. Stat Med 2012; 31: 3821–3839.
46. Van Buuren S and Oudshoorn K. Flexible multivariate imputation by MICE. Technical report, Leiden, The Netherlands:
TNO Prevention Center, 1999, http://www.stefvanbuuren.nl/publications/Flexible%20multivariate%20-%20TNO99054%
201999.pdf
47. Bates D, Maechler M, Bolker B, et al. lme4: linear mixed-effects models using Eigen and S4, R package version 1.1-12, 2016,
http://CRAN.R-project.org/package¼lme4
48. Mebazaa A, Gayat E, Lassus J, et al. Association between elevated blood glucose and outcome in acute heart failure:
results from an international observational cohort. J Am Coll Cardiol 2013; 61: 820–829.
49. Lassus J, Gayat E, Mueller C, et al. Incremental value of biomarkers to clinical variables for mortality prediction in acutely
decompensated heart failure: the multinational observational cohort on acute heart failure (MOCA) study. Int J Cardiol
2013; 168: 2186–2194.
50. Riley R, Lambert P and Abo-Zaid G. Meta-analysis of individual participant data: rationale, conduct, and reporting.
Br Med J 2010; 340: 521–525.
51. Quartagno M and Carpenter J. Multiple imputation for IPD meta-analysis: allowing for heterogeneity and studies with
missing covariates. Stat Med 2016; 35: 2954–2938.
52. Gilks W, Richardson S and Spiegelhalter D. Full conditional distributions. In: Markov Chain Monte Carlo in practice.
Boca Raton: Springer, 1996, pp.75–88.
53. Liu J, Gelman A, Hill J, et al. On the stationary distribution of iterative imputations. Biometrika 2014; 101: 155–173.
54. Gelman A. Parameterization and Bayesian modeling. J Am Stat Assoc 2004; 99: 537–545.
where 1 is a ni-vector of ones, J is a ni ni matrix of ones, and I is the identity matrix of size ni. Then the
conditional expectation and covariance matrix of x2i are:
E½x2i jx1i ¼ 2 1 þ ðu12 J þ "12 IÞðu11 J þ "11 IÞ1 ðx1i 1 1Þ, ð5Þ
Var½x2i jx1i ¼ ðu22 J þ "22 IÞ ðu12 J þ "12 IÞðu11 J þ "11 IÞ1 ðu12 J þ "12 IÞ: ð6Þ
Lemma 1
For scalars a and b, and identity I and matrix of ones J both of order n n,
a 1
ðaJ þ bIÞ1 ¼ J þ I:
bðan þ bÞ b
Resche-Rigon and White 1647
Proof
Note J2 ¼ nJ and ðaJ þ bIÞðcJ þ dIÞ ¼ ðad þ bc þ nacÞJ þ bdI. So ðcJ þ dIÞ ¼ ðaJ þ bIÞ1 if bd ¼ 1 and
ad þ bc þ nac ¼ 0. This is true when d ¼ 1b and c ¼ bðbþnaÞ
a
.
Using Lemma 1, we obtain the regression coefficient of x2i on x1i as
ðu12 J þ "12 IÞðu11 J þ "11 IÞ1 ¼
ðni ÞJ þ I ð7Þ
where
u12 "11 "12 u11 "12
ðni Þ ¼ and ¼ :
"11 ð"11 þ ni u11 Þ "11
where
"12 u12 ðu12 "11 "12 u11 Þð"12 þ ni u12 Þ
ðni Þ ¼ u22
"11 "11 ð"11 þ ni u11 Þ
and
2"12
¼ "22 :
"11
Note that dependence on cluster size disappears as cluster sizes become large, since as ni ! 1,
2
Now let Xki be the ni pk matrix whose jth row is xkij . Let xki ¼ vecðXki Þ where vecð:Þ denotes the operation of
stacking the columns. This gives
! !
x1i lT1 1 u11 J þ "11 I u12 J þ "12 I
N , ð11Þ
x2i lT2 1 u12 J þ "12 I u22 J þ "22 I
1648 Statistical Methods in Medical Research 27(6)
where 1 is a ni-vector of ones, J is a ni ni matrix of ones, and I is the identity matrix of size ni.
Some useful identities for dealing with the Kronecker product A B are
(1) For column vectors a and b, vec aT b ¼ a b.
(2) For conformable matrices A, B and X, ðBT AÞvecðXÞ ¼ vecðAXBÞ.
(3) For conformable matrices A, B, C and D, ðA BÞðC DÞ ¼ ðACÞ ðBDÞ.
(4) For q q matrices A and B, and identity I and 1-matrix J both of order n n,
ðA J þ B I Þ1 ¼ ðnA þ BÞ1 AB1 J þ B1 I:
We can now express the conditional expectation and covariance matrix of x2i as:
E ½x2i jx1i ¼ lT2 1 þ ðx1i lT1 1Þ ð12Þ
Var½x2i jx1i ¼ ðu22 J þ "22 I Þ ðu12 J þ "12 I Þ ð13Þ
where ¼ ðu12 J þ "12 I Þðu11 J þ "11 I Þ1 is the regression coefficient of x2i on x1i . Using identity 4,
we obtain ¼ aðni Þ J þ b I where
1
aðni Þ ¼ ðu21 1 1
u11 "11 "21 Þð"11 þ ni u11 Þ u11 "11
b ¼ "12 1
"11 :
Now using the identities above, and recalling x1i ¼ vecðX1i Þ, we can show
E ½X2i jX1i ¼ l2 1 þ JðX1i l1 1Þaðni ÞT þ ðX1i l1 1ÞbT :
Finally, since JX1i ¼ ni x1i 1 where x1i is a row-vector of means, and Jðl1 1Þ ¼ ni l1 1,
E ½X2i jX1i ¼ l2 1 þ ni fðx1 l1 Þ 1gaðni ÞT þ fX1i l1 1gbT
and
E ½x2ij jX1i ¼ l2 þ ni ðx1 l1 Þaðni ÞT þ ðx1ij l1 ÞbT : ð14Þ
where
cðni Þ ¼ u22 u21 1
"11 "12
þ ðni u21 þ "21 Þðni u11 þ "11 Þ1 u11 1
"11 "12 u12
d ¼ "22 "21 1
"11 "12 :
As in Appendix 1, we note that dependence on cluster size disappears if either (1) ni ! 1, since then
ni aðni Þ ! u21 1 1 1 1 1
u11 "21 "11 and cðni Þ ! u22 u21 u11 u12 , or (2) u21 u11 ¼ "21 "11 , since then aðni Þ ¼ 0
and cðni Þ ¼ u22 u21 1
u11 u12 (independent of n i ).
In a MICE
procedure,
we are interested in all the conditional distributions, not just ½X2i jX1i . If
u11 u12 "11 "12
¼k then condition (2) above holds for all the conditional distributions.
u21 u22 "21 "22
Resche-Rigon and White 1649
2 1 0:5
Table 8. Base-case simulations with only sporadically missing data. u ¼ 0:5 , ":1 ¼ ":2 ¼ 0:5, ":1 ¼ ":2 ¼ 0.
0:5 1
pffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffi
1 b11 2 b22
Mean
Est Model Emp Est Est Model Emp Est run
Method mean SE SE Cover mean mean SE SE Cover mean time