[go: up one dir, main page]

0% found this document useful (0 votes)
24 views12 pages

Mida (AE)

Uploaded by

XP B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views12 pages

Mida (AE)

Uploaded by

XP B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

MIDA: Multiple Imputation using Denoising

Autoencoders

Lovedeep Gondara and Ke Wang

Department of Computing Science


Simon Fraser University
lgondara@sfu.ca,wangk@cs.sfu.ca

Abstract. Missing data is a significant problem impacting all domains.


arXiv:1705.02737v3 [cs.LG] 17 Feb 2018

State-of-the-art framework for minimizing missing data bias is multiple


imputation, for which the choice of an imputation model remains non-
trivial. We propose a multiple imputation model based on overcomplete
deep denoising autoencoders. Our proposed model is capable of handling
different data types, missingness patterns, missingness proportions and
distributions. Evaluation on several real life datasets show our proposed
model significantly outperforms current state-of-the-art methods under
varying conditions while simultaneously improving end of the line ana-
lytics.

1 Introduction

Missing data is an important issue, even small proportions of missing data can
adversely impact the quality of learning process, leading to biased inference
[14,4]. Many methods have been proposed over the past decades to minimize
missing data bias [14,9] and can be divided into two categories: One that attempt
to model the missing data process and use all available partial data for directly
estimating model parameters and two that attempt to fill in/impute missing
values with plausible predicted values. Imputation methods are preferred for
their obvious advantage, that is, providing users with a complete dataset that
can be analyzed using user specified models.
Methods for imputing missing data range from replacing missing values by
the column average to complex imputations based on various statistical and
machine learning models. All standalone methods share a common drawback,
imputing a single value for one missing observation, which is then treated as
the gold standard, same as the observed data in any subsequent analysis. This
implicitly assumes that imputation model is perfect and fails to account for er-
ror/uncertainty in the imputation process. This is overcome by replacing each
missing value with several slightly different imputed values, reflecting our uncer-
tainty about the imputation process. This approach is called multiple imputation
[10,15] and is the most widely used framework for missing data analytics. The
biggest challenge with multiple imputation is the correct specification of an impu-
tation model [11]. It is a nontrivial task because of the varying model capabilities
2 Lovedeep Gondara, Ke Wang

and underlying assumptions. Some imputation models are incapable of handling


mixed data types (categorical and continuous), some have strict distributional
assumptions (multivariate normality) and/or cannot handle arbitrary missing
data patterns. Existing models capable of overcoming aforementioned issues are
further limited in their ability to model highly nonlinear relationships, high vol-
ume data and complex interactions while preserving inter-variable dependencies.
Recent advancements in deep learning have established state-of-the-art re-
sults in many fields [6]. Deep architectures have the capability to automatically
learn latent representations and complex inter-variable associations, which is not
possible using classical models. Part of the deep learning framework, Denoising
Autoencoders (DAEs) [18] are designed to recover clean output from noisy input.
Missing data is a special case of noisy input, making DAEs ideal as an imputa-
tion model. But, missing data can depend on interactions/latent representations
that are not observable in the input dataset space. Hence, we propose to use an
overcomplete DAE as an imputation model, whereby projecting our input data
to a higher dimensional subspace from where we then recover missing informa-
tion. We propose a multiple imputation framework with overcomplete DAE as
the base model, where we simulate multiple predictions by initializing our model
with a different set of random weights at each run. Details of our method are
presented in Section 3. Our proposed method has several advantages over the
current methods, some of which we outline below:
– Previous studies on imputing missing data using machine learning methods
use complete observations for the training phase. We show that our model
outperforms state-of-the-art methods even when users do not have the luxury
of having complete observations for initial training, a common scenario in
real life.
– Our model is capable of preserving attribute correlations, which are of a
concern using traditional imputation methods and can significantly affect
end of the line analytics
– Our model is better equipped to deal with different missing data genera-
tion processes, such as data missing not at random, which is a performance
bottleneck for other imputation methods. Experimental results using real
life datasets show that our model outperforms state-of-the-art methods un-
der varying dataset and missingness conditions and improves end of the line
analytics.
The rest of the paper is organized as following. Section 2 provides preliminary
background to missing data terminology and introduces denoising autoencoders.
Section 3 introduces our model with section 4 presenting empirical evaluation
and the effect of imputation on end of the line analytics followed by our conclu-
sions.

2 Background
Missing data is a well researched topic in statistics. Most of the early work on
missing data, including definitions, multiple imputation and subsequent analysis
MIDA: Multiple Imputation using Denoising Autoencoders 3

is attributed to works of Little and Rubin [10,9,14]. From machine learning


perspective, it has been shown that auto-associative neural networks are better
at imputing missing data when attribute interdependencies are of concern [12], a
common scenario in real life datasets. Denoising autoencoders have been recently
used in completing traffic and health records data [5,1] and collaborative filtering
[8]. Below we provide some preliminary introduction to missing data mechanisms
and denoising autoencoders.

2.1 Missing data

Mechanisms: Impact of missing data depends on the underlying missing data


generating mechanism. We define three missing data categories [10] with the aid
of data from Table 1, representing an income questionnaire in a survey where we
denote missing data with ”?”. Data is Missing Completely At Random (MCAR)

Table 1: Data snippet for income questionnaire with missing data represented
using ’ ?’
Id Age Sex Income Postal Job Marital status
1 50 M 100 123 a single
2 45 ? ? 456 ? married
3 ? F ? 789 ? ?

if missingness does not depend on observed or unobserved data, example: Survey


participants flip a coin to decide whether to answer questions or not. Data is
Missing At Random (MAR) if missingness can be explained using observed data,
example: Survey participants that live in postal code 456 and 789 refuse to fill
in the questionnaire. Data is Missing Not At Random (MNAR) if missingness
depends on an unobserved attribute or on the missing attribute itself, example:
Everyone who owns a six bedroom house refuses the questionnaire, bigger house
is an indirect indicator for greater wealth and a better paying job, but we don’t
have the related data. When data are MAR or MCAR, it is known as ignorable
missing data as observed data can be used to account for missingness. But, given
the observed data, it is impossible to distinguish between MNAR and MAR [17]
and sometimes, missing data can be a combination of both.
Multiple Imputation: In a multiple imputation scenario, we will create multi-
ple copies of the dataset presented in Table 1 with ’ ?’ replaced by slightly differ-
ent imputed values in each copy. Multiple imputation accounts for uncertainty
in predicting missing data using observed data by modelling variability into the
imputed values as the true values for missing data are never known. Multiple
imputed datasets are then analyzed independently and the results combined. A
single statistic such as classification accuracy or root mean square error (RMSE)
can be simply averaged from multiple imputations.
4 Lovedeep Gondara, Ke Wang

2.2 Autoencoders and Denoising autoencoders

An autoencoder takes an input x ∈ [0, 1]d and maps (encodes) it to an interme-


0
diate representation h ∈ [0, 1]d using an encoder, where d0 represents a different
dimensional subspace. The assumption is, in the dataset, h captures the coor-
dinates along the main factors of variation. The encoded representation is then
decoded back to the original d dimensional space using a decoder. Encoder and
decoder are both artificial neural networks. The two stages are represented as

h = s(W x + b) (1)

z = s(W 0 h + b0 ) (2)
where z is the decoded result and s is any nonlinear function, reconstruction
error between x and z is minimized during training phase.
Denoising autoencoders, are a natural extension to autoencoders [18]. By
corrupting the input data and forcing the network to reconstruct the clean out-
put forces the hidden layers to learn robust features. Corruption can be applied
in different ways, such as randomly setting some input values to zero or using
distributional additive noise. DAEs reconstruction capabilities can be explained
by thinking of DAEs implicitly estimating the data distribution as the asymp-
totic distribution of the Markov chain that alternates between corruption and
denoising [2].

3 Models

This section introduces our multiple imputation model and the competitors used
for comparison.

3.1 Our model

Architecture:Our default architecture is shown in Figure 1. We employ atypi-


cal overcomplete representation of DAEs, that is, more units in successive hidden
layers during encoding phase compared to the input layer. This mapping of our
input data to a higher dimensional subspace creates representations capable of
adding lateral connections, aiding in data recovery, usefulness of this approach
is empirically validated in the supplemental material. We start with an initial
n dimensional input, then at each successive hidden layer, we add Θ nodes, in-
creasing the dimensionality to n + Θ. For initial comparisons, we use Θ = 7. We
tried different values for Θ for various datasets and decided to use 7 as it pro-
vided consistent better results. It is an arbitrary choice and can be dealt with
by viewing Θ as another tuning hyperparameter. Our model inputs are stan-
dardized between 0 and 1 to facilitate faster convergence for small to moderate
sample sizes. Our model is trained with 500 epochs using an adaptive learning
rate with a time decay factor of 0.99 and Nesterov’s accelerated gradient [13].
The input dropout ratio to induce corruption is set to 0.5, so that in a given
MIDA: Multiple Imputation using Denoising Autoencoders 5

Encoder Decoder
I H1 H2 H3 H3 H4 H5 O

x1 x01 x01 x01 x01 x01 x01 x01

xn x0n+Θ x0n+2Θ x0n+3Θ x0n+3Θ x0n+2Θ x0n+Θ x0n

Fig. 1: Our basic architecture, encoder block increases dimensionality at every


hidden layer by adding Θ units with decoder symmetrically scaling it back to
original dimensions. Crossed out inputs represent stochastic corruption at input
by setting random inputs to zero. H1 , H2 , H3 , H4 , H5 are hidden layers with I
and O being the input and output layers respectively. Encoder and decoder are
constructed using fully connected artificial neural networks.

training batch, half of the inputs are set to zero. Tanh is used as an activa-
tion function as we find it performs better than ReLU for small to moderate
sample sizes when some inputs are closer to zero. We use early stopping rule
to terminate training if desired mean squared error (MSE) of 1e-06 is achieved
or if simple moving average of length 5 of the error deviance does not improve.
The training-test split of 70-30 is used with all results reported on the test set.
Multiple imputation is accomplished by using multiple runs of the model with
a different set of random initial weights at each run. This provides us with the
variation needed for multiple imputations. Algorithm 1 explains our multiple
imputation process.
Usage:We start with imitating a real life scenario, where the user only have
a dataset with pre-existing missing values. That is, the user does not have the
luxury of access to the clean data and user does not know the underlying missing
data generating mechanism or the distribution. In scenarios where missingness
is inherent to the datasets, training imputation models using complete data can
bias the learner. But as DAEs require complete data at initialization, we initially
use the respective column average in case of continuous variables and most fre-
quent label in case of categorical variables as placeholders for missing data at
initialization. Training phase is then initiated with a stochastic corruption pro-
6 Lovedeep Gondara, Ke Wang

Algorithm 1 Multiple imputation using DAEs


Require: k: Number of imputations needed
1: for i = 1 → k do
2: Initialize DAE based imputation model using weights from random uniform dis-
tribution
3: Fit the imputation model to training partition using stochastic corruption
4: Reconstruct test set using the trained model
5: end for

cess setting random inputs to zero, where our model learns to map corrupted
input to clean output. Our approach is based on one assumption, that is, we
have enough complete data to train our model, so the model learns to recover
true data using stochastic corruption on inputs, and is not learning to map
placeholders as valid imputations. The results show this assumption is readily
satisfied in real life scenarios, even datasets with small sample sizes are enough
for DAE based imputation model to achieve better performance compared to
state-of-the-art.

3.2 Competitors and comparison

Competitor:For multiple imputation, we need methods that can inject varia-


tion in successive imputations, providing slightly different imputation results at
each iteration. Simple models such as linear/logistic regression or deterministic
methods based on matrix decomposition fail to take this into account. Cur-
rent state-of-the-art in multiple imputation is the Multivariate Imputation by
Chained Equations (MICE) [3], which is a fully conditional specification ap-
proach and works better than Joint Modelling approaches where multivariate
distributions cannot provide a reasonable description of the data or where no
suitable multivariate distributions can be found. MICE specifies multivariate
model on variable by variable basis using a set of conditional densities, one for
each variable with missing data. MICE draws imputations by iterating over con-
ditional densities, it has an added advantage of being able to model different
densities for different variables. Internal imputation model in MICE is vital and
a model with properties to handle different data types and distributions is es-
sential for effective imputations. Predictive mean matching and Random Forest
are the best available options within MICE framework [16]. We compared them
both and found predictive mean matching to provide more consistent results
with varying dataset types and sizes. Hence it is used as the internal component
of our competitor MICE model.
Comparison: Imputation results are compared using sum of root mean squared
error calculated per attribute on the test set, given as
v
m u
X n
u X
RM SEsum = tE( (tˆi − ti )) (3)
i=1 i=1
MIDA: Multiple Imputation using Denoising Autoencoders 7

where we have m attributes, n observations, t̂ is the imputed value and t is


the observed value. RM SEsum is calculated on scaled datasets to avoid dispro-
portionate attribute contributions. RM SEsum provides us a measure of relative
distance, that is, how far the dataset completed with imputed values is from the
original complete dataset. For multiple imputation scenarios with k imputations,
we have k values for RM SEsum per dataset. The results are then reported using
average RM SEsum along with the range.

4 Experiments

We start our empirical evaluation for multiple imputation on several publicly


available real life datasets under varying missingness conditions.

4.1 Datasets

Table 2 shows the properties of various real life publicly available datasets [7]
used for model evaluation. Models based on deep architectures are known to
perform well on large sample, high dimensional datasets. Here we include some
extremely low dimensional and low sample size datasets to test the extremes and
to prove that our model has real world applications. Most of the datasets have
a combination of continuous, categorical and ordinal attributes, which further
challenges the convergence of our model using small training samples.

Table 2: Datasets used for evaluation. Dataset acronyms are shown in parenthesis
that we will be using in the results section.
Observations Attributes
Boston Housing (BH) 506 14
Breast Cancer (BC) 699 11
DNA (DN) 3186 180
Glass (GL) 214 10
House votes (HV) 435 17
Ionosphere (IS) 351 35
Ozone (ON) 366 13
Satellite (SL) 6435 37
Servo (SR) 167 5
Shuttle (ST) 58000 9
Sonar (SN) 208 61
Soybean (SB) 683 36
Vehicle (VC) 846 19
Vowel (VW) 990 10
Zoo (ZO) 101 17
8 Lovedeep Gondara, Ke Wang

4.2 Inducing missingness


To provide a wide range of comparisons, initially for each data set, we introduce
missingness in four different ways, with a fixed missingness proportion of 20%
using the steps detailed below.
1. Append a uniform random vector v with n observations to the dataset with
values between 0 and 1, where n is number of observations in the dataset.
2. MCAR, uniform: Set all attributes to have missing values where vi ≤ t, ,
i ∈ 1 : n, t is the missingness threshold, 20% in our case.
3. MCAR, random: Set randomly sampled half of the attributes to have missing
values where vi ≤ t, i ∈ 1 : n.
4. MNAR, uniform: Randomly sample two attributes x1 and x2 from the
dataset and calculate their median m1 and m2 . Set all attributes to have
missing values where vi ≤ t, i ∈ 1 : n and (x1 ≤ m1 or x2 ≥ m2).
5. MNAR, random: Randomly sample two attributes x1 and x2 from the dataset
and calculate their median m1 and m2 . Set randomly sampled half of the
attributes to have missing values where vi ≤ t, i ∈ 1 : n and (x1 ≤ m1 or
x2 ≥ m2).

4.3 Main results


Table 3 shows the multiple imputation results on real life datasets, comparing
five imputations by our model with five imputations by MICE, that is, each
missing value is imputed five times with a slightly different value. The results
show that our model outperforms MICE in 100% of cases with data MCAR
and MNAR with uniform missing pattern and in > 70% of cases with random
missing pattern. Our model’s superior performance in this scenario using small to
moderate dataset sizes with constrained dimensionality is indicative of it’s utility
when datasets are large and are of higher dimensionality, which is a performance
bottleneck for other multiple imputation models whereas our model is capable of
handling massive data by design. Another advantage is that our model does not
need a certain proportion of available data to predict missing value. As in the
case of dataset VW-MNAR, MICE was unable to provide complete imputations.
Computational cost associated with our model is at par or better than im-
putations based on MICE for small to moderate sized datasets. This might seem
counter-intuitive to some readers as our model is much more complex. But, com-
putational gains are significant when we are modelling a complete dataset in a
single attempt compared to iterative variable by variable imputation in MICE.

4.4 Increased missingness proportion


Missing data proportion is known to affect imputation performance, which dete-
riorates with increasing missing data proportion. To test the impact of varying
missing data proportion on our model, we introduce missingness in all 15 datasets
with missingness proportion set at 40% and 60% using methods described in ex-
perimental setup section. Keeping all model parameters same for our model and
MIDA: Multiple Imputation using Denoising Autoencoders 9

Table 3: Imputation results comparing our model and MICE. Results are dis-
played using sum of root mean square error(RM SEsum ), providing a measure of
relative distance of imputation from original data. As results are from multiple
imputation (5 imputations), mean RM SEsum from 5 imputations is displayed
outside with min and max RM SEsum inside parenthesis providing a range for
imputation performance. Value for MNAR is NA for dataset VW as MICE was
unable to impute a complete dataset.
Data Uniform missingness Random missingness
DAE MICE DAE MICE
BH 2.9(2.9,3) 3.7(3.5,3.8) 0.9(0.9,1) 0.9(0.7,1)
BC 2.9(2.9,2.9) 3.9(3.6,4.2) 1.2(1.2,1.3) 1.3(1.1,1.4)
DN 25.7(25.7,25.7) 36.5(36.3,36.6) 13.1(13.1,13.2) 16.9(16.9,17)
GL 1.1(1,1.1) 1.5(1.3,1.7) 1.3(1.2,1.4) 1.4(1.3,1.6)
HV 2.4(2.4,2.4) 3.4(3.1,3.7) 1.1(1.1,1.2) 1.2(0.9,1.3)
IS 13(12.9,13.1) 17.1(16.2,17.7) 5.8(5.6,6.2) 7(6.7,7.5)
ON 2.1(2.1,2.1) 3.1(3,3.3) 0.9(0.9,1) 1(1,1.2)
MCAR SL 3.6(3.6,3.7) 4.5(4.4,4.6) 1.8(1.7,1.8) 0.7(0.7,0.7)
SR 1.2(1.,1.2) 1.5(1.4,1.7) 0.4(0.4,0.5) 0.4(0.4,0.5)
ST 16.5(16.5,16.7) 27.9(27.5,28.2) 6.5(6.4,6.7) 13(12.5,13.8)
SN 5.1(5,5.1) 7.3(7.2,7.5) 2.3(2.2,2.3) 3.2(3.2,3.3)
SB 1.8(1.8,1.8) 2.4(2.3,2.4) 1.2(1.1,1.2) 1.1(1,1.1)
VC 4.1(4,4.1) 5.6(5.5,5.7) 1.6(1.6,1.6) 2.2(2.1,2.3)
VW 5.8(5.7,6.2) 7.7(7,8.1) 2.6(2.4,2.9) 3.8(3.3,4.2)
ZO 2.1(2.1,2.1) 3.4(3.1,4.3) 1.1(1.1,1.2) 1.1(1.1,1.1)
BH 2.3(2.2,2.4) 3.2(2.9,3.4) 0.9(0.8,1) 0.7(0.7,0.8)
BC 2.9(2.8,3) 3.6(3.4,3.8) 1.7(1.7,1.8) 1.4(1.3,1.5)
DN 25.3(25.2,25.3) 34.5(34.5,34.7) 5.7(5.7,5.8) 7.2(7.1,7.2)
GL 1.3(1.3,1.4) 1.5(1.3,1.8) 0.4(0.3,0.4) 0.2(0.11,0.2)
HV 2.6(2.6,2.6) 3.5(3.3,3.7) 1.3(1.2,1.3) 1.3(1.3,1.4)
IS 11.7(11.5,11.8) 15.4(14.9,16.5) 4.8(4.5,5.1) 6.3(5.6,6.8)
ON 1.5(1.5,1.5) 2.2(2,2.4) 1.2(1.1,1.2) 1.3(1.1,1.5)
MNAR SL 3.4(3.4,3.4) 3.8(3.8,3.9) 1.6(1.6,1.6) 0.5(0.5,0.5)
SR 1.2(1.2,1.2) 1.6(1.5,1.7) 0.4(0.3,0.4) 0.3(0.2,0.3)
ST 11.8(11.7,11.9) 22.4(22.1,22.7) 4.5(4.3,4.7) 9.5(8.4,10.3)
SN 4.6(4.6,4.6) 6.8(6.5,7.1) 2.3(2.3,2.4) 3.1(3,3.2)
SB 1.7(1.7,1.7) 2.3(2.2,2.4) 0.6(0.6,0.6) 0.9(0.9,0.9)
VC 3.5(3.4,3.7) 4.6(4.4,4.8) 1.7(1.7,1.8) 2.4(2.3,2.4)
VW 5.9(5.9,5.9) NA 2.3(2.1,2.5) NA
ZO 3.3(2.8,5.5) 3.9(3.6,4.6) 0.9(0.8,1.0) 1.1(0.7,1.7)

MICE, we multiple imputed datasets with five imputations each. For a better
visual representation, we compare the imputation results between our model and
MICE using mean error ratio ER , given as
1 Pn
i=1 EDi
ER = n P (4)
1 n
EP i
n i=1
10 Lovedeep Gondara, Ke Wang

(a) (b) (c) (d)

(e) (f) (g) (h)

Fig. 2: Results for imputation with increased missingness proportions. Figures


(a), (b), (c), (d) show imputation results with 40% missing data and figures (e),
(f), (g), (h) show imputation results with 60% missing data. Red line is drawn
as a reference line at y-intercept of 1 to signify superior/inferior performance of
our model vs MICE. Results are displayed using ER where values less than one
signify our model performing better and values greater than one signify MICE’s
superior performance. X-axis show different datasets (1-15) and Y-axis display
ER . For some cases, MICE has trouble imputing dataset 14 (VW) whereas our
model provides consistent imputations.

where ED is imputation error of our model, EP is imputation error of MICE and


n is number of imputations. ER values of less than one signify average superior
performance of our model over MICE, whereas values greater than one signify
MICE performing better.
Figure 2 shows the results, a reference line at y-intercept of 1 is drawn to aid
visual comparisons. Our model performs better on average compared to MICE,
irrespective of missing data proportion. Results echo the findings of our main
results, where we observe our model performing better than MICE on average
of > 85% cases.

4.5 Impact on final analysis

Main goal of imputing missing data is to generate complete datasets that can
be used for analytics. While imputation accuracy provides us with a measure of
how close the imputed dataset is to the complete dataset, we still do not know
how well inter-variable correlations are preserved. Which severely impacts end
of the line analytics. To check the imputation quality in relation to a dataset’s
overall structure and to quantify the impact of imputation on end of the line
analytics, we use all imputed datasets as the input to classification/regression
models based on random forest with 5 times 5 fold cross validation. The task
is to use the target variable from all datasets and store the classification accu-
racy/RMSE for each dataset imputed using our model and MICE. Higher values
MIDA: Multiple Imputation using Denoising Autoencoders 11

Table 4: Average accuracy and RMSE estimates for end of the line analytics
using random forest on imputed datasets. As we have used multiple imputa-
tion, results are averaged over all imputed datasets. * signifies where RMSE
is reported because target variable is numeric, hence lower values the better.
All other datasets report average classification accuracy, higher the better. For
dataset VW, as MICE was unable to impute a full dataset, end of the line ana-
lytics is not possible.
Data Uniform missingness Random missingness
DAE MICE DAE MICE
BH* 3.9 4.5 3.7 4.1
BC 96.0 96.0 97.0 96.1
DN 91.6 87.6 93.3 93.7
GL 70.2 64.1 74.6 70.4
HV 98.5 99.2 95.0 98.2
IS 90.5 86.6 90.7 90.3
ON* 4 3.8 3.6 4.2
MNAR SL 90.0 80.9 89.6 89.6
SR* 6.6 8.1 6.9 6.4
ST 86.4 80.8 76.5 72.9
SN 90.9 70.5 85.3 84.2
SB 72.6 62.7 73.7 77.1
VC 74.7 63.8 72.7 70.6
VW 93.9 NA 77.7 NA
ZO 99.9 98.5 99.9 99.9

for classification accuracy and lower RMSE will signify a better preserved pre-
dictive dataset structure. We calculate mean accuracy/RMSE from all five runs
of multiple imputation. Datasets with data MNAR (uniform and random) are
used as MNAR datasets pose greatest challenges for imputation.
Results in Table 4 show that multiple imputation using our model provides
higher predictive power for end of the line analytics compared to MICE im-
puted data. The difference even more significant when data are MNAR uniform
compared to when data are MNAR random.

5 Conclusion

We have presented a new method for multiple imputation based on deep de-
noising autoencoders. We have shown that our proposed method outperforms
current state-of-the-art using various real life datasets and missingness mecha-
nisms. We have shown that our model performs well, even with small sample
sizes, which is thought to be a hard task for deep architectures. In addition to
not requiring a complete dataset for training, we have shown that our proposed
model improves end of the line analytics.
12 Lovedeep Gondara, Ke Wang

References
1. Brett K Beaulieu-Jones, Jason H Moore, The pooled resource open-access ALS, and
clinical trials consortium. Missing data imputation in the electronic health record
using deeply learned autoencoders. In Pacific Symposium on Biocomputing. Pacific
Symposium on Biocomputing, volume 22, page 207. NIH Public Access, 2016.
2. Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized de-
noising auto-encoders as generative models. In Advances in Neural Information
Processing Systems, pages 899–907, 2013.
3. Stef Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by
chained equations in r. Journal of statistical software, 45(3), 2011.
4. Pei Chen. Optimization algorithms on subspaces: Revisiting missing data problem
in low-rank matrix. International Journal of Computer Vision, 80(1):125–142,
2008.
5. Yanjie Duan, Yisheng Lv, Wenwen Kang, and Yifei Zhao. A deep learning based
approach for traffic data imputation. In Intelligent Transportation Systems (ITSC),
2014 IEEE 17th International Conference on, pages 912–917. IEEE, 2014.
6. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
521(7553):436–444, 2015.
7. Friedrich Leisch and Evgenia Dimitriadou. Machine learning benchmark problems.
2010.
8. Sheng Li, Jaya Kawale, and Yun Fu. Deep collaborative filtering via marginalized
denoising auto-encoder. In Proceedings of the 24th ACM International on Confer-
ence on Information and Knowledge Management, pages 811–820. ACM, 2015.
9. Roderick JA Little. Missing-data adjustments in large surveys. Journal of Business
& Economic Statistics, 6(3):287–296, 1988.
10. Roderick JA Little and Donald B Rubin. Statistical analysis with missing data.
John Wiley & Sons, 2014.
11. Tim P Morris, Ian R White, and Patrick Royston. Tuning multiple imputation
by predictive mean matching and local residual draws. BMC medical research
methodology, 14(1):75, 2014.
12. Fulufhelo V Nelwamondo, Shakir Mohamed, and Tshilidzi Marwala. Missing data:
A comparison of neural network and expectation maximisation techniques. arXiv
preprint arXiv:0704.3474, 2007.
13. Yurii Nesterov. A method of solving a convex programming problem with conver-
gence rate o (1/k2). 1983.
14. Donald B Rubin. Inference and missing data. Biometrika, pages 581–592, 1976.
15. Joseph L Schafer. Multiple imputation: a primer. Statistical methods in medical
research, 8(1):3–15, 1999.
16. Anoop D Shah, Jonathan W Bartlett, James Carpenter, Owen Nicholas, and Harry
Hemingway. Comparison of random forest and parametric imputation models for
imputing missing data using mice: a caliber study. American journal of epidemi-
ology, 179(6):764–774, 2014.
17. Jonathan AC Sterne, Ian R White, John B Carlin, Michael Spratt, Patrick Roys-
ton, Michael G Kenward, Angela M Wood, and James R Carpenter. Multiple
imputation for missing data in epidemiological and clinical research: potential and
pitfalls. Bmj, 338:b2393, 2009.
18. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
Extracting and composing robust features with denoising autoencoders. In Proceed-
ings of the 25th international conference on Machine learning, pages 1096–1103.
ACM, 2008.

You might also like