Mida (AE)
Mida (AE)
Autoencoders
1 Introduction
Missing data is an important issue, even small proportions of missing data can
adversely impact the quality of learning process, leading to biased inference
[14,4]. Many methods have been proposed over the past decades to minimize
missing data bias [14,9] and can be divided into two categories: One that attempt
to model the missing data process and use all available partial data for directly
estimating model parameters and two that attempt to fill in/impute missing
values with plausible predicted values. Imputation methods are preferred for
their obvious advantage, that is, providing users with a complete dataset that
can be analyzed using user specified models.
Methods for imputing missing data range from replacing missing values by
the column average to complex imputations based on various statistical and
machine learning models. All standalone methods share a common drawback,
imputing a single value for one missing observation, which is then treated as
the gold standard, same as the observed data in any subsequent analysis. This
implicitly assumes that imputation model is perfect and fails to account for er-
ror/uncertainty in the imputation process. This is overcome by replacing each
missing value with several slightly different imputed values, reflecting our uncer-
tainty about the imputation process. This approach is called multiple imputation
[10,15] and is the most widely used framework for missing data analytics. The
biggest challenge with multiple imputation is the correct specification of an impu-
tation model [11]. It is a nontrivial task because of the varying model capabilities
2 Lovedeep Gondara, Ke Wang
2 Background
Missing data is a well researched topic in statistics. Most of the early work on
missing data, including definitions, multiple imputation and subsequent analysis
MIDA: Multiple Imputation using Denoising Autoencoders 3
Table 1: Data snippet for income questionnaire with missing data represented
using ’ ?’
Id Age Sex Income Postal Job Marital status
1 50 M 100 123 a single
2 45 ? ? 456 ? married
3 ? F ? 789 ? ?
h = s(W x + b) (1)
z = s(W 0 h + b0 ) (2)
where z is the decoded result and s is any nonlinear function, reconstruction
error between x and z is minimized during training phase.
Denoising autoencoders, are a natural extension to autoencoders [18]. By
corrupting the input data and forcing the network to reconstruct the clean out-
put forces the hidden layers to learn robust features. Corruption can be applied
in different ways, such as randomly setting some input values to zero or using
distributional additive noise. DAEs reconstruction capabilities can be explained
by thinking of DAEs implicitly estimating the data distribution as the asymp-
totic distribution of the Markov chain that alternates between corruption and
denoising [2].
3 Models
This section introduces our multiple imputation model and the competitors used
for comparison.
Encoder Decoder
I H1 H2 H3 H3 H4 H5 O
training batch, half of the inputs are set to zero. Tanh is used as an activa-
tion function as we find it performs better than ReLU for small to moderate
sample sizes when some inputs are closer to zero. We use early stopping rule
to terminate training if desired mean squared error (MSE) of 1e-06 is achieved
or if simple moving average of length 5 of the error deviance does not improve.
The training-test split of 70-30 is used with all results reported on the test set.
Multiple imputation is accomplished by using multiple runs of the model with
a different set of random initial weights at each run. This provides us with the
variation needed for multiple imputations. Algorithm 1 explains our multiple
imputation process.
Usage:We start with imitating a real life scenario, where the user only have
a dataset with pre-existing missing values. That is, the user does not have the
luxury of access to the clean data and user does not know the underlying missing
data generating mechanism or the distribution. In scenarios where missingness
is inherent to the datasets, training imputation models using complete data can
bias the learner. But as DAEs require complete data at initialization, we initially
use the respective column average in case of continuous variables and most fre-
quent label in case of categorical variables as placeholders for missing data at
initialization. Training phase is then initiated with a stochastic corruption pro-
6 Lovedeep Gondara, Ke Wang
cess setting random inputs to zero, where our model learns to map corrupted
input to clean output. Our approach is based on one assumption, that is, we
have enough complete data to train our model, so the model learns to recover
true data using stochastic corruption on inputs, and is not learning to map
placeholders as valid imputations. The results show this assumption is readily
satisfied in real life scenarios, even datasets with small sample sizes are enough
for DAE based imputation model to achieve better performance compared to
state-of-the-art.
4 Experiments
4.1 Datasets
Table 2 shows the properties of various real life publicly available datasets [7]
used for model evaluation. Models based on deep architectures are known to
perform well on large sample, high dimensional datasets. Here we include some
extremely low dimensional and low sample size datasets to test the extremes and
to prove that our model has real world applications. Most of the datasets have
a combination of continuous, categorical and ordinal attributes, which further
challenges the convergence of our model using small training samples.
Table 2: Datasets used for evaluation. Dataset acronyms are shown in parenthesis
that we will be using in the results section.
Observations Attributes
Boston Housing (BH) 506 14
Breast Cancer (BC) 699 11
DNA (DN) 3186 180
Glass (GL) 214 10
House votes (HV) 435 17
Ionosphere (IS) 351 35
Ozone (ON) 366 13
Satellite (SL) 6435 37
Servo (SR) 167 5
Shuttle (ST) 58000 9
Sonar (SN) 208 61
Soybean (SB) 683 36
Vehicle (VC) 846 19
Vowel (VW) 990 10
Zoo (ZO) 101 17
8 Lovedeep Gondara, Ke Wang
Table 3: Imputation results comparing our model and MICE. Results are dis-
played using sum of root mean square error(RM SEsum ), providing a measure of
relative distance of imputation from original data. As results are from multiple
imputation (5 imputations), mean RM SEsum from 5 imputations is displayed
outside with min and max RM SEsum inside parenthesis providing a range for
imputation performance. Value for MNAR is NA for dataset VW as MICE was
unable to impute a complete dataset.
Data Uniform missingness Random missingness
DAE MICE DAE MICE
BH 2.9(2.9,3) 3.7(3.5,3.8) 0.9(0.9,1) 0.9(0.7,1)
BC 2.9(2.9,2.9) 3.9(3.6,4.2) 1.2(1.2,1.3) 1.3(1.1,1.4)
DN 25.7(25.7,25.7) 36.5(36.3,36.6) 13.1(13.1,13.2) 16.9(16.9,17)
GL 1.1(1,1.1) 1.5(1.3,1.7) 1.3(1.2,1.4) 1.4(1.3,1.6)
HV 2.4(2.4,2.4) 3.4(3.1,3.7) 1.1(1.1,1.2) 1.2(0.9,1.3)
IS 13(12.9,13.1) 17.1(16.2,17.7) 5.8(5.6,6.2) 7(6.7,7.5)
ON 2.1(2.1,2.1) 3.1(3,3.3) 0.9(0.9,1) 1(1,1.2)
MCAR SL 3.6(3.6,3.7) 4.5(4.4,4.6) 1.8(1.7,1.8) 0.7(0.7,0.7)
SR 1.2(1.,1.2) 1.5(1.4,1.7) 0.4(0.4,0.5) 0.4(0.4,0.5)
ST 16.5(16.5,16.7) 27.9(27.5,28.2) 6.5(6.4,6.7) 13(12.5,13.8)
SN 5.1(5,5.1) 7.3(7.2,7.5) 2.3(2.2,2.3) 3.2(3.2,3.3)
SB 1.8(1.8,1.8) 2.4(2.3,2.4) 1.2(1.1,1.2) 1.1(1,1.1)
VC 4.1(4,4.1) 5.6(5.5,5.7) 1.6(1.6,1.6) 2.2(2.1,2.3)
VW 5.8(5.7,6.2) 7.7(7,8.1) 2.6(2.4,2.9) 3.8(3.3,4.2)
ZO 2.1(2.1,2.1) 3.4(3.1,4.3) 1.1(1.1,1.2) 1.1(1.1,1.1)
BH 2.3(2.2,2.4) 3.2(2.9,3.4) 0.9(0.8,1) 0.7(0.7,0.8)
BC 2.9(2.8,3) 3.6(3.4,3.8) 1.7(1.7,1.8) 1.4(1.3,1.5)
DN 25.3(25.2,25.3) 34.5(34.5,34.7) 5.7(5.7,5.8) 7.2(7.1,7.2)
GL 1.3(1.3,1.4) 1.5(1.3,1.8) 0.4(0.3,0.4) 0.2(0.11,0.2)
HV 2.6(2.6,2.6) 3.5(3.3,3.7) 1.3(1.2,1.3) 1.3(1.3,1.4)
IS 11.7(11.5,11.8) 15.4(14.9,16.5) 4.8(4.5,5.1) 6.3(5.6,6.8)
ON 1.5(1.5,1.5) 2.2(2,2.4) 1.2(1.1,1.2) 1.3(1.1,1.5)
MNAR SL 3.4(3.4,3.4) 3.8(3.8,3.9) 1.6(1.6,1.6) 0.5(0.5,0.5)
SR 1.2(1.2,1.2) 1.6(1.5,1.7) 0.4(0.3,0.4) 0.3(0.2,0.3)
ST 11.8(11.7,11.9) 22.4(22.1,22.7) 4.5(4.3,4.7) 9.5(8.4,10.3)
SN 4.6(4.6,4.6) 6.8(6.5,7.1) 2.3(2.3,2.4) 3.1(3,3.2)
SB 1.7(1.7,1.7) 2.3(2.2,2.4) 0.6(0.6,0.6) 0.9(0.9,0.9)
VC 3.5(3.4,3.7) 4.6(4.4,4.8) 1.7(1.7,1.8) 2.4(2.3,2.4)
VW 5.9(5.9,5.9) NA 2.3(2.1,2.5) NA
ZO 3.3(2.8,5.5) 3.9(3.6,4.6) 0.9(0.8,1.0) 1.1(0.7,1.7)
MICE, we multiple imputed datasets with five imputations each. For a better
visual representation, we compare the imputation results between our model and
MICE using mean error ratio ER , given as
1 Pn
i=1 EDi
ER = n P (4)
1 n
EP i
n i=1
10 Lovedeep Gondara, Ke Wang
Main goal of imputing missing data is to generate complete datasets that can
be used for analytics. While imputation accuracy provides us with a measure of
how close the imputed dataset is to the complete dataset, we still do not know
how well inter-variable correlations are preserved. Which severely impacts end
of the line analytics. To check the imputation quality in relation to a dataset’s
overall structure and to quantify the impact of imputation on end of the line
analytics, we use all imputed datasets as the input to classification/regression
models based on random forest with 5 times 5 fold cross validation. The task
is to use the target variable from all datasets and store the classification accu-
racy/RMSE for each dataset imputed using our model and MICE. Higher values
MIDA: Multiple Imputation using Denoising Autoencoders 11
Table 4: Average accuracy and RMSE estimates for end of the line analytics
using random forest on imputed datasets. As we have used multiple imputa-
tion, results are averaged over all imputed datasets. * signifies where RMSE
is reported because target variable is numeric, hence lower values the better.
All other datasets report average classification accuracy, higher the better. For
dataset VW, as MICE was unable to impute a full dataset, end of the line ana-
lytics is not possible.
Data Uniform missingness Random missingness
DAE MICE DAE MICE
BH* 3.9 4.5 3.7 4.1
BC 96.0 96.0 97.0 96.1
DN 91.6 87.6 93.3 93.7
GL 70.2 64.1 74.6 70.4
HV 98.5 99.2 95.0 98.2
IS 90.5 86.6 90.7 90.3
ON* 4 3.8 3.6 4.2
MNAR SL 90.0 80.9 89.6 89.6
SR* 6.6 8.1 6.9 6.4
ST 86.4 80.8 76.5 72.9
SN 90.9 70.5 85.3 84.2
SB 72.6 62.7 73.7 77.1
VC 74.7 63.8 72.7 70.6
VW 93.9 NA 77.7 NA
ZO 99.9 98.5 99.9 99.9
for classification accuracy and lower RMSE will signify a better preserved pre-
dictive dataset structure. We calculate mean accuracy/RMSE from all five runs
of multiple imputation. Datasets with data MNAR (uniform and random) are
used as MNAR datasets pose greatest challenges for imputation.
Results in Table 4 show that multiple imputation using our model provides
higher predictive power for end of the line analytics compared to MICE im-
puted data. The difference even more significant when data are MNAR uniform
compared to when data are MNAR random.
5 Conclusion
We have presented a new method for multiple imputation based on deep de-
noising autoencoders. We have shown that our proposed method outperforms
current state-of-the-art using various real life datasets and missingness mecha-
nisms. We have shown that our model performs well, even with small sample
sizes, which is thought to be a hard task for deep architectures. In addition to
not requiring a complete dataset for training, we have shown that our proposed
model improves end of the line analytics.
12 Lovedeep Gondara, Ke Wang
References
1. Brett K Beaulieu-Jones, Jason H Moore, The pooled resource open-access ALS, and
clinical trials consortium. Missing data imputation in the electronic health record
using deeply learned autoencoders. In Pacific Symposium on Biocomputing. Pacific
Symposium on Biocomputing, volume 22, page 207. NIH Public Access, 2016.
2. Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized de-
noising auto-encoders as generative models. In Advances in Neural Information
Processing Systems, pages 899–907, 2013.
3. Stef Buuren and Karin Groothuis-Oudshoorn. mice: Multivariate imputation by
chained equations in r. Journal of statistical software, 45(3), 2011.
4. Pei Chen. Optimization algorithms on subspaces: Revisiting missing data problem
in low-rank matrix. International Journal of Computer Vision, 80(1):125–142,
2008.
5. Yanjie Duan, Yisheng Lv, Wenwen Kang, and Yifei Zhao. A deep learning based
approach for traffic data imputation. In Intelligent Transportation Systems (ITSC),
2014 IEEE 17th International Conference on, pages 912–917. IEEE, 2014.
6. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
521(7553):436–444, 2015.
7. Friedrich Leisch and Evgenia Dimitriadou. Machine learning benchmark problems.
2010.
8. Sheng Li, Jaya Kawale, and Yun Fu. Deep collaborative filtering via marginalized
denoising auto-encoder. In Proceedings of the 24th ACM International on Confer-
ence on Information and Knowledge Management, pages 811–820. ACM, 2015.
9. Roderick JA Little. Missing-data adjustments in large surveys. Journal of Business
& Economic Statistics, 6(3):287–296, 1988.
10. Roderick JA Little and Donald B Rubin. Statistical analysis with missing data.
John Wiley & Sons, 2014.
11. Tim P Morris, Ian R White, and Patrick Royston. Tuning multiple imputation
by predictive mean matching and local residual draws. BMC medical research
methodology, 14(1):75, 2014.
12. Fulufhelo V Nelwamondo, Shakir Mohamed, and Tshilidzi Marwala. Missing data:
A comparison of neural network and expectation maximisation techniques. arXiv
preprint arXiv:0704.3474, 2007.
13. Yurii Nesterov. A method of solving a convex programming problem with conver-
gence rate o (1/k2). 1983.
14. Donald B Rubin. Inference and missing data. Biometrika, pages 581–592, 1976.
15. Joseph L Schafer. Multiple imputation: a primer. Statistical methods in medical
research, 8(1):3–15, 1999.
16. Anoop D Shah, Jonathan W Bartlett, James Carpenter, Owen Nicholas, and Harry
Hemingway. Comparison of random forest and parametric imputation models for
imputing missing data using mice: a caliber study. American journal of epidemi-
ology, 179(6):764–774, 2014.
17. Jonathan AC Sterne, Ian R White, John B Carlin, Michael Spratt, Patrick Roys-
ton, Michael G Kenward, Angela M Wood, and James R Carpenter. Multiple
imputation for missing data in epidemiological and clinical research: potential and
pitfalls. Bmj, 338:b2393, 2009.
18. Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol.
Extracting and composing robust features with denoising autoencoders. In Proceed-
ings of the 25th international conference on Machine learning, pages 1096–1103.
ACM, 2008.