[go: up one dir, main page]

0% found this document useful (0 votes)
23 views37 pages

Missing Data Analysis: University College London, 2015

This document discusses various approaches for handling missing data in datasets. It begins by introducing different types of missing data mechanisms and how missing values can bias analyses if not addressed properly. It then summarizes methods that discard missing data like complete-case analysis and available-case analysis. Next, it covers simple imputation methods like mean substitution and hot deck imputation. More advanced techniques like regression imputation, multiple imputation using the EM algorithm or Bayesian methods, and machine learning approaches are also summarized. The document concludes by describing the robust imputation based on the GMDH algorithm to handle missing data containing noise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views37 pages

Missing Data Analysis: University College London, 2015

This document discusses various approaches for handling missing data in datasets. It begins by introducing different types of missing data mechanisms and how missing values can bias analyses if not addressed properly. It then summarizes methods that discard missing data like complete-case analysis and available-case analysis. Next, it covers simple imputation methods like mean substitution and hot deck imputation. More advanced techniques like regression imputation, multiple imputation using the EM algorithm or Bayesian methods, and machine learning approaches are also summarized. The document concludes by describing the robust imputation based on the GMDH algorithm to handle missing data containing noise.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Missing  

data  analysis  

University  College  London,  2015


Contents

1. Introduction
2. Missing-­data  mechanisms
3. Missing-­data  methods  that  discard  data
4. Simple  approaches  that  retain  all  the  data
5. RIBG
6. Conclusion
Introduction

• Databases are often corrupted by missing values

• Most data mining algorithms cannot be immediately


applied to incomplete data

• The simplest method to deal with missing data is data


reduction which deletes the instances with missing values.
However it will lead to great information loss.
Why  are  data  missing

• Random error
– Someone forgot to write down a number, to fill in a
questionnaire item, etc.

• Systematic bias
– Certain types of people didn’t want or couldn’t or
preferred not to answer certain types of questions
Basic  notions

• Let D denote an incomplete dataset with r


variables D = {A1, A2 ,..., Ar } and n instances.
obs mis
For each variable A j = {A j , A j }.
The entire dataset consists also of two components:
D = {D obs , D mis }
Let’s introduce a response indicator matrix
!
#0 if vij is missing
Rij = "
$1 if vij is observed
#
Types  of  missing  data  mechanisms  (Rubin)
• Missing Completely At Random (MCAR)
If Pr(R|Dmis,Dobs)=Pr(R). It implies that the
missingness is unrelated to both missing and
observed values in the dataset.
• Missing At Random (MAR)
If Pr(R|Dmis,Dobs)=Pr(R|Dobs). It means that the
missingness depends only on observed values.
• Not Missing At Random (NMAR)
If Pr(R|Dmis,Dobs) is not equal to Pr(R|Dobs) and
depends on Dmis.
Missing-­data  methods  that  discard  data

• Complete-­case analysis
– excluding all units for which the outcome or any of the inputs are
missing

Problems with this approach:


– if the units with missing values differ systematically from the
completely observed cases, this could bias the complete-­case
analysis.
– if many variables are included in a model, there may be very few
complete cases, so that most of the data would be discarded for
the sake of a sample analysis.
Missing-­data  methods  that  discard  data

• Available-­case analysis
– study of different aspects of a problem with different subsets of the
data.
Example: in the 2001 Social Indicators Survey, all 1501 respondents
stated their education level, but 16% refused to state their earnings.
This allow summarizing the distribution of education levels using all
the responses and the distribution of earnings using 84% of
respondents who answered the question.

Problems with this approach:


– different analyses will be based on different subsets of the data
and may not be consistent with each other
– if non-­respondents differ systematically form the respondents, this
will bias the available-­case summaries.
Approaches  that  retain  the  data

• Mean substitution
– replacing the missing values by the mean of all observed values at
the same variable

Problems with this approach:


– if the units with missing values differ systematically from the
completely observed cases, this could bias the complete-­case
analysis.
– if many variables are included in a model, there may be very few
complete cases, so that most of the data would be discarded for
the sake of a sample analysis.
Mean  substitution  

• Regression line always pass through the mean of X and the mean of Y
• Missing values of X can be placed at the mean of X without affecting
the slope of the line
Mean  substitution  

Advantages:
• All subjects have data for all values

Disadvantages
• False impression of N
• Variance decreases
• What if data are missing for a reason?
Approaches  that  retain  the  data
• Hot deck imputation
– replacing missing values with values from a “similar” responding
unit. Usually used in data from surveys. Involves replacing missing
values of one or more variables for a non-­respondent (called the
recipient) with observed values from a respondent (the donor) that
is similar to the non-­respondent with respect to characteristics
observed by both cases.

Types of HTD:
– random hot deck methods (donor is selected randomly from a set
of potential donors)
– deterministic hot deck methods (single donor is identified and
values are imputed from that case, “nearest” in some sense)
Other  imputation  methods

• Regression  imputation.  It  uses  regression  models  


(different  forms  of  them)  to  predict  missing  values.

Package  “VIM”

• EM  imputation.  It  uses  the  iterative  procedure  of  


Expectation-­Maximization   algorithm  to  calculate  the  
sufficient  statistics.  Missing  values  will  be  produced  in  the  
process.
Amelia
Expectation-­Maximization   Bootstrap-­based  algorithm  (EMB)
It  assumes  that  the  complete  data  are  multivariate  normal

Advantages:  
• fast
• can  deal  with  time-­series  data
• never  crashes  (according  to  official  description)
Approaches  that  retain  the  data

• Multiple imputation. First proposed by Rubin way


to handle missing data. It produces m complete
datasets and then each of them is analyzed by
complete-­data method. At last the results derived
from these m datasets are combined.
Multiple  imputation
Basic steps:
1. Make a model that predict every missing data item (linear or
logistic regression, non-­linear models, etc.)
2. Use the above models to create a “complete” dataset.
3. Each time a “complete” dataset is created, do an analysis of
it, keeping the mean and SE of each parameter of interest.
4. Repeat this between 2 and tens of thousands of time
5. To form final inferences, for each repetition, average across
means, and sum the within and between variances for each
parameter.

R package: “mi”
Machine  learning-­based  imputation

• Machine-­learning-­based approach. Decision tree


approach, clustering procedures, k-­nearest
neighbors approach and other can be used to fill in
the missing data.

Example: function “impute.knn” from package “impute”


Example  in  R
data(mtcars);; mtcars<-­as.matrix(mtcars[,c(1,3:7)]);; mtcars_imp<-­ mtcars;; mis_level<-­ 0.3
x1<-­ sample(1:length(mtcars[,1]), round(length(mtcars[,1])*mis_level), replace=F)
x2<-­ sample(1:length(mtcars[,1]), round(length(mtcars[,1])*mis_level), replace=F)
mtcars_imp[x1, 2]<-­ NA;; mtcars_imp[x2, 5]<-­ NA
knn_res=rep(0,length(mtcars[,1])) #k-­nearest neighbours
for (i in 1:length(mtcars[,1]))
{knn<-­ impute.knn(mtcars_imp,k=i)
knn_res[i]=sqrt(sum((mtcars[x1,2]-­knn$data[x1,2])^2,  (mtcars[x2,5]-­knn$data[x2,5])^2))  
/sum(length(x1),  length(x2))  }
am=amelia(mtcars_imp,  k=5)  #Amelia
amelia_imp=(am$imputations$imp1+am$imputations$imp2+am$imputations$imp3+am$imputations$imp4+am$im
putations$imp5)/5
amelia_res=sqrt(sum((mtcars[x1,2]-­amelia_imp[x1,2])^2,  (mtcars[x2,5]-­amelia_imp[x2,5])^2))  /sum(length(x1),  
length(x2))  
mult_imp=mi(missing_data.frame(mtcars_imp),  n.chains=5)  #Multiple  Imputation
mi_imp=(complete(mult_imp)[[1]][,1:6]+complete(mult_imp)[[2]][,1:6]+complete(mult_imp)[[3]][,1:6]+complete(mult
_imp)[[4]][,1:6]+complete(mult_imp)[[5]][,1:6])/5
mi_res=sqrt(sum((mtcars[x1,2]-­mi_imp[x1,2])^2,  (mtcars[x2,5]-­mi_imp[x2,5])^2))  /sum(length(x1),  length(x2))  
imp1=regressionImp(disp~mpg+hp+drat+qsec,  data=mtcars_imp)  #Regression
imp2=regressionImp(wt~mpg+hp+drat+qsec,  data=mtcars_imp)
reg_imp=cbind(mtcars_imp[,1],imp1$disp,  mtcars_imp[,3:4],imp2$wt,mtcars_imp[,6])
reg_res=sqrt(sum((mtcars[x1,2]-­reg_imp[x1,2])^2,  (mtcars[x2,5]-­reg_imp[x2,5])^2))  /sum(length(x1),  length(x2))  
knn_res;;  amelia_res;;  mi_res;;  reg_res
GMDH  algorithm

• Group Method of Data Handling is an inductive


method that constructs a hierarchical (multi-­
layered) network structure to identify complex
input-­output functional relationship from data.

• The process of GMDH is based on sorting-­out of


gradually complicated models and selection of the
best solution by external criterion.
RIBG  (robust  imputation  based  on  GMDH)  
algorithm
• The main idea of RIBG is using the mechanism
GMDH to impute missing data even when data
contain noise.
• Let’s consider an incomplete dataset
D = {A1, A2 ,..., Ar }
• First RIBG will fill in the original dataset by simple
mean imputation to get an initial complete dataset.
• Then the GMDH mechanism will be used to
predict and update these initial estimated missing
values with an iterative process.
RIBG  criterion

• The  criterion  is  introduced  which  integrates  the  


systematic  regularity  criterion  (SR)  and  minimum  
bias  criterion  (MB):
RM = SR + MB =
*
,$ '.
B 2 ,
= +& ∑ (yi − ŷi ) +∑ (yi − ŷi ) )/ + ∑ ( ŷiB − ŷiC )2
C 2

,
-% i∈B i∈C (,
0 i∈B∪C
B, C -­ two  disjoint  subsets, B ∪ C = D
B C
ŷ , ŷ
i i
-­ estimated  outputs  of  the  model
Simulations
Data  sets:
• Housing  (economics)

• Breast  (medical  science)

• Bupa,  Cmc,  Iris  (life  sciences)

• Glass2,  Ionosphere,  Wine  (physics)


Missingness and  noise

Levels  of  missing  rate:  5%,  10%,  20%

(δ )
Levels  of  noise              :  0%,  10%,  20%

(δ )
Every  value  at  each  variable  had  a                chance  to  be  
changed  to  any  other  random  value
Methods  to  compare

• Regression  imputation

• EM  imputation

• GBNN  imputation  (based  on  knn method)

• Multiple  imputation
Performance  measure

) 1 j "
nmis v̂ij − vij % if  variable  is  
+ j ∑ $$ max min ''
numerical
+ nmis i=1 # v j − v j &
NMAE j = *
cor
+ nj if  variable  is  
+1− n mis nominal
, j
mis
n j
-­ number  of  missing  values;; vij , v̂ij -­ true  and
max min
imputed  values;;  v   j , v j -­ maximum  and  minimum  
for  this  variable;;  
cor
n j -­ number  of  correcty predicted  nominal  values
Literature

1. Andridge R.R., Little R.J.A. A review of Hot Deck


Imputation for Survey Non-­response. International
statistical Review. 78, 2010, 40-­64 pp.
2. Honaker J., King G., Blackwell M. Amelia II: A program for
missing data, 2014.
3. Zhu B., He C., Liatsis P. A robust missing value
imputation method for noisy data. Applied Intelligence. 36,
1, 2012, 61-­74 pp.
4. Packages “HotDeckImputation”, “Amelia”, “mi”
Questions

You might also like