0% found this document useful (0 votes)

23 views37 pages

Missing Data Analysis: University College London, 2015

This document discusses various approaches for handling missing data in datasets. It begins by introducing different types of missing data mechanisms and how missing values can bias analyses if not addressed properly. It then summarizes methods that discard missing data like complete-case analysis and available-case analysis. Next, it covers simple imputation methods like mean substitution and hot deck imputation. More advanced techniques like regression imputation, multiple imputation using the EM algorithm or Bayesian methods, and machine learning approaches are also summarized. The document concludes by describing the robust imputation based on the GMDH algorithm to handle missing data containing noise.

Uploaded by

charudattasonawane55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views37 pages

Missing Data Analysis: University College London, 2015

Uploaded by

charudattasonawane55

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Missing

data analysis

University College London, 2015

Contents

1. Introduction
2. Missing-data mechanisms
3. Missing-data methods that discard data
4. Simple approaches that retain all the data
5. RIBG
6. Conclusion
Introduction

• Databases are often corrupted by missing values

• Most data mining algorithms cannot be immediately

applied to incomplete data

• The simplest method to deal with missing data is data

reduction which deletes the instances with missing values.
However it will lead to great information loss.
Why are data missing

• Random error
– Someone forgot to write down a number, to fill in a
questionnaire item, etc.

• Systematic bias
– Certain types of people didn’t want or couldn’t or
preferred not to answer certain types of questions
Basic notions

• Let D denote an incomplete dataset with r

variables D = {A1, A2 ,..., Ar } and n instances.
obs mis
For each variable A j = {A j , A j }.
The entire dataset consists also of two components:
D = {D obs , D mis }
Let’s introduce a response indicator matrix
!
#0 if vij is missing
Rij = "
$1 if vij is observed
#
Types of missing data mechanisms (Rubin)
• Missing Completely At Random (MCAR)
If Pr(R|Dmis,Dobs)=Pr(R). It implies that the
missingness is unrelated to both missing and
observed values in the dataset.
• Missing At Random (MAR)
If Pr(R|Dmis,Dobs)=Pr(R|Dobs). It means that the
missingness depends only on observed values.
• Not Missing At Random (NMAR)
If Pr(R|Dmis,Dobs) is not equal to Pr(R|Dobs) and
depends on Dmis.
Missing-data methods that discard data

• Complete-case analysis
– excluding all units for which the outcome or any of the inputs are
missing

Problems with this approach:

– if the units with missing values differ systematically from the
completely observed cases, this could bias the complete-case
analysis.
– if many variables are included in a model, there may be very few
complete cases, so that most of the data would be discarded for
the sake of a sample analysis.
Missing-data methods that discard data

• Available-case analysis
– study of different aspects of a problem with different subsets of the
data.
Example: in the 2001 Social Indicators Survey, all 1501 respondents
stated their education level, but 16% refused to state their earnings.
This allow summarizing the distribution of education levels using all
the responses and the distribution of earnings using 84% of
respondents who answered the question.

Problems with this approach:

– different analyses will be based on different subsets of the data
and may not be consistent with each other
– if non-respondents differ systematically form the respondents, this
will bias the available-case summaries.
Approaches that retain the data

• Mean substitution
– replacing the missing values by the mean of all observed values at
the same variable

Problems with this approach:

• Regression line always pass through the mean of X and the mean of Y
• Missing values of X can be placed at the mean of X without affecting
the slope of the line
Mean substitution

Advantages:
• All subjects have data for all values

Disadvantages
• False impression of N
• Variance decreases
• What if data are missing for a reason?
Approaches that retain the data
• Hot deck imputation
– replacing missing values with values from a “similar” responding
unit. Usually used in data from surveys. Involves replacing missing
values of one or more variables for a non-respondent (called the
recipient) with observed values from a respondent (the donor) that
is similar to the non-respondent with respect to characteristics
observed by both cases.

Types of HTD:
– random hot deck methods (donor is selected randomly from a set
of potential donors)
– deterministic hot deck methods (single donor is identified and
values are imputed from that case, “nearest” in some sense)
Other imputation methods

• Regression imputation. It uses regression models

(different forms of them) to predict missing values.

Package “VIM”

• EM imputation. It uses the iterative procedure of

Expectation-Maximization algorithm to calculate the
sufficient statistics. Missing values will be produced in the
process.
Amelia
Expectation-Maximization Bootstrap-based algorithm (EMB)
It assumes that the complete data are multivariate normal

Advantages:
• fast
• can deal with time-series data
• never crashes (according to official description)
Approaches that retain the data

• Multiple imputation. First proposed by Rubin way

to handle missing data. It produces m complete
datasets and then each of them is analyzed by
complete-data method. At last the results derived
from these m datasets are combined.
Multiple imputation
Basic steps:
1. Make a model that predict every missing data item (linear or
logistic regression, non-linear models, etc.)
2. Use the above models to create a “complete” dataset.
3. Each time a “complete” dataset is created, do an analysis of
it, keeping the mean and SE of each parameter of interest.
4. Repeat this between 2 and tens of thousands of time
5. To form final inferences, for each repetition, average across
means, and sum the within and between variances for each
parameter.

R package: “mi”
Machine learning-based imputation

• Machine-learning-based approach. Decision tree

approach, clustering procedures, k-nearest
neighbors approach and other can be used to fill in
the missing data.

Example: function “impute.knn” from package “impute”

Example in R
data(mtcars);; mtcars<-as.matrix(mtcars[,c(1,3:7)]);; mtcars_imp<- mtcars;; mis_level<- 0.3
x1<- sample(1:length(mtcars[,1]), round(length(mtcars[,1])*mis_level), replace=F)
x2<- sample(1:length(mtcars[,1]), round(length(mtcars[,1])*mis_level), replace=F)
mtcars_imp[x1, 2]<- NA;; mtcars_imp[x2, 5]<- NA
knn_res=rep(0,length(mtcars[,1])) #k-nearest neighbours
for (i in 1:length(mtcars[,1]))
{knn<- impute.knn(mtcars_imp,k=i)
knn_res[i]=sqrt(sum((mtcars[x1,2]-knn$data[x1,2])^2, (mtcars[x2,5]-knn$data[x2,5])^2))
/sum(length(x1), length(x2)) }
am=amelia(mtcars_imp, k=5) #Amelia
amelia_imp=(am$imputations$imp1+am$imputations$imp2+am$imputations$imp3+am$imputations$imp4+am$im
putations$imp5)/5
amelia_res=sqrt(sum((mtcars[x1,2]-amelia_imp[x1,2])^2, (mtcars[x2,5]-amelia_imp[x2,5])^2)) /sum(length(x1),
length(x2))
mult_imp=mi(missing_data.frame(mtcars_imp), n.chains=5) #Multiple Imputation
mi_imp=(complete(mult_imp)[[1]][,1:6]+complete(mult_imp)[[2]][,1:6]+complete(mult_imp)[[3]][,1:6]+complete(mult
_imp)[[4]][,1:6]+complete(mult_imp)[[5]][,1:6])/5
mi_res=sqrt(sum((mtcars[x1,2]-mi_imp[x1,2])^2, (mtcars[x2,5]-mi_imp[x2,5])^2)) /sum(length(x1), length(x2))
imp1=regressionImp(disp~mpg+hp+drat+qsec, data=mtcars_imp) #Regression
imp2=regressionImp(wt~mpg+hp+drat+qsec, data=mtcars_imp)
reg_imp=cbind(mtcars_imp[,1],imp1$disp, mtcars_imp[,3:4],imp2$wt,mtcars_imp[,6])
reg_res=sqrt(sum((mtcars[x1,2]-reg_imp[x1,2])^2, (mtcars[x2,5]-reg_imp[x2,5])^2)) /sum(length(x1), length(x2))
knn_res;; amelia_res;; mi_res;; reg_res
GMDH algorithm

• Group Method of Data Handling is an inductive

method that constructs a hierarchical (multi-
layered) network structure to identify complex
input-output functional relationship from data.

• The process of GMDH is based on sorting-out of

gradually complicated models and selection of the
best solution by external criterion.
RIBG (robust imputation based on GMDH)
algorithm
• The main idea of RIBG is using the mechanism
GMDH to impute missing data even when data
contain noise.
• Let’s consider an incomplete dataset
D = {A1, A2 ,..., Ar }
• First RIBG will fill in the original dataset by simple
mean imputation to get an initial complete dataset.
• Then the GMDH mechanism will be used to
predict and update these initial estimated missing
values with an iterative process.
RIBG criterion

• The criterion is introduced which integrates the

systematic regularity criterion (SR) and minimum
bias criterion (MB):
RM = SR + MB =
*
,$ '.
B 2 ,
= +& ∑ (yi − ŷi ) +∑ (yi − ŷi ) )/ + ∑ ( ŷiB − ŷiC )2
C 2

,
-% i∈B i∈C (,
0 i∈B∪C
B, C - two disjoint subsets, B ∪ C = D
B C
ŷ , ŷ
i i
- estimated outputs of the model
Simulations
Data sets:
• Housing (economics)

• Breast (medical science)

• Bupa, Cmc, Iris (life sciences)

• Glass2, Ionosphere, Wine (physics)

Missingness and noise

Levels of missing rate: 5%, 10%, 20%

(δ )
Levels of noise : 0%, 10%, 20%

(δ )
Every value at each variable had a chance to be
changed to any other random value
Methods to compare

• Regression imputation

• EM imputation

• GBNN imputation (based on knn method)

• Multiple imputation
Performance measure

) 1 j "
nmis v̂ij − vij % if variable is
+ j ∑ $$ max min ''
numerical
+ nmis i=1 # v j − v j &
NMAE j = *
cor
+ nj if variable is
+1− n mis nominal
, j
mis
n j
- number of missing values;; vij , v̂ij - true and
max min
imputed values;; v j , v j - maximum and minimum
for this variable;;
cor
n j - number of correcty predicted nominal values
Literature

1. Andridge R.R., Little R.J.A. A review of Hot Deck

Imputation for Survey Non-response. International
statistical Review. 78, 2010, 40-64 pp.
2. Honaker J., King G., Blackwell M. Amelia II: A program for
missing data, 2014.
3. Zhu B., He C., Liatsis P. A robust missing value
imputation method for noisy data. Applied Intelligence. 36,
1, 2012, 61-74 pp.
4. Packages “HotDeckImputation”, “Amelia”, “mi”
Questions

Missing Data
No ratings yet
Missing Data
71 pages
Unit 2 Notes - Docx-3
No ratings yet
Unit 2 Notes - Docx-3
14 pages
FDS U4
No ratings yet
FDS U4
93 pages
SICE: An Improved Missing Data Imputation Technique: Open Access Research
No ratings yet
SICE: An Improved Missing Data Imputation Technique: Open Access Research
21 pages
CH 02 Data Handling Technique
No ratings yet
CH 02 Data Handling Technique
105 pages
D4.1 - Informe Sobre El Análisis de Requisitos TIC Del IQS
No ratings yet
D4.1 - Informe Sobre El Análisis de Requisitos TIC Del IQS
176 pages
Emmanuel Et Al. - 2021 - A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel Et Al. - 2021 - A Survey On Missing Data in Machine Learning
37 pages
Ficha Tecnica Zoomlion Zs090v Skid Steer Loader
No ratings yet
Ficha Tecnica Zoomlion Zs090v Skid Steer Loader
2 pages
SmartWorks MFPv6 Help - EN
No ratings yet
SmartWorks MFPv6 Help - EN
92 pages
Handling Missing Data
No ratings yet
Handling Missing Data
32 pages
Amish Culture Thesis Defense by Slidesgo
No ratings yet
Amish Culture Thesis Defense by Slidesgo
55 pages
MCB (Miniature Circuit Breaker) : Characteristics
No ratings yet
MCB (Miniature Circuit Breaker) : Characteristics
8 pages
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
No ratings yet
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
42 pages
Multivariate Poisson Models
No ratings yet
Multivariate Poisson Models
53 pages
ASTRO® Digital XTL™ 5000 O3 Control Head User Guide
No ratings yet
ASTRO® Digital XTL™ 5000 O3 Control Head User Guide
154 pages
MIssing Data Imputation Using Machine Learning Algorithm
No ratings yet
MIssing Data Imputation Using Machine Learning Algorithm
11 pages
Advanced Traffic Control System Report 2-A
No ratings yet
Advanced Traffic Control System Report 2-A
30 pages
Lloyd's - Marine Corrosion Management
No ratings yet
Lloyd's - Marine Corrosion Management
23 pages
Multivariate and Multiple Poisson Distributions
No ratings yet
Multivariate and Multiple Poisson Distributions
80 pages
Lecture 2.3.10
No ratings yet
Lecture 2.3.10
30 pages
Missing Data Techniques - UCLA
No ratings yet
Missing Data Techniques - UCLA
66 pages
Business Analytics ST1
No ratings yet
Business Analytics ST1
13 pages
Week 5 Lecture - Data Wrangling
No ratings yet
Week 5 Lecture - Data Wrangling
26 pages
RAYCHEM SB H56857 IndsHeatTracingNAM EN
No ratings yet
RAYCHEM SB H56857 IndsHeatTracingNAM EN
24 pages
YOJANA-ENGLISH-JULY 2018 (Estore33.com)
No ratings yet
YOJANA-ENGLISH-JULY 2018 (Estore33.com)
45 pages
Nihms 857131
No ratings yet
Nihms 857131
40 pages
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
Missing Data Mechanisms and Imputation Methods
No ratings yet
Missing Data Mechanisms and Imputation Methods
16 pages
Lucknow Residency
50% (2)
Lucknow Residency
4 pages
BLD 121 Notes
No ratings yet
BLD 121 Notes
21 pages
Emmanuel 2021 A Survey On Missing Data in Machine Learning
No ratings yet
Emmanuel 2021 A Survey On Missing Data in Machine Learning
37 pages
Handling Missing Data
No ratings yet
Handling Missing Data
23 pages
Missng Data
No ratings yet
Missng Data
8 pages
SPSS
No ratings yet
SPSS
92 pages
Mida (AE)
No ratings yet
Mida (AE)
12 pages
C Band Block Up Converter Upcom
No ratings yet
C Band Block Up Converter Upcom
29 pages
1.data Cleaning Screening
No ratings yet
1.data Cleaning Screening
21 pages
M Akaba 2019
No ratings yet
M Akaba 2019
7 pages
MS Lectures 6
No ratings yet
MS Lectures 6
10 pages
Missing Data
100% (2)
Missing Data
35 pages
Ads Exp2
No ratings yet
Ads Exp2
3 pages
General, Organic, and Biochemistry, 9th Edition Instant EPUB Download
100% (11)
General, Organic, and Biochemistry, 9th Edition Instant EPUB Download
15 pages
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
No ratings yet
Journal of Statistical Software: Reviewer: Abdolvahab Khademi University of Massachusetts
4 pages
Data - Preprocessing - 2
No ratings yet
Data - Preprocessing - 2
10 pages
Philips Lumea Manual Arabic
No ratings yet
Philips Lumea Manual Arabic
102 pages
Author's Proof
No ratings yet
Author's Proof
14 pages
Package Treeclim': October 8, 2020
No ratings yet
Package Treeclim': October 8, 2020
23 pages
Mist Blowers
No ratings yet
Mist Blowers
10 pages
Extreme Learning Machine For Missing Data Using Multiple Imputations
No ratings yet
Extreme Learning Machine For Missing Data Using Multiple Imputations
18 pages
Modern Method Web in Ar May 2012
No ratings yet
Modern Method Web in Ar May 2012
45 pages
An Analysis of Four Missing Data Treatment Methods For Supervised Learning
No ratings yet
An Analysis of Four Missing Data Treatment Methods For Supervised Learning
16 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Platias2020 Greece
No ratings yet
Platias2020 Greece
10 pages
Gambarini 2014
No ratings yet
Gambarini 2014
5 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
DAE14
No ratings yet
DAE14
44 pages
Handling Missing Data Analysis of A Challenging Data Set Using Multiple Imputation
No ratings yet
Handling Missing Data Analysis of A Challenging Data Set Using Multiple Imputation
20 pages
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
No ratings yet
WINSEM2018-19 - MGT1051 - TH - SJTG23 - VL2018195003627 - Reference Material I - 12-12 - C1 - BAE
20 pages
Songsof Radhaby Sarojini Naidu
No ratings yet
Songsof Radhaby Sarojini Naidu
10 pages
8 Hron Et Al 2010
No ratings yet
8 Hron Et Al 2010
13 pages
Engineering Journal Missing Data Imputation Methods in Classification Contexts
No ratings yet
Engineering Journal Missing Data Imputation Methods in Classification Contexts
6 pages
Centraltendencywhattoconsider 1
No ratings yet
Centraltendencywhattoconsider 1
6 pages
Milsap Allison
No ratings yet
Milsap Allison
18 pages
A GMM Approach For Dealing With Missing Data
No ratings yet
A GMM Approach For Dealing With Missing Data
41 pages
Earthquake Prediction Using Machine Learning Using Support Vector Machine Algorithm
No ratings yet
Earthquake Prediction Using Machine Learning Using Support Vector Machine Algorithm
7 pages
Fuzzy Based Techniques For Handling Missing Values
No ratings yet
Fuzzy Based Techniques For Handling Missing Values
6 pages
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
No ratings yet
Comparison of Imputation Techniques After Classifying The Dataset Using KNN Classifier For The Imputation of Missing Data
4 pages
Missing Data
No ratings yet
Missing Data
25 pages
Migration and Livelihoods in Historical Perspective A Case Study of Bihar India
No ratings yet
Migration and Livelihoods in Historical Perspective A Case Study of Bihar India
29 pages
Honaker & King - What To Do About Missing Values - 2010
No ratings yet
Honaker & King - What To Do About Missing Values - 2010
21 pages
JDS 612 PDF
No ratings yet
JDS 612 PDF
18 pages
Missing Data Stata
No ratings yet
Missing Data Stata
18 pages
Imputation
No ratings yet
Imputation
10 pages
Graham2009 Missing Values Analysis
No ratings yet
Graham2009 Missing Values Analysis
31 pages
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
No ratings yet
The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution For Improvement
8 pages
Missing Data
No ratings yet
Missing Data
14 pages
Missing Data Imputation Using Singular Value Decomposition
No ratings yet
Missing Data Imputation Using Singular Value Decomposition
6 pages
Solutions For Missing Data in Structural Equation Modeling
No ratings yet
Solutions For Missing Data in Structural Equation Modeling
6 pages
ISAT 600 Progress Report 2
No ratings yet
ISAT 600 Progress Report 2
6 pages
Rai Industrial Power PVT - LTD Inspection Test Plan (Check/Hold Points) For Piping Works
67% (6)
Rai Industrial Power PVT - LTD Inspection Test Plan (Check/Hold Points) For Piping Works
4 pages
DADM S5 Imputation of Missing Data
No ratings yet
DADM S5 Imputation of Missing Data
15 pages
Conservation of Energy Lab
0% (1)
Conservation of Energy Lab
6 pages
Ijctt V3i2p104
No ratings yet
Ijctt V3i2p104
5 pages
Matter Sphere For Mage The Ascension
100% (3)
Matter Sphere For Mage The Ascension
2 pages
Creole Modernity LBST - 11 13 15
100% (1)
Creole Modernity LBST - 11 13 15
3 pages
Dealing With Missing Data: Key Assumptions and Methods For Applied Analysis
No ratings yet
Dealing With Missing Data: Key Assumptions and Methods For Applied Analysis
20 pages
Timber Test Revised
No ratings yet
Timber Test Revised
4 pages
Missing Data & How To Handle It
No ratings yet
Missing Data & How To Handle It
32 pages
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
No ratings yet
Missing Data: I. Types of Missing Data. There Are Several Useful Distinctions We Can Make
19 pages
Problems 1
No ratings yet
Problems 1
3 pages
ST-14 Handling Missing Data With Multiple Imputation Using PROC MI in SAS
No ratings yet
ST-14 Handling Missing Data With Multiple Imputation Using PROC MI in SAS
5 pages
Roles of Imputation Methods For Filling The Missing Values: A Review
No ratings yet
Roles of Imputation Methods For Filling The Missing Values: A Review
9 pages
Punnet Square Practice
No ratings yet
Punnet Square Practice
3 pages
1.2 Training Course - 5g Massive Mimo
100% (1)
1.2 Training Course - 5g Massive Mimo
38 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet

Missing Data Analysis: University College London, 2015

Uploaded by

Missing Data Analysis: University College London, 2015

Uploaded by

Missing

University College London, 2015

• Databases are often corrupted by missing values

• Most data mining algorithms cannot be immediately

• The simplest method to deal with missing data is data

• Let D denote an incomplete dataset with r

Problems with this approach:

Problems with this approach:

Problems with this approach:

• Regression imputation. It uses regression models

• EM imputation. It uses the iterative procedure of

• Multiple imputation. First proposed by Rubin way

• Machine-­learning-­based approach. Decision tree

Example: function “impute.knn” from package “impute”

• Group Method of Data Handling is an inductive

• The process of GMDH is based on sorting-­out of

• The criterion is introduced which integrates the

• Breast (medical science)

• Bupa, Cmc, Iris (life sciences)

• Glass2, Ionosphere, Wine (physics)

Levels of missing rate: 5%, 10%, 20%

• GBNN imputation (based on knn method)

1. Andridge R.R., Little R.J.A. A review of Hot Deck

You might also like

• Machine-learning-based approach. Decision tree

• The process of GMDH is based on sorting-out of