Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.
com
Dealing with Missing Data-
The Art and Science of Imputation
May 2021
For the International Cost Estimating and Analysis
Association Conference – May 2021
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPUTATION
FILLING IN HOLES IN DATASETS
THE PROBLEM OF MISSING DATA
A significant problem, especially for small datasets
Often dealt with by removing observations with missing data
TECHNIQUES FOR HANDLING MISSING DATA
A variety of techniques exist for filling in missing data, though
some perform better than others
FILLING IN HOLES WITH STATISTICS
Recognizing the inherent uncertainty in missing data, we
adopt and advocate the method of multiple imputation
using Bayesian methods (“chained equations”)
2
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Why Imputation?
Is it worth it?
Preserves Data
Fooled by Randomness
Imputation prevents the reduction of
Having more data prevents us from falling
sample size due to missing values. This
prey to overly optimistic models that are
helps to preserve all responses in the
fit to more noise than signals
sample
Impute and
Assess Risk!
Preserves Structure of Data
Predictive Accuracy
When we remove data points, we could
Reducible uncertainty can be reduced by
be missing important patterns in the data,
increasing sample size. This helps to
which can cause our analysis to distort
improve predictive accuracy
true patterns within the data
3
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
DATA
Foundation of All
Analyses
The goal is to turn
data into
How Should We Handle It? information, and
The bulk of the time in analytics should
be spent on collecting, normalizing
information into
and verifying data. In defense and insight.
aerospace applications, datasets are
small. Data should be preserved when -Carly Fiorina
4 possible
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPUTATION
To impute or not to impute, that is the question
01 02 03
Understand Determine Know when
the available variables that blanks are
data would benefit intentional
from imputation
Imputation is a powerful method that is useful for filling blanks when they are missing within a dataset
An analyst must understand the data intimately to know if a blank means that the factor is not applicable for
that data point
5
Sometimes a blank does not reflect a nonresponse and should be observed “as is”
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Is the response missing at random?
The US Census Bureau
deals with missing data all
the time. If no response is
provided for the name of
Person 7 on the Census
form from the household
of six members, this missing
value is not an omission;
the response is “Not
Applicable”
6
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
ISSUES WITH DATA GAPS
What can go wrong?
Fewer Degrees of Reduction of Predictive Inability to Use
Freedom Power Advanced Methods
Removing observations with Predictive power is diminished Certain Machine Learning
missing values results in fewer when degrees of freedom are methods cannot be applied
degrees of freedom in models small when missing values are
prevalent
7
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
METHODS ALLOWING
MISSING DATA
Complete-Case Analysis
Approach that excludes any records with missing data.
Disadvantage – bias becomes introduced into the analysis
due to the removal of data that may provide insight into the
population
Available-Case Analysis
Approach allows the analysis of subsets of the complete
dataset so that multiple aspects of a problem can be
studied. Disadvantage – bias is again introduced if data are
missing in a pattern
Alternative to Allowing Missingness
Though methods exist to continue with analysis upon removal
of missing data, better alternatives exist for filling data gaps
8
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPUTATION METHODS
Mean Imputation Imputing using Regression Expectation
Related Observations Imputation Maximization
Filling missing values with the Filling missing values with Replacing missing values with Replacing missing values by
mean of the observed values responses from related a predicted value based on exploring the covariation
observations the results of fitting a among variables in order to
regression line to the available infer values for the missing
data data
To retain as much of the precious gold (data) as possible, we should consider using imputation
methods. There are several methods you can choose to make a best statistical inference at a
response that will close a data gap
9
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPUTATION METHODS
How do they compare?
Mean Imputation Related Observations Regression Imputation Expectation Maximization
This method helps to restrict the This method also helps to restrict This method uses regression to This method uses maximum
variability of the data variability in the data predict missing values. MICE is a likelihood method to estimate
regression imputation method missing values
Disadvantage: it weakens Disadvantage: Introduces
covariances and correlations measurement error Advantage: Produces unbiased Advantage: Increases precision
amount features estimates with data that are and decreases parameter bias
10
Missing At Random (MAR)
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Tools for Imputation
R Python
R is a language and Python is a high-level
environment for statistical programming language with
computing and graphics. It is dynamic semantics. Like R,
an integrated suite of software Python supports modules and
facilities for data manipulation, packages to help with analysis
calculation and graphical
display
11
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
MICE
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
MULTIPLE IMPUTATION BY
CHAINED EQUATIONS
MICE
Method
This method creates multiple imputations for a missing value
that accounts for the statistical uncertainty in the imputation
Assumptions
This method operates under the assumption that the missing
data is MAR. MAR occurs when a data gap is full accounted
for by variables where there is complete information
Iterations
Multiple regression models are conducted and each variable
with missing data is modeled conditionally on the responses
of the other variables within the dataset. With this method,
each variable is modeled according to its own distribution
13
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
HOW MICE FILLS GAPS
Several imputed versions of the data are created using plausible data values
01 02 03
NUMBER #01 NUMBER #02 NUMBER #03
Multiple imputation is a series of stochastic The first step is an imputation step (I-step) The number of iterations, m, are specified
regression imputations that fills data gaps using stochastic for the number of imputations that are
regression conducted in the I-step
06 05 04
NUMBER #06 NUMBER #05 NUMBER #04
The coefficients of the individual equation The P-step proceeds by taking a random In posterior step (P-step), the mean and
are averaged using a simple, unweighted draw from the mean and covariance covariance distributions are calculated
mean. Goodness-of-fit measures are distributions, which are used to calculate from the filled-in data
14
calculated using the pooled results regression coefficients
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
THE MICE PROCESS
Given the multiple imputations, the coefficients of the individual equation are averaged (using a
simple, unweighted mean). The other parameters, including the degrees of freedom, standard
errors, and R2s are combined using what is known as Rubin’s Rules, after the statistician who
developed them
15
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
UNDERSTANDING THE DATA
Exploring engine data
Dataset
The data used for analysis is a Wheeled and Tracked Vehicle
Engine dataset. The dataset is small, which makes the use of
imputation very important
Included Features
Identification (ID), Brake Horsepower (bHP), Displacement
(DISP), Engine Speed (EngSP), Cylinders (CYL), Unit Cost in
Dollars (UC), Dry Weight (DryWGT)
Missing Counts
Of the seven features included in the dataset, four of those
seven have missing values.
N=9
16
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Dataset Example
Four variables have missing data
ID bHP EngSP CYL DryWGT DISP UC
1 290 2600 6 7.2 $40,079
2 330 2400 6 1296 7.2 $40,927
3 330 2200 6 1905 8.8 $29,563
4 515 1500 6 3090 15.2 $63,931
5 675 2101 8 14.8 $111,976
6 675 2101 8 14.8 $120,661
7 500 2100 8 12.1 $47,873
8 362 2300 3230 12.1
9 340 8 912 6.6 $40,661
17
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPLEMENTING MICE
01 02 03
We used the statistical Conduct linear regression on Pooling Results
programming platform R and each of the five imputed
Combining the results of these separate
the ‘mice’ package to datasets analyses is referred to as pooling
calculate imputed data
To view each of the imputed datasets, we The pooled regression equation has
use the complete() function: coefficients that are the arithmetic means
R code:
of the coefficients for the five individual
install.packages('mice’)
R code: regressions
library(mice)
completedData<-complete(imputedata,1)
data<-read(“Example.csv”)
Let m denote the number of imputed
imputdata<-mice(data, m=5, meth=‘pmm’,
The number one in the complete function datasets, 𝛽𝑖 denote the ith coefficient, and 𝛽𝑖𝑗
seed=23109)
indicates that you want to see the first denote the ith coefficient for the jth imputed dataset;
iteration. To see the other 2-5 datasets, you then:
Fixed seed to ensure the analysis is
will need to write functions to create and σ𝑚𝑗=1 𝛽𝑖𝑗
repeatable 𝛽𝑖 =
view those datasets 𝑚
The default in mice is m=5. This parameter
will need to be included if another value of
imputations is desired
18
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPLEMENTING MICE
04 05 06
Pooling Results - 2 Goodness-of-Fit Statistics Compare Results
To fit a linear model to a dataset, use the Unlike the coefficients, you cannot simply Compare the results from the imputed
lm() function. Then, pool the m estimates average the R2 values, standard errors, the dataset to the original dataset with missing
𝑄 (1) , … , 𝑄 (𝑚) into one model 𝑄.
ഥ F-stats, etc., in order to calculate the values removed
goodness-of-fit statistics
R code:
Fit1<-with(imputedata,lm(UC~bHP)) R code:
Summary(pool(Fit1)) pool.r.squared(fit4, adjusted = FALSE)
poolF<-mi.anova(mi.res=imputedata,
formula="UC~bHP")
19
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
ANALYZING RESULTS
Creating plots to determine reasonableness of imputations
Scatterplot Analysis
There is a linear relationship
between UC and bHP. The pattern
of the relationship seems plausible
for the imputed values (pink) as
compared to the observed values
(blue)
Density Plot Analysis
Density plots provide a visual into
the shapes of each imputation. The
plot is useful to determine outlier
imputations and works for variables
with two or more missing values
20
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
MICE Results
ID bHP EngSP CYL DryWGT DISP UC
1 290 2600 6 3090, 1296, 7.2 $40,079
1905, 1905, 912
2 330 2400 6 1296 7.2 $40,927
3 330 2200 6 1905 8.8 $29,563
4 515 1500 6 3090 15.2 $63,931
5 675 2101 8 912, 3230, 1296, 14.8 $111,976
3090, 1905
6 675 2101 8 3090,1905, 14.8 $120,661
3090, 912, 912
7 500 2100 8 912, 3090, 1296, 12.1 $47,873
3090, 912
8 362 2300 8, 8, 8, 3230 12.1 $47,873,
6 $47,873,
$40,079
$40,927
$111,976
9 340 2400, 2400, 8 912 6.6 $40,661
2300, 2300,
2400
21
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
FIT RESULTS
Comparing results from the original dataset to the imputed (pooled) dataset
Linear Model MICE Imputed Model
The model is a solid one with a statistically significant p-value less than Though the R2 statistic is lower than the original dataset, we gained some
alpha = 0.05 and an R2 equal to 87.5%. One data point was removed due degrees of freedom with the use of imputation with the creation of this
to missing a unit cost value statistically significant model. The model does not gain a full degree of
freedom since the iterations are pooled
22
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
EXPECTATION
MAXIMIZATION
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Expectation
Maximization
Imputing by optimizing
Maximum Likelihood
The maximum likelihood method is used to impute missing values.
This method uses available data to impute a value and then checks
to determine the reasonableness of the guess
Covariance
The covariation among variables is used to infer probable values for
the missing data
Two-Step Process
The method follows a two-step process to fill in missing data
24
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
EM TWO-STEP PROCESS
How EM fills data gaps
STEP #01 01 02 STEP #02
Iterative Process
The maximum likelihood estimates
EM is an of the mean vector and
First Pass at Filling Gaps iterative covariance matrix are calculated.
The algorithm begins by filling the process The covariance matrix is then used
to derive regression equations for
gaps with the conditional mean of
used to fill the next iteration and the cycle
the missing values.
data gaps continues until the difference
between the covariance matrices
in subsequent runs falls below the
convergence criteria
25
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
IMPLEMENTING EM
01 02 03
Show missingness patterns Performing maximum Pooling Results
likelihood estimation using
The function prelim.norm if used on a matrix The average of the imputations is
of the x (bHP) and y (cost) variables to sort EM algorithm calculated for the variable with missing
rows according to the missingness patterns values
Fixed seed to ensure the analysis is R code:
repeatable b<-em.norm(a) R code:
c1<-getparam.norm(a,b) c1$mu[1]
R code:
a<-prelim.norm(cbind(y,x) This function produces a vector which can The estimates for the coefficients of the
then be used to return a list of parameters model are then estimated
b.est<-c(c1$mu[1]-
(c1$sigma[1,2]/c1$sigma[2,2])*c1$mu[2],c1
$sigma[1,2]/c1$sigma[2,2])
The model can then be used to calculate
the missing values for the dataset
26
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
EM
ID bHP UC
1 290 $40,079
2 330 $40,927
3 330 $29,563
4 515 $63,931
5 675 $111,976
6 675 $120,661
7 500 $47,873
8 362 $59,771
9 340 $40,661
27
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
FIT RESULTS - 2
Comparing results from the original dataset to the EM imputed dataset
Linear Model EM Imputed Model
The model is a solid one with a statistically significant p-value less than Compared to the results produced from removing the data points with
alpha = 0.05 and an R2 equal to 87.5%. One data point was removed due missing values, this is a better performing model. A degree of freedom
to missing a unit cost value was gained and the R2 metric increased while the model retained
statistical significance
28
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
EXPECTATION MAXIMIZATION
Why choose EM?
ADVANTAGES DISADVANTAGES
EM preserves the relationship with other EM can sometime underestimate standard
variables, unlike mean imputation error
29
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
COMPARING METHODS
MICE VERSUS EM
MICE and EM are based on similar For small data sets, it is wise to run both and
assumptions and in practice they often compare the results, as small differences in
produce similar results. The Bayesian the methods could have an outsized
estimation in MICE is asymptotically impact when the number of data points is
equivalent to the maximum likelihood limited
estimates in EM, so for large data sets the
two methods should provide similar results
There are multiple methods which can be used to impute data. Two of the strongest techniques, MICE
and EM, should be considered first as they preserve relationships between independent and
dependent variables and estimate error more accurately.
The MICE method for imputation has an edge over EM since MICE calculates multiple imputations for
the missing values instead of one single estimate.
30
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Q&A
THE FUTURE. DELIVERED.
Galorath provides solutions that help organizational leaders make complex business decisions
with confidence. Our predictive analytics products and services give complete insight into the
implications of significant technical or financial decisions, allowing organizations to execute a
plan with assurance and reach their goals with absolute certainty.
Learn more or schedule a demo
(310) 906-6320 • sales@galorath.com Kimberly Roye Christian Smart, PhD, CCEA
kroye@galorath.com csmart@galorath.com
3
1
Presented for the ICEAA 2021 Online Workshop - www.iceaaonline.com
Presenters
Kimberly Roye Christian Smart Dustin Hilton
Senior Data Scientist Chief Scientist Senior Cost Analyst
Kroye@galorath.com csmart@galorath.com dhilton@galorath.com
32