@@ -9,19 +9,19 @@ Imputation of missing values
9
9
For various reasons, many real world datasets contain missing values, often
10
10
encoded as blanks, NaNs or other placeholders. Such datasets however are
11
11
incompatible with scikit-learn estimators which assume that all values in an
12
- array are numerical, and that all have and hold meaning. A basic strategy to use
13
- incomplete datasets is to discard entire rows and/or columns containing missing
14
- values. However, this comes at the price of losing data which may be valuable
15
- (even though incomplete). A better strategy is to impute the missing values,
16
- i.e., to infer them from the known part of the data. See the :ref: ` glossary `
17
- entry on imputation.
12
+ array are numerical, and that all have and hold meaning. A basic strategy to
13
+ use incomplete datasets is to discard entire rows and/or columns containing
14
+ missing values. However, this comes at the price of losing data which may be
15
+ valuable (even though incomplete). A better strategy is to impute the missing
16
+ values, i.e., to infer them from the known part of the data. See the
17
+ :ref: ` glossary ` entry on imputation.
18
18
19
19
20
20
Univariate vs. Multivariate Imputation
21
21
======================================
22
22
23
- One type of imputation algorithm is univariate, which imputes values in the i-th
24
- feature dimension using only non-missing values in that feature dimension
23
+ One type of imputation algorithm is univariate, which imputes values in the
24
+ i-th feature dimension using only non-missing values in that feature dimension
25
25
(e.g. :class: `impute.SimpleImputer `). By contrast, multivariate imputation
26
26
algorithms use the entire set of available feature dimensions to estimate the
27
27
missing values (e.g. :class: `impute.IterativeImputer `).
@@ -66,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
66
66
[6. 3.]
67
67
[7. 6.]]
68
68
69
- Note that this format is not meant to be used to implicitly store missing values
70
- in the matrix because it would densify it at transform time. Missing values encoded
71
- by 0 must be used with dense input.
69
+ Note that this format is not meant to be used to implicitly store missing
70
+ values in the matrix because it would densify it at transform time. Missing
71
+ values encoded by 0 must be used with dense input.
72
72
73
73
The :class: `SimpleImputer ` class also supports categorical data represented as
74
74
string values or pandas categoricals when using the ``'most_frequent' `` or
@@ -110,31 +110,43 @@ round are returned.
110
110
IterativeImputer(imputation_order='ascending', initial_strategy='mean',
111
111
max_value=None, min_value=None, missing_values=nan, n_iter=10,
112
112
n_nearest_features=None, predictor=None, random_state=0,
113
- sample_posterior=False, verbose=False )
113
+ sample_posterior=False, verbose=0 )
114
114
>>> X_test = [[np.nan, 2 ], [6 , np.nan], [np.nan, 6 ]]
115
115
>>> # the model learns that the second feature is double the first
116
116
>>> print (np.round(imp.transform(X_test)))
117
117
[[ 1. 2.]
118
118
[ 6. 12.]
119
119
[ 3. 6.]]
120
120
121
- Both :class: `SimpleImputer ` and :class: `IterativeImputer ` can be used in a Pipeline
122
- as a way to build a composite estimator that supports imputation.
121
+ Both :class: `SimpleImputer ` and :class: `IterativeImputer ` can be used in a
122
+ Pipeline as a way to build a composite estimator that supports imputation.
123
123
See :ref: `sphx_glr_auto_examples_impute_plot_missing_values.py `.
124
124
125
+ Flexibility of IterativeImputer
126
+ -------------------------------
127
+
128
+ There are many well-established imputation packages in the R data science
129
+ ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
130
+ out to be a particular instance of different sequential imputation algorithms
131
+ that can all be implemented with :class: `IterativeImputer ` by passing in
132
+ different regressors to be used for predicting missing feature values. In the
133
+ case of missForest, this regressor is a Random Forest.
134
+ See :ref: `sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py `.
135
+
136
+
125
137
.. _multiple_imputation :
126
138
127
139
Multiple vs. Single Imputation
128
- ==============================
140
+ ------------------------------
129
141
130
- In the statistics community, it is common practice to perform multiple imputations,
131
- generating, for example, ``m `` separate imputations for a single feature matrix.
132
- Each of these ``m `` imputations is then put through the subsequent analysis pipeline
133
- (e.g. feature engineering, clustering, regression, classification). The `` m `` final
134
- analysis results (e.g. held-out validation errors) allow the data scientist
135
- to obtain understanding of how analytic results may differ as a consequence
136
- of the inherent uncertainty caused by the missing values. The above practice
137
- is called multiple imputation.
142
+ In the statistics community, it is common practice to perform multiple
143
+ imputations, generating, for example, ``m `` separate imputations for a single
144
+ feature matrix. Each of these ``m `` imputations is then put through the
145
+ subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
146
+ c
F438
lassification). The `` m `` final analysis results (e.g. held-out validation
147
+ errors) allow the data scientist to obtain understanding of how analytic
148
+ results may differ as a consequence of the inherent uncertainty caused by the
149
+ missing values. The above practice is called multiple imputation.
138
150
139
151
Our implementation of :class: `IterativeImputer ` was inspired by the R MICE
140
152
package (Multivariate Imputation by Chained Equations) [1 ]_, but differs from
@@ -144,13 +156,13 @@ it repeatedly to the same dataset with different random seeds when
144
156
``sample_posterior=True ``. See [2 ]_, chapter 4 for more discussion on multiple
145
157
vs. single imputations.
146
158
147
- It is still an open problem as to how useful single vs. multiple imputation is in
148
- the context of prediction and classification when the user is not interested in
149
- measuring uncertainty due to missing values.
159
+ It is still an open problem as to how useful single vs. multiple imputation is
160
+ in the context of prediction and classification when the user is not
161
+ interested in measuring uncertainty due to missing values.
150
162
151
- Note that a call to the ``transform `` method of :class: `IterativeImputer ` is not
152
- allowed to change the number of samples. Therefore multiple imputations cannot be
153
- achieved by a single call to ``transform ``.
163
+ Note that a call to the ``transform `` method of :class: `IterativeImputer ` is
164
+ not allowed to change the number of samples. Therefore multiple imputations
165
+ cannot be achieved by a single call to ``transform ``.
154
166
155
167
References
156
168
==========
0 commit comments