@@ -9,28 +9,12 @@ Imputation of missing values
9
9
For various reasons, many real world datasets contain missing values, often
10
10
encoded as blanks, NaNs or other placeholders. Such datasets however are
11
11
incompatible with scikit-learn estimators which assume that all values in an
12
- array are numerical, and that all have and hold meaning. A basic strategy to
13
- use incomplete datasets is to discard entire rows and/or columns containing
14
- missing values. However, this comes at the price of losing data which may be
15
- valuable (even though incomplete). A better strategy is to impute the missing
16
- values, i.e., to infer them from the known part of the data. See the
17
- :ref: `glossary ` entry on imputation.
18
-
19
-
20
- Univariate vs. Multivariate Imputation
21
- ======================================
22
-
23
- One type of imputation algorithm is univariate, which imputes values in the
24
- i-th feature dimension using only non-missing values in that feature dimension
25
- (e.g. :class: `impute.SimpleImputer `). By contrast, multivariate imputation
26
- algorithms use the entire set of available feature dimensions to estimate the
27
- missing values (e.g. :class: `impute.IterativeImputer `).
28
-
29
-
30
- .. _single_imputer :
31
-
32
- Univariate feature imputation
33
- =============================
12
+ array are numerical, and that all have and hold meaning. A basic strategy to use
13
+ incomplete datasets is to discard entire rows and/or columns containing missing
14
+ values. However, this comes at the price of losing data which may be valuable
15
+ (even though incomplete). A better strategy is to impute the missing values,
16
+ i.e., to infer them from the known part of the data. See the :ref: `glossary `
17
+ entry on imputation.
34
18
35
19
The :class: `SimpleImputer ` class provides basic strategies for imputing missing
36
20
values. Missing values can be imputed with a provided constant value, or using
@@ -66,9 +50,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
66
50
[6. 3.]
67
51
[7. 6.]]
68
52
69
- Note that this format is not meant to be used to implicitly store missing
70
- values in the matrix because it would densify it at transform time. Missing
71
- values encoded by 0 must be used with dense input.
53
+ Note that this format is not meant to be used to implicitly store missing values
54
+ in the matrix because it would densify it at transform time. Missing values encoded
55
+ by 0 must be used with dense input.
72
56
73
57
The :class: `SimpleImputer ` class also supports categorical data represented as
74
58
string values or pandas categoricals when using the ``'most_frequent' `` or
@@ -87,92 +71,9 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
87
71
['a' 'y']
88
72
['b' 'y']]
89
73
90
- .. _iterative_imputer :
91
-
92
-
93
- Multivariate feature imputation
94
- ===============================
95
-
96
- A more sophisticated approach is to use the :class: `IterativeImputer ` class,
97
- which models each feature with missing values as a function of other features,
98
- and uses that estimate for imputation. It does so in an iterated round-robin
99
- fashion: at each step, a feature column is designated as output ``y `` and the
100
- other feature columns are treated as inputs ``X ``. A regressor is fit on ``(X,
101
- y) `` for known ``y ``. Then, the regressor is used to predict the missing values
102
- of ``y ``. This is done for each feature in an iterative fashion, and then is
103
- repeated for ``max_iter `` imputation rounds. The results of the final
104
- imputation round are returned.
105
-
106
- >>> import numpy as np
107
- >>> from sklearn.impute import IterativeImputer
108
- >>> imp = IterativeImputer(max_iter = 10 , random_state = 0 )
109
- >>> imp.fit([[1 , 2 ], [3 , 6 ], [4 , 8 ], [np.nan, 3 ], [7 , np.nan]]) # doctest: +NORMALIZE_WHITESPACE
110
- IterativeImputer(estimator=None, imputation_order='ascending',
111
- initial_strategy='mean', max_iter=10, max_value=None,
112
- min_value=None, missing_values=nan, n_nearest_features=None,
113
- random_state=0, sample_posterior=False, tol=0.001, verbose=0)
114
- >>> X_test = [[np.nan, 2 ], [6 , np.nan], [np.nan, 6 ]]
115
- >>> # the model learns that the second feature is double the first
116
- >>> print (np.round(imp.transform(X_test)))
117
- [[ 1. 2.]
118
- [ 6. 12.]
119
- [ 3. 6.]]
120
-
121
- Both :class: `SimpleImputer ` and :class: `IterativeImputer ` can be used in a
122
- Pipeline as a way to build a composite estimator that supports imputation.
123
- See :ref: `sphx_glr_auto_examples_impute_plot_missing_values.py `.
124
-
125
- Flexibility of IterativeImputer
126
- -------------------------------
127
-
128
- There are many well-established imputation packages in the R data science
129
- ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
130
- out to be a particular instance of different sequential imputation algorithms
131
- that can all be implemented with :class: `IterativeImputer ` by passing in
132
- different regressors to be used for predicting missing feature values. In the
133
- case of missForest, this regressor is a Random Forest.
134
- See :ref: `sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py `.
135
-
136
-
137
- .. _multiple_imputation :
138
-
139
- Multiple vs. Single Imputation
140
- ------------------------------
141
-
142
- In the statistics community, it is common practice to perform multiple
143
- imputations, generating, for example, ``m `` separate imputations for a single
144
- feature matrix. Each of these ``m `` imputations is then put through the
145
- subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
146
- classification). The ``m `` final analysis results (e.g. held-out validation
147
- errors) allow the data scientist to obtain understanding of how analytic
148
- results may differ as a consequence of the inherent uncertainty caused by the
149
- missing values. The above practice is called multiple imputation.
150
-
151
- Our implementation of :class: `IterativeImputer ` was inspired by the R MICE
152
- package (Multivariate Imputation by Chained Equations) [1 ]_, but differs from
153
- it by returning a single imputation instead of multiple imputations. However,
154
- :class: `IterativeImputer ` can also be used for multiple imputations by applying
155
- it repeatedly to the same dataset with different random seeds when
156
- ``sample_posterior=True ``. See [2 ]_, chapter 4 for more discussion on multiple
157
- vs. single imputations.
158
-
159
- It is still an open problem as to how useful single vs. multiple imputation is
160
- in the context of prediction and classification when the user is not
161
- interested in measuring uncertainty due to missing values.
162
-
163
- Note that a call to the ``transform `` method of :class: `IterativeImputer ` is
164
- not allowed to change the number of samples. Therefore multiple imputations
165
- cannot be achieved by a single call to ``transform ``.
166
-
167
- References
168
- ==========
169
-
170
- .. [1 ] Stef van Buuren, Karin Groothuis-Oudshoorn (2011). "mice: Multivariate
171
- Imputation by Chained Equations in R". Journal of Statistical Software 45:
172
- 1-67.
173
74
174
- .. [ 2 ] Roderick J A Little and Donald B Rubin (1986). "Statistical Analysis
175
- with Missing Data". John Wiley & Sons, Inc., New York, NY, USA .
75
+ :class: ` SimpleImputer ` can be used in a Pipeline as a way to build a composite
76
+ estimator that supports imputation. See :ref: ` sphx_glr_auto_examples_plot_missing_values.py ` .
176
77
177
78
.. _missing_indicator :
178
79
0 commit comments