8000 Revert "FEA Add IterativeImputer (#11977)" · xhluca/scikit-learn@c2f5106 · GitHub
[go: up one dir, main page]

Skip to content

Commit c2f5106

Browse files
author
Xing
committed
Revert "FEA Add IterativeImputer (scikit-learn#11977)"
This reverts commit 518b183.
1 parent cf39de0 commit c2f5106

File tree

9 files changed

+45
-1346
lines changed

9 files changed

+45
-1346
lines changed

doc/modules/classes.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -655,9 +655,8 @@ Kernels:
655655
:template: class.rst
656656

657657
impute.SimpleImputer
658-
impute.IterativeImputer
659658
impute.MissingIndicator
660-
659+
661660
.. _kernel_approximation_ref:
662661

663662
:mod:`sklearn.kernel_approximation` Kernel Approximation

doc/modules/impute.rst

Lines changed: 11 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -9,28 +9,12 @@ Imputation of missing values
99
For various reasons, many real world datasets contain missing values, often
1010
encoded as blanks, NaNs or other placeholders. Such datasets however are
1111
incompatible with scikit-learn estimators which assume that all values in an
12-
array are numerical, and that all have and hold meaning. A basic strategy to
13-
use incomplete datasets is to discard entire rows and/or columns containing
14-
missing values. However, this comes at the price of losing data which may be
15-
valuable (even though incomplete). A better strategy is to impute the missing
16-
values, i.e., to infer them from the known part of the data. See the
17-
:ref:`glossary` entry on imputation.
18-
19-
20-
Univariate vs. Multivariate Imputation
21-
======================================
22-
23-
One type of imputation algorithm is univariate, which imputes values in the
24-
i-th feature dimension using only non-missing values in that feature dimension
25-
(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
26-
algorithms use the entire set of available feature dimensions to estimate the
27-
missing values (e.g. :class:`impute.IterativeImputer`).
28-
29-
30-
.. _single_imputer:
31-
32-
Univariate feature imputation
33-
=============================
12+
array are numerical, and that all have and hold meaning. A basic strategy to use
13+
incomplete datasets is to discard entire rows and/or columns containing missing
14+
values. However, this comes at the price of losing data which may be valuable
15+
(even though incomplete). A better strategy is to impute the missing values,
16+
i.e., to infer them from the known part of the data. See the :ref:`glossary`
17+
entry on imputation.
3418

3519
The :class:`SimpleImputer` class provides basic strategies for imputing missing
3620
values. Missing values can be imputed with a provided constant value, or using
@@ -66,9 +50,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
6650
[6. 3.]
6751
[7. 6.]]
6852

69-
Note that this format is not meant to be used to implicitly store missing
70-
values in the matrix because it would densify it at transform time. Missing
71-
values encoded by 0 must be used with dense input.
53+
Note that this format is not meant to be used to implicitly store missing values
54+
in the matrix because it would densify it at transform time. Missing values encoded
55+
by 0 must be used with dense input.
7256

7357
The :class:`SimpleImputer` class also supports categorical data represented as
7458
string values or pandas categoricals when using the ``'most_frequent'`` or
@@ -87,92 +71,9 @@ string values or pandas categoricals when using the ``'most_frequent'`` or
8771
['a' 'y']
8872
['b' 'y']]
8973

90-
.. _iterative_imputer:
91-
92-
93-
Multivariate feature imputation
94-
===============================
95-
96-
A more sophisticated approach is to use the :class:`IterativeImputer` class,
97-
which models each feature with missing values as a function of other features,
98-
and uses that estimate for imputation. It does so in an iterated round-robin
99-
fashion: at each step, a feature column is designated as output ``y`` and the
100-
other feature columns are treated as inputs ``X``. A regressor is fit on ``(X,
101-
y)`` for known ``y``. Then, the regressor is used to predict the missing values
102-
of ``y``. This is done for each feature in an iterative fashion, and then is
103-
repeated for ``max_iter`` imputation rounds. The results of the final
104-
imputation round are returned.
105-
106-
>>> import numpy as np
107-
>>> from sklearn.impute import IterativeImputer
108-
>>> imp = IterativeImputer(max_iter=10, random_state=0)
109-
>>> imp.fit([[1, 2], [3, 6], [4, 8], [np.nan, 3], [7, np.nan]]) # doctest: +NORMALIZE_WHITESPACE
110-
IterativeImputer(estimator=None, imputation_order='ascending',
111-
initial_strategy='mean', max_iter=10, max_value=None,
112-
min_value=None, missing_values=nan, n_nearest_features=None,
113-
random_state=0, sample_posterior=False, tol=0.001, verbose=0)
114-
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
115-
>>> # the model learns that the second feature is double the first
116-
>>> print(np.round(imp.transform(X_test)))
117-
[[ 1. 2.]
118-
[ 6. 12.]
119-
[ 3. 6.]]
120-
121-
Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a
122-
Pipeline as a way to build a composite estimator that supports imputation.
123-
See :ref:`sphx_glr_auto_examples_impute_plot_missing_values.py`.
124-
125-
Flexibility of IterativeImputer
126-
-------------------------------
127-
128-
There are many well-established imputation packages in the R data science
129-
ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
130-
out to be a particular instance of different sequential imputation algorithms
131-
that can all be implemented with :class:`IterativeImputer` by passing in
132-
different regressors to be used for predicting missing feature values. In the
133-
case of missForest, this regressor is a Random Forest.
134-
See :ref:`sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py`.
135-
136-
137-
.. _multiple_imputation:
138-
139-
Multiple vs. Single Imputation
140-
------------------------------
141-
142-
In the statistics community, it is common practice to perform multiple
143-
imputations, generating, for example, ``m`` separate imputations for a single
144-
feature matrix. Each of these ``m`` imputations is then put through the
145-
subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
146-
classification). The ``m`` final analysis results (e.g. held-out validation
147-
errors) allow the data scientist to obtain understanding of how analytic
148-
results may differ as a consequence of the inherent uncertainty caused by the
149-
missing values. The above practice is called multiple imputation.
150-
151-
Our implementation of :class:`IterativeImputer` was inspired by the R MICE
152-
package (Multivariate Imputation by Chained Equations) [1]_, but differs from
153-
it by returning a single imputation instead of multiple imputations. However,
154-
:class:`IterativeImputer` can also be used for multiple imputations by applying
155-
it repeatedly to the same dataset with different random seeds when
156-
``sample_posterior=True``. See [2]_, chapter 4 for more discussion on multiple
157-
vs. single imputations.
158-
159-
It is still an open problem as to how useful single vs. multiple imputation is
160-
in the context of prediction and classification when the user is not
161-
interested in measuring uncertainty due to missing values.
162-
163-
Note that a call to the ``transform`` method of :class:`IterativeImputer` is
164-
not allowed to change the number of samples. Therefore multiple imputations
165-
cannot be achieved by a single call to ``transform``.
166-
167-
References
168-
==========
169-
170-
.. [1] Stef van Buuren, Karin Groothuis-Oudshoorn (2011). "mice: Multivariate
171-
Imputation by Chained Equations in R". Journal of Statistical Software 45:
172-
1-67.
17374

174-
.. [2] Roderick J A Little and Donald B Rubin (1986). "Statistical Analysis
175-
with Missing Data". John Wiley & Sons, Inc., New York, NY, USA.
75+
:class:`SimpleImputer` can be used in a Pipeline as a way to build a composite
76+
estimator that supports imputation. See :ref:`sphx_glr_auto_examples_plot_missing_values.py`.
17677

17778
.. _missing_indicator:
17879

doc/whats_new/v0.21.rst

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -125,15 +125,6 @@ Support for Python 3.4 and below has been officially dropped.
125125
- |API| Deprecated :mod:`externals.six` since we have dropped support for
126126
Python 2.7. :issue:`12916` by :user:`Hanmin Qin <qinhanmin2014>`.
127127

128-
:mod:`sklearn.impute`
129-
.....................
130-
131-
- |MajorFeature| Added :class:`impute.IterativeImputer`, which is a strategy
132-
for imputing missing values by modeling each feature with missing values as a
133-
function of other features in a round-robin fashion. :issue:`8478` and
134-
:issue:`12177` by :user:`Sergey Feldman <sergeyf>` :user:`Ben Lawson
135-
<benlawson>`.
136-
137128
:mod:`sklearn.linear_model`
138129
...........................
139130

examples/impute/README.txt

Lines changed: 0 additions & 6 deletions
This file was deleted.

examples/impute/plot_iterative_imputer_variants_comparison.py

Lines changed: 0 additions & 126 deletions
This file was deleted.

0 commit comments

Comments
 (0)
0