8000 [MRG] IterativeImputer extended example by sergeyf · Pull Request #12100 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

[MRG] IterativeImputer extended example #12100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
cdc7183
first commit
sergeyf Sep 17, 2018
493a27a
undoing space
sergeyf Sep 17, 2018
75ad2da
newline
sergeyf Sep 17, 2018
1465f77
fixing bug in plot_missing_values and adding a bit more performance
sergeyf Sep 17, 2018
ea84910
slight clarification
sergeyf Sep 17, 2018
753f60d
another example
sergeyf Sep 18, 2018
2ec0401
fixing tests
sergeyf Sep 18, 2018
39fb7f7
modularizing plot_missing_values
sergeyf Sep 26, 2018
de2c307
fixing cut off plot
sergeyf Sep 26, 2018
c135653
updating narrative docs
sergeyf Sep 26, 2018
a8768c4
default for verbose should be 0
sergeyf Sep 26, 2018
4368c31
fixing doc error
sergeyf Sep 27, 2018
c5595b8
addressing some comments
sergeyf Oct 16, 2018
e0e90be
making example more interesting
sergeyf Oct 17, 2018
75f94af
Reverting v0.21 to master
sergeyf Jan 16, 2019
0dff9a0
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf Jan 16, 2019
755dde5
Responding to reviewer comments.
sergeyf Jan 16, 2019
7a6a7b4
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf Jan 16, 2019
22dc1ef
Updating v0.20
sergeyf Jan 16, 2019
3b0fae9
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf Jan 17, 2019
3f059df
Revert changes to v0.21.rst
sergeyf Jan 17, 2019
2442272
Merge branch 'iterativeimputer_missforest' of https://github.com/serg…
sergeyf Jan 17, 2019
2ed49af
fixing doctest impute.rst
sergeyf Jan 17, 2019
dae2e3a
fixing doctest impute.rst v2
sergeyf Jan 17, 2019
cc9ae8b
fixing doctest impute.rst v3
sergeyf Jan 17, 2019
bd0be11
One more try with expected/actual issue in impute.rst
sergeyf Jan 17, 2019
d2c3357
updating to 26
sergeyf Jan 17, 2019
af3a056
Merge branch 'iterativeimputer_missforest' of https://github.com/serg…
sergeyf Jan 17, 2019
1862379
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf Jan 24, 2019
b2f2b54
Updating RidgeCV to BayesianRidge to be more in line with the default
sergeyf Jan 24, 2019
b305620
updating to glemaitre's plot
sergeyf Jan 24, 2019
c8dccb4
line lengths for impute.rst
sergeyf Jan 24, 2019
a102cec
Update examples/impute/plot_iterative_imputer_variants_comparison.py
glemaitre Jan 24, 2019
57b83d6
fixing y-axis labels
sergeyf Jan 24, 2019
5e9b45f
Merge branch 'iterativeimputer_missforest' of https://github.com/serg…
sergeyf Jan 24, 2019
a86b535
Update examples/impute/plot_iterative_imputer_variants_comparison.py
glemaitre Jan 24, 2019
c0b7439
Update examples/impute/plot_iterative_imputer_variants_comparison.py
glemaitre Jan 24, 2019
332a01d
minor change
sergeyf Jan 24, 2019
3ab4686
Merge branch 'iterativeimputer_missforest' of https://github.com/serg…
sergeyf Jan 24, 2019
46e21dc
addressing reviewer comments
sergeyf Jan 24, 2019
303d026
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf Jan 24, 2019
ddab109
reordering predictors
sergeyf Jan 25, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 41 additions & 29 deletions doc/modules/impute.rst
10000
Original file line number Diff line number Diff line change
Expand Up @@ -9,19 +9,19 @@ Imputation of missing values
For various reasons, many real world datasets contain missing values, often
encoded as blanks, NaNs or other placeholders. Such datasets however are
incompatible with scikit-learn estimators which assume that all values in an
array are numerical, and that all have and hold meaning. A basic strategy to use
incomplete datasets is to discard entire rows and/or columns containing missing
values. However, this comes at the price of losing data which may be valuable
(even though incomplete). A better strategy is to impute the missing values,
i.e., to infer them from the known part of the data. See the :ref:`glossary`
entry on imputation.
array are numerical, and that all have and hold meaning. A basic strategy to
use incomplete datasets is to discard entire rows and/or columns containing
missing values. However, this comes at the price of losing data which may be
valuable (even though incomplete). A better strategy is to impute the missing
values, i.e., to infer them from the known part of the data. See the
:ref:`glossary` entry on imputation.


Univariate vs. Multivariate Imputation
======================================

One type of imputation algorithm is univariate, which imputes values in the i-th
feature dimension using only non-missing values in that feature dimension
One type of imputation algorithm is univariate, which imputes values in the
i-th feature dimension using only non-missing values in that feature dimension
(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
algorithms use the entire set of available feature dimensions to estimate the
missing values (e.g. :class:`impute.IterativeImputer`).
Expand Down Expand Up @@ -66,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
[6. 3.]
[7. 6.]]

Note that this format is not meant to be used to implicitly store missing values
in the matrix because it would densify it at transform time. Missing values encoded
by 0 must be used with dense input.
Note that this format is not meant to be used to implicitly store missing
values in the matrix because it would densify it at transform time. Missing
values encoded by 0 must be used with dense input.

The :class:`SimpleImputer` class also supports categorical data represented as
string values or pandas categoricals when using the ``'most_frequent'`` or
Expand Down Expand Up @@ -110,31 +110,43 @@ round are returned.
IterativeImputer(imputation_order='ascending', initial_strategy='mean',
max_value=None, min_value=None, missing_values=nan, n_iter=10,
n_nearest_features=None, predictor=None, random_state=0,
sample_posterior=False, verbose=False)
sample_posterior=False, verbose=0)
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
>>> # the model learns that the second feature is double the first
>>> print(np.round(imp.transform(X_test)))
[[ 1. 2.]
[ 6. 12.]
[ 3. 6.]]

Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a Pipeline
as a way to build a composite estimator that supports imputation.
Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a
Pipeline as a way to build a composite estimator that supports imputation.
See :ref:`sphx_glr_auto_examples_impute_plot_missing_values.py`.

Flexibility of IterativeImputer
-------------------------------

There are many well-established imputation packages in the R data science
ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
out to be a particular instance of different sequential imputation algorithms
that can all be implemented with :class:`IterativeImputer` by passing in
different regressors to be used for predicting missing feature values. In the
case of missForest, this regressor is a Random Forest.
See :ref:`sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py`.


.. _multiple_imputation:

Multiple vs. Single Imputation
==============================
------------------------------

In the statistics community, it is common practice to perform multiple imputations,
generating, for example, ``m`` separate imputations for a single feature matrix.
Each of these ``m`` imputations is then put through the subsequent analysis pipeline
(e.g. feature engineering, clustering, regression, classification). The ``m`` final
analysis results (e.g. held-out validation errors) allow the data scientist
to obtain understanding of how analytic results may differ as a consequence
of the inherent uncertainty caused by the missing values. The above practice
is called multiple imputation.
In the statistics community, it is common practice to perform multiple
imputations, generating, for example, ``m`` separate imputations for a single
feature matrix. Each of these ``m`` imputations is then put through the
subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
classification). The ``m`` final analysis results (e.g. held-out validation
errors) allow the data scientist to obtain understanding of how analytic
results may differ as a consequence of the inherent uncertainty caused by the
missing values. The above practice is called multiple imputation.

Our implementation of :class:`IterativeImputer` was inspired by the R MICE
package (Multivariate Imputation by Chained Equations) [1]_, but differs from
Expand All @@ -144,13 +156,13 @@ it repeatedly to the same dataset with different random seeds when
``sample_posterior=True``. See [2]_, chapter 4 for more discussion on multiple
vs. single imputations.

It is still an open problem as to how useful single vs. multiple imputation is in
the context of prediction and classification when the user is not interested in
measuring uncertainty due to missing values.
It is still an open problem as to how useful single vs. multiple imputation is
in the context of prediction and classification when the user is not
interested in measuring uncertainty due to missing values.

Note that a call to the ``transform`` method of :class:`IterativeImputer` is not
allowed to change the number of samples. Therefore multiple imputations cannot be
achieved by a single call to ``transform``.
Note that a call to the ``transform`` method of :class:`IterativeImputer` is
not allowed to change the number of samples. Therefore multiple imputations
cannot be achieved by a single call to ``transform``.

References
==========
Expand Down
126 changes: 126 additions & 0 deletions examples/impute/plot_iterative_imputer_variants_comparison.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
"""
=========================================================
Imputing missing values with variants of IterativeImputer
=========================================================

The :class:`sklearn.impute.IterativeImputer` class is very flexible - it can be
used with a variety of predictors to do round-robin regression, treating every
variable as an output in turn.

In this example we compare some predictors for the purpose of missing feature
imputation with :class:`sklearn.imputeIterativeImputer`::

:class:`~sklearn.linear_model.BayesianRidge`: regularized linear regression
:class:`~sklearn.tree.DecisionTreeRegressor`: non-linear regression
:class:`~sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R
:class:`~sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN
imputation approaches

Of particular interest is the ability of
:class:`sklearn.impute.IterativeImputer` to mimic the behavior of missForest, a
popular imputation package for R. In this example, we have chosen to use
:class:`sklearn.ensemble.ExtraTreesRegressor` instead of
:class:`sklearn.ensemble.RandomForestRegressor` (as in missForest) due to its
increased speed.

Note that :class:`sklearn.neighbors.KNeighborsRegressor` is different from KNN
imputation, which learns from samples with missing values by using a distance
metric that accounts for missing values, rather than imputing them.

The goal is to compare different predictors to see which one is best for the
:class:`sklearn.impute.IterativeImputer` when using a
:class:`sklearn.linear_model.BayesianRidge` estimator on the California housing
dataset with a single value randomly removed from each row.

For this particular pattern of missing values we see that
:class:`sklearn.ensemble.ExtraTreesRegressor` and
:class:`sklearn.linear_model.BayesianRidge` give the best results.
"""
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

N_SPLITS = 5

rng = np.random.RandomState(0)

X_full, y_full = fetch_california_housing(return_X_y=True)
n_samples, n_features = X_full.shape

# Estimate the score on the entire dataset, with no missing values
br_estimator = BayesianRidge()
score_full_data = pd.DataFrame(
cross_val_score(
br_estimator, X_full, y_full, scoring='neg_mean_squared_error',
cv=N_SPLITS
),
columns=['Full Data']
)

# Add a single missing value to each row
X_missing = X_full.copy()
y_missing = y_full
missing_samples = np.arange(n_samples)
missing_features = rng.choice(n_features, n_samples, replace=True)
X_missing[missing_samples, missing_features] = np.nan

# Estimate the score after imputation (mean and median strategies)
score_simple_imputer = pd.DataFrame()
for strategy in ('mean', 'median'):
estimator = make_pipeline(
SimpleImputer(missing_values=np.nan, strategy=strategy),
br_estimator
)
score_simple_imputer[strategy] = cross_val_score(
estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
cv=N_SPLITS
)

# Estimate the score after iterative imputation of the missing values
# with different predictors
predictors = [
BayesianRidge(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this ordering? It doesn't match the one above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. It's the same order as in the docstring?

    :class:`sklearn.linear_model.BayesianRidge`: regularized linear regression
    :class:`sklearn.tree.DecisionTreeRegressor`: non-linear regression
    :class:`sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN imputation approaches
    :class:`sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe I misread.

DecisionTreeRegressor(max_features='sqrt', random_state=0),
ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0),
KNeighborsRegressor(n_neighbors=15)
]
score_iterative_imputer = pd.DataFrame()
for predictor in predictors:
estimator = make_pipeline(
IterativeImputer(random_state=0, predictor=predictor),
br_estimator
)
score_iterative_imputer[predictor.__class__.__name__] = \
cross_val_score(
estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
cv=N_SPLITS
)

scores = pd.concat(
[score_full_data, score_simple_imputer, score_iterative_imputer],
keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
)

# plot boston results
fig, ax = plt.subplots(figsize=(13, 6))
means = -scores.mean()
errors = scores.std()
means.plot.barh(xerr=errors, ax=ax)
ax.set_title('California Housing Regression with Different Imputation Methods')
ax.set_xlabel('MSE (smaller is better)')
ax.set_yticks(np.arange(means.shape[0]))
ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()])
plt.tight_layout(pad=1)
plt.show()
58 changes: 32 additions & 26 deletions examples/impute/plot_missing_values.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,13 @@
round-robin linear regression, treating every variable as an output in
turn. The version implemented assumes Gaussian (output) variables. If your
features are obviously non-Normal, consider transforming them to look more
Normal so as to improve performance.
Normal so as to potentially improve performance.

In addition of using an imputing method, we can also keep an indication of the
missing information using :func:`sklearn.impute.MissingIndicator` which might
carry some information.
"""
print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
Expand All @@ -31,16 +32,29 @@

rng = np.random.RandomState(0)

N_SPLITS = 5
REGRESSOR = RandomForestRegressor(random_state=0, n_estimators=100)


def get_scores_for_imputer(imputer, X_missing, y_missing):
estimator = make_pipeline(
make_union(imputer, MissingIndicator(missing_values=0)),
REGRESSOR)
impute_scores = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error',
cv=N_SPLITS)
return impute_scores


def get_results(dataset):
X_full, y_full = dataset.data, dataset.target
n_samples = X_full.shape[0]
n_features = X_full.shape[1]

# Estimate the score on the entire dataset, with no missing values
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
full_scores = cross_val_score(estimator, X_full, y_full,
scoring='neg_mean_squared_error', cv=5)
full_scores = cross_val_score(REGRESSOR, X_full, y_full,
scoring='neg_mean_squared_error',
cv=N_SPLITS)

# Add missing values in 75% of the lines
missing_rate = 0.75
Expand All @@ -51,35 +65,27 @@ def get_results(dataset):
dtype=np.bool)))
rng.shuffle(missing_samples)
missing_features = rng.randint(0, n_features, n_missing_samples)

# Estimate the score after replacing missing values by 0
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
zero_impute_scores = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error',
cv=5)

# Estimate the score after replacing missing values by 0
imputer = SimpleImputer(missing_values=0,
strategy='constant',
fill_value=0)
zero_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)

# Estimate the score after imputation (mean strategy) of the missing values
X_missing = X_full.copy()
X_missing[np.where(missing_samples)[0], missing_features] = 0
y_missing = y_full.copy()
estimator = make_pipeline(
make_union(SimpleImputer(missing_values=0, strategy="mean"),
MissingIndicator(missing_values=0)),
RandomForestRegressor(random_state=0, n_estimators=100))
mean_impute_scores = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error',
cv=5)
imputer = SimpleImputer(missing_values=0, strategy="mean")
mean_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)

# Estimate the score after iterative imputation of the missing values
estimator = make_pipeline(
make_union(IterativeImputer(missing_values=0, random_state=0),
MissingIndicator(missing_values=0)),
RandomForestRegressor(random_state=0, n_estimators=100))
iterative_impute_scores = cross_val_score(estimator, X_missing, y_missing,
scoring='neg_mean_squared_error')
imputer = IterativeImputer(missing_values=0,
random_state=0,
n_nearest_features=5)
iterative_impute_scores = get_scores_for_imputer(imputer,
X_missing,
y_missing)

return ((full_scores.mean(), full_scores.std()),
(zero_impute_scores.mean(), zero_impute_scores.std()),
Expand Down
2 changes: 1 addition & 1 deletion sklearn/impute.py
Original file line number Diff line number Diff line change
Expand Up @@ -556,7 +556,7 @@ def __init__(self,
initial_strategy="mean",
min_value=None,
max_value=None,
verbose=False,
verbose=0,
random_state=None):

self.missing_values = missing_values
Expand Down
0