10000 EXA Add IterativeImputer extended example (#12100) · scikit-learn/scikit-learn@dc304a4 · GitHub
[go: up one dir, main page]

Skip to content

Commit dc304a4

Browse files
sergeyfjnothman
authored andcommitted
EXA Add IterativeImputer extended example (#12100)
1 parent cf4670c commit dc304a4

File tree

4 files changed

+200
-56
lines changed

4 files changed

+200
-56
lines changed

doc/modules/impute.rst

Lines changed: 41 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,19 @@ Imputation of missing values
99
For various reasons, many real world datasets contain missing values, often
1010
encoded as blanks, NaNs or other placeholders. Such datasets however are
1111
incompatible with scikit-learn estimators which assume that all values in an
12-
array are numerical, and that all have and hold meaning. A basic strategy to use
13-
incomplete datasets is to discard entire rows and/or columns containing missing
14-
values. However, this comes at the price of losing data which may be valuable
15-
(even though incomplete). A better strategy is to impute the missing values,
16-
i.e., to infer them from the known part of the data. See the :ref:`glossary`
17-
entry on imputation.
12+
array are numerical, and that all have and hold meaning. A basic strategy to
13+
use incomplete datasets is to discard entire rows and/or columns containing
14+
missing values. However, this comes at the price of losing data which may be
15+
valuable (even though incomplete). A better strategy is to impute the missing
16+
values, i.e., to infer them from the known part of the data. See the
17+
:ref:`glossary` entry on imputation.
1818

1919

2020
Univariate vs. Multivariate Imputation
2121
======================================
2222

23-
One type of imputation algorithm is univariate, which imputes values in the i-th
24-
feature dimension using only non-missing values in that feature dimension
23+
One type of imputation algorithm is univariate, which imputes values in the
24+
i-th feature dimension using only non-missing values in that feature dimension
2525
(e.g. :class:`impute.SimpleImputer`). By contrast, multivariate imputation
2626
algorithms use the entire set of available feature dimensions to estimate the
2727
missing values (e.g. :class:`impute.IterativeImputer`).
@@ -66,9 +66,9 @@ The :class:`SimpleImputer` class also supports sparse matrices::
6666
[6. 3.]
6767
[7. 6.]]
6868

69-
Note that this format is not meant to be used to implicitly store missing values
70-
in the matrix because it would densify it at transform time. Missing values encoded
71-
by 0 must be used with dense input.
69+
Note that this format is not meant to be used to implicitly store missing
70+
values in the matrix because it would densify it at transform time. Missing
71+
values encoded by 0 must be used with dense input.
7272

7373
The :class:`SimpleImputer` class also supports categorical data represented as
7474
string values or pandas categoricals when using the ``'most_frequent'`` or
@@ -110,31 +110,43 @@ round are returned.
110110
IterativeImputer(imputation_order='ascending', initial_strategy='mean',
111111
max_value=None, min_value=None, missing_values=nan, n_iter=10,
112112
n_nearest_features=None, predictor=None, random_state=0,
113-
sample_posterior=False, verbose=False)
113+
sample_posterior=False, verbose=0)
114114
>>> X_test = [[np.nan, 2], [6, np.nan], [np.nan, 6]]
115115
>>> # the model learns that the second feature is double the first
116116
>>> print(np.round(imp.transform(X_test)))
117117
[[ 1. 2.]
118118
[ 6. 12.]
119119
[ 3. 6.]]
120120

121-
Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a Pipeline
122-
as a way to build a composite estimator that supports imputation.
121+
Both :class:`SimpleImputer` and :class:`IterativeImputer` can be used in a
122+
Pipeline as a way to build a composite estimator that supports imputation.
123123
See :ref:`sphx_glr_auto_examples_impute_plot_missing_values.py`.
124124

125+
Flexibility of IterativeImputer
126+
-------------------------------
127+
128+
There are many well-established imputation packages in the R data science
129+
ecosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns
130+
out to be a particular instance of different sequential imputation algorithms
131+
that can all be implemented with :class:`IterativeImputer` by passing in
132+
different regressors to be used for predicting missing feature values. In the
133+
case of missForest, this regressor is a Random Forest.
134+
See :ref:`sphx_glr_auto_examples_plot_iterative_imputer_variants_comparison.py`.
135+
136+
125137
.. _multiple_imputation:
126138

127139
Multiple vs. Single Imputation
128-
==============================
140+
------------------------------
129141

130-
In the statistics community, it is common practice to perform multiple imputations,
131-
generating, for example, ``m`` separate imputations for a single feature matrix.
132-
Each of these ``m`` imputations is then put through the subsequent analysis pipeline
133-
(e.g. feature engineering, clustering, regression, classification). The ``m`` final
134-
analysis results (e.g. held-out validation errors) allow the data scientist
135-
to obtain understanding of how analytic results may differ as a consequence
136-
of the inherent uncertainty caused by the missing values. The above practice
137-
is called multiple imputation.
142+
In the statistics community, it is common practice to perform multiple
143+
imputations, generating, for example, ``m`` separate imputations for a single
144+
feature matrix. Each of these ``m`` imputations is then put through the
145+
subsequent analysis pipeline (e.g. feature engineering, clustering, regression,
146+
c F438 lassification). The ``m`` final analysis results (e.g. held-out validation
147+
errors) allow the data scientist to obtain understanding of how analytic
148+
results may differ as a consequence of the inherent uncertainty caused by the
149+
missing values. The above practice is called multiple imputation.
138150

139151
Our implementation of :class:`IterativeImputer` was inspired by the R MICE
140152
package (Multivariate Imputation by Chained Equations) [1]_, but differs from
@@ -144,13 +156,13 @@ it repeatedly to the same dataset with different random seeds when
144156
``sample_posterior=True``. See [2]_, chapter 4 for more discussion on multiple
145157
vs. single imputations.
146158

147-
It is still an open problem as to how useful single vs. multiple imputation is in
148-
the context of prediction and classification when the user is not interested in
149-
measuring uncertainty due to missing values.
159+
It is still an open problem as to how useful single vs. multiple imputation is
160+
in the context of prediction and classification when the user is not
161+
interested in measuring uncertainty due to missing values.
150162

151-
Note that a call to the ``transform`` method of :class:`IterativeImputer` is not
152-
allowed to change the number of samples. Therefore multiple imputations cannot be
153-
achieved by a single call to ``transform``.
163+
Note that a call to the ``transform`` method of :class:`IterativeImputer` is
164+
not allowed to change the number of samples. Therefore multiple imputations
165+
cannot be achieved by a single call to ``transform``.
154166

155167
References
156168
==========
Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
"""
2+
=========================================================
3+
Imputing missing values with variants of IterativeImputer
4+
=========================================================
5+
6+
The :class:`sklearn.impute.IterativeImputer` class is very flexible - it can be
7+
used with a variety of predictors to do round-robin regression, treating every
8+
variable as an output in turn.
9+
10+
In this example we compare some predictors for the purpose of missing feature
11+
imputation with :class:`sklearn.imputeIterativeImputer`::
12+
13+
:class:`~sklearn.linear_model.BayesianRidge`: regularized linear regression
14+
:class:`~sklearn.tree.DecisionTreeRegressor`: non-linear regression
15+
:class:`~sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R
16+
:class:`~sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN
17+
imputation approaches
18+
19+
Of particular interest is the ability of
20+
:class:`sklearn.impute.IterativeImputer` to mimic the behavior of missForest, a
21+
popular imputation package for R. In this example, we have chosen to use
22+
:class:`sklearn.ensemble.ExtraTreesRegressor` instead of
23+
:class:`sklearn.ensemble.RandomForestRegressor` (as in missForest) due to its
24+
increased speed.
25+
26+
Note that :class:`sklearn.neighbors.KNeighborsRegressor` is different from KNN
27+
imputation, which learns from samples with missing values by using a distance
28+
metric that accounts for missing values, rather than imputing them.
29+
30+
The goal is to compare different predictors to see which one is best for the
31+
:class:`sklearn.impute.IterativeImputer` when using a
32+
:class:`sklearn.linear_model.BayesianRidge` estimator on the California housing
33+
dataset with a single value randomly removed from each row.
34+
35+
For this particular pattern of missing values we see that
36+
:class:`sklearn.ensemble.ExtraTreesRegressor` and
37+
:class:`sklearn.linear_model.BayesianRidge` give the best results.
38+
"""
39+
print(__doc__)
40+
41+
import numpy as np
42+
import matplotlib.pyplot as plt
43+
import pandas as pd
44+
45+
from sklearn.datasets import fetch_california_housing
46+
from sklearn.impute import SimpleImputer
47+
from sklearn.impute import IterativeImputer
48+
from sklearn.linear_model import BayesianRidge
49+
from sklearn.tree import DecisionTreeRegressor
50+
from sklearn.ensemble import ExtraTreesRegressor
51+
from sklearn.neighbors import KNeighborsRegressor
52+
from sklearn.pipeline import make_pipeline
53+
from sklearn.model_selection import cross_val_score
54+
55+
N_SPLITS = 5
56+
57+
rng = np.random.RandomState(0)
58+
59+
X_full, y_full = fetch_california_housing(return_X_y=True)
60+
n_samples, n_features = X_full.shape
61+
62+
# Estimate the score on the entire dataset, with no missing values
63+
br_estimator = BayesianRidge()
64+
score_full_data = pd.DataFrame(
65+
cross_val_score(
66+
br_estimator, X_full, y_full, scoring='neg_mean_squared_error',
67+
cv=N_SPLITS
68+
),
69+
columns=['Full Data']
70+
)
71+
72+
# Add a single missing value to each row
73+
X_missing = X_full.copy()
74+
y_missing = y_full
75+
missing_samples = np.arange(n_samples)
76+
missing_features = rng.choice(n_features, n_samples, replace=True)
77+
X_missing[missing_samples, missing_features] = np.nan
78+
79+
# Estimate the score after imputation (mean and median strategies)
80+
score_simple_imputer = pd.DataFrame()
81+
for strategy in ('mean', 'median'):
82+
estimator = make_pipeline(
83+
SimpleImputer(missing_values=np.nan, strategy=strategy),
84+
br_estimator
85+
)
86+
score_simple_imputer[strategy] = cross_val_score(
87+
estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
88+
cv=N_SPLITS
89+
)
90+
91+
# Estimate the score after iterative imputation of the missing values
92+
# with different predictors
93+
predictors = [
94+
BayesianRidge(),
95+
DecisionTreeRegressor(max_features='sqrt', random_state=0),
96+
ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0),
97+
KNeighborsRegressor(n_neighbors=15)
98+
]
99+
score_iterative_imputer = pd.DataFrame()
100+
for predictor in predictors:
101+
estimator = make_pipeline(
102+
IterativeImputer(random_state=0, predictor=predictor),
103+
br_estimator
104+
)
105+
score_iterative_imputer[predictor.__class__.__name__] = \
106+
cross_val_score(
107+
estimator, X_missing, y_missing, scoring='neg_mean_squared_error',
108+
cv=N_SPLITS
109+
)
110+
111+
scores = pd.concat(
112+
[score_full_data, score_simple_imputer, score_iterative_imputer],
113+
keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1
114+
)
115+
116+
# plot boston results
117+
fig, ax = plt.subplots(figsize=(13, 6))
118+
means = -scores.mean()
119+
errors = scores.std()
120+
means.plot.barh(xerr=errors, ax=ax)
121+
ax.set_title('California Housing Regression with Different Imputation Methods')
122+
ax.set_xlabel('MSE (smaller is better)')
123+
ax.set_yticks(np.arange(means.shape[0]))
124+
ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()])
125+
plt.tight_layout(pad=1)
126+
plt.show()

examples/impute/plot_missing_values.py

Lines changed: 32 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,13 @@
1212
round-robin linear regression, treating every variable as an output in
1313
turn. The version implemented assumes Gaussian (output) variables. If your
1414
features are obviously non-Normal, consider transforming them to look more
15-
Normal so as to improve performance.
15+
Normal so as to potentially improve performance.
1616
1717
In addition of using an imputing method, we can also keep an indication of the
1818
missing information using :func:`sklearn.impute.MissingIndicator` which might
1919
carry some information.
2020
"""
21+
print(__doc__)
2122

2223
import numpy as np
2324
import matplotlib.pyplot as plt
@@ -31,16 +32,29 @@
3132

3233
rng = np.random.RandomState(0)
3334

35+
N_SPLITS = 5
36+
REGRESSOR = RandomForestRegressor(random_state=0, n_estimators=100)
37+
38+
39+
def get_scores_for_imputer(imputer, X_missing, y_missing):
40+
estimator = make_pipeline(
41+
make_union(imputer, MissingIndicator(missing_values=0)),
42+
REGRESSOR)
43+
impute_scores = cross_val_score(estimator, X_missing, y_missing,
44+
scoring='neg_mean_squared_error',
45+
cv=N_SPLITS)
46+
return impute_scores
47+
3448

3549
def get_results(dataset):
3650
X_full, y_full = dataset.data, dataset.target
3751
n_samples = X_full.shape[0]
3852
n_features = X_full.shape[1]
3953

4054
# Estimate the score on the entire dataset, with no missing values
41-
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
42-
full_scores = cross_val_score(estimator, X_full, y_full,
43-
scoring='neg_mean_squared_error', cv=5)
55+
full_scores = cross_val_score(REGRESSOR, X_full, y_full,
56+
scoring='neg_mean_squared_error',
57+
cv=N_SPLITS)
4458

4559
# Add missing values in 75% of the lines
4660
missing_rate = 0.75
@@ -51,35 +65,27 @@ def get_results(dataset):
5165
dtype=np.bool)))
5266
rng.shuffle(missing_samples)
5367
missing_features = rng.randint(0, n_features, n_missing_samples)
54-
55-
# Estimate the score after replacing missing values by 0
5668
X_missing = X_full.copy()
5769
X_missing[np.where(missing_samples)[0], missing_features] = 0
5870
y_missing = y_full.copy()
59-
estimator = RandomForestRegressor(random_state=0, n_estimators=100)
60-
zero_impute_scores = cross_val_score(estimator, X_missing, y_missing,
61-
scoring='neg_mean_squared_error',
62-
cv=5)
71+
72+
# Estimate the score after replacing missing values by 0
73+
imputer = SimpleImputer(missing_values=0,
74+
strategy='constant',
75+
fill_value=0)
76+
zero_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)
6377

6478
# Estimate the score after imputation (mean strategy) of the missing values
65-
X_missing = X_full.copy()
66-
X_missing[np.where(missing_samples)[0], missing_features] = 0
67-
y_missing = y_full.copy()
68-
estimator = make_pipeline(
69-
make_union(SimpleImputer(missing_values=0, strategy="mean"),
70-
MissingIndicator(missing_values=0)),
71-
RandomForestRegressor(random_state=0, n_estimators=100))
72-
mean_impute_scores = cross_val_score(estimator, X_missing, y_missing,
73-
scoring='neg_mean_squared_error',
74-
cv=5)
79+
imputer = SimpleImputer(missing_values=0, strategy="mean")
80+
mean_impute_scores = get_scores_for_imputer(imputer, X_missing, y_missing)
7581

7682
# Estimate the score after iterative imputation of the missing values
77-
estimator = make_pipeline(
78-
make_union(IterativeImputer(missing_values=0, random_state=0),
79-
MissingIndicator(missing_values=0)),
80-
RandomForestRegressor(random_state=0, n_estimators=100))
81-
iterative_impute_scores = cross_val_score(estimator, X_missing, y_missing,
82-
scoring='neg_mean_squared_error')
83+
imputer = IterativeImputer(missing_values=0,
84+
random_state=0,
85+
n_nearest_features=5)
86+
iterative_impute_scores = get_scores_for_imputer(imputer,
87+
X_missing,
88+
y_missing)
8389

8490
return ((full_scores.mean(), full_scores.std()),
8591
(zero_impute_scores.mean(), zero_impute_scores.std()),

sklearn/impute.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -556,7 +556,7 @@ def __init__(self,
556556
initial_strategy="mean",
557557
min_value=None,
558558
max_value=None,
559-
verbose=False,
559+
verbose=0,
560560
random_state=None):
561561

562562
self.missing_values = missing_values

0 commit comments

Comments
 (0)
0