-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
[MRG] IterativeImputer extended example #12100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jnothman
merged 42 commits into
scikit-learn:iterativeimputer
from
sergeyf:iterativeimputer_missforest
Jan 25, 2019
Merged
Changes from all commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
cdc7183
first commit
sergeyf 493a27a
undoing space
sergeyf 75ad2da
newline
sergeyf 1465f77
fixing bug in plot_missing_values and adding a bit more performance
sergeyf ea84910
slight clarification
sergeyf 753f60d
another example
sergeyf 2ec0401
fixing tests
sergeyf 39fb7f7
modularizing plot_missing_values
sergeyf de2c307
fixing cut off plot
sergeyf c135653
updating narrative docs
sergeyf a8768c4
default for verbose should be 0
sergeyf 4368c31
fixing doc error
sergeyf c5595b8
addressing some comments
sergeyf e0e90be
making example more interesting
sergeyf 75f94af
Reverting v0.21 to master
sergeyf 0dff9a0
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf 755dde5
Responding to reviewer comments.
sergeyf 7a6a7b4
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf 22dc1ef
Updating v0.20
sergeyf 3b0fae9
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf 3f059df
Revert changes to v0.21.rst
sergeyf 2442272
Merge branch 'iterativeimputer_missforest' of https://github.com/serg…
sergeyf 2ed49af
fixing doctest impute.rst
sergeyf dae2e3a
fixing doctest impute.rst v2
sergeyf cc9ae8b
fixing doctest impute.rst v3
sergeyf bd0be11
One more try with expected/actual issue in impute.rst
sergeyf d2c3357
updating to 26
sergeyf af3a056
Merge branch 'iterativeimputer_missforest' of https://github.com/serg…
sergeyf 1862379
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf b2f2b54
Updating RidgeCV to BayesianRidge to be more in line with the default
sergeyf b305620
updating to glemaitre's plot
sergeyf c8dccb4
line lengths for impute.rst
sergeyf a102cec
Update examples/impute/plot_iterative_imputer_variants_comparison.py
glemaitre 57b83d6
fixing y-axis labels
sergeyf 5e9b45f
Merge branch 'iterativeimputer_missforest' of https://github.com/serg…
sergeyf a86b535
Update examples/impute/plot_iterative_imputer_variants_comparison.py
glemaitre c0b7439
Update examples/impute/plot_iterative_imputer_variants_comparison.py
glemaitre 332a01d
minor change
sergeyf 3ab4686
Merge branch 'iterativeimputer_missforest' of https://github.com/serg…
sergeyf 46e21dc
addressing reviewer comments
sergeyf 303d026
Merge branch 'iterativeimputer' into iterativeimputer_missforest
sergeyf ddab109
reordering predictors
sergeyf File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
126 changes: 126 additions & 0 deletions
126
examples/impute/plot_iterative_imputer_variants_comparison.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
""" | ||
========================================================= | ||
Imputing missing values with variants of IterativeImputer | ||
========================================================= | ||
|
||
The :class:`sklearn.impute.IterativeImputer` class is very flexible - it can be | ||
used with a variety of predictors to do round-robin regression, treating every | ||
variable as an output in turn. | ||
|
||
In this example we compare some predictors for the purpose of missing feature | ||
imputation with :class:`sklearn.imputeIterativeImputer`:: | ||
|
||
:class:`~sklearn.linear_model.BayesianRidge`: regularized linear regression | ||
:class:`~sklearn.tree.DecisionTreeRegressor`: non-linear regression | ||
:class:`~sklearn.ensemble.ExtraTreesRegressor`: similar to missForest in R | ||
:class:`~sklearn.neighbors.KNeighborsRegressor`: comparable to other KNN | ||
imputation approaches | ||
|
||
Of particular interest is the ability of | ||
:class:`sklearn.impute.IterativeImputer` to mimic the behavior of missForest, a | ||
popular imputation package for R. In this example, we have chosen to use | ||
:class:`sklearn.ensemble.ExtraTreesRegressor` instead of | ||
:class:`sklearn.ensemble.RandomForestRegressor` (as in missForest) due to its | ||
increased speed. | ||
|
||
Note that :class:`sklearn.neighbors.KNeighborsRegressor` is different from KNN | ||
imputation, which learns from samples with missing values by using a distance | ||
metric that accounts for missing values, rather than imputing them. | ||
|
||
The goal is to compare different predictors to see which one is best for the | ||
:class:`sklearn.impute.IterativeImputer` when using a | ||
:class:`sklearn.linear_model.BayesianRidge` estimator on the California housing | ||
dataset with a single value randomly removed from each row. | ||
|
||
For this particular pattern of missing values we see that | ||
:class:`sklearn.ensemble.ExtraTreesRegressor` and | ||
:class:`sklearn.linear_model.BayesianRidge` give the best results. | ||
""" | ||
print(__doc__) | ||
|
||
import numpy as np | ||
import matplotlib.pyplot as plt | ||
import pandas as pd | ||
|
||
from sklearn.datasets import fetch_california_housing | ||
from sklearn.impute import SimpleImputer | ||
from sklearn.impute import IterativeImputer | ||
from sklearn.linear_model import BayesianRidge | ||
from sklearn.tree import DecisionTreeRegressor | ||
from sklearn.ensemble import ExtraTreesRegressor | ||
from sklearn.neighbors import KNeighborsRegressor | ||
from sklearn.pipeline import make_pipeline | ||
from sklearn.model_selection import cross_val_score | ||
|
||
N_SPLITS = 5 | ||
|
||
rng = np.random.RandomState(0) | ||
|
||
X_full, y_full = fetch_california_housing(return_X_y=True) | ||
n_samples, n_features = X_full.shape | ||
|
||
# Estimate the score on the entire dataset, with no missing values | ||
br_estimator = BayesianRidge() | ||
score_full_data = pd.DataFrame( | ||
cross_val_score( | ||
br_estimator, X_full, y_full, scoring='neg_mean_squared_error', | ||
cv=N_SPLITS | ||
), | ||
columns=['Full Data'] | ||
) | ||
|
||
# Add a single missing value to each row | ||
X_missing = X_full.copy() | ||
y_missing = y_full | ||
missing_samples = np.arange(n_samples) | ||
missing_features = rng.choice(n_features, n_samples, replace=True) | ||
X_missing[missing_samples, missing_features] = np.nan | ||
|
||
# Estimate the score after imputation (mean and median strategies) | ||
score_simple_imputer = pd.DataFrame() | ||
for strategy in ('mean', 'median'): | ||
estimator = make_pipeline( | ||
SimpleImputer(missing_values=np.nan, strategy=strategy), | ||
br_estimator | ||
) | ||
score_simple_imputer[strategy] = cross_val_score( | ||
estimator, X_missing, y_missing, scoring='neg_mean_squared_error', | ||
cv=N_SPLITS | ||
) | ||
|
||
# Estimate the score after iterative imputation of the missing values | ||
# with different predictors | ||
predictors = [ | ||
BayesianRidge(), | ||
DecisionTreeRegressor(max_features='sqrt', random_state=0), | ||
ExtraTreesRegressor(n_estimators=10, n_jobs=-1, random_state=0), | ||
KNeighborsRegressor(n_neighbors=15) | ||
] | ||
score_iterative_imputer = pd.DataFrame() | ||
for predictor in predictors: | ||
estimator = make_pipeline( | ||
IterativeImputer(random_state=0, predictor=predictor), | ||
br_estimator | ||
) | ||
score_iterative_imputer[predictor.__class__.__name__] = \ | ||
cross_val_score( | ||
estimator, X_missing, y_missing, scoring='neg_mean_squared_error', | ||
cv=N_SPLITS | ||
) | ||
|
||
scores = pd.concat( | ||
[score_full_data, score_simple_imputer, score_iterative_imputer], | ||
keys=['Original', 'SimpleImputer', 'IterativeImputer'], axis=1 | ||
) | ||
|
||
# plot boston results | ||
fig, ax = plt.subplots(figsize=(13, 6)) | ||
means = -scores.mean() | ||
errors = scores.std() | ||
means.plot.barh(xerr=errors, ax=ax) | ||
ax.set_title('California Housing Regression with Different Imputation Methods') | ||
ax.set_xlabel('MSE (smaller is better)') | ||
ax.set_yticks(np.arange(means.shape[0])) | ||
ax.set_yticklabels([" w/ ".join(label) for label in means.index.get_values()]) | ||
plt.tight_layout(pad=1) | ||
plt.show() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for this ordering? It doesn't match the one above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. It's the same order as in the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe I misread.