10000 MNT accelerate plot_iterative_imputer_variants_comparison.py by siavrez · Pull Request #21748 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

MNT accelerate plot_iterative_imputer_variants_comparison.py #21748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Feb 23, 2022
Merged
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
a2dce00
accelerate plot_iterative_imputer_variants_comparison.py added bootst…
siavrez Nov 22, 2021
7893fc9
Added more params to ETregressor to reduce runtime
siavrez Nov 23, 2021
2452df2
removed bootstrap=True, with max_sample param bootstrap is changed to…
siavrez Nov 23, 2021
64eec52
change the folds to 5
siavrez Nov 23, 2021
4415648
changing tree with random forest add tolerance for each model, change…
siavrez Nov 24, 2021
0f70fdb
removed comment
siavrez Nov 25, 2021
f591ec0
added bootstrap=True to randomforest
siavrez Nov 25, 2021
8e36e9b
added comment for tolerance
siavrez Nov 25, 2021
a4855cc
added bootstrap=True to ET
siavrez Nov 25, 2021
4509282
changed tolerance comment place
siavrez Nov 25, 2021
b8947b6
change ET with Ny-Ridge pipeline
siavrez Nov 25, 2021
84e77a2
Changed docstring and added a comment about HistGradientBoosting abil…
siavrez Dec 8, 2021
5bc1ae2
Changed docstring
siavrez Dec 9, 2021
bbf468c
Update examples/impute/plot_iterative_imputer_variants_comparison.py
siavrez Dec 15, 2021
e8817d5
Update examples/impute/plot_iterative_imputer_variants_comparison.py
siavrez Dec 15, 2021
4f0187b
Merge branch 'main' into accelerate_examples7
adrinjalali Feb 8, 2022
0babf50
Update examples/impute/plot_iterative_imputer_variants_comparison.py
jeremiedbb Feb 23, 2022
804b3a0
Update examples/impute/plot_iterative_imputer_variants_comparison.py
jeremiedbb Feb 23, 2022
b30b128
Update examples/impute/plot_iterative_imputer_variants_comparison.py
jeremiedbb Feb 23, 2022
2d6144d
Merge branch 'main' into accelerate_examples7
jeremiedbb Feb 23, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 39 additions & 15 deletions examples/impute/plot_iterative_imputer_variants_comparison.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,16 @@
imputation with :class:`~impute.IterativeImputer`:

* :class:`~linear_model.BayesianRidge`: regularized linear regression
* :class:`~tree.DecisionTreeRegressor`: non-linear regression
* :class:`~ensemble.ExtraTreesRegressor`: similar to missForest in R
* :class:`~tree.RandomForestRegressor`: Forests of randomized trees regression
* :func:`~pipeline.make_pipeline`(:class:`~kernel_approximation.Nystroem`,
:class:`~linear_model.Ridge`): a pipeline with the expansion of a degree 2
polynomial kernel and regularized linear regression
* :class:`~neighbors.KNeighborsRegressor`: comparable to other KNN
imputation approaches

Of particular interest is the ability of
:class:`~impute.IterativeImputer` to mimic the behavior of missForest, a
popular imputation package for R. In this example, we have chosen to use
:class:`~ensemble.ExtraTreesRegressor` instead of
:class:`~ensemble.RandomForestRegressor` (as in missForest) due to its
increased speed.
popular imputation package for R.

Note that :class:`~neighbors.KNeighborsRegressor` is different from KNN
imputation, which learns from samples with missing values by using a distance
Expand All @@ -35,8 +34,13 @@
dataset with a single value randomly removed from each row.

For this particular pattern of missing values we see that
:class:`~ensemble.ExtraTreesRegressor` and
:class:`~linear_model.BayesianRidge` give the best results.
:class:`~linear_model.BayesianRidge` and
:class:`~ensemble.RandomForestRegressor` give the best results.

It shoud be noted that some estimators such as
:class:`~ensemble.HistGradientBoostingRegressor` can natively deal with
missing features and are often recommended over building pipelines with
complex and costly missing values imputation strategies.

"""

Expand All @@ -49,9 +53,9 @@
from sklearn.datasets import fetch_california_housing
from sklearn.impute import SimpleImputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import BayesianRidge, Ridge
from sklearn.kernel_approximation import Nystroem
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
Expand Down Expand Up @@ -97,14 +101,34 @@
# with different estimators
estimators = [
BayesianRidge(),
DecisionTreeRegressor(max_features="sqrt", random_state=0),
ExtraTreesRegressor(n_estimators=10, random_state=0),
RandomForestRegressor(
# We tuned the hyperparameters of the RandomForestRegressor to get a good
# enough predictive performance for a restricted execution time.
n_estimators=4,
max_depth=10,
bootstrap=True,
max_samples=0.5,
n_jobs=2,
random_state=0,
),
make_pipeline(
Nystroem(kernel="polynomial", degree=2, random_state=0), Ridge(alpha=1e3)
),
KNeighborsRegressor(n_neighbors=15),
]
score_iterative_imputer = pd.DataFrame()
for impute_estimator in estimators:
# iterative imputer is sensible to the tolerance and
# dependent on the estimator used internally.
# we tuned the tolerance to keep this example run with limited computational
# resources while not changing the results too much compared to keeping the
# stricter default value for the tolerance parameter.
tolerances = (1e-3, 1e-1, 1e-1, 1e-2)
for impute_estimator, tol in zip(estimators, tolerances):
estimator = make_pipeline(
IterativeImputer(random_state=0, estimator=impute_estimator), br_estimator
IterativeImputer(
random_state=0, estimator=impute_estimator, max_iter=25, tol=tol
),
br_estimator,
)
score_iterative_imputer[impute_estimator.__class__.__name__] = cross_val_score(
estimator, X_missing, y_missing, scoring="neg_mean_squared_error", cv=N_SPLITS
Expand Down
0