8000 [MRG] Speed up plot_stack_predictors.py by chritter · Pull Request #21726 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Conversation

@chritter
Copy link
Contributor

Reference Issues/PRs

Towards #21598
What does this implement/fix? Explain your changes.

These changes are accelerating example ../examples/ensemble/plot_stack_predictors.py

Any other comments?

Working with @norbusan

#DataUmbrella Sprint

The current code uses three predictors (Random Forest, Lasso,
Gradient Boosting) and then combine them with StackingRegressor
to show an automatic selection.

Currently, with the default of 5 folds for cross-validation, the
run time can get considerably long, in particular in settings where
only one (v)CPU is available (like CI tests).

Experiments have shown that switching to 2 folds instead of the
default 5, decreases the execution time around a factor of 10:

The following table gives run times  (real run time) on my system
with n_jobs = 1 added to all computations (to have comparable output)
with adding the `cv` parameter to the `cross_val(_predict)` calls
(rows) and to the stacking regressor (columns)

  | 5   | 2  |
5 | 22  | 11 |
2 | 7.5 | 2  |

In particular, changing the `cv` value didn't have any influence on
the final outcome.
@chritter chritter changed the title [WI] Speed up plot-stack-predictor.py [WIP] Speed up plot-stack-predictor.py Nov 21, 2021
@chritter
Copy link
Contributor Author
chritter commented Nov 21, 2021

Modified from @norbusan commit message:

The current code uses three predictors (Random Forest, Lasso,
Gradient Boosting) and then combine them with StackingRegressor
to show an automatic selection.

Currently, with the default of 5 folds for cross-validation, the
run time can get considerably long, in particular in settings where
only one (v)CPU is available (like CI tests).

Experiments have shown that switching to 2 folds instead of the
default 5, decreases the execution time around a factor of 10:

The following table gives run times (real run time) on my system
with n_jobs = 1 added to all computations (to have comparable output)
with adding the cv parameter to the cross_val(_predict) calls
(rows) and to the stacking regressor (columns)

cv 5 2
5 22s 11s
2 7.5s 2s

In particular, changing the cv value didn't have any influence on
the final outcome.

@adrinjalali adrinjalali changed the title [WIP] Speed up plot-stack-predictor.py [WIP] Speed up plot_stack_predictors.py Nov 22, 2021
@adrinjalali adrinjalali mentioned this pull request Nov 22, 2021
41 tasks
@adrinjalali
Copy link
Member

Please run black on the example to make your PR pass the CI: https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute

@chritter chritter changed the title [WIP] Speed up plot_stack_predictors.py [MRG] Speed up plot_stack_predictors.py Dec 3, 2021
@chritter
Copy link
Contributor Author
chritter commented Jan 2, 2022

@adrinjalali I would appreciate your help with the review. I addressed your comment. Thanks!

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @chritter !

start_time = time.time()
score = cross_validate(
est, X, y, scoring=["r2", "neg_mean_absolute_error"], n_jobs=2, verbose=0
est, X, y, scoring=["r2", "neg_mean_absolute_error"], n_jobs=2, verbose=0, cv=2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting cv=2 cross_validate would promote bad ML practices. In this case, I prefer to have a longer running example with the default cv=5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine with me, it still would cut the execution time to about the half, so it would anyway be a win.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit that removes the above cv=2 part.

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

elapsed_time = time.time() - start_time

y_pred = cross_val_predict(est, X, y, n_jobs=2, verbose=0)
y_pred = cross_val_predict(est, X, y, n_jobs=2, verbose=0, cv=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the cv setting should be consistent with cross_validate. Maybe @glemaitre has some input on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. let's do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed this, using consistently 5-fold cv for cross_val_predict and cross_validate.

Copy link
Member
@thomasjpfan thomasjpfan Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing the example on dev and this PR there does not look to be a timing difference anymore when we use the default cv=5. Do you see a difference locally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I confirm that most performance gain is lost once we put the cv=5's back in place.

@chritter
Copy link
Contributor Author
chritter commented Jan 7, 2022

An alternative options to get a significant speedup:

A) Remove HistGradientBoostingRegressor as two estimators, RidgeCV and RandomForestRegressor, should be enough to demo the stacking approach. This would reduce the time from 41s to 21s runtime.
B) Reducing the number of trees for HistGradientBoostingRegressor to from 100 to 10 leads to 24s runtime.

Let me know which approach you deem reasonable to still follow good practices.

@thomasjpfan
Copy link
Member

Reducing the number of trees for HistGradientBoostingRegressor to from 100 to 10 leads to 24s runtime.

I think I am okay with this, as long as we leave a comment saying this is to reduce runtime. Although making it 10 feels too low since we semi-recently moved the default from 10 to 100 in RandomForest.

How much does the runtime lower when n_estimators=50 is set for both RandomForest and HistGradientBoosting?

@chritter
Copy link
Contributor Author

Reducing the number of trees for HistGradientBoostingRegressor to from 100 to 10 leads to 24s runtime.

I think I am okay with this, as long as we leave a comment saying this is to reduce runtime. Although making it 10 feels too low since we semi-recently moved the default from 10 to 100 in RandomForest.

How much does the runtime lower when n_estimators=50 is set for both RandomForest and HistGradientBoosting?

It takes only 28s, compared to the original 41s. If that is sufficient I would only implement this change. thanks!

@chritter chritter requested a review from thomasjpfan January 17, 2022 14:14
@thomasjpfan
Copy link
Member

It takes only 28s, compared to the original 41s. If that is sufficient I would only implement this change. thanks!

That sounds reasonable. Going down to 25 sounds reasonable to me, if the metric still hold.

@siavrez
Copy link
Contributor
siavrez commented Jan 19, 2022

I mistakenly worked on this for a while. @chritter using cross_validate with return_estimator=True instead of cross_val_predict helped with runtime. https://github.com/scikit-learn/scikit-learn/pull/21733/files

@adrinjalali
Copy link
Member

@chritter could you take suggestions from @siavrez into account?

@adrinjalali adrinjalali added Stalled good first issue Easy with clear instructions to resolve help wanted labels Mar 18, 2022
@adrinjalali
Copy link
Member
adrinjalali commented Mar 18, 2022

Putting this available for any contributor who wants to finish the work.

@siavrez
Copy link
Contributor
siavrez commented Mar 18, 2022

Putting this available for any contributor who wants to finish the work.

I'll work on it if that's okay.

@adrinjalali
Copy link
Member

Of course, I just didn't see anything here since this comment: #21726 (comment)< A93C /p>

@adrinjalali adrinjalali removed Stalled good first issue Easy with clear instructions to resolve help wanted labels Mar 18, 2022
@glemaitre
Copy link
Member

superseded

@glemaitre glemaitre closed this May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

0