[MRG] Speed up plot_stack_predictors.py #21726

chritter · 2021-11-21T02:14:16Z

Reference Issues/PRs

Towards #21598
What does this implement/fix? Explain your changes.

These changes are accelerating example ../examples/ensemble/plot_stack_predictors.py

Any other comments?

#DataUmbrella Sprint

The current code uses three predictors (Random Forest, Lasso, Gradient Boosting) and then combine them with StackingRegressor to show an automatic selection. Currently, with the default of 5 folds for cross-validation, the run time can get considerably long, in particular in settings where only one (v)CPU is available (like CI tests). Experiments have shown that switching to 2 folds instead of the default 5, decreases the execution time around a factor of 10: The following table gives run times (real run time) on my system with n_jobs = 1 added to all computations (to have comparable output) with adding the `cv` parameter to the `cross_val(_predict)` calls (rows) and to the stacking regressor (columns) | 5 | 2 | 5 | 22 | 11 | 2 | 7.5 | 2 | In particular, changing the `cv` value didn't have any influence on the final outcome.

chritter · 2021-11-21T02:15:27Z

Modified from @norbusan commit message:

The current code uses three predictors (Random Forest, Lasso,
Gradient Boosting) and then combine them with StackingRegressor
to show an automatic selection.

Currently, with the default of 5 folds for cross-validation, the
run time can get considerably long, in particular in settings where
only one (v)CPU is available (like CI tests).

Experiments have shown that switching to 2 folds instead of the
default 5, decreases the execution time around a factor of 10:

The following table gives run times (real run time) on my system
with n_jobs = 1 added to all computations (to have comparable output)
with adding the cv parameter to the cross_val(_predict) calls
(rows) and to the stacking regressor (columns)

cv	5	2
5	22s	11s
2	7.5s	2s

In particular, changing the cv value didn't have any influence on
the final outcome.

adrinjalali · 2021-11-23T18:16:58Z

Please run black on the example to make your PR pass the CI: https://scikit-learn.org/dev/developers/contributing.html#how-to-contribute

chritter · 2022-01-02T14:35:12Z

@adrinjalali I would appreciate your help with the review. I addressed your comment. Thanks!

thomasjpfan

Thank you for the PR @chritter !

examples/ensemble/plot_stack_predictors.py

thomasjpfan · 2022-01-02T19:51:22Z

examples/ensemble/plot_stack_predictors.py

    start_time = time.time()
    score = cross_validate(
-        est, X, y, scoring=["r2", "neg_mean_absolute_error"], n_jobs=2, verbose=0
+        est, X, y, scoring=["r2", "neg_mean_absolute_error"], n_jobs=2, verbose=0, cv=2


I think setting cv=2 cross_validate would promote bad ML practices. In this case, I prefer to have a longer running example with the default cv=5.

That is fine with me, it still would cut the execution time to about the half, so it would anyway be a win.

I pushed a commit that removes the above cv=2 part.

thomasjpfan

Thanks for the update!

thomasjpfan · 2022-01-04T02:41:57Z

examples/ensemble/plot_stack_predictors.py

    elapsed_time = time.time() - start_time

-    y_pred = cross_val_predict(est, X, y, n_jobs=2, verbose=0)
+    y_pred = cross_val_predict(est, X, y, n_jobs=2, verbose=0, cv=2)


I feel like the cv setting should be consistent with cross_validate. Maybe @glemaitre has some input on this?

yep. let's do that.

I addressed this, using consistently 5-fold cv for cross_val_predict and cross_validate.

Comparing the example on dev and this PR there does not look to be a timing difference anymore when we use the default cv=5. Do you see a difference locally?

Yes, I confirm that most performance gain is lost once we put the cv=5's back in place.

chritter · 2022-01-07T00:20:49Z

An alternative options to get a significant speedup:

A) Remove HistGradientBoostingRegressor as two estimators, RidgeCV and RandomForestRegressor, should be enough to demo the stacking approach. This would reduce the time from 41s to 21s runtime.
B) Reducing the number of trees for HistGradientBoostingRegressor to from 100 to 10 leads to 24s runtime.

Let me know which approach you deem reasonable to still follow good practices.

thomasjpfan · 2022-01-07T03:24:47Z

Reducing the number of trees for HistGradientBoostingRegressor to from 100 to 10 leads to 24s runtime.

I think I am okay with this, as long as we leave a comment saying this is to reduce runtime. Although making it 10 feels too low since we semi-recently moved the default from 10 to 100 in RandomForest.

How much does the runtime lower when n_estimators=50 is set for both RandomForest and HistGradientBoosting?

chritter · 2022-01-17T14:13:44Z

Reducing the number of trees for HistGradientBoostingRegressor to from 100 to 10 leads to 24s runtime.

I think I am okay with this, as long as we leave a comment saying this is to reduce runtime. Although making it 10 feels too low since we semi-recently moved the default from 10 to 100 in RandomForest.

How much does the runtime lower when n_estimators=50 is set for both RandomForest and HistGradientBoosting?

It takes only 28s, compared to the original 41s. If that is sufficient I would only implement this change. thanks!

thomasjpfan · 2022-01-17T15:15:44Z

It takes only 28s, compared to the original 41s. If that is sufficient I would only implement this change. thanks!

That sounds reasonable. Going down to 25 sounds reasonable to me, if the metric still hold.

siavrez · 2022-01-19T20:26:22Z

I mistakenly worked on this for a while. @chritter using cross_validate with return_estimator=True instead of cross_val_predict helped with runtime. https://github.com/scikit-learn/scikit-learn/pull/21733/files

adrinjalali · 2022-02-04T15:34:04Z

@chritter could you take suggestions from @siavrez into account?

adrinjalali · 2022-03-18T14:10:41Z

~~Putting this available for any contributor who wants to finish the work.~~

siavrez · 2022-03-18T14:22:57Z

Putting this available for any contributor who wants to finish the work.

I'll work on it if that's okay.

adrinjalali · 2022-03-18T14:29:54Z

Of course, I just didn't see anything here since this comment: #21726 (comment)< A93C /p>

glemaitre · 2022-05-30T16:57:50Z

superseded

chritter changed the title ~~[WI] Speed up plot-stack-predictor.py~~ [WIP] Speed up plot-stack-predictor.py Nov 21, 2021

chritter added 2 commits November 20, 2021 21:35

slight update plot_stack_predictors wording more positive

6ded8b2

plot_stack_predictor remove note about cv justification

8e58074

adrinjalali changed the title ~~[WIP] Speed up plot-stack-predictor.py~~ [WIP] Speed up plot_stack_predictors.py Nov 22, 2021

adrinjalali mentioned this pull request Nov 22, 2021

Accelerate slow examples #21598

Closed

41 tasks

adrinjalali mentioned this pull request Nov 23, 2021

MNT accelerate example plot_stack_predictors.py #21733

Closed

chritter added 2 commits December 2, 2021 08:55

update plot_stack_predictors pass black

3b24ec0

update with upstream by resolving merge in plot_stack_predictors

d3fb3b8

chritter changed the title ~~[WIP] Speed up plot_stack_predictors.py~~ [MRG] Speed up plot_stack_predictors.py Dec 3, 2021

thomasjpfan reviewed Jan 2, 2022

View reviewed changes

norbusan added 2 commits January 4, 2022 10:57

Go back to (implicit) cv=5 for cross validation

f2e90a9

Add a explanation why we use cv=2

a6e610d

thomasjpfan reviewed Jan 4, 2022

View reviewed changes

consistent 5-fold cv between cross_validate and cross_val_predict.

7a2ccd3

chritter requested a review from thomasjpfan January 17, 2022 14:14

reduced tree number in rf and gradient boosting

92f91b1

adrinjalali added Stalled good first issue Easy with clear instructions to resolve help wanted labels Mar 18, 2022

adrinjalali removed Stalled good first issue Easy with clear instructions to resolve help wanted labels Mar 18, 2022

glemaitre closed this May 30, 2022

Uh oh!

[MRG] Speed up plot_stack_predictors.py #21726

[MRG] Speed up plot_stack_predictors.py #21726

Uh oh!

Conversation

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants