-
-
Notifications
You must be signed in to change notification settings - Fork 26.4k
[MRG] Speed up plot_stack_predictors.py #21726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The current code uses three predictors (Random Forest, Lasso, Gradient Boosting) and then combine them with StackingRegressor to show an automatic selection. Currently, with the default of 5 folds for cross-validation, the run time can get considerably long, in particular in settings where only one (v)CPU is available (like CI tests). Experiments have shown that switching to 2 folds instead of the default 5, decreases the execution time around a factor of 10: The following table gives run times (real run time) on my system with n_jobs = 1 added to all computations (to have comparable output) with adding the `cv` parameter to the `cross_val(_predict)` calls (rows) and to the stacking regressor (columns) | 5 | 2 | 5 | 22 | 11 | 2 | 7.5 | 2 | In particular, changing the `cv` value didn't have any influence on the final outcome.
|
Modified from @norbusan commit message: The current code uses three predictors (Random Forest, Lasso,
Experiments have shown that switching to 2 folds instead of the The following table gives run times (real run time) on my system
In particular, changing the |
|
Please run |
|
@adrinjalali I would appreciate your help with the review. I addressed your comment. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @chritter !
| start_time = time.time() | ||
| score = cross_validate( | ||
| est, X, y, scoring=["r2", "neg_mean_absolute_error"], n_jobs=2, verbose=0 | ||
| est, X, y, scoring=["r2", "neg_mean_absolute_error"], n_jobs=2, verbose=0, cv=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think setting cv=2 cross_validate would promote bad ML practices. In this case, I prefer to have a longer running example with the default cv=5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine with me, it still would cut the execution time to about the half, so it would anyway be a win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit that removes the above cv=2 part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update!
| elapsed_time = time.time() - start_time | ||
|
|
||
| y_pred = cross_val_predict(est, X, y, n_jobs=2, verbose=0) | ||
| y_pred = cross_val_predict(est, X, y, n_jobs=2, verbose=0, cv=2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like the cv setting should be consistent with cross_validate. Maybe @glemaitre has some input on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. let's do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I addressed this, using consistently 5-fold cv for cross_val_predict and cross_validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I confirm that most performance gain is lost once we put the cv=5's back in place.
|
An alternative options to get a significant speedup: A) Remove Let me know which approach you deem reasonable to still follow good practices. |
I think I am okay with this, as long as we leave a comment saying this is to reduce runtime. Although making it 10 feels too low since we semi-recently moved the default from 10 to 100 in RandomForest. How much does the runtime lower when |
It takes only 28s, compared to the original 41s. If that is sufficient I would only implement this change. thanks! |
That sounds reasonable. Going down to 25 sounds reasonable to me, if the metric still hold. |
|
I mistakenly worked on this for a while. @chritter using |
|
|
I'll work on it if that's okay. |
|
Of course, I just didn't see anything here since this comment: #21726 (comment)< A93C /p> |
|
superseded |
Reference Issues/PRs
Towards #21598
What does this implement/fix? Explain your changes.
These changes are accelerating example
../examples/ensemble/plot_stack_predictors.pyAny other comments?
Working with @norbusan
#DataUmbrella Sprint