8000 Accelerate slow examples · Issue #21598 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Accelerate slow examples #21598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
38 of 41 tasks
adrinjalali opened this issue Nov 8, 2021 · 63 comments · Fixed by #21773
Closed
38 of 41 tasks

Accelerate slow examples #21598

adrinjalali opened this issue Nov 8, 2021 · 63 comments · Fixed by #21773
Labels
good first issue Easy with clear instructions to resolve Meta-issue General issue associated to an identified list of tasks

Comments

@adrinjalali
Copy link
Member
adrinjalali commented Nov 8, 2021

These examples take quite a long time to run, and they make our documentation CI fail quite frequently due to timeout. It'd be nice to speed the up a little bit.

To contributors: if you want to work on an example, first have a look at the example, and if you think you're comfortable working on it and have found a potential way to speed-up execution time while preserving the educational message of the example, please mention which one you're working on in the comments below.

Please open a dedicated PR for each individual example you have a found fix for (with a new git branch branched off of main for each example) to make the review faster.

Please focus on the longest running examples first (e.g. 30s or more). Examples that run in less than 15s are probably fine.

Please also keep in mind that we want to keep the example code as simple as possible for educational reasons while preserving the main points expressed in the text of the example valid and well illustrated by the result of the execution (plots or text outputs).

Finally, we expect that some examples cannot really be accelerated while preserving their educational value (integrity of the message and the simplicity of the code). In this case, we might decide to keep them as they are if they last less than 60s.

To maintainers: I'm running a script which automatically updates the following list with connected PRs and "done" checkboxes, no need to updated them manually.

Examples to Update

@adrinjalali adrinjalali added good first issue Easy with clear instructions to resolve help wanted labels Nov 8, 2021
@adrinjalali
Copy link
Member Author

@hhnnhh or @marenwestermann may be interested in this.

@cakiki
Copy link
cakiki commented Nov 9, 2021

Hi @adrinjalali. Could you elaborate on what "speeding up" entails? Is this about choosing a more reasonable setup of the parameters, or a more substantial refactoring of the examples?

@adrinjalali
Copy link
Member Author

@cakiki ideally you'd be able to speed them up by just changing some parameters or reducing the size of the data, while being able to present the same outcome, but changing the examples a bit is also not necessarily out of scope if it's required.

@adrinjalali
Copy link
Member Author

For instance, you can switch from the digits dataset to the iris dataset in the first and slowest example, and speed it up by almost 100 fold. The question is then if that still represents the benefit of RandomizedSearchCV. Or you could try to use HistGradientBoostingClassifier instead of SGDClassifier and see if it works much faster. Then open a PR and through discussions we'll figure out what the best choice is.

@sply88
Copy link
Contributor
sply88 commented Nov 9, 2021

I will work on ../examples/ensemble/plot_gradient_boosting_early_stopping.py

@ghost
Copy link
ghost commented Nov 9, 2021

Would work on the ../examples/linear_model/plot_sgd_comparison.py

@sply88
Copy link
Contributor
sply88 commented Nov 9, 2021

Will look at ../examples/ensemble/plot_gradient_boosting_regularization.py next.

@sply88
Copy link
Contributor
sply88 commented Nov 9, 2021

Will look at ../examples/model_selection/plot_successive_halving_iterations.py next

@johgreen
Copy link

I'll try ../examples/linear_model/plot_sgd_early_stopping.py

@Iglesys347
Copy link
Contributor

I'll work on ../examples/cluster/plot_cluster_comparison.py

@ghost
Copy link
ghost commented Nov 10, 2021

Will work on ../examples/linear_model/plot_sparse_logistic_regression_20newsgroups.py next

@ghost
Copy link
ghost commented Nov 10, 2021

Will work on ../examples/linear_model/plot_tweedie_regression_insurance_claims.py next

@ghost
Copy link
ghost commented Nov 10, 2021

Now working on ../examples/svm/plot_svm_scale_c.py

@ghost
Copy link
ghost commented Nov 10, 2021

Am working on ../examples/model_selection/plot_multi_metric_evaluation.py

@ghost
Copy link
ghost commented Nov 10, 2021

Am working on ../examples/model_selection/plot_learning_curve.py right now

adrinjalali pushed a commit that referenced this issue Nov 29, 2021
…21612)

* accelerate plot_successive_halving_iterations.py example #21598

* n_estimators back to 20
adrinjalali pushed a commit that referenced this issue Nov 29, 2021
#21611)

* accelerate plot_gradient_boosting_regularization.py example #21598

* speed up by less samples and less trees

* use train_test_split instead of slicing
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this issue Nov 29, 2021
…t-learn#21678)

* Reduce num of samples in plot-digit-linkage example

* Remove unnecessary random_state

* Remove nudge_images

* Address PR comment, elaborate analysis
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this issue Nov 29, 2021
…earn#21598 (scikit-learn#21612)

* accelerate plot_successive_halving_iterations.py example scikit-learn#21598

* n_estimators back to 20
samronsin pushed a commit to samronsin/scikit-learn that referenced this issue Nov 30, 2021
…t-learn#21678)

* Reduce num of samples in plot-digit-linkage example

* Remove unnecessary random_state

* Remove nudge_images

* Address PR comment, elaborate analysis
samronsin pushed a commit to samronsin/scikit-learn that referenced this issue Nov 30, 2021
…earn#21598 (scikit-learn#21612)

* accelerate plot_successive_halving_iterations.py example scikit-learn#21598

* n_estimators back to 20
samronsin pushed a commit to samronsin/scikit-learn that referenced this issue Nov 30, 2021
…t-learn#21598 (scikit-learn#21611)

* accelerate plot_gradient_boosting_regularization.py example scikit-learn#21598

* speed up by less samples and less trees

* use train_test_split instead of slicing
@ojeda-e
Copy link
Contributor
ojeda-e commented Dec 11, 2021

As suggested by @adrinjalali on Gitter, I am commenting on my attempt to accelerate /examples/linear_model/plot_sparse_logistic_regression_mnist.py.

This example involves a logistic regression using the saga algorithm and l1 penalty. In my attempt to optimize it, I found out that most of the running time is spent fetching the data instead of running the regression itself. The best acceleration I found was 5% faster, which translates to 2 seconds difference (running locally on my laptop, not CI). Unfortunately, by reducing train_samples the running time is not meaningfully decreased either. Another option I considered was to run LogisticRegression with tol=0.9, n_jobs=2, and max_iter=40. Unfortunately, can't get more than 5%.
One of the tweaks I managed to do was in the plot itself, by running the reshape outside the plot and using a list comprehension. Overall it makes the plot faster, but that's completely independent of the regression.

Compared to other PRs in this issue, it seems that the acceleration obtained here is quite low. I am not sure even if this is an example that can be further improved considering the most expensive operation here is fetch_openml.

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this issue Dec 24, 2021
…t-learn#21678)

* Reduce num of samples in plot-digit-linkage example

* Remove unnecessary random_state

* Remove nudge_images

* Address PR comment, elaborate analysis
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this issue Dec 24, 2021
…earn#21598 (scikit-learn#21612)

* accelerate plot_successive_halving_iterations.py example scikit-learn#21598

* n_estimators back to 20
glemaitre pushed a commit to glemaitre/scikit-learn that referenced this issue Dec 24, 2021
…t-learn#21598 (scikit-learn#21611)

* accelerate plot_gradient_boosting_regularization.py example scikit-learn#21598

* speed up by less samples and less trees

* use train_test_split instead of slicing
glemaitre pushed a commit that referenced this issue Dec 25, 2021
* Reduce num of samples in plot-digit-linkage example

* Remove unnecessary random_state

* Remove nudge_images

* Address PR comment, elaborate analysis
glemaitre pushed a commit that referenced this issue Dec 25, 2021
…21612)

* accelerate plot_successive_halving_iterations.py example #21598

* n_estimators back to 20
glemaitre pushed a commit that referenced this issue Dec 25, 2021
#21611)

* accelerate plot_gradient_boosting_regularization.py example #21598

* speed up by less samples and less trees

* use train_test_split instead of slicing
@amrutha-mukkatira
Copy link
  1. For of now, you can switch from the digits dataset to the iris dataset in the first and slowest example
  2. And speed it up by almost 100 fold.
  3. The question is then if that still represents the benefit of RandomizedSearchCV. Or you could try to use HistGradientBoostingClassifier instead of SGDClassifier and see if it works much faster.

refer this : * accelerate plot_successive_halving_iterations.py example scikit-learn#21598

@aad1tya
Copy link
Contributor
aad1tya commented Jan 31, 2022

I am working on
..\examples\ensemble\plot_stack_predictors.py

@andrewkrawczyk
Copy link
andrewkrawczyk commented Feb 1, 2022

As far as /examples/linear_model/plot_sparse_logistic_regression_mnist.py goes, I agree with @ojeda-e in #21598 (comment). I ran some timing tests, and the largest time taken by far is data fetching from OpenML:
Section timings

My placement of timing "checkpoints" can be seen here:
plot_sparse_logistic_regression_mnist_timed.txt

I'm a first-time contributor, so sorry that I submitted that .py file as a .txt file, but I'm still getting used to how Github works. Please let me know if there is a better way to share my work, since it's not a file that should be merged into the main project.

Unless this sample can be switched to a different dataset that is faster to fetch, I don't think performance can be improved since fetching the data is ~85% of the total runtime.

@glemaitre
Copy link
Member
glemaitre commented Feb 1, 2022

We are aware of the slow fetch_openml function and we have a separate PR to address it: #21938

So I would ignore for the moment the fact that the fetcher is taking too much time.

@Aditi840
Copy link
Aditi840 commented Mar 9, 2022

Hello everyone, I saw this issue open and I will like to work on it. Is there any help needed?

@adrinjalali

Copy link
Member Author

@Aditi840 for now I think no more help is needed here. Thanks for offering.

@glemaitre
Copy link
Member

Closing after merging #21938 that accelerate the remaining examples

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Easy with clear instructions to resolve Meta-issue General issue associated to an identified list of tasks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

0