8000 DOC Add demo on parallelization with context manager using different backends by ArturoAmorQ · Pull Request #25714 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

DOC Add demo on parallelization with context manager using different backends #25714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ArturoAmorQ
Copy link
Member

Reference Issues/PRs

What does this implement/fix? Explain your changes.

We currently have some very insightful information about parallelism on the user guide, where we even link to this document from @thomasjpfan. But IMHO a demo using different backends might be the cherry on the cake.

Any other comments?

I tried benchmarking on spark but I had several problems of compatibility and did not seem to improve the computing time on a local computer.

This work is a joint effort with @Vincent-Maladiere.

Copy link
Member
@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmark! Since we can not run the benchmark on the CI when building examples, this feels like something that belongs in the benchmark directory. In that case, we can reference the benchmark from doc/computing/parallelism.rst.

from sklearn.ensemble import HistGradientBoostingClassifier

X, y = make_classification(n_samples=1_000, random_state=0)
hist_gbdt = HistGradientBoostingClassifier(random_state=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does joblib automatically OMP_NUM_THREADS for backends such as dask or ray?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thomasjpfan! I don't understand your question; could you rephrase it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does joblib control the number of OpenMP threads for other backends? There can be over-subscription because of dask parallelism + OpenMP parallelism.

XREF: I know for loky, joblib will control the environment variables, but I have not checked for other backends.

8000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question. I don't think so but I think we should investigate. It's possible that ray and dask do something smart by default outside of the of joblib connector itself.

@ArturoAmorQ
Copy link
Member Author

Thanks for the benchmark! Since we can not run the benchmark on the CI when building examples, this feels like something that belongs in the benchmark directory.

Indeed, I was wondering the best place to add this file and thought that this case is similar to the approximate nearest neighbors example, which is both a benchmark and a demo of a scikit-learn tool (in the case of the current PR, the parameter n_jobs).

I am certainly open to moving it to the benchmark directory, though I tried to keep a beginner-friendly narrative which I find more appropriate for the example gallery. WDYT?

@thomasjpfan
Copy link
Member

I am certainly open to moving it to the benchmark directory, though I tried to keep a beginner-friendly narrative which I find more appropriate for the example gallery. WDYT?

Currently, examples run nightly to ensure that they work with the latest version of scikit-learn. The example in this PR is a snapshot of performance with fixed versions of scikit-learn and dependencies.

  • If the goal is to render the example as HTML, then we can create a "performance" directory that do not run nightly, but a snapshot in time about performance. I am unsure about adding a "performance" directory, because it will get out of date. Updates to Scikit-learn, ray, or dask may change the performance results.

  • The other way to render the HTML and to share the example is to make it a blog post in https://blog.scikit-learn.org/.

In both cases, I'll like to see the library versions used to generate the example so others can reproduce it.

8000
@ArturoAmorQ
Copy link
Member Author
  • If the goal is to render the example as HTML, then we can create a "performance" directory that do not run nightly

As far as I understand, the example gallery does not run nightly for files without the plot_ prefix.

@thomasjpfan
Copy link
Member
thomasjpfan commented Mar 7, 2023

As far as I understand, the example gallery does not run nightly for files without the plot_ prefix.

That is true and I prefer not to have them, because we can not be sure they work anymore. I do not want to add more examples that do not run nightly.

Edit: I am +0.25 for adding this example, as long as scikit-learn's version and dependencies' versions are displayed in the beginning of this example.

@thomasjpfan
Copy link
Member

Thinking it over, I am +0.25 for adding this example, as long as scikit-learn's version and the dependencies' versions are displayed in the beginning of this example. With non-running examples, I'm still concerned that either the code does not work anymore or the benchmark results will be out of date when scikit-learn or the dependencies update.

Given that the examples are rendered into HTML and public facing, we could be showcasing code that may not run anymore.

Comment on lines +110 to +122
from time import time
from sklearn.model_selection import cross_validate
from threadpoolctl import threadpool_limits
import numpy as np

n_threads_grid = 2 ** np.arange(np.log2(2 * N_CORES).astype(np.int32) + 1)

for n_threads in n_threads_grid:
tic = time()
with threadpool_limits(limits=int(n_threads)):
cross_validate(clf, X, y, cv=cv)
toc = time()
print(f"n_threads: {n_threads}, elapsed time: {toc - tic:.3f} sec")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my laptop (4 physical cores and 8 logical cores) I observe the following:

n_threads: 1, elapsed time: 3.381 sec
n_threads: 2, elapsed time: 3.261 sec
n_threads: 4, elapsed time: 3.235 sec
n_threads: 8, elapsed time: 26.078 sec

The last line reveals serious oversubscription problem when the number of threads is not limited to the number of physical cores.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore we can observe that the dataset is too small for the multi-threading to give significant speed-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_threads: 8, elapsed time: 26.078 sec

ouch :)

@ArturoAmorQ ArturoAmorQ marked this pull request as draft March 21, 2023 17:30
@ogrisel
Copy link
Member
ogrisel commented Mar 21, 2023

The timings measured in #25714 (comment) are another case of the problem reported by @adrinjalali in #25822: note that the datasets are small in both cases.

Furthermore, irrespective of the size of the dataset, we might want to make scikit-learn automatically limit the number of threads used in openmp in Cython code to the number of physical cores instead of logical cores.

The latter point could be a follow-up to @jeremiedbb's #25918 that changes _openmp_effective_n_threads to make it possible to use the number of physical cores. We should probably change the default to True in a dedicated PR once #25918 is merged.

@ogrisel
Copy link
Member
ogrisel commented Mar 21, 2023

Let's put this PR on hold for now because there will not be any simple educational message as long as #25822 is not fixed. Once this is the case, I think it would be interesting in continuing this work to show how to use threadpoolctl and various joblib backends as a tutorial on how to configure the different levels of parallelism in scikit-learn, their pros and cons.

@adrinjalali
Copy link
Member

To me this is more of a user guide kind of material which doesn't reproduce on CI, but is there as a fixed state, and for that I think it makes sense to have it. Users should be able to reproduce on their machines, and if we do it in a user guide, then the dataset can be large enough for the points raised here to be valid.

@ArturoAmorQ
Copy link
Member Author

Somewhat related to #30662, as we want the message to hold regardless of the size of the dataset.

@ogrisel
Copy link
Member
ogrisel commented Mar 20, 2025

as we want the message to hold regardless of the size of the dataset.

That might be difficult to achieve, but I agree that we should try to protect against the most pathological slowdowns with the default settings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0