-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC Add demo on parallelization with context manager using different backends #25714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the benchmark! Since we can not run the benchmark on the CI when building examples, this feels like something that belongs in the benchmark
directory. In that case, we can reference the benchmark from doc/computing/parallelism.rst
.
from sklearn.ensemble import HistGradientBoostingClassifier | ||
|
||
X, y = make_classification(n_samples=1_000, random_state=0) | ||
hist_gbdt = HistGradientBoostingClassifier(random_state=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does joblib
automatically OMP_NUM_THREADS
for backends such as dask
or ray
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @thomasjpfan! I don't understand your question; could you rephrase it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does joblib
control the number of OpenMP threads for other backends? There can be over-subscription because of dask parallelism + OpenMP parallelism.
XREF: I know for loky
, joblib will control the environment variables, but I have not checked for other backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good question. I don't think so but I think we should investigate. It's possible that ray and dask do something smart by default outside of the of joblib connector itself.
Indeed, I was wondering the best place to add this file and thought that this case is similar to the approximate nearest neighbors example, which is both a benchmark and a demo of a scikit-learn tool (in the case of the current PR, the parameter I am certainly open to moving it to the benchmark directory, though I tried to keep a beginner-friendly narrative which I find more appropriate for the example gallery. WDYT? |
Currently, examples run nightly to ensure that they work with the latest version of scikit-learn. The example in this PR is a snapshot of performance with fixed versions of scikit-learn and dependencies.
In both cases, I'll like to see the library versions used to generate the example so others can reproduce it. |
As far as I understand, the example gallery does not run nightly for files without the |
That is true and I prefer not to have them, because we can not be sure they work anymore. Edit: I am +0.25 for adding this example, as long as scikit-learn's version and dependencies' versions are displayed in the beginning of this example. |
Thinking it over, I am +0.25 for adding this example, as long as scikit-learn's version and the dependencies' versions are displayed in the beginning of this example. With non-running examples, I'm still concerned that either the code does not work anymore or the benchmark results will be out of date when scikit-learn or the dependencies update. Given that the examples are rendered into HTML and public facing, we could be showcasing code that may not run anymore. |
from time import time | ||
from sklearn.model_selection import cross_validate | ||
from threadpoolctl import threadpool_limits | ||
import numpy as np | ||
|
||
n_threads_grid = 2 ** np.arange(np.log2(2 * N_CORES).astype(np.int32) + 1) | ||
|
||
for n_threads in n_threads_grid: | ||
tic = time() | ||
with threadpool_limits(limits=int(n_threads)): | ||
cross_validate(clf, X, y, cv=cv) | ||
toc = time() | ||
print(f"n_threads: {n_threads}, elapsed time: {toc - tic:.3f} sec") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my laptop (4 physical cores and 8 logical cores) I observe the following:
n_threads: 1, elapsed time: 3.381 sec
n_threads: 2, elapsed time: 3.261 sec
n_threads: 4, elapsed time: 3.235 sec
n_threads: 8, elapsed time: 26.078 sec
The last line reveals serious oversubscription problem when the number of threads is not limited to the number of physical cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore we can observe that the dataset is too small for the multi-threading to give significant speed-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_threads: 8, elapsed time: 26.078 sec
ouch :)
The timings measured in #25714 (comment) are another case of the problem reported by @adrinjalali in #25822: note that the datasets are small in both cases. Furthermore, irrespective of the size of the dataset, we might want to make scikit-learn automatically limit the number of threads used in openmp in Cython code to the number of physical cores instead of logical cores. The latter point could be a follow-up to @jeremiedbb's #25918 that changes |
Let's put this PR on hold for now because there will not be any simple educational message as long as #25822 is not fixed. Once this is the case, I think it would be interesting in continuing this work to show how to use |
To me this is more of a user guide kind of material which doesn't reproduce on CI, but is there as a fixed state, and for that I think it makes sense to have it. Users should be able to reproduce on their machines, and if we do it in a user guide, then the dataset can be large enough for the points raised here to be valid. |
Somewhat related to #30662, as we want the message to hold regardless of the size of the dataset. |
That might be difficult to achieve, but I agree that we should try to protect against the most pathological slowdowns with the default settings. |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
We currently have some very insightful information about parallelism on the user guide, where we even link to this document from @thomasjpfan. But IMHO a demo using different backends might be the cherry on the cake.
Any other comments?
I tried benchmarking on spark but I had several problems of compatibility and did not seem to improve the computing time on a local computer.
This work is a joint effort with @Vincent-Maladiere.