8000 DOC Update "Parallelism, resource management, and configuration" sect… · scikit-learn/scikit-learn@7d1c318 · GitHub
[go: up one dir, main page]

Skip to content

Commit 7d1c318

Browse files
jjerphanbetatimogrisel
authored
DOC Update "Parallelism, resource management, and configuration" section (#24997)
Co-authored-by: Tim Head <betatim@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
1 parent df14322 commit 7d1c318

File tree

3 files changed

+116
-70
lines changed

3 files changed

+116
-70
lines changed

doc/computing/parallelism.rst

Lines changed: 91 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -10,38 +10,46 @@ Parallelism, resource management, and configuration
1010
Parallelism
1111
-----------
1212

13-
Some scikit-learn estimators and utilities can parallelize costly operations
14-
using multiple CPU cores, thanks to the following components:
13+
Some scikit-learn estimators and utilities parallelize costly operations
14+
using multiple CPU cores.
1515

16-
- via the `joblib <https://joblib.readthedocs.io/en/latest/>`_ library. In
17-
this case the number of threads or processes can be controlled with the
18-
``n_jobs`` parameter.
19-
- via OpenMP, used in C or Cython code.
16+
Depending on the type of estimator and sometimes the values of the
17+
constructor parameters, this is either done:
2018

21-
In addition, some of the numpy routines that are used internally by
22-
scikit-learn may also be parallelized if numpy is installed with specific
23-
numerical libraries such as MKL, OpenBLAS, or BLIS.
19+
- with higher-level parallelism via `joblib <https://joblib.readthedocs.io/en/latest/>`_.
20+
- with lower-level parallelism via OpenMP, used in C or Cython code.
21+
- with lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations
22+
on arrays.
2423

25-
We describe these 3 scenarios in the following subsections.
24+
The `n_jobs` parameters of estimators always controls the amount of parallelism
25+
managed by joblib (processes or threads depending on the joblib backend).
26+
The thread-level parallelism managed by OpenMP in scikit-learn's own Cython code
27+
or by BLAS & LAPACK libraries used by NumPy and SciPy operations used in scikit-learn
28+
is always controlled by environment variables or `threadpoolctl` as explained below.
29+
Note that some estimators can leverage all three kinds of parallelism at different
30+
points of their training and prediction methods.
2631

27-
Joblib-based parallelism
28-
........................
32+
We describe these 3 types of parallelism in the following subsections in more details.
33+
34+
Higher-level parallelism with joblib
35+
....................................
2936

3037
When the underlying implementation uses joblib, the number of workers
3138
(threads or processes) that are spawned in parallel can be controlled via the
3239
``n_jobs`` parameter.
3340

3441
.. note::
3542

36-
Where (and how) parallelization happens in the estimators is currently
37-
poorly documented. Please help us by improving our docs and tackle `issue
38-
14228 <https://github.com/scikit-learn/scikit-learn/issues/14228>`_!
43+
Where (and how) parallelization happens in the estimators using joblib by
44+
specifying `n_jobs` is currently poorly documented.
45+
Please help us by improving our docs and tackle `issue 14228
46+
<https://github.com/scikit-learn/scikit-learn/issues/14228>`_!
3947

4048
Joblib is able to support both multi-processing and multi-threading. Whether
4149
joblib chooses to spawn a thread or a process depends on the **backend**
4250
that it's using.
4351

44-
Scikit-learn generally relies on the ``loky`` backend, which is joblib's
52+
scikit-learn generally relies on the ``loky`` backend, which is joblib's
4553
default backend. Loky is a multi-processing backend. When doing
4654
multi-processing, in order to avoid duplicating the memory in each process
4755
(which isn't reasonable with big datasets), joblib will create a `memmap
@@ -70,40 +78,57 @@ that increasing the number of workers is always a good thing. In some cases
7078
it can be highly detrimental to performance to run multiple copies of some
7179
estimators or functions in parallel (see oversubscription below).
7280

73-
OpenMP-based parallelism
74-
........................
81+
Lower-level parallelism with OpenMP
82+
...................................
7583

7684
OpenMP is used to parallelize code written in Cython or C, relying on
77-
multi-threading exclusively. By default (and unless joblib is trying to
78-
avoid oversubscription), the implementation will use as many threads as
79-
possible.
85+
multi-threading exclusively. By default, the implementations using OpenMP
86+
will use as many threads as possible, i.e. as many threads as logical cores.
8087

81-
You can control the exact number of threads that are used via the
82-
``OMP_NUM_THREADS`` environment variable:
88+
You can control the exact number of threads that are used either:
8389

84-
.. prompt:: bash $
90+
- via the ``OMP_NUM_THREADS`` environment variable, for instance when:
91+
running a python script:
92+
93+
.. prompt:: bash $
94+
95+
OMP_NUM_THREADS=4 python my_script.py
8596

86-
OMP_NUM_THREADS=4 python my_script.py
97+
- or via `threadpoolctl` as explained by `this piece of documentation
98+
<https://github.com/joblib/threadpoolctl/#setting-the-maximum-size-of-thread-pools>`_.
8799

88-
Parallel Numpy routines from numerical libraries
89-
................................................
100+
Parallel NumPy and SciPy routines from numerical libraries
101+
..........................................................
90102

91-
Scikit-learn relies heavily on NumPy and SciPy, which internally call
92-
multi-threaded linear algebra routines implemented in libraries such as MKL,
93-
OpenBLAS or BLIS.
103+
scikit-learn relies heavily on NumPy and SciPy, which internally call
104+
multi-threaded linear algebra routines (BLAS & LAPACK) implemented in libraries
105+
such as MKL, OpenBLAS or BLIS.
94106

95-
The number of threads used by the OpenBLAS, MKL or BLIS libraries can be set
96-
via the ``MKL_NUM_THREADS``, ``OPENBLAS_NUM_THREADS``, and
97-
``BLIS_NUM_THREADS`` environment variables.
107+
You can control the exact number of threads used by BLAS for each library
108+
using environment variables, namely:
109+
110+
- ``MKL_NUM_THREADS`` sets the number of thread MKL uses,
111+
- ``OPENBLAS_NUM_THREADS`` sets the number of threads OpenBLAS uses
112+
- ``BLIS_NUM_THREADS`` sets the number of threads BLIS uses
113+
114+
Note that BLAS & LAPACK implementations can also be impacted by
115+
`OMP_NUM_THREADS`. To check whether this is the case in your environment,
116+
you can inspect how the number of threads effectively used by those libraries
117+
is affected when running the the following command in a bash or zsh terminal
118+
for different values of `OMP_NUM_THREADS`::
119+
120+
.. prompt:: bash $
98121

99-
Please note that scikit-learn has no direct control over these
100-
implementations. Scikit-learn solely relies on Numpy and Scipy.
122+
OMP_NUM_THREADS=2 python -m threadpoolctl -i numpy scipy
101123

102124
.. note::
103-
At the time of writing (2019), NumPy and SciPy packages distributed on
104-
pypi.org (used by ``pip``) and on the conda-forge channel are linked
105-
with OpenBLAS, while conda packages shipped on the "defaults" channel
106-
from anaconda.org are linked by default with MKL.
125+
At the time of writing (2022), NumPy and SciPy packages which are
126+
distributed on pypi.org (i.e. the ones installed via ``pip install``)
127+
and on the conda-forge channel (i.e. the ones installed via
128+
``conda install --channel conda-forge``) are linked with OpenBLAS, while
129+
NumPy and SciPy packages packages shipped on the ``defaults`` conda
130+
channel from Anaconda.org (i.e. the ones installed via ``conda install``)
131+
are linked by default with MKL.
107132

108133

109134
Oversubscription: spawning too many threads
@@ -120,8 +145,8 @@ with ``n_jobs=8`` over a
120145
OpenMP). Each instance of
121146
:class:`~sklearn.ensemble.HistGradientBoostingClassifier` will spawn 8 threads
122147
(since you have 8 CPUs). That's a total of ``8 * 8 = 64`` threads, which
123-
leads to oversubscription of physical CPU resources and to scheduling
124-
overhead.
148+
leads to oversubscription of threads for physical CPU resources and thus
149+
to scheduling overhead.
125150

126151
Oversubscription can arise in the exact same fashion with parallelized
127152
routines from MKL, OpenBLAS or BLIS that are nested in joblib calls.
@@ -146,38 +171,34 @@ Note that:
146171
only use ``<LIB>_NUM_THREADS``. Joblib exposes a context manager for
147172
finer control over the number of threads in its workers (see joblib docs
148173
linked below).
149-
- Joblib is currently unable to avoid oversubscription in a
150-
multi-threading context. It can only do so with the ``loky`` backend
151-
(which spawns processes).
174+
- When joblib is configured to use the ``threading`` backend, there is no
175+
mechanism to avoid oversubscriptions when calling into parallel native
176+
libraries in the joblib-managed threads.
177+
- All scikit-learn estimators that explicitly rely on OpenMP in their Cython code
178+
always use `threadpoolctl` internally to automatically adapt the numbers of
179+
threads used by OpenMP and potentially nested BLAS calls so as to avoid
180+
oversubscription.
152181

153182
You will find additional details about joblib mitigation of oversubscription
154183
in `joblib documentation
155184
<https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-resources>`_.
156185

186+
You will find additional details about parallelism in numerical python libraries
187+
in `this document from Thomas J. Fan <https://thomasjpfan.github.io/parallelism-python-libraries-design/>`_.
157188

158189
Configuration switches
159190
-----------------------
160191

161-
Python runtime
162-
..............
192+
Python API
193+
..........
163194

164-
:func:`sklearn.set_config` controls the following behaviors:
165-
166-
`assume_finite`
167-
~~~~~~~~~~~~~~~
168-
169-
Used to skip validation, which enables faster computations but may lead to
170-
segmentation faults if the data contains NaNs.
171-
172-
`working_memory`
173-
~~~~~~~~~~~~~~~~
174-
175-
The optimal size of temporary arrays used by some algorithms.
195+
:func:`sklearn.set_config` and :func:`sklearn.config_context` can be used to change
196+
parameters of the configuration which control aspect of parallelism.
176197

177198
.. _environment_variable:
178199

179200
Environment variables
180-
......................
201+
.....................
181202

182203
These environment variables should be set before importing scikit-learn.
183204

@@ -277,3 +298,14 @@ float64 data.
277298
When this environment variable is set to a non zero value, the `Cython`
278299
derivative, `boundscheck` is set to `True`. This is useful for finding
279300
segfaults.
301+
302+
`SKLEARN_PAIRWISE_DIST_CHUNK_SIZE`
303+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
304+
305+
This sets the size of chunk to be used by the underlying `PairwiseDistancesReductions`
306+
implementations. The default value is `256` which has been showed to be adequate on
307+
most machines.
308+
309+
Users looking for the best performance might want to tune this variable using
310+
powers of 2 so as to get the best parallelism behavior for their hardware,
311+
especially with respect to their caches' sizes.

sklearn/metrics/pairwise.py

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -688,9 +688,12 @@ def pairwise_distances_argmin_min(
688688
values = values.flatten()
689689
indices = indices.flatten()
690690
else:
691-
# TODO: once BaseDistanceReductionDispatcher supports distance metrics
692-
# for boolean datasets, we won't need to fallback to
693-
# pairwise_distances_chunked anymore.
691+
# Joblib-based backend, which is used when user-defined callable
692+
# are passed for metric.
693+
694+
# This won't be used in the future once PairwiseDistancesReductions support:
695+
# - DistanceMetrics which work on supposedly binary data
696+
# - CSR-dense and dense-CSR case if 'euclidean' in metric.
694697

695698
# Turn off check for finiteness because this is costly and because arrays
696699
# have already been validated.
@@ -800,9 +803,12 @@ def pairwise_distances_argmin(X, Y, *, axis=1, metric="euclidean", metric_kwargs
800803
)
801804
indices = indices.flatten()
802805
else:
803-
# TODO: once BaseDistanceReductionDispatcher supports distance metrics
804-
# for boolean datasets, we won't need to fallback to
805-
# pairwise_distances_chunked anymore.
806+
# Joblib-based backend, which is used when user-defined callable
807+
# are passed for metric.
808+
809+
# This won't be used in the future once PairwiseDistancesReductions support:
810+
# - DistanceMetrics which work on supposedly binary data
811+
# - CSR-dense and dense-CSR case if 'euclidean' in metric.
806812

807813
# Turn off check for finiteness because this is costly and because arrays
808814
# have already been validated.

sklearn/neighbors/_base.py

+13Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -839,9 +839,13 @@ class from an array representing our data set and ask who's
839839
)
840840

841841
elif self._fit_method == "brute":
842-
# TODO: should no longer be needed once ArgKmin
843-
# is extended to accept sparse and/or float32 inputs.
842+
# Joblib-based backend, which is used when user-defined callable
843+
# are passed for metric.
844844

845+
# This won't be used in the future once PairwiseDistancesReductions
846+
# support:
847+
# - DistanceMetrics which work on supposedly binary data
848+
# - CSR-dense and dense-CSR case if 'euclidean' in metric.
845849
reduce_func = partial(
846850
self._kneighbors_reduce_func,
847851
n_neighbors=n_neighbors,
@@ -1173,9 +1177,13 @@ class from an array representing our data set and ask who's
11731177
)
11741178

11751179
elif self._fit_method == "brute":
1176-
# TODO: should no longer be needed once we have Cython-optimized
1177-
# implementation for radius queries, with support for sparse and/or
1178-
# float32 inputs.
1180+
# Joblib-based backend, which is used when user-defined callable
1181+
# are passed for metric.
1182+
1183+
# This won't be used in the future once PairwiseDistancesReductions
1184+
# support:
1185+
# - DistanceMetrics which work on supposedly binary data
1186+
# - CSR-dense and dense-CSR case if 'euclidean' in metric.
11791187

11801188
# for efficiency, use squared euclidean distances
11811189
if self.effective_metric_ == "euclidean":

0 commit comments

Comments
 (0)
0