8000 [MRG] DOC More details about parallelism (joblib, openMP, MKL...) (#1… · scikit-learn/scikit-learn@60ce68c · GitHub
[go: up one dir, main page]

Skip to content

Commit 60ce68c

Browse files
NicolasHugogrisel
authored andcommitted
[MRG] DOC More details about parallelism (joblib, openMP, MKL...) (#15116)
1 parent d0d8f20 commit 60ce68c

File tree

4 files changed

+171
-66
lines changed

4 files changed

+171
-66
lines changed

doc/faq.rst

Lines changed: 11 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -299,23 +299,20 @@ documentation <https://docs.python.org/3/library/multiprocessing.html#contexts-a
299299

300300
.. _faq_mkl_threading:
301301

302-
Why does my job use more cores than specified with n_jobs under OSX or Linux?
303-
-----------------------------------------------------------------------------
302+
Why does my job use more cores than specified with n_jobs?
303+
----------------------------------------------------------
304304

305-
This happens when vectorized numpy operations are handled by libraries such
306-
as MKL or OpenBLAS.
305+
This is because ``n_jobs`` only controls the number of jobs for
306+
routines that are parallelized with ``joblib``, but parallel code can come
307+
from other sources:
307308

308-
While scikit-learn adheres to the limit set by ``n_jobs``,
309-
numpy operations vectorized using MKL (or OpenBLAS) will make use of multiple
310-
threads within each scikit-learn job (thread or process).
309+
- some routines may be parallelized with OpenMP (for code written in C or
310+
Cython).
311+
- scikit-learn relies a lot on numpy, which in turn may rely on numerical
312+
libraries like MKL, OpenBLAS or BLIS which can provide parallel
313+
implementations.
311314

312-
The number of threads used by the BLAS library can be set via an environment
313-
variable. For example, to set the maximum number of threads to some integer
314-
value ``N``, the following environment variables should be set:
315-
316-
* For MKL: ``export MKL_NUM_THREADS=N``
317-
318-
* For OpenBLAS: ``export OPENBLAS_NUM_THREADS=N``
315+
For more details, please refer to our :ref:`Parallelism notes <parallelism>`.
319316

320317

321318
Why is there no support for deep or reinforcement learning / Will there be support for deep or reinforcement learning in scikit-learn?

doc/glossary.rst

Lines changed: 17 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -1508,45 +1508,23 @@ functions or non-estimator constructors.
15081508
early.
15091509

15101510
``n_jobs``
1511-
This is used to specify how many concurrent processes/threads should be
1512-
used for parallelized routines. Scikit-learn uses one processor for
1513-
its processing by default, although it also makes use of NumPy, which
1514-
may be configured to use a threaded numerical processor library (like
1515-
MKL; see :ref:`FAQ <faq_mkl_threading>`).
1516-
1517-
``n_jobs`` is an int, specifying the maximum number of concurrently
1518-
running jobs. If set to -1, all CPUs are used. If 1 is given, no
1519-
joblib level parallelism is used at all, which is useful for
1520-
debugging. Even with ``n_jobs = 1``, parallelism may occur due to
1521-
numerical processing libraries (see :ref:`FAQ <faq_mkl_threading>`).
1522-
For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for
1523-
``n_jobs = -2``, all CPUs but one are used.
1524-
1525-
``n_jobs=None`` means *unset*; it will generally be interpreted as
1526-
``n_jobs=1``, unless the current :class:`joblib.Parallel` backend
1527-
context specifies otherwise.
1528-
1529-
The use of ``n_jobs``-based parallelism in estimators varies:
1530-
1531-
* Most often parallelism happens in :term:`fitting <fit>`, but
1532-
sometimes parallelism happens in prediction (e.g. in random forests).
1533-
* Some parallelism uses a multi-threading backend by default, some
1534-
a multi-processing backend. It is possible to override the default
1535-
backend by using :func:`sklearn.utils.parallel_backend`.
1536-
* Whether parallel processing is helpful at improving runtime depends
1537-
on many factors, and it's usually a good idea to experiment rather
1538-
than assuming that increasing the number of jobs is always a good
1539-
thing. *It can be highly detrimental to performance to run multiple
1540-
copies of some estimators or functions in parallel.*
1541-
1542-
Nested uses of ``n_jobs``-based parallelism with the same backend will
1543-
result in an exception.
1544-
So ``GridSearchCV(OneVsRestClassifier(SVC(), n_jobs=2), n_jobs=2)``
1545-
won't work.
1546-
1547-
When ``n_jobs`` is not 1, the estimator being parallelized must be
1548-
picklable. This means, for instance, that lambdas cannot be used
1549-
as estimator parameters.
1511+
This parameter is used to specify how many concurrent processes or
1512+
threads should be used for routines that are parallelized with
1513+
:term:`joblib`.
1514+
1515+
``n_jobs`` is an integer, specifying the maximum number of concurrently
1516+
running workers. If 1 is given, no joblib parallelism is used at all,
1517+
which is useful for debugging. If set to -1, all CPUs are used. For
1518+
``n_jobs`` below -1, (n_cpus + 1 + n_jobs) are used. For example with
1519+
``n_jobs=-2``, all CPUs but one are used.
1520+
1521+
``n_jobs`` is ``None`` by default, which means *unset*; it will
1522+
generally be interpreted as ``n_jobs=1``, unless the current
1523+
:class:`joblib.Parallel` backend context specifies otherwise.
1524+
1525+
For more details on the use of ``joblib`` and its interactions with
1526+
scikit-learn, please refer to our :ref:`parallelism notes
1527+
<parallelism>`.
15501528

15511529
``pos_label``
15521530
Value with which positive labels must be encoded in binary

doc/modules/computing.rst

Lines changed: 141 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -504,20 +504,152 @@ Links
504504
- `Scipy sparse matrix formats documentation <https://docs.scipy.org/doc/scipy/reference/sparse.html>`_
505505

506506
Parallelism, resource management, and configuration
507-
=====================================================
507+
===================================================
508508

509509
.. _parallelism:
510510

511-
Parallel and distributed computing
512-
-----------------------------------
511+
Parallelism
512+
-----------
513513

514-
Scikit-learn uses the `joblib <https://joblib.readthedocs.io/en/latest/>`__
515-
library to enable parallel computing inside its estimators. See the
516-
joblib documentation for the switches to control parallel computing.
514+
Some scikit-learn estimators and utilities can parallelize costly operations
515+
using multiple CPU cores, thanks to the following components:
516+
517+
- via the `joblib <https://joblib.readthedocs.io/en/latest/>`_ library. In
518+
this case the number of threads or processes can be controlled with the
519+
``n_jobs`` parameter.
520+
- via OpenMP, used in C or Cython code.
521+
522+
In addition, some of the numpy routines that are used internally by
523+
scikit-learn may also be parallelized if numpy is installed with specific
524+
numerical libraries such as MKL, OpenBLAS, or BLIS.
525+
526+
We describe these 3 scenarios in the following subsections.
527+
528+
Joblib-based parallelism
529+
........................
530+
531+
When the underlying implementation uses joblib, the number of workers
532+
(threads or processes) that are spawned in parallel can be controled via the
533+
``n_jobs`` parameter.
534+
535+
.. note::
536+
537+
Where (and how) parallelization happens in the estimators is currently
538+
poorly documented. Please help us by improving our docs and tackle `issue
539+
14228 <https://github.com/scikit-learn/scikit-learn/issues/14228>`_!
540+
541+
Joblib is able to support both multi-processing and multi-threading. Whether
542+
joblib chooses to spawn a thread or a process depends on the **backend**
543+
that it's using.
544+
545+
Scikit-learn generally relies on the ``loky`` backend, which is joblib's
546+
default backend. Loky is a multi-processing backend. When doing
547+
multi-processing, in order to avoid duplicating the memory in each process
548+
(which isn't reasonable with big datasets), joblib will create a `memmap
549+
<https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html>`_
550+
that all processes can share, when the data is bigger than 1MB.
551+
552+
In some specific cases (when the code that is run in parallel releases the
553+
GIL), scikit-learn will indicate to ``joblib`` that a multi-threading
554+
backend is preferable.
555+
556+
As a user, you may control the backend that joblib will use (regardless of
557+
what scikit-learn recommends) by using a con 10000 text manager::
558+
559+
from joblib import parallel_backend
560+
561+
with parallel_backend('threading', n_jobs=2):
562+
# Your scikit-learn code here
563+
564+
Please refer to the `joblib's docs
565+
<https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism>`_
566+
for more details.
567+
568+
In practice, whether parallelism is helpful at improving runtime depends on
569+
many factors. It is usually a good idea to experiment rather than assuming
570+
that increasing the number of workers is always a good thing. In some cases
571+
it can be highly detrimental to performance to run multiple copies of some
572+
estimators or functions in parallel (see oversubscription below).
573+
574+
OpenMP-based parallelism
575+
........................
576+
577+
OpenMP is used to parallelize code written in Cython or C, relying on
578+
multi-threading exclusively. By default (and unless joblib is trying to
579+
avoid oversubscription), the implementation will use as many threads as
580+
possible.
581+
582+
You can control the exact number of threads that are used via the
583+
``OMP_NUM_THREADS`` environment variable::
584+
585+
OMP_NUM_THREADS=4 python my_script.py
586+
587+
Parallel Numpy routines from numerical libraries
588+
................................................
589+
590+
Scikit-learn relies heavily on NumPy and SciPy, which internally call
591+
multi-threaded linear algebra routines implemented in libraries such as MKL,
592+
OpenBLAS or BLIS.
593+
594+
The number of threads used by the OpenBLAS, MKL or BLIS libraries can be set
595+
via the ``MKL_NUM_THREADS``, ``OPENBLAS_NUM_THREADS``, and
596+
``BLIS_NUM_THREADS`` environment variables.
597+
598+
Please note that scikit-learn has no direct control over these
599+
implementations. Scikit-learn solely relies on Numpy and Scipy.
600+
601+
.. note::
602+
At the time of writing (2019), NumPy and SciPy packages distributed on
603+
pypi.org (used by ``pip``) and on the conda-forge channel are linked
604+
with OpenBLAS, while conda packages shipped on the "defaults" channel
605+
from anaconda.org are linked by default with MKL.
606+
607+
608+
Oversubscription: spawning too many threads
609+
...........................................
610+
611+
It is generally recommended to avoid using significantly more processes or
612+
threads than the number of CPUs on a machine. Over-subscription happens when
613+
a program is running too many threads at the same time.
614+
615+
Suppose you have a machine with 8 CPUs. Consider a case where you're running
616+
a :class:`~GridSearchCV` (parallelized with joblib) with ``n_jobs=8`` over
617+
a :class:`~HistGradientBoostingClassifier` (parallelized with OpenMP). Each
618+
instance of :class:`~HistGradientBoostingClassifier` will spawn 8 threads
619+
(since you have 8 CPUs). That's a total of ``8 * 8 = 64`` threads, which
620+
leads to oversubscription of physical CPU resources and to scheduling
621+
overhead.
622+
623+
Oversubscription can arise in the exact same fashion with parallelized
624+
routines from MKL, OpenBLAS or BLIS that are nested in joblib calls.
625+
626+
Starting from ``joblib >= 0.14``, when the ``loky`` backend is used (which
627+
is the default), joblib will tell its child **processes** to limit the
628+
number of threads they can use, so as to avoid oversubscription. In practice
629+
the heuristic that joblib uses is to tell the processes to use ``max_threads
630+
= n_cpus // n_jobs``, via their corresponding environment variable. Back to
631+
our example from above, since the joblib backend of :class:`~GridSearchCV`
632+
is ``loky``, each process will only be able to use 1 thread instead of 8,
633+
thus mitigating the oversubscription issue.
634+
635+
Note that:
636+
637+
- Manually setting one of the environment variables (``OMP_NUM_THREADS``,
638+
``MKL_NUM_THREADS``, ``OPENBLAS_NUM_THREADS``, or ``BLIS_NUM_THREADS``)
639+
will take precedence over what joblib tries to do. The total number of
640+
threads will be ``n_jobs * <LIB>_NUM_THREADS``. Note that setting this
641+
limit will also impact your computations in the main process, which will
642+
only use ``<LIB>_NUM_THREADS``. Joblib exposes a context manager for
643+
finer control over the number of threads in its workers (see joblib docs
644+
linked below).
645+
- Joblib is currently unable to avoid oversubscription in a
646+
multi-threading context. It can only do so with the ``loky`` backend
647+
(which spawns processes).
648+
649+
You will find additional details about joblib mitigation of oversubscription
650+
in `joblib documentation
651+
<https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-ressources>`_.
517652

518-
Note that, by default, scikit-learn uses its embedded (vendored) version
519-
of joblib. A configuration switch (documented below) controls this
520-
behavior.
521653

522654
Configuration switches
523655
-----------------------

doc/modules/ensemble.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -946,10 +946,8 @@ Low-level parallelism
946946

947947
:class:`HistGradientBoostingClassifier` and
948948
:class:`HistGradientBoostingRegressor` have implementations that use OpenMP
949-
for parallelization through Cython. The number of threads that is used can
950-
be changed using the ``OMP_NUM_THREADS`` environment variable. By default,
951-
all available cores are used. Please refer to the OpenMP documentation for
952-
details.
949+
for parallelization through Cython. For more details on how to control the
950+
number of threads, please refer to our :ref:`parallelism` notes.
953951

954952
The following parts are parallelized:
955953

0 commit comments

Comments
 (0)
0