@@ -504,20 +504,152 @@ Links
504
504
- `Scipy sparse matrix formats documentation <https://docs.scipy.org/doc/scipy/reference/sparse.html >`_
505
505
506
506
Parallelism, resource management, and configuration
507
- =====================================================
507
+ ===================================================
508
508
509
509
.. _parallelism :
510
510
511
- Parallel and distributed computing
512
- -----------------------------------
511
+ Parallelism
512
+ -----------
513
513
514
- Scikit-learn uses the `joblib <https://joblib.readthedocs.io/en/latest/ >`__
515
- library to enable parallel computing inside its estimators. See the
516
- joblib documentation for the switches to control parallel computing.
514
+ Some scikit-learn estimators and utilities can parallelize costly operations
515
+ using multiple CPU cores, thanks to the following components:
516
+
517
+ - via the `joblib <https://joblib.readthedocs.io/en/latest/ >`_ library. In
518
+ this case the number of threads or processes can be controlled with the
519
+ ``n_jobs `` parameter.
520
+ - via OpenMP, used in C or Cython code.
521
+
522
+ In addition, some of the numpy routines that are used internally by
523
+ scikit-learn may also be parallelized if numpy is installed with specific
524
+ numerical libraries such as MKL, OpenBLAS, or BLIS.
525
+
526
+ We describe these 3 scenarios in the following subsections.
527
+
528
+ Joblib-based parallelism
529
+ ........................
530
+
531
+ When the underlying implementation uses joblib, the number of workers
532
+ (threads or processes) that are spawned in parallel can be controled via the
533
+ ``n_jobs `` parameter.
534
+
535
+ .. note ::
536
+
537
+ Where (and how) parallelization happens in the estimators is currently
538
+ poorly documented. Please help us by improving our docs and tackle `issue
539
+ 14228 <https://github.com/scikit-learn/scikit-learn/issues/14228> `_!
540
+
541
+ Joblib is able to support both multi-processing and multi-threading. Whether
542
+ joblib chooses to spawn a thread or a process depends on the **backend **
543
+ that it's using.
544
+
545
+ Scikit-learn generally relies on the ``loky `` backend, which is joblib's
546
+ default backend. Loky is a multi-processing backend. When doing
547
+ multi-processing, in order to avoid duplicating the memory in each process
548
+ (which isn't reasonable with big datasets), joblib will create a `memmap
549
+ <https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html> `_
550
+ that all processes can share, when the data is bigger than 1MB.
551
+
552
+ In some specific cases (when the code that is run in parallel releases the
553
+ GIL), scikit-learn will indicate to ``joblib `` that a multi-threading
554
+ backend is preferable.
555
+
556
+ As a user, you may control the backend that joblib will use (regardless of
557
+ what scikit-learn recommends) by using a con
10000
text manager::
558
+
559
+ from joblib import parallel_backend
560
+
561
+ with parallel_backend('threading', n_jobs=2):
562
+ # Your scikit-learn code here
563
+
564
+ Please refer to the `joblib's docs
565
+ <https://joblib.readthedocs.io/en/latest/parallel.html#thread-based-parallelism-vs-process-based-parallelism> `_
566
+ for more details.
567
+
568
+ In practice, whether parallelism is helpful at improving runtime depends on
569
+ many factors. It is usually a good idea to experiment rather than assuming
570
+ that increasing the number of workers is always a good thing. In some cases
571
+ it can be highly detrimental to performance to run multiple copies of some
572
+ estimators or functions in parallel (see oversubscription below).
573
+
574
+ OpenMP-based parallelism
575
+ ........................
576
+
577
+ OpenMP is used to parallelize code written in Cython or C, relying on
578
+ multi-threading exclusively. By default (and unless joblib is trying to
579
+ avoid oversubscription), the implementation will use as many threads as
580
+ possible.
581
+
582
+ You can control the exact number of threads that are used via the
583
+ ``OMP_NUM_THREADS `` environment variable::
584
+
585
+ OMP_NUM_THREADS=4 python my_script.py
586
+
587
+ Parallel Numpy routines from numerical libraries
588
+ ................................................
589
+
590
+ Scikit-learn relies heavily on NumPy and SciPy, which internally call
591
+ multi-threaded linear algebra routines implemented in libraries such as MKL,
592
+ OpenBLAS or BLIS.
593
+
594
+ The number of threads used by the OpenBLAS, MKL or BLIS libraries can be set
595
+ via the ``MKL_NUM_THREADS ``, ``OPENBLAS_NUM_THREADS ``, and
596
+ ``BLIS_NUM_THREADS `` environment variables.
597
+
598
+ Please note that scikit-learn has no direct control over these
599
+ implementations. Scikit-learn solely relies on Numpy and Scipy.
600
+
601
+ .. note ::
602
+ At the time of writing (2019), NumPy and SciPy packages distributed on
603
+ pypi.org (used by ``pip ``) and on the conda-forge channel are linked
604
+ with OpenBLAS, while conda packages shipped on the "defaults" channel
605
+ from anaconda.org are linked by default with MKL.
606
+
607
+
608
+ Oversubscription: spawning too many threads
609
+ ...........................................
610
+
611
+ It is generally recommended to avoid using significantly more processes or
612
+ threads than the number of CPUs on a machine. Over-subscription happens when
613
+ a program is running too many threads at the same time.
614
+
615
+ Suppose you have a machine with 8 CPUs. Consider a case where you're running
616
+ a :class: `~GridSearchCV ` (parallelized with joblib) with ``n_jobs=8 `` over
617
+ a :class: `~HistGradientBoostingClassifier ` (parallelized with OpenMP). Each
618
+ instance of :class: `~HistGradientBoostingClassifier ` will spawn 8 threads
619
+ (since you have 8 CPUs). That's a total of ``8 * 8 = 64 `` threads, which
620
+ leads to oversubscription of physical CPU resources and to scheduling
621
+ overhead.
622
+
623
+ Oversubscription can arise in the exact same fashion with parallelized
624
+ routines from MKL, OpenBLAS or BLIS that are nested in joblib calls.
625
+
626
+ Starting from ``joblib >= 0.14 ``, when the ``loky `` backend is used (which
627
+ is the default), joblib will tell its child **processes ** to limit the
628
+ number of threads they can use, so as to avoid oversubscription. In practice
629
+ the heuristic that joblib uses is to tell the processes to use ``max_threads
630
+ = n_cpus // n_jobs ``, via their corresponding environment variable. Back to
631
+ our example from above, since the joblib backend of :class: `~GridSearchCV `
632
+ is ``loky ``, each process will only be able to use 1 thread instead of 8,
633
+ thus mitigating the oversubscription issue.
634
+
635
+ Note that:
636
+
637
+ - Manually setting one of the environment variables (``OMP_NUM_THREADS ``,
638
+ ``MKL_NUM_THREADS ``, ``OPENBLAS_NUM_THREADS ``, or ``BLIS_NUM_THREADS ``)
639
+ will take precedence over what joblib tries to do. The total number of
640
+ threads will be ``n_jobs * <LIB>_NUM_THREADS ``. Note that setting this
641
+ limit will also impact your computations in the main process, which will
642
+ only use ``<LIB>_NUM_THREADS ``. Joblib exposes a context manager for
643
+ finer control over the number of threads in its workers (see joblib docs
644
+ linked below).
645
+ - Joblib is currently unable to avoid oversubscription in a
646
+ multi-threading context. It can only do so with the ``loky `` backend
647
+ (which spawns processes).
648
+
649
+ You will find additional details about joblib mitigation of oversubscription
650
+ in `joblib documentation
651
+ <https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-ressources> `_.
517
652
518
- Note that, by default, scikit-learn uses its embedded (vendored) version
519
- of joblib. A configuration switch (documented below) controls this
520
- behavior.
521
653
522
654
Configuration switches
523
655
-----------------------
0 commit comments