Set number of threads after numpy import #11826

paugier · 2018-08-28T11:34:16Z

This is not a bug report but just an enhancement proposal.

I think it would be useful and important to be able to easily set the number of threads used by Numpy after Numpy import.

From the perspective of library developers, it is often useful to be able to control the number of threads used by Numpy, see for example biopython/biopython#1401.

It is not difficult to do in a simple script when we are sure that Numpy or Scipy have not been imported previously with something like:

import os
os.environ["OMP_NUM_THREADS"] = "1"
import numpy as np

However, in a library there is a good chance that the user has already imported Numpy in its main script, with something like

import numpy as np

import fluidimage

In this case, I don't see how to set the number of threads used by Numpy from the fluidimage code.

Thus, it would be very convenient to have a function np.set_num_threads.

The text was updated successfully, but these errors were encountered:

mattip · 2018-08-28T13:15:16Z

We would have to make this pluggable somehow to adapt for different linalg implementations. For instance, OpenBLAS seems to expose a openblas_set_num_threads, but MKL would use some other function

paugier · 2018-08-29T07:25:08Z

MKL has a function void mkl_set_num_threads( int nt );

https://software.intel.com/en-us/mkl-developer-reference-c-mkl-set-num-threads

which can be call through https://pypi.org/project/mkl/

But it seems to me it would be cleaner to have a generic Numpy function for this.

Most Numpy users don't even know which linalg library is used in the background.

touqir14 · 2018-09-15T07:22:15Z

If its alright, I would like to work on this. I haven't contributed to numpy before, so I will have to be a bit familiar with the codebase. Hence, it would be great if I could get some directions.

mattip · 2018-09-15T19:59:05Z

@touqir14 see the developer documentation. You should write some tests, tests that try out the new functions, verifying that they indeed set the number of threads you desire (perhaps by writing a C-level tests-only function that calls NPY_BEGIN_THREADS and ends with NPY_END_THREADS, and in the middle somehow counts the number of threads started) and then write code to implement the functions.

juliantaylor · 2018-09-16T00:51:52Z

note that this is actually quite tricky to tackle.
NumPy does not actually know which BLAS library is used to implement the functions, it just assumes that whatever implements it uses standard BLAS apis and abis. The underlying implementation can also be switched without relying on recompilation of numpy (so abi compatibility, this is used e.g. in debian based systems).

As there unfortunately is no standard api to determine information on the provider numpy would be relient on runtime introspection to determine the actual provider of the functions. As there are not very many of them (some I recall are openblas, atlas, blis, mkl and reference blas) it might be possible but still difficult to get working portably.

touqir14 · 2018-09-16T09:50:47Z

Thanks for the heads up!

…

On Sat, Sep 15, 2018 at 6:52 PM Julian Taylor ***@***.***> wrote: note that this is actually quite tricky to tackle. NumPy does not actually know which BLAS library is used to implement the functions, it just assumes that whatever implements it uses standard BLAS apis and abis. As there unfortunately is no standard api to determine information on the provider numpy would be relient on runtime introspection to determine the actual provider of the functions. As there are not very many of them (some I recall are openblas, atlas, blis, mkl and reference blas) it might be possible but still difficult to get working portably. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11826 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGX9GCm2_lM2F9HHOe8A-cw3JihOHtLkks5ubaCzgaJpZM4WPbrD> .

FabianIsensee · 2018-09-26T09:42:13Z

This is an extremely annoying problem because I want to explicitly turn off multiprocessing for my background workers and as far as I can see there is no way of properly doing that (except downgrading numpy to < 1.14). OMP_NUM_THREADS=1 will solve my problem for the most part, but it will prevent my entire python process, including the main thread, from using openmp multiprocessing while I would like to only disable it in the background workers. As of now, I have 8 background workers (multiprocessing.Process), each of which will spawn another 8 threads to do np.dot computation (which is only a very small part of what they are actually doing). That clogs up the entire CPU. Any suggestions of how I could solve this?

Edit: downgrading to numpy 1.14.5 solves the problem. Starting form numpy 1.14.6 it's there

mattip · 2018-09-26T17:05:40Z

@FabianIsensee The problem is that NumPy does not know right now what linalg backend you are using, OpenBlas, MKL, ATLAS, ... Did you see this stackoverflow question mentioned above, which suggests functions you can wrap and call to control this for your backend.

FabianIsensee · 2018-09-27T07:03:17Z

@mattip Thanks for pointing that out! That is a good solution to the problem, provided that the system you are running on uses openBLAS. Unfortunately this solution has no effect on my system. What's especially annoying for me is that I am providing an open source framework that runs into this problem and I cannot know what blas library each and every one of the users is going to use.
I am not familiar with the numpy releases but the multi-threaded matmul appear starting from version 1.14.6. I checked the release notes and could not spot anything that would have enabled that (or am I blind?). Since multithreading these operations did not happen prior to that there must be a way to control that from within numpy (none of my libraries changed)

charris · 2018-09-27T13:18:11Z

The change in 1.14.6 was building manylinux1 against OpenBLAS instead of ATLAS, we were already using OpenBLAS for Windows. As far as the build goes, 1.14.6 is pretty much 1.15.x. It sounds like we need to find a solution to this that doesn't depend on the user knowing which library is in use, probably adding some function in numpy that stores its info during the build.

FabianIsensee · 2018-09-27T13:21:44Z

Thank you for the clarification!
In an ideal world I would like to either specifically enable/disable multithreading by setting some property after importing numpy or even be able to set the number of threads to use manually. Would something like this be possible?

charris · 2018-09-27T13:29:58Z

Would something like this be possible?

I don't know, but it sounds like we need it. If nothing else we can pass info upstream and try to encourage the libraries themselves to add such a feature, ideally as some sort of standard library interface in BLAS (LAPACK) itself. @njsmith IIRC, you reported that there was some work going on to produce a new standard? In any case, @matthew-brett I think we could have some effect on OpenBLAS. Maybe there is already such a feature.

charris · 2018-09-27T13:33:21Z

Although with a single dynamic library I don't see how one could coordinate between callers. Hmm, not a simple problem, almost something that needs to be handled at the OS level. This is getting beyond my expertise.

mattip · 2018-09-27T14:14:51Z

@FabianIsensee this seems troubling

Unfortunately this solution has no effect on my system

What exactly did you try and what was the result?

Edit: formatting

FabianIsensee · 2018-09-27T14:29:58Z

@mattip I did exactly what was decribed in the comment you referenced above.

    with num_threads(1):
        import numpy as np
        a = np.random.random((20000000, 3))
        b = np.random.random((3, 3))
        for _ in range(10):
            c = np.dot(a, b)

    import numpy as np
    a = np.random.random((20000000, 3))
    b = np.random.random((3, 3))
    for _ in range(10):
        c = np.dot(a, b)

These are the two examples I compared. I ran them while looking at htop to see CPU usage. For both of them a number of threads was spawned and CPU usage was above 100% for the main thread.

Running it like this OMP_NUM_THREADS=1 python playground/numpy_threading_problem.py gives the intended behavior on my computer, but this may not transfer to other machines.

stuarteberg · 2018-10-19T15:05:04Z

Whether or not a numpy API is feasible for this feature, perhaps we can crowdsource a new section in the numpy docs to explain this issue and offer advice with respect to the different environment variables.

IIUC, the basic advice for OpenBLAS and MKL users would be: If you're planning to use multiple processes (e.g. via multiprocessing, dask, pyspark, etc.), set OMP_NUM_THREADS=1 before running your program. If not, you can leave it unset. (It would be nice to explain why this is important, too.)

Beyond that, maybe an MKL expert can offer more fine-grained advice with respect to these variables:
https://software.intel.com/en-us/mkl-linux-developer-guide-intel-mkl-specific-environment-variables-for-openmp-threading-control

For ATLAS users, apparently the number of threads is predetermined at compile time:
http://math-atlas.sourceforge.net/faq.html#tnum

@charris wrote:

Although with a single dynamic library I don't see how one could coordinate between callers.

I don't quite grok this point, which is all the more reason I would love to see some docs on this general topic. In which section of the numpy docs should I start a PR on the topic of multithreading control?

bbbbbbbbba · 2018-10-30T19:05:31Z

IIUC, the basic advice for OpenBLAS and MKL users would be: If you're planning to use multiple processes (e.g. via multiprocessing, dask, pyspark, etc.), set OMP_NUM_THREADS=1 before running your program. If not, you can leave it unset. (It would be nice to explain why this is important, too.)

Correct me if I'm wrong, but it's not a big problem with multiprocessing, right? At worst you pay some extra overhead for parallelizing more than you have CPU cores, but each process spawns its own threads and work correctly. It's multithreading that is really problematic and causes bugs like #11046.

FabianIsensee · 2018-10-30T19:32:18Z

I would very much disagree here. In the specific situation that I am in, I have a pool of background workers (multiprocessing.Process) that generate batches for a deep learning algorithm. These batches contain images where some of these images need to be (among other things) rotated for data augmentation (which is implemented via matrix multiplication of image coordinates). The machine I am working on is a dgx1 computer with 8 graphics cards and 80 CPU threads. Usually I train 8 different networks on it simultaneously using 10 workers and 1 GPU each. Now each of these workers (80 in total) will attempt to do these matrix multiplications (which are quite tiny by the way) in a multithreaded way and since each worker sees 80 cpus they will spawn 80 threads each, resulting in the system being completely clogged up. That effectively breaks everything for me and the only way I can continue my work is to downgrade to numpy 1.14.5

stuarteberg · 2018-10-30T19:38:52Z

Correct me if I'm wrong, but it's not a big problem with multiprocessing, right? At worst you pay some extra overhead for parallelizing more than you have CPU cores

I think we're on the same page, @bbbbbbbbba. In the case of multiprocessing, I'm not worried about incorrect results. As you said, I'm worried about poor performance due to spawning more threads than you can schedule onto your CPU cores.

But the penalty is not trivial! In one of my recent use-cases on a 16-core machine, the performance was 6x worse due to the extra threads. On an 80-core machine like @FabianIsensee's, the overhead must be even worse.

The threading-related issue you referred to is troubling, but that sounds like an outright bug, not an issue with the OMP_NUM_THREADS setting. Hopefully someone will eventually implement a fix in OpenBLAS or a workaround numpy.

mattip · 2018-11-01T08:05:09Z

This code which is MIT licensed and based on other BSD-licensed routines at runtime probes the loaded DLLs (shared objects) to find which implementation is relevant and calls the implementation specific routine to set the number of threads. Thanks to @ogrisel for this comment pointing it out

ogrisel · 2018-11-01T08:33:41Z

This code which is MIT licensed and based on other BSD-licensed routines at runtime probes the loaded DLLs (shared objects) to find which implementation is relevant and calls the implementation specific routine to set the number of threads. Thanks to @ogrisel for this comment pointing it out

Sorry for being late to the party I had not seen this issue. Indeed we started to investigate over-subscription issues a bit in the context of scikit-learn / joblib but this is still work in progress. Having numpy expose a uniform API to control the behavior of the underlying BLAS thread pool would be nice. Note that @anton-malakhov and @tomMoral are the primary authors of those dynamic ctypes-based access to the underlying runtime libraries.

Ping @jeremiedbb who might also be interested in following this discussion.

mattip · 2018-11-05T23:11:47Z

@seberg the script I linked to uses ctypes and OS-provided functions to walk down the loaded shared objects looking for the one we want. Isn't that what your script does, only using Popen([ldd ...)`? It seems that if the clib provides [dl_iterate_phdr]https://linux.die.net/man/3/dl_iterate_phdr) for linux, _dyld_image_count for MacOS and GetModuleFileNameExW for windows we should use them.

seberg · 2018-11-06T00:01:17Z

Ah sorry, forget about the ldd stuff, I just added it because I kept looking at it. No, the first function just loads the multiarray.so with ctypes and checks if certain function symbols are defined... That seems to work for openblas, mkl, blis and atlas. But I have no idea if just trying to load function symbols should work, or if e.g. accelerate is identifiable by the existance of such a symbol.

EDIT: OK, nevermind my rambling. Tried on windows and the stuff probably just randomly works on linux.

mattip · 2019-05-26T05:01:44Z

Can we mark this as Closed? Maybe we should pivot it to "document use of threadpoolctl to control threads"?

mattip · 2019-08-18T18:00:39Z

Closing, since the threadpoolctl package handles this.

mattip added the 01 - Enhancement label Aug 28, 2018

stuarteberg mentioned this issue Oct 19, 2018

DOC: Implicit parallelism and how to disable it pandas-dev/pandas#23139

Closed

andportnoy mentioned this issue Oct 26, 2018

Restrict use of multiple threads for parallel processing kutaslab/fitgrid#44

Closed

mattip mentioned this issue Nov 1, 2018

BUG: np.dot is not thread-safe with OpenBLAS #11046

Closed

This comment has been minimized.

Sign in to view

jbteves mentioned this issue May 23, 2019

Programmatically Control Thread Count ME-ICA/tedana#188

Closed

4 tasks

Zaharid mentioned this issue Jun 3, 2019

Set CPU affinity NNPDF/reportengine#20

Closed

constantinpape mentioned this issue Jun 28, 2019

Tasks do not respect thread limt constantinpape/cluster_tools#3

Closed

mattip mentioned this issue Jul 2, 2019

np.convolve running on max threads with numpy >= 1.14.6 #13888

Closed

jmansour mentioned this issue Jul 11, 2019

numpy/uw/blas threaded operation underworldcode/underworld2#400

Closed

mattip closed this as completed Aug 18, 2019

lesteve mentioned this issue Aug 28, 2019

Permutation test with distributed scheduler is slower than serial dask/dask#5324

Closed

jeremiedbb mentioned this issue Sep 12, 2019

show_config displays useless info when installed through conda-forge channel #14492

Closed

jbteves mentioned this issue Sep 24, 2019

More elegant multithreading handling ME-ICA/tedana#401

Closed

1 task

sk1p mentioned this issue Jan 13, 2020

Improve control over number of threads for numerics libraries LiberTEM/LiberTEM#546

Closed

mwaskom mentioned this issue Feb 8, 2020

lmplot() uses all server CPUs - any method to limit cores? mwaskom/seaborn#1955

Closed

rgommers mentioned this issue Aug 3, 2020

Multi-threading aware multi-threading #16990

Closed

PhilipVinc mentioned this issue Aug 19, 2020

Poor performance of MPI in v3.0 netket/netket#464

Closed

TomWinder mentioned this issue Oct 27, 2020

NumPy multi-threading uses all available threads QuakeMigrate/QuakeMigrate#106

Open

jakobnissen mentioned this issue Jan 14, 2021

Thread control RasmussenLab/vamb#45

Open

cgohlke mentioned this issue Feb 4, 2021

Parallelisation issue with tifffile import order cgohlke/tifffile#63

Closed

ajjackson mentioned this issue Feb 24, 2021

Improve parallelism settings pace-neutrons/Euphonic#128

Closed

mfeurer mentioned this issue Mar 26, 2021

Use threadctl to control the number of threads used by numpy automl/auto-sklearn#1108

Closed

constantinpape mentioned this issue May 5, 2021

MeanShift clustering uses all cores despite n_jobs = 1 scikit-learn/scikit-learn#20037

Closed

j-bac mentioned this issue Jun 3, 2021

Too many threads spawed in DANCo().fit() scikit-learn-contrib/scikit-dimension#7

Open

adiegel mentioned this issue Jul 9, 2021

OMP_NUM_THREADS warning firedrakeproject/firedrake#2145

Closed

MichaelClerx mentioned this issue Jul 14, 2021

Stop numpy (and others) from using threads inside multiprocessing worker processes pints-team/pints#1356

Closed

dafrose mentioned this issue Apr 13, 2022

Improve control over number of threads used in an mne call mne-tools/mne-python#10522

Closed

jaredl7 mentioned this issue Apr 27, 2022

Revision to ssutils.set_numpy_threads holehouse-lab/soursop#11

Merged

jubbens mentioned this issue Jun 6, 2023

PCA hangs occasionally when applied to data that fits in memory scikit-learn/scikit-learn#22434

Open

LeHenschel mentioned this issue Sep 5, 2023

Uncontrolled Multi CPU Threading in FastSurfer (even when setting value for --threads) Deep-MI/FastSurfer#371

Open

exAClior mentioned this issue Mar 25, 2024

Multiprocessing Support numqi/numqi#7

Open

Svalorzen mentioned this issue Oct 18, 2024

Multiprocessing backend does not support nor sets inner_max_num_threads joblib/joblib#1612

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Set number of threads after numpy import #11826

Set number of threads after numpy import #11826

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Set number of threads after numpy import #11826

Set number of threads after numpy import #11826

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Uh oh!

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!