-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
Set number of threads after numpy import #11826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We would have to make this pluggable somehow to adapt for different linalg implementations. For instance, OpenBLAS seems to expose a |
MKL has a function https://software.intel.com/en-us/mkl-developer-reference-c-mkl-set-num-threads which can be call through https://pypi.org/project/mkl/ But it seems to me it would be cleaner to have a generic Numpy function for this. Most Numpy users don't even know which linalg library is used in the background. |
If its alright, I would like to work on this. I haven't contributed to numpy before, so I will have to be a bit familiar with the codebase. Hence, it would be great if I could get some directions. |
@touqir14 see the developer documentation. You should write some tests, tests that try out the new functions, verifying that they indeed set the number of threads you desire (perhaps by writing a C-level tests-only function that calls |
note that this is actually quite tricky to tackle. As there unfortunately is no standard api to determine information on the provider numpy would be relient on runtime introspection to determine the actual provider of the functions. As there are not very many of them (some I recall are openblas, atlas, blis, mkl and reference blas) it might be possible but still difficult to get working portably. |
Thanks for the heads up!
…On Sat, Sep 15, 2018 at 6:52 PM Julian Taylor ***@***.***> wrote:
note that this is actually quite tricky to tackle.
NumPy does not actually know which BLAS library is used to implement the
functions, it just assumes that whatever implements it uses standard BLAS
apis and abis. As there unfortunately is no standard api to determine
information on the provider numpy would be relient on runtime introspection
to determine the actual provider of the functions. As there are not very
many of them (some I recall are openblas, atlas, blis, mkl and reference
blas) it might be possible but still difficult to get working portably.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11826 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AGX9GCm2_lM2F9HHOe8A-cw3JihOHtLkks5ubaCzgaJpZM4WPbrD>
.
|
This is an extremely annoying problem because I want to explicitly turn off multiprocessing for my background workers and as far as I can see there is no way of properly doing that (except downgrading numpy to < 1.14). OMP_NUM_THREADS=1 will solve my problem for the most part, but it will prevent my entire python process, including the main thread, from using openmp multiprocessing while I would like to only disable it in the background workers. As of now, I have 8 background workers (multiprocessing.Process), each of which will spawn another 8 threads to do np.dot computation (which is only a very small part of what they are actually doing). That clogs up the entire CPU. Any suggestions of how I could solve this? Edit: downgrading to numpy 1.14.5 solves the problem. Starting form numpy 1.14.6 it's there |
@FabianIsensee The problem is that NumPy does not know right now what linalg backend you are using, OpenBlas, MKL, ATLAS, ... Did you see this stackoverflow question mentioned above, which suggests functions you can wrap and call to control this for your backend. |
@mattip Thanks for pointing that out! That is a good solution to the problem, provided that the system you are running on uses openBLAS. Unfortunately this solution has no effect on my system. What's especially annoying for me is that I am providing an open source framework that runs into this problem and I cannot know what blas library each and every one of the users is going to use. |
The change in 1.14.6 was building manylinux1 against OpenBLAS instead of ATLAS, we were already using OpenBLAS for Windows. As far as the build goes, 1.14.6 is pretty much 1.15.x. It sounds like we need to find a solution to this that doesn't depend on the user knowing which library is in use, probably adding some function in numpy that stores its info during the build. |
Thank you for the clarification! |
I don't know, but it sounds like we need it. If nothing else we can pass info upstream and try to encourage the libraries themselves to add such a feature, ideally as some sort of standard library interface in BLAS (LAPACK) itself. @njsmith IIRC, you reported that there was some work going on to produce a new standard? In any case, @matthew-brett I think we could have some effect on OpenBLAS. Maybe there is already such a feature. |
Although with a single dynamic library I don't see how one could coordinate between callers. Hmm, not a simple problem, almost something that needs to be handled at the OS level. This is getting beyond my expertise. |
@FabianIsensee this seems troubling
What exactly did you try and what was the result? Edit: formatting |
@mattip I did exactly what was decribed in the comment you referenced above.
These are the two examples I compared. I ran them while looking at htop to see CPU usage. For both of them a number of threads was spawned and CPU usage was above 100% for the main thread. Running it like this |
Whether or not a numpy API is feasible for this feature, perhaps we can crowdsource a new section in the numpy docs to explain this issue and offer advice with respect to the different environment variables. IIUC, the basic advice for OpenBLAS and MKL users would be: If you're planning to use multiple processes (e.g. via Beyond that, maybe an MKL expert can offer more fine-grained advice with respect to these variables: For ATLAS users, apparently the number of threads is predetermined at compile time: @charris wrote:
I don't quite grok this point, which is all the more reason I would love to see some docs on this general topic. In which section of the numpy docs should I start a PR on the topic of multithreading control? |
Correct me if I'm wrong, but it's not a big problem with multiprocessing, right? At worst you pay some extra overhead for parallelizing more than you have CPU cores, but each process spawns its own threads and work correctly. It's multithreading that is really problematic and causes bugs like #11046. |
I would very much disagree here. In the specific situation that I am in, I have a pool of background workers (multiprocessing.Process) that generate batches for a deep learning algorithm. These batches contain images where some of these images need to be (among other things) rotated for data augmentation (which is implemented via matrix multiplication of image coordinates). The machine I am working on is a dgx1 computer with 8 graphics cards and 80 CPU threads. Usually I train 8 different networks on it simultaneously using 10 workers and 1 GPU each. Now each of these workers (80 in total) will attempt to do these matrix multiplications (which are quite tiny by the way) in a multithreaded way and since each worker sees 80 cpus they will spawn 80 threads each, resulting in the system being completely clogged up. That effectively breaks everything for me and the only way I can continue my work is to downgrade to numpy 1.14.5 |
I think we're on the same page, @bbbbbbbbba. In the case of multiprocessing, I'm not worried about incorrect results. As you said, I'm worried about poor performance due to spawning more threads than you can schedule onto your CPU cores. But the penalty is not trivial! In one of my recent use-cases on a 16-core machine, the performance was 6x worse due to the extra threads. On an 80-core machine like @FabianIsensee's, the overhead must be even worse. The threading-related issue you referred to is troubling, but that sounds like an outright bug, not an issue with the |
This code which is MIT licensed and based on other BSD-licensed routines at runtime probes the loaded DLLs (shared objects) to find which implementation is relevant and calls the implementation specific routine to set the number of threads. Thanks to @ogrisel for this comment pointing it out |
Sorry for being late to the party I had not seen this issue. Indeed we started to investigate over-subscription issues a bit in the context of scikit-learn / joblib but this is still work in progress. Having numpy expose a uniform API to control the behavior of the underlying BLAS thread pool would be nice. Note that @anton-malakhov and @tomMoral are the primary authors of those dynamic ctypes-based access to the underlying runtime libraries. Ping @jeremiedbb who might also be interested in following this discussion. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@seberg the script I linked to uses ctypes and OS-provided functions to walk down the loaded shared objects looking for the one we want. Isn't that what your script does, only using Popen([ldd ...)`? It seems that if the clib provides [dl_iterate_phdr]https://linux.die.net/man/3/dl_iterate_phdr) for linux, _dyld_image_count for MacOS and GetModuleFileNameExW for windows we should use them. |
This comment has been minimized.
This comment has been minimized.
Ah sorry, forget about the ldd stuff, I just added it because I kept looking at it. No, the first function just loads the multiarray.so with ctypes and checks if certain function symbols are defined... That seems to work for openblas, mkl, blis and atlas. But I have no idea if just trying to load function symbols should work, or if e.g. accelerate is identifiable by the existance of such a symbol. EDIT: OK, nevermind my rambling. Tried on windows and the stuff probably just randomly works on linux. |
Can we mark this as Closed? Maybe we should pivot it to "document use of |
Closing, since the |
This is not a bug report but just an enhancement proposal.
I think it would be useful and important to be able to easily set the number of threads used by Numpy after Numpy import.
From the perspective of library developers, it is often useful to be able to control the number of threads used by Numpy, see for example biopython/biopython#1401.
It is not difficult to do in a simple script when we are sure that Numpy or Scipy have not been imported previously with something like:
However, in a library there is a good chance that the user has already imported Numpy in its main script, with something like
In this case, I don't see how to set the number of threads used by Numpy from the fluidimage code.
Thus, it would be very convenient to have a function
np.set_num_threads
.The text was updated successfully, but these errors were encountered: