Improve control over number of thr 8000 eads used in an mne call #10522

dafrose · 2022-04-13T12:57:22Z

Description

I propose to use threadpoolctl to improve control over the number of threads used throughout mne and in calls to external libraries like numpy. This is apparently the direction that numpy has moved to as discussed here: numpy/numpy#11826

Reasoning

I have had trouble completely controlling the number of threads used by various mne functions. Many mne functions have n_jobs arguments that control the number of threads used in that function, but there are cases where code within that function can escape this limit due to externally defined reference values. And then there are functions like mne.chpi.filter_chpi that do not have the n_jobs argument, but can still parallelize. It is possible to control this with environment variables as dicussed here, but that only works before you import the respective library, e.g. numpy. The easiest way to control thread limits after an import has happened appears to be threadpoolctl.

Proposed Implementation

I have successfully use the syntax

from threadpoolctl import threadpool_limits
with threadpool_limits(limits=n_jobs, user_api="blas"):
    mne.do_something()

to control threads used in an mne call. The same could be used internally to make better use of the existing n_jobs argument without forcing the user to do it themselves. If this proves successful, it might make sense to add the n_jobs argument in even more places.

The text was updated successfully, but these errors were encountered:

agramfort · 2022-04-13T14:15:29Z

I like this idea but it's tricky to make work out of the box. I suspect the optimal behavior depends on the length of the files, the number of channels, epochs...

dafrose · 2022-04-13T15:19:19Z

thanks for the reply @agramfort . In contrast, I think it would be straight forward. The current n_jobs argument already allows the user to define the number of threads to use. The only change would be, that this would actually apply to all libraries that are called by mne. I don't see why you would need to implement an "optimal" behaviour. In most cases (e.g. numpy) the default is a maximum of 128 threads per job. I do not think that this would need to be overwritten, unless the user explicitly wants to.

And there are good reasons why you would want to consistently define a max thread number, e.g. when you work on shared resources or want to run multiple jobs on the same compute server. As it is now, a user needs to go out of their way to make sure that all restrictions via environment variables are set before anything else is imported - or do some additional research to find, what I referenced above.

agramfort · 2022-04-13T15:25:30Z

how would you do this? if n_jobs=1 you make sure one thread is used? So suddenly all computations in MNE are monothread? it means we would need to add n_jobs in many places? note that we use usually processes and not threads for parallel do I get what you want to do correctly?

…

Message ID: ***@***.***>

dafrose · 2022-04-13T16:14:28Z

how would you do this? if n_jobs=1 you make sure one thread is used?
So suddenly all computations in MNE are monothread?

Well I guess, the question would then be, whether n_jobs=1 should be the default. I have seen other libraries use a default of 0 or -1 to signify, that the user does not want to change the default behaviour. Then you could use a globally set default or decide not to change anything. In my opinion, being able to set n_jobs implies that the code being run would use at maximum that number of available CPU threads. However, in mne that only applies to some of the code, while other code (e.g. some numpy calls) rely on externally defined variables that don't change when you set n_jobs. This does not matter, when you set n_jobs to be the total number of CPU threads, but it does when you explicitly want it to be less. In particular, if you do not modify the default of n_jobs=1 as a user, the code run might still actually parallelize, which may be undesired.

Apart from functions, that have the argument n_jobs, what should be the default behaviour for functions that don't use it? As a user, I would expect that these functions always run monothreaded. However, mne.chpi.filter_chpi by default fills up to 128 CPU threads and there is no apparent way to control it. I think the same can happen for reading functions, am I correct? The problem here is, that you can't efficiently run multiple jobs in parallel on a large compute server, if some of them once in while try to take all available resources. Once your jobs start competing for CPU threads, everything runs a lot less efficient. At the same time, other code may not ne able to make good use of 128 threads, which is why it makes sense to run multiple jobs in parallel with a defined number of maximum threads per job.

As a user, I assumed that n_jobs would do exactly that. If you do not want that to be the case, an alternative could be to explicitly document ways to do that as a user on mne.tools and to document that in the usage of n_jobs as well.

If you decide to empower n_jobs as suggested, a new default of 0 or -1 or "auto" could mean to automatically infer the maximum number of available CPU threads and use that number, possibly up to a maximum of 128. That appears to be the current default for e.g. OPENBLAS_NUM_THREADS.

note that we use usually processes and not threads for parallel

I actually meant available CPU resources, not threads or processes as they are used in the context of python. I have used "CPU threads" now to clarify.

it means we would need to add n_jobs in many places?

Essentially yes. But that could be done gradually.

dafrose · 2022-04-13T18:25:01Z

8000

I guess it comes down to, what you would like n_jobs to mean. As long as it is consistent, a valid choice is to define it as "max number of threads/processes used in places that we control". But in that case some guidance in form of an example would be helpful to users who would like more/full control.

In my current code, the use of threadpoolctl as described above seems to do the job. However, if used on an mne function that also accepts n_jobs, it needs to be specified at twice this way. Once for threadpool_limits and once in the mne function call. Reducing that to one place would be more elegant and less ambiguous, but I can understand if you do not wish to change the meaning of n_jobs.

agramfort · 2022-04-13T19:48:37Z

I would try to avoid a behavior that deviates to big libraries like scikit-learn. what is unclear to me is how big of the change is the change you suggest. in the lab we ask users to set OMP_NUM_THREADS to 1 in their .bashrc on the shared machines and to nice their jobs with "nice -5 python ..." Message ID: ***@***.***>

…

dafrose · 2022-04-14T06:32:02Z

I would try to avoid a behavior that deviates to big libraries like scikit-learn.

According to this resource the default for sklearn is n_jobs=-1 in which case all available resources are used.

what is unclear to me is how big of the change is the change you suggest.

The function mne.parallel.check_n_jobs already contains the necessary code: If a negative n_jobs is passed, the number of CPU cores is used to calculate the actual n_jobs. So the first step would be to set all defaults to n_jobs=-1. That can be done very easily with most IDEs. Of course, we should ensure that check_n_jobs is actually called wherever necessary.
If we want to achieve full control over thread numbers, it would make sense to add the threadpool_limits context manager immediately after the call to check_n_jobs to set the limit for everything that comes after. To begin with, we could do that everywhere, where the n_jobs argument is already present. There might also be alternatives to explicitly using threadpool_ctl but that might require some research.
Whatever the solution is, it could be gradually rolled out to every other function in mne that implicitly parallelizes.

All these changes are non-breaking in the sense that actively setting n_jobs still produces the same result, but more consistently (which would appear like a bug fix). On the other hand the default behaviour for "not setting n_jobs" changes to "use all cores". Whether that is a bad thing or not depends on expectation, but it is coherent with scikit.

in the lab we ask users to set OMP_NUM_THREADS to 1 in their .bashrc
on the shared machines and to nice their jobs with "nice -5 python ..."

That would be enforcing monothreading, assuming it catches all cases (might need to add more enviroment variables, see this stackoverflow answer). Even so, wouldn't it make more sense to use a Python API instead of having users manipulate their .bashrc?

agramfort · 2022-04-14T20:25:01Z

let me think about this. We will discuss this in the next MNE core dev meeting. You're welcome to join. It will be on Friday 22nd 5PM CET on MNE discord channel.

2 remarks:

in scikit-learn we default to 1 which is for me much less dangerous than defaulting to -1
using threadpool_limits context would mean indenting huge code blocks and I am not a fan to adding this everywhere. As I don't see then where we would not do this. We have numerical code everywhere and I've seen this done elsewhere but maybe you can point to other packages in the pydata ecosystem that have done what you suggest?

larsoner · 2022-04-15T13:40:55Z

I have not thought about threadpoolctl much, but I've seen it used in SciPy (with modifications from sklearn):

scipy/scipy#14441

And they mention there that it's also what's being used by NumPy. Given that scikit-learn uses Joblib to spawn new processes, as well, we can probably learn from their experience and try to do the same things.

using threadpool_limits context would mean indenting huge code blocks and I am not a fan to adding this everywhere.

Two ideas (and I think the second is better):

We've gotten around this before by changing things like fid = open(...) to with open(...) via:

def my_fun(...):
    do_something_slow
    on_many_lines

to

def my_fun(...):
    with context():
        _my_fun(...)

def _my_fun(...):
    do_something_slow
    on_many_lines

But really we should just do it by adding a @threadpool_controlled decorator or so that uses a context manager. Then we "just" need to add this to functions that need it. In practice we'd get it almost for free if we add it to verbose, which already decorates most slow functions anyway...

larsoner · 2022-04-15T18:43:53Z

@dafrose we usually try to follow what sklearn does, under the assumption that they have thought about this stuff a lot. It sounds like they, in turn, mostly deligate to joblib. With that in mind, I propose we follow their model by:

Setting n_jobs=None which means "use 1 if nothing configured". By using joblib contexts, this can be effectively changed to other values by users. So you could do:
```
with joblib.parallel_backend('threading', n_jobs=4):
    raw.filter(..., n_jobs=None)
```
And by using the (now default) value of n_jobs=None, you'll end up using 4 threads. This uses threadpoolctl under the hood, so should play nicely with linalg libraries according to their docs.
Update filter_chpi to have n_jobs argument.

This seems like it would allow MNE functions that use n_jobs to take as many threads as available according to the joblib.parallel_backend param.

agramfort · 2022-04-16T07:49:39Z

+1 to now default to None in n_jobs and use this semantic from sklearn

…

Message ID: ***@***.***>

dafrose · 2022-04-19T08:46:20Z

@agramfort thanks for the invite. I will see whether I can make it on Friday.

@larsoner +1 for adding the context manager as a decorator or to an existent decorator.

Regarding n_jobs=None: I like the idea, because it does not change the current default behaviour, but allows to hand over control to a lower-tier context. However, it should always be clear, what takes preference. Unless otherwise explained, I would expect that the call raw.filter(..., n_jobs=<a_number>) should overrule whatever an external context defines, unless the value is None as you defined above. Do I understand correctly, that setting n_jobs=-1 would still mean that all available cores are used?

agramfort · 2022-04-19T08:51:57Z

Regarding n_jobs=None: I like the idea, because it does not change the current default behaviour, but allows to hand over control to a lower-tier context. However, it should always be clear, what takes preference. Unless otherwise explained, I would expect that the call raw.filter(..., n_jobs=<a_number>) should overrule whatever an external context defines, unless the value is None as you defined above. Do I understand correctly, that setting n_jobs=-1 would still mean that all available cores are used?Message ID: ***@***.***>

yes it's the sklearn behavior.

larsoner · 2022-04-19T15:59:14Z

@dafrose do you want to take a stab at a PR to implement this?

dafrose · 2022-04-19T17:14:10Z

@larsoner thanks for the offer. I would love to, but I am afraid it would take some time. I already have a few PRs on my todo list, one of them already for mne and I haven't gotten to do any of them yet... So if it can wait for a few weeks TM, maybe. But I won't mind, if someone else did it until then.

dafrose added the ENH label Apr 13, 2022

larsoner mentioned this issue Apr 18, 2022

Add pqdm mne-tools/mne-installers#119

Closed

larsoner mentioned this issue Apr 26, 2022

ENH: Add n_jobs=None support #10567

Merged

7 tasks

agramfort closed this as completed in #10567 Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve control over number of thr 8000 eads used in an mne call #10522

Improve control over number of threads used in an mne call #10522

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Improve control over number of thr 8000 eads used in an mne call #10522

Improve control over number of threads used in an mne call #10522

Comments

Description

Reasoning

Proposed Implementation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!