Add `allow_override` kwarg to Parallel #524

jcrist · 2017-05-23T17:25:36Z

If true, wrapping a call to Parallel with a parallel_backend contextmanager will use the active backend instead of the backend specified in the Parallel call. Default is False (existing behavior).

This allows overriding what backend is used inside libraries that depend on joblib (e.g. scikit-learn), and is motivated by the dask.distributed joblib backend (see scikit-learn/scikit-learn#8804).

Example

# Uses threading
with parallel_backend('threading'):
    Parallel(backend='multiprocessing', allow_override=True, ...)(...)

# Uses multiprocessing since `allow_override=False`
with parallel_backend('threading'):
    Parallel(backend='multiprocessing', ...)(...)

# Uses specified n_jobs in Parallel (10)
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing', allow_override=True, n_jobs=10)

# Uses context specified n_jobs (2), since none specified in Parallel
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing', allow_override=True)

# Uses default n_jobs (1), since none specified in Parallel and allow_override=False
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing')

The n_jobs behavior with allow_override is a bit wonky, as parallel_backend may or may not specify n_jobs, and the default for parallel_backend is -1 but the default for Parallel is 1. The approach taken here is that allow_override only means the backend can be overridden, not n_jobs in the call to Parallel (if specified). Alternative solutions would be:

Remove the n_jobs kwarg from parallel_backend, as n_jobs and backend are two separate concepts
Switch the kwarg default for n_jobs in parallel_backend to None. This would allow checking if a global n_jobs was set, and use that to override the n_jobs specified in Parallel if allow_override is True. The logic in Parallel would then look like:
```
if allow_override or backend is None:
    backend, default_n_jobs = get_active_backend()
    if default_n_jobs is not None:  # n_jobs set in parallel_backend, use that
        n_jobs = default_n_jobs
    elif n_jobs is None:            # otherwise if specified locally use that
        n_jobs = 1                  # fallback to 1
```
The downside of this is it would change the behavior of
```
with parallel_backend('some_backend'):
    Parallel()(...)
```
from using n_jobs=-1 (fallback to parallel_backend defaults) to n_jobs=1 (fallback to Parallel defaults). Since the parallel_backend contextmanager has a warning that it may change any time this may be acceptable.

codecov · 2017-05-23T17:32:26Z

Codecov Report

Merging #524 into master will increase coverage by 0.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #524      +/-   ##
==========================================
+ Coverage   93.72%   93.75%   +0.03%     
==========================================
  Files          37       37              
  Lines        4779     4807      +28     
==========================================
+ Hits         4479     4507      +28     
  Misses        300      300

Impacted Files	Coverage Δ
joblib/test/test_parallel.py	`95.48% <100%> (+0.26%)`	⬆️
joblib/parallel.py	`97.1% <100%> (+0.06%)`	⬆️
joblib/test/test_memory.py	`98.09% <0%> (-0.43%)`	⬇️
joblib/test/test_pool.py	`99.61% <0%> (+0.77%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 549ec6f...e3244a4. Read the comment docs.

The new release of pytest adds a plugin to capture all warnings and output them at the end. This was causing a test failure, as we were checking that a warning was 8000 raised in an external process by checking the captured stderr. Since the warning is raised in an external process we can't use `pytest.warns`, and none of the current pytest-warnings configuration settings allow turning off capturing in a way that works for this test. For now we disable this plugin entirely.

If true, wrapping a call to `Parallel` with a `parallel_backend` contextmanager will use the active backend instead of the backend specified in the `Parallel` call. Default is False (existing behavior). This allows overriding what backend is used inside libraries that depend on joblib (e.g. scikit-learn), and is motivated by the `dask.distributed` joblib backend. This also changes the kwarg default for `n_jobs` to None to allow for checking if the user specified a number in the call to `Parallel` (which is then used), otherwise the global default is used. This is better than the existing logic, which checked if it was `1`, which meant that if no backend was specified but `n_jobs` was set to `1`, `n_jobs=DEFAULT_N_JOBS` would be used instead. The default behavior remains the same as before.

aabadie · 2017-05-26T16:53:00Z

I followed a bit the discussion in scikit-learn and dask.

The second option you propose in the summary of this PR makes more sense to me: allow the parallel_backend context manager to override the backend and the n_jobs. And use the same default value for n_jobs than the one in Parallel. From the example you provided, I think this is the most intuitive.

aabadie · 2017-05-26T17:02:32Z

joblib/test/test_parallel.py

+@with_multiprocessing
+def test_n_jobs_defaults():
+    # n_jobs specified, use that
+    p = Parallel(n_jobs=1)


Maybe use a value different from the default (1)

The point of this test was to ensure that passing in n_jobs=1 wasn't ignored. In the existing code providing no backend but specifying n_jobs=1 results in n_jobs=-1.

Now ``allow_override`` dictates if both the backend and n_jobs can be overridden. This required changing the default value of ``n_jobs`` for ``parallel_backend``, but that should be fine as the docs provide a warning that it may change at any time. This also changes the defaults for ``backend`` and ``n_jobs`` to ignore ``parallel_backend`` if not provided. This is more consistent, as the ``parallel_backend`` contextmanager is now always ignored unless ``allow_override=True``.

jcrist · 2017-05-30T15:56:34Z

I have adjusted the code to have allow_override indicate if both the backend and n_jobs should be overridden. This required changing the default value for n_jobs in parallel_backend to None to match Parallel, but that seems fine to me since there was a warning in the parallel_backend docstring that it may change any time.

I also changed Parallel to now ignore parallel_backend unless allow_override=True. This seems more consistent to me, but breaks with the previous behavior of no backend passed meaning use the active backend. I'm not sure if this was intended though, as the docstring makes it look like no backend should mean use DEFAULT_BACKEND.

Example

# Uses threading
with parallel_backend('threading'):
    Parallel(backend='multiprocessing', allow_override=True, ...)(...)

# Uses multiprocessing since `allow_override=False`
with parallel_backend('threading'):
    Parallel(backend='multiprocessing', ...)(...)

# Uses specified n_jobs in parallel_backend (2)
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing', allow_override=True, n_jobs=10)(...)

# Uses default n_jobs (1), since none specified in Parallel and allow_override=False
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing')(...)

# Uses DEFAULT_BACKEND (multiprocessing) since allow_override=False
# This is different than the current release where `threading` would be used
with parallel_backend('threading'):
    Parallel()(...)

ogrisel · 2017-06-01T08:23:19Z

I am wondering if it would make more sense to have:

with parallel_backend('threading', n_jobs=2, override=True):
    Parallel(backend='multiprocessing', ...)(...)

instead. It would give more control to the end user (over the library writer). WDYT?

GaelVaroquaux · 2017-06-01T08:37:30Z

I am thinking that the concept of "override" is very general, and doesn't really say what is really the constraint. It seems to me that it is too weak a semantic. Could we consider rather specifying backends in a richer way: I see two constraints that seem important: 1) does the code need gil-free execution? 2) does the code need shared memory. If each backend specifies whether it is gil-free or not, and whether it is shared-mem or not, the user can specify not the backend, but his requirement on the parallel code, the override will then make sense. What do people think?

ogrisel · 2017-06-01T09:06:11Z

Indeed it would be saner for the random forest code to state that the code under the parallel loop is mostly GIL free and then let joblib select the threading backend as the default backend as a consequence while at the same time making it possible for the users to use the with parallel_backend construct to explicitly select their own backend from their own calling program.

Any suggestion for the GIL-free declaration?

ogrisel · 2017-06-01T09:12:52Z

Possible suggestions:

To state that it's safe to use a thread-based backend without being penalized by GIL contention:

Parallel(n_jobs, gil_free=True)(delayed(func)(x) for x in task_data)
To state that it's required to use a single-machine single-process thread-based backend because the code inside the loop relies on shared memory for concurrent state mutation (never used in any sklearn code as far as I know and should be discouraged in general):

Parallel(n_jobs, require_shared_mem=True)(delayed(func)(x) for x in task_data)

jcrist · 2017-06-01T17:12:04Z

My main issue with that (as I mentioned in the originating issue) is that I find it more confusing as a joblib user (and admittedly a novice one). The intended use here is twofold:

Allow users to change the joblib backend inside library code when/where it makes sense to do so
Provide a simple way library maintainers can control this override to prevent unwanted behavior

For both of these I find a simple boolean "can the backend be overridden by parallel_backend" to be easier to understand and reason about. I think a more fine-grained solution is a bit overkill.

The two constraints you mentioned with the existing backends boil down to:

gil_free: can threading be used? Note that False doesn't mean threading can't be used, it'll just be slower.
requires_shared_mem: is threading/sequential mandatory? This is a mandatory constraint, as you'll get incorrect results otherwise.

Going through the 4 possible combinations of these:

gil_free=True, requires_shared_mem=False

This means that any backend should work fine, but the most efficient default is probably threading. This would be equivalent to Parallel('threading', allow_override=True).
gil_free=True, requires_shared_mem=True

This requires either threading or sequential. A reasonable choice here would be Parallel('threading', allow_override=False). Given n_jobs=1, this is equivalent to sequential.
gil_free=False, requires_shared_mem=False

This means that any backend should work fine, but the threading backend is slower than desired. In this case I'd use Parallel(allow_override=True), and note in the library documentation that the threading backend will be slower for this algorithm.
gil_free=False, requires_shared_mem=True

This is a weird state, as you need shared memory but can't achieve true parallelism with threads. This requires sequential (just a for loop), and is unlikely to come up (and if it did, I'd say Parallel('sequential', allow_override=False)).

I'm not 100% against the more fine grained solution, but I don't see how it solves any problems better than the simple boolean keyword and some documentation. The constraint based solution would also need docs so users could know whether wrapping a block in with parallel_backend(...) would use the requested backend or not.

jcrist · 2017-06-01T17:17:52Z

Any suggestion for the GIL-free declaration?

If we do settle on the constraint based approach, I think releases_gil would be better than gil_free. gil_free to me is a bit vague - is the backend "gil-free" (e.g. multiprocessing), or is the function "gil-free"? allow_threads might be another good choice.

lesteve · 2017-06-02T09:12:32Z

because the code inside the loop relies on shared memory for concurrent state mutation (never used in any sklearn code as far as I know and should be discouraged in general)

@ogrisel actually I think scikit-learn/scikit-learn#8672 does rely on this for example.

ogrisel · 2017-06-02T12:31:34Z

@lesteve indeed scikit-learn/scikit-learn#8672 makes impossible to use a non-threading backend for random forest in scikit-learn master...

The clean solution would be to implement a new "map / reduce" or possibly "map / combine / reduce" semantics into joblib.

GaelVaroquaux · 2017-06-02T14:04:19Z

For both of these I find a simple boolean "can the backend be overridden by parallel_backend" to be easier to understand and reason about. I think a more fine-grained solution is a bit overkill.

As an algorithm developer, I need to know what I can assume on the call to Parallel. There are many possible variants, from the safest (no shared memory, execution order constraint, releasing the gil), to the wildest (no control on the order of results, no shared disks or memory). By specifying the constraints here we only explore a small fraction of these. If, as an algorithm developer, I have just a binary choice on wether or not the environment will override my choice, I don't know what kind of assumptions that I am making will be broken. This will probably be more an more a problem as more backends come that break my assumptions. If we finally manage to make Parallel an iterator, there would be a real value in returning results in a non ordered way. For map-reduce like applications it would enable emptying the return queue faster. But if new backends that do this are applied to a code that assumed ordered return, they could very well break it. I am trying to think in terms of the future, so that existing code can keep working as Parallel evolves. Maybe the right thing to do would be to, in the long term, deprecate the "backend" argument to only focus on a variety of constraints that make sense.

jcrist · 2017-06-02T20:25:20Z

Ok, the forward thinking argument convinced me. I'll write up a PR implementing this.

GaelVaroquaux · 2017-06-02T20:27:49Z

Ok, the forward thinking argument convinced me. I'll write up a PR implementing this.

Thanks! You rock! And bear in mind that I love argumented disagreemnt, so if you feel that I am pushing forward wrong choices, I want to hear it.

jcrist · 2017-07-19T20:29:14Z

Superseded by #537. Closing.

jcrist mentioned this pull request May 23, 2017

Add ability to override joblib backend for scikit-learn estimators scikit-learn/scikit-learn#8804

Closed

jcrist mentioned this pull request May 23, 2017

Joblib scatter dask/distributed#1022

Merged

jcrist added 2 commits May 23, 2017 13:12

jcrist force-pushed the allow_override branch from a694d22 to 8258e12 Compare May 23, 2017 18:15

aabadie reviewed May 26, 2017

View reviewed changes

flake8

e3244a4

jcrist mentioned this pull request Jul 19, 2017

[WIP] Implement backend requirements and preferences #537

Closed

jcrist closed this Jul 19, 2017

jcrist deleted the allow_override branch July 19, 2017 20:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `allow_override` kwarg to Parallel #524

Add `allow_override` kwarg to Parallel #524

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add allow_override kwarg to Parallel #524

Add allow_override kwarg to Parallel #524

Uh oh!

Conversation

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add `allow_override` kwarg to Parallel #524

Add `allow_override` kwarg to Parallel #524