8000 Add `allow_override` kwarg to Parallel by jcrist · Pull Request #524 · joblib/joblib · GitHub
[go: up one dir, main page]

Skip to content

Add allow_override kwarg to Parallel #524

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jcrist
Copy link
Contributor
@jcrist jcrist commented May 23, 2017

If true, wrapping a call to Parallel with a parallel_backend contextmanager will use the active backend instead of the backend specified in the Parallel call. Default is False (existing behavior).

This allows overriding what backend is used inside libraries that depend on joblib (e.g. scikit-learn), and is motivated by the dask.distributed joblib backend (see scikit-learn/scikit-learn#8804).

Example

# Uses threading
with parallel_backend('threading'):
    Parallel(backend='multiprocessing', allow_override=True, ...)(...)

# Uses multiprocessing since `allow_override=False`
with parallel_backend('threading'):
    Parallel(backend='multiprocessing', ...)(...)

# Uses specified n_jobs in Parallel (10)
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing', allow_override=True, n_jobs=10)

# Uses context specified n_jobs (2), since none specified in Parallel
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing', allow_override=True)

# Uses default n_jobs (1), since none specified in Parallel and allow_override=False
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing')

The n_jobs behavior with allow_override is a bit wonky, as parallel_backend may or may not specify n_jobs, and the default for parallel_backend is -1 but the default for Parallel is 1. The approach taken here is that allow_override only means the backend can be overridden, not n_jobs in the call to Parallel (if specified). Alternative solutions would be:

  • Remove the n_jobs kwarg from parallel_backend, as n_jobs and backend are two separate concepts

  • Switch the kwarg default for n_jobs in parallel_backend to None. This would allow checking if a global n_jobs was set, and use that to override the n_jobs specified in Parallel if allow_override is True. The logic in Parallel would then look like:

    if allow_override or backend is None:
        backend, default_n_jobs = get_active_backend()
        if default_n_jobs is not None:  # n_jobs set in parallel_backend, use that
            n_jobs = default_n_jobs
        elif n_jobs is None:            # otherwise if specified locally use that
            n_jobs = 1                  # fallback to 1

    The downside of this is it would change the behavior of

    with parallel_backend('some_backend'):
        Parallel()(...)

    from using n_jobs=-1 (fallback to parallel_backend defaults) to n_jobs=1 (fallback to Parallel defaults). Since the parallel_backend contextmanager has a warning that it may change any time this may be acceptable.

@codecov
Copy link
codecov bot commented May 23, 2017

Codecov Report

Merging #524 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #524      +/-   ##
==========================================
+ Coverage   93.72%   93.75%   +0.03%     
==========================================
  Files          37       37              
  Lines        4779     4807      +28     
==========================================
+ Hits         4479     4507      +28     
  Misses        300      300
Impacted Files Coverage Δ
joblib/test/test_parallel.py 95.48% <100%> (+0.26%) ⬆️
joblib/parallel.py 97.1% <100%> (+0.06%) ⬆️
joblib/test/test_memory.py 98.09% <0%> (-0.43%) ⬇️
joblib/test/test_pool.py 99.61% <0%> (+0.77%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 549ec6f...e3244a4. Read the comment docs.

jcrist added 2 commits May 23, 2017 13:12
The new release of pytest adds a plugin to capture all warnings and
output them at the end. This was causing a test failure, as we were
checking that a warning was 
8000
raised in an external process by checking
the captured stderr. Since the warning is raised in an external process
we can't use `pytest.warns`, and none of the current pytest-warnings
configuration settings allow turning off capturing in a way that works
for this test. For now we disable this plugin entirely.
If true, wrapping a call to `Parallel` with a `parallel_backend`
contextmanager will use the active backend instead of the backend
specified in the `Parallel` call. Default is False (existing behavior).

This allows overriding what backend is used inside libraries that depend
on joblib (e.g. scikit-learn), and is motivated by the
`dask.distributed` joblib backend.

This also changes the kwarg default for `n_jobs` to None to allow for
checking if the user specified a number in the call to `Parallel` (which
is then used), otherwise the global default is used. This is better than
the existing logic, which checked if it was `1`, which meant that if no
backend was specified but `n_jobs` was set to `1`,
`n_jobs=DEFAULT_N_JOBS` would be used instead. The default behavior
remains the same as before.
@aabadie
Copy link
Contributor
aabadie commented May 26, 2017

I followed a bit the discussion in scikit-learn and dask.

The second option you propose in the summary of this PR makes more sense to me: allow the parallel_backend context manager to override the backend and the n_jobs. And use the same default value for n_jobs than the one in Parallel. From the example you provided, I think this is the most intuitive.

@with_multiprocessing
def test_n_jobs_defaults():
# n_jobs specified, use that
p = Parallel(n_jobs=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use a value different from the default (1)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this test was to ensure that passing in n_jobs=1 wasn't ignored. In the existing code providing no backend but specifying n_jobs=1 results in n_jobs=-1.

Now ``allow_override`` dictates if both the backend and n_jobs can be
overridden. This required changing the default value of ``n_jobs`` for
``parallel_backend``, but that should be fine as the docs provide a
warning that it may change at any time.

This also changes the defaults for ``backend`` and ``n_jobs`` to ignore
``parallel_backend`` if not provided. This is more consistent, as the
``parallel_backend`` contextmanager is now always ignored unless
``allow_override=True``.
@jcrist
Copy link
Contributor Author
jcrist commented May 30, 2017

I have adjusted the code to have allow_override indicate if both the backend and n_jobs should be overridden. This required changing the default value for n_jobs in parallel_backend to None to match Parallel, but that seems fine to me since there was a warning in the parallel_backend docstring that it may change any time.

I also changed Parallel to now ignore parallel_backend unless allow_override=True. This seems more consistent to me, but breaks with the previous behavior of no backend passed meaning use the active backend. I'm not sure if this was intended though, as the docstring makes it look like no backend should mean use DEFAULT_BACKEND.

Example

# Uses threading
with parallel_backend('threading'):
    Parallel(backend='multiprocessing', allow_override=True, ...)(...)

# Uses multiprocessing since `allow_override=False`
with parallel_backend('threading'):
    Parallel(backend='multiprocessing', ...)(...)

# Uses specified n_jobs in parallel_backend (2)
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing', allow_override=True, n_jobs=10)(...)

# Uses default n_jobs (1), since none specified in Parallel and allow_override=False
with parallel_backend('threading', n_jobs=2):
    Parallel(backend='multiprocessing')(...)

# Uses DEFAULT_BACKEND (multiprocessing) since allow_override=False
# This is different than the current release where `threading` would be used
with parallel_backend('threading'):
    Parallel()(...)

@ogrisel
Copy link
Contributor
ogrisel commented Jun 1, 2017

I am wondering if it would make more sense to have:

with parallel_backend('threading', n_jobs=2, override=True):
    Parallel(backend='multiprocessing', ...)(...)

instead. It would give more control to the end user (over the library writer). WDYT?

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Jun 1, 2017 via email

@ogrisel
Copy link
Contributor
ogrisel commented Jun 1, 2017

Indeed it would be saner for the random forest code to state that the code under the parallel loop is mostly GIL free and then let joblib select the threading backend as the default backend as a consequence while at the same time making it possible for the users to use the with parallel_backend construct to explicitly select their own backend from their own calling program.

Any suggestion for the GIL-free declaration?

@ogrisel
Copy link
Contributor
ogrisel commented Jun 1, 2017

Possible suggestions:

  • To state that it's safe to use a thread-based backend without being penalized by GIL contention:

    Parallel(n_jobs, gil_free=True)(delayed(func)(x) for x in task_data)

  • To state that it's required to use a single-machine single-process thread-based backend because the code inside the loop relies on shared memory for concurrent state mutation (never used in any sklearn code as far as I know and should be discouraged in general):

    Parallel(n_jobs, require_shared_mem=True)(delayed(func)(x) for x in task_data)

@jcrist
Copy link
Contributor Author
jcrist commented Jun 1, 2017

My main issue with that (as I mentioned in the originating issue) is that I find it more confusing as a joblib user (and admittedly a novice one). The intended use here is twofold:

  • Allow users to change the joblib backend inside library code when/where it makes sense to do so
  • Provide a simple way library maintainers can control this override to prevent unwanted behavior

For both of these I find a simple boolean "can the backend be overridden by parallel_backend" to be easier to understand and reason about. I think a more fine-grained solution is a bit overkill.

The two constraints you mentioned with the existing backends boil down to:

  • gil_free: can threading be used? Note that False doesn't mean threading can't be used, it'll just be slower.
  • requires_shared_mem: is threading/sequential mandatory? This is a mandatory constraint, as you'll get incorrect results otherwise.

Going through the 4 possible combinations of these:

  • gil_free=True, requires_shared_mem=False

    This means that any backend should work fine, but the most efficient default is probably threading. This would be equivalent to Parallel('threading', allow_override=True).

  • gil_free=True, requires_shared_mem=True

    This requires either threading or sequential. A reasonable choice here would be Parallel('threading', allow_override=False). Given n_jobs=1, this is equivalent to sequential.

  • gil_free=False, requires_shared_mem=False

    This means that any backend should work fine, but the threading backend is slower than desired. In this case I'd use Parallel(allow_override=True), and note in the library documentation that the threading backend will be slower for this algorithm.

  • gil_free=False, requires_shared_mem=True

    This is a weird state, as you need shared memory but can't achieve true parallelism with threads. This requires sequential (just a for loop), and is unlikely to come up (and if it did, I'd say Parallel('sequential', allow_override=False)).

I'm not 100% against the more fine grained solution, but I don't see how it solves any problems better than the simple boolean keyword and some documentation. The constraint based solution would also need docs so users could know whether wrapping a block in with parallel_backend(...) would use the requested backend or not.

@jcrist
Copy link
Contributor Author
jcrist commented Jun 1, 2017

Any suggestion for the GIL-free declaration?

If we do settle on the constraint based approach, I think releases_gil would be better than gil_free. gil_free to me is a bit vague - is the backend "gil-free" (e.g. multiprocessing), or is the function "gil-free"? allow_threads might be another good choice.

@lesteve
Copy link
Member
lesteve commented Jun 2, 2017

because the code inside the loop relies on shared memory for concurrent state mutation (never used in any sklearn code as far as I know and should be discouraged in general)

@ogrisel actually I think scikit-learn/scikit-learn#8672 does rely on this for example.

@ogrisel
Copy link
Contributor
ogrisel commented Jun 2, 2017

@lesteve indeed scikit-learn/scikit-learn#8672 makes impossible to use a non-threading backend for random forest in scikit-learn master...

The clean solution would be to implement a new "map / reduce" or possibly "map / combine / reduce" semantics into joblib.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Jun 2, 2017 via email

@jcrist
Copy link
Contributor Author
jcrist commented Jun 2, 2017

Ok, the forward thinking argument convinced me. I'll write up a PR implementing this.

@GaelVaroquaux
Copy link
Member
GaelVaroquaux commented Jun 2, 2017 via email

@jcrist
Copy link
Contributor Author
jcrist commented Jul 19, 2017

Superseded by #537. Closing.

@jcrist jcrist closed this Jul 19, 2017
@jcrist jcrist deleted the allow_override branch July 19, 2017 20:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0