-
Notifications
You must be signed in to change notification settings - Fork 432
Add allow_override
kwarg to Parallel
#524
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #524 +/- ##
==========================================
+ Coverage 93.72% 93.75% +0.03%
==========================================
Files 37 37
Lines 4779 4807 +28
==========================================
+ Hits 4479 4507 +28
Misses 300 300
Continue to review full report at Codecov.
|
The new release of pytest adds a plugin to capture all warnings and output them at the end. This was causing a test failure, as we were checking that a warning was 8000 raised in an external process by checking the captured stderr. Since the warning is raised in an external process we can't use `pytest.warns`, and none of the current pytest-warnings configuration settings allow turning off capturing in a way that works for this test. For now we disable this plugin entirely.
If true, wrapping a call to `Parallel` with a `parallel_backend` contextmanager will use the active backend instead of the backend specified in the `Parallel` call. Default is False (existing behavior). This allows overriding what backend is used inside libraries that depend on joblib (e.g. scikit-learn), and is motivated by the `dask.distributed` joblib backend. This also changes the kwarg default for `n_jobs` to None to allow for checking if the user specified a number in the call to `Parallel` (which is then used), otherwise the global default is used. This is better than the existing logic, which checked if it was `1`, which meant that if no backend was specified but `n_jobs` was set to `1`, `n_jobs=DEFAULT_N_JOBS` would be used instead. The default behavior remains the same as before.
I followed a bit the discussion in scikit-learn and dask. The second option you propose in the summary of this PR makes more sense to me: allow the parallel_backend context manager to override the backend and the n_jobs. And use the same default value for n_jobs than the one in Parallel. From the example you provided, I think this is the most intuitive. |
@with_multiprocessing | ||
def test_n_jobs_defaults(): | ||
# n_jobs specified, use that | ||
p = Parallel(n_jobs=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe use a value different from the default (1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this test was to ensure that passing in n_jobs=1
wasn't ignored. In the existing code providing no backend but specifying n_jobs=1
results in n_jobs=-1
.
Now ``allow_override`` dictates if both the backend and n_jobs can be overridden. This required changing the default value of ``n_jobs`` for ``parallel_backend``, but that should be fine as the docs provide a warning that it may change at any time. This also changes the defaults for ``backend`` and ``n_jobs`` to ignore ``parallel_backend`` if not provided. This is more consistent, as the ``parallel_backend`` contextmanager is now always ignored unless ``allow_override=True``.
I have adjusted the code to have I also changed Example # Uses threading
with parallel_backend('threading'):
Parallel(backend='multiprocessing', allow_override=True, ...)(...)
# Uses multiprocessing since `allow_override=False`
with parallel_backend('threading'):
Parallel(backend='multiprocessing', ...)(...)
# Uses specified n_jobs in parallel_backend (2)
with parallel_backend('threading', n_jobs=2):
Parallel(backend='multiprocessing', allow_override=True, n_jobs=10)(...)
# Uses default n_jobs (1), since none specified in Parallel and allow_override=False
with parallel_backend('threading', n_jobs=2):
Parallel(backend='multiprocessing')(...)
# Uses DEFAULT_BACKEND (multiprocessing) since allow_override=False
# This is different than the current release where `threading` would be used
with parallel_backend('threading'):
Parallel()(...) |
I am wondering if it would make more sense to have: with parallel_backend('threading', n_jobs=2, override=True):
Parallel(backend='multiprocessing', ...)(...) instead. It would give more control to the end user (over the library writer). WDYT? |
I am thinking that the concept of "override" is very general, and doesn't
really say what is really the constraint. It seems to me that it is too
weak a semantic.
Could we consider rather specifying backends in a richer way: I see two
constraints that seem important: 1) does the code need gil-free
execution? 2) does the code need shared memory.
If each backend specifies whether it is gil-free or not, and whether it
is shared-mem or not, the user can specify not the backend, but his
requirement on the parallel code, the override will then make sense.
What do people think?
|
Indeed it would be saner for the random forest code to state that the code under the parallel loop is mostly GIL free and then let joblib select the threading backend as the default backend as a consequence while at the same time making it possible for the users to use the Any suggestion for the GIL-free declaration? |
Possible suggestions:
|
My main issue with that (as I mentioned in the originating issue) is that I find it more confusing as a joblib user (and admittedly a novice one). The intended use here is twofold:
For both of these I find a simple boolean "can the backend be overridden by The two constraints you mentioned with the existing backends boil down to:
Going through the 4 possible combinations of these:
I'm not 100% against the more fine grained solution, but I don't see how it solves any problems better than the simple boolean keyword and some documentation. The constraint based solution would also need docs so users could know whether wrapping a block in |
If we do settle on the constraint based approach, I think |
@ogrisel actually I think scikit-learn/scikit-learn#8672 does rely on this for example. |
@lesteve indeed scikit-learn/scikit-learn#8672 makes impossible to use a non-threading backend for random forest in scikit-learn master... The clean solution would be to implement a new "map / reduce" or possibly "map / combine / reduce" semantics into joblib. |
For both of these I find a simple boolean "can the backend be overridden by
parallel_backend" to be easier to understand and reason about. I think a more
fine-grained solution is a bit overkill.
As an algorithm developer, I need to know what I can assume on the call
to Parallel. There are many possible variants, from the safest (no shared
memory, execution order constraint, releasing the gil), to the wildest
(no control on the order of results, no shared disks or memory). By
specifying the constraints here we only explore a small fraction of
these. If, as an algorithm developer, I have just a binary choice on
wether or not the environment will override my choice, I don't know what
kind of assumptions that I am making will be broken. This will probably
be more an more a problem as more backends come that break my
assumptions. If we finally manage to make Parallel an iterator, there
would be a real value in returning results in a non ordered way. For
map-reduce like applications it would enable emptying the return queue
faster. But if new backends that do this are applied to a code that
assumed ordered return, they could very well break it.
I am trying to think in terms of the future, so that existing code can
keep working as Parallel evolves.
Maybe the right thing to do would be to, in the long term, deprecate the
"backend" argument to only focus on a variety of constraints that make
sense.
|
Ok, the forward thinking argument convinced me. I'll write up a PR implementing this. |
Ok, the forward thinking argument convinced me. I'll write up a PR implementing
this.
Thanks! You rock!
And bear in mind that I love argumented disagreemnt, so if you feel that
I am pushing forward wrong choices, I want to hear it.
|
Superseded by #537. Closing. |
If true, wrapping a call to
Parallel
with aparallel_backend
contextmanager will use the active backend instead of the backend specified in theParallel
call. Default is False (existing behavior).This allows overriding what backend is used inside libraries that depend on joblib (e.g. scikit-learn), and is motivated by the
dask.distributed
joblib backend (see scikit-learn/scikit-learn#8804).Example
The
n_jobs
behavior withallow_override
is a bit wonky, asparallel_backend
may or may not specifyn_jobs
, and the default forparallel_backend
is-1
but the default forParallel
is1
. The approach taken here is thatallow_override
only means the backend can be overridden, notn_jobs
in the call toParallel
(if specified). Alternative solutions would be:Remove the
n_jobs
kwarg fromparallel_backend
, asn_jobs
andbackend
are two separate conceptsSwitch the kwarg default for
n_jobs
inparallel_backend
toNone
. This would allow checking if a globaln_jobs
was set, and use that to override then_jobs
specified inParallel
ifallow_override
is True. The logic inParallel
would then look like:The downside of this is it would change the behavior of
from using
n_jobs=-1
(fallback toparallel_backend
defaults) ton_jobs=1
(fallback toParallel
defaults). Since theparallel_backend
contextmanager has a warning that it may change any time this may be acceptable.