-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Are there any pitfalls by combining n_jobs
and random_state
?
#30811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We've had many discussions in this regard. You can have a look at these: scikit-learn/enhancement_proposals#24, scikit-learn/enhancement_proposals#88 and other related discussions. This is a rather complex issue we've been dealing with. Closing as duplicate. |
@adrinjalali From the linked discussion and the links therein, there is no mention to what happens when |
Somewhere in those discussions we also touch on issues with parallelism. You can check for yourself pretty easily, how things go south in some cases: In [1]: import numpy as np
In [2]: from joblib import delayed, Parallel
In [3]: def f(rng):
...: return rng.integers(0, 10, 10)
...:
In [4]: rng = np.random.default_rng()
In [5]: Parallel(n_jobs=5)(delayed(f)(rng) for x in range(10))
Out[5]:
[array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6])] As you see, those are identical results, since the |
Probably this joblib example is relevant as well. |
Discussed in #30809
Originally posted by adosar February 11, 2025
In Controlling randomness, the guide is discussing how to properly control randomness either for an estimator or CV or when using both. However, there is no mention if
random_state
andn_jobs > 1
interact in any unexpected way.Lets consider a typical use case where a user cross validates a
RandomForestClassifier
withKFold
:Since
n_jobs=-1
this means that multiple cores will be used for cross validation (e.g. 1 core per fold).Would the same state be used for the different folds, since during multiprocessing the estimator and hence the
rng
passed to it, is copied via fork?The text was updated successfully, but these errors were encountered: