8000 Are there any pitfalls by combining `n_jobs` and `random_state`? · Issue #30811 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Are there any pitfalls by combining n_jobs and random_state? #30811

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adosar opened this issue Feb 11, 2025 · 4 comments
Closed

Are there any pitfalls by combining n_jobs and random_state? #30811

adosar opened this issue Feb 11, 2025 · 4 comments
Labels
Needs Triage Issue requires triage

Comments

@adosar
Copy link
adosar commented Feb 11, 2025

Discussed in #30809

Originally posted by adosar February 11, 2025
In Controlling randomness, the guide is discussing how to properly control randomness either for an estimator or CV or when using both. However, there is no mention if random_state and n_jobs > 1 interact in any unexpected way.

Lets consider a typical use case where a user cross validates a RandomForestClassifier with KFold:

estimator = RandomForestClassifer(random_state=np.random.RandomState(1))  # Recommended to pass RandomState instance.
kfold = KFold(shuffle=True, random_state=42)  # Recommended to pass int.
cross_val_score(estimator, n_jobs=-1, ..., cv=kfold)

Since n_jobs=-1 this means that multiple cores will be used for cross validation (e.g. 1 core per fold).

Would the same state be used for the different folds, since during multiprocessing the estimator and hence the rng passed to it, is copied via fork?

@github-actions github-actions bot added the Needs Triage Issue requires triage label Feb 11, 2025
@adrinjalali
Copy link
Member

We've had many discussions in this regard. You can have a look at these: scikit-learn/enhancement_proposals#24, scikit-learn/enhancement_proposals#88 and other related discussions. This is a rather complex issue we've been dealing with.

Closing as duplicate.

@adosar
Copy link
Author
adosar commented Feb 11, 2025

@adrinjalali From the linked discussion and the links therein, there is no mention to what happens when n_jobs > 1. That was the motivation behind this question/issue.

@adrinjalali
Copy link
Member

Somewhere in those discussions we also touch on issues with parallelism.

You can check for yourself pretty easily, how things go south in some cases:

In [1]: import numpy as np

In [2]: from joblib import delayed, Parallel

In [3]: def f(rng):
   ...:     return rng.integers(0, 10, 10)
   ...: 

In [4]: rng = np.random.default_rng()

In [5]: Parallel(n_jobs=5)(delayed(f)(rng) for x in range(10))
Out[5]: 
[array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
 array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
 array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
 array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
 array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
 array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
 array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
 array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
 array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6]),
 array([0, 7, 1, 6, 3, 6, 7, 9, 1, 6])]

As you see, those are identical results, since the rng is cloned. Right now the only way to make sure things keep being random and separate parallel processes generate different numbers, is to pass None as random seed.

@lesteve
Copy link
Member
lesteve commented Feb 20, 2025

Probably this joblib example is relevant as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue requires triage
Projects
None yet
Development

No branches or pull requests

3 participants
0