8000 Initializing new random instances for each estimator instead of passing around the same one · Issue #25395 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Initializing new random instances for each estimator instead of passing around the same one #25395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
joelostblom opened this issue Jan 14, 2023 · 8 comments

Comments

@joelostblom
Copy link
joelostblom commented Jan 14, 2023

Describe the issue linked to the documentation

After reading the detailed and helpful section on "Controlling randomness", my understanding is that the benefit of passing an np.random.RandomState instance to each estimator is the different randomness for different folds during the CV process, which passing an integer would not provide. In the sample code, the random instance is defined globally and then passed to the random_state parameter of all function calls. The drawback of this is that the order of operations now matters (as mentioned in the documentation) and the addition of any code that uses the same globally defined random instance variable will advance the random number generator and change the results of any code that is run afterwards, which can be quite confusing and not ideal for reproducibility.

Instead of defining this random state globally and re-using it, would it not be more beneficial to pass a new instance to each estimator? It seems to me that this would combine the benefits of the CV exploring different randomness in each fold, and having the code reproducible regardless of insertion of new code that uses the same random instance.

Suggest a potential alternative/fix

Instead of the current recommendation in the docs:

rng = RandomState(0)  # from numpy.random import RandomState
X, y = make_classification(random_state=rng)
rf = RandomForestClassifier(random_state=rng)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng)
rf.fit(X_train, y_train).score(X_test, y_test)

I would expect this to combine the benefits I mentioned above:

rng_init = 0
X, y = make_classification(random_state=RandomState(rng_init))
rf = RandomForestClassifier(random_state=RandomState(rng_init))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RandomState(rng_init))
rf.fit(X_train, y_train).score(X_test, y_test)

Apart from the increased verbosity, is there any downside to this approach (or is it even the same as passing an int so it has the same downsides)? If not, would you be open to a PR mentioning this in the docs? I am not that well-versed with pseudo random number generators in general, so while I know that the sequence generated by the number generator can be beneficial for some application, I don't know if there are any benefits to advancing this sequence as a global variable between different estimator call or if it is more important within each estimator and it is therefore fine to reinitialize the sequence as I suggest above.

And in this particular example, isn't only to the estimator we need to pass the RandomState instance (so the call to make_classification and train_test_split could receive an int instead which would make it less verbose)?

@glemaitre
Copy link
Member

I think the current recommendation is fine when we deal with random generators. In the case that you propose, I would recommend only passing an integral seed that will internally generate the instance. But then, this is out of the scope of the recommendation since the section is about RandomState instances.

@joelostblom
Copy link
Author

Thank you for the reply @glemaitre ! Could you elaborate on this sentence:

I would recommend only passing an integral seed that will internally generate the instance

What would I pass to random_state in this case, a numpy Generator?

@glemaitre
Copy link
Member

What would I pass to random_state in this case, a numpy Generator?

An integer. In your case, just passing rng_init. Internally, we are going to call np.random.RandomState(rng_init).

@glemaitre
Copy link
Member

To be specific.

rng_init = 0
X, y = make_classification(random_state=rng_init)
rf = RandomForestClassifier(random_state=rng_init)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng_init)
rf.fit(X_train, y_train).score(X_test, y_test)

@betatim
Copy link
Member
betatim commented Jan 16, 2023

I think it is tricky. The top comment made me think of https://scikit-learn.org/stable/common_pitfalls.html#id2 which highlights the difference between passing an integer and a sequence when performing cross-validation. Basically, when you clone an estimator that uses a integer as seed, the cloned estimator and the original will see the same sequence of random numbers. This matters when you do grid search (where estimators are cloned for each new parameter value). As the docs state, it probably makes sense to not use an integer in this case.

For your specific question: I'm not sure what the result is of using the same sequence of random numbers (but independent instances) like in your second example. The reason all the users will receive the same sequence of numbers if because they all start from the same seed. Does this matter? I don't know. Probably the answer is "it depends".

@thomasjpfan thomasjpfan removed the Needs Triage Issue requires triage label Feb 9, 2023
@avm19
Copy link
Contributor
avm19 commented Apr 9, 2023

The top comment made me think of https://scikit-learn.org/stable/common_pitfalls.html#id2 which highlights the difference between passing and integer and a sequence when performing cross-validation. Basically, when you clone an estimator that uses a integer as seed, the cloned estimator and the original will see the same sequence of random numbers. This matters when you do grid search (where estimators are cloned for each new parameter value). As the docs state, it probably makes sense to not use an integer in this case.

I think this documentation is out of date. The section on cloning says:

Moreover, a and b will influence each-other since they share the same internal RNG: calling a.fit will consume b’s RNG, and calling b.fit will consume a’s RNG, since they are the same.

However, I tried the code provided therein, and it seems like RNG are somehow cloned together with an estimator. Am I missing anything?

from sklearn import clone
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X, y = make_classification(n_samples=100, random_state=0)
rng = np.random.RandomState(0)
a = RandomForestClassifier(random_state=rng)
b = clone(a)
print(repr(a.random_state)) # RandomState(MT19937) at 0x7EFF69071A40
print(repr(b.random_state)) # RandomState(MT19937) at 0x7EFF688ED840 # Expected the same address
print(a.random_state.randint(10, size=10)) # [5 0 3 3 7 9 3 5 2 4]
print(b.random_state.randint(10, size=10)) # [5 0 3 3 7 9 3 5 2 4] # Expected a different sequence

I tried this on version 1.0.2 and on a two-week-old dev build. Should I open a DOC issue?


P.S. This behaviour (when RNG is automatically cloned) is desirable for reproducibility: we don't want parallel jobs affect each other unpredictably and uncontrollably. Does this sound right? Also, I wonder if using a single RNG instance creates a bottle neck in multithreading or multiprocessing.

@betatim
Copy link
Member
betatim commented Apr 11, 2023

Trying to bring us back on track to a concrete suggestion for a change (so we can decide yes or no on it). I think the suggestion from the top comment is probably a good idea:

rng_init = 0
X, y = make_classification(random_state=RandomState(rng_init))
rf = RandomForestClassifier(random_state=RandomState(rng_init))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RandomState(rng_init))
rf.fit(X_train, y_train).score(X_test, y_test)

However, I'm not sure that using the same seed for all the RNGs is a good or a bad thing. I think as long as you are using the instances in different applications it doesn't matter. But yeah, maybe not?

Something related is https://numpy.org/devdocs/reference/random/parallel.html#seedsequence-spawning (in particular the end of the linked to section) which talks about how to spawn new generators from an existing one. It also links to more material about why seed + fold_idx to generate "reproducible seeds" and such schemes are error prone.

@avm19
Copy link
Contributor
avm19 commented Apr 11, 2023

Trying to bring us back on track to a concrete suggestion for a change (so we can decide yes or no on it)

@betatim I am afraid that the current "Controlling randomness" reference does not truthfully describe the current behaviour. I am bringing this up to prevent making a decision based on a wrong premise. To be specific, this premise:

After reading the detailed and helpful section on "Controlling randomness", my understanding is that the benefit of passing an np.random.RandomState instance to each estimator is the different randomness for different folds during the CV process, which passing an integer would not provide.

I believe that estimator instances used for different folds do not have different randomness (therefore, there is no difference in that regard between passing an RNG instance or an integer as random_state). Here is the code that I believe demonstrates that:

from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
import numpy as np

rng = np.random.RandomState(0)  # from numpy.random import RandomState
X, y = make_classification(random_state=rng)
rf = RandomForestClassifier(random_state=rng)

d = cross_validate(rf, X, y, return_estimator=True, cv=2)
rngs = [e.random_state for e in d['estimator']]
# estimators corresponding to different CV runs have DIFFERENT but IDENTICAL RNGs:
print(rngs[0] is rngs[1]) # False
print(all(rngs[0].randint(10, size=10) == rngs[1].randint(10, size=10))) # True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0