-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Initializing new random instances for each estimator instead of passing around the same one #25395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think the current recommendation is fine when we deal with random generators. In the case that you propose, I would recommend only passing an integral seed that will internally generate the instance. But then, this is out of the scope of the recommendation since the section is about |
Thank you for the reply @glemaitre ! Could you elaborate on this sentence:
What would I pass to |
An integer. In your case, just passing |
To be specific. rng_init = 0
X, y = make_classification(random_state=rng_init)
rf = RandomForestClassifier(random_state=rng_init)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=rng_init)
rf.fit(X_train, y_train).score(X_test, y_test) |
I think it is tricky. The top comment made me think of https://scikit-learn.org/stable/common_pitfalls.html#id2 which highlights the difference between passing an integer and a sequence when performing cross-validation. Basically, when you clone an estimator that uses a integer as seed, the cloned estimator and the original will see the same sequence of random numbers. This matters when you do grid search (where estimators are cloned for each new parameter value). As the docs state, it probably makes sense to not use an integer in this case. For your specific question: I'm not sure what the result is of using the same sequence of random numbers (but independent instances) like in your second example. The reason all the users will receive the same sequence of numbers if because they all start from the same seed. Does this matter? I don't know. Probably the answer is "it depends". |
I think this documentation is out of date. The section on cloning says:
However, I tried the code provided therein, and it seems like RNG are somehow cloned together with an estimator. Am I missing anything? from sklearn import clone
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import numpy as np
X, y = make_classification(n_samples=100, random_state=0)
rng = np.random.RandomState(0)
a = RandomForestClassifier(random_state=rng)
b = clone(a)
print(repr(a.random_state)) # RandomState(MT19937) at 0x7EFF69071A40
print(repr(b.random_state)) # RandomState(MT19937) at 0x7EFF688ED840 # Expected the same address
print(a.random_state.randint(10, size=10)) # [5 0 3 3 7 9 3 5 2 4]
print(b.random_state.randint(10, size=10)) # [5 0 3 3 7 9 3 5 2 4] # Expected a different sequence I tried this on version P.S. This behaviour (when RNG is automatically cloned) is desirable for reproducibility: we don't want parallel jobs affect each other unpredictably and uncontrollably. Does this sound right? Also, I wonder if using a single RNG instance creates a bottle neck in multithreading or multiprocessing. |
Trying to bring us back on track to a concrete suggestion for a change (so we can decide yes or no on it). I think the suggestion from the top comment is probably a good idea: rng_init = 0
X, y = make_classification(random_state=RandomState(rng_init))
rf = RandomForestClassifier(random_state=RandomState(rng_init))
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=RandomState(rng_init))
rf.fit(X_train, y_train).score(X_test, y_test) However, I'm not sure that using the same seed for all the RNGs is a good or a bad thing. I think as long as you are using the instances in different applications it doesn't matter. But yeah, maybe not? Something related is https://numpy.org/devdocs/reference/random/parallel.html#seedsequence-spawning (in particular the end of the linked to section) which talks about how to spawn new generators from an existing one. It also links to more material about why |
@betatim I am afraid that the current "Controlling randomness" reference does not truthfully describe the current behaviour. I am bringing this up to prevent making a decision based on a wrong premise. To be specific, this premise:
I believe that estimator instances used for different folds do not have different randomness (therefore, there is no difference in that regard between passing an RNG instance or an integer as from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
import numpy as np
rng = np.random.RandomState(0) # from numpy.random import RandomState
X, y = make_classification(random_state=rng)
rf = RandomForestClassifier(random_state=rng)
d = cross_validate(rf, X, y, return_estimator=True, cv=2)
rngs = [e.random_state for e in d['estimator']]
# estimators corresponding to different CV runs have DIFFERENT but IDENTICAL RNGs:
print(rngs[0] is rngs[1]) # False
print(all(rngs[0].randint(10, size=10) == rngs[1].randint(10, size=10))) # True |
Describe the issue linked to the documentation
After reading the detailed and helpful section on "Controlling randomness", my understanding is that the benefit of passing an
np.random.RandomState
instance to each estimator is the different randomness for different folds during the CV process, which passing an integer would not provide. In the sample code, the random instance is defined globally and then passed to therandom_state
parameter of all function calls. The drawback of this is that the order of operations now matters (as mentioned in the documentation) and the addition of any code that uses the same globally defined random instance variable will advance the random number generator and change the results of any code that is run afterwards, which can be quite confusing and not ideal for reproducibility.Instead of defining this random state globally and re-using it, would it not be more beneficial to pass a new instance to each estimator? It seems to me that this would combine the benefits of the CV exploring different randomness in each fold, and having the code reproducible regardless of insertion of new code that uses the same random instance.
Suggest a potential alternative/fix
Instead of the current recommendation in the docs:
I would expect this to combine the benefits I mentioned above:
Apart from the increased verbosity, is there any downside to this approach (or is it even the same as passing an
int
so it has the same downsides)? If not, would you be open to a PR mentioning this in the docs? I am not that well-versed with pseudo random number generators in general, so while I know that the sequence generated by the number generator can be beneficial for some application, I don't know if there are any benefits to advancing this sequence as a global variable between different estimator call or if it is more important within each estimator and it is therefore fine to reinitialize the sequence as I suggest above.And in this particular example, isn't only to the estimator we need to pass the
RandomState
instance (so the call tomake_classification
andtrain_test_split
could receive anint
instead which would make it less verbose)?The text was updated successfully, but these errors were encountered: