-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Cloned estimators have identical randomness but different RNG instances #26148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As far as I could see, when an estimator is cloned, There are several components to the issue.
As for (2), the documentation says that clones should refer to one and the same RNG instance. However, if we implement this, don't we risk opening the Pandora Box of concurrency and race condition issues, and thereby killing any hope of reproducibility? I think, it is the documentation should be changed for (2), not the code. As for (1), cross-validation routines seem to follow the "parallel-delayed-clone" pattern: scikit-learn/sklearn/model_selection/_validation.py Lines 266 to 268 in 9aaed49
Thus, one solution for (1) would be to ensure that cloned estimators have different randomness. This randomness should be controllable. @betatim mentioned what appears to be a nice solution. I quote him:
I did not have a chance to look deeper into all this, but the solution could be a special treatment of for name, param in new_object_params.items():
if name=="random_state" and type(param) is RandomState:
new_object_params[name] = param.spawn()
else:
new_object_params[name] = clone(param, safe=False) P.S. To sum up, the proposed solution is to ensure different randomness and different RNG instances. |
So @avm19 is right that when Re: random state in scikit-learn, @NicolasHug has done a lot of work, which unfortunately didn't end up anywhere since we didn't manage to reach a consensus (xref SLEP proposal: https://github.com/scikit-learn/enhancement_proposals/pull/24/files) ). For now, a PR to fix the documentation to match the actual code is welcome. Also, note that if you want the estimators to have different random states, simply don't pass a random state object to them, or pass |
Are you sure? I don't think any A new I think we just need to improve our doc. |
@ogrisel I think yes, the following code, prints # %%
import numpy as np
from sklearn.base import clone
from sklearn.datasets import make_classification
from sklearn.linear_model import SGDClassifier
X, y = make_classification(n_features=5, random_state=42)
# %%
rng = np.random.RandomState(10)
rng2 = clone(rng, safe=False)
print(id(rng) == id(rng2))
# %%
# According to the docs https://scikit-learn.org/stable/common_pitfalls.html#id2
# in particular the subsection on cloning for estimators these two estimators
# should influence each other. They share the `rng` instance.
sgd = SGDClassifier(random_state=rng)
sgd2 = clone(sgd)
print(id(sgd.random_state) == id(sgd2.random_state))
If the user passes an |
I agree with what Olivier said in terms of what should happen (shared reference). This is also what the docs say should happen. However, I don't think this is what actually happens. Otherwise code like https://gist.github.com/betatim/66b6ee6a780ec2e1c54f653eff198d9d would work (in the sense of giving different As a result I don't think the docs need improving, instead we need to improve the code. |
And because I've forgotten twice already: thanks for digging into this, reading the docs and reporting this @avm19. This is a topic with many subtleties where we need to get what should happen, what does happen and what the docs say in alignment. |
For that we'd need to revive the SLEP. |
I am extremely confused. We've had plenty of discussions with @GaelVaroquaux , @amueller , @ogrisel , @agramfort and @adrinjalali [1] about this behaviour and we all had the same shared (wrong???) understanding that this snippet passes - i.e. that the rng = np.random.RandomState(0)
rf = RandomForestClassifier(random_state=rng)
rf2 = clone(rf)
assert check_random_state(rf.random_state) is check_random_state(rf2.random_state) So we either got it all wrong all this time, or something changed in the way we The current state (wrong docs, I'm still in the denial phase right now - it has to be a change in
Whether we consider this a bugfix or not, implementing the behaviour claimed by the docs is going to be a massive behaviour change - and a silent one. As for SLEP 11, I never submitted it but I wrote a whole new version in https://github.com/NicolasHug/enhancement_proposals/blob/random_state_v2/slep011/proposal.rst. It (wrongly?) assumes [1] in scikit-learn/enhancement_proposals#24 in #18363, and offline |
I think it is worth slowing down, because it is a hairy issue and the impact is also big. The good thing is that its probably been the way it is now for a while, so spending an extra moment to disentangle things is not a big cost. I also thought that what the docs describe is how it should be as well as how it is. However, Nicolas' code fails for me on version 1.0.2 (same with the SGD based example). This is doubly weird because the docstring of |
That's very interesting, I also tried with an old (2019) set of versions, and the code fails: >>> import numpy as np
>>> from sklearn.base import clone
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.utils import check_random_state
>>> rng = np.random.RandomState(0)
>>> rf = RandomForestClassifier(random_state=rng)
>>> rf2 = clone(rf)
>>> assert check_random_state(rf.random_state) is check_random_state(rf2.random_state)
assert check_random_state(rf.random_state) is check_random_state(rf2.random_state)
AssertionError:
>>> sklearn.show_versions()
/home/adrin/miniforge3/envs/delete-me/lib/python3.7/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
System:
python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21) [GCC 9.4.0]
executable: /home/adrin/miniforge3/envs/delete-me/bin/python
machine: Linux-6.2.10-arch1-1-x86_64-with-arch
BLAS:
macros: HAVE_CBLAS=None
lib_dirs: /home/adrin/miniforge3/envs/delete-me/lib
cblas_libs: blas, cblas, lapack, blas, cblas, lapack, blas, cblas, lapack
Python deps:
pip: 23.0.1
setuptools: 67.6.1
sklearn: 0.20.4
numpy: 1.15.4
scipy: 1.5.1
Cython: None
pandas: None |
my expectations are:
```python
import numpy as np
from sklearn.base import clone
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import check_random_state
rng = np.random.RandomState(0)
rf = RandomForestClassifier(random_state=rng)
rf2 = clone(rf)
assert rf.random_state.rand() == rf2.random_state.rand()
```
and this works. So instances are different but they behave the same.
Now indeed they don't share the same state. Fitting rf will not affect the
fit of rf2.
|
@agramfort so your expectations are what the code does. It means that all the |
It means that all the RandomForest clones will all have the same RNG across all folds of a CV procedure / grid search.
This is indeed sub-optimal.
For instance it can lead to underestimating variance when benchmarking a learning procedure.
An example of where this is problematic is when using a DummyClassifier(strategy='stratified') to sample the distribution under a null. We'll clearly get an under-estimated variance in a cross-validation.
Another example of a problem is when we want to use BaggingClassifier on trees to manually implement random forest. Individual decision trees have a randomness in splitting (either splitter="random", or even with splitter="best", to choose between ties). I extreme cases where we have exactly duplicated informative features, and we use splitter="best", in all our bags, the decision tree will select the same feature out of these duplicated informative features, because it will go through the features in the same order.
With regards to actions to lead, I agree with Tim that we can avoid any rush here: maybe fix the docs to be explicit about this, but not immediately change the behavior without a stop to think about it, and communicate. We should explain why this behavior, while clearly reproducible, gives a false sense of security and is not desirable.
In terms of change to our code, one option would be to special-case RandomState objects in clone to not deep copy them. This would lead to maximum entropy (different random number generation in different clones). It would probably mean that it is harder to get bitwise reproducibility (I often find such reproducibility less interesting than statistical reproducibility). One example of threats to reproducibility is when the different cloned estimators are used in parallel in multiple threads. In such a setting, the specific order in which they will access the random number generator will vary from one execution to another (parallel computing gives ordering that are not strictly reproducible), and thus the stream of random number that each sees will vary.
|
I've added this to the agenda for the next monthly meeting (24th April). I think having a short sync discussion would be helpful, even if it won't answer all questions "once and for all". |
During the Drafting meeting, @adrinjalali @ogrisel @glemaitre @betatim and I discussed many aspects of random state. Here is my summary of the discussion:
rng = np.random.RandomState(42)
rf1 = RandomForestClassifier(random_state=rng).fit(X, y)
rf2 = RandomForestClassifier(random_state=rng).fit(X, y)
# Should the above give the same model, and thus the same predictions?
np.assert_array_equal(rf1.predict(X), rf2.predict(X)) Cross validation and max entropyWe agree that in any cross validation such as GridSearchCV should give the maximum amount of entropy with a way to make it deterministic. As described in #26148 (comment) there is no way to make it possible now and be deterministic. We did not come up with a good way to do it with Estimators that know what they are spawning (RandomForestClassifer) do not have this issue and can use SeedSequences. Concretely, |
Thanks for this excellent summary.
I wonder about point 6: people might use a predictor that sample the posterior, and be surprised if multiple calls give the same samples.
…On May 6, 2023, 03:58, at 03:58, "Thomas J. Fan" ***@***.***> wrote:
During the Drafting meeting, @adrinjalali @ogrisel @glemaitre @betatim
and I discussed many aspects of random state.
Here is my summary of the discussion:
1. We agree on `randon_state=None` and `random_state=int` behavior. The
hard part is the random state objects.
2. We agree that we should leave the current behavior alone. I think
the docs should be updated to reflect the actual behavior.
3. We want to use the move to NumPy Generators to support the desired
behavior. We do not need to worry about backward compatibility, because
NumPy Generators are new objects to pass around.
4. We reconsidered
[SLEP011](scikit-learn/enhancement_proposals#24)
and are undecided.
5. There is no API contract about randomness for multiple calls to
predict. In scikit-learn, I think we are consistent and that multiple
calls to `predict` with the same data give the same output.
6. We have a conflict about what the desired behavior of the following
should be:
```python
rng = np.random.RandomState(42)
rf1 = RandomForestClassifier(random_state=rng).fit(X, y)
rf2 = RandomForestClassifier(random_state=rng).fit(X, y)
# Should the above give the same model, and thus the same predictions?
np.assert_array_equal(rf1.predict(X), rf2.predict(X))
```
### Cross validation and max entropy
We agree that in any cross validation such as GridSearchCV should give
the maximum amount of entropy with a way to make it deterministic. As
described in
#26148 (comment)
there is no way to make it possible now **and** be deterministic. We
did not come up with a good way to do it with `Generators` either.
Estimators that know what they are spawning (RandomForestClassifer) do
not have this issue and can use
[SeedSequences](https://numpy.org/doc/stable/reference/random/parallel.html).
Concretely, `RandomForestClassifer` can pass the random state objects
to the tree directly because the forest knows it is dealing with trees.
On the other hand, `GridSearchCV` can not pass this random state down
because there is no API for passing these objects to an estimator.
--
Reply to this email directly or view it on GitHub:
#26148 (comment)
You are receiving this because you were mentioned.
Message ID:
***@***.***>
|
The point we made, which is lost in the summary, is that people should be able to have randomized predict, but they should also be able to control that randomness if needed. |
My take on point 6 in #26148 (comment) is that it should result in 2 different random forests, thus the assert should fail. Otherwise we make it hard for someone who knows what she is doing with rng. Can't we make |
I think you can argue that this is a case that is different from what the code snippet shows. At least for me what you point out is more akin to this: clf = FooWithSamplingClassifier(random_state=<int or random_state>).fit(X, y)
# calling once
pred1 = clf.predict(X)
# calling twice
pred2 = clf.predict(X)
np.assert_array_not_equal(pred1, pred2) Where For me the take away is that These three use-cases all seem legitimate so I think we can't enforce a particular contract, it will depend on the estimator. It also means whether a
I think a person's expectation of what happens will depend on whether they think This makes me wonder if life would be simpler for everyone if we considered them immutable. Using the JAX notation (because you can go and read about it). To get two estimators that are the same you'd do: from jax import random
key = random.PRNGKey(0)
rf1 = RandomForestClassifier(random_state=key)
rf2 = RandomForestClassifier(random_state=key) To have two different estimators you'd split the key before passing it to the constructor: from jax import random
key = random.PRNGKey(0)
key, rf1_key, rf2_key = random.split(key)
rf1 = RandomForestClassifier(random_state=rf1_key)
rf2 = RandomForestClassifier(random_state=rf2_key) I quite like this. It is only marginally more complex to understand for a user than using integers as seeds and like with seeds you can have both kinds of behaviours. I think it is also possible to do this with the new random number generator infrastructure of Numpy?! (n.b. I have rewritten my comment three or four times over the course of the last hour, and each time I changed my conclusion. But I think I like this version now, but who knows, if I think some more about it I might change my mind again??) |
I hope I'm not bringing additional confusion here but I think the last few comments may have been mixing up point 5 and point 6. As far as I understand Point 5 is about "randomized predict" i.e. whether |
I think you are right Nicolas :-/ Sorry. Having thought about my previous comment regarding JAX Keys and splitting and having a discussion with @seberg I think I am on the cusp of changing my mind again compared to what I wrote :-/ Sebastian point
6D40
ed out that either option ( We can make both options happen. One by teaching the minority that they need to call |
OK. I've been reluctant to publish scikit-learn/enhancement_proposals#88 (rendered docs are here) so far because I won't be able to champion it has become slightly out of date. But I think the SLEP does a decent job at describing the key requirements we need to support for a solution, so I hope it can help better framing the discussion. FWIW I think the solution advocated in the SLEP is quite similar to what you suggested above @betatim. The main idea is to 1) make estimators (and splitters) fully stateless across calls to For completeness, a toy implementation of the solution is implemented in this notebook. |
Not coming from the sklearn boat, so I may well be completely off and happy to be ignored, but on both things I am not sure I can fully agree. I would probably argue that users should never re-use identical pseudo randomness unless they are explicit about it. Because of that, I will ask what the benefit is if calling
My suspicion would be that most (especially inexperienced?) users will almost never need identical streams and will end up with a subtly bad rng if that is what you give them. But maybe I am underestimating how often it is required? OTOH, maybe when you do there are often other ways (such as explicitly setting the initial conditions). TBH, I am not sure it is important whether users lean towards "of course calling |
@seberg The question is more what the result of calling fit on the same data with a cloned estimator or with the same parametrized estimator and identical object passed as rng parameter should be. Meanwhile, I think discussing options for „point 6“ in #26148 (comment) does not solve our issue(s). Both options are fine and can be used to implement the same functional behavor. It’s then more a question of having it documented and tested. IMHO, @NicolasHug initiated the right approach with his draft SLEP: write down requirements and search for solutions by extending estimators use case by use case. Ideally, we don’t bother users at all with randomness (and thereby protect from the danger of serious foot injuries) and still provide swiss army knife flexibility for experts. I hope, for instance, that we all agree on requirement 1: It should be very easy or be the default to have exact reproducibility (when running the same python code in different python processes in the same order). |
Right, I had more things but its complicated and of course mapping out things is important. I suppose I have a few questions (opinions):
The thing about |
Describe the bug
Cloned estimators have identical randomness but different RNG instances. According to documentation, it should be the other way around: different randomness but identical RNG instances.
Related #25395
The User Guide says:
In regards to cloning, the same reference says:
The actual behaviour does not follow this description.
Steps/Code to Reproduce
Expected Results
True False True False
Actual Results
False True False True
Versions
The text was updated successfully, but these errors were encountered: