Description
Describe the bug
Cloned estimators have identical randomness but different RNG instances. According to documentation, it should be the other way around: different randomness but identical RNG instances.
Related #25395
The User Guide says:
For an optimal robustness of cross-validation (CV) results, pass RandomState instances when creating estimators
rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0)) cross_val_score(rf_inst, X, y)...
Sincerf_inst
was passed aRandomState
instance, each call tofit
starts from a different RNG. As a result, the random subset of features will be different for each folds
In regards to cloning, the same reference says:
rng = np.random.RandomState(0) a = RandomForestClassifier(random_state=rng) b = clone(a)Moreover,
a
andb
will influence each-other since they share the same internal RNG: callinga.fit
will consumeb
’s RNG, and callingb.fit
will consumea
’s RNG, since they are the same.
The actual behaviour does not follow this description.
Steps/Code to Reproduce
import numpy as np
from sklearn import clone
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
rng = np.random.RandomState(0)
X, y = make_classification(random_state=rng)
rf = RandomForestClassifier(random_state=rng)
d = cross_validate(rf, X, y, return_estimator=True, cv=2)
rngs = [e.random_state for e in d['estimator']]
# estimators corresponding to different CV runs have different but identical RNGs:
print(rngs[0] is rngs[1]) # False
print(all(rngs[0].randint(10, size=10) == rngs[1].randint(10, size=10))) # True
rf_clone = clone(rf)
rngs = [rf.random_state, rf_clone.random_state]
print(rngs[0] is rngs[1]) # False
print(all(rngs[0].randint(10, size=10) == rngs[1].randint(10, size=10))) # True
Expected Results
True False True False
Actual Results
False True False True
Versions
Tested on a two-week-old dev build and also on the following version (Kaggle)
System:
python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53) [GCC 9.4.0]
executable: /opt/conda/bin/python3.7
machine: Linux-5.15.90+-x86_64-with-debian-bullseye-sid
Python dependencies:
pip: 22.3.1
setuptools: 59.8.0
sklearn: 1.0.2
numpy: 1.21.6
scipy: 1.7.3
Cython: 0.29.34
pandas: 1.3.5
matplotlib: 3.5.3
joblib: 1.2.0
threadpoolctl: 3.1.0
Built with OpenMP: True