8000 Cloned estimators have identical randomness but different RNG instances · Issue #26148 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Cloned estimators have identical randomness but different RNG instances #26148
Open
< 8DDD div class="Box-sc-g0xbh4-0 jonlJW">
@avm19

Description

@avm19

Describe the bug

Cloned estimators have identical randomness but different RNG instances. According to documentation, it should be the other way around: different randomness but identical RNG instances.

Related #25395

The User Guide says:

For an optimal robustness of cross-validation (CV) results, pass RandomState instances when creating estimators

rf_inst = RandomForestClassifier(random_state=np.random.RandomState(0))
cross_val_score(rf_inst, X, y)

...
Since rf_inst was passed a RandomState instance, each call to fit starts from a different RNG. As a result, the random subset of features will be different for each folds

In regards to cloning, the same reference says:

rng = np.random.RandomState(0)
a = RandomForestClassifier(random_state=rng)
b = clone(a)

Moreover, a and b will influence each-other since they share the same internal RNG: calling a.fit will consume b’s RNG, and calling b.fit will consume a’s RNG, since they are the same.

The actual behaviour does not follow this description.

Steps/Code to Reproduce

import numpy as np
from sklearn import clone
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

rng = np.random.RandomState(0)
X, y = make_classification(random_state=rng)
rf = RandomForestClassifier(random_state=rng)

d = cross_validate(rf, X, y, return_estimator=True, cv=2)
rngs = [e.random_state for e in d['estimator']]
# estimators corresponding to different CV runs have different but identical RNGs:
print(rngs[0] is rngs[1]) # False
print(all(rngs[0].randint(10, size=10) == rngs[1].randint(10, size=10))) # True

rf_clone = clone(rf)
rngs = [rf.random_state, rf_clone.random_state]
print(rngs[0] is rngs[1]) # False
print(all(rngs[0].randint(10, size=10) == rngs[1].randint(10, size=10))) # True

Expected Results

True False True False

Actual Results

False True False True

Versions

Tested on a two-week-old dev build and also on the following version (Kaggle)

System:
    python: 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:53)  [GCC 9.4.0]
executable: /opt/conda/bin/python3.7
   machine: Linux-5.15.90+-x86_64-with-debian-bullseye-sid

Python dependencies:
          pip: 22.3.1
   setuptools: 59.8.0
      sklearn: 1.0.2
        numpy: 1.21.6
        scipy: 1.7.3
       Cython: 0.29.34
       pandas: 1.3.5
   matplotlib: 3.5.3
       joblib: 1.2.0
threadpoolctl: 3.1.0

Built with OpenMP: True

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0