8000 Should all test cases set the random_state? · Issue #8194 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
Should all test cases set the random_state? #8194
@erikcs

Description

@erikcs

In this PR some newly added test cases made some unrelated tests fail. The culprit was mutating state due to missing random_state(s) in the unrelated tests' function/class invocations. @amueller also mention that:

Ideally we would explicitly pass random states everywhere in all tests. To fix the test failure, maybe fix the random states in the test that's failing. You can also add it in the test that you added.

I made a simple python script here that goes through each of sklearn's test files and look for estimators/classes that take a random_state parameter, and see if this was passed in the invocation(s). If not then print out the class/function name followed by a list of line numbers with missing invocations.

On the latest master branch (ca687ba) there are 1 155 cases in 154 test files where a class/function that takes a random_state is called without setting it to the canonical 42:

erik  scikit-learn$ find * -type f | grep "/test_.*py$" | python rcheck.py
sklearn/tree/tests/test_tree.py
	DecisionTreeClassifier [398, 1173, 1210, 1231, 1237, 1531]
	DecisionTreeRegressor [1532]
sklearn/tree/tests/test_export.py
        DecisionTreeClassifier [197, 212]
.
.
.
sklearn/model_selection/tests/test_search.py
	DecisionTreeClassifier [868]
	DecisionTreeRegressor [868]
	GroupShuffleSplit [242]
	KFold [241, 250, 491, 503, 919, 1172, 1193]
	LinearSVC [292, 356, 363, 376, 383]
	LinearSVCNoScore [101, 210]
	ParameterSampler [1093, 1096, 1113]
	RandomizedSearchCV [265, 704, 708, 753, 798, 839, 886, 946, 983]
	SGDClassifier [1126, 1141]
	SVC [292, 301, 305, 325, 333, 340, 348, 356, 363, 376, 383, 417, 441, 644, 647, 704, 708, 752, 753, 795, 798, 838, 839, 1003]
	StratifiedKFold [250, 919]
	StratifiedShuffleSplit [250]

Found 1155 missing in 154 files

The full output is here

If desirable, I could work on a PR for a find/replace on all these cases? If so, then the scikit-learn contributing docs should perhaps be explicit in stating that all added test cases that invoke an estimator that takes a random_state, should also set it / or add some kind of flake plugin that checks this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0