10000 RandomizedLogisticRegression not repeatable · Issue #7895 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content
RandomizedLogisticRegression not repeatable #7895
Closed
@jhaney5930

Description

@jhaney5930

Description

sklearn.linear_model.RandomizedLogisticRegression does not handle random state in a way that allows for repeatable results. In particular, the _randomized_logistic() helper function makes a call to LogisticRegression() without passing the random_state parameter. Since the default value for random_state is None, a new RandomState instance is generated over which the user has no control. When the user specifies a random seed or random state, there is no guarantee that the output will be the same over multiple runs.

A workaround is to set tol low enough that the random numbers used by LogisticRegression are less likely to affect the final outcome, but full repeatability is still preferable for testing and validation of models.

Steps/Code to Reproduce

from sklearn.datasets import make_classification
from sklearn.linear_model import RandomizedLogisticRegression
from pprint import pprint
import pandas as pd
import scipy

def test():
    # iris = load_iris()
    # X, y = iris.data, iris.target

    X, y = make_classification(n_samples=3000,
                           n_features=402,
                           n_informative=2,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=123,
                           shuffle=False)

    # remove first 2 informative columns, just keep all random columns
    pprint(X)
    X2 = scipy.delete(X, 0, 1)
    X2 = scipy.delete(X2, 0, 1)
    pprint(X2)

    df = pd.DataFrame()
    for i in range(10):
        clf = RandomizedLogisticRegression(
                random_state=1234, C=10, verbose=True, n_resampling=100)
        clf.fit(X2,y)
        print(clf.get_params(deep=True))
        df[i] = clf.scores_
    pprint("Maximum standard deviation of selection probabilities:")
    pprint(df.std(axis=1).max())
    df.to_csv('rlr_comparison.csv')

if __name__ == "__main__":
    test();

Expected Results

Maximum standard deviation of selection probabilities:
0.0

Actual Results

Maximum standard deviation of selection probabilities:
0.0052704627669473035

Versions

Windows-7-6.1.7601-SP1
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
NumPy 1.10.4
SciPy 0.17.0
Scikit-Learn 0.17.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0