RandomizedLogisticRegression not repeatable

Description

sklearn.linear_model.RandomizedLogisticRegression does not handle random state in a way that allows for repeatable results. In particular, the _randomized_logistic() helper function makes a call to LogisticRegression() without passing the random_state parameter. Since the default value for random_state is None, a new RandomState instance is generated over which the user has no control. When the user specifies a random seed or random state, there is no guarantee that the output will be the same over multiple runs.

A workaround is to set tol low enough that the random numbers used by LogisticRegression are less likely to affect the final outcome, but full repeatability is still preferable for testing and validation of models.

Steps/Code to Reproduce

from sklearn.datasets import make_classification
from sklearn.linear_model import RandomizedLogisticRegression
from pprint import pprint
import pandas as pd
import scipy

def test():
    # iris = load_iris()
    # X, y = iris.data, iris.target

    X, y = make_classification(n_samples=3000,
                           n_features=402,
                           n_informative=2,
                           n_redundant=0,
                           n_repeated=0,
                           n_classes=2,
                           random_state=123,
                           shuffle=False)

    # remove first 2 informative columns, just keep all random columns
    pprint(X)
    X2 = scipy.delete(X, 0, 1)
    X2 = scipy.delete(X2, 0, 1)
    pprint(X2)

    df = pd.DataFrame()
    for i in range(10):
        clf = RandomizedLogisticRegression(
                random_state=1234, C=10, verbose=True, n_resampling=100)
        clf.fit(X2,y)
        print(clf.get_params(deep=True))
        df[i] = clf.scores_
    pprint("Maximum standard deviation of selection probabilities:")
    pprint(df.std(axis=1).max())
    df.to_csv('rlr_comparison.csv')

if __name__ == "__main__":
    test();

Expected Results

Maximum standard deviation of selection probabilities:
0.0

Actual Results

Maximum standard deviation of selection probabilities:
0.0052704627669473035

Versions

Windows-7-6.1.7601-SP1
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
NumPy 1.10.4
SciPy 0.17.0
Scikit-Learn 0.17.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Description

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions