Description
Description
sklearn.linear_model.RandomizedLogisticRegression
does not handle random state in a way that allows for repeatable results. In particular, the _randomized_logistic()
helper function makes a call to LogisticRegression()
without passing the random_state
parameter. Since the default value for random_state
is None, a new RandomState
instance is generated over which the user has no control. When the user specifies a random seed or random state, there is no guarantee that the output will be the same over multiple runs.
A workaround is to set tol
low enough that the random numbers used by LogisticRegression
are less likely to affect the final outcome, but full repeatability is still preferable for testing and validation of models.
Steps/Code to Reproduce
from sklearn.datasets import make_classification
from sklearn.linear_model import RandomizedLogisticRegression
from pprint import pprint
import pandas as pd
import scipy
def test():
# iris = load_iris()
# X, y = iris.data, iris.target
X, y = make_classification(n_samples=3000,
n_features=402,
n_informative=2,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=123,
shuffle=False)
# remove first 2 informative columns, just keep all random columns
pprint(X)
X2 = scipy.delete(X, 0, 1)
X2 = scipy.delete(X2, 0, 1)
pprint(X2)
df = pd.DataFrame()
for i in range(10):
clf = RandomizedLogisticRegression(
random_state=1234, C=10, verbose=True, n_resampling=100)
clf.fit(X2,y)
print(clf.get_params(deep=True))
df[i] = clf.scores_
pprint("Maximum standard deviation of selection probabilities:")
pprint(df.std(axis=1).max())
df.to_csv('rlr_comparison.csv')
if __name__ == "__main__":
test();
Expected Results
Maximum standard deviation of selection probabilities:
0.0
Actual Results
Maximum standard deviation of selection probabilities:
0.0052704627669473035
Versions
Windows-7-6.1.7601-SP1
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
NumPy 1.10.4
SciPy 0.17.0
Scikit-Learn 0.17.1