-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
linear_model with normalize and StandardScaler lead to faulty results with weighted constant features #19450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I cannot reproduce with >>> import numpy as np
>>> from sklearn.preprocessing import StandardScaler
>>> X = np.random.RandomState(1).rand(100, 3)
>>> X[X < 0.5] = 0.
>>> X[:, 2] = 1.
>>> StandardScaler().fit_transform(X).std(axis=0)
array([1., 1., 0.]) |
I tried with many random seeds and increasing |
Ok, I misread @maikia reports and now understand the failing test in #19426 better. As she said the problem only happens for sparse data and with weights. Here is a minimal reproducer: >>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> n_samples = 100
>>> from sklearn.preprocessing import StandardScaler
>>> sw = np.random.rand(n_samples)
>>> X = np.zeros(shape=(n_samples, 2))
>>> X[:, 1] = 1.
>>> StandardScaler(with_mean=False).fit(X, sample_weight=sw).scale_
array([1., 1.]) # everything is fine, both features are detected as constant
>>> StandardScaler(with_mean=False).fit(csr_matrix(X), sample_weight=sw).scale_
array([1.00000000e+00, 1.20671561e-08]) The problem happens with >>> from sklearn.utils.sparsefuncs import mean_variance_axis
>>> mean_variance_axis(csr_matrix(X), axis=0, weights=sw)[1]
array([0.00000000e+00, 1.45616256e-16]) This kind of rounding errors is expected: it is below what we can expect from numerical precision on typical float64 operations: >>> np.finfo(np.float64).eps
2.220446049250313e-16 However |
When testing the results of the linear models with
normalize
set to True andsample_weight
in the PR in the PR #19426 we noted that for the sparse data the result is not correct in the case when there is a constant non-zero feature, for example:The normalization is close to 0 but never exactly 0 due to the roundoff errors so we don't replace it with 1s.
Therefore if we divide the X with mean by the normalization we get high number.
This is the same if we call
StandardScaler
with the same data andsample_weight
.Possibly because they are both using
mean_variance_axis()
cc @ogrisel
The text was updated successfully, but these errors were encountered: