8000 linear_model with normalize and StandardScaler lead to faulty results with weighted constant features · Issue #19450 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

linear_model with normalize and StandardScaler lead to faulty results with weighted constant features #19450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maikia opened this issue Feb 12, 2021 · 3 comments · Fixed by #19527

Comments

@maikia
Copy link
Contributor
maikia commented Feb 12, 2021

When testing the results of the linear models with normalize set to True and sample_weight in the PR in the PR #19426 we noted that for the sparse data the result is not correct in the case when there is a constant non-zero feature, for example:

X = rng.rand(n_samples, n_features)
X[X < 0.5] = 0.
X[:, 2] = 1.

The normalization is close to 0 but never exactly 0 due to the roundoff errors so we don't replace it with 1s.

Therefore if we divide the X with mean by the normalization we get high number.

This is the same if we call StandardScaler with the same data and sample_weight.

Possibly because they are both using mean_variance_axis()

cc @ogrisel

@ogrisel
Copy link
Member
ogrisel commented Feb 19, 2021

I cannot reproduce with StandardScaler:

>>> import numpy as np
>>> from sklearn.preprocessing import StandardScaler
>>> X = np.random.RandomState(1).rand(100, 3)
>>> X[X < 0.5] = 0.
>>> X[:, 2] = 1.
>>> StandardScaler().fit_transform(X).std(axis=0)
array([1., 1., 0.])

@ogrisel
Copy link
Member
ogrisel commented Feb 19, 2021

I tried with many random seeds and increasing n_samples to 200 as to be closer to the test in #19426 and I still cannot reproduce. Could you please give a minimal reproducer of the problem you observed with StandardScaler (including the traceback)?

@ogrisel
Copy link
Member
ogrisel commented Feb 22, 2021

Ok, I misread @maikia reports and now understand the failing test in #19426 better. As she said the problem only happens for sparse data and with weights. Here is a minimal reproducer:

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> n_samples = 100
>>> from sklearn.preprocessing import StandardScaler
>>> sw = np.random.rand(n_samples)
>>> X = np.zeros(shape=(n_samples, 2))
>>> X[:, 1] = 1.
>>> StandardScaler(with_mean=False).fit(X, sample_weight=sw).scale_
array([1., 1.])   # everything is fine, both features are detected as constant
>>> StandardScaler(with_mean=False).fit(csr_matrix(X), sample_weight=sw).scale_
array([1.00000000e+00, 1.20671561e-08])

The problem happens with sklearn.utils.sparsefuncs.mean_variance_axis which is close to 1e-16

>>> from sklearn.utils.sparsefuncs import mean_variance_axis
>>> mean_variance_axis(csr_matrix(X), axis=0, weights=sw)[1]
array([0.00000000e+00, 1.45616256e-16])

This kind of rounding errors is expected: it is below what we can expect from numerical precision on typical float64 operations:

>>> np.finfo(np.float64).eps
2.220446049250313e-16

However StandardScaler takes the square root of the variance to get the scale and this is causing the problem. We should probably better test for constant features before taking the square root.

@ogrisel ogrisel changed the title linear_model with normalize and StandardScaler lead to faulty results in edge case scenario linear_model with normalize and StandardScaler lead to faulty results with weighted constant features Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0