linear_model with normalize and StandardScaler lead to faulty results with weighted constant features #19450

maikia · 2021-02-12T22:15:44Z

When testing the results of the linear models with normalize set to True and sample_weight in the PR in the PR #19426 we noted that for the sparse data the result is not correct in the case when there is a constant non-zero feature, for example:

X = rng.rand(n_samples, n_features)
X[X < 0.5] = 0.
X[:, 2] = 1.

The normalization is close to 0 but never exactly 0 due to the roundoff errors so we don't replace it with 1s.

Therefore if we divide the X with mean by the normalization we get high number.

This is the same if we call StandardScaler with the same data and sample_weight.

Possibly because they are both using mean_variance_axis()

cc @ogrisel

The text was updated successfully, but these errors were encountered:

ogrisel · 2021-02-19T09:34:19Z

I cannot reproduce with StandardScaler:

>>> import numpy as np
>>> from sklearn.preprocessing import StandardScaler
>>> X = np.random.RandomState(1).rand(100, 3)
>>> X[X < 0.5] = 0.
>>> X[:, 2] = 1.
>>> StandardScaler().fit_transform(X).std(axis=0)
array([1., 1., 0.])

ogrisel · 2021-02-19T10:47:47Z

I tried with many random seeds and increasing n_samples to 200 as to be closer to the test in #19426 and I still cannot reproduce. Could you please give a minimal reproducer of the problem you observed with StandardScaler (including the traceback)?

ogrisel · 2021-02-22T09:50:15Z

Ok, I misread @maikia reports and now understand the failing test in #19426 better. As she said the problem only happens for sparse data and with weights. Here is a minimal reproducer:

>>> import numpy as np
>>> from scipy.sparse import csr_matrix
>>> n_samples = 100
>>> from sklearn.preprocessing import StandardScaler
>>> sw = np.random.rand(n_samples)
>>> X = np.zeros(shape=(n_samples, 2))
>>> X[:, 1] = 1.
>>> StandardScaler(with_mean=False).fit(X, sample_weight=sw).scale_
array([1., 1.])   # everything is fine, both features are detected as constant
>>> StandardScaler(with_mean=False).fit(csr_matrix(X), sample_weight=sw).scale_
array([1.00000000e+00, 1.20671561e-08])

The problem happens with sklearn.utils.sparsefuncs.mean_variance_axis which is close to 1e-16

>>> from sklearn.utils.sparsefuncs import mean_variance_axis
>>> mean_variance_axis(csr_matrix(X), axis=0, weights=sw)[1]
array([0.00000000e+00, 1.45616256e-16])

This kind of rounding errors is expected: it is below what we can expect from numerical precision on typical float64 operations:

>>> np.finfo(np.float64).eps
2.220446049250313e-16

However StandardScaler takes the square root of the variance to get the scale and this is causing the problem. We should probably better test for constant features before taking the square root.

maikia added the Bug: triage label Feb 12, 2021

ogrisel mentioned this issue Feb 22, 2021

MRG fix Normalize for linear models when used with sample_weight #19426

Merged

ogrisel changed the title ~~linear_model with normalize and StandardScaler lead to faulty results in edge case scenario~~ linear_model with normalize and StandardScaler lead to faulty results with weighted constant features Feb 22, 2021

ogrisel mentioned this issue Feb 22, 2021

Prevent scalers to scale near-constant features very large values #19527

Merged

cmarmo added Bug module:linear_model module:preprocessing and removed Bug: triage labels Feb 24, 2021

rth closed this as completed in #19527 Feb 25, 2021

ogrisel mentioned this issue Mar 23, 2021

BUG: Regression with StandardScaler due to #19527 #19726

Closed

jeremiedbb mentioned this issue Mar 25, 2021

MNT Avoid catastrophic cancellation in mean_variance_axis #19766

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

linear_model with normalize and StandardScaler lead to faulty results with weighted constant features #19450

linear_model with normalize and StandardScaler lead to faulty results with weighted constant features #19450

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linear_model with normalize and StandardScaler lead to faulty results with weighted constant features #19450

linear_model with normalize and StandardScaler lead to faulty results with weighted constant features #19450

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!