BUG: Regression with StandardScaler due to #19527 #19726

larsoner · 2021-03-19T17:14:20Z

Describe the bug

#19527 introduced a regression with StandardScaler when dealing with data with small magnitudes.

Steps/Code to Reproduce

In MNE-Python some of our data channels have magnitudes in the ~1e-13 range. On 638b768 or before, this code (which uses random data of different scales) returns all True, which seems correct:

import numpy as np
from sklearn.preprocessing import StandardScaler

for scale in (1e15, 1e10, 1e5, 1, 1e-5, 1e-10, 1e-15):
    data = np.random.RandomState(0).rand(1000, 4) - 0.5
    data *= scale
    scaler = StandardScaler(with_mean=True, with_std=True)
    X = scaler.fit_transform(data)
    stds = np.std(data, axis=0)
    means = np.mean(data, axis=0)
    print(np.allclose(X, (data - means) / stds, rtol=1e-7, atol=1e-7 * scale))

But on c748e46 / after #19527, anything "too small" starts to fail, as I get 5 True and the last two scale factors (1e-10, 1e-15) False. Hence StandardScaler no longer standardizes the data.

cc @ogrisel since this came from your PR and @maikia @rth @agramfort since you approved the PR

The text was updated successfully, but these errors were encountered:

agramfort · 2021-03-20T08:45:38Z

I suggest this fix in MNE mne-tools/mne-python#9173
and to avoid treating this as a regression.

larsoner · 2021-03-20T10:38:44Z

We can work around it in MNE, but I consider it a (substantial) regression if scaling data by 1e-10 makes a class whose primary (only?) job it is to normalize data no longer normalize it. Surely MNE will not be the only ones who hit this issue, and it happens silently so we wouldn't have even known except that we had a regression test.

It seems like there should at least be a way to disable this behavior (strict=True?) with a visible deprecation warning, but even this seems dangerous, and incorrect behavior to me.

One potential fix for the issue this breaking PR tried to fix is to add another state of processing that zeros out low variance columns, where the user can specify what "low variance" means. This could be added to Standard scaler with a (backward compatible) default of 0, and 1e-10 or whatever could be used in the case that drove the PR.

ogrisel · 2021-03-23T16:05:51Z

Thanks for the report @larsoner. Let me flag this as a potential regression to investigate.

We can probably change the default criterion of 10 * np.finfo(X.dtype).eps down to np.finfo(X.dtype).eps to make this change less disruptive, but even this would still be a problem I guess.

In private discussions, @jeremiedbb suggested that we should not decide whether a feature is constant based on an absolute criterion but using a relative criterion.

Maybe that would mean that for StandardScaler we should change the line: c748e46#diff-9b7f4ea2b7691c59cdf3c72e68074a0a86651d23bb1dbdd233b2dc40f6d686a6R869

and compare the variance to eps * (square mean) for instance. I would need to check if this would make the fix for # #19450 still work.

larsoner · 2021-03-23T16:43:42Z

In private discussions, @jeremiedbb suggested that we should not decide whether a feature is constant based on an absolute criterion but using a relative criterion.

I think this is not enough to avoid breaking things for people with existing workflows that worked before #19527 ~~#19450~~ and should continue to work. If you change my MWE above to multiply just the last column rather than all columns by the given scale factor, even this new proposed solution will silently break things (well maybe depending on how you calculate what to be relative to, you need to up it to 100 or more columns to see it break, but hopefully you get the idea).

To me it seems like the simplest backward compatible solution is to add atol=0, rtol=0 default values, and allow people to specify these (absolute and relative-to-RMS-or-whatever tolerances) according to their use case. Or you could just get away with one or the other of atol=0 or rtol=0, whatever solves the problem from #19450. But I still don't see a good way to (silently) set the default to be non-zero while avoiding breaking people's code...

jeremiedbb · 2021-03-25T14:40:53Z

The computed variance is accurate up to some precision. When the variance is very small and the computed value lies within the error bounds of the algorithm used to compute it, the returned value can't be trusted. In that case, the relative error on the computed variance is > 100% and thus is undistinguishable from a 0 variance. I think having a threshold where the variance is considered 0 is correct (provided that the threshold corresponds to the error bounds).

I think it will not break code that was not correct in the first place, e.g. relying on a computed variance smaller than the error bounds of the algorithm.

However the bound set in #19527 is far too big. For the current implementation of the variance, it should be n_samples**3 * mean(x)**2 * eps**2. But I found that we can improve the accuracy of our implementation with a small change to have an upper bound of n_samples**4 * mean(x)**3 * eps**3. I'm working on this.

ogrisel · 2021-03-25T14:45:47Z

But I found that we can improve the accuracy of our implementation with a small change to have an upper bound of n_samples**4 * mean(x)3 * eps3. I'm working on this.

Including in the case where we have sample_weight is not None?

jeremiedbb · 2021-03-25T14:48:18Z

Including in the case where we have sample_weight is not None?

Yes the sample weights have no impact on the precision of the variance algorithm

jeremiedbb · 2021-03-25T17:44:43Z

I opened a first PR (#19766) to fix the issue described here #19450 (comment). This would by itself have solved the issues encountered with normalize and StandardScaler.

I still think it's valuable to detect near constant features as done in #19527 but with a more appropriate threshold.

larsoner · 2021-03-25T17:48:24Z

n_samples**4 * mean(x)3 * eps3

I haven't thought about this too much, but zero mean data is going to be a problem here, no?

I opened a first PR (#19766) to fix the issue described here #19450 (comm 8000 ent). This would by itself have solved the issues encountered with normalize and StandardScaler.

Do you know if the PR that was opened works on the code snippet I pasted above? To me that's the critical test, and what I would add as a unit test of StandardScaler to ensure it continues to work. I don't think there should be any issues calculating variance at any of the scale factors I pasted above. And this should even hold true if I modify the example to be zero mean in all columns before passing it to StandardScaler.

jeremiedbb · 2021-03-25T18:15:47Z

Do you know if the PR that was opened works on the code snippet I pasted above? To me that's the critical test

I did not fix StandardScaler yet, the threshold is still there.

I haven't thought about this too much, but zero mean data is going to be a problem here, no?

Sure is there another term which dominates when the mean is also very small

larsoner · 2021-03-25T19:00:30Z

I did not fix StandardScaler yet, the threshold is still there.

Should we revert the changes from that PR, then?

ogrisel · 2021-03-29T07:57:02Z

Should we revert the changes from that PR, then?

Before considering reverting we will try to see if we can fix both the original bug and automatically detect constant feature based on the tighter theoretical bound of our variance computation algorithm @jeremiedbb linked above.

jeremiedbb · 2021-03-30T16:55:27Z

I opened #19788 to fix this issue. The criterion to detect near constant features has been improved.

the snippet in the description now always returns True.

For those interested, here's an example showing that having a criterion to detect near constant feature (other than strict 0 variance) is a good thing:

import numpy as np 
from sklearn.preprocessing import StandardScaler 

for scale in (1e15, 1e10, 1e5, 1, 1e-5, 1e-10, 1e-15): 
    data = 1 + np.random.RandomState(0).rand(1000, 4) * scale 
    stds = np.std(data, axis=0) 
    means = np.mean(data, axis=0) 
    X_scaled = (data - means) / stds 
    print("means", X_scaled.mean(axis=0)) 
    print("stds", X_scaled.std(axis=0))

means [ 2.57571742e-16  1.92246219e-15  1.33759670e-15 -1.16728849e-15]
stds [1. 1. 1. 1.]
means [ 2.18391971e-15 -7.41851025e-16 -1.60316205e-16 -7.54285523e-16]
stds [1. 1. 1. 1.]
means [ 1.61509695e-15 -7.01660952e-16  1.75193193e-16  6.30828723e-16]
stds [1. 1. 1. 1.]
means [ 6.30229202e-15 -1.02293729e-14  4.46975790e-16 -9.92428362e-16]
stds [1. 1. 1. 1.]
means [ 1.42086466e-10  6.99443579e-10  2.91957548e-10 -1.42117233e-10]
stds [1. 1. 1. 1.]
means [-2.57954591e-05  1.82366224e-05 -2.49108951e-05 -3.29294194e-05]
stds [1. 1. 1. 1.]
means [0.69852162 0.69162771 0.85114366 0.85133945]
stds [0.71558895 0.72225418 0.52493282 0.52461523]

Everything is as expected for the 6 first scale: the mean of the scaled X is approx 0 and the std is 1.
However, for the last one the mean and std of the scaled X are garbage. In that case the StandardScaler will not scale.

ogrisel · 2021-03-30T17:01:03Z

Very nice! Thanks so much @jeremiedbb for taking the time to understand the root cause of the problem and design this fix.

< 8000 img class="avatar avatar-user" src="https://avatars.githubusercontent.com/u/2365790?s=40&v=4" width="20" height="20" alt="@larsoner" /> larsoner added the Bug: triage label Mar 19, 2021

larsoner mentioned this issue Mar 19, 2021

MAINT: Fix for sklearn regression mne-tools/mne-python#9164

Merged

agramfort mentioned this issue Mar 20, 2021

avoid tiny numbers passed to scikit-learn Scaler that detects otherw… mne-tools/mne-python#9173

Closed

cmarmo added the module:preprocessing label Mar 23, 2021

ogrisel added Bug Regression and removed Bug: triage labels Mar 23, 2021

ogrisel added this to the 1.0 milestone Mar 23, 2021

jeremiedbb mentioned this issue Mar 30, 2021

[MRG] Fix near constant feature detection in StandardScaler and linear models #19788

Merged

glemaitre closed this as completed in #19788 Apr 14, 2021

jeremiedbb mentioned this issue Apr 14, 2021

Improve near constant feature detection in scalers #19898

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: Regression with StandardScaler due to #19527 #19726

BUG: Regression with StandardScaler due to #19527 #19726

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BUG: Regression with StandardScaler due to #19527 #19726

BUG: Regression with StandardScaler due to #19527 #19726

Comments

Uh oh!

Describe the bug

Steps/Code to Reproduce

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!