8000 BUG: Regression with StandardScaler due to #19527 · Issue #19726 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

BUG: Regression with StandardScaler due to #19527 #19726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
larsoner opened this issue Mar 19, 2021 · 14 comments · Fixed by #19788
Closed

BUG: Regression with StandardScaler due to #19527 #19726

larsoner opened this issue Mar 19, 2021 · 14 comments · Fixed by #19788

Comments

@larsoner
Copy link
Contributor
larsoner commented Mar 19, 2021

Describe the bug

#19527 introduced a regression with StandardScaler when dealing with data with small magnitudes.

Steps/Code to Reproduce

In MNE-Python some of our data channels have magnitudes in the ~1e-13 range. On 638b768 or before, this code (which uses random data of different scales) returns all True, which seems correct:

import numpy as np
from sklearn.preprocessing import StandardScaler

for scale in (1e15, 1e10, 1e5, 1, 1e-5, 1e-10, 1e-15):
    data = np.random.RandomState(0).rand(1000, 4) - 0.5
    data *= scale
    scaler = StandardScaler(with_mean=True, with_std=True)
    X = scaler.fit_transform(data)
    stds = np.std(data, axis=0)
    means = np.mean(data, axis=0)
    print(np.allclose(X, (data - means) / stds, rtol=1e-7, atol=1e-7 * scale))

But on c748e46 / after #19527, anything "too small" starts to fail, as I get 5 True and the last two scale factors (1e-10, 1e-15) False. Hence StandardScaler no longer standardizes the data.

cc @ogrisel since this came from your PR and @maikia @rth @agramfort since you approved the PR

@agramfort
Copy link
Member

I suggest this fix in MNE mne-tools/mne-python#9173
and to avoid treating this as a regression.

@larsoner
Copy link
Contributor Author

We can work around it in MNE, but I consider it a (substantial) regression if scaling data by 1e-10 makes a class whose primary (only?) job it is to normalize data no longer normalize it. Surely MNE will not be the only ones who hit this issue, and it happens silently so we wouldn't have even known except that we had a regression test.

It seems like there should at least be a way to disable this behavior (strict=True?) with a visible deprecation warning, but even this seems dangerous, and incorrect behavior to me.

One potential fix for the issue this breaking PR tried to fix is to add another state of processing that zeros out low variance columns, where the user can specify what "low variance" means. This could be added to Standard scaler with a (backward compatible) default of 0, and 1e-10 or whatever could be used in the case that drove the PR.

@ogrisel
Copy link
Member
ogrisel commented Mar 23, 2021

Thanks for the report @larsoner. Let me flag this as a potential regression to investigate.

We can probably change the default criterion of 10 * np.finfo(X.dtype).eps down to np.finfo(X.dtype).eps to make this change less disruptive, but even this would still be a problem I guess.

In private discussions, @jeremiedbb suggested that we should not decide whether a feature is constant based on an absolute criterion but using a relative criterion.

Maybe that would mean that for StandardScaler we should change the line: c748e46#diff-9b7f4ea2b7691c59cdf3c72e68074a0a86651d23bb1dbdd233b2dc40f6d686a6R869

and compare the variance to eps * (square mean) for instance. I would need to check if this would make the fix for # #19450 still work.

@larsoner
Copy link
Contributor Author
larsoner commented Mar 23, 2021

In private discussions, @jeremiedbb suggested that we should not decide whether a feature is constant based on an absolute criterion but using a relative criterion.

I think this is not enough to avoid breaking things for people with existing workflows that worked before #19527 #19450 and should continue to work. If you change my MWE above to multiply just the last column rather than all columns by the given scale factor, even this new proposed solution will silently break things (well maybe depending on how you calculate what to be relative to, you need to up it to 100 or more columns to see it break, but hopefully you get the idea).

To me it seems like the simplest backward compatible solution is to add atol=0, rtol=0 default values, and allow people to specify these (absolute and relative-to-RMS-or-whatever tolerances) according to their use case. Or you could just get away with one or the other of atol=0 or rtol=0, whatever solves the problem from #19450. But I still don't see a good way to (silently) set the default to be non-zero while avoiding breaking people's code...

@ogrisel ogrisel added this to the 1.0 milestone Mar 23, 2021
@jeremiedbb
Copy link
Member

The computed variance is accurate up to some precision. When the variance is very small and the computed value lies within the error bounds of the algorithm used to compute it, the returned value can't be trusted. In that case, the relative error on the computed variance is > 100% and thus is undistinguishable from a 0 variance. I think having a threshold where the variance is considered 0 is correct (provided that the threshold corresponds to the error bounds).

I think it will not break code that was not correct in the first place, e.g. relying on a computed variance smaller than the error bounds of the algorithm.

However the bound set in #19527 is far too big. For the current implementation of the variance, it should be n_samples**3 * mean(x)**2 * eps**2. But I found that we can improve the accuracy of our implementation with a small change to have an upper bound of n_samples**4 * mean(x)**3 * eps**3. I'm working on this.

@ogrisel
Copy link
Member
ogrisel commented Mar 25, 2021

But I found that we can improve the accuracy of our implementation with a small change to have an upper bound of n_samples**4 * mean(x)3 * eps3. I'm working on this.

Including in the case where we have sample_weight is not None?

@jeremiedbb
Copy link
Member

Including in the case where we have sample_weight is not None?

Yes the sample weights have no impact on the precision of the variance algorithm

@jeremiedbb
Copy link
Member

I opened a first PR (#19766) to fix the issue described here #19450 (comment). This would by itself have solved the issues encountered with normalize and StandardScaler.

I still think it's valuable to detect near constant features as done in #19527 but with a more appropriate threshold.

@larsoner
Copy link
Contributor Author

n_samples**4 * mean(x)3 * eps3

I haven't thought about this too much, but zero mean data is going to be a problem here, no?

I opened a first PR (#19766) to fix the issue described here #19450 (comm 8000 ent). This would by itself have solved the issues encountered with normalize and StandardScaler.

Do you know if the PR that was opened works on the code snippet I pasted above? To me that's the critical test, and what I would add as a unit test of StandardScaler to ensure it continues to work. I don't think there should be any issues calculating variance at any of the scale factors I pasted above. And this should even hold true if I modify the example to be zero mean in all columns before passing it to StandardScaler.

@jeremiedbb
Copy link
Member
jeremiedbb commented Mar 25, 2021

Do you know if the PR that was opened works on the code snippet I pasted above? To me that's the critical test

I did not fix StandardScaler yet, the threshold is still there.

I haven't thought about this too much, but zero mean data is going to be a problem here, no?

Sure is there another term which dominates when the mean is also very small

@larsoner
Copy link
Contributor Author

I did not fix StandardScaler yet, the threshold is still there.

Should we revert the changes from that PR, then?

@ogrisel
Copy link
Member
ogrisel commented Mar 29, 2021

Should we revert the changes from that PR, then?

Before considering reverting we will try to see if we can fix both the original bug and automatically detect constant feature based on the tighter theoretical bound of our variance computation algorithm @jeremiedbb linked above.

@jeremiedbb
Copy link
Member

I opened #19788 to fix this issue. The criterion to detect near constant features has been improved.

the snippet in the description now always returns True.

For those interested, here's an example showing that having a criterion to detect near constant feature (other than strict 0 variance) is a good thing:

import numpy as np 
from sklearn.preprocessing import StandardScaler 

for scale in (1e15, 1e10, 1e5, 1, 1e-5, 1e-10, 1e-15): 
    data = 1 + np.random.RandomState(0).rand(1000, 4) * scale 
    stds = np.std(data, axis=0) 
    means = np.mean(data, axis=0) 
    X_scaled = (data - means) / stds 
    print("means", X_scaled.mean(axis=0)) 
    print("stds", X_scaled.std(axis=0))
means [ 2.57571742e-16  1.92246219e-15  1.33759670e-15 -1.16728849e-15]
stds [1. 1. 1. 1.]
means [ 2.18391971e-15 -7.41851025e-16 -1.60316205e-16 -7.54285523e-16]
stds [1. 1. 1. 1.]
means [ 1.61509695e-15 -7.01660952e-16  1.75193193e-16  6.30828723e-16]
stds [1. 1. 1. 1.]
means [ 6.30229202e-15 -1.02293729e-14  4.46975790e-16 -9.92428362e-16]
stds [1. 1. 1. 1.]
means [ 1.42086466e-10  6.99443579e-10  2.91957548e-10 -1.42117233e-10]
stds [1. 1. 1. 1.]
means [-2.57954591e-05  1.82366224e-05 -2.49108951e-05 -3.29294194e-05]
stds [1. 1. 1. 1.]
means [0.69852162 0.69162771 0.85114366 0.85133945]
stds [0.71558895 0.72225418 0.52493282 0.52461523]

Everything is as expected for the 6 first scale: the mean of the scaled X is approx 0 and the std is 1.
However, for the last one the mean and std of the scaled X are garbage. In that case the StandardScaler will not scale.

@ogrisel
Copy link
Member
ogrisel commented Mar 30, 2021

Very nice! Thanks so much @jeremiedbb for taking the time to understand the root cause of the problem and design this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
0