-
-
Notifications
You must be signed in to change notification settings - Fork 26k
BUG: Regression with StandardScaler due to #19527 #19726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I suggest this fix in MNE mne-tools/mne-python#9173 |
We can work around it in MNE, but I consider it a (substantial) regression if scaling data by 1e-10 makes a class whose primary (only?) job it is to normalize data no longer normalize it. Surely MNE will not be the only ones who hit this issue, and it happens silently so we wouldn't have even known except that we had a regression test. It seems like there should at least be a way to disable this behavior (strict=True?) with a visible deprecation warning, but even this seems dangerous, and incorrect behavior to me. One potential fix for the issue this breaking PR tried to fix is to add another state of processing that zeros out low variance columns, where the user can specify what "low variance" means. This could be added to Standard scaler with a (backward compatible) default of 0, and 1e-10 or whatever could be used in the case that drove the PR. |
Thanks for the report @larsoner. Let me flag this as a potential regression to investigate. We can probably change the default criterion of In private discussions, @jeremiedbb suggested that we should not decide whether a feature is constant based on an absolute criterion but using a relative criterion. Maybe that would mean that for and compare the variance to eps * (square mean) for instance. I would need to check if this would make the fix for # #19450 still work. |
I think this is not enough to avoid breaking things for people with existing workflows that worked before #19527 To me it seems like the simplest backward compatible solution is to add |
The computed variance is accurate up to some precision. When the variance is very small and the computed value lies within the error bounds of the algorithm used to compute it, the returned value can't be trusted. In that case, the relative error on the computed variance is > 100% and thus is undistinguishable from a 0 variance. I think having a threshold where the variance is considered 0 is correct (provided that the threshold corresponds to the error bounds). I think it will not break code that was not correct in the first place, e.g. relying on a computed variance smaller than the error bounds of the algorithm. However the bound set in #19527 is far too big. For the current implementation of the variance, it should be |
Including in the case where we have |
Yes the sample weights have no impact on the precision of the variance algorithm |
I opened a first PR (#19766) to fix the issue described here #19450 (comment). This would by itself have solved the issues encountered with normalize and StandardScaler. I still think it's valuable to detect near constant features as done in #19527 but with a more appropriate threshold. |
I haven't thought about this too much, but zero mean data is going to be a problem here, no?
Do you know if the PR that was opened works on the code snippet I pasted above? To me that's the critical test, and what I would add as a unit test of StandardScaler to ensure it continues to work. I don't think there should be any issues calculating variance at any of the scale factors I pasted above. And this should even hold true if I modify the example to be zero mean in all columns before passing it to StandardScaler. |
I did not fix StandardScaler yet, the threshold is still there.
Sure is there another term which dominates when the mean is also very small |
Should we revert the changes from that PR, then? |
Before considering reverting we will try to see if we can fix both the original bug and automatically detect constant feature based on the tighter theoretical bound of our variance computation algorithm @jeremiedbb linked above. |
I opened #19788 to fix this issue. The criterion to detect near constant features has been improved. the snippet in the description now always returns True. For those interested, here's an example showing that having a criterion to detect near constant feature (other than strict 0 variance) is a good thing: import numpy as np
from sklearn.preprocessing import StandardScaler
for scale in (1e15, 1e10, 1e5, 1, 1e-5, 1e-10, 1e-15):
data = 1 + np.random.RandomState(0).rand(1000, 4) * scale
stds = np.std(data, axis=0)
means = np.mean(data, axis=0)
X_scaled = (data - means) / stds
print("means", X_scaled.mean(axis=0))
print("stds", X_scaled.std(axis=0))
Everything is as expected for the 6 first |
Very nice! Thanks so much @jeremiedbb for taking the time to understand the root cause of the problem and design this fix. |
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
#19527 introduced a regression with StandardScaler when dealing with data with small magnitudes.
Steps/Code to Reproduce
In MNE-Python some of our data channels have magnitudes in the ~1e-13 range. On 638b768 or before, this code (which uses random data of different scales) returns all True, which seems correct:
But on c748e46 / after #19527, anything "too small" starts to fail, as I get 5 True and the last two scale factors (1e-10, 1e-15) False. Hence
StandardScaler
no longer standardizes the data.cc @ogrisel since this came from your PR and @maikia @rth @agramfort since you approved the PR
The text was updated successfully, but these errors were encountered: