-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
BUG: StandardScaler partial_fit overflows #5602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A simple workaround is to keep track of the sample mean itself, instead of the sample sum. Here a gist. However, there is caveat: the computation of the sample variance is less numerically accurate.
while on |
Notice that while in the first code snippet I use large numbers to simulate the online stream, in the second case the lenght of the stream does not matter. It is the absolute value of the mean itself that is problematic. Indeed, if you run the second script with smaller I think this means that we want this change in the code, but I would like to hear other opinions first. Maybe @jakevdp ? I have no idea if other estimators with |
We'd probably want to use something like Welford's algorithm to avoid overflow and precision loss in the case of |
Started to take care of this in #11549 |
Is part of the issue that previously we were using Python ints and now we're using numpy ints? |
no idea
|
This has been fixed already. The partial_fit documentation says it's been stabilized https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_data.py#L879-L889 |
I can still reproduce using the OP's post. @ogrisel recently was looking at a similar thing I think. Would you happen to know if this is solvable now? |
Uh oh!
There was an error while loading. Please reload this page.
The recent implementation of
partial_fit
forStandardScaler
can overflow. A use case there is to transform indefinitely long stream of data, but that is problematic with the current implementation. The reason is that to compute the running mean, we keep track of the sample sum.Here the code to reproduce the behavior. To simulate long stream of data would take long time; instead, I use samples with very large norm but the effect is the same. The same batch is presented to the transformer many times. The mean should be same.
The text was updated successfully, but these errors were encountered: