BUG: StandardScaler partial_fit overflows #5602

giorgiop · 2015-10-27T08:59:52Z

The recent implementation of partial_fit for StandardScaler can overflow. A use case there is to transform indefinitely long stream of data, but that is problematic with the current implementation. The reason is that to compute the running mean, we keep track of the sample sum.

Here the code to reproduce the behavior. To simulate long stream of data would take long time; instead, I use samples with very large norm but the effect is the same. The same batch is presented to the transformer many times. The mean should be same.

from sklearn.preprocessing import StandardScaler
import numpy as np

rng = np.random.RandomState(0)

def gen_1d_uniform_batch(min_, max_, n):
    return rng.uniform(min_, max_, size=(n, 1))

max_f = np.finfo(np.float64).max / 1e5
min_f = max_f / 1e2
stream_dim = 100
batch_dim = 500000
print("mean overflow: batch vs online on %d repetitions" % stream_dim)

X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)

scaler = StandardScaler(with_std=False).fit(X)
print(scaler.mean_)
[  1.79769313e+301]

iscaler = StandardScaler(with_std=False)
batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
    iscaler = iscaler.partial_fit(batch)
RuntimeWarning: overflow encountered in add
  updated_mean = (last_sum + new_sum) / updated_sample_count

print(iscaler.mean_)
[ inf]

The text was updated successfully, but these errors were encountered:

giorgiop · 2015-10-27T09:41:55Z

A simple workaround is to keep track of the sample mean itself, instead of the sample sum. Here a gist. However, there is caveat: the computation of the sample variance is less numerically accurate.

from sklearn.preprocessing.data import StandardScaler
import numpy as np

rng = np.random.RandomState(0)

def gen_1d_uniform_batch(min_, max_, n):
    return rng.uniform(min_, max_, size=(n, 1))

max_f = 1e15
min_f = max_f / 1e3
stream_dim = 10000
batch_dim = 1000
print("var divergence: batch vs online on %d repetitions" % stream_dim)

X = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)

scaler = StandardScaler().fit(X)
iscaler = StandardScaler()

batch = gen_1d_uniform_batch(min_=min_f, max_=min_f, n=batch_dim)
for _ in range(stream_dim):
    iscaler = iscaler.partial_fit(batch)

print(scaler.var_)
[  0.]
print(iscaler.var_)
[  1.11203573e-05]

while on master the last print is 0.

giorgiop · 2015-10-27T09:53:54Z

Notice that while in the first code snippet I use large numbers to simulate the online stream, in the second case the lenght of the stream does not matter. It is the absolute value of the mean itself that is problematic. Indeed, if you run the second script with smaller max_f, everything is smooth. This is the offending step, which should trigger a potential catastrophic cancellation.

I think this means that we want this change in the code, but I would like to hear other opinions first. Maybe @jakevdp ?

I have no idea if other estimators with partial_fit may suffer the same issue. Maybe we could test for overflow for every partial_fit? #3907

jakevdp · 2017-12-12T19:26:30Z

We'd probably want to use something like Welford's algorithm to avoid overflow and precision loss in the case of partial_fit.

agramfort · 2018-07-16T10:54:44Z

Started to take care of this in #11549

jnothman · 2018-07-16T21:47:00Z

Is part of the issue that previously we were using Python ints and now we're using numpy ints?

agramfort · 2018-07-18T20:26:15Z

no idea

antonl · 2024-05-22T22:20:07Z

This has been fixed already. The partial_fit documentation says it's been stabilized https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/preprocessing/_data.py#L879-L889

adrinjalali · 2024-05-23T16:46:01Z

I can still reproduce using the OP's post.

@ogrisel recently was looking at a similar thing I think. Would you happen to know if this is solvable now?

giorgiop closed this as completed Oct 27, 2015

giorgiop reopened this Oct 27, 2015

jnothman added Bug help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Dec 13, 2017

agramfort linked a pull request Jul 16, 2018 that will close this issue

[WIP] use more robust mean online computation in StandardScaler #11549

Open

agramfort mentioned this issue Feb 16, 2019

[MRG] Fix for float16 overflow on accumulator operations #13010

Merged

ogrisel mentioned this issue Mar 1, 2021

Weighted variance computation for sparse data is not numerically stable #19546

Closed

cmarmo added the module:preprocessing label Dec 9, 2021

thomasjpfan moved this to Todo📬 in Quansight's scikit-learn Project Board Apr 15, 2022

thomasjpfan added this to Quansight's scikit-learn Project Board Apr 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: StandardScaler partial_fit overflows #5602

BUG: StandardScaler partial_fit overflows #5602

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BUG: StandardScaler partial_fit overflows #5602

BUG: StandardScaler partial_fit overflows #5602

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!