8000 Is it possible to reduce StandardScaler.fit() memory consumption? · Issue #5651 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

Is it possible to reduce StandardScaler.fit() memory consumption? #5651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
enoonIT opened this issue Oct 31, 2015 · 14 comments · Fixed by #20652
Closed

Is it possible to reduce StandardScaler.fit() memory consumption? #5651

enoonIT opened this issue Oct 31, 2015 · 14 comments · Fixed by #20652
Labels
Documentation Easy Well-defined and straightforward way to resolve help wanted

Comments

@enoonIT
Copy link
enoonIT commented Oct 31, 2015

The issue:
When applying StandardScaler to a big matrix, the memory requirements are expensive.

Example:

big = np.random.random([495982, 4098]) //this is around 8GB
scaler = StandardScaler()
scaler.fit(big) //this will require nearly another 16GB of RAM

I guess it uses some lookup tables to speed the standard deviation computations, but double the required RAM might be too much in some cases. A flag to enable a, slower, but less memory intensive version, would be nice.
Is there any solutions to reduce memory consumption?

@olologin
Copy link
Contributor

Hmm, try to enable "copy" parameter of constructor.
I can't test it now on my machine.

@enoonIT
8000
Copy link
Author
enoonIT commented Oct 31, 2015

As far as I've seen, the copy parameter is useful for the transform() method rather than the fit() one.

@enoonIT
Copy link
Author
enoonIT commented Oct 31, 2015

I just noticed that simply using numpy's implementation of std, the memory consumption can be reduced by half:

big = np.random.random([495982, 4098]) //this is around 8GB
std = big.std(0) //this _only_ requires another 8GB
mean = big.mean(0)
scaler = StandardScaler(copy=False) 
scaler.std_ = std
scaler.mean_ = mean
big = scaler.transform(big)

This works, and while it might be slightly slower (not sure), it uses significantly less memory.

@giorgiop
Copy link
Contributor

As far as I've seen, the copy parameter is useful for the transform() method rather than the fit() one.

This is correct.

If your goal is just to scale big matrices without allocating too much memory, 0.17 has now partial_fit methods.

@enoonIT
Copy link
Author
enoonIT commented Nov 1, 2015

@giorgiop Thanks, good to know.

@giorgiop
Copy link
Contributor
giorgiop commented Nov 3, 2015

I have done some memory profiling. You are right, we are doubling memory consumption calling fit, even when copy=False. Although, numpy.var is the reason. Look here:

@profile
def numpy_var():
    shape = (2 ** 16, 2 ** 12)
    X = np.random.RandomState(0).uniform(-1, 1, size=shape)  # ~2000 MiB
    X.var(axis=0)

figure_1

and indeed your example above doubles the memory requirement too:

@profile
def enoonIT_example():
    shape = (2 ** 16, 2 ** 12)
    big = np.random.RandomState(0).uniform(-1, 1, size=shape) # ~2000 MiB
    std = big.std(0) # + more ~2000 MiB
    mean = big.mean(0)
    scaler = StandardScaler(copy=False)
    scaler.scale_ = std
    scaler.mean_ = mean
    big = scaler.transform(big)

figure_2

As a rule of thumb, if you do not care about overwriting your input data, use copy=False. If that's not enough, call partial_fit in a loop over batches of the data. This may be slower, but will fit in memory eventually.

@enoonIT
Copy link
Author
enoonIT commented Nov 3, 2015

Hi,
how do you profile memory consumption?

Could you try:

@profile
def enoonIT_fit_example():
    shape = (2 ** 16, 2 ** 12)
    big = np.random.RandomState(0).uniform(-1, 1, size=shape) # ~2000 MiB
    scaler = StandardScaler(copy=False).fit(big) # this should add another 4000MiB
    big = scaler.transform(big)

Thanks!

@giorgiop
Copy link
Contributor
giorgiop commented Nov 3, 2015

You can use memory profiler. You last code only doubles the memory. The additional 4000 are required if you call with copy=True (second plot below).

figure_3
figure_4

@enoonIT
Copy link
Author
enoonIT commented Nov 16, 2015

Thanks for the clarification and sorry if I couldn't answer sooner.
I think my confusion was due to the documentation regarding the copy parameter and fit method:
Copy parameter:

If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

And fit()

Compute the mean and std to be used for later scaling.

It doesn't say explicitly that the copy parameter will also influence the fit() function, which I personally would't expect to make copies of the data.
Maybe this could be better explained in the docs?

@giorgiop
Copy link
Contributor
giorgiop commented Feb 9, 2016

@enoonIT would you like to work on a PR to fix the documentation?

@amueller amueller added Easy Well-defined and straightforward way to resolve Documentation Need Contributor labels Oct 27, 2016
@ychennay
Copy link
ychennay commented Dec 7, 2017

FYI everyone, the solution provided by @enoonIT on October 31st, 2015 I believe is now deprecated:

big = np.random.random([495982, 4098]) //this is around 8GB
std = big.std(0) //this _only_ requires another 8GB
mean = big.mean(0)
scaler = StandardScaler(copy=False) 
scaler.std_ = std
scaler.mean_ = mean
big = scaler.transform(big)

fails because the transform() method performs check_is_fitted() and will throw a NotFittedException, since the instance expects us to call fit() before transform(). Also, scaler.std_ is replaced with scaler.scale_.

My recommendation is to create your own Scaler class that inherits from StandardScaler and overrides transform(), and ignores the check_is_fitted() function.

@jnothman
Copy link
Member
jnothman commented Dec 7, 2017

I think if you set std_ and scale_, check_is_fitted will pass without complaint.

But it's an interesting point that variance could be computed with better memory efficiency than numpy does. I would consider a PR which limits memory consumption here to a fixed buffer size.

@amueller
Copy link
Member

It's unclear to me what is to be done for this issue copy is not used in fit and it sounds like the memory consumption comes from numpy?

@hosjiu1702
Copy link

so after all, which solution you guys will do for this problem? @amueller @giorgiop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Easy Well-defined and straightforward way to resolve help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants
0