-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Is it possible to reduce StandardScaler.fit() memory consumption? #5651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hmm, try to enable "copy" parameter of constructor. |
As far as I've seen, the copy parameter is useful for the transform() method rather than the fit() one. |
I just noticed that simply using numpy's implementation of std, the memory consumption can be reduced by half:
This works, and while it might be slightly slower (not sure), it uses significantly less memory. |
This is correct. If your goal is just to scale big matrices without allocating too much memory, 0.17 has now |
@giorgiop Thanks, good to know. |
I have done some memory profiling. You are right, we are doubling memory consumption calling
and indeed your example above doubles the memory requirement too:
As a rule of thumb, if you do not care about overwriting your input data, use |
Hi, Could you try:
Thanks! |
You can use memory profiler. You last code only doubles the memory. The additional 4000 are required if you call with |
Thanks for the clarification and sorry if I couldn't answer sooner.
And fit()
It doesn't say explicitly that the copy parameter will also influence the fit() function, which I personally would't expect to make copies of the data. |
@enoonIT would you like to work on a PR to fix the documentation? |
FYI everyone, the solution provided by @enoonIT on October 31st, 2015 I believe is now deprecated:
fails because the transform() method performs My recommendation is to create your own Scaler class that inherits from |
I think if you set std_ and scale_, check_is_fitted will pass without complaint. But it's an interesting point that variance could be computed with better memory efficiency than numpy does. I would consider a PR which limits memory consumption here to a fixed buffer size. |
It's unclear to me what is to be done for this issue |
The issue:
When applying StandardScaler to a big matrix, the memory requirements are expensive.
Example:
I guess it uses some lookup tables to speed the standard deviation computations, but double the required RAM might be too much in some cases. A flag to enable a, slower, but less memory intensive version, would be nice.
Is there any solutions to reduce memory consumption?
The text was updated successfully, but these errors were encountered: