8000 enhancement: sklearn.utils.shuffle consume 2x memory, better do it in-place · Issue #7754 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

enhancement: sklearn.utils.shuffle consume 2x memory, better do it in-place #7754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mw66 opened this issue Oct 26, 2016 · 10 comments
Open

Comments

@mw66
Copy link
mw66 commented Oct 26, 2016

Description

sklearn.utils.shuffle use the double amount of memory in my test. I'd like the in-place implementation without using 2x times memory of the input arrays.

numpy/numpy#8204

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

@jnothman
Copy link
Member

Do you mean that we should provide shuffle with a copy=False option?

@mw66
Copy link
Author
mw66 commented 8000 Oct 26, 2016

I think it should be in-place shuffling to reduce memory consumption.

If you do not want to break existing code using sklearn, you can provide such copy=False option.

@amueller
Copy link
Member

@mingwugmail are you concerned about the use in your own code or within scikit-learn estimators?
I think we'd accept a PR for adding a copy option to shuffle. How to add it to the estimators is probably a case-by-case decision.

@mw66
Copy link
Author
mw66 commented Oct 26, 2016

Not just in my own code.

This function in sklearn library should better do in-place shuffling, everybody's training data getting bigger and bigger these days, so anyone can use the library without worry about memory consumption.

@amueller
Copy link
Member

My question was about people using sklearn.utils.shuffle directly or when it is used within the models.

@jnothman
Copy link
Member 8000

(I don't think it's true that everyone's training data is getting bigger
and bigger; and to the extent that it is, they should probably find other
ways to work with it to avoid memory issues, rather than non-copying
shuffles.)

On 27 October 2016 at 07:27, mingwugmail notifications@github.com wrote:

Not just in my own code.

This function in sklearn library should better do in-place shuffling,
everybody's training data getting bigger and bigger these days, so anyone
can use the library without worry about memory consumption.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7754 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-MgpPk4CKoy1oVaBqMTV8zOYmsSks5q37efgaJpZM4KgshQ
.

@jnothman
Copy link
Member

I don't think we currently use sklearn.utils.shuffle internally.

On 27 October 2016 at 02:19, Andreas Mueller notifications@github.com
wrote:

@mingwugmail https://github.com/mingwugmail are you concerned about the
use in your own code or within scikit-learn estimators?
I think we'd accept a PR for adding a copy option to shuffle. How to add
it to the estimators is probably a case-by-case decision.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7754 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-DyIWaHa4tthRKGydYElHM-wZcyks5q32-bgaJpZM4KgshQ
.

@amueller
Copy link
Member

Git grep says you're right (to my slight surprise as we don't often add utils we don't use).
So the impact of doing this is small in sklearn but it might help user code. So I'm +1 for a PR.

@ogrisel
Copy link
Member
ogrisel commented Dec 20, 2021

I put a comment here to explain why this is not so simple to implement inplace shuffling for heterogeneous datastructures: #22003 (comment)

For homogeneous datastructures (all arrays) it would be possible to use numpy's in-place shuffle repeatedly using the an RNG in the same state (to apply the same permutation several times):

for a in arrays:
    rng = np.random.RandomState(seed)
    rng.shuffle(a)
return arrays

Actually it seems that this function can also work a mix of numpy arrays and Python lists for instance. However it does not work on scipy CSR sparse matrices and pandas dataframes.

Maybe we could still write an in-place implementation of shuffle based on numpy.random.shuffle by wrapping scipy sparse matrices and pandas dataframes with an API wrapper that makes them look like numpy arrays but converts the inplace assignments internally.

@ogrisel
Copy link
Member
ogrisel commented Dec 20, 2021

Row-wise pandas swaps might be expensive though. I am not so sure it would be worth to invest time in this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
0