-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
enhancement: sklearn.utils.shuffle consume 2x memory, better do it in-place #7754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Do you mean that we should provide |
I think it should be in-place shuffling to reduce memory consumption. If you do not want to break existing code using sklearn, you can provide such copy=False option. |
@mingwugmail are you concerned about the use in your own code or within scikit-learn estimators? |
Not just in my own code. This function in sklearn library should better do in-place shuffling, everybody's training data getting bigger and bigger these days, so anyone can use the library without worry about memory consumption. |
My question was about people using |
(I don't think it's true that everyone's training data is getting bigger On 27 October 2016 at 07:27, mingwugmail notifications@github.com wrote:
|
I don't think we currently use sklearn.utils.shuffle internally. On 27 October 2016 at 02:19, Andreas Mueller notifications@github.com
|
Git grep says you're right (to my slight surprise as we don't often add utils we don't use). |
I put a comment here to explain why this is not so simple to implement inplace shuffling for heterogeneous datastructures: #22003 (comment) For homogeneous datastructures (all arrays) it would be possible to use numpy's in-place shuffle repeatedly using the an RNG in the same state (to apply the same permutation several times): for a in arrays:
rng = np.random.RandomState(seed)
rng.shuffle(a)
return arrays Actually it seems that this function can also work a mix of numpy arrays and Python lists for instance. However it does not work on scipy CSR sparse matrices and pandas dataframes. Maybe we could still write an in-place implementation of shuffle based on |
Row-wise pandas swaps might be expensive though. I am not so sure it would be worth to invest time in this. |
Description
sklearn.utils.shuffle use the double amount of memory in my test. I'd like the in-place implementation without using 2x times memory of the input arrays.
numpy/numpy#8204
Steps/Code to Reproduce
Expected Results
Actual Results
Versions
The text was updated successfully, but these errors were encountered: