8000 ENH Implement utils.shuffle without copy #7754 by murata-yu · Pull Request #22003 · scikit-learn/scikit-learn · GitHub
[go: up one dir, main page]

Skip to content

ENH Implement utils.shuffle without copy #7754 #22003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

murata-yu
Copy link
Contributor

Reference Issues/PRs

Enhances #7754

What does this implement/fix? Explain your changes.

Added copy parameter in utils.shuffle.
If copy=False, indexable data-structures in *arrays are destructed instead of consuming less memory.

Any other comments?

utils._mocking.MockDataFrame is entered in utils.shuffle, but I can't implement shuffle MockDataFrame without copy.
So, it raise ValueError when shuffle MockDataFrame without copy.

memory consume

The purpose of this PR is to reduce utils.shuffle's memory consumption.
I compared the memory consumption with and without copying

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7     71.2 MiB     71.2 MiB           1   @profile
     8                                         def profile_samples(length: int):
     9     79.3 MiB      8.1 MiB           1       base_array = np.array(list(range(length)))
    10     79.3 MiB      0.0 MiB           1       base_id = id(base_array)
    11                                         
    12     95.3 MiB     16.0 MiB           1       copied = utils.shuffle(base_array)
    13    103.3 MiB      8.1 MiB           1       uncopied = utils.shuffle(base_array, copy=False)

utils.shuffle without copy consumes half the memory of utils.shuffle with copy

Copy link
Member
@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, unfortunately I think the problem is much more complex as explained below:

resampled_arrays = [_safe_indexing(a, indices) for a in arrays]
else:
for a in arrays:
a[:] = _safe_indexing(a, indices)
Copy link
Member
@ogrisel ogrisel Dec 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately I am pretty sure that _safe_indexing(a, indices) still produces an intermediate allocation of the size of a.

I it is not visible in the memory profile because this way of profiling the memory measures the memory usage before and after line 618, but not during the execution of line 618.

To confirm this (if you do not trust me :) you can try to write a Python script that calls shuffle(X) on a larger X (e.g. 1GB) and run it with memory_profiler in "sampling" mode (also known as "time-based mode") with the mprof run command instead of the line-by-line tracing mode.

I think the only way to implement in-place shuffling memory efficiently would be to implement the Durstenfeld variant of the Fisher–Yates algorithm:

https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm

EDIT: or Sattolo's algorithm:

https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Sattolo's_algorithm

This has several drawbacks though:

  • first it needs an explicit loop of size n_samples which might be slow to do in Python and would be much faster in Cython.
  • however while implementing it Cython should be easily doable when all inputs are numpy arrays, this would be much more challenging for other input data structures (e.g. Python lists, pandas DataFrames/Series, and so on.
  • finally inplace assignments for pairs of indices that would work for all the datastructures supported by _safe_indexing would require a similar utility function (e.g. _safe_index_assign(X, indices, values)) because index based assignment has no consistent API for all those libraries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comment.
I understand that the problem is complicated.

This issue seems difficult for me to solve, so I'm sorry but I would like to close this PR...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a way forward to use numpy.random.shuffle as explained in #7754 (comment) but I am not sure how complex it would be to get the details right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0