-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH Implement utils.shuffle without copy #7754 #22003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, unfortunately I think the problem is much more complex as explained below:
resampled_arrays = [_safe_indexing(a, indices) for a in arrays] | ||
else: | ||
for a in arrays: | ||
a[:] = _safe_indexing(a, indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately I am pretty sure that _safe_indexing(a, indices)
still produces an intermediate allocation of the size of a
.
I it is not visible in the memory profile because this way of profiling the memory measures the memory usage before and after line 618, but not during the execution of line 618.
To confirm this (if you do not trust me :) you can try to write a Python script that calls shuffle(X) on a larger X (e.g. 1GB) and run it with memory_profiler in "sampling" mode (also known as "time-based mode") with the mprof run
command instead of the line-by-line tracing mode.
I think the only way to implement in-place shuffling memory efficiently would be to implement the Durstenfeld variant of the Fisher–Yates algorithm:
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#The_modern_algorithm
EDIT: or Sattolo's algorithm:
https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Sattolo's_algorithm
This has several drawbacks though:
- first it needs an explicit loop of size
n_samples
which might be slow to do in Python and would be much faster in Cython. - however while implementing it Cython should be easily doable when all inputs are numpy arrays, this would be much more challenging for other input data structures (e.g. Python lists, pandas DataFrames/Series, and so on.
- finally inplace assignments for pairs of indices that would work for all the datastructures supported by
_safe_indexing
would require a similar utility function (e.g._safe_index_assign(X, indices, values)
) because index based assignment has no consistent API for all those libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment.
I understand that the problem is complicated.
This issue seems difficult for me to solve, so I'm sorry but I would like to close this PR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a way forward to use numpy.random.shuffle
as explained in #7754 (comment) but I am not sure how complex it would be to get the details right.
Reference Issues/PRs
Enhances #7754
What does this implement/fix? Explain your changes.
Added
copy
parameter inutils.shuffle
.If
copy=False
, indexable data-structures in*arrays
are destructed instead of consuming less memory.Any other comments?
utils._mocking.MockDataFrame
is entered inutils.shuffle
, but I can't implement shuffleMockDataFrame
without copy.So, it raise
ValueError
when shuffleMockDataFrame
without copy.memory consume
The purpose of this PR is to reduce
utils.shuffle
's memory consumption.I compared the memory consumption with and without copying
utils.shuffle
without copy consumes half the memory ofutils.shuffle
with copy