-
Notifications
You must be signed in to change notification settings - Fork 181
[WIP] No-copy semantics for large memoryviews #138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #138 +/- ##
========================================
+ Coverage 83.85% 85.86% +2%
========================================
Files 2 1 -1
Lines 539 658 +119
Branches 98 121 +23
========================================
+ Hits 452 565 +113
- Misses 64 67 +3
- Partials 23 26 +3
Continue to review full report at Codecov.
|
@HyukjinKwon I would love to get your feedback on this PR if you have the time to review it. I still have a couple of things this to check:
Question:
|
@ogrisel, I would hold on on this PR before the CPython changes get accepted and merged. In particular, one should study the possible issues with memoryview lifetime. |
cloudpickle/cloudpickle.py
Outdated
# - data is a temporary variable that has just been allocated from | ||
# reading from the pickle stream and will never be used outside of | ||
# the scope of this reducer. | ||
# - bytes objects are not subject to interning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really?
>>> "x".encode() is b"x"
True
>>> "xy".encode() is b"xy"
False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum. I concluded this from the following:
>>> sys.intern(b'x')
Traceback (most recent call last):
File "<ipython-input-5-02fdb271b026>", line 1, in <module>
sys.intern(b'x')
TypeError: intern() argument 1 must be str, not bytes
Do you know the maximum length of interned bytes objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently (from experiments), only instances with zero or one-byte length are "interned" although I could not google that this is supposed to be officially the case. I am having a hard time understanding the C code of the bytesobject.c
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pandas we just check the refcount of the object. Regardless of how interning happens, this will be safe if the refcount is exactly 1 AND you steal the reference. To actually steal the reference we implement the buffer protocol in C; right now you can still access your newly mutable bytestring by digging the the chain of base objects on the memory view and array. We just take a copy if the refcount is not 1. If the refcount is > 1 and the len is > 1, we raise a performance warning stating that something odd happened and a copy was done; however that doesn't happen. We don't warn under len 1 because a copy of 1 or two bytes is trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cloudpickle we have to stay 100% pure python (hence this hack in ctypes). Furthermore we have to respect the pickle protocol and this function is called by the load function when hitting the REDUCE opcode. At this point the bytes
buffer has already been read and is poped from the unpickler stack, but it's still in the lexical scope of the caller function (load_reduce
IIRC).
However this bytes object will never be used anywhere else, so I am still quite confident that this ctypes hack, when called in the context of the unpickler is "safe".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel the language doesn't guarantee anything here, so you are on your own if you try to mutate a bytes object without taking the kind of precaution @llllllllll talks about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying that we need to use the C implementation shown in pandas, I am saying that we should copy the refcount checks that it does. We may also want to try harder to hide the mutable bytes object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about it, it might be possible to get the refcount down from 2 to 1 by nesting 2 reducers and doing the "mutate_move" in the wrapping reducer. I can also try to hide the _base
reference with a property that raises or warn on getattr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed a fix. It still lacks a more complete test harness though. Will work on that next week.
cloudpickle/cloudpickle.py
Outdated
@@ -249,6 +252,93 @@ def _walk_global_ops(code): | |||
yield op, instr.arg | |||
|
|||
|
|||
def _memoryview_from_bytes(data, format, readonly): | |||
if not readonly: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have written something very similar for pandas: https://github.com/pandas-dev/pandas/blob/master/pandas/util/move.c#L160
The test suite exercises some interesting edge cases we may want to replicate: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/util/test_util.py#L324
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I will have a look.
Thanks @ogrisel. I have been trying to follow it. Will try to take a look to check / double check bit by bit when I have some time as well, probably from the next week. |
cloudpickle/cloudpickle.py
Outdated
# the memory buffer of the bytes object as writeable buffer to back the | ||
# memoryview: this buffer is no longer unreachable from anywhere else. | ||
if hasattr(sys, 'getrefcount'): | ||
safe_to_mutate = sys.getrefcount(data_holder[0]) <= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When making calls like this, it may be nice to call out what the references are:
- The reference owned by
data_holder
- The reference pushed onto the data stack from the expression
data_holder[0]
Many people don't consider the second reference, so they would be confused why 2 references is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I explained in the above comment. Also, I reorganized the code to make it more readable and easier to test.
@llllllllll thanks for the feedback. I think I have addressed the safety concerns in the latest batch of commits. This ctypes-based implementation should mirror all the checks done in C in pandas. |
This is really impressive work! |
This PR was a fun learning experience but it's definitely a hack. I think it's better to wait for PEP 574 to be implemented in upstream Python and adopted by numpy (see: numpy/numpy#11161). |
Here is a PR that backports some of the work done for large bytes in upstream CPython 3.7 or 3.8 in the following PR: python/cpython#4353.
The goal is to make the
memoryview
support of cloudpickle benefit from it and implement nocopy semantics and later any nested numpy arrays datastructures (e.g. pandas dataframes, scipy sparse arrays, large scikit-learn RandomForestClassifier...).In particular, this would make it possible for dask distributed to spill any large numpy-based datastructure to disk without making any temporary in-memory copy on workers that are close to their memory usage limit.
Please do not merge this PR while the upstream CPython PR is still under review.
_memoryview_from_bytes
to ever mutate single bytes interned bytes instances,/cc @pitrou @mrocklin.