8000 [WIP] No-copy semantics for large memoryviews by ogrisel · Pull Request #138 · cloudpipe/cloudpickle · GitHub
[go: up one dir, main page]

Skip to content

[WIP] No-copy semantics for large memoryviews #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from

Conversation

ogrisel
Copy link
Contributor
@ogrisel ogrisel commented Dec 4, 2017

Here is a PR that backports some of the work done for large bytes in upstream CPython 3.7 or 3.8 in the following PR: python/cpython#4353.

The goal is to make the memoryview support of cloudpickle benefit from it and implement nocopy semantics and later any nested numpy arrays datastructures (e.g. pandas dataframes, scipy sparse arrays, large scikit-learn RandomForestClassifier...).

In particular, this would make it possible for dask distributed to spill any large numpy-based datastructure to disk without making any temporary in-memory copy on workers that are close to their memory usage limit.

Please do not merge this PR while the upstream CPython PR is still under review.

  • prevent _memoryview_from_bytes to ever mutate single bytes interned bytes instances,
  • add a battery of unittests to check edge cases with bytes mutability,
  • take the numpy serializer out of the test suite and make it available to users,
  • fix pickling for numpy arrays with the object dtype,
  • wait for the final review of bpo-31993: do not allocate large temporary buffers in pickle dump python/cpython#4353,
  • disable memoization of bytes when serializing a non-contiguous memoryview,
  • refactor the new tests to make them less redundant.

/cc @pitrou @mrocklin.

@ogrisel ogrisel added this to the 0.6 milestone Dec 4, 2017
@codecov-io
Copy link
codecov-io commented Dec 4, 2017

Codecov Report

Merging #138 into master will increase coverage by 2%.
The diff coverage is 95.23%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master     #138    +/-   ##
========================================
+ Coverage   83.85%   85.86%    +2%     
========================================
  Files           2        1     -1     
  Lines         539      658   +119     
  Branches       98      121    +23     
========================================
+ Hits          452      565   +113     
- Misses         64       67     +3     
- Partials       23       26     +3
Impacted Files Coverage Δ
cloudpickle/cloudpickle.py 85.86% <95.23%> (+2.09%) ⬆️
cloudpickle/__init__.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abeb3fb...0dbff61. Read the comment docs.

@ogrisel
Copy link
Contributor Author
ogrisel commented Dec 6, 2017

@HyukjinKwon I would love to get your feedback on this PR if you have the time to review it. I still have a couple of things this to check:

  • serve the shape of the memoryview themselves (instead of just at the numpy array level)
  • I would like to see if we can get nocopy semantics for pandas and scipy sparse matrices for free (early quick tests show that it does not seem to work for pandas, we have to investigate). Edit fixed: I just needed to handle f-contiguous arrays properly.

Question:

  • shall we provide the nocopy memoryview-based reducer for numpy arrays by default (instead of just providing an example in the tests)?

@pitrou
Copy link
Member
pitrou commented Dec 6, 2017

@ogrisel, I would hold on on this PR before the CPython changes get accepted and merged. In particular, one should study the possible issues with memoryview lifetime.

# - data is a temporary variable that has just been allocated from
# reading from the pickle stream and will never be used outside of
# the scope of this reducer.
# - bytes objects are not subject to interning.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really?

>>> "x".encode() is b"x"
True
>>> "xy".encode() is b"xy"
False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum. I concluded this from the following:

>>> sys.intern(b'x')
Traceback (most recent call last):
  File "<ipython-input-5-02fdb271b026>", line 1, in <module>
    sys.intern(b'x')
TypeError: intern() argument 1 must be str, not bytes

Do you know the maximum length of interned bytes objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently (from experiments), only instances with zero or one-byte length are "interned" although I could not google that this is supposed to be officially the case. I am having a hard time understanding the C code of the bytesobject.c file.

Copy link
Contributor
@llllllllll llllllllll Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas we just check the refcount of the object. Regardless of how interning happens, this will be safe if the refcount is exactly 1 AND you steal the reference. To actually steal the reference we implement the buffer protocol in C; right now you can still access your newly mutable bytestring by digging the the chain of base objects on the memory view and array. We just take a copy if the refcount is not 1. If the refcount is > 1 and the len is > 1, we raise a performance warning stating that something odd happened and a copy was done; however that doesn't happen. We don't warn under len 1 because a copy of 1 or two bytes is trivial.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cloudpickle we have to stay 100% pure python (hence this hack in ctypes). Furthermore we have to respect the pickle protocol and this function is called by the load function when hitting the REDUCE opcode. At this point the bytes buffer has already been read and is poped from the unpickler stack, but it's still in the lexical scope of the caller function (load_reduce IIRC).

However this bytes object will never be used anywhere else, so I am still quite confident that this ctypes hack, when called in the context of the unpickler is "safe".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel the language doesn't guarantee anything here, so you are on your own if you try to mutate a bytes object without taking the kind of precaution @llllllllll talks about.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying that we need to use the C implementation shown in pandas, I am saying that we should copy the refcount checks that it does. We may also want to try harder to hide the mutable bytes object

Copy link
Contributor Author
@ogrisel ogrisel Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about it, it might be possible to get the refcount down from 2 to 1 by nesting 2 reducers and doing the "mutate_move" in the wrapping reducer. I can also try to hide the _base reference with a property that raises or warn on getattr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a fix. It still lacks a more complete test harness though. Will work on that next week.

@@ -249,6 +252,93 @@ def _walk_global_ops(code):
yield op, instr.arg


def _memoryview_from_bytes(data, format, readonly):
if not readonly:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have written something very similar for pandas: https://github.com/pandas-dev/pandas/blob/master/pandas/util/move.c#L160

The test suite exercises some interesting edge cases we may want to replicate: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/util/test_util.py#L324

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks I will have a look.

@HyukjinKwon
Copy link
Member

Thanks @ogrisel. I have been trying to follow it. Will try to take a look to check / double check bit by bit when I have some time as well, probably from the next week.

# the memory buffer of the bytes object as writeable buffer to back the
# memoryview: this buffer is no longer unreachable from anywhere else.
if hasattr(sys, 'getrefcount'):
safe_to_mutate = sys.getrefcount(data_holder[0]) <= 2
Copy link
Contributor
@llllllllll llllllllll Dec 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When making calls like this, it may be nice to call out what the references are:

  1. The reference owned by data_holder
  2. The reference pushed onto the data stack from the expression data_holder[0]

Many people don't consider the second reference, so they would be confused why 2 references is safe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I explained in the above comment. Also, I reorganized the code to make it more readable and easier to test.

@ogrisel
Copy link
Contributor Author
ogrisel commented Dec 12, 2017

@llllllllll thanks for the feedback. I think I have addressed the safety concerns in the latest batch of commits. This ctypes-based implementation should mirror all the checks done in C in pandas.

@llllllllll
Copy link
Contributor

This is really impressive work!

@ogrisel
Copy link
Contributor Author
ogrisel commented Jun 4, 2018

This PR was a fun learning experience but it's definitely a hack. I think it's better to wait for PEP 574 to be implemented in upstream Python and adopted by numpy (see: numpy/numpy#11161).

@ogrisel ogrisel closed this Jun 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0