[WIP] No-copy semantics for large memoryviews #138

ogrisel · 2017-12-04T15:36:19Z

Here is a PR that backports some of the work done for large bytes in upstream CPython 3.7 or 3.8 in the following PR: python/cpython#4353.

The goal is to make the memoryview support of cloudpickle benefit from it and implement nocopy semantics and later any nested numpy arrays datastructures (e.g. pandas dataframes, scipy sparse arrays, large scikit-learn RandomForestClassifier...).

In particular, this would make it possible for dask distributed to spill any large numpy-based datastructure to disk without making any temporary in-memory copy on workers that are close to their memory usage limit.

Please do not merge this PR while the upstream CPython PR is still under review.

prevent _memoryview_from_bytes to ever mutate single bytes interned bytes instances,
add a battery of unittests to check edge cases with bytes mutability,
take the numpy serializer out of the test suite and make it available to users,
fix pickling for numpy arrays with the object dtype,
wait for the final review of bpo-31993: do not allocate large temporary buffers in pickle dump python/cpython#4353,
disable memoization of bytes when serializing a non-contiguous memoryview,
refactor the new tests to make them less redundant.

/cc @pitrou @mrocklin.

codecov-io · 2017-12-04T15:40:05Z

Codecov Report

Merging #138 into master will increase coverage by 2%.
The diff coverage is 95.23%.

@@           Coverage Diff            @@
##           master     #138    +/-   ##
========================================
+ Coverage   83.85%   85.86%    +2%     
========================================
  Files           2        1     -1     
  Lines         539      658   +119     
  Branches       98      121    +23     
========================================
+ Hits          452      565   +113     
- Misses         64       67     +3     
- Partials       23       26     +3

Impacted Files	Coverage Δ
cloudpickle/cloudpickle.py	`85.86% <95.23%> (+2.09%)`	⬆️
cloudpickle/__init__.py

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abeb3fb...0dbff61. Read the comment docs.

ogrisel · 2017-12-06T17:22:10Z

@HyukjinKwon I would love to get your feedback on this PR if you have the time to review it. I still have a couple of things this to check:

serve the shape of the memoryview themselves (instead of just at the numpy array level)
I would like to see if we can get nocopy semantics for pandas and scipy sparse matrices for free (early quick tests show that it does not seem to work for pandas, we have to investigate). Edit fixed: I just needed to handle f-contiguous arrays properly.

Question:

shall we provide the nocopy memoryview-based reducer for numpy arrays by default (instead of just providing an example in the tests)?

pitrou · 2017-12-06T17:55:08Z

@ogrisel, I would hold on on this PR before the CPython changes get accepted and merged. In particular, one should study the possible issues with memoryview lifetime.

pitrou · 2017-12-06T18:49:01Z

cloudpickle/cloudpickle.py

+        # - data is a temporary variable that has just been allocated from
+        #   reading from the pickle stream and will never be used outside of
+        #   the scope of this reducer.
+        # - bytes objects are not subject to interning.


Really?

>>> "x".encode() is b"x" True >>> "xy".encode() is b"xy" False

Hum. I concluded this from the following:

>>> sys.intern(b'x') Traceback (most recent call last): File "<ipython-input-5-02fdb271b026>", line 1, in <module> sys.intern(b'x') TypeError: intern() argument 1 must be str, not bytes

Do you know the maximum length of interned bytes objects?

Apparently (from experiments), only instances with zero or one-byte length are "interned" although I could not google that this is supposed to be officially the case. I am having a hard time understanding the C code of the bytesobject.c file.

In pandas we just check the refcount of the object. Regardless of how interning happens, this will be safe if the refcount is exactly 1 AND you steal the reference. To actually steal the reference we implement the buffer protocol in C; right now you can still access your newly mutable bytestring by digging the the chain of base objects on the memory view and array. We just take a copy if the refcount is not 1. If the refcount is > 1 and the len is > 1, we raise a performance warning stating that something odd happened and a copy was done; however that doesn't happen. We don't warn under len 1 because a copy of 1 or two bytes is trivial.

In cloudpickle we have to stay 100% pure python (hence this hack in ctypes). Furthermore we have to respect the pickle protocol and this function is called by the load function when hitting the REDUCE opcode. At this point the bytes buffer has already been read and is poped from the unpickler stack, but it's still in the lexical scope of the caller function (load_reduce IIRC).

However this bytes object will never be used anywhere else, so I am still quite confident that this ctypes hack, when called in the context of the unpickler is "safe".

@ogrisel the language doesn't guarantee anything here, so you are on your own if you try to mutate a bytes object without taking the kind of precaution @llllllllll talks about.

I am not saying that we need to use the C implementation shown in pandas, I am saying that we should copy the refcount checks that it does. We may also want to try harder to hide the mutable bytes object

Thinking more about it, it might be possible to get the refcount down from 2 to 1 by nesting 2 reducers and doing the "mutate_move" in the wrapping reducer. I can also try to hide the _base reference with a property that raises or warn on getattr.

I have pushed a fix. It still lacks a more complete test harness though. Will work on that next week.

llllllllll · 2017-12-06T18:54:11Z

cloudpickle/cloudpickle.py

@@ -249,6 +252,93 @@ def _walk_global_ops(code):
                yield op, instr.arg


+def _memoryview_from_bytes(data, format, readonly):
+    if not readonly:


I have written something very similar for pandas: https://github.com/pandas-dev/pandas/blob/master/pandas/util/move.c#L160

The test suite exercises some interesting edge cases we may want to replicate: https://github.com/pandas-dev/pandas/blob/master/pandas/tests/util/test_util.py#L324

Thanks I will have a look.

HyukjinKwon · 2017-12-07T07:42:31Z

Thanks @ogrisel. I have been trying to follow it. Will try to take a look to check / double check bit by bit when I have some time as well, probably from the next week.

llllllllll · 2017-12-08T20:30:08Z

cloudpickle/cloudpickle.py

+    # the memory buffer of the bytes object as writeable buffer to back the
+    # memoryview: this buffer is no longer unreachable from anywhere else.
+    if hasattr(sys, 'getrefcount'):
+        safe_to_mutate = sys.getrefcount(data_holder[0]) <= 2


When making calls like this, it may be nice to call out what the references are:

The reference owned by data_holder

The reference pushed onto the data stack from the expression data_holder[0]

Many people don't consider the second reference, so they would be confused why 2 references is safe.

This is what I explained in the above comment. Also, I reorganized the code to make it more readable and easier to test.

ogrisel · 2017-12-12T20:51:20Z

@llllllllll thanks for the feedback. I think I have addressed the safety concerns in the latest batch of commits. This ctypes-based implementation should mirror all the checks done in C in pandas.

llllllllll · 2017-12-12T20:59:19Z

This is really impressive work!

ogrisel · 2018-06-04T13:01:56Z

This PR was a fun learning experience but it's definitely a hack. I think it's better to wait for PEP 574 to be implemented in upstream Python and adopted by numpy (see: numpy/numpy#11161).

ogrisel added 3 commits December 4, 2017 14:32

TST non-regression test for no-copy dump of bytes

d02f066

Backport nocopy dump for large bytes object from bpo-31993

516e464

Nocopy support for contiguous memoryviews

63b9b79

ogrisel added this to the 0.6 milestone Dec 4, 2017

ogrisel added 11 commits December 4, 2017 18:57

from struct import pack

3e6f98e

WIP reduce memoryviews

558c33a

WIP add test for memoryview of integer array

05a6320

Workaround bug in Python 3.4 and partial impl of memview in PyPy

f7341ac

Better comments and tests

ed2b2af

test_nocopy_readonly_bytes works on Py34

5cfe1e3

More precise memory measurements and more tolerant check

070d2bf

Fix segfault by missing reference + wrong nbytes measurement 8000

5437bf5

tighter memory tests

224a548

Try to make memory checks more deterministic on travis

ec85e75

nocopy numpy arrays

5185527

ogrisel mentioned this pull request Dec 6, 2017

Pandas serialization dask/distributed#931

Closed

pitrou reviewed Dec 6, 2017

View reviewed changes

llllllllll reviewed Dec 6, 2017

View reviewed changes

ogrisel added 5 commits December 7, 2017 22:51

Preserve shape at the memoryview level when possible

d0ecba6

Fix support for transposed numpy arrays: nocopy pandas and scipy!

4a46a98

skip broken test with old PyPy and make test_nocopy_pydata faster

52309ae

do not build pandas from source on travis for Python 3.4

5c49c51

Use reference counting to ensure safe mutability of bytes buffer

fca7ec7

llllllllll reviewed Dec 8, 2017

View reviewed changes

ogrisel added 2 commits December 11, 2017 11:26

Safely mutate non-interned Python 2 strings

f3a4560

Add missing test file + protect safety of bytes singleton

dc3a861

ogrisel added 8 commits December 11, 2017 15:35

Extract _safe_to_mutate to improve testability

f24fb62

wording

02f21f5

Small reorg, better wording.

401fb2d

cosmit

4d198fb

Better comment

80867ed

Fixed wrong assertions in tests for PyPy

82e508a

Single element bytes mutation is actually safe

78cb82f

Improve tests

53ef282

ogrisel added 2 commits December 12, 2017 22:52

Cosmetics.

137dd34

Nocopy semantics for numpy arrays for Python 3 enabled by default

0dbff61

ogrisel mentioned this pull request Dec 18, 2017

WIP Experimental integration of nocopy cloudpickle with bytelist API dask/distributed#1643

Closed

ogrisel closed this Jun 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] No-copy semantics for large memoryviews #138

[WIP] No-copy semantics for large memoryviews #138

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[WIP] No-copy semantics for large memoryviews #138

[WIP] No-copy semantics for large memoryviews #138

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!