Use new buffer interface to unpack #195

jfolz · 2016-05-12T18:43:41Z

This PR adds support for unpacking/feeding from any object that supports the new buffer interface.

For compatibility, an attempt is made to use the old buffer interface if that fails. On success, a RuntimeWarning is issued to inform users of possible errors and future removal of this feature.

jfolz · 2016-05-12T18:59:25Z

Other than possibly allowing ExtType.data to be memoryview, this should finish memoryview support.

methane · 2016-05-13T05:01:43Z

Default to the new buffer interface methods to access data in Cython unpackb

Why?
I think old buffer API can be used for objects supporting new buffer protocol. Am I wrong?
Is there sample code shows some object which current Cython unpackb can't accept?

jfolz · 2016-05-13T07:19:18Z

For 3.x, that is correct, as described here: https://docs.python.org/3/c-api/objbuffer.html

As always 2.7 is the culprit. Since both protocols exist there, objects have to support them separately. Ironically, memoryview is one of the objects that only supports the new protocol. It's simple to test:

>>> buffer(memoryview(b'abc'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: buffer object expected

Besides technical reasons, since these functions are deprecated since 3.0, we should IMO refrain from using them if possible. The only relevant builtin object that I know does not support the new protocol is array.array.

methane · 2016-05-13T08:47:55Z

@jfolz Thanks. How do you think dropping old buffer protocol?
I want to keep code simple as possible, especially Cython codes.

methane · 2016-05-13T08:49:31Z

msgpack 0.5 --- Add memoryview support and deprecate old buffer protocol (warn if possible).
msgpack 1.0 --- Remove old buffer protocol support.

jfolz · 2016-05-13T16:05:47Z

@methane Simple code is good code :)

To sum it up:

Only unpackb uses the old protocol
All types support the new protocol in 3.x
The only (semi-)relevant builtin type that doesn't support the new protocol in 2.7 is array.array

Very few people will ever see a warning, but backward compatibility is kept in 0.5. Even less people will have to change their custom types that export the old protocol for 1.0.

jfolz · 2016-05-13T16:08:50Z

msgpack/_unpacker.pyx

@@ -145,6 +147,9 @@ def unpackb(object packed, object object_hook=None, object list_hook=None,
        else:
            raise
    else:
+        warnings.warn("unpacking %s requires old buffer protocol,"


Should this warning be issued after the attempt to get the buffer? If the object does not support any protocol (old or new), it will still result in a TypeError, but the message may be a bit confusing.

jfolz · 2016-05-15T08:53:39Z

msgpack/fallback.py

@@ -259,7 +259,8 @@ def feed(self, next_bytes):
                              DeprecationWarning)
            else:
                raise
-        assert view.itemsize == 1
+        if view.itemsize != 1:
+            raise ValueError("cannot unpack from multi-byte object")


I decided to raise a proper exception with explanation here. Multi-byte types are very problematic for fallback code, since there is not way to "cast" memoryview to a different type.

Usually I'm all for developer responsibility, but honestly, if you're trying to unpack floats, I'm pretty sure you're doing it wrong.

jfolz · 2016-05-15T09:51:36Z

These last few commits improve the consistency between Cython and fallback.

Something that's still inconsistent is the behavior of Unpacker.feed:

Fallback attempts to use the old buffer protocol if possible/necessary and does not accept multi-byte types. Cython code only uses the new protocol and accepts multi-byte types. Should Cython be aligned with fallback here?

ThomasWaldmann · 2016-05-15T11:48:42Z

msgpack/_unpacker.pyx

+    else:
+        PyObject_AsReadBuffer(packed, <const void**>&buf, &buf_len)
+        PyErr_WarnEx(DeprecationWarning,
+                     "unpacking %s requires old buffer protocol,"


maybe add a blank at the end of first string.

jfolz · 2016-05-17T09:04:59Z

Last commit moves buffer access in Cython to a separate function to simplify code as well as guarantee unpack/unpackb/feed react identical to data.

methane · 2016-05-17T09:15:25Z

msgpack/_unpacker.pyx

@@ -20,6 +19,7 @@ from cpython.exc cimport PyErr_WarnEx
 cdef extern from "Python.h":
    ctypedef struct PyObject
    cdef int PyObject_AsReadBuffer(object o, const void** buff, Py_ssize_t* buf_len) except -1
+    cdef int PyObject_GetBuffer(object o, Py_buffer *view, int flags) except -1


Why do you move this?

Exceptions were ignored (e.g., if array is not C-contiguous), which led to access violations.
Should I move related functions as well?

Exceptions were ignored

Why?

https://github.com/cython/cython/blob/970c2fc0e676ca22016e14147ada0edba937dc6b/Cython/Includes/cpython/buffer.pxd#L31
Cython's declaration has except -1. I don't know why exception were ignored.

You are right. I reverted this change and it had no effect.
At first I could not reproduce the problem, but I think the control flow was wrong, because Cython does not automatically exit a cdef function when exceptions occur. Checking for errors and explicitly re-raising does the job.

jfolz · 2016-05-18T11:22:42Z

These commits should make memory handling more solid.

The following kinds of data are rejected:

Multi-byte types
Multidimensional data
Sliced data

What I'm not satisfied with:

A test for contiguity rejection (commented out) does not run on 2.7, because memoryview slicing with step is not implemented

Overall, a lot of added/moved code. This was a lot less finished than I originally thought, really sorry about that.

methane · 2016-05-18T19:13:10Z

Thanks for your job. I'll review in this week hopefully.

But my first impression is... I forget why we need such a big code.
Basically, accepting binary data is easy. Pass y* to PyArg_ParseTuple.
https://docs.python.org/3/c-api/arg.html

I'm afraid about we're over engineering.

methane · 2016-05-18T19:22:53Z

FYI, This is the code in getargs.c in cpython:

static int
getbuffer(PyObject *arg, Py_buffer *view, const char **errmsg)
{
    if (PyObject_GetBuffer(arg, view, PyBUF_SIMPLE) != 0) {
        *errmsg = "bytes-like object";
        return -1;
    }
    if (!PyBuffer_IsContiguous(view, 'C')) {
        PyBuffer_Release(view);
        *errmsg = "contiguous buffer";
        return -1;
    }
    return 0;
}

jfolz · 2016-05-19T07:32:48Z

I agree that the buffer code is more complex than it could be.
However, other than the contiguity and itemsize checks I don't see any opportunities to make it any simpler. Half of the code is need for the legacy protocol to prevent regressions.

On contiguity checks: Sadly there is no simple way to check for it in 2.7/PyPy. I could add a pure Python port of the C code that does these checks, though that would be a lot of code:

https://github.com/python/cpython/blob/master/Objects/abstract.c

methane · 2016-05-19T07:40:49Z

I think there are no reasy about rejecting data which file.write() accepts.
Isn't PyBuffer_IsContiguous() enough?

jfolz · 2016-05-19T07:54:18Z

Technically it is enough, but my main goal with this was to keep consistency between different runtimes.

Idea: Fallback should be fine with sliced data. For Cython where we need contiguous data, we can try to get it using PyBuffer_FromContiguous.

methane · 2016-05-19T09:38:48Z

One important purpose of fallback module is supporting PyPy.
Have you benchmarked memoryview based unpacker on PyPy?

methane · 2016-05-19T09:49:47Z

msgpack/fallback.py

-        self._fb_buf_n += len(next_bytes)
-        self._fb_buffers.append(next_bytes)
+        self._fb_buf_n += L
+        self._fb_buffers.append(view)


Do copy here.
Supporting memoryview makes it possible to avoid memory copy on user side, not here.

>>> import io >>> f = io.BytesIO(b"foobar") >>> buf = bytearray(4) >>> n = f.readinto(buf) >>> n 4 >>> buf bytearray(b'foob') >>> a = [memoryview(buf)[:n]] >>> n = f.readinto(buf) >>> n 2 >>> buf bytearray(b'arob') >>> a += [memoryview(buf)[:n]] >>> a [<memory at 0x10ee60288>, <memory at 0x10ee60348>] >>> bytes(a[0]) b'arob' # Overwritten!!!

More realistic sample code of .feed() usage with memoryview.

unpacker = msgpack.Unpacker() buf = bytearray(16*1024) while True: n = f.readinto(buf) if n is None: break unpacker.feed(memoryview(buf)[:n]) # more efficient than buf[:n] for obj in unpacker: callback(obj)

Since we will be copying anyway it would make sense to use a single, larger buffer. I experimented in that direction, but it was quite complex.

jfolz · 2016-05-19T10:27:21Z

Good point. We should suspend this PR until I can do that.

jfolz · 2016-05-19T11:13:55Z

Using PyPy3 2.4.0.

PyPI (v0.4.7)

unpacking integers (fallback)    0.132916, 0.069149, 0.068334, 0.067976
unpacking bytes (fallback)       0.102578, 0.073634, 0.073475, 0.073621
unpacking lists (fallback)       0.307713, 0.297894, 0.295595, 0.296820
unpacking dicts (fallback)       0.280711, 0.263227, 0.274489, 0.262999

This PR (+copy in feed)

unpacking integers (fallback)    0.162345, 0.093845, 0.092142, 0.090567
unpacking bytes (fallback)       0.118643, 0.088598, 0.088944, 0.105694
unpacking lists (fallback)       0.342774, 0.324450, 0.345009, 0.324881
unpacking dicts (fallback)       0.351656, 0.316005, 0.326773, 0.315644

This PR (revert to bytes as buffers)

unpacking integers (fallback)    0.128763, 0.063703, 0.061099, 0.061204
unpacking bytes (fallback)       0.104795, 0.076680, 0.076643, 0.076292
unpacking lists (fallback)       0.313370, 0.303259, 0.302231, 0.300040
unpacking dicts (fallback)       0.287460, 0.266801, 0.277002, 0.266382

Seems like PyPy is well-optimized for bytes, not so much for memoryviews.

methane · 2016-05-19T13:38:29Z

@jfolz See #196. I haven't benchmarked it yet.
But I think we should rewrite buffer implementation before implement accepting memoryview.

jfolz · 2016-05-19T13:47:11Z

msgpack/_unpacker.pyx

+            PyObject_GetBuffer(contiguous, &tmp, PyBUF_SIMPLE)
+            PyBuffer_Release(view)
+            PyBuffer_FillInfo(view, contiguous, tmp.buf, tmp.len, 1, 0)
+            Py_DECREF(contiguous)


What this does:

PyMemoryView_GetContiguous creates a memoryview object to a contiguous copy of the non-contiguous memory

Get a Py_buffer for this contiguous memory and transfer the information to the target buffer

Release resources as necessary

methane · 2016-05-28T09:27:59Z

Maybe related issue: http://bugs.python.org/issue20699

jfolz · 2016-05-29T08:54:03Z

2.7 is a mess with regards to buffer handling. Working with this stuff makes me want to move to 3.5.
Though this code using the old interface as required should deal with this.

jfolz · 2016-05-29T10:00:43Z

I briefly tested Cython's typed memoryviews. Sadly, they require write access, so they don't work for most bytes-like objects.
I did however figure out how to simplify the contiguous object code and added some comments for clarity.

jfolz · 2016-06-13T13:20:26Z

@methane Any updates on this?

methane · 2016-06-13T13:26:12Z

msgpack/_unpacker.pyx

@@ -150,6 +195,8 @@ def unpackb(object packed, object object_hook=None, object list_hook=None,
             use_list, cenc, cerr,
             max_str_len, max_bin_len, max_array_len, max_map_len, max_ext_len)
    ret = unpack_construct(&ctx, buf, buf_len, &off)
+    if new_protocol:
+        PyBuffer_Release(&view);


try ~ finally is required, since unpack_construct, encoding.encode() and unicode_errors.encode() may raise Exception.

methane · 2016-06-13T13:27:49Z

LGTM, except one comment.

jfolz · 2016-06-13T14:20:35Z

I added try-finally and squashed.

Thanks for the help, looking forward to 0.5 :)

methane · 2016-06-13T17:29:26Z

Thanks

jfolz reviewed May 13, 2016
View reviewed changes

jfolz force-pushed the master branch 2 times, most recently from a47f0f7 to 4b8bd7f Compare May 14, 2016 17:59

jfolz reviewed May 15, 2016
View reviewed changes

jfolz mentioned this pull request May 15, 2016

memoryview objects are not fully supported #126

Closed

jfolz force-pushed the master branch from 705b5bc to 249e051 Compare May 15, 2016 09:58

ThomasWaldmann reviewed May 15, 2016
View reviewed changes

jfolz force-pushed the master branch from 57de67b to dd98f3e Compare May 17, 2016 08:35

methane reviewed May 17, 2016
View reviewed changes

jfolz force-pushed the master branch 2 times, most recently from b0fdcec to 4d40e65 Compare May 18, 2016 10:37

methane reviewed May 19, 2016
View reviewed changes

jfolz reviewed May 19, 2016
View reviewed changes

jfolz force-pushed the master branch 2 times, most recently from e573df7 to 31f76fb Compare May 22, 2016 10:14

jfolz changed the title ~~Default to new buffer interface for unpacking~~ Use new buffer interface to unpack May 22, 2016

jfolz force-pushed the master branch from 31f76fb to 12ed5d3 Compare May 25, 2016 10:05

methane reviewed Jun 13, 2016
View reviewed changes

enable unpacking from memoryview

2b63e9f

jfolz force-pushed the master branch from f233b68 to 2b63e9f Compare June 13, 2016 13:37

methane merged commit 0ef5f4d into msgpack:master Jun 13, 2016

wushuyi mentioned this pull request Oct 12, 2016

TypeError: ord() expected string of length 1 on clear windows install 0rpc/zerorpc-python#151

Closed

Use new buffer interface to unpack #195

Use new buffer interface to unpack #195

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!