8000 Use new buffer interface to unpack by jfolz · Pull Request #195 · msgpack/msgpack-python · GitHub
[go: up one dir, main page]

Skip to content

Use new buffer interface to unpack #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 13, 2016
Merged

Use new buffer interface to unpack #195

merged 1 commit into from
Jun 13, 2016

Conversation

jfolz
Copy link
Contributor
@jfolz jfolz commented May 12, 2016

This PR adds support for unpacking/feeding from any object that supports the new buffer interface.

For compatibility, an attempt is made to use the old buffer interface if that fails. On success, a RuntimeWarning is issued to inform users of possible errors and future removal of this feature.

@jfolz
Copy link
Contributor Author
jfolz commented May 12, 2016

Other than possibly allowing ExtType.data to be memoryview, this should finish memoryview support.

@methane
Copy link
Member
methane commented May 13, 2016

Default to the new buffer interface methods to access data in Cython unpackb

Why?
I think old buffer API can be used for objects supporting new buffer protocol. Am I wrong?
Is there sample code shows some object which current Cython unpackb can't accept?

@jfolz
Copy link
Contributor Author
jfolz commented May 13, 2016

For 3.x, that is correct, as described here: https://docs.python.org/3/c-api/objbuffer.html

As always 2.7 is the culprit. Since both protocols exist there, objects have to support them separately. Ironically, memoryview is one of the objects that only supports the new protocol. It's simple to test:

>>> buffer(memoryview(b'abc'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: buffer object expected

Besides technical reasons, since these functions are deprecated since 3.0, we should IMO refrain from using them if possible. The only relevant builtin object that I know does not support the new protocol is array.array.

@methane
Copy link
Member
methane commented May 13, 2016

@jfolz Thanks. How do you think dropping old buffer protocol?
I want to keep code simple as possible, especially Cython codes.

@methane
Copy link
Member
methane commented May 13, 2016
  • msgpack 0.5 --- Add memoryview support and deprecate old buffer protocol (warn if possible).
  • msgpack 1.0 --- Remove old buffer protocol support.

@jfolz
Copy link
Contributor Author
jfolz commented May 13, 2016

@methane Simple code is good code :)

To sum it up:

  • Only unpackb uses the old protocol
  • All types support the new protocol in 3.x
  • The only (semi-)relevant builtin type that doesn't support the new protocol in 2.7 is array.array

Very few people will ever see a warning, but backward compatibility is kept in 0.5. Even less people will have to change their custom types that export the old protocol for 1.0.

@@ -145,6 +147,9 @@ def unpackb(object packed, object object_hook=None, object list_hook=None,
else:
raise
else:
warnings.warn("unpacking %s requires old buffer protocol,"
Copy link
Contributor Author
@jfolz jfolz May 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this warning be issued after the attempt to get the buffer? If the object does not support any protocol (old or new), it will still result in a TypeError, but the message may be a bit confusing.

@jfolz jfolz force-pushed the master branch 2 times, most recently from a47f0f7 to 4b8bd7f Compare May 14, 2016 17:59
@@ -259,7 +259,8 @@ def feed(self, next_bytes):
DeprecationWarning)
else:
raise
assert view.itemsize == 1
if view.itemsize != 1:
raise ValueError("cannot unpack from multi-byte object")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to raise a proper exception with explanation here. Multi-byte types are very problematic for fallback code, since there is not way to "cast" memoryview to a different type.

Usually I'm all for developer responsibility, but honestly, if you're trying to unpack floats, I'm pretty sure you're doing it wrong.

@jfolz
Copy link
Contributor Author
jfolz commented May 15, 2016

These last few commits improve the consistency between Cython and fallback.

Something that's still inconsistent is the behavior of Unpacker.feed:

Fallback attempts to use the old buffer protocol if possible/necessary and does not accept multi-byte types. Cython code only uses the new protocol and accepts multi-byte types. Should Cython be aligned with fallback here?

else:
PyObject_AsReadBuffer(packed, <const void**>&buf, &buf_len)
PyErr_WarnEx(DeprecationWarning,
"unpacking %s requires old buffer protocol,"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a blank at the end of first string.

@jfolz
Copy link
Contributor Author
jfolz commented May 17, 2016

Last commit moves buffer access in Cython to a separate function to simplify code as well as guarantee unpack/unpackb/feed react identical to data.

@@ -20,6 +19,7 @@ from cpython.exc cimport PyErr_WarnEx
cdef extern from "Python.h":
ctypedef struct PyObject
cdef int PyObject_AsReadBuffer(object o, const void** buff, Py_ssize_t* buf_len) except -1
cdef int PyObject_GetBuffer(object o, Py_buffer *view, int flags) except -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you move this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions were ignored (e.g., if array is not C-contiguous), which led to access violations.
Should I move related functions as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions were ignored

Why?

https://github.com/cython/cython/blob/970c2fc0e676ca22016e14147ada0edba937dc6b/Cython/Includes/cpython/buffer.pxd#L31
Cython's declaration has except -1. I don't know why exception were ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I reverted this change and it had no effect.
At first I could not reproduce the problem, but I think the control flow was wrong, because Cython does not automatically exit a cdef function when exceptions occur. Checking for errors and explicitly re-raising does the job.

@jfolz jfolz force-pushed the master branch 2 times, most recently from b0fdcec to 4d40e65 Compare May 18, 2016 10:37
@jfolz
Copy link
Contributor Author
jfolz commented May 18, 2016

These commits should make memory handling more solid.

The following kinds of data are rejected:

  • Multi-byte types
  • Multidimensional data
  • Sliced data

What I'm not satisfied with:

  • A test for contiguity rejection (commented out) does not run on 2.7, because memoryview slicing with step is not implemented

Overall, a lot of added/moved code. This was a lot less finished than I originally thought, really sorry about that.

@methane
Copy link
Member
methane commented May 18, 2016

Thanks for your job. I'll review in this week hopefully.

But my first impression is... I forget why we need such a big code.
Basically, accepting binary data is easy. Pass y* to PyArg_ParseTuple.
https://docs.python.org/3/c-api/arg.html

I'm afraid about we're over engineering.

@methane
Copy link
Member
methane commented May 18, 2016

FYI, This is the code in getargs.c in cpython:

static int
getbuffer(PyObject *arg, Py_buffer *view, const char **errmsg)
{
    if (PyObject_GetBuffer(arg, view, PyBUF_SIMPLE) != 0) {
        *errmsg = "bytes-like object";
        return -1;
    }
    if (!PyBuffer_IsContiguous(view, 'C')) {
        PyBuffer_Release(view);
        *errmsg = "contiguous buffer";
        return -1;
    }
    return 0;
}

@jfolz
Copy link
Contributor Author
628C
jfolz commented May 19, 2016

I agree that the buffer code is more complex than it could be.
However, other than the contiguity and itemsize checks I don't see any opportunities to make it any simpler. Half of the code is need for the legacy protocol to prevent regressions.

On contiguity checks: Sadly there is no simple way to check for it in 2.7/PyPy. I could add a pure Python port of the C code that does these checks, though that would be a lot of code:

https://github.com/python/cpython/blob/master/Objects/abstract.c

@methane
Copy link
Member
methane commented May 19, 2016

I think there are no reasy about rejecting data which file.write() accepts.
Isn't PyBuffer_IsContiguous() enough?

@jfolz
Copy link
Contributor Author
jfolz commented May 19, 2016

Technically it is enough, but my main goal with this was to keep consistency between different runtimes.

Idea: Fallback should be fine with sliced data. For Cython where we need contiguous data, we can try to get it using PyBuffer_FromContiguous.

@methane
Copy link
Member
methane commented May 19, 2016

One important purpose of fallback module is supporting PyPy.
Have you benchmarked memoryview based unpacker on PyPy?

self._fb_buf_n += len(next_bytes)
self._fb_buffers.append(next_bytes)
self._fb_buf_n += L
self._fb_buffers.append(view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do copy here.
Supporting memoryview makes it possible to avoid memory copy on user side, not here.

>>> import io
>>> f = io.BytesIO(b"foobar")
>>> buf = bytearray(4)
>>> n = f.readinto(buf)
>>>  n
4
>>> buf
bytearray(b'foob')
>>> a = [memoryview(buf)[:n]]
>>> n = f.readinto(buf)
>>> n
2
>>> buf
bytearray(b'arob')
>>> a += [memoryview(buf)[:n]]
>>> a
[<memory at 0x10ee60288>, <memory at 0x10ee60348>]
>>> bytes(a[0])
b'arob'   # Overwritten!!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More realistic sample code of .feed() usage with memoryview.

unpacker = msgpack.Unpacker()
buf = bytearray(16*1024)
while True:
    n = f.readinto(buf)
    if n is None:
        break
    unpacker.feed(memoryview(buf)[:n])   # more efficient than buf[:n]
    for obj in unpacker:
        callback(obj)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we will be copying anyway it would make sense to use a single, larger buffer. I experimented in that direction, but it was quite complex.

@jfolz
Copy link
Contributor Author
jfolz commented May 19, 2016

Good point. We should suspend this PR until I can do that.

@jfolz
Copy link
Contributor Author
jfolz commented May 19, 2016

Using PyPy3 2.4.0.

PyPI (v0.4.7)

unpacking integers (fallback)    0.132916, 0.069149, 0.068334, 0.067976
unpacking bytes (fallback)       0.102578, 0.073634, 0.073475, 0.073621
unpacking lists (fallback)       0.307713, 0.297894, 0.295595, 0.296820
unpacking dicts (fallback)       0.280711, 0.263227, 0.274489, 0.262999

This PR (+copy in feed)

unpacking integers (fallback)    0.162345, 0.093845, 0.092142, 0.090567
unpacking bytes (fallback)       0.118643, 0.088598, 0.088944, 0.105694
unpacking lists (fallback)       0.342774, 0.324450, 0.345009, 0.324881
unpacking dicts (fallback)       0.351656, 0.316005, 0.326773, 0.315644

This PR (revert to bytes as buffers)

unpacking integers (fallback)    0.128763, 0.063703, 0.061099, 0.061204
unpacking bytes (fallback)       0.104795, 0.076680, 0.076643, 0.076292
unpacking lists (fallback)       0.313370, 0.303259, 0.302231, 0.300040
unpacking dicts (fallback)       0.287460, 0.266801, 0.277002, 0.266382

Seems like PyPy is well-optimized for bytes, not so much for memoryviews.

@methane
Copy link
Member
methane commented May 19, 2016

@jfolz See #196. I haven't benchmarked it yet.
But I think we should rewrite buffer implementation before implement accepting memoryview.

PyObject_GetBuffer(contiguous, &tmp, PyBUF_SIMPLE)
PyBuffer_Release(view)
PyBuffer_FillInfo(view, contiguous, tmp.buf, tmp.len, 1, 0)
Py_DECREF(contiguous)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this does:

  • PyMemoryView_GetContiguous creates a memoryview object to a contiguous copy of the non-contiguous memory
  • Get a Py_buffer for this contiguous memory and transfer the information to the target buffer
  • Release resources as necessary

@jfolz jfolz force-pushed the master branch 2 times, most recently from e573df7 to 31f76fb Compare May 22, 2016 10:14
@jfolz jfolz changed the title Default to new buffer interface for unpacking Use new buffer interface to unpack May 22, 2016
@methane
Copy link
Member
methane commented May 28, 2016

Maybe related issue: http://bugs.python.org/issue20699

@jfolz
Copy link
Contributor Author
jfolz commented May 29, 2016

2.7 is a mess with regards to buffer handling. Working with this stuff makes me want to move to 3.5.
Though this code using the old interface as required should deal with this.

@jfolz
Copy link
Contributor Author
jfolz commented May 29, 2016

I briefly tested Cython's typed memoryviews. Sadly, they require write access, so they don't work for most bytes-like objects.
I did however figure out how to simplify the contiguous object code and added some comments for clarity.

@jfolz
Copy link
Contributor Author
jfolz commented Jun 13, 2016

@methane Any updates on this?

@@ -150,6 +195,8 @@ def unpackb(object packed, object object_hook=None, object list_hook=None,
use_list, cenc, cerr,
max_str_len, max_bin_len, max_array_len, max_map_len, max_ext_len)
ret = unpack_construct(&ctx, buf, buf_len, &off)
if new_protocol:
PyBuffer_Release(&view);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try ~ finally is required, since unpack_construct, encoding.encode() and unicode_errors.encode() may raise Exception.

@methane
Copy link
Member
methane commented Jun 13, 2016

LGTM, except one comment.

@jfolz
Copy link
Contributor Author
jfolz commented Jun 13, 2016

I added try-finally and squashed.

Thanks for the help, looking forward to 0.5 :)

@methane methane merged commit 0ef5f4d into msgpack:master Jun 13, 2016
@methane
Copy link
Member
methane commented Jun 13, 2016

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0