8000 Add raw_as_bytes option to Unpacker. (#265) · guoyu07/msgpack-python@5534d0c · GitHub
[go: up one dir, main page]

Skip to content

Commit 5534d0c

Browse files
authored
Add raw_as_bytes option to Unpacker. (msgpack#265)
1 parent 50ea49c commit 5534d0c

11 files changed

+199
-93
lines changed

Makefile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ cython:
88

99
.PHONY: test
1010
test:
11-
py.test -v test
11+
pytest -v test
12+
MSGPACK_PUREPYTHON=1 pytest -v test
1213

1314
.PHONY: serve-doc
1415
serve-doc: all

README.rst

Lines changed: 56 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,21 @@ MessagePack for Python
1010
:target: https://msgpack-python.readthedocs.io/en/latest/?badge=latest
1111
:alt: Documentation Status
1212

13-
IMPORTANT: Upgrading from msgpack-0.4
14-
--------------------------------------
13+
14+
What's this
15+
-----------
16+
17+
`MessagePack <https://msgpack.org/>`_ is an efficient binary serialization format.
18+
It lets you exchange data among multiple languages like JSON.
19+
But it's faster and smaller.
20+
This package provides CPython bindings for reading and writing MessagePack data.
21+
22+
23+
Very important notes for existing users
24+
---------------------------------------
25+
26+
PyPI package name
27+
^^^^^^^^^^^^^^^^^
1528

1629
TL;DR: When upgrading from msgpack-0.4 or earlier, don't do `pip install -U msgpack-python`.
1730
Do `pip uninstall msgpack-python; pip install msgpack` instead.
@@ -24,13 +37,37 @@ Sadly, this doesn't work for upgrade install. After `pip install -U msgpack-pyt
2437
msgpack is removed and `import msgpack` fail.
2538

2639

27-
What's this
28-
-----------
40+
Deprecating encoding option
41+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
42+
43+
encoding and unicode_errors options are deprecated.
44+
45+
In case of packer, use UTF-8 always. Storing other than UTF-8 is not recommended.
46+
47+
For backward compatibility, you can use ``use_bin_type=False`` and pack ``bytes``
48+
object into msgpack raw type.
49+
50+
In case of unpacker, there is new ``raw_as_bytes`` option. It is ``True`` by default
51+
for backward compatibility, but it is changed to ``False`` in near future.
52+
You can use ``raw_as_bytes=False`` instead of ``encoding='utf-8'``.
53+
54+
Planned backward incompatible changes
55+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
56+
57+
When msgpack 1.0, I planning these breaking changes:
58+
59+
* packer and unpacker: Remove ``encoding`` and ``unicode_errors`` option.
60+
* packer: Change default of ``use_bin_type`` option from False to True.
61+
* unpacker: Change default of ``raw_as_bytes`` option from True to False.
62+
* unpacker: Reduce all ``max_xxx_len`` options for typical usage.
63+
* unpacker: Remove ``write_bytes`` option from all methods.
64+
65+
To avoid these breaking changes breaks your application, please:
66+
67+
* Don't use deprecated options.
68+
* Pass ``use_bin_type`` and ``raw_as_bytes`` options explicitly.
69+
* If your application handle large (>1MB) data, specify ``max_xxx_len`` options too.
2970

30-
`MessagePack <https://msgpack.org/>`_ is an efficient binary serialization format.
31-
It lets you exchange data among multiple languages like JSON.
32-
But it's faster and smaller.
33-
This package provides CPython bindings for reading and writing MessagePack data.
3471

3572
Install
3673
-------
@@ -76,14 +113,14 @@ msgpack provides ``dumps`` and ``loads`` as an alias for compatibility with
76113
>>> import msgpack
77114
>>> msgpack.packb([1, 2, 3], use_bin_type=True)
78115
'\x93\x01\x02\x03'
79-
>>> msgpack.unpackb(_)
116+
>>> msgpack.unpackb(_, raw_as_bytes=False)
80117
[1, 2, 3]
81118
82119
``unpack`` unpacks msgpack's array to Python's list, but can also unpack to tuple:
83120

84121
.. code-block:: pycon
85122
86-
>>> msgpack.unpackb(b'\x93\x01\x02\x03', use_list=False)
123+
>>> msgpack.unpackb(b'\x93\x01\x02\x03', use_list=False, raw_as_bytes=False)
87124
(1, 2, 3)
88125
89126
You should always specify the ``use_list`` keyword argument for backward compatibility.
@@ -109,7 +146,7 @@ stream (or from bytes provided through its ``feed`` method).
109146
110147
buf.seek(0)
111148
112-
unpacker = msgpack.Unpacker(buf)
149+
unpacker = msgpack.Unpacker(buf, raw_as_bytes=False)
113150
for unpacked in unpacker:
114151
print(unpacked)
115152
@@ -142,7 +179,7 @@ It is also possible to pack/unpack custom data types. Here is an example for
142179
143180
144181
packed_dict = msgpack.packb(useful_dict, default=encode_datetime, use_bin_type=True)
145-
this_dict_again = msgpack.unpackb(packed_dict, object_hook=decode_datetime)
182+
this_dict_again = msgpack.unpackb(packed_dict, object_hook=decode_datetime, raw_as_bytes=False)
146183
147184
``Unpacker``'s ``object_hook`` callback receives a dict; the
148185
``object_pairs_hook`` callback may instead be used to receive a list of
@@ -172,7 +209,7 @@ It is also possible to pack/unpack custom data types using the **ext** type.
172209
...
173210
>>> data = array.array('d', [1.2, 3.4])
174211
>>> packed = msgpack.packb(data, default=default, use_bin_type=True)
175-
>>> unpacked = msgpack.unpackb(packed, ext_hook=ext_hook)
212+
>>> unpacked = msgpack.unpackb(packed, ext_hook=ext_hook, raw_as_bytes=False)
176213
>>> data == unpacked
177214
True
178215
@@ -217,14 +254,10 @@ Early versions of msgpack didn't distinguish string and binary types (like Pytho
217254
The type for representing both string and binary types was named **raw**.
218255

219256
For backward compatibility reasons, msgpack-python will still default all
220-
strings to byte strings, unless you specify the `use_bin_type=True` option in
257+
strings to byte strings, unless you specify the ``use_bin_type=True`` option in
221258
the packer. If you do so, it will use a non-standard type called **bin** to
222259
serialize byte arrays, and **raw** becomes to mean **str**. If you want to
223-
distinguish **bin** and **raw** in the unpacker, specify `encoding='utf-8'`.
224-
225-
**In future version, default value of ``use_bin_type`` will be changed to ``True``.
226-
To avoid this change will break your code, you must specify it explicitly
227-
even when you want to use old format.**
260+
distinguish **bin** and **raw** in the unpacker, specify ``raw_as_bytes=False``.
228261

229262
Note that Python 2 defaults to byte-arrays over Unicode strings:
230263

@@ -234,7 +267,7 @@ Note that Python 2 defaults to byte-arrays over Unicode strings:
234267
>>> msgpack.unpackb(msgpack.packb([b'spam', u'eggs']))
235268
['spam', 'eggs']
236269
>>> msgpack.unpackb(msgpack.packb([b'spam', u'eggs'], use_bin_type=True),
237-
encoding='utf-8')
270+
raw_as_bytes=False)
238271
['spam', u'eggs']
239272
240273
This is the same code in Python 3 (same behaviour, but Python 3 has a
@@ -246,7 +279,7 @@ different default):
246279
>>> msgpack.unpackb(msgpack.packb([b'spam', u'eggs']))
247280
[b'spam', b'eggs']
248281
>>> msgpack.unpackb(msgpack.packb([b'spam', u'eggs'], use_bin_type=True),
249-
encoding='utf-8')
282+
raw_as_bytes=False)
250283
[b'spam', 'eggs']
251284
252285
@@ -277,6 +310,7 @@ You can use ``gc.disable()`` when unpacking large message.
277310

278311
use_list option
279312
^^^^^^^^^^^^^^^
313+
280314
List is the default sequence type of Python.
281315
But tuple is lighter than list.
282316
You can use ``use_list=False`` while unpacking when performance is important.
@@ -295,7 +329,7 @@ Test
295329
MessagePack uses `pytest` for testing.
296330
Run test with following command:
297331

298-
$ pytest -v test
332+
$ make test
299333

300334

301335
..

ci/runtests.bat

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,7 @@
33
%PYTHON%\python.exe setup.py install
44
%PYTHON%\python.exe -c "import sys; print(hex(sys.maxsize))"
55
%PYTHON%\python.exe -c "from msgpack import _packer, _unpacker"
6-
%PYTHON%\python.exe -m pytest -v test
76
%PYTHON%\python.exe setup.py bdist_wheel
7+
%PYTHON%\python.exe -m pytest -v test
8+
SET EL=%ERRORLEVEL%
9+
exit /b %EL%

msgpack/_packer.pyx

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
#cython: embedsignature=True
33

44
from cpython cimport *
5-
#from cpython.exc cimport PyErr_WarnEx
5+
from cpython.exc cimport PyErr_WarnEx
66

77
from msgpack.exceptions import PackValueError, PackOverflowError
88
from msgpack import ExtType
@@ -39,7 +39,7 @@ cdef extern from "pack.h":
3939
int msgpack_pack_ext(msgpack_packer* pk, char typecode, size_t l)
4040

4141
cdef int DEFAULT_RECURSE_LIMIT=511
42-
cdef size_t ITEM_LIMIT = (2**32)-1
42+
cdef long long ITEM_LIMIT = (2**32)-1
4343

4444

4545
cdef inline int PyBytesLike_Check(object o):
@@ -110,9 +110,13 @@ cdef class Packer(object):
110110
self.pk.buf_size = buf_size
111111
self.pk.length = 0
112112

113-
def __init__(self, default=None, encoding='utf-8', unicode_errors='strict',
113+
def __init__(self, default=None, encoding=None, unicode_errors=None,
114114
bint use_single_float=False, bint autoreset=True, bint use_bin_type=False,
115115
bint strict_types=False):
116+
if encoding is not None:
117+
PyErr_WarnEx(PendingDeprecationWarning, "encoding is deprecated.", 1)
118+
if unicode_errors is not None:
119+
PyErr_WarnEx(PendingDeprecationWarning, "unicode_errors is deprecated.", 1)
116120
self.use_float = use_single_float
117121
self.strict_types = strict_types
118122
self.autoreset = autoreset
@@ -122,7 +126,7 @@ cdef class Packer(object):
122126
raise TypeError("default must be a callable.")
123127
self._default = default
124128
if encoding is None:
125-
self.encoding = NULL
129+
self.encoding = 'utf_8'
126130
self.unicode_errors = NULL
127131
else:
128132
if isinstance(encoding, unicode):
@@ -134,7 +138,8 @@ cdef class Packer(object):
134138
self._berrors = unicode_errors.encode('ascii')
135139
else:
136140
self._berrors = unicode_errors
137-
self.unicode_errors = PyBytes_AsString(self._berrors)
141+
if self._berrors is not None:
142+
self.unicode_errors = PyBytes_AsString(self._berrors)
138143

139144
def __dealloc__(self):
140145
PyMem_Free(self.pk.buf)
@@ -149,7 +154,7 @@ cdef class Packer(object):
149154
cdef char* rawval
150155
cdef int ret
151156
cdef dict d
152-
cdef size_t L
157+
cdef Py_ssize_t L
153158
cdef int default_used = 0
154159
cdef bint strict_types = self.strict_types
155160
cdef Py_buffer view
@@ -203,6 +208,7 @@ cdef class Packer(object):
203208
elif PyUnicode_CheckExact(o) if strict_types else PyUnicode_Check(o):
204209
if not self.encoding:
205210
raise TypeError("Can't encode unicode string: no encoding is specified")
211+
#TODO: Use faster API for UTF-8
206212
o = PyUnicode_AsEncodedString(o, self.encoding, self.unicode_errors)
207213
L = len(o)
208214
if L > ITEM_LIMIT:

0 commit comments

Comments
 (0)
0