-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: Add encoding option to numpy text IO. #10054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
OK, should be ready to go. |
@@ -268,6 +282,10 @@ The new ``chebinterpolate`` function interpolates a given function at the | |||
Chebyshev points of the first kind. A new ``Chebyshev.interpolate`` class | |||
method adds support for interpolation over arbitrary intervals using the scaled | |||
and shifted Chebyshev points of the first kind. | |||
Support for reading lzma compressed text files in Python 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: line break here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I decided to leave off the rest of the release notes editing to just before the branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the formatting changes could do with collecting together coherently too, when we get around to that.
numpy/lib/_iotools.py
Outdated
""" decode bytes from binary input streams, default to latin1 """ | ||
if type(line) is bytes: | ||
if encoding is None: | ||
line = line.decode('latin1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that this type of encoding-guessing goes against the python3 unicode
model, which requires this kind of thing to be explicit. Can we perhaps add a warning here in python 3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is trying to preserve compatibility with previous behavior and is only called in three functions, two of which pass 'bytes' as the default. I went ahead and documented the default values in those places. I think we can revisit this after Python 2 is dropped. There has also been a long time desire to replace all these functions, perhaps by bringing over the Pandas functionality, so this may all become moot in the long term.
numpy/lib/tests/test__iotools.py
Outdated
return date(*time.strptime(s, "%Y-%m-%d")[:3]) | ||
if type(s) == bytes: | ||
s = s.decode("latin1") | ||
return date(*time.strptime(s, "%Y-%m-%d")[:3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the contract be that s
is now always unicode
, and so the decode is unconditional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_bytes_to_date
is a helper function called by several other tests, so it probably needs to do the check. Mind that we are also trying to make things look as much as possible like they did before, so that check can be looked at as a compatibility check. In particular, I believe that encoding='bytes'
means 'latin1' for this PR.
@@ -294,7 +296,7 @@ def load(file, mmap_mode=None, allow_pickle=True, fix_imports=True, | |||
used in Python 3. | |||
encoding : str, optional | |||
What encoding to use when reading Python 2 strings. Only useful when | |||
loading Python 2 generated pickled files on Python 3, which includes | |||
loading Python 2 generated pickled files in Python 3, which includes | |||
npy/npz files containing object arrays. Values other than 'latin1', | |||
'ASCII', and 'bytes' are not allowed, as they can corrupt numerical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really true? It seems like we should now allow and encourage utf8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered about that too, but the problem is the pickled case.
if encoding not in ('ASCII', 'latin1', 'bytes'):
# The 'encoding' value for pickle also affects what encoding
# the serialized binary data of NumPy arrays is loaded
# in. Pickle does not pass on the encoding information to
# NumPy. The unpickling code in numpy.core.multiarray is
# written to assume that unicode data appearing where binary
# should be is in 'latin1'. 'bytes' is also safe, as is 'ASCII'.
#
# Other encoding values can corrupt binary data, and we
# purposefully disallow them. For the same reason, the errors=
# argument is not exposed, as values other than 'strict'
# result can similarly silently corrupt numerical data.
numpy/lib/npyio.py
Outdated
@@ -731,7 +733,7 @@ def _getconv(dtype): | |||
|
|||
def floatconv(x): | |||
x.lower() | |||
if b'0x' in x: | |||
if '0x' in x: | |||
return float.fromhex(asstr(x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the asstr
here, since the above line already requires x to be unicode
in py3, and py2 doesn't care either way.
numpy/lib/npyio.py
Outdated
if conv is bytes: | ||
user_conv = asbytes | ||
elif byte_converters: | ||
# converters may use decode to workaround numpy's oldd behaviour, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in oldd
numpy/lib/npyio.py
Outdated
if byte_converters and strcolidx: | ||
# convert strings back to bytes for backward compatibility | ||
warnings.warn( | ||
"Reading strings without specifying the encoding argument is " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably good to call them unicode strings
to avoid ambiguity on python 2
numpy/lib/npyio.py
Outdated
for i in strcolidx: | ||
if isinstance(row[i], bytes): | ||
row[i] = row[i].decode('latin1') | ||
data[k] = tuple(row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this block would be clearer as:
try:
new_data = []
for row_tup in data):
row = list(row_tup)
for i in strcolidx:
row[i] = row[i].encode('latin1')
row_tup = tuple(row)
new_data.append(row_tup)
type_str = np.bytes_
except UnicodeEncodeError:
type_str = np.unicode_
else:
data = new_data
Or even better
def encode_unicode_cols(row_tup):
row = list(row_tup)
for i in strcolidx:
row[i] = row[i].encode('latin1')
return tuple(row)
try:
data = [encode_unicode_cols(r) for r in data]
except UnicodeEncodeError:
type_str = np.unicode_
else:
type_str = np.bytes_
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type_str
needs to be a numpy type, 'U' or 'S'. The 'U' is the default here, overridden if the try
succeeds.
numpy/lib/npyio.py
Outdated
# ... and take the largest number of chars. | ||
for i in strcolidx: | ||
column_types[i] = "|S%i" % max(len(row[i]) for row in data) | ||
column_types[i] = "|%s%i" % (typestr, max(len(row[i]) for row in data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to use np.dtype((the_type, max(len(row[i]) for row in data))
here instead of string formatting
numpy/lib/npyio.py
Outdated
if v in (type('S'), np.string_)] | ||
if v == np.unicode_] | ||
|
||
typestr = 'U' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to np.unicode_
, and the 'S'
below to np.bytes_
3726112
to
1a5e37d
Compare
Most review comments have been addressed. |
numpy/lib/npyio.py
Outdated
@@ -1992,7 +2149,7 @@ def genfromtxt(fname, dtype=float, comments='#', delimiter=None, | |||
if usemask and names: | |||
for (name, conv) in zip(names or (), converters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This or ()
is useless, as we already check that names
is truthy above
numpy/lib/_iotools.py
Outdated
except (TypeError, ValueError): | ||
tester = None | ||
self.type = self._dtypeortype(self._getdtype(tester)) | ||
# Add the missing values to the existing set | ||
if missing_values is not None: | ||
if _is_bytes_like(missing_values): | ||
if isinstance(missing_values, basestring): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This now rejects bytes
on python 3. This is perhaps fine, but it does so silently, as there there is no else
case to fail on this if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the whole function looks a bit bogus as it never checks what the iterator returns. _is_bytes_like
also doesn't do much except check that it can be combined with the bytes type, which does not include str
in Python 3.
The documentation does say that missing_values
is a str or sequence of str, although it would originally accept unicode also. I think in Python 3 after this PR it should not accept bytes, but that might not be backwards compatible with third party converters as the original only worked with byte strings in Python 3.
Maybe the thing to do here is make missing_values
iterable, if not already, check each value for basestring, and fail if not.
numpy/lib/npyio.py
Outdated
comments = [asbytes(comment) for comment in comments] | ||
comments = [comments] | ||
|
||
comments = [_decode_line(x) for x in comments] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this forward the encoding
argument? (after the handling of "bytes"
below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. The encoding
parameter refers to the file encoding, while the comment strings are passed as an argument. A better option might be to not accept bytes strings for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the docstring to mention the default encoding. The previously used asbytes
assumed 'latin1'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe not. There are a bunch of other passed string parameters, none of which are explicitly decoded if they are byte strings. The only thing special about comments
is that it may be a sequence. I think we should simply drop the byte string option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, what I have done is decoded the delimiter
also and added a note to the documentation that if either of delimiter
or comments
are passed as byte strings, that they will be decoded as 'latin1'. We might run into some backwards compatibility problems here, but I don't expect many folks were passing byte strings, especially as that would assume that the input file has some strange encoding that couldn't be handled by the latin1
assumption. We should maybe warn if either is a byte string.
@@ -268,6 +282,10 @@ The new ``chebinterpolate`` function interpolates a given function at the | |||
Chebyshev points of the first kind. A new ``Chebyshev.interpolate`` class | |||
method adds support for interpolation over arbitrary intervals using the scaled | |||
and shifted Chebyshev points of the first kind. | |||
Support for reading lzma compressed text files in Python 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the formatting changes could do with collecting together coherently too, when we get around to that.
Looking pretty good now, The |
I'm going to put the final bits of this off till tomorrow. 1C6A |
The |
OK, I think this is done, but I will be surprised, and pleased, if it doesn't turn up some problems after it is merged :(. |
4d493d3
to
c9a44b1
Compare
numpy/lib/tests/test__iotools.py
Outdated
assert_equal(converter._status, len(converter._mapper) - 1) | ||
# test str TODO | ||
#assert_equal(converter.upgrade(b'a'), b'a') | ||
#assert_equal(converter._status, len(converter._mapper) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what it was supposed to test. However, the commented out portions work, so I have uncommented them and added tests with str and unicode. The upgrade is supposed to find and add a converter for the passed type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was running wrong tests. So I am still not sure what the comment intended, but the (new?) upshot is that all input data not recognized as boolean or numeric type is converted to unicode. I've made the test check that that is the case.
numpy/lib/tests/test__iotools.py
Outdated
@@ -164,30 +162,36 @@ def test_upgrade(self): | |||
status_offset = int(nx.dtype(nx.integer).itemsize < nx.dtype(nx.int64).itemsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be nx.int_
, not nx.integer
, and the comment should say long
, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curiously, numpy.integer
is valid and is a long. Probably an obsolete usage going back to Numeric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.integer
is the base class of np.int_
and friends. But as you observe, np.dtype(np.integer)
promotes to long
in the same way that np.dtype(np.number)
promotes to float
numpy/lib/tests/test__iotools.py
Outdated
assert_equal(test, date(2000, 1, 1)) | ||
|
||
def test_string_to_object(self): | ||
"Make sure that string-to-object functions are properly recognized" | ||
conv = StringConverter(_bytes_to_date) | ||
assert_equal(conv._mapper[-2][0](0), 0j) | ||
assert_equal(conv._mapper[-3][0](0), 0j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened here? A comment saying what -3
means would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a list index into a list of tuples. The contents of the list was changed, Self explanatory if you take a look at what _mapper
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this test is garbage - conv._mapper[-3][0] is np.longdouble
, yet it seems to be expecting it to be a complex value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also _mapper
is a class property, not an instance one - making this test even more bizarre.
Maybe the test should be:
old_map = StringConverter._mapper.copy()
new_map = StringConverter(_bytes_to_date)._mapper
assert_equal(new_map, old_map)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not going to spend a lot of time fixing up crappy old tests/codes as long as they are no worse than they were. Note that the long double converter is also broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the bytes converter is never reached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a fair stance.
Note that my comments below are about new code though!
if not np.iterable(missing_values): | ||
missing_values = [missing_values] | ||
if not all(isinstance(v, basestring) for v in missing_values): | ||
raise TypeError("missing_values must be strings or unicode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message is a little odd in python 3 where string and unicode mean the same thing, but not sure how to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, needed the unicode for Python 2.
numpy/lib/_iotools.py
Outdated
if missing_values is None: | ||
# Clear all missing values even though the ctor initializes it to | ||
# {''} when the argument is None. | ||
self.missing_values = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug - it was presumably supposed to use set()
, but ends up producing a dict
instead.
Does this code path ever get taken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, certainly not tested, as previously used append
for a set
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it used update
, which is defined on both?
Maybe we get away with it because later on we test it for truthiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think originally it was all written using a list, then someone upgraded to sets and missed fixing up this bit of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, used append
in a loop.
3d6fe25
to
9386ade
Compare
numpy/lib/_iotools.py
Outdated
# Add the missing values to the existing set or clear it. | ||
if missing_values is None: | ||
# Clear all missing values even though the ctor initializes it to | ||
# set('') when the argument is None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say set([''])
, not set('')
, since the latter is just set()
. It was correct as {''}
numpy/lib/tests/test_io.py
Outdated
v.encode(locale.getpreferredencoding()) | ||
return False # no skipping | ||
except UnicodeEncodeError: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am super confused by this function:
- Comment says
decode
, name saysencode
- Input is
unicode
, comment says bytes - Function returns
False
if it can decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Comment says
default encoding
, but code doesn't usesys.getdefaultencoding()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. It was only used to determine if test_utf8_file_nodtype_unicode
could be run. The test has been modified.
Minor cleanups of old code to reflect more modern usage.
More fixes and added a bit more documentation. |
# ... and take the largest number of chars. | ||
for i in strcolidx: | ||
column_types[i] = "|S%i" % max(len(row[i]) for row in data) | ||
max_line_length = max(len(row[i]) for row in data) | ||
column_types[i] = np.dtype((type_str, max_line_length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This chunk needs to run for bytes
too
Fixes numpygh-10394, due to regression in numpygh-10054
Fixes numpygh-10394, due to regression in numpygh-10054
Rebase and squash of #4208
This modifies loadtxt and genfromtxt in several ways intended to add
unicode support for text files by adding an
encoding
keyword tonp.load, np.genfromtxt, np.savetxt, and np.fromregex. The original
treatment of the relevant files was to open them as byte
files, whereas they are now opened as text files with an encoding. When
read, they are decoded to unicode strings for Python3 compatibility,
and when written, they are encoded as specified. For backward
compatibility, the default encoding in both cases is latin1.