ENH: Add encoding option to numpy text IO. #10054

charris · 2017-11-19T17:41:16Z

charris · 2017-11-19T17:42:33Z

eric-wieser · 2017-11-19T19:34:45Z

doc/release/1.14.0-notes.rst

@@ -268,6 +282,10 @@ The new ``chebinterpolate`` function interpolates a given function at the
 Chebyshev points of the first kind. A new ``Chebyshev.interpolate`` class
 method adds support for interpolation over arbitrary intervals using the scaled
 and shifted Chebyshev points of the first kind.
+Support for reading lzma compressed text files in Python 3


nit: line break here and below

Yeah, I decided to leave off the rest of the release notes editing to just before the branch.

I think all the formatting changes could do with collecting together coherently too, when we get around to that.

eric-wieser · 2017-11-19T19:36:59Z

numpy/lib/_iotools.py

+    """ decode bytes from binary input streams, default to latin1 """
+    if type(line) is bytes:
+        if encoding is None:
+            line = line.decode('latin1')


It seems to me that this type of encoding-guessing goes against the python3 unicode model, which requires this kind of thing to be explicit. Can we perhaps add a warning here in python 3?

This is trying to preserve compatibility with previous behavior and is only called in three functions, two of which pass 'bytes' as the default. I went ahead and documented the default values in those places. I think we can revisit this after Python 2 is dropped. There has also been a long time desire to replace all these functions, perhaps by bringing over the Pandas functionality, so this may all become moot in the long term.

eric-wieser · 2017-11-19T19:39:56Z

numpy/lib/tests/test__iotools.py

-        return date(*time.strptime(s, "%Y-%m-%d")[:3])
+    if type(s) == bytes:
+        s = s.decode("latin1")
+    return date(*time.strptime(s, "%Y-%m-%d")[:3])


Shouldn't the contract be that s is now always unicode, and so the decode is unconditional?

_bytes_to_date is a helper function called by several other tests, so it probably needs to do the check. Mind that we are also trying to make things look as much as possible like they did before, so that check can be looked at as a compatibility check. In particular, I believe that encoding='bytes' means 'latin1' for this PR.

eric-wieser · 2017-11-19T19:42:38Z

numpy/lib/npyio.py

@@ -294,7 +296,7 @@ def load(file, mmap_mode=None, allow_pickle=True, fix_imports=True,
        used in Python 3.
    encoding : str, optional
        What encoding to use when reading Python 2 strings. Only useful when
-        loading Python 2 generated pickled files on Python 3, which includes
+        loading Python 2 generated pickled files in Python 3, which includes
        npy/npz files containing object arrays. Values other than 'latin1',
        'ASCII', and 'bytes' are not allowed, as they can corrupt numerical


Is this really true? It seems like we should now allow and encourage utf8.

I wondered about that too, but the problem is the pickled case.

if encoding not in ('ASCII', 'latin1', 'bytes'): # The 'encoding' value for pickle also affects what encoding # the serialized binary data of NumPy arrays is loaded # in. Pickle does not pass on the encoding information to # NumPy. The unpickling code in numpy.core.multiarray is # written to assume that unicode data appearing where binary # should be is in 'latin1'. 'bytes' is also safe, as is 'ASCII'. # # Other encoding values can corrupt binary data, and we # purposefully disallow them. For the same reason, the errors= # argument is not exposed, as values other than 'strict' # result can similarly silently corrupt numerical data.

eric-wieser · 2017-11-19T19:44:21Z

numpy/lib/npyio.py

@@ -731,7 +733,7 @@ def _getconv(dtype):

    def floatconv(x):
        x.lower()
-        if b'0x' in x:
+        if '0x' in x:
            return float.fromhex(asstr(x))


No need for the asstr here, since the above line already requires x to be unicode in py3, and py2 doesn't care either way.

eric-wieser · 2017-11-19T19:46:32Z

numpy/lib/npyio.py

+        if conv is bytes:
+            user_conv = asbytes
+        elif byte_converters:
+            # converters may use decode to workaround numpy's oldd behaviour,


Typo in oldd

eric-wieser · 2017-11-19T19:48:17Z

numpy/lib/npyio.py

+        if byte_converters and strcolidx:
+            # convert strings back to bytes for backward compatibility
+            warnings.warn(
+                "Reading strings without specifying the encoding argument is "


Probably good to call them unicode strings to avoid ambiguity on python 2

eric-wieser · 2017-11-19T19:52:46Z

numpy/lib/npyio.py

+                    for i in strcolidx:
+                        if isinstance(row[i], bytes):
+                            row[i] = row[i].decode('latin1')
+                    data[k] = tuple(row)


IMO this block would be clearer as:

try: new_data = [] for row_tup in data): row = list(row_tup) for i in strcolidx: row[i] = row[i].encode('latin1') row_tup = tuple(row) new_data.append(row_tup) type_str = np.bytes_ except UnicodeEncodeError: type_str = np.unicode_ else: data = new_data

Or even better

def encode_unicode_cols(row_tup): row = list(row_tup) for i in strcolidx: row[i] = row[i].encode('latin1') return tuple(row) try: data = [encode_unicode_cols(r) for r in data] except UnicodeEncodeError: type_str = np.unicode_ else: type_str = np.bytes_

type_str needs to be a numpy type, 'U' or 'S'. The 'U' is the default here, overridden if the try succeeds.

eric-wieser · 2017-11-19T19:55:16Z

numpy/lib/npyio.py

        # ... and take the largest number of chars.
        for i in strcolidx:
-            column_types[i] = "|S%i" % max(len(row[i]) for row in data)
+            column_types[i] = "|%s%i" % (typestr, max(len(row[i]) for row in data))


Would be nice to use np.dtype((the_type, max(len(row[i]) for row in data)) here instead of string formatting

eric-wieser · 2017-11-19T19:55:53Z

numpy/lib/npyio.py

-                     if v in (type('S'), np.string_)]
+                     if v == np.unicode_]
+
+        typestr = 'U'


Change to np.unicode_, and the 'S' below to np.bytes_

charris · 2017-11-19T22:59:07Z

eric-wieser · 2017-11-19T23:03:11Z

numpy/lib/npyio.py

@@ -1992,7 +2149,7 @@ def genfromtxt(fname, dtype=float, comments='#', delimiter=None,
    if usemask and names:
        for (name, conv) in zip(names or (), converters):


This or () is useless, as we already check that names is truthy above

eric-wieser · 2017-11-19T23:05:40Z

numpy/lib/_iotools.py

            except (TypeError, ValueError):
                tester = None
            self.type = self._dtypeortype(self._getdtype(tester))
        # Add the missing values to the existing set
        if missing_values is not None:
-            if _is_bytes_like(missing_values):
+            if isinstance(missing_values, basestring):


This now rejects bytes on python 3. This is perhaps fine, but it does so silently, as there there is no else case to fail on this if

Hmm, the whole function looks a bit bogus as it never checks what the iterator returns. _is_bytes_like also doesn't do much except check that it can be combined with the bytes type, which does not include str in Python 3.

The documentation does say that missing_values is a str or sequence of str, although it would originally accept unicode also. I think in Python 3 after this PR it should not accept bytes, but that might not be backwards compatible with third party converters as the original only worked with byte strings in Python 3.

Maybe the thing to do here is make missing_values iterable, if not already, check each value for basestring, and fail if not.

eric-wieser · 2017-11-19T23:10:18Z

numpy/lib/npyio.py

-            comments = [asbytes(comment) for comment in comments]
+            comments = [comments]
+
+        comments = [_decode_line(x) for x in comments]


Should this forward the encoding argument? (after the handling of "bytes" below)

I don't think so. The encoding parameter refers to the file encoding, while the comment strings are passed as an argument. A better option might be to not accept bytes strings for this.

I'll update the docstring to mention the default encoding. The previously used asbytes assumed 'latin1'.

Hmm, maybe not. There are a bunch of other passed string parameters, none of which are explicitly decoded if they are byte strings. The only thing special about comments is that it may be a sequence. I think we should simply drop the byte string option.

OK, what I have done is decoded the delimiter also and added a note to the documentation that if either of delimiter or comments are passed as byte strings, that they will be decoded as 'latin1'. We might run into some backwards compatibility problems here, but I don't expect many folks were passing byte strings, especially as that would assume that the input file has some strange encoding that couldn't be handled by the latin1 assumption. We should maybe warn if either is a byte string.

eric-wieser · 2017-11-19T23:11:36Z

doc/release/1.14.0-notes.rst

@@ -268,6 +282,10 @@ The new ``chebinterpolate`` function interpolates a given function at the
 Chebyshev points of the first kind. A new ``Chebyshev.interpolate`` class
 method adds support for interpolation over arbitrary intervals using the scaled
 and shifted Chebyshev points of the first kind.
+Support for reading lzma compressed text files in Python 3


I think all the formatting changes could do with collecting together coherently too, when we get around to that.

eric-wieser · 2017-11-19T23:13:17Z

charris · 2017-11-20T00:02:44Z

charris · 2017-11-20T18:44:32Z

charris · 2017-11-20T18:44:50Z

eric-wieser · 2017-11-22T02:20:39Z

numpy/lib/tests/test__iotools.py

-        assert_equal(converter._status, len(converter._mapper) - 1)
+        # test str TODO
+        #assert_equal(converter.upgrade(b'a'), b'a')
+        #assert_equal(converter._status, len(converter._mapper) - 1)


Can you elaborate on this TODO?

Not sure what it was supposed to test. However, the commented out portions work, so I have uncommented them and added tests with str and unicode. The upgrade is supposed to find and add a converter for the passed type.

Was running wrong tests. So I am still not sure what the comment intended, but the (new?) upshot is that all input data not recognized as boolean or numeric type is converted to unicode. I've made the test check that that is the case.

eric-wieser · 2017-11-23T04:20:31Z

numpy/lib/tests/test__iotools.py

@@ -164,30 +162,36 @@ def test_upgrade(self):
        status_offset = int(nx.dtype(nx.integer).itemsize < nx.dtype(nx.int64).itemsize)


This should be nx.int_, not nx.integer, and the comment should say long, I think.

Curiously, numpy.integer is valid and is a long. Probably an obsolete usage going back to Numeric.

np.integer is the base class of np.int_ and friends. But as you observe, np.dtype(np.integer) promotes to long in the same way that np.dtype(np.number) promotes to float

eric-wieser · 2017-11-23T04:21:09Z

numpy/lib/tests/test__iotools.py

        assert_equal(test, date(2000, 1, 1))

    def test_string_to_object(self):
        "Make sure that string-to-object functions are properly recognized"
        conv = StringConverter(_bytes_to_date)
-        assert_equal(conv._mapper[-2][0](0), 0j)
+        assert_equal(conv._mapper[-3][0](0), 0j)


What happened here? A comment saying what -3 means would be helpful.

It's a list index into a list of tuples. The contents of the list was changed, Self explanatory if you take a look at what _mapper is.

Well, this test is garbage - conv._mapper[-3][0] is np.longdouble, yet it seems to be expecting it to be a complex value.

Also _mapper is a class property, not an instance one - making this test even more bizarre.

Maybe the test should be:

old_map = StringConverter._mapper.copy() new_map = StringConverter(_bytes_to_date)._mapper assert_equal(new_map, old_map)

Yeah, I'm not going to spend a lot of time fixing up crappy old tests/codes as long as they are no worse than they were. Note that the long double converter is also broken.

And the bytes converter is never reached.

Seems like a fair stance.

Note that my comments below are about new code though!

eric-wieser · 2017-11-23T04:23:01Z

numpy/lib/_iotools.py

+            if not np.iterable(missing_values):
+                missing_values = [missing_values]
+            if not all(isinstance(v, basestring) for v in missing_values):
+                raise TypeError("missing_values must be strings or unicode")


Message is a little odd in python 3 where string and unicode mean the same thing, but not sure how to fix that.

Yeah, needed the unicode for Python 2.

eric-wieser · 2017-11-23T04:25:09Z

numpy/lib/_iotools.py

+        if missing_values is None:
+            # Clear all missing values even though the ctor initializes it to
+            # {''} when the argument is None.
+            self.missing_values = {}


This is a bug - it was presumably supposed to use set(), but ends up producing a dict instead.

Does this code path ever get taken?

I don't think so, certainly not tested, as previously used append for a set.

I thought it used update, which is defined on both?

Maybe we get away with it because later on we test it for truthiness.

I think originally it was all written using a list, then someone upgraded to sets and missed fixing up this bit of code.

Nope, used append in a loop.

eric-wieser · 2017-11-23T04:56:49Z

numpy/lib/_iotools.py

+        # Add the missing values to the existing set or clear it.
+        if missing_values is None:
+            # Clear all missing values even though the ctor initializes it to
+            # set('') when the argument is None.


This should say set(['']), not set(''), since the latter is just set(). It was correct as {''}

eric-wieser · 2017-11-23T05:19:04Z

numpy/lib/tests/test_io.py

+        v.encode(locale.getpreferredencoding())
+        return False # no skipping
+    except UnicodeEncodeError:
+        return True


I am super confused by this function:

Comment says decode, name says encode

Input is unicode, comment says bytes

Function returns False if it can decode

Comment says default encoding, but code doesn't use sys.getdefaultencoding()

Removed. It was only used to determine if test_utf8_file_nodtype_unicode could be run. The test has been modified.

charris · 2017-11-24T23:53:16Z

eric-wieser · 2018-01-14T04:01:14Z

numpy/lib/npyio.py

        # ... and take the largest number of chars.
        for i in strcolidx:
-            column_types[i] = "|S%i" % max(len(row[i]) for row in data)
+            max_line_length = max(len(row[i]) for row in data)
+            column_types[i] = np.dtype((type_str, max_line_length))


This chunk needs to run for bytes too

charris added 01 - Enhancement component: numpy.lib labels Nov 19, 2017

charris added this to the 1.14.0 release milestone Nov 19, 2017

charris mentioned this pull request Nov 19, 2017

ENH: Add encoding option to numpy text IO #4208

Closed

charris force-pushed the gh-4208 branch from e09d574 to 43eb5b9 Compare November 19, 2017 18:55

eric-wieser reviewed Nov 19, 2017

View reviewed changes

charris force-pushed the gh-4208 branch 4 times, most recently from 3726112 to 1a5e37d Compare November 19, 2017 22:54

eric-wieser reviewed Nov 19, 2017

View reviewed changes

charris force-pushed the gh-4208 branch from 1a5e37d to 7af56e9 Compare November 19, 2017 23:58

charris force-pushed the gh-4208 branch from 7af56e9 to c463bc5 Compare November 20, 2017 18:01

charris force-pushed the gh-4208 branch 2 times, most recently from 4d493d3 to c9a44b1 Compare November 21, 2017 22:27

eric-wieser reviewed Nov 22, 2017

View reviewed changes

charris force-pushed the gh-4208 branch from c9a44b1 to 3f3275d Compare November 22, 2017 15:52

eric-wieser reviewed Nov 23, 2017

View reviewed changes

charris force-pushed the gh-4208 branch 2 times, most recently from 3d6fe25 to 9386ade Compare November 23, 2017 04:45

eric-wieser reviewed Nov 23, 2017

View reviewed changes

charris force-pushed the gh-4208 branch from 9386ade to fdc6dbb Compare November 23, 2017 05:11

eric-wieser reviewed Nov 23, 2017

View reviewed changes

MAINT: Various minor code cleanups.

1d97b3a

Minor cleanups of old code to reflect more modern usage.

charris force-pushed the gh-4208 branch from fdc6dbb to 1d97b3a Compare November 24, 2017 23:52

charris merged commit 8c441fa into numpy:master Nov 26, 2017

eric-wieser mentioned this pull request Jan 14, 2018

Possible regression with numpy 1.14.0 and recfromcsv #10394

Closed

eric-wieser reviewed Jan 14, 2018

View reviewed changes

eric-wieser mentioned this pull request Jan 14, 2018

BUG: Resize bytes_ columns in genfromtxt #10401

Merged

eric-wieser added a commit to eric-wieser/numpy that referenced this pull request Jan 16, 2018

BUG: Resize bytes_ columns in genfromtxt

0a87861

Fixes numpygh-10394, due to regression in numpygh-10054

hanjohn pushed a commit to hanjohn/numpy that referenced this pull request Feb 15, 2018

BUG: Resize bytes_ columns in genfromtxt

5530c20

Fixes numpygh-10394, due to regression in numpygh-10054

eric-wieser mentioned this pull request Apr 27, 2018

genfromtxt requires encoding despite input being unicode #10990

Closed

ahaldane mentioned this pull request May 2, 2018

recfromtext bytes delimiter #11028

Closed

charris deleted the gh-4208 branch September 19, 2018 14:35

WarrenWeckesser mentioned this pull request Sep 24, 2021

numpy.savetxt does not save arrays to previously opened text files in Python 3.x #9859

Closed

Uh oh!

ENH: Add encoding option to numpy text IO. #10054

ENH: Add encoding option to numpy text IO. #10054

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!