Bug with NumPy `loadtxt()` and unicode strings #4600

saullocastro · 2014-04-08T13:15:06Z

Please, refer to this question posted in StackOverflow::

http://stackoverflow.com/q/22936790/832621

The OP uses windows and ISO-8859 text file created by linux with very long lines, with CRLF line terminators.

When reading into NumPy, except the first line which contains labels (with special characters, usually only the greek mu):

Python 2.7.6, Numpy 1.8.0, this works perfectly::

data = np.loadtxt('input_file.txt', skiprows=1)

Python 3.4.0, Numpy 1.8.0, gives an error::

np.loadtxt('input_file.txt', skiprows=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/numpy/lib/npyio.py", line 796, in loadtxt
next(fh)
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 4158: invalid     
start byte

It worked with genfromtxt().

The text was updated successfully, but these errors were encountered:

lsvermeer · 2014-04-08T13:49:56Z

OP here. Just to correct/clarify the above: I used python/numpy on Linux but the files were created by a windows PC.

juliantaylor · 2014-04-08T18:24:35Z

the text loading functions are broken in respect to unicode or non-latin encodings, especially in python3, please try if gh-4208 helps, make sure to give the function the right encoding.

8000

tkamishima · 2016-04-27T17:55:31Z

I also faced a problem of reading encoded by non-ascii files, such as in written in Japanese.
In my case, this post in Stackoverflow Loading UTF-8 file in Python 3 using numpy.genfromtxt was helpful.
Explicitly adding converters may help you to resolver your problem.

fzh0917 · 2018-11-28T04:40:52Z

Nice, it's useful to replace loadtxt function with genfromtxt one.

rossbar · 2021-08-04T13:54:06Z

This is a pretty old issue and I'm not sure how much of it is still relevant. The closest thing I can find to a reproducer is from one of the linked SO posts, where a user has trouble loading "Côte d'Ivoire" from a iso-8859 encoded file. This should work using loadtxt's encoding parameter:

>>> fh = io.BytesIO("Côte d'Ivoire".encode('iso-8859-1'))
>>> fh.getvalue()
b"C\xf4te d'Ivoire"
>>> # Note: use delimiter=',' to prevent a split at the space
>>> np.loadtxt(fh, dtype="U", delimiter=",", encoding='iso-8859-1')
array("Côte d'Ivoire", dtype='<U13')

Note that the default value for encoding eventually resolves to sys.getdefaultencoding() in many cases, so users will have to supply the correct encoding if it's different than whatever the current system default is.

I'm going to close this hoping that the original issue is either obsolete or resolved by e.g. the above example. If the issue persists or there are related file encoding issues, please reopen or open a new issue with a minimal reproducing example.

rgommers added 00 - Bug component: numpy.lib labels May 27, 2016

rossbar closed this as completed Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bug with NumPy `loadtxt()` and unicode strings #4600

Bug with NumPy `loadtxt()` and unicode strings #4600

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bug with NumPy loadtxt() and unicode strings #4600

Bug with NumPy loadtxt() and unicode strings #4600

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bug with NumPy `loadtxt()` and unicode strings #4600

Bug with NumPy `loadtxt()` and unicode strings #4600