10000 Bug with NumPy `loadtxt()` and unicode strings · Issue #4600 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Bug with NumPy loadtxt() and unicode strings #4600

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
saullocastro opened this issue Apr 8, 2014 · 5 comments
Closed

Bug with NumPy loadtxt() and unicode strings #4600

saullocastro opened this issue Apr 8, 2014 · 5 comments

Comments

@saullocastro
Copy link
Contributor

Please, refer to this question posted in StackOverflow::

http://stackoverflow.com/q/22936790/832621

The OP uses windows and ISO-8859 text file created by linux with very long lines, with CRLF line terminators.

When reading into NumPy, except the first line which contains labels (with special characters, usually only the greek mu):

Python 2.7.6, Numpy 1.8.0, this works perfectly::

data = np.loadtxt('input_file.txt', skiprows=1)

Python 3.4.0, Numpy 1.8.0, gives an error::

np.loadtxt('input_file.txt', skiprows=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/numpy/lib/npyio.py", line 796, in loadtxt
next(fh)
File "/usr/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 4158: invalid     
start byte

It worked with genfromtxt().

@lsvermeer
Copy link

OP here. Just to correct/clarify the above: I used python/numpy on Linux but the files were created by a windows PC.

@juliantaylor
Copy link
Contributor

the text loading functions are broken in respect to unicode or non-latin encodings, especially in python3, please try if gh-4208 helps, make sure to give the function the right encoding.

8000

@tkamishima
Copy link
Contributor

I also faced a problem of reading encoded by non-ascii files, such as in written in Japanese.
In my case, this post in Stackoverflow Loading UTF-8 file in Python 3 using numpy.genfromtxt was helpful.
Explicitly adding converters may help you to resolver your problem.

@fzh0917
Copy link
fzh0917 commented Nov 28, 2018

Nice, it's useful to replace loadtxt function with genfromtxt one.

@rossbar
Copy link
Contributor
rossbar commented Aug 4, 2021

This is a pretty old issue and I'm not sure how much of it is still relevant. The closest thing I can find to a reproducer is from one of the linked SO posts, where a user has trouble loading "Côte d'Ivoire" from a iso-8859 encoded file. This should work using loadtxt's encoding parameter:

>>> fh = io.BytesIO("Côte d'Ivoire".encode('iso-8859-1'))
>>> fh.getvalue()
b"C\xf4te d'Ivoire"
>>> # Note: use delimiter=',' to prevent a split at the space
>>> np.loadtxt(fh, dtype="U", delimiter=",", encoding='iso-8859-1')
array("Côte d'Ivoire", dtype='<U13')

Note that the default value for encoding eventually resolves to sys.getdefaultencoding() in many cases, so users will have to supply the correct encoding if it's different than whatever the current system default is.

I'm going to close this hoping that the original issue is either obsolete or resolved by e.g. the above example. If the issue persists or there are related file encoding issues, please reopen or open a new issue with a minimal reproducing example.

@rossbar rossbar closed this as completed Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
0