8000 Strange problem when creating a pandas.Series from void ndarray since 1.15.0 · Issue #11668 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Strange problem when creating a pandas.Series from void ndarray since 1.15.0 #11668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
konlai opened this issue Aug 3, 2018 · 16 comments
Closed

Comments

@konlai
Copy link
konlai commented Aug 3, 2018

I don't know whether this is Numpy bug or Pandas bug. Since upgrading to 1.15.0, I'm getting garbage in my pandas.Series when I construct it using an ndarray. This worked fine in Numpy 1.14.5.

Reproducing code example:

import numpy
import pandas

a = numpy.array(['abcd'], "V4")
#print a
s = pandas.Series(a).apply(lambda x: str(x))
print s

Error message:

With Numpy 1.14.5, it prints:

0    abcd
dtype: object

With Numpy 1.15.0, it prints:

0    �7�5
dtype: object

However, if you uncomment the print a, then it will print:

[b'\x61\x62\x63\x64']
0    abcd
dtype: object

Note that the original "V4" ndarray is produced by another library. Ideally, they should probably be producing an "a4" ndarray. But that's another matter.

Numpy/Python version information:

Python 2.7.5
Numpy 1.15.0
Pandas 0.23.3

@charris
Copy link
Member
charris commented Aug 3, 2018

Out of curiosity, can you try with Python 3 or a recent version of 2.7? Python 2.7.5 is known to have some issues, see #10524. @jreback Any idea, that does seem strange.

@charris charris added this to the 1.15.1 release milestone Aug 3, 2018
@konlai
Copy link
Author
konlai commented Aug 3, 2018

I ran this with Python 3.4.8 and Pandas 0.22.0:

import numpy
import pandas

a = numpy.array([b'abcd'], "V4")
#print(a)
s = pandas.Series(a).apply(lambda x: str(x))
print(s)

With Numpy 1.14.5:

0    [ 97  98  99 100]
dtype: object

With Numpy 1.15.0:

0    b'\xbd7\x865'
dtype: object

With Numpy 1.15.0 and uncommented "print(a)":

[b'\x61\x62\x63\x64']
0    b'abcd'
dtype: object

Still very different behaviours,

@charris
Copy link
Member
charris commented Aug 3, 2018

I think that there is something uninitialized in the printing function for that dtype.

EDIT: Or maybe not.

Python 3.3.6, Pandas 0.22.0

In [1]: import pandas

In [2]: a = numpy.array([b'abcd'], "V4")

In [3]: s = pandas.Series(a).apply(lambda x: str(x)); s
Out[3]: 
0    b'`\xc9M/'
dtype: object

In [4]: s = str(a)

In [5]: s = pandas.Series(a).apply(lambda x: str(x)); s
Out[5]: 
0    b'abcd'
dtype: object

In [6]: a = numpy.array([b'abcd'], "V4")

In [7]: s = pandas.Series(a).apply(lambda x: str(x)); s
Out[7]: 
0    b'abcd'
dtype: object

@ahaldane Can you think of anything?

@eric-wieser
Copy link
Member
eric-wieser commented Aug 3, 2018

So to be clear, the act of printing the array changes the behavior of the following call to pandas?

@konlai
Copy link
Author
konlai commented Aug 3, 2018

So to be clear, the act of printing the array changes the behavior?

Yes.

I think that there is something uninitialized in the printing function for that dtype.

I don't know how the printing function is involved in the construction of the pandas.Series, but the object inside the Series is different depending on whether you've printed the ndarray first. e.g. (with with Python 3.4.8, Pandas 0.22.0, Numpy 1.15.0):

a = numpy.array([b'abcd'], "V4")
#print(a)
s = pandas.Series(a).apply(lambda x: x.decode('latin-1'))
print(s[0] == 'abcd')

Prints False. But prints True if print(a) is uncommented.

@eric-wieser
Copy link
Member

Is accessing the element sufficient? So replacing print(a) with tmp = a[0]?

@konlai
Copy link
Author
konlai commented Aug 3, 2018

Is accessing the element sufficient? So replacing print(a) with tmp = a[0]?

No. The pandas.Series still contains garbage if I do that.

@eric-wieser
Copy link
Member
eric-wieser commented Aug 3, 2018

How about tmp = str(a[0])? Edit: I can repro this on master with pandas 0.23.1

8000

@konlai
Copy link
Author
konlai commented Aug 3, 2018

How about tmp = str(a[0])?

Yes. That fixes the problem.

@eric-wieser
Copy link
Member
eric-wieser commented Aug 3, 2018

Something very strange is going on. After some poking around, I got this behavior:

>>> b = numpy.array([b'abce'], "V4")  # note the e
>>> pandas.Series(b).apply(lambda x: x.decode('latin-1'))
0    abcd
dtype: object

Seems we're using unitialized memory from somewhere

@eric-wieser
Copy link
Member

Printing pandas.Series(a) seems to error in pandas for an unrelated reason

@eric-wieser
Copy link
Member
8000 eric-wieser commented Aug 3, 2018

Well, I've found one bug

return PyBytes_FromStringAndSize(PyArray_DATA(ap), descr->elsize);

should be

return PyBytes_FromStringAndSize(ip, descr->elsize);

For length 1 arrays though like the one in your example, I can't see why that would matter.

This causes:

>>> b = numpy.array([b'abcd', b'efgh'], "V4")
>>> b.astype('O')
array([b'abcd', b'abcd'], dtype=object)

I think this slipped through the cracks since I removed almost all the occurences of TYPE_getitem that weren't used on 0d arrays, where PyArray_DATA(ap) == ip. .astype is one of the few places where that call remains.

@charris
Copy link
Member
charris commented Aug 3, 2018

The bad commit is a83af93 in #8157.

Which seems to be the same as @eric-wieser has above.

@charris
Copy link
Member
charris commented Aug 3, 2018

Fixing the bug found by @eric-wieser fixes the immediate problem. The pandas.Series(a) printing problem is also in NumPy v1.10.0 and I suspect a Pandas bug.

@eric-wieser eric-wieser changed the title Strange problem when creating a pandas.Series from ndarray since 1.15.0 Strange problem when creating a pandas.Series from void ndarray since 1.15.0 Aug 3, 2018
@charris
Copy link
Member
charris commented Aug 3, 2018

@eric-wieser Any idea how to test the fix without using pandas? I'm coming up dry.

@eric-wieser
Copy link
Member

I gave a failing test case in my comment above

charris added a commit to charris/numpy that referenced this issue Aug 6, 2018
The return value for a void array was not correct.

Closes numpy#11668.
charris added a commit to charris/numpy that referenced this issue Aug 7, 2018
The return value for a void array was not correct.

Closes numpy#11668.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
0