Strange problem when creating a pandas.Series from void ndarray since 1.15.0 #11668

konlai · 2018-08-03T02:41:47Z

I don't know whether this is Numpy bug or Pandas bug. Since upgrading to 1.15.0, I'm getting garbage in my pandas.Series when I construct it using an ndarray. This worked fine in Numpy 1.14.5.

Reproducing code example:

import numpy
import pandas

a = numpy.array(['abcd'], "V4")
#print a
s = pandas.Series(a).apply(lambda x: str(x))
print s

Error message:

With Numpy 1.14.5, it prints:

0    abcd
dtype: object

With Numpy 1.15.0, it prints:

0    �7�5
dtype: object

However, if you uncomment the print a, then it will print:

[b'\x61\x62\x63\x64']
0    abcd
dtype: object

Note that the original "V4" ndarray is produced by another library. Ideally, they should probably be producing an "a4" ndarray. But that's another matter.

Numpy/Python version information:

Python 2.7.5
Numpy 1.15.0
Pandas 0.23.3

The text was updated successfully, but these errors were encountered:

charris · 2018-08-03T03:49:44Z

Out of curiosity, can you try with Python 3 or a recent version of 2.7? Python 2.7.5 is known to have some issues, see #10524. @jreback Any idea, that does seem strange.

konlai · 2018-08-03T04:19:52Z

I ran this with Python 3.4.8 and Pandas 0.22.0:

import numpy
import pandas

a = numpy.array([b'abcd'], "V4")
#print(a)
s = pandas.Series(a).apply(lambda x: str(x))
print(s)

With Numpy 1.14.5:

0    [ 97  98  99 100]
dtype: object

With Numpy 1.15.0:

0    b'\xbd7\x865'
dtype: object

With Numpy 1.15.0 and uncommented "print(a)":

[b'\x61\x62\x63\x64']
0    b'abcd'
dtype: object

Still very different behaviours,

charris · 2018-08-03T04:43:25Z

I think that there is something uninitialized in the printing function for that dtype.

EDIT: Or maybe not.

Python 3.3.6, Pandas 0.22.0

In [1]: import pandas

In [2]: a = numpy.array([b'abcd'], "V4")

In [3]: s = pandas.Series(a).apply(lambda x: str(x)); s
Out[3]: 
0    b'`\xc9M/'
dtype: object

In [4]: s = str(a)

In [5]: s = pandas.Series(a).apply(lambda x: str(x)); s
Out[5]: 
0    b'abcd'
dtype: object

In [6]: a = numpy.array([b'abcd'], "V4")

In [7]: s = pandas.Series(a).apply(lambda x: str(x)); s
Out[7]: 
0    b'abcd'
dtype: object

@ahaldane Can you think of anything?

eric-wieser · 2018-08-03T05:28:25Z

So to be clear, the act of printing the array changes the behavior of the following call to pandas?

konlai · 2018-08-03T06:05:08Z

So to be clear, the act of printing the array changes the behavior?

Yes.

I think that there is something uninitialized in the printing function for that dtype.

I don't know how the printing function is involved in the construction of the pandas.Series, but the object inside the Series is different depending on whether you've printed the ndarray first. e.g. (with with Python 3.4.8, Pandas 0.22.0, Numpy 1.15.0):

a = numpy.array([b'abcd'], "V4")
#print(a)
s = pandas.Series(a).apply(lambda x: x.decode('latin-1'))
print(s[0] == 'abcd')

Prints False. But prints True if print(a) is uncommented.

eric-wieser · 2018-08-03T06:09:14Z

Is accessing the element sufficient? So replacing print(a) with tmp = a[0]?

konlai · 2018-08-03T06:12:33Z

Is accessing the element sufficient? So replacing print(a) with tmp = a[0]?

No. The pandas.Series still contains garbage if I do that.

eric-wieser · 2018-08-03T06:14:49Z

How about tmp = str(a[0])? Edit: I can repro this on master with pandas 0.23.1

8000

konlai · 2018-08-03T06:17:06Z

How about tmp = str(a[0])?

Yes. That fixes the problem.

eric-wieser · 2018-08-03T06:18:28Z

Something very strange is going on. After some poking around, I got this behavior:

>>> b = numpy.array([b'abce'], "V4")  # note the e
>>> pandas.Series(b).apply(lambda x: x.decode('latin-1'))
0    abcd
dtype: object

Seems we're using unitialized memory from somewhere

eric-wieser · 2018-08-03T06:21:07Z

Printing pandas.Series(a) seems to error in pandas for an unrelated reason

eric-wieser · 2018-08-03T06:24:56Z

Well, I've found one bug

return PyBytes_FromStringAndSize(PyArray_DATA(ap), descr->elsize);

should be

return PyBytes_FromStringAndSize(ip, descr->elsize);

For length 1 arrays though like the one in your example, I can't see why that would matter.

This causes:

>>> b = numpy.array([b'abcd', b'efgh'], "V4")
>>> b.astype('O')
array([b'abcd', b'abcd'], dtype=object)

I think this slipped through the cracks since I removed almost all the occurences of TYPE_getitem that weren't used on 0d arrays, where PyArray_DATA(ap) == ip. .astype is one of the few places where that call remains.

charris · 2018-08-03T15:19:31Z

The bad commit is a83af93 in #8157.

Which seems to be the same as @eric-wieser has above.

charris · 2018-08-03T16:02:18Z

Fixing the bug found by @eric-wieser fixes the immediate problem. The pandas.Series(a) printing problem is also in NumPy v1.10.0 and I suspect a Pandas bug.

charris · 2018-08-03T17:42:49Z

@eric-wieser Any idea how to test the fix without using pandas? I'm coming up dry.

eric-wieser · 2018-08-03T17:54:11Z

I gave a failing test case in my comment above

The return value for a void array was not correct. Closes numpy#11668.

charris added the 06 - Regression label Aug 3, 2018

charris added this to the 1.15.1 release milestone Aug 3, 2018

eric-wieser changed the title ~~Strange problem when creating a pandas.Series from ndarray since 1.15.0~~ Strange problem when creating a pandas.Series from void ndarray since 1.15.0 Aug 3, 2018

charris mentioned this issue Aug 3, 2018

BUG: Fix regression in void_getitem #11669

Merged

charris added a commit to charris/numpy that referenced this issue Aug 6, 2018

BUG: Fix regression in void_getitem

ec1e79f

The return value for a void array was not correct. Closes numpy#11668.

eric-wieser closed this as completed in #11669 Aug 7, 2018

charris added a commit to charris/numpy that referenced this issue Aug 7, 2018

BUG: Fix regression in void_getitem

e5a6812

The return value for a void array was not correct. Closes numpy#11668.

charris mentioned this issue Aug 7, 2018

BUG: Fix regression in void_getitem #11682

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Strange problem when creating a pandas.Series from void ndarray since 1.15.0 #11668

Strange problem when creating a pandas.Series from void ndarray since 1.15.0 #11668

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Strange problem when creating a pandas.Series from void ndarray since 1.15.0 #11668

Strange problem when creating a pandas.Series from void ndarray since 1.15.0 #11668

Comments

Reproducing code example:

Error message:

Numpy/Python version information:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!