Structured arrays with offsets not working in 1.14.2 but OK in 1.13.3 #10752

rfrenchseti · 2018-03-15T17:06:21Z

NumPy 1.14.x always uses a zero offset for strings in a structured array.

Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
>>> import numpy as np
>>> np.version.full_version
'1.13.3'
>>> s='abcdefghijkl'
>>> dt=(np.dtype({'f1':('S3',2),'f2':('S4',7)}))
>>> np.array(s, dtype=dt)
array(('cde', 'hijk'),     <<== OK
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})

but:

>>> import numpy as np
>>> np.version.full_version
'1.14.2'
>>> s='abcdefghijkl'
>>> dt=(np.dtype({'f1':('S3',2),'f2':('S4',7)}))
>>> np.array(s, dtype=dt)
array(('abc', 'abcd'),   <<== BROKEN
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})

charris · 2018-03-15T17:20:41Z

Hmm ...

eric-wieser · 2018-03-15T17:31:20Z

Some more exploration:

>>> np.__version__
'1.14.1'
>>> dt = np.dtype({'f1': ('S3', 2), 'f2': ('S4', 7)})

>>> s12 = b'abcdefghijkl'
>>> s12_arr = np.array(s12)
>>> np.array(s12, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> np.array(s12_arr, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> s12_arr.view(dt)
ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

>>> s11 = b'abcdefghijk'
>>> s11_arr = np.array(s11)
>>> np.array(s11, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> np.array(s11_arr, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> s11_arr.view(dt)
array((b'cde', b'hijk'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})

At least this last case works...

eric-wieser · 2018-03-15T17:33:53Z

The problem occurs for unstructured np.void_ types (V3) too

eric-wieser · 2018-03-15T17:36:22Z

Oh, actually this is by design. Running the above on 1.13 gives:

>>> s12 = b'abcdefghijkl'
>>> s12_arr = np.array(s12)
>>> np.array(s12, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> np.array(s12_arr, dt)
array((b'cde', b'hijk'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})

1.14 fixed this inconsistency by treating string arrays the same way as string scalars.

ahaldane · 2018-03-15T17:38:34Z

This is sort of expected, as recently discussed here, and the new behavior is the "desired" behavior:
https://mail.python.org/pipermail/numpy-discussion/2018-January/077642.html

The problem is that your original code is using some dangerous behavior which was causing many hard-to-debug bugs like #7058, #6314, #2353, #3351

In 1.13, numpy would assign to void arrays as if reading from a byte buffer, in other words it was ignoring casting, endianness, and itemsize.

While it does seem like we should have worked harder to have a round of deprecation, the new behavior avoids these problems by (more correctly) casting instead of viewing. asarray and np.array are supposed to cast, not view.

You can achieve your old behavior using a view:

s = 'abcdefghijkl'
dt = np.dtype({'f1':('S3',2),'f2':('S4',7)})
np.array([s], dtype='S11').view(dt)

rfrenchseti · 2018-03-15T18:05:07Z

Hmmmm, OK. Thanks for the example code, which appears to work the same under both 1.13 and 1.14. I don't fully understand the implications of what changed, but it still seems strange to me that you can give an offset in a dtype as in the original example and it is ignored. Is there some way to make it throw an exception instead?

However, the bigger problem seems to be that I can't access the entire string this way. You're using S11 for a 12-character string, and using S12 doesn't work. If I use S11 and change the last field to be 5 characters long, I still don't get the final "l" in the string. Why is this and is there some kind of workaround?

ahaldane · 2018-03-15T21:30:29Z

The idea is the byte-offsets are a property of the memory layout of the data, and memory layout should not matter when assigning values. This is like endianness: The integer "1" is the same value on big-endian and little-endian machines even though the byte layout is different.

When you do np.array([s], dtype=dt), you are assigning the value "s" to the first element of the array. We want the result of the assignment not to depend on the details of the memory layout. Your particular example assigns a scalar (string) to a structured array, which is a scenario that we made fall under the rule here (link) in 1.14: The value "s" gets assigned to each field separately. (This actually already happens in numpy 1.13 in many situations, we just made your particular assignment behave like the rest).

views in numpy, on the other hand, do depend on the memory layout, so they will be affected by the offset. The offset is not ignored for views. If you want to view an S12 string as a structured type, you have to create a structured type that is 12 bytes long. So you can do

s = 'abcdefghijkl'
dt = np.dtype({'names': ['f1', 'f2'], 'formats': ['S3', 'S4'], 
               'offsets': [2, 7], 'itemsize': 12})
np.array([s], dtype='S12').view(dt)

Some other notes:

First, and somewhat unrelated, you are using a special dictionary-based form of dtype specification ({'f1':('S3',2),'f2':('S4',7)}) which is discouraged, see format 4 in the user guide(link). In the code snippet just above I switched it to the other dictionary-based specification which is more reliable. I recommend avoiding style 4 (keys are field names).

Second, note that your code does not work in python3 even with numpy 1.13, which is another sign of the bugginess we fixed.

And third, just to illustrate the weirdness of the 1.13 behavior in python2, consider:

>>> s = 'abcdefghijkl'
>>> np.array(s, dtype='i4,S1,f4')
array((1684234849, 'e',   1.75599422e+25),
      dtype=[('f0', '<i4'), ('f1', 'S1'), ('f2', '<f4')])

rfrenchseti · 2018-03-15T21:57:49Z

Thank you for taking the time to explain fully. I've updated my code and it is working properly now under Python 3.

I'm a little confused by your final example, though. That behavior is actually what I would expect.

eric-wieser · 2018-03-16T02:30:40Z

That behavior is actually what I would expect.

The behavior there is reasonable, but np.array is the wrong name for it. np.frombuffer is for converting from a buffer (bytes) to an arbitrary dtype via memory interpretation.

ahaldane · 2018-04-25T16:08:29Z

I'm going to close this one because it was an intentional change, to get it off the 1.14.3 issue list.

Feel free to reopen if there is more to discuss.

charris added this to the 1.14.3 release milestone Mar 15, 2018

charris added the component: numpy._core label Mar 15, 2018

eric-wieser added 00 - Bug 06 - Regression labels Mar 15, 2018

eric-wieser removed the 00 - Bug label Mar 15, 2018

ahaldane mentioned this issue Mar 23, 2018

BUG: Array copy does not zero out gaps in structured dtypes #10789

Closed

ahaldane closed this as completed Apr 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Structured arrays with offsets not working in 1.14.2 but OK in 1.13.3 #10752

Structured arrays with offsets not working in 1.14.2 but OK in 1.13.3 #10752

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Structured arrays with offsets not working in 1.14.2 but OK in 1.13.3 #10752

Structured arrays with offsets not working in 1.14.2 but OK in 1.13.3 #10752

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!