10000 Structured arrays with offsets not working in 1.14.2 but OK in 1.13.3 · Issue #10752 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Structured arrays with offsets not working in 1.14.2 but OK in 1.13.3 #10752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rfrenchseti opened this issue Mar 15, 2018 · 10 comments
Closed

Comments

@rfrenchseti
Copy link
rfrenchseti commented Mar 15, 2018

NumPy 1.14.x always uses a zero offset for strings in a structured array.

Python 2.7.13 |Anaconda 4.4.0 (64-bit)| (default, Dec 20 2016, 23:09:15)
>>> import numpy as np
>>> np.version.full_version
'1.13.3'
>>> s='abcdefghijkl'
>>> dt=(np.dtype({'f1':('S3',2),'f2':('S4',7)}))
>>> np.array(s, dtype=dt)
array(('cde', 'hijk'),     <<== OK
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})

but:

>>> import numpy as np
>>> np.version.full_version
'1.14.2'
>>> s='abcdefghijkl'
>>> dt=(np.dtype({'f1':('S3',2),'f2':('S4',7)}))
>>> np.array(s, dtype=dt)
array(('abc', 'abcd'),   <<== BROKEN
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
@charris
Copy link
Member
charris commented Mar 15, 2018

Hmm ...

@charris charris added this to the 1.14.3 release milestone Mar 15, 2018
@eric-wieser
Copy link
Member
eric-wieser commented Mar 15, 2018

Some more exploration:

>>> np.__version__
'1.14.1'
>>> dt = np.dtype({'f1': ('S3', 2), 'f2': ('S4', 7)})

>>> s12 = b'abcdefghijkl'
>>> s12_arr = np.array(s12)
>>> np.array(s12, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> np.array(s12_arr, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> s12_arr.view(dt)
ValueError: Changing the dtype of a 0d array is only supported if the itemsize is unchanged

>>> s11 = b'abcdefghijk'
>>> s11_arr = np.array(s11)
>>> np.array(s11, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> np.array(s11_arr, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> s11_arr.view(dt)
array((b'cde', b'hijk'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})

At least this last case works...

@eric-wieser
Copy link
Member

The problem occurs for unstructured np.void_ types (V3) too

@eric-wieser
Copy link
Member
eric-wieser commented Mar 15, 2018

Oh, actually this is by design. Running the above on 1.13 gives:

>>> s12 = b'abcdefghijkl'
>>> s12_arr = np.array(s12)
>>> np.array(s12, dt)
array((b'abc', b'abcd'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})
>>> np.array(s12_arr, dt)
array((b'cde', b'hijk'),
      dtype={'names':['f1','f2'], 'formats':['S3','S4'], 'offsets':[2,7], 'itemsize':11})

1.14 fixed this inconsistency by treating string arrays the same way as string scalars.

@ahaldane
Copy link
Member
ahaldane commented Mar 15, 2018

This is sort of expected, as recently discussed here, and the new behavior is the "desired" behavior:
https://mail.python.org/pipermail/numpy-discussion/2018-January/077642.html

The problem is that your original code is using some dangerous behavior which was causing many hard-to-debug bugs like #7058, #6314, #2353, #3351

In 1.13, numpy would assign to void arrays as if reading from a byte buffer, in other words it was ignoring casting, endianness, and itemsize.

While it does seem like we should have worked harder to have a round of deprecation, the new behavior avoids these problems by (more correctly) casting instead of viewing. asarray and np.array are supposed to cast, not view.

You can achieve your old behavior using a view:

s = 'abcdefghijkl'
dt = np.dtype({'f1':('S3',2),'f2':('S4',7)})
np.array([s], dtype='S11').view(dt)

@rfrenchseti
Copy link
Author

Hmmmm, OK. Thanks for the example code, which appears to work the same under both 1.13 and 1.14. I don't fully understand the implications of what changed, but it still seems strange to me that you can give an offset in a dtype as in the original example and it is ignored. Is there some way to make it throw an exception instead?

However, the bigger problem seems to be that I can't access the entire string this way. You're using S11 for a 12-character string, and using S12 doesn't work. If I use S11 and change the last field to be 5 characters long, I still don't get the final "l" in the string. Why is this and is there some kind of workaround?

@ahaldane
Copy link
Member

The idea is the byte-offsets are a property of the memory layout of the data, and memory layout should not matter when assigning values. This is like endianness: The integer "1" is the same value on big-endian and little-endian machines even though the byte layout is different.

When you do np.array([s], dtype=dt), you are assigning the value "s" to the first element of the array. We want the result of the assignment not to depend on the details of the memory layout. Your particular example assigns a scalar (string) to a structured array, which is a scenario that we made fall under the rule here (link) in 1.14: The value "s" gets assigned to each field separately. (This actually already happens in numpy 1.13 in many situations, we just made your particular assignment behave like the rest).

views in numpy, on the other hand, do depend on the memory layout, so they will be affected by the offset. The offset is not ignored for views. If you want to view an S12 string as a structured type, you have to create a structured type that is 12 bytes long. So you can do

s = 'abcdefghijkl'
dt = np.dtype({'names': ['f1', 'f2'], 'formats': ['S3', 'S4'], 
               'offsets': [2, 7], 'itemsize': 12})
np.array([s], dtype='S12').view(dt)

Some other notes:

First, and somewhat unrelated, you are using a special dictionary-based form of dtype specification ({'f1':('S3',2),'f2':('S4',7)}) which is discouraged, see format 4 in the user guide(link). In the code snippet just above I switched it to the other dictionary-based specification which is more reliable. I recommend avoiding style 4 (keys are field names).

Second, note that your code does not work in python3 even with numpy 1.13, which is another sign of the bugginess we fixed.

And third, just to illustrate the weirdness of the 1.13 behavior in python2, consider:

>>> s = 'abcdefghijkl'
>>> np.array(s, dtype='i4,S1,f4')
array((1684234849, 'e',   1.75599422e+25),
      dtype=[('f0', '<i4'), ('f1', 'S1'), ('f2', '<f4')])

@rfrenchseti
Copy link
Author

Thank you for taking the time to explain fully. I've updated my code and it is working properly now under Python 3.

I'm a little confused by your final example, though. That behavior is actually what I would expect.

@eric-wieser
Copy link
Member

That behavior is actually what I would expect.

The behavior there is reasonable, but np.array is the wrong name for it. np.frombuffer is for converting from a buffer (bytes) to an arbitrary dtype via memory interpretation.

@ahaldane
Copy link
Member

I'm going to close this one because it was an intentional change, to get it off the 1.14.3 issue list.

Feel free to reopen if there is more to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
0