8000 Numpy does not recognize ctypes arrays with c_wchar field · Issue #10100 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Numpy does not recognize ctypes arrays with c_wchar field #10100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, 8000 you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
macat opened this issue Nov 27, 2017 · 28 comments
Closed

Numpy does not recognize ctypes arrays with c_wchar field #10100

macat opened this issue Nov 27, 2017 · 28 comments

Comments

@macat
Copy link
macat commented Nov 27, 2017

The casting of a ctypes Structure object array to numpy array usually results with a numpy structured array. But if the Structure contains a c_wchar array (string) field it becomes an object. I could not find the reason why this is happening.

Example:

import numpy as np
import ctypes

class A(ctypes.Structure):
    _fields_ = [('s', ctypes.c_wchar * 5)]

np.array((A*2)())
# array([<__main__.A object at 0x7fa99678f9d8>,
#       <__main__.A object at 0x7fa98804f1e0>], dtype=object)

While:

class B(ctypes.Structure):
    _fields_ = [('s', ctypes.c_char * 5)]

np.array((B*2)())
# array([([b'', b'', b'', b'', b''],), ([b'', b'', b'', b'', b''],)],
#      dtype=[('s', 'S1', (5,))])

Environment:
Python 3.6, numpy 1.13.1, Ubuntu 16.04

@macat
Copy link
Author
macat commented Nov 27, 2017

I'm not familiar with the numpy source code. Tried to trace the code to the point where it checks the type of the ctype array field, and found this:

else if (type == (PyObject *) &PyCharacterArrType_Type) {

Is it possible that this line does not work with c_wchar type?

@eric-wieser
Copy link
Member
eric-wieser commented Nov 27, 2017

Numpy is trying and failing to parse this type using the PEP3118 buffer formats:

>>> memoryview((A*2)()).format
 'T{(5)<u:s:}'

Because it fails, it falls back on using object

Unfortunately, numpy doesn't yet have a wide (UCS2) character type - it only has bytes, and UCS4. I suppose we could fall back on either np.uint16 or np.void here, but nothing's ideal.

@ahaldane
Copy link
Member
ahaldane commented Nov 27, 2017

Just some initial thoughts looking at this:

This has something to do with the PEP3118 "buffer" interface. See https://www.python.org/dev/peps/pep-3118/

What happens is that the ctypes object exposes a pep3118 interface, and numpy tries to read the data using that interface. 'memoryviews' in python give you some kind of access to the raw interface intermediate:

[1]: import numpy as np
...: import ctypes
...: 
...: class A(ctypes.Structure):
...:     _fields_ = [('s', ctypes.c_wchar * 5)]
...: 
...: m = memoryview((A*2)())
...: m.format
...: 
'T{(5)<u:s:}'

The memoryview format attribute gives you the datatype using the pep3118 format mini-language. Ctypes encodes your struct using the 'w' (typo, wrong) u pep3118 character. (See the character table in the 3118 doc).

Numpy has to translate this mini-language into a numpy dtype. This happens in numpy/core/src/multiarray/buffer.c with some code in numpy/core/_internal.py. You can see functions in both files which convert pep3118 format chars to numpy dtypes. I'm not sure why there is so much duplicate code. But I also see that we don't have any code to handle the u character, which the 3118 doc says is a ucs-2 character. We only know how to handle the c and w characters, for ucs-1 (latin-1) and ucs-4 respectively.

Numpy can't handle ucs-2 internally, so it can't represent that format, so it fails and returns an object array (which is kind of ugly..).

So one train of though is that numpy just can't convert that ctypes object to an array, and we should be raising an error message somewhere instead of converting to object.

But it also occurs to me that maybe this is a ctypes bug. Why is it encoding things as ucs-2? Is it really doing that, or is it incorrectly setting the pep3118 format?

Do you happen to know what the encoding of unicode wchar strings is in ctypes?

@ahaldane
Copy link
Member
ahaldane commented Nov 27, 2017

wchar_t seems to have portability problems because it is compiler-dependent, but at least on my system I get:

>>> ctypes.sizeof(ctypes.c_wchar)
4
>>> memoryview(ctypes.c_wchar()).format
'<u'

The first line suggests it is actually ucs-4.

This makes me think this is a ctypes bug, and ctypes should use w in this case.

@eric-wieser
Copy link
Member
eric-wieser commented Nov 27, 2017

It gives 2 on my system, suggesting that ctypes is not checking to see which implementation-defined wchar it is using

@pv
Copy link
Member
pv commented Nov 27, 2017 via email

@alanhdu
Copy link
alanhdu commented Nov 27, 2017

I'm not sure this is the same bug, but there's also something funky with the ctypes.c_byte and ctypes.c_char too:

In [1]: import ctypes

In [2]: import numpy as np

In [3]: class Test(ctypes.Structure):
   ...:    # using ctypes.c_float works fine
   ...:     _fields_ = [("a", ctypes.c_byte * 10), ("b", ctypes.c_float)]
   ...:     

In [4]: x = (Test * 5)()

In [5]: x[0].b = 3

In [6]: np.array(x)
/home/alan/workspace/scratch/py3.venv/bin/ipython:1: RuntimeWarning: Item size computed from the PEP 3118 buffer format string does not match the actual item size.
  #!/home/alan/workspace/scratch/py3.venv/bin/python3.6
Out[6]: 
array([([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.),
       ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.),
       ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.),
       ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.),
       ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.)],
      dtype=[('a', 'i1', (10,)), ('b', '<f4')])

(Notice that the values are all 0 instead of having one of the b values be 3.

This is with NumPy 1.13.3 and Python 3.6

@eric-wieser
Copy link
Member
eric-wieser commented Nov 27, 2017

Numpy 1.14 and python 3.5 gives:

ValueError: invalid literal for int() with base 10: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

My guess is that structure padding is causing your problem there

@pv
Copy link
Member
pv commented Nov 27, 2017 via email

@alanhdu
Copy link
alanhdu commented Nov 27, 2017

Yup -- it's definitely a padding issue (I guess we'll just make sure that all our byte and char arrays have lengths that are divisible by 4 for now 😆).

Possibly a dumb question: should ctypes output the padding bytes as part of its format explicitly or should that be inferred on the NumPy side? The only documentation I could find was about the struct module, which says:

Padding is only automatically added between successive structure members. No padding is added at the beginning or the end of the encoded struct.

and struct.unpack("cccf", ...) automatically adds the padding byte for you (i.e. it's equivalent to struct.unpack("cccxf", ...) AFAICT).

@ahaldane
Copy link
Member

Padding bytes are a bit of a mess too. I tried working on this here #7798, which is on hiatus but we should get back to it one day.

You can see that I progressively discovered new understandings of how padding bytes are supposed to be interpreted, and some fundamental problems with the 3118 interface.

@pv
Copy link
8000
Member
pv commented Nov 27, 2017

@alanhdu: note that "<cccf" is not equivalent to "cccxf" --- the issue is with the alignment specifiers. The implicit padding is only for the "native" alignment mode, but ctypes emits format strings that switch away from that. ctypes should emit a valid description of the actual layout of the object in memory (in any way allowed by the spec), but currently it does not.

@alanhdu
Copy link
alanhdu commented Nov 29, 2017

@ahaldane I read through #7798, and now I think I'm just more confused 😆. If I understand correctly, the "solution" here is to patch cpython to output padding bytes in ctypes to force things to be aligned, right?

@pv
Copy link
Member
pv commented Nov 30, 2017 via email

@eric-wieser
Copy link
Member
eric-wieser commented Feb 6, 2018

@pv: Little slow on the uptake here, but I've attempted that in python/cpython#5561

@pv
Copy link
Member
pv commented Feb 6, 2018 via email

@eric-wieser
Copy link
Member
eric-wieser commented Feb 6, 2018

That comment might be better placed on that PR.

Isn't Q required to be 8 bytes? I choose it deliberately to try and avoid the mess that is int/long having multiple representations. So I suppose the issue only exists on platforms where sizeof(long long) == 4, which I may as well disable the test entirely on since it relies on having a fixed-width type

@pv
Copy link
Member
pv commented Feb 6, 2018 via email

@eric-wieser
Copy link
Member

Right, but it's still required to serialize as Q in that case, because the endianness specifier kicks us into standard-size mode

@pv
Copy link
Member
pv commented Feb 6, 2018 via email

@pv
Copy link
Member
pv commented Feb 6, 2018 via email

@eric-wieser
Copy link
Member
eric-wieser commented Feb 6, 2018

All that matters is whether sizeof(c_ulonglong) == 8. If sizeof(u_long) == 8 too (such as when they are identical), then it's required to serialize to <Q as well.

Either way, I've update the PR to use c_uint64, which will at least NameError on platforms where sizeof(c_ulonglong) == 4 rather than producing a confusingly failing test.

@pv
Copy link
Member
pv commented Feb 6, 2018 via email

@eric-wieser
Copy link
Member

Since you're here - do you know which of i and l c_int and c_long produce in standard size mode, if they are both 4 bytes?

@pv
Copy link
Member
pv commented Feb 6, 2018 via email

@eric-wieser
Copy link
Member

Seems weird to me not to drop l/L entirely and just use i/I in all cases - neither PEP3118 nor the struct docs attach any meaning to the distinction.

@eric-wieser
Copy link
Member

Looking at _ctypes_alloc_format_string_for_type, ctypes assumes that sizeof(long long) == 8 anyway, so a lot more than just my test will fail on such a hypothetical platform

@eric-wieser
Copy link
Member
eric-wieser commented Nov 19, 2018

In 1.16, as of #12254, this now gives:

> np.array((A*2)())
NotImplementedError: Unrepresentable PEP 3118 data type 'u' (UCS-2 strings)

The above exception was the direct cause of the following exception:

ValueError: 'T{(5)<u:s:}' is not a valid PEP 3118 buffer format string

A little weirdly, the following also fails with the same error:

np.array((A*2)(), dtype=object)

However, construction object arrays with ctype values at each entry was never supported anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0