Numpy does not recognize ctypes arrays with c_wchar field #10100

macat · 2017-11-27T15:44:15Z

The casting of a ctypes Structure object array to numpy array usually results with a numpy structured array. But if the Structure contains a c_wchar array (string) field it becomes an object. I could not find the reason why this is happening.

Example:

import numpy as np
import ctypes

class A(ctypes.Structure):
    _fields_ = [('s', ctypes.c_wchar * 5)]

np.array((A*2)())
# array([<__main__.A object at 0x7fa99678f9d8>,
#       <__main__.A object at 0x7fa98804f1e0>], dtype=object)

While:

class B(ctypes.Structure):
    _fields_ = [('s', ctypes.c_char * 5)]

np.array((B*2)())
# array([([b'', b'', b'', b'', b''],), ([b'', b'', b'', b'', b''],)],
#      dtype=[('s', 'S1', (5,))])

Environment:
Python 3.6, numpy 1.13.1, Ubuntu 16.04

The text was updated successfully, but these errors were encountered:

macat · 2017-11-27T16:47:22Z

I'm not familiar with the numpy source code. Tried to trace the code to the point where it checks the type of the ctype array field, and found this:

numpy/numpy/core/src/multiarray/scalarapi.c

Line 463 in a2bddfa

else if (type == (PyObject *) &PyCharacterArrType_Type) {

Is it possible that this line does not work with c_wchar type?

eric-wieser · 2017-11-27T17:16:03Z

Numpy is trying and failing to parse this type using the PEP3118 buffer formats:

>>> memoryview((A*2)()).format
 'T{(5)<u:s:}'

Because it fails, it falls back on using object

Unfortunately, numpy doesn't yet have a wide (UCS2) character type - it only has bytes, and UCS4. I suppose we could fall back on either np.uint16 or np.void here, but nothing's ideal.

ahaldane · 2017-11-27T17:17:25Z

Just some initial thoughts looking at this:

This has something to do with the PEP3118 "buffer" interface. See https://www.python.org/dev/peps/pep-3118/

What happens is that the ctypes object exposes a pep3118 interface, and numpy tries to read the data using that interface. 'memoryviews' in python give you some kind of access to the raw interface intermediate:

[1]: import numpy as np
...: import ctypes
...: 
...: class A(ctypes.Structure):
...:     _fields_ = [('s', ctypes.c_wchar * 5)]
...: 
...: m = memoryview((A*2)())
...: m.format
...: 
'T{(5)<u:s:}'

The memoryview format attribute gives you the datatype using the pep3118 format mini-language. Ctypes encodes your struct using the ~~'w' (typo, wrong)~~ u pep3118 character. (See the character table in the 3118 doc).

Numpy has to translate this mini-language into a numpy dtype. This happens in numpy/core/src/multiarray/buffer.c with some code in numpy/core/_internal.py. You can see functions in both files which convert pep3118 format chars to numpy dtypes. I'm not sure why there is so much duplicate code. But I also see that we don't have any code to handle the u character, which the 3118 doc says is a ucs-2 character. We only know how to handle the c and w characters, for ucs-1 (latin-1) and ucs-4 respectively.

Numpy can't handle ucs-2 internally, so it can't represent that format, so it fails and returns an object array (which is kind of ugly..).

So one train of though is that numpy just can't convert that ctypes object to an array, and we should be raising an error message somewhere instead of converting to object.

But it also occurs to me that maybe this is a ctypes bug. Why is it encoding things as ucs-2? Is it really doing that, or is it incorrectly setting the pep3118 format?

Do you happen to know what the encoding of unicode wchar strings is in ctypes?

ahaldane · 2017-11-27T17:25:31Z

wchar_t seems to have portability problems because it is compiler-dependent, but at least on my system I get:

>>> ctypes.sizeof(ctypes.c_wchar)
4
>>> memoryview(ctypes.c_wchar()).format
'<u'

The first line suggests it is actually ucs-4.

This makes me think this is a ctypes bug, and ctypes should use w in this case.

eric-wieser · 2017-11-27T17:33:09Z

It gives 2 on my system, suggesting that ctypes is not checking to see which implementation-defined wchar it is using

pv · 2017-11-27T17:33:43Z

It's quite possible the pep3118 type codes in ctypes are buggy (its type codes for integer types are buggy in Python < 3.7.dev).

alanhdu · 2017-11-27T17:49:10Z

I'm not sure this is the same bug, but there's also something funky with the ctypes.c_byte and ctypes.c_char too:

In [1]: import ctypes

In [2]: import numpy as np

In [3]: class Test(ctypes.Structure):
   ...:    # using ctypes.c_float works fine
   ...:     _fields_ = [("a", ctypes.c_byte * 10), ("b", ctypes.c_float)]
   ...:     

In [4]: x = (Test * 5)()

In [5]: x[0].b = 3

In [6]: np.array(x)
/home/alan/workspace/scratch/py3.venv/bin/ipython:1: RuntimeWarning: Item size computed from the PEP 3118 buffer format string does not match the actual item size.
  #!/home/alan/workspace/scratch/py3.venv/bin/python3.6
Out[6]: 
array([([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.),
       ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.),
       ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.),
       ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.),
       ([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],  0.)],
      dtype=[('a', 'i1', (10,)), ('b', '<f4')])

(Notice that the values are all 0 instead of having one of the b values be 3.

This is with NumPy 1.13.3 and Python 3.6

eric-wieser · 2017-11-27T17:56:32Z

Numpy 1.14 and python 3.5 gives:

ValueError: invalid literal for int() with base 10: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

My guess is that structure padding is causing your problem there

pv · 2017-11-27T18:27:47Z

Ctypes seems to produce struct format that doesn't contain padding: ``` >> memoryview(x).format 'T{(10)<b:a:<f:b:}' ``` At the moment it seems the ctypes pep3118 implementation needs some work before it can relied on. In my experience, the only way to get this fixed is to do it yourself and then send the PRs to https://github.com/python/cpython --- the cpython maintainers are of course up to their necks in PRs, but there is hope to get patches through even as a random contributor.

alanhdu · 2017-11-27T21:44:47Z

Yup -- it's definitely a padding issue (I guess we'll just make sure that all our byte and char arrays have lengths that are divisible by 4 for now 😆).

Possibly a dumb question: should ctypes output the padding bytes as part of its format explicitly or should that be inferred on the NumPy side? The only documentation I could find was about the struct module, which says:

Padding is only automatically added between successive structure members. No padding is added at the beginning or the end of the encoded struct.

and struct.unpack("cccf", ...) automatically adds the padding byte for you (i.e. it's equivalent to struct.unpack("cccxf", ...) AFAICT).

ahaldane · 2017-11-27T21:50:47Z

Padding bytes are a bit of a mess too. I tried working on this here #7798, which is on hiatus but we should get back to it one day.

You can see that I progressively discovered new understandings of how padding bytes are supposed to be interpreted, and some fundamental problems with the 3118 interface.

pv · 2017-11-27T21:52:28Z

@alanhdu: note that "<cccf" is not equivalent to "cccxf" --- the issue is with the alignment specifiers. The implicit padding is only for the "native" alignment mode, but ctypes emits format strings that switch away from that. ctypes should emit a valid description of the actual layout of the object in memory (in any way allowed by the spec), but currently it does not.

alanhdu · 2017-11-29T22:44:19Z

@ahaldane I read through #7798, and now I think I'm just more confused 😆. If I understand correctly, the "solution" here is to patch cpython to output padding bytes in ctypes to force things to be aligned, right?

pv · 2017-11-30T09:11:43Z

The solution is to fix the bugs in ctypes that make it output invalid format strings.

eric-wieser · 2018-02-06T08:58:45Z

@pv: Little slow on the uptake here, but I've attempted that in python/cpython#5561

pv · 2018-02-06T09:15:45Z

@eric-wieser: note that it needs "..Q..".replace("Q", s_ulonglong) in the tests --- on some platforms in ctypes "c_ulonglong is c_long" so that there's no type that produces a "Q" code.

eric-wieser · 2018-02-06T09:44:33Z

That comment might be better placed on that PR.

Isn't Q required to be 8 bytes? I choose it deliberately to try and avoid the mess that is int/long having multiple representations. So I suppose the issue only exists on platforms where sizeof(long long) == 4, which I may as well disable the test entirely on since it relies on having a fixed-width type

pv · 2018-02-06T10:03:07Z

It's possible that `unsigned long` and `unsigned long long` are both 64 bits in size, in which case I think you get `c_ulong is c_ulonglong` in ctypes.

eric-wieser · 2018-02-06T10:03:52Z

Right, but it's still required to serialize as Q in that case, because the endianness specifier kicks us into standard-size mode

pv · 2018-02-06T10:04:58Z

Yes, but if you write `c_ulonglong` in ctypes it means the corresponding c type which does not necessarily produce "Q" in the format string.

pv · 2018-02-06T10:06:28Z

If `sizeof(unsigned long long) == sizeof(unsigned long)` then it's literally true that `assert ctypes.c_ulong is ctypes.c_ulonglong`

eric-wieser · 2018-02-06T10:08:40Z

All that matters is whether sizeof(c_ulonglong) == 8. If sizeof(u_long) == 8 too (such as when they are identical), then it's required to serialize to <Q as well.

Either way, I've update the PR to use c_uint64, which will at least NameError on platforms where sizeof(c_ulonglong) == 4 rather than producing a confusingly failing test.

pv · 2018-02-06T10:09:28Z

Ok, sorry yes, now I see the statements are not contradictory, it's indeed guaranteed to produce "Q" in the standard size mode.

eric-wieser · 2018-02-06T10:10:19Z

Since you're here - do you know which of i and l c_int and c_long produce in standard size mode, if they are both 4 bytes?

pv · 2018-02-06T10:13:03Z

It produces `l` (the logic written in the test describes the actual situation).

eric-wieser · 2018-02-06T10:17:22Z

Seems weird to me not to drop l/L entirely and just use i/I in all cases - neither PEP3118 nor the struct docs attach any meaning to the distinction.

eric-wieser · 2018-02-06T10:18:35Z

Looking at _ctypes_alloc_format_string_for_type, ctypes assumes that sizeof(long long) == 8 anyway, so a lot more than just my test will fail on such a hypothetical platform

eric-wieser · 2018-11-19T02:35:02Z

In 1.16, as of #12254, this now gives:

> np.array((A*2)())
NotImplementedError: Unrepresentable PEP 3118 data type 'u' (UCS-2 strings)

The above exception was the direct cause of the following exception:

ValueError: 'T{(5)<u:s:}' is not a valid PEP 3118 buffer format string

A little weirdly, the following also fails with the same error:

np.array((A*2)(), dtype=object)

However, construction object arrays with ctype values at each entry was never supported anyway.

eric-wieser closed this as completed Nov 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numpy does not recognize ctypes arrays with c_wchar field #10100

Numpy does not recognize ctypes arrays with c_wchar field #10100

Numpy does not recognize ctypes arrays with c_wchar field #10100

Numpy does not recognize ctypes arrays with c_wchar field #10100

Comments