BUG: NEP 42 user dtype has type number set to -1 and this causes various failures. #22900

l-johnston · 2022-12-29T13:50:42Z

Describe the issue:

Custom user dtypes in the new NEP 42 DTypeMeta have a type number set to -1 and this leads to various failures.

In the 'experimental_public_dtype_api.c', the code is:

/* invalid type num. Ideally, we get away with it! */
    DType->type_num = -1;

Well, it seems we can't get away with it. For the code example, I can provide an example from my package microohm.

Reproduce the code example:

import numpy as np
import numpy.core._dtype as _dtype
from microohm import QuantityDType

arr = np.array([1.0, 2.0], dtype=QuantityDType("m"))
print(arr.dtype.num)
print(f"{arr.dtype.kind!r}")
_dtype._name_get(arr.dtype)

Error message:

RuntimeError                              Traceback (most recent call last)
Cell In[11], line 1
----> 1 _dtype._name_get(arr.dtype)

File ~/github/microohm/.venv/lib/python3.11/site-packages/numpy/core/_dtype.py:355, in _name_get(dtype)
    353     name = dtype.type.__name__
    354 else:
--> 355     name = _kind_name(dtype)
    357 # append bit counts
    358 if _name_includes_bit_suffix(dtype):

File ~/github/microohm/.venv/lib/python3.11/site-packages/numpy/core/_dtype.py:28, in _kind_name(dtype)
     26     return _kind_to_stem[dtype.kind]
     27 except KeyError as e:
---> 28     raise RuntimeError(
     29         "internal dtype error, unknown kind {!r}"
     30         .format(dtype.kind)
     31     ) from None

RuntimeError: internal dtype error, unknown kind '\x00'

Runtime information:

1.25.0.dev0+272.gbf20c55a2
3.11.1 (main, Dec 9 2022, 10:58:57) [GCC 11.3.0]
[{'numpy_version': '1.25.0.dev0+272.gbf20c55a2',
'python': '3.11.1 (main, Dec 9 2022, 10:58:57) [GCC 11.3.0]',
'uname': uname_result(system='Linux', node='curro2', release='5.15.0-56-generic', version='#62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022', machine='x86_64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2',
'AVX512F',
'AVX512CD',
'AVX512_SKX'],
'not_found': ['AVX512_KNL',
'AVX512_KNM',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'SkylakeX',
'filepath': '/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so',
'internal_api': 'openblas',
'num_threads': 8,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.20'}]
None

Context for the issue:

This causes failures in downstream libraries like Pandas, etc.

seberg · 2023-01-01T12:22:25Z

Sorry about not answering earlier, I was travelling a bit and barely checked on updates. Right now, I think you are seeing this as an error during printing.

You can avoid this by implementing tp_repr and tp_str on your DType. There may be places where you cannot get away with this, and maybe we need to define a type number, but I suspect using valid type numbers will come with its own set of problems (for the new style DTypes).

seberg · 2023-01-01T12:29:28Z

Hmm, I see you have repr, what is the thing you are expecting to work? The dtype.kind call?

l-johnston · 2023-01-01T14:47:56Z

There are two failures I have seen related to these dtype attributes.
1.

In [1]: from microohm import *
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: arr = np.array([1.0, 2.0], dtype=QuantityDType("m"))
In [5]: s1 = pd.Series(arr)
In [6]: s1
Out[6]: ---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)

In [7]: np.ctypeslib.ndpointer(dtype=QuantityDType('m'))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 np.ctypeslib.ndpointer(dtype=QuantityDType('m'))

File ~/github/microohm/.venv/lib/python3.11/site-packages/numpy/ctypeslib.py:343, in ndpointer(dtype, ndim, shape, flags)
    340 else:
    341     base = _ndptr
--> 343 klass = type("ndpointer_%s"%name, (base,),
    344              {"_dtype_": dtype,
    345               "_shape_" : shape,
    346               "_ndim_" : ndim,
    347               "_flags_" : num})
    348 _pointer_type_cache[cache_key] = klass
    349 return klass

ValueError: type name must not contain null characters

seberg · 2023-01-01T14:51:58Z

OK, I can't say I am surprised about either. But I am not sure that having a type number available will fix them as well.
Pandas clearly needs fixing up for full dtype support, for ctypes, the question is how to export it, since there is no C equivalent type available.

l-johnston · 2023-01-01T15:56:51Z

Is it possible for DTypeMeta to be a user defined type such that existing code works as is?

seberg · 2023-01-01T16:06:54Z

The question is whether that same "existing code" actually works with a user dtype that is defined the old way. Pandas does a lot of weird stuff, if pandas works with existing user dtypes like rational and the only problem is kind and type_num being defined to something slightly saner, then I don't mind defining it.

Feel free to just override type_num on the instance (needs to be part of the __new__/tp_new) to see if it helps, but I doubt it will.

Until now, I have never dug into what pandas dislikes exactly. I doubt it is much that needs to change in pandas to make things work, but I don't know what it is.

l-johnston · 2023-01-01T16:48:52Z

I tried overriding type_num but this doesn't work. In the original dtype scheme, new user defined types are registered and this increments a global variable NPY_NUMUSERTYPES.
usertypes.c

/*NUMPY_API
  Register Data type
  Does not change the reference count of descr
*/
NPY_NO_EXPORT int
PyArray_RegisterDataType(PyArray_Descr *descr)
{
    PyArray_Descr *descr2;
    int typenum;
    int i;
    PyArray_ArrFuncs *f;

    /* See if this type is already registered */
    for (i = 0; i < NPY_NUMUSERTYPES; i++) {
        descr2 = userdescrs[i];
        if (descr2 == descr) {
            return descr->type_num;
        }
    }
   ...
   userdescrs[NPY_NUMUSERTYPES++] = descr;

This value is then used in various places:
_dtype.py

def _name_get(dtype):
    # provides dtype.name.__get__, documented as returning a "bit name"

    if dtype.isbuiltin == 2:
        # user dtypes don't promise to do anything special
        return dtype.type.__name__

descriptor.c

/*
 * returns 1 for a builtin type
 * and 2 for a user-defined data-type descriptor
 * return 0 if neither (i.e. it's a copy of one)
 */
static PyObject *
arraydescr_isbuiltin_get(PyArray_Descr *self, void *NPY_UNUSED(ignored))
{
    long val;
    val = 0;
    if (self->fields == Py_None) {
        val = 1;
    }
    if (PyTypeNum_ISUSERDEF(self->type_num)) {
        val = 2;
    }
    return PyLong_FromLong(val);
}

ndarraytypes.h

#define PyTypeNum_ISUSERDEF(type) (((type) >= NPY_USERDEF) && \
                                   ((type) < NPY_USERDEF+     \
                                    NPY_NUMUSERTYPES))

l-johnston · 2023-01-01T17:15:50Z

rational works:

>>> import numpy as np
>>> from numpy.core._rational_tests import rational
>>> import pandas as pd
>>> arr = np.array([rational(1), rational(2)])
>>> s1 = pd.Series(arr)
>>> s1
0    1
1    2
dtype: rational

seberg · 2023-01-01T17:36:26Z

@ngoldbaum dunno if you are interested in having a look at why pandas seems to be fine wit hold-style, but not new style user dtypes. The question though is if it is a simple thing or not. Since new-style dtypes are parametric, much of the old infrastructure seems probably to not work in either case.
(of course for non-parametric that isn't an issue, which could still be good to just support in that case, the dtype here is of course parametric...)

ngoldbaum · 2023-01-02T20:54:14Z

So the full error from pandas is here:

Traceback (most recent call last):
  File "/home/nathan/Documents/numpy-experiments/test-pandas.py", line 11, in <module>
    print(s1)
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/site-packages/pandas/core/series.py", line 1594, in __repr__
    return self.to_string(**repr_params)
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/site-packages/pandas/core/series.py", line 1687, in to_string
    result = formatter.to_string()
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/site-packages/pandas/io/formats/format.py", line 391, in to_string
    footer = self._get_footer()
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/site-packages/pandas/io/formats/format.py", line 353, in _get_footer
    dtype_name = getattr(self.tr_series.dtype, "name", None)
  File "/home/nathan/Documents/numpy/numpy/core/_dtype.py", line 355, in _name_get
    name = _kind_name(dtype)
  File "/home/nathan/Documents/numpy/numpy/core/_dtype.py", line 28, in _kind_name
    raise RuntimeError(
RuntimeError: internal dtype error, unknown kind '\x00'

I patched numpy to avoid the error in _dtype.py to fall back on dtype.__name__, I get a different error from pandas because the isnan function isn't defined for the dtype I'm trying. Currently I don't have a dtype that defines isnan, so not sure what else is broken after that.

I think we probably just want to define dtype.name:

diff --git a/numpy/core/_dtype.py b/numpy/core/_dtype.py
index 3db80c17e..c4d3594d2 100644
--- a/numpy/core/_dtype.py
+++ b/numpy/core/_dtype.py
@@ -344,7 +344,7 @@ def _name_includes_bit_suffix(dtype):
 def _name_get(dtype):
     # provides dtype.name.__get__, documented as returning a "bit name"
 
-    if dtype.isbuiltin == 2:
+    if dtype.isbuiltin == 2 or dtype.kind == '\x00':
         # user dtypes don't promise to do anything special
         return dtype.type.__name__

The built-in dtypes have names that show the width of the dtype in bytes, so maybe there also needs to be some code to do that but that's not really a big deal and should be doable in Python based on dtype.itemsize.

l-johnston · 2023-01-04T17:15:16Z

Perhaps go through all of the dtype attributes and define the desired behavior for the new DTypeMeta. For example, the current values for the following don't seem quite right to me:
dtype.num -> -1 # invalid type number
dtype.kind -> "\x00" # non-printable character
dtype.isbuiltin -> 0 # suggests isn't built-in and not user defined (2)

Is DTypeMeta a user-defined data type?

ngoldbaum · 2023-01-04T17:52:08Z

Is DTypeMeta a user-defined data type?

Not necessarily, no. DTypeMeta defines a metaclass for creating dtypes. Currently all of numpy's dtypes are created via a DTypeMeta. So, for example, type(np.dtype('float32')) is a numpy._DTypeMeta instance that can create float32 dtype instances.

l-johnston · 2023-01-04T18:51:47Z

In my case, QuantityDType is a DTypeMeta that is instaniated as in QuantityDType("m"), so to me it is a user defined type. Are we supposed to subclass it to a DType? at runtime based on the unit expression?

l-johnston added the 00 - Bug label Dec 29, 2022

charris added this to the 1.24.2 release milestone Jan 3, 2023

ngoldbaum mentioned this issue Jan 19, 2023

MAINT: dtype.name works for NEP 42 dtypes #23047

Merged

seberg closed this as completed in #23047 Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: NEP 42 user dtype has type number set to -1 and this causes various failures. #22900

BUG: NEP 42 user dtype has type number set to -1 and this causes various failures. #22900

BUG: NEP 42 user dtype has type number set to -1 and this causes various failures. #22900

BUG: NEP 42 user dtype has type number set to -1 and this causes various failures. #22900

Comments

Describe the issue:

Reproduce the code example:

Error message:

Runtime information:

Context for the issue: