8000 BUG: NEP 42 user dtype has type number set to -1 and this causes various failures. · Issue #22900 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: NEP 42 user dtype has type number set to -1 and this causes various failures. #22900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
l-johnston opened this issue Dec 29, 2022 · 13 comments · Fixed by #23047
Closed

BUG: NEP 42 user dtype has type number set to -1 and this causes various failures. #22900

l-johnston opened this issue Dec 29, 2022 · 13 comments · Fixed by #23047

Comments

@l-johnston
Copy link
Contributor

Describe the issue:

Custom user dtypes in the new NEP 42 DTypeMeta have a type number set to -1 and this leads to various failures.

In the 'experimental_public_dtype_api.c', the code is:

/* invalid type num. Ideally, we get away with it! */
    DType->type_num = -1;

Well, it seems we can't get away with it. For the code example, I can provide an example from my package microohm.

Reproduce the code example:

import numpy as np
import numpy.core._dtype as _dtype
from microohm import QuantityDType

arr = np.array([1.0, 2.0], dtype=QuantityDType("m"))
print(arr.dtype.num)
print(f"{arr.dtype.kind!r}")
_dtype._name_get(arr.dtype)

Error message:

RuntimeError                              Traceback (most recent call last)
Cell In[11], line 1
----> 1 _dtype._name_get(arr.dtype)

File ~/github/microohm/.venv/lib/python3.11/site-packages/numpy/core/_dtype.py:355, in _name_get(dtype)
    353     name = dtype.type.__name__
    354 else:
--> 355     name = _kind_name(dtype)
    357 # append bit counts
    358 if _name_includes_bit_suffix(dtype):

File ~/github/microohm/.venv/lib/python3.11/site-packages/numpy/core/_dtype.py:28, in _kind_name(dtype)
     26     return _kind_to_stem[dtype.kind]
     27 except KeyError as e:
---> 28     raise RuntimeError(
     29         "internal dtype error, unknown kind {!r}"
     30         .format(dtype.kind)
     31     ) from None

RuntimeError: internal dtype error, unknown kind '\x00'

Runtime information:

1.25.0.dev0+272.gbf20c55a2
3.11.1 (main, Dec 9 2022, 10:58:57) [GCC 11.3.0]
[{'numpy_version': '1.25.0.dev0+272.gbf20c55a2',
'python': '3.11.1 (main, Dec 9 2022, 10:58:57) [GCC 11.3.0]',
'uname': uname_result(system='Linux', node='curro2', release='5.15.0-56-generic', version='#62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022', machine='x86_64')},
{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2',
'AVX512F',
'AVX512CD',
'AVX512_SKX'],
'not_found': ['AVX512_KNL',
'AVX512_KNM',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'SkylakeX',
'filepath': '/usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so',
'internal_api': 'openblas',
'num_threads': 8,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.20'}]
None

Context for the issue:

This causes failures in downstream libraries like Pandas, etc.

@seberg
Copy link
Member
seberg commented Jan 1, 2023

Sorry about not answering earlier, I was travelling a bit and barely checked on updates. Right now, I think you are seeing this as an error during printing.

You can avoid this by implementing tp_repr and tp_str on your DType. There may be places where you cannot get away with this, and maybe we need to define a type number, but I suspect using valid type numbers will come with its own set of problems (for the new style DTypes).

@seberg
8000
Copy link
Member
seberg commented Jan 1, 2023

Hmm, I see you have repr, what is the thing you are expecting to work? The dtype.kind call?

@l-johnston
Copy link
Contributor Author

There are two failures I have seen related to these dtype attributes.
1.

In [1]: from microohm import *
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: arr = np.array([1.0, 2.0], dtype=QuantityDType("m"))
In [5]: s1 = pd.Series(arr)
In [6]: s1
Out[6]: ---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
In [7]: np.ctypeslib.ndpointer(dtype=QuantityDType('m'))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 1
----> 1 np.ctypeslib.ndpointer(dtype=QuantityDType('m'))

File ~/github/microohm/.venv/lib/python3.11/site-packages/numpy/ctypeslib.py:343, in ndpointer(dtype, ndim, shape, flags)
    340 else:
    341     base = _ndptr
--> 343 klass = type("ndpointer_%s"%name, (base,),
    344              {"_dtype_": dtype,
    345               "_shape_" : shape,
    346               "_ndim_" : ndim,
    347               "_flags_" : num})
    348 _pointer_type_cache[cache_key] = klass
    349 return klass

ValueError: type name must not contain null characters

@seberg
Copy link
Member
seberg commented Jan 1, 2023

OK, I can't say I am surprised about either. But I am not sure that having a type number available will fix them as well.
Pandas clearly needs fixing up for full dtype support, for ctypes, the question is how to export it, since there is no C equivalent type available.

@l-johnston
Copy link
Contributor Author

Is it possible for DTypeMeta to be a user defined type such that existing code works as is?

@seberg
Copy link
Member
seberg commented Jan 1, 2023

The question is whether that same "existing code" actually works with a user dtype that is defined the old way. Pandas does a lot of weird stuff, if pandas works with existing user dtypes like rational and the only problem is kind and type_num being defined to something slightly saner, then I don't mind defining it.

Feel free to just override type_num on the instance (needs to be part of the __new__/tp_new) to see if it helps, but I doubt it will.

Until now, I have never dug into what pandas dislikes exactly. I doubt it is much that needs to change in pandas to make things work, but I don't know what it is.

@l-johnston
Copy link
Contributor Author

I tried overriding type_num but this doesn't work. In the original dtype scheme, new user defined types are registered and this increments a global variable NPY_NUMUSERTYPES.
usertypes.c

/*NUMPY_API
  Register Data type
  Does not change the reference count of descr
*/
NPY_NO_EXPORT int
PyArray_RegisterDataType(PyArray_Descr *descr)
{
    PyArray_Descr *descr2;
    int typenum;
    int i;
    PyArray_ArrFuncs *f;

    /* See if this type is already registered */
    for (i = 0; i < NPY_NUMUSERTYPES; i++) {
        descr2 = userdescrs[i];
        if (descr2 == descr) {
            return descr->type_num;
        }
    }
   ...
   userdescrs[NPY_NUMUSERTYPES++] = descr;

This value is then used in various places:
_dtype.py

def _name_get(dtype):
    # provides dtype.name.__get__, documented as returning a "bit name"

    if dtype.isbuiltin == 2:
        # user dtypes don't promise to do anything special
        return dtype.type.__name__

descriptor.c

/*
 * returns 1 for a builtin type
 * and 2 for a user-defined data-type descriptor
 * return 0 if neither (i.e. it's a copy of one)
 */
static PyObject *
arraydescr_isbuiltin_get(PyArray_Descr *self, void *NPY_UNUSED(ignored))
{
    long val;
    val = 0;
    if (self->fields == Py_None) {
        val = 1;
    }
    if (PyTypeNum_ISUSERDEF(self->type_num)) {
        val = 2;
    }
    return PyLong_FromLong(val);
}

ndarraytypes.h

#define PyTypeNum_ISUSERDEF(type) (((type) >= NPY_USERDEF) && \
                                   ((type) < NPY_USERDEF+     \
                                    NPY_NUMUSERTYPES))

@l-johnston
Copy link
Contributor Author

rational works:

>>> import numpy as np
>>> from numpy.core._rational_tests import rational
>>> import pandas as pd
>>> arr = np.array([rational(1), rational(2)])
>>> s1 = pd.Series(arr)
>>> s1
0    1
1    2
dtype: rational

@seberg
Copy link
Member
seberg commented Jan 1, 2023

@ngoldbaum dunno if you are interested in having a look at why pandas seems to be fine wit hold-style, but not new style user dtypes. The question though is if it is a simple thing or not. Since new-style dtypes are parametric, much of the old infrastructure seems probably to not work in either case.
(of course for non-parametric that isn't an issue, which could still be good to just support in that case, the dtype here is of course parametric...)

@ngoldbaum
Copy link
Member

So the full error from pandas is here:

Traceback (most recent call last):
  File "/home/nathan/Documents/numpy-experiments/test-pandas.py", line 11, in <module>
    print(s1)
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/site-packages/pandas/core/series.py", line 1594, in __repr__
    return self.to_string(**repr_params)
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/site-packages/pandas/core/series.py", line 1687, in to_string
    result = formatter.to_string()
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/site-packages/pandas/io/formats/format.py", line 391, in to_string
    footer = self._get_footer()
  File "/home/nathan/.pyenv/versions/3.10.8-debug/lib/python3.10/site-packages/pandas/io/formats/format.py", line 353, in _get_footer
    dtype_name = getattr(self.tr_series.dtype, "name", None)
  File "/home/nathan/Documents/numpy/numpy/core/_dtype.py", line 355, in _name_get
    name = _kind_name(dtype)
  File "/home/nathan/Documents/numpy/numpy/core/_dtype.py", line 28, in _kind_name
    raise RuntimeError(
RuntimeError: internal dtype error, unknown kind '\x00'

I patched numpy to avoid the error in _dtype.py to fall back on dtype.__name__, I get a different error from pandas because the isnan function isn't defined for the dtype I'm trying. Currently I don't have a dtype that defines isnan, so not sure what else is broken after that.

I think we probably just want to define dtype.name:

diff --git a/numpy/core/_dtype.py b/numpy/core/_dtype.py
index 3db80c17e..c4d3594d2 100644
--- a/numpy/core/_dtype.py
+++ b/numpy/core/_dtype.py
@@ -344,7 +344,7 @@ def _name_includes_bit_suffix(dtype):
 def _name_get(dtype):
     # provides dtype.name.__get__, documented as returning a "bit name"
 
-    if dtype.isbuiltin == 2:
+    if dtype.isbuiltin == 2 or dtype.kind == '\x00':
         # user dtypes don't promise to do anything special
         return dtype.type.__name__

The built-in dtypes have names that show the width of the dtype in bytes, so maybe there also needs to be some code to do that but that's not really a big deal and should be doable in Python based on dtype.itemsize.

@charris charris added this to the 1.24.2 release milestone Jan 3, 2023
@l-johnston
Copy link
Contributor Author
8000

Perhaps go through all of the dtype attributes and define the desired behavior for the new DTypeMeta. For example, the current values for the following don't seem quite right to me:
dtype.num -> -1 # invalid type number
dtype.kind -> "\x00" # non-printable character
dtype.isbuiltin -> 0 # suggests isn't built-in and not user defined (2)

Is DTypeMeta a user-defined data type?

@ngoldbaum
Copy link
Member

Is DTypeMeta a user-defined data type?

Not necessarily, no. DTypeMeta defines a metaclass for creating dtypes. Currently all of numpy's dtypes are created via a DTypeMeta. So, for example, type(np.dtype('float32')) is a numpy._DTypeMeta instance that can create float32 dtype instances.

@l-johnston
Copy link
Contributor Author

In my case, QuantityDType is a DTypeMeta that is instaniated as in QuantityDType("m"), so to me it is a user defined type. Are we supposed to subclass it to a DType? at runtime based on the unit expression?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0