WIP: refactor dtype to be a type subclass #12585

mattip · 2018-12-18T13:40:52Z

Replaces #12430. This first step is to refactor the global descr creation to use PyObject_New. Note the new define USE_DTYPE_AS_PYOBJECT which preserves the old behavior. Once the new dtype works we can flip the default.

mattip · 2018-12-18T18:00:30Z

The first commit 59ad4d3 passes all but codecov tests and is a refactor of existing code.

The next batch of commits are all attempts to get a job running on Azure Pipelines with USE_DTYPE_AS_PYOBJECT=0 which will use the new code paths and make dtype instances an instance of python type rather than an instance of python object. Finally in 2efd78b I seem to have got it right.

The code errors out, somehow I am not binding the class methods. The new azure job fails since np.dtype(scalar).newbyteorder(x) is an unbound class method.

ERROR collecting build/testenv/lib/python3.5/site-packages/numpy/lib/tests/test_format.py 
    numpy/lib/tests/test_format.py:330: in <module>
    dtype = np.dtype(scalar).newbyteorder(endian)
E   TypeError: descriptor 'newbyteorder' requires a 'numpy.dtype' object but received a 'str'

This commit produces the following output (note unbound methods need a class object):

>>> np.dtype.mro(np.dtype)
[<class 'numpy.dtype'>, <class 'type'>, <class 'object'>]

>>> issubclass(np.dtype, type)
True

>>> i = np.dtype('int32')
>>> type(i)
<class 'numpy.dtype'>

>>> isinstance(i, type)
True

Now if I could only figure out how to bind those class methods ...

numpy/core/src/multiarray/descriptor.c

eric-wieser · 2018-12-19T03:00:45Z

Now if I could only figure out how to bind those class methods ...

Perhaps METH_CLASS? Although thinking about the equivalent python, that shouldn't be necessary...

eric-wieser · 2018-12-19T03:07:00Z

Can you show the output of

np.dtype(scalar).newbyteorder

and

np.dtype.newbyteorder

numpy/core/src/multiarray/arraytypes.c.src

eric-wieser · 2018-12-19T05:06:39Z

I think USE_DTYPE_AS_PYTYPEOBJECT would be clearer than the inverse USE_DTYPE_AS_PYOBJECT, since PYTYPEOBJECTs are themselves PyTypeObjects

mattip · 2018-12-19T05:40:36Z

I think USE_DTYPE_AS_PYTYPEOBJECT would be clearer

I want to use a value that means "use old behaviour" so that the default is off, i.e. "use new behaviour" but we can leave it in for a few deprecation cycles for backward compatibility

mattip · 2018-12-19T05:49:43Z

New (broken) code:

>>> import numpy as np
>>> np.dtype.newbyteorder
<method 'newbyteorder' of 'numpy.dtype' objects>
>>> np.dtype(int).newbyteorder
<method 'newbyteorder' of 'numpy.dtype' objects>
>>> np.dtype.mro()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: descriptor 'mro' of 'type' object needs an argument

Old (working) code:

>>> np.dtype.newbyteorder
<method 'newbyteorder' of 'numpy.dtype' objects>
>>> np.dtype(int).newbyteorder
<built-in method newbyteorder of numpy.dtype object at 0x7fceac911580>
>>> np.dtype.mro()
[<class 'numpy.dtype'>, <class 'object'>]

The only change to np.dtype is the base class, so maybe I need to call something after/before PyType_Ready to bind the methods.

mattip · 2018-12-19T05:51:29Z

Also
New code

>>> np.dtype.mro
<method 'mro' of 'type' objects>

Old code

>>> np.dtype.mro
<built-in method mro of type object at 0x7fceac914280>

eric-wieser · 2018-12-19T06:09:08Z

dtype.mro() breaking is expected here, I think - you need to spell that type.mro(dtype). It happens with all python metaclasses.

dtype(int).mro() should work though.

eric-wieser · 2018-12-19T06:10:30Z

I think your copy of the PyTypeObject is causing the breakage - it's copying the class methods onto the instances, which you don't want to do.

mattip · 2018-12-19T08:44:46Z

It is more convoluted than that. Attribute lookup usually goes to the object __dict__ which gets filled out at object instantiation. That is not happening here, the lookup is going to the instance's descrtype (the PyType_Type member of the PyArray_Descr struct). I think the object instantiaion is not creating a __dict__.

eric-wieser · 2018-12-19T08:53:11Z

What happens if you remove the memcpy?

the lookup is going to the instance's descrtype (the PyType_Type member of the PyArray_Descr struct).

Yes, exactly. I think the problem is that descr->descrtype should have no methods, but you're copying in the ones from the type object. I think what you're doing is equivalent to:

class dtype(type):
    def __new__(metacls, name, bases, dict):
        # this copy breaks methods - don't do it!
        for k in dir(metacls):
            try:
                dict[k] = getattr(metacls, k)
            except AttributeError:
                pass

        return super().__new__(metacls, name, bases, dict)

class bad_class(metaclass=dtype):
    pass

class good_class:
    pass

good_class.mro
#<built-in method mro of type object at 0x0000021A4A6F6F78>
bad_class.mro
#<method 'mro' of 'type' objects>

mattip · 2018-12-19T16:20:36Z

No more crashes, tests run to completion. There are failures around pickling and creating new instances

numpy/core/src/multiarray/arraytypes.c.src

mattip · 2018-12-20T20:33:09Z

Next step: get pickle.dumps(np.dtype(int)) to work. The problem is this logic in _pickle.c that has a different path for type objects. It seems I need to register a function in a dispatch_table to work arouond that.

mhvk · 2018-12-20T20:48:10Z

Is it going wrong in save_globals? Maybe it is just a matter of setting attributes that pickle expects to be present? (Sorry, ignore if these are just silly wild guesses.)

mattip · 2018-12-20T20:55:29Z

@mhvk: it is because the function resolution goes:

Look up a __reduce__ function in the dispatch_table s
If a type object, simply save it
If object has a __reduce__, use it

The old code falls through to 3, I registered a dispatch_table for the new code to avoid 2.

numpy/core/src/multiarray/descriptor.c

mattip · 2018-12-20T21:26:23Z

down to one failing test, which it seems needs an implementation of dtype('f8').__call__

eric-wieser · 2018-12-19T10:09:49Z

numpy/core/src/multiarray/arraytypes.c.src

-     * #name2 = HALF, DATETIME, TIMEDELTA#
-     */
+#if USE_DTYPE_AS_PYOBJECT
+    dtype = PyObject_New(PyArray_Descr, &PyArrayDescr_Type);


What is the advantage of heap allocating here vs the previous static allocation?

Personal preference. At some point perhaps subinterpreters support may become mature enough that we will want to allow reloading the NumPy module, and thinking about what that means for static allocation gets me confused.

eric-wieser · 2018-12-21T08:37:34Z

down to one failing test

The problem is that this patch changes the value of isinstance(np.dtype(int), collections.Callable).

The test is just bad here. It should be using for attrib, value in inspect.members(a, inspect.isroutine) rather than filtering by callable, since types are not methods.

mhvk · 2018-12-21T14:42:45Z

numpy/core/src/multiarray/arraytypes.c.src

+    dtype = PyObject_New(PyArray_Descr, &PyArrayDescr_Type);
+#ifndef USE_DTYPE_AS_PYOBJECT
+    /* Don't copy PyObject_HEAD part */
+    memset((char *)dtype + sizeof(PyObject),


For me (and thus for future people looking at this...), it would be helpful to expand on why this is done. Does it get filled out later by actually instantiating? I realize links to documentation can get out of sync, but might still be better than nothing.

Looking at the documentation for PyObject_New, I see "Fields not defined by the Python object header are not initialized" - so, is the point here to undo the initialization that is done? But why?

I think the goal here is to ensure all the tp_ slots are initialized to null, without having to list them all.

My question was whether it should be zeroed: presumably PyObject_New initializes this section (since it states that other parts are not initialized, suggesting that these parts are), and, again presumably, this is with stuff from PyArrayDescr_Type. If those presumptions hold, are those items wrong? Or might they have, e.g., init/new functions that can just be used?

No, PyObject_New does not initialize this section, because it's not part of the PyObject head - in the same way that PyObject_New does not initialize new instance of PyArrayObject

The comment here is wrong though - the word copy is no longer appropriate.

Arggh, I just looked so poorly - of course the code zeros everything beyond PyObject. Which definitely makes sense.

mhvk · 2018-12-22T15:27:36Z

Had a quick look at the basic ideas for this again, and wondered whether perhaps something is missing: shouldn't the subclassing be more hierarchical, so that, e.g., all integer dtype are subclasses of a base integer type? That would allow one to replace, e.g., the current dtype.kind=='i' kind of checks with a simple issubclass (for dtype, or isinstance for scalars), and would also allow easy registration in numbers.Integral. Though I'm not sure in that kind what an instance of that general integer dtype would look like.

More generally, it seems that really there should be a NEP, arguably about this chance but certainly about the larger plans. Just as one example, some of the ideas about having dtype provide __array_ufunc__ might be solved if array instances became instances of both a base array class and the dtype.

eric-wieser · 2018-12-27T07:51:02Z

@mattip, I think this is really close - just needs the test change I mention above, and a better comment by the memset.

@mhvk: I think the hierarchy can come later. Lifting dtype to a metatype seems like the right first step to me. Some questions we should answer down the road:

Should we merge the scalar types and dtypes sooner rather than later?
Do we need np.integer, np.floating, and the other abstract classes to be dtypes themselves? Or can they become abstract base classes? Making them instances of dtypes is perhaps nonsensical, as np.integer.itemsize doesn't make any sense. Making them metaclasses (subclasses of dtype) would solve this problem, but might be confusing.

njsmith · 2018-12-27T07:59:03Z

@eric-wieser

Should we merge the scalar types and dtypes sooner rather than later?

I can't tell if merging them makes sense at all. It'd be nice to have a description of this plan written down somewhere, e.g. as a NEP :-).

@mhvk

Just as one example, some of the ideas about having dtype provide __array_ufunc__ might be solved if array instances became instances of both a base array class and the dtype.

I don't see the benefit of dtypes providing __array_ufunc__, and making array instances a subclass of dtypes doesn't make sense to me at all.

It sounds like we're definitely not all on the same page here, which usually means we need to stop and talk things out just so everyone knows what's going on.

eric-wieser · 2018-12-27T08:06:01Z

It'd be nice to have a description of this plan written down somewhere, e.g. as a NEP :-).

I think it's worth acknowledging that there are a bunch of orthogonal and sequential changes being proposed, and I think it would be wise to try and segregate them into multiple NEPs

mattip · 2018-12-27T10:29:00Z

I am working on drafting a NEP to describe the motivation of changing instances of np.dtype into type subclasses.

mhvk · 2018-12-27T14:58:45Z

@mattip - great to have something concrete to look at!

@njsmith - the comment I made came from the suggestion for an __dtype_ufunc__ - should have bene clearer... I wondered if any array could be a mixed subclass of an array container - that essentially deals with shapes - and the dtype - which tells how to do arithmetic on the elements. In some sense, this was already the case somewhat, with things like comparison (and dot) living on the dtype; this would take it to the extreme. But I'm not at all clear this would be a good idea! So, again, 👍 to a NEP.

mattip · 2019-01-03T07:42:03Z

In light of the discussion on nep 29 #12630, I will refactor this to define a DtypeMeta class. numpy.dtype will use it as its metaclass, and numpy.dtype(x) will go through its __call__ slot. At startup, we will still create singleton instances of the bultin dtypes.

teoliphant · 2019-03-02T21:06:53Z

numpy/core/include/numpy/ndarraytypes.h

        PyObject_HEAD
+#else
+        /* may need to be PyHeapTypeObject for new tp_as_* functions? */
+        PyTypeObject descrtype;


Yes, this needs to be PyHeapTypeObject. The Cython team explored this and ran into the problem before until they made meta-types inherit from PyHeapTypeObject. See, for example, https://mail.python.org/pipermail/cython-devel/2013-September/003813.html

Suggested change

PyTypeObject descrtype;

PyHeapTypeObject descrtype;

I'm not sure this applies to us. From that thread:

However, when MyClass is created, an instance of struct __pyx_obj_4meta_MetaClass (cc. 200 + 4 bytes) is dynamically allocated by the Python memory management machinery. The machinery then tries to initialize the allocated memory.

In our case, the instance of the dtype is malloc'd and initialized by arraydescr_new, so we're in complete control.

As far as I can tell, the main purpose of PyHeapTypeObject is to convert builtin magic methods into slots, something we probably don't care about too much for dtypes.

mattip · 2019-09-03T15:51:19Z

Closing. This seems to not be the right approach.

mattip force-pushed the dtype-refactor2 branch 14 times, most recently from 21dcbcf to 2efd78b Compare December 18, 2018 17:07

eric-wieser reviewed Dec 19, 2018

View reviewed changes

numpy/core/src/multiarray/descriptor.c Outdated Show resolved Hide resolved

eric-wieser reviewed Dec 19, 2018

View reviewed changes

numpy/core/src/multiarray/descriptor.c Outdated Show resolved Hide resolved

eric-wieser reviewed Dec 19, 2018

View reviewed changes

numpy/core/src/multiarray/arraytypes.c.src Show resolved Hide resolved

mattip changed the title ~~MAINT: refactor descr creation from stack to using PyObject_New~~ WIP: refactor dtype to be a python type Dec 19, 2018

eric-wieser changed the title ~~WIP: refactor dtype to be a python type~~ WIP: refactor dtype to be a type subclass Dec 19, 2018

mhvk reviewed Dec 19, 2018

View reviewed changes

numpy/core/src/multiarray/arraytypes.c.src Outdated Show resolved Hide resolved

mattip commented Dec 20, 2018

View reviewed changes

numpy/core/src/multiarray/descriptor.c Show resolved Hide resolved

eric-wieser reviewed Dec 20, 2018

View reviewed changes

mhvk reviewed Dec 21, 2018

View reviewed changes

mattip mentioned this pull request Feb 7, 2019

BUG: Add timsort without breaking the API. #12945

Merged

teoliphant reviewed Mar 2, 2019

View reviewed changes

mattip added 6 commits August 19, 2019 14:49

MAINT: refactor descr creation from stack to using PyObject_New

b08d30c

WIP, ENH: make np.dtype inherit from type

1dcab70

ENH: add tp_getattro, tp_init to PyArrayDescr_Type

f574588

MAINT: use PyObjec_New, support MSVC

fed8ecf

MAINT: support pickling of dtype type instances

c8f3560

MAINT: use object.setattr, not type.setattr

aaff618

mattip force-pushed the dtype-refactor2 branch from 6a54239 to aaff618 Compare August 19, 2019 11:50

mattip closed this Sep 3, 2019

h-vetinari mentioned this pull request Sep 20, 2019

WIP,NEP: Create draft of DTypes NEP #14422

Closed

mattip deleted the dtype-refactor2 branch June 8, 2020 06:58

Uh oh!

WIP: refactor dtype to be a type subclass #12585

WIP: refactor dtype to be a type subclass #12585

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!