8000 WIP: refactor dtype to be a type subclass by mattip · Pull Request #12585 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

WIP: refactor dtype to be a type subclass #12585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

mattip
Copy link
Member
@mattip mattip commented Dec 18, 2018

Replaces #12430. This first step is to refactor the global descr creation to use PyObject_New. Note the new define USE_DTYPE_AS_PYOBJECT which preserves the old behavior. Once the new dtype works we can flip the default.

@mattip mattip force-pushed the dtype-refactor2 branch 14 times, most recently from 21dcbcf to 2efd78b Compare December 18, 2018 17:07
@mattip
Copy link
Member Author
mattip commented Dec 18, 2018

The first commit 59ad4d3 passes all but codecov tests and is a refactor of existing code.

The next batch of commits are all attempts to get a job running on Azure Pipelines with USE_DTYPE_AS_PYOBJECT=0 which will use the new code paths and make dtype instances an instance of python type rather than an instance of python object. Finally in 2efd78b I seem to have got it right.

The code errors out, somehow I am not binding the class methods. The new azure job fails since np.dtype(scalar).newbyteorder(x) is an unbound class method.

ERROR collecting build/testenv/lib/python3.5/site-packages/numpy/lib/tests/test_format.py 
    numpy/lib/tests/test_format.py:330: in <module>
    dtype = np.dtype(scalar).newbyteorder(endian)
E   TypeError: descriptor 'newbyteorder' requires a 'numpy.dtype' object but received a 'str'

This commit produces the following output (note unbound methods need a class object):

>>> np.dtype.mro(np.dtype)
[<class 'numpy.dtype'>, <class 'type'>, <class 'object'>]

>>> issubclass(np.dtype, type)
True

>>> i = np.dtype('int32')
>>> type(i)
<class 'numpy.dtype'>

>>> isinstance(i, type)
True

Now if I could only figure out how to bind those class methods ...

@eric-wieser
Copy link
Member
eric-wieser commented Dec 19, 2018

Now if I could only figure out how to bind those class methods ...

Perhaps METH_CLASS? Although thinking about the equivalent python, that shouldn't be necessary...

@eric-wieser
Copy link
Member

Can you show the output of

np.dtype(scalar).newbyteorder

and

np.dtype.newbyteorder

@eric-wieser
Copy link
Member

I think USE_DTYPE_AS_PYTYPEOBJECT would be clearer than the inverse USE_DTYPE_AS_PYOBJECT, since PYTYPEOBJECTs are themselves PyTypeObjects

@mattip mattip changed the title MAINT: refactor descr creation from stack to using PyObject_New WIP: refactor dtype to be a python type Dec 19, 2018
@eric-wieser eric-wieser changed the title WIP: refactor dtype to be a python type WIP: refactor dtype to be a type subclass Dec 19, 2018
@mattip
Copy link
Member Author
mattip commented Dec 19, 2018

I think USE_DTYPE_AS_PYTYPEOBJECT would be clearer

I want to use a value that means "use old behaviour" so that the default is off, i.e. "use new behaviour" but we can leave it in for a few deprecation cycles for backward compatibility

@mattip
Copy link
Member Author
mattip commented Dec 19, 2018

New (broken) code:

>>> import numpy as np
>>> np.dtype.newbyteorder
<method 'newbyteorder' of 'numpy.dtype' objects>
>>> np.dtype(int).newbyteorder
<method 'newbyteorder' of 'numpy.dtype' objects>
>>> np.dtype.mro()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
TypeError: descriptor 'mro' of 'type' object needs an argument 

Old (working) code:

>>> np.dtype.newbyteorder
<method 'newbyteorder' of 'numpy.dtype' objects>
>>> np.dtype(int).newbyteorder
<built-in method newbyteorder of numpy.dtype object at 0x7fceac911580>
>>> np.dtype.mro()
[<class 'numpy.dtype'>, <class 'object'>]

The only change to np.dtype is the base class, so maybe I need to call something after/before PyType_Ready to bind the methods.

@mattip
Copy link
Member Author
mattip commented Dec 19, 2018

Also
New code

>>> np.dtype.mro
<method 'mro' of 'type' objects>

Old code

>>> np.dtype.mro
<built-in method mro of type object at 0x7fceac914280>

@eric-wieser
Copy link
Member
eric-wieser commented Dec 19, 2018

dtype.mro() breaking is expected here, I think - you need to spell that type.mro(dtype). It happens with all python metaclasses.

dtype(int).mro() should work though.

@eric-wieser
Copy link
Member

I think your copy of the PyTypeObject is causing the breakage - it's copying the class methods onto the instances, which you don't want to do.

@mattip
Copy link
Member Author
mattip commented Dec 19, 2018

It is more convoluted than that. Attribute lookup usually goes to the object __dict__ which gets filled out at object instantiation. That is not happening here, the lookup is going to the instance's descrtype (the PyType_Type member of the PyArray_Descr struct). I think the object instantiaion is not creating a __dict__.

@eric-wieser
Copy link
Member
eric-wieser commented Dec 19, 2018

What happens if you remove the memcpy?

the lookup is going to the instance's descrtype (the PyType_Type member of the PyArray_Descr struct).

Yes, exactly. I think the problem is that descr->descrtype should have no methods, but you're copying in the ones from the type object. I think what you're doing is equivalent to:

class dtype(type):
    def __new__(metacls, name, bases, dict):
        # this copy breaks methods - don't do it!
        for k in dir(metacls):
            try:
                dict[k] = getattr(metacls, k)
            except AttributeError:
                pass

        return super().__new__(metacls, name, bases, dict)

class bad_class(metaclass=dtype):
    pass

class good_class:
    pass

good_class.mro
#<built-in method mro of type object at 0x0000021A4A6F6F78>
bad_class.mro
#<method 'mro' of 'type' objects>

@mattip
Copy link
Member Author
mattip commented Dec 19, 2018

No more crashes, tests run to completion. There are failures around pickling and creating new instances

@mattip
Copy link
Member Author
mattip commented Dec 20, 2018

Next step: get pickle.dumps(np.dtype(int)) to work. The problem is this logic in _pickle.c that has a different path for type objects. It seems I need to register a function in a dispatch_table to work arouond that.

@mhvk
Copy link
Contributor
mhvk commented Dec 20, 2018

Is it going wrong in save_globals? Maybe it is just a matter of setting attributes that pickle expects to be present? (Sorry, ignore if these are just silly wild guesses.)

@mattip
Copy link
Member Author
mattip commented Dec 20, 2018

@mhvk: it is because the function resolution goes:

  1. Look up a __reduce__ function in the dispatch_table s
  2. If a type object, simply save it
  3. If object has a __reduce__, use it

The old code falls through to 3, I registered a dispatch_table for the new code to avoid 2.

@mattip
Copy link
Member Author
mattip commented Dec 20, 2018

down to one failing test, which it seems needs an implementation of dtype('f8').__call__

* #name2 = HALF, DATETIME, TIMEDELTA#
*/
#if USE_DTYPE_AS_PYOBJECT
dtype = PyObject_New(PyArray_Descr, &PyArrayDescr_Type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of heap allocating here vs the previous static allocation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference. At some point perhaps subinterpreters support may become mature enough that we will want to allow reloading the NumPy module, and thinking about what that means for static allocation gets me confused.

@eric-wieser
Copy link
Member
eric-wieser commented Dec 21, 2018

down to one failing test

The problem is that this patch changes the value of isinstance(np.dtype(int), collections.Callable).

The test is just bad here. It should be using for attrib, value in inspect.members(a, inspect.isroutine) rather than filtering by callable, since types are not methods.

dtype = PyObject_New(PyArray_Descr, &PyArrayDescr_Type);
#ifndef USE_DTYPE_AS_PYOBJECT
/* Don't copy PyObject_HEAD part */
memset((char *)dtype + sizeof(PyObject),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me (and thus for future people looking at this...), it would be helpful to expand on why this is done. Does it get filled out later by actually instantiating? I realize links to documentation can get out of sync, but might still be better than nothing.

Looking at the documentation for PyObject_New, I see "Fields not defined by the Python object header are not initialized" - so, is the point here to undo the initialization that is done? But why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the goal here is to ensure all the tp_ slots are initialized to null, without having to list them all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was whether it should be zeroed: presumably PyObject_New initializes this section (since it states that other parts are not initialized, suggesting that these parts are), and, again presumably, this is with stuff from PyArrayDescr_Type. If those presumptions hold, are those items wrong? Or might they have, e.g., init/new functions that can just be used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, PyObject_New does not initialize this section, because it's not part of the PyObject head - in the same way that PyObject_New does not initialize new instance of PyArrayObject

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is wrong though - the word copy is no longer appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arggh, I just looked so poorly - of course the code zeros everything beyond PyObject. Which definitely makes sense.

@mhvk
Copy link
Contributor
mhvk commented Dec 22, 2018

Had a quick look at the basic ideas for this again, and wondered whether perhaps something is missing: shouldn't the subclassing be more hierarchical, so that, e.g., all integer dtype are subclasses of a base integer type? That would allow one to replace, e.g., the current dtype.kind=='i' kind of checks with a simple issubclass (for dtype, or isinstance for scalars), and would also allow easy registration in numbers.Integral. Though I'm not sure in that kind what an instance of that general integer dtype would look like.

More generally, it seems that really there should be a NEP, arguably about this chance but certainly about the larger plans. Just as one example, some of the ideas about having dtype provide __array_ufunc__ might be solved if array instances became instances of both a base array class and the dtype.

@eric-wieser
Copy link
Member
eric-wieser commented Dec 27, 2018

@mattip, I think this is really close - just needs the test change I mention above, and a better comment by the memset.

@mhvk: I think the hierarchy can come later. Lifting dtype to a metatype seems like the right first step to me. Some questions we should answer down the road:

  • Should we merge the scalar types and dtypes sooner rather than later?
  • Do we need np.integer, np.floating, and the other abstract classes to be dtypes themselves? Or can they become abstract base classes? Making them instances of dtypes is perhaps nonsensical, as np.integer.itemsize doesn't make any sense. Making them metaclasses (subclasses of dtype) would solve this problem, but might be confusing.

@njsmith
Copy link
Member
njsmith commented Dec 27, 2018

@eric-wieser

Should we merge the scalar types and dtypes sooner rather than later?

I can't tell if merging them makes sense at all. It'd be nice to have a description of this plan written down somewhere, e.g. as a NEP :-).

@mhvk

Just as one example, some of the ideas about having dtype provide __array_ufunc__ might be solved if array instances became instances of both a base array class and the dtype.

I don't see the benefit of dtypes providing __array_ufunc__, and making array instances a subclass of dtypes doesn't make sense to me at all.

It sounds like we're definitely not all on the same page here, which usually means we need to stop and talk things out just so everyone knows what's going on.

@eric-wieser
Copy link
Member

It'd be nice to have a description of this plan written down somewhere, e.g. as a NEP :-).

I think it's worth acknowledging that there are a bunch of orthogonal and sequential changes being proposed, and I think it would be wise to try and segregate them into multiple NEPs

@mattip
Copy link
Member Author
mattip commented Dec 27, 2018

I am working on drafting a NEP to describe the motivation of changing instances of np.dtype into type subclasses.

@mhvk
Copy link
Contributor
mhvk commented Dec 27, 2018

@mattip - great to have something concrete to look at!

@njsmith - the comment I made came from the suggestion for an __dtype_ufunc__ - should have bene clearer... I wondered if any array could be a mixed subclass of an array container - that essentially deals with shapes - and the dtype - which tells how to do arithmetic on the elements. In some sense, this was already the case somewhat, with things like comparison (and dot) living on the dtype; this would take it to the extreme. But I'm not at all clear this would be a good idea! So, again, 👍 to a NEP.

@mattip
Copy link
Member Author
mattip commented Jan 3, 2019

In light of the discussion on nep 29 #12630, I will refactor this to define a DtypeMeta class. numpy.dtype will use it as its metaclass, and numpy.dtype(x) will go through its __call__ slot. At startup, we will still create singleton instances of the bultin dtypes.

PyObject_HEAD
#else
/* may need to be PyHeapTypeObject for new tp_as_* functions? */
PyTypeObject descrtype;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs to be PyHeapTypeObject. The Cython team explored this and ran into the problem before until they made meta-types inherit from PyHeapTypeObject. See, for example, https://mail.python.org/pipermail/cython-devel/2013-September/003813.html

Suggested change
PyTypeObject descrtype;
PyHeapTypeObject descrtype;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this applies to us. From that thread:

However, when MyClass is created, an instance of struct __pyx_obj_4meta_MetaClass (cc. 200 + 4 bytes) is dynamically allocated by the Python memory management machinery. The machinery then tries to initialize the allocated memory.

In our case, the instance of the dtype is malloc'd and initialized by arraydescr_new, so we're in complete control.

As far as I can tell, the main purpose of PyHeapTypeObject is to convert builtin magic methods into slots, something we probably don't care about too much for dtypes.

@mattip
Copy link
Member Author
mattip commented Sep 3, 2019

Closing. This seems to not be the right approach.

@mattip mattip closed this Sep 3, 2019
@mattip mattip deleted the dtype-refactor2 branch June 8, 2020 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0