The epic dtype cleanup plan

[this is very drafty notes-to-self that I'm hitting save on now because I was dumb and typed it into a webform where it might get lost, and because I'm lazy and don't want to pull it out to stick in some file on my hard-drive where I'll lose it. So no guarantees any of this actually makes sense or anything, but since we've been running into a number of nasty aspects of dtypes recently I wanted to write down a bunch of ideas together so we can try and organize them somewhat. Feel free to comment even at this early date. Also, don't tell anyone, but part of the secret goal here is enabling the NA dtype work too (though not all of this is a prerequisite for that by any means).]

End goal: dtypes basically function like Python objects WRT to subclassing etc.; isinstance/issubclass work in a useful way; parametric dtypes are not horrible; user-defined dtypes are on closer-to-equal footing with built-in dtypes

Pass dtypes to ufunc inner loops (What should be the calling convention for ufunc inner loop signatures? #12518)
Make dtypes immutable and make sure that we don't actually allocate new ones on ufunc return or when passing cast data to inner loops, but instead just incref-and-go. (immutability is a prerequisite for this.)
Move all the special case operations for void/string/datetime dtypes out of the core ufunc code and into ufunc loops.
Merge ufunc.so and multiarray.so, the split just ends up making us contort code to let them interface with each other (ENH: implement nep 0015: merge multiarray and umath #10915)
Make overflow/etc. error flag clearing/setting/checking methods on dtypes. Use this to split the integer and float error flag stuff (Stop mixing up integer and floating point overflow #2898 is to make this possible, but leaves the problem of arranging for the correct functions to be called on the correct loops. this is the clean way to actually call the correct pieces at the correct time, while also as a bonus allowing custom dtypes to do their own error signaling, or not, as they prefer.) (This is an example of a place where having to export this interface between multiarray.so, where dtypes live, and ufunc.so, where ufuncs live, would be pointlessly cumbersome.)
Make dtype (the class) inherit from type and define __instancecheck__ and __subclasscheck__ methods, so isinstance(foo, dtype_instance) can do something useful. (This will completely change the dtype struct's memory layout though, type is a huge object, and all of dtype's fields will get shifted down in memory.)
Give all the dtype functions self arguments. Turn them into the equivalent of cpdef methods. (not sure how we make this work -- follow MRO when doing dtype operations? or in dtype.new, copy parent fn ptrs to undefined child fn ptrs and ha ha no multiple inheritance is not supported?)
Make it possible to add new dtype methods without breaking the memory representation in the future (right now they have to go in the middle of the dtype struct...)
Make it possible to define new dtypes in Python. (A set of C-level dtype methods that just call python-level methods)
Re-arrange all the weird stuff in the current dtype memory representation so the parts that are specific to a single dtype go into that subclass (e.g. the field names, the datetime unit, etc., should not be part of the core dtype object)
...figure out how to do this without completely messing up existing code. Really only code that actually messes with dtypes should be affected (like, code that defines new dtypes). Binary compatibility much harder than source compatibility -- can we tell people to recompile? Possibility: make sure we can distinguish a pointer-to-old-dtype-object from a pointer-to-new-dtype-object (should be easy, all PyObject*'s are compatible enough to let you get their type), and at all the public entry points do if (!(arg = CheckOrConvertThisDtypeThing(arg)) return -1;, and that function sets up a new-style dtype in place of the old-style one, while raising a DeprecationWarning. This might not be so horrible even, though we would want to make sure we did all the binary compatibility breaking changes at once. (This still won't help for user code that reaches into dtype objects by hand, though, as opposed to just creating them. It's possible that all user dtypes do this, so eh.)
Rationalize the dtype constructor arguments. Ideally move the horrible 'build a dtype from strings and tuples and sealing wax' stuff to Python code.

Note: there is an argument for getting all the dtype struct changes out of the way before exposing them to ufunc inner loops, since that will open the door to more user code that depends on the contents of dtype structs. Of course if the problem is already as bad as it could possibly be, then this doesn't matter :-). And letting ufuncs get at dtypes is probably the single biggest win on this list, so it would be nice to prioritize it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The epic dtype cleanup plan #2899

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

The epic dtype cleanup plan #2899

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions