-
-
Notifications
You must be signed in to change notification settings - Fork 11k
WIP, NEP: add dtype design NEP #12630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
the NEP has two alternatives, I am not sure which is easier to follow, or whether there is a better third choice. |
class IntDtype(Dtype): | ||
def __repr__(self): | ||
if self is IntDescr: | ||
return type.__repr__(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this branch is reachable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do replace self
with cls
here. I think that also answers @eric-wieser's doubt. (Though I think the if
statement is reversed in that case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hit in a = Dtype.Int8Descr; print(a.mro(a))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confusing IntDescr
with IntDtype
return type.__repr__(self) | ||
return 'dtype(%s%d)' %(self.kind, self.itemsize) | ||
|
||
class GenericDescr(type, metaclass=Dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks off to me - do you really want a second layer of metaclasses? I think type
here should be object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the descriptor is not a type, it is an object with a __call__
method. Does that make more sense?
>>> a = dtype.Dtype(np.uint8)
>>> a.mro()
[dtype(uint8), <class 'dtype.IntDescr'>, <class 'dtype.GenericDescr'>, <class 'object'>]
>>> a.__call__
<bound method Dtype.__call__ of dtype(uint8)>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am confused like @eric-wieser -- subclassing object
makes oneself a class, which I think is what we want here: the GenericDescr
class is meant (eventually) to produce instances, not other classes. I think the mro
that you show makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
@mattip - thanks for putting this together! I think for someone like me it would be very helpful to provide a bit more background (which will likely end up straight in the documentation, so worth writing!). If I understand correctly, What confuses me, however, is the parallel chain of subclasses. Why is there an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more detailed comments. It may be worth mentioning that a disadvantage of the "array.dtype
is a descriptor instance` would be that the future extension of making numpy scalars instances of the descriptor becomes much harder.
else: | ||
return super().__new__(cls, obj, *args, **kwargs) | ||
|
||
def __call__(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe replace self
with cls
here, since it is the class that is calling this. Also, for clarity, might it make more sense to define this as GenericDescr.__new__
rather than as a __call__
method here? As written, it ensures none of the descriptor classes can ever define a useful __new__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the scalar instance is created: a = dtype.Int8Descr; s = a(10)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it took me a long time to realize that is what was happening. Which is why I suggested it were done with GenericDescr.__new__
instead. It seems something the class should do, not the metaclass.
|
||
class Dtype(type): | ||
|
||
def __new__(cls, obj, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe write mcls
to avoid confusion (by me, at least...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
# Dtype('int8') or Dtype('S10') or record descr | ||
return create_new_descr(cls, obj, *args, **kwargs) | ||
else: | ||
return super().__new__(cls, obj, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be the place where someone is subclassing, say, IntDescr
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is hit when instantiating the IntDescr
class itself during module import (creating the IntDescr
class type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, maybe good to add a comment...
|
||
This NEP proposes a different class heirarchy. Objects of ``np.dtype`` will | ||
become type objects with a heirarchical ``mro`` like scalars. They will support | ||
subclassing. A future NEP may propose that instantiating a dtype type object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might need clarification: if I understand correctly, one would subclass a descriptor to create something new, one would generally not subclass np.dtype
itself, since that is the metaclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. Is it clearer to say "ndarray
descriptors will have np.dtype
as their metaclass with a heirarchical ...". I am not sure how to phrase the idea.
which is more difficult to reason about than object instances. For instance, | ||
note that in the ``Colors`` example, we did not instantiate an object of the | ||
``Colors`` type, rather used that type directly in the ndarray creation. Also | ||
the ``format`` function is not a bound method of a class instance, rather an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be circumvented of thinking of it as __format__
and that what the print machinery would actually do (once we allow instances to be scalars) is Color(array_scalar).__format__()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that arr.dtype.format
would replace the formatfunction
in arrayprint. I don't think we should be requiring scalar instantiation in each iteration of a loop to stringify array values.
This is a precursor to more extensively using method overloading to replace much of the case-based lookups scattered around the code, which is key to making subclasses of dtypes work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I agree we would not need to instantiate; I only meant it as an example of what would be possible.) Note that Color.__format__(array_scalar)
would do exactly the same thing, so my real question is why we are introducing a new class method here when a standard python one would do. Especially since surely eventually we are going to allow passing on format strings, just like for __format__
...
p.s. The discussion with @njsmith really needs a solution before moving to much further. In his scheme, I think one would format using dtype.type.__format__(value)
else: | ||
return super().__new__(cls, obj, *args, **kwargs) | ||
|
||
def __call__(self, args, kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed.
return super().__call__(self.__name__, args, kwargs) | ||
|
||
class GenericDescr(type, metaclass=Dtype): | ||
def __new__(cls, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should also be removed, as it doesn't do anything. Why is GenericDescr
a metaclass?
Probably, I just do not understand this example...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed __new__
. GenericDescr
is the base class of the descriptor heirarchy, it is not meant to be a metaclass.
return self.typeobj(*args, **kwargs) | ||
|
||
class IntDescr(GenericDescr): | ||
def format(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it would be format(self, value)
, no?
return 'dtype(%s%d)' %(self.kind, self.itemsize) | ||
|
||
|
||
class UInt8Descr(IntDescr): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example, wouldn't uint8descr
be an instance of IntDescr
, i.e., uint8descr = IntDescr(<something>)
?
This NEP proposes a different class heirarchy. Objects of ``np.dtype`` will | ||
become type objects with a heirarchical ``mro`` like scalars. They will support | ||
subclassing. A future NEP may propose that instantiating a dtype type object | ||
will produce a scalar refleting that dtype, but that is not a goal of this NEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be nice to clarify this a bit further: the descriptors will then be similar to built-in classes like float
and int
.
I don't think I quite follow all the ideas here (only did a quick skim), but I think it's helpful to clarify one point: the use of metaclasses here is AFAICT totally orthogonal to the way I've been proposing to use metaclasses for dtypes. Specifically: in this proposal, dtype objects are themselves types, i.e. In my proposal, dtype objects are regular instances of the class DtypeMeta(type):
cdef int (*unbox_method)(PyObject* self, PyObject* source, char* dest)
cdef PyObject* (*box_method)(PyObject* self, char* source)
...
class dtype(metaclass=DtypeMeta):
...
class categorical(dtype):
...
np.array(..., dtype=categorical(...)) |
@njs Are we talking about something beyond naming? The code in the NEP does define The way I have it layed out, Note that |
@mattip - @njsmith's proposal is a bit more what your alternative should look like: I actually also wondered about this scheme as it is what we use for units in astropy, to ensure that, say, the meter is only defined once; see https://github.com/astropy/astropy/blob/master/astropy/units/core.py#L1758 (note that looking at this again, if I would have written it, some of what is in the |
Perhaps the main difference between descriptors as classes and instances is the future extension of using them to create scalars. If descriptors are classes, those scalars could be their instances, created via So, the question is whether it is useful to eventually move to a state where |
Well, naming is important :-). But, take an actual dtype object. By that I mean, an object you might find looking at the In my proposal, these are always true:
I should probably say clearly: the idea of unifying scalars and dtypes strikes me as an awful, conceptually incoherent idea. Now, the problem is, scalars themselves are the most confusing and incoherent part of the classic numpy design, but we have to do something with them, so I'm willing to hear arguments that unifying them with dtypes is the least awful option :-). But conceptually, scalars are duck arrays with a fixed dtype and shape. So unifying them with dtypes requires violating the classic split between array/ufunc/dtype. It doesn't make any more sense a priori than it would to, I don't know, unify ufuncs and arrays. |
In the proposal as stated both of those are True since
The problem is not how to define slot functions, but the need to move away from code like Could you explain more about We can leave the discussion about scalars for later, I will remove the |
I can see the logic that since a dtype defines the meaning of a single element, and should really define also how, e.g., their More generally, it might be helpful to have a clearer definition of the problem that we are trying to solve. I thought it was mainly to make it possible to define a new dtype in python, but above you also mention interfacing with For subclassing, right now, there are C implementations of rational and quaternion dtypes [1]; could we use those as more concrete examples of something that one might want to do in pure python, giving perhaps a skeleton of how that should look like? It might well be that this brings up the scalar version again as well, since presumably those need repr/str as well. |
There are a few things to say here :-) First, well this isn't the point but... FWIW categorical shouldn't inherit from any kind of int-related type, for the same reasons numpy bool doesn't. Categorical and bool both use an int-compatible storage layout in memory, but semantically they behave really differently. Anyway, that's not the point, let's move on. Whatever design we end up with is going to need a distinction between a concrete, fully specified dtype like string-of-length-7 or categorical-with-options-setosa/virginica/versicolor, and an abstract category of dtypes with shared implementation, like fixed-length-string or categorical. This is basically the same as the instance-versus-type distinction in Python. To make it more confusing, the traditional name for the instance objects is "dtype", which shares 4 letters with the python word "type", even though from the python perspective they're not types, they're instances. So I've been calling the instance objects "dtypes" and the new type-of-dtype objects "dtypetypes", which is silly terminology but at least it makes the distinction. Compared to how numpy works now, dtype objects remain dtype objects, and they're still instances of
For |
Ok, I think I get it. Then The only reason we need a Does this make sense to the others in the conversation? It seems in the C code we refer to instances of dtypes as array descriptors. |
I think it also facilitates subclassing of dtypes. EDIT: And possibly registration. Are we going to do that? |
@mattip - the meta-class can both register specific instances and intercept them. In a way, that gives a nice separation of concerns - |
I pushed a revised version, which incorporates the suggestions. I am not sure, but I think circleci publishes it somewhere. Thanks for the reviews, it is much simpler now. Should I close this PR and resubmit the NEP as a new PR? |
@mattip: Can you revert the last change, and make a new NEP with the alternate proposal you submitted in that change? To me it looks like a completely different design, but it would be nice to have a record of the first design, even if we end up rejecting the NEP. |
I don't understand what you mean by this
I don't really buy this argument - it seems to be just "I'm used to dtype not being a type, it will surprise me when it becomes one". One proposal that has been mentioned before is just |
I think this is a useful way of thinking about it, but it's sort of orthogonal to @mattip's original proposal. There are sort of three levels here:
The dtype-types would have C level slots as follows: // this is the bit we currently disagree on - but it's orthogonal
#ifdef dtype_is_metaclass
# define DTYPE_BASE PyTypeObject
#else
# define DTYPE_BASE PyObjec
#endif
// the current PyArray_Descr
struct DtypeObject {
DTYPE_BASE head;
int itemsize;
int aligment;
// push the other slots to subclasses
};
// the current PyArray_DescrType
PyTypeObject DtypeType = { .tp_name = "dtype", .tp_basicsize = sizeof(DtypeObject) };
// new subclasses of `np.dtype` with C level slots
struct IntegerDtypeObject {
DtypeObject head; // C-level inheritance
int min_value;
int max_value; // could include a bunch of stuff currently in `iinfo` here if we wanted
int bits; // or a getset descriptor returning `8 * itemsize` in the line below
};
PyTypeObject IntegerDtypeType = {
.tp_base = &DtypeType,
.tp_name = "dtype_types.integral",
.tp_basicsize = sizeof(IntegerDtypeObject),
};
struct TimeDeltaDtypeObject {
DtypeObject head;
int max_value;
};
PyTypeObject TimeDeltaDtypeType = {
.tp_base = &DtypeType,
.tp_name = "dtype_types.timedelta",
.tp_basicsize = sizeof(TimeDeltaDtypeObject),
};
struct StructuredDtypeObject {
DtypeObject head;
struct {
char const *name;
DtypeObject *format;
int offset;
} *fields;
}
PyTypeObject TimeDeltaDtypeType = {
.tp_base = &DtypeType,
.tp_name = "dtype_types.structured",
.tp_basicsize = sizeof(StructuredDtypeObject),
}; I think we're on the same page about this part of the change - but note that it does not require the introduction of any new metaclasses. I'll perhaps write more about the other half of this (values vs dtypes) later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattip - I think the class ordering now makes a lot more sense. What still would help is to make a better case. I think for this NEP, that should not necessarily be printing (I don't actually think a dtype should be able format anything except a scalar). Instead, I think it would be most useful to focus on subclassing, which this would enable more immediately. So, perhaps in the Color
example, override one of those internal C methods like setting or interpreting memory?
I mean that the object returned by calling
Well, sure, obviously we could decide to make dtype objects also be instances of
There is a real tricky question about what to do about the scalar types. Conceptually, the way it works historically is:
(This is an impressionistic caricature, no one literally sat down and made this decision, but that's roughly how it turned out.) So...... It's just a really tangled mess. If we were designing a system from scratch, we wouldn't have scalars; we'd just have a sensible dtype system, + we'd optimize ndarray so zero-dim arrays were fast enough to make scalars unnecessary. And the current design causes real problems. Obviously it's super complicated and confusing. And, it's totally possible, today, to have a dtype that doesn't have a corresponding scalar type; if this happens the core code simply returns a zero-dim ndarray whenever it would otherwise return a scalar. And this is a good thing, because asking everyone who implements a custom dtype to also implement a custom scalar duck array type that's exactly like OTOH it's not like we can simply I think Matti is right and we should avoid touching scalars for now. We already have a lot of tricky questions to answer just to clean up the core dtype machinery, and I don't think any of it will block future changes to scalars, if all we're doing is rationalizing the existing architecture. [1] Arithmetic operations aren't quite the same as ndarrays (e.g. int wraparound is handled differently), they're read-only and auto-copy on inplace operators, and the |
@njsmith - While I agree it is probably best not to consider numpy scalars too much, there are things related to them that a descriptor probably has to provide (if perhaps indirectly). In particular, since it is what describes what a piece of memory actually means, it makes sense that it would also provide some way to implement converters for a single element, whether to standard python scalars via In thinking about class or instance, perhaps it is useful to think of @eric-wieser - One thing that might help is if the proposal of |
Right now dtype objects already have this: (There are some subtleties to figure out around the different ways you can convert raw memory into a Python object – scalar objects vs. builtin objects vs. 0d arrays – and which one you want to use in which situations, but I'm sure we can figure it out.) |
I am not proposing to invent a new set, rather to refactor the existing code:
with method lookup on the dtypes (not on scalars). For examples, see arrayprint |
As requested, I moved the new design to PR #12660 and removed those changes from this PR |
@eric-wieser said
The only reason I was looking at metaclasses is because of the strange NumPy syntax for creating dtypes |
I know @eric-wieser requested it, but I think with this move we risk losing what little cohesion there is to the discussion... I wonder if now it is to move over to #12660 altogether, and have design here as the |
@mhvk: Perhaps it would have been better to do the inverse, and move the old revision over to a spin-off PR, leaving the evolving one here. I'd like to see if I can evolve the old revision to be orthogonal to this one. Maybe I should just create a local branch at that revision, and revive a competitor NEP in the next few days, rather than leaving mattip to manage both. |
An argument for making dtype instances themselves types - it allows us to use def dtype_struct(name, bases, dict):
cls = type(name, (np.void,), {})
return np.dtype((cls, list(dict['__annotations__'].items())))
class X(metaclass=dtype_struct):
a: np.int8
b: np.int32 Which actually works, producing However, it will confuse users to no end that |
@eric-wieser - in principle, we can work with this branch: as long as @mattip allows edits (which is the default), you can just push to his branch. |
I understand the next steps in this are
|
Ping. I would like to move this forward but am hesitant before approval of the last comment |
@eric-wieser can you confirm the re-shuffling of the PRs is what you intended? |
I'm afraid I don't have the context in my cache right now, and am low on cycles to refresh it. I'll try to find time this weekend to reread. Ultimately - how you manage the PRs to best preserve discussion is up to you, and I think the way my suggestion fragmented it was a mistake. I think there are two orthogonal NEPs to be developed here though |
I think we need more informal discussion before diving into a NEP. I threw up some thoughts here, maybe that format is more conducive to interchange? Anyone can edit (there are revision checkpoints) and there is also a comment feature. The first thing I think needs to be clear "what do we mean by |
Closing, it seems starting over will be easier than resuscitating this. Hopefully the discussion will have been useful as a starting point for a replacement. |
Add a NEP describing the design of the descriptor type, using a dtype metaclass.
The ideas is that we design a new dtype heirarchy to
eventually align the multiple class heirarchies of scalars and dtypesVery much a WIP, and I am not opposed to a totally different design that meets the goals
My simple understanding of
1
is to design the classes with subclassing from python in mind, rather than adding new protocols,and that.2
means that instaniating an object from a dtype will create a scalar so thattype(np.int32(0)) is np.dtype('int32')
Edit: 2. is not a goal