8000 WIP, NEP: add dtype design NEP by mattip · Pull Request #12630 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

WIP, NEP: add dtype design NEP #12630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

mattip
Copy link
Member
@mattip mattip commented Dec 29, 2018

Add a NEP describing the design of the descriptor type, using a dtype metaclass.

The ideas is that we design a new dtype heirarchy to

  1. allow convenient expansion of the builtin dtypes
  2. eventually align the multiple class heirarchies of scalars and dtypes
  3. modernize the codebase with more pythonic class layouts.

Very much a WIP, and I am not opposed to a totally different design that meets the goals

My simple understanding of 1 is to design the classes with subclassing from python in mind, rather than adding new protocols, and that 2 means that instaniating an object from a dtype will create a scalar so that type(np.int32(0)) is np.dtype('int32').

Edit: 2. is not a goal

@mattip
Copy link
Member Author
mattip commented Dec 29, 2018

the NEP has two alternatives, I am not sure which is easier to follow, or whether there is a better third choice.

class IntDtype(Dtype):
def __repr__(self):
if self is IntDescr:
return type.__repr__(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this branch is reachable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do replace self with cls here. I think that also answers @eric-wieser's doubt. (Though I think the if statement is reversed in that case.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hit in a = Dtype.Int8Descr; print(a.mro(a))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confusing IntDescr with IntDtype

return type.__repr__(self)
return 'dtype(%s%d)' %(self.kind, self.itemsize)

class GenericDescr(type, metaclass=Dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks off to me - do you really want a second layer of metaclasses? I think type here should be object

Copy link
Member Author
@mattip mattip Dec 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the descriptor is not a type, it is an object with a __call__ method. Does that make more sense?

>>> a = dtype.Dtype(np.uint8)
>>> a.mro()
[dtype(uint8), <class 'dtype.IntDescr'>, <class 'dtype.GenericDescr'>, <class 'object'>]
>>> a.__call__
<bound method Dtype.__call__ of dtype(uint8)>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am confused like @eric-wieser -- subclassing object makes oneself a class, which I think is what we want here: the GenericDescr class is meant (eventually) to produce instances, not other classes. I think the mro that you show makes sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@mhvk
Copy link
Contributor
mhvk commented Dec 29, 2018

@mattip - thanks for putting this together! I think for someone like me it would be very helpful to provide a bit more background (which will likely end up straight in the documentation, so worth writing!). If I understand correctly, dtype subclasses type in order to generate new classes, not instances of itself. And you use it for classes based on GenericDescr (maybe just call it Descriptor?), with the idea being that eventually those might instantiate scalars (and that, e.g., IntDescr could be registered with numbers.Integral).

What confuses me, however, is the parallel chain of subclasses. Why is there an IntDtype sub-metaclass?

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more detailed comments. It may be worth mentioning that a disadvantage of the "array.dtype is a descriptor instance` would be that the future extension of making numpy scalars instances of the descriptor becomes much harder.

else:
return super().__new__(cls, obj, *args, **kwargs)

def __call__(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe replace self with cls here, since it is the class that is calling this. Also, for clarity, might it make more sense to define this as GenericDescr.__new__ rather than as a __call__ method here? As written, it ensures none of the descriptor classes can ever define a useful __new__.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the scalar instance is created: a = dtype.Int8Descr; s = a(10)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it took me a long time to realize that is what was happening. Which is why I suggested it were done with GenericDescr.__new__ instead. It seems something the class should do, not the metaclass.


class Dtype(type):

def __new__(cls, obj, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe write mcls to avoid confusion (by me, at least...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

# Dtype('int8') or Dtype('S10') or record descr
return create_new_descr(cls, obj, *args, **kwargs)
else:
return super().__new__(cls, obj, *args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be the place where someone is subclassing, say, IntDescr, correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hit when instantiating the IntDescr class itself during module import (creating the IntDescr class type)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, maybe good to add a comment...


This NEP proposes a different class heirarchy. Objects of ``np.dtype`` will
become type objects with a heirarchical ``mro`` like scalars. They will support
subclassing. A future NEP may propose that instantiating a dtype type object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might need clarification: if I understand correctly, one would subclass a descriptor to create something new, one would generally not subclass np.dtype itself, since that is the metaclass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Is it clearer to say "ndarray descriptors will have np.dtype as their metaclass with a heirarchical ...". I am not sure how to phrase the idea.

which is more difficult to reason about than object instances. For instance,
note that in the ``Colors`` example, we did not instantiate an object of the
``Colors`` type, rather used that type directly in the ndarray creation. Also
the ``format`` function is not a bound method of a class instance, rather an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be circumvented of thinking of it as __format__ and that what the print machinery would actually do (once we allow instances to be scalars) is Color(array_scalar).__format__().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that arr.dtype.format would replace the formatfunction in arrayprint. I don't think we should be requiring scalar instantiation in each iteration of a loop to stringify array values.

This is a precursor to more extensively using method overloading to replace much of the case-based lookups scattered around the code, which is key to making subclasses of dtypes work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I agree we would not need to instantiate; I only meant it as an example of what would be possible.) Note that Color.__format__(array_scalar) would do exactly the same thing, so my real question is why we are introducing a new class method here when a standard python one would do. Especially since surely eventually we are going to allow passing on format strings, just like for __format__...

p.s. The discussion with @njsmith really needs a solution before moving to much further. In his scheme, I think one would format using dtype.type.__format__(value)

else:
return super().__new__(cls, obj, *args, **kwargs)

def __call__(self, args, kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed.

return super().__call__(self.__name__, args, kwargs)

class GenericDescr(type, metaclass=Dtype):
def __new__(cls, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be removed, as it doesn't do anything. Why is GenericDescr a metaclass?

Probably, I just do not understand this example...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed __new__. GenericDescr is the base class of the descriptor heirarchy, it is not meant to be a metaclass.

return self.typeobj(*args, **kwargs)

class IntDescr(GenericDescr):
def format(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it would be format(self, value), no?

return 'dtype(%s%d)' %(self.kind, self.itemsize)


class UInt8Descr(IntDescr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, wouldn't uint8descr be an instance of IntDescr, i.e., uint8descr = IntDescr(<something>)?

This NEP proposes a different class heirarchy. Objects of ``np.dtype`` will
become type objects with a heirarchical ``mro`` like scalars. They will support
subclassing. A future NEP may propose that instantiating a dtype type object
will produce a scalar refleting that dtype, but that is not a goal of this NEP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be nice to clarify this a bit further: the descriptors will then be similar to built-in classes like float and int.

@njsmith
Copy link
Member
njsmith commented Dec 30, 2018

I don't think I quite follow all the ideas here (only did a quick skim), but I think it's helpful to clarify one point: the use of metaclasses here is AFAICT totally orthogonal to the way I've been proposing to use metaclasses for dtypes.

Specifically: in this proposal, dtype objects are themselves types, i.e. isinstance(arr.dtype, type) will be true.

In my proposal, dtype objects are regular instances of the np.dtype class-or-subclasses, and these objects have methods to do things like boxing and unboxing. The wrinkle though is that we want these to be C-level methods, which means we need to add new C-level slots to dtype and its subclasses, which means that dtype and its subclasses need to have a custom metaclass. So something like:

class DtypeMeta(type):
    cdef int (*unbox_method)(PyObject* self, PyObject* source, char* dest)
    cdef PyObject* (*box_method)(PyObject* self, char* source)
    ...

class dtype(metaclass=DtypeMeta):
    ...

class categorical(dtype):
    ...

np.array(..., dtype=categorical(...))

@mattip
Copy link
Member Author
mattip commented Dec 30, 2018

@njs Are we talking about something beyond naming? The code in the NEP does define Dtype as a metaclass, and GenericDescr as the base class for descriptors. So you are suggesting I change the name of Dtype to DtypeMeta and GenericDescr to dtype, or is there something else I am missing?

The way I have it layed out, np.dtype(np.uint8) goes to the __new__ function on the metaclass, if we change the names it will go to the metaclass __call__ instead, which might complicate the eventual unification of scalars and descriptors but could be doable.

Note that categorical should inherit from np.dtype(np.uint8).

@mhvk
Copy link
Contributor
mhvk commented Dec 30, 2018

@mattip - @njsmith's proposal is a bit more what your alternative should look like: dtype itself is a class that derives from object but which has a different meta-class than type. In this alternative, the metaclass is never used directly by the user, but does two things: it ensures that the standard dtype instances are singletons, and it ensures that dtype subclasses that users may create can be used (where necessary; I don't claim to understand box_method etc.)

I actually also wondered about this scheme as it is what we use for units in astropy, to ensure that, say, the meter is only defined once; see https://github.com/astropy/astropy/blob/master/astropy/units/core.py#L1758 (note that looking at this again, if I would have written it, some of what is in the __call__ of the metaclass would be Unit.__new__).

@mhvk
Copy link
Contributor
mhvk commented Dec 30, 2018

Perhaps the main difference between descriptors as classes and instances is the future extension of using them to create scalars. If descriptors are classes, those scalars could be their instances, created via __new__ and __init__, while if descriptors are dtype instances, it could only happen via __call__ or dtype.type.

So, the question is whether it is useful to eventually move to a state where np.dtype('i4') is np.int32 and thus behaves like python's int. There do seem to be advantages to that, but it does change the API quite a bit. For instance, the dtype classes get lots of new methods (like conjugate). This is not in itself bad - those could be used while iterating over an array - but could be rather confusing. And in the descriptor-as-instance case, those could still be easily accessed via dtype.type.conjugate.

@njsmith
Copy link
Member
njsmith commented Dec 30, 2018

@mattip

Are we talking about something beyond naming?

Well, naming is important :-). But, take an actual dtype object. By that I mean, an object you might find looking at the .dtype attribute on an array. We're committed to that being a well defined thing, no matter what name we call it.

In my proposal, these are always true: isinstance(arr.dtype, np.dtype), not isinstance(arr.dtype, type). At least one of those is false in this proposal, right? So the difference is not "just names"?

unification of scalars and dtypes

I should probably say clearly: the idea of unifying scalars and dtypes strikes me as an awful, conceptually incoherent idea.

Now, the problem is, scalars themselves are the most confusing and incoherent part of the classic numpy design, but we have to do something with them, so I'm willing to hear arguments that unifying them with dtypes is the least awful option :-).

But conceptually, scalars are duck arrays with a fixed dtype and shape. So unifying them with dtypes requires violating the classic split between array/ufunc/dtype. It doesn't make any more sense a priori than it would to, I don't know, unify ufuncs and arrays.

@mattip
Copy link
Member Author
mattip commented Dec 30, 2018

At least one of those is false in this proposal

In the proposal as stated both of those are True since dtype is the metaclass that inherits from type.

>>> a = Dtype(np.uint8)
>>> isinstance(a, type)
True
>>> isinstance(a, Dtype)
True
>>> a.mro()
[dtype('uint8'), <class 'IntDescr'>, <class 'GenericDescr'>, <class 'object'>]

The problem is not how to define slot functions, but the need to move away from code like arrayprint that know about all possible dtypes.

Could you explain more about categorical in your proposal? I propose it inherits from dtype(np.uint8), not from dtype, and overrides a format function that turns a memory location into a string.

We can leave the discussion about scalars for later, I will remove the __call__ function from consideration in this NEP.

@mhvk
Copy link
Contributor
mhvk commented Dec 30, 2018

I can see the logic that since a dtype defines the meaning of a single element, and should really define also how, e.g., their repr and str work of those single elements, the dtype instances might as well be such elements. Yet overall it does seem nicer in many ways if arr.dtype is an instance, with methods that describe the dtype and not the single element. This also implies a much smaller API change.

More generally, it might be helpful to have a clearer definition of the problem that we are trying to solve. I thought it was mainly to make it possible to define a new dtype in python, but above you also mention interfacing with arrayprint (i.e., repr, str -- still seems dtype.type.__str__(value) c/should cover that...).

For subclassing, right now, there are C implementations of rational and quaternion dtypes [1]; could we use those as more concrete examples of something that one might want to do in pure python, giving perhaps a skeleton of how that should look like? It might well be that this brings up the scalar version again as well, since presumably those need repr/str as well.

[1] https://github.com/numpy/numpy-dtypes

@njsmith
Copy link
Member
njsmith commented Dec 30, 2018

Could you explain more about categorical in your proposal? I propose it inherits from dtype(np.uint8), not from dtype, and overrides a format function that turns a memory location into a string.

There are a few things to say here :-)

First, well this isn't the point but... FWIW categorical shouldn't inherit from any kind of int-related type, for the same reasons numpy bool doesn't. Categorical and bool both use an int-compatible storage layout in memory, but semantically they behave really differently. Anyway, that's not the point, let's move on.

Whatever design we end up with is going to need a distinction between a concrete, fully specified dtype like string-of-length-7 or categorical-with-options-setosa/virginica/versicolor, and an abstract category of dtypes with shared implementation, like fixed-length-string or categorical.

This is basically the same as the instance-versus-type distinction in Python. To make it more confusing, the traditional name for the instance objects is "dtype", which shares 4 letters with the python word "type", even though from the python perspective they're not types, they're instances. So I've been calling the instance objects "dtypes" and the new type-of-dtype objects "dtypetypes", which is silly terminology but at least it makes the distinction.

Compared to how numpy works now, dtype objects remain dtype objects, and they're still instances of np.dtype. But right now, the dtypetype is kind of vague: it's there in "type codes" and the "ArrFuncs" struct, but there's no single python object corresponding to a dtypetype. To fix that, we make np.dtype an abstract base class. A dtypetype then is a concrete subclass of np.dtype, its methods are a rationalized version of the old "ArrFuncs", and every dtype instance is an instance of one of these dtypetypes.

np.dtype(np.int8) returns an object that you can use directly in an array, so that means it's "dtype", not a "dtypetype". So when I read "inherits from np.dtype(int8)", I get confused, because it's like I just read "inherits from the number 7". If we wanted to make a specialized version of int8, then we should inherit from np.dtypetypes.Int8Dtype or whatever we end up calling it, not np.dtype(int8).

For arrayprint, we'll want some kind of formatting method that's defined as part of the dtypetypes and called on dtypes. (We actually have a bunch of these pseudo-methods already, in the ArrFuncs struct; I don't know why arrayprint doesn't use one already. Well, looking at the revision comments at the top of arrayprint.py, I guess that code is actually older than dtypes...)

@mattip
Copy link
Member Author
mattip commented Dec 30, 2018

Ok, I think I get it. Then np.dtype will use its __new__ to intercept calls like np.dtype(np.uint8) and return the proper subclass instance.

The only reason we need a metaclass in this layout is to set up the layout of the dtype-specific slots.

Does this make sense to the others in the conversation?

It seems in the C code we refer to instances of dtypes as array descriptors.

@charris
Copy link
Member
charris commented Dec 30, 2018

The only reason we need a metaclass in this layout is to set up the layout of the dtype-specific slots.

I think it also facilitates subclassing of dtypes.

EDIT: And possibly registration. Are we going to do that?

@mhvk
Copy link
Contributor
1CF5 mhvk commented Dec 31, 2018

@mattip - the meta-class can both register specific instances and intercept them. In a way, that gives a nice separation of concerns - dtype.__new__ itself can then do what its name implies: create a new dtype instance.

@mattip
Copy link
Member Author
mattip commented Dec 31, 2018

I pushed a revised version, which incorporates the suggestions. I am not sure, but I think circleci publishes it somewhere. Thanks for the reviews, it is much simpler now. Should I close this PR and resubmit the NEP as a new PR?

@eric-wieser
Copy link
Member

@mattip: Can you revert the last change, and make a new NEP with the alternate proposal you submitted in that change? To me it looks like a completely different design, but it would be nice to have a record of the first design, even if we end up rejecting the NEP.

@eric-wieser
Copy link
Member

np.dtype(np.int8) returns an object that you can use directly in an array

I don't understand what you mean by this

so that means it's "dtype", not a "dtypetype". So when I read "inherits from np.dtype(int8)", I get confused, because it's like I just read "inherits from the number 7".

I don't really buy this argument - it seems to be just "I'm used to dtype not being a type, it will surprise me when it becomes one". One proposal that has been mentioned before is just np.dtype(np.int8) == np.int8 - ie, np.dtype.__new__ is just a function that looks up an existing dtype instance (which is itself a type).

@eric-wieser
Copy link
Member
eric-wieser commented Jan 3, 2019

So I've been calling the instance objects "dtypes" and the new type-of-dtype objects "dtypetypes", which is silly terminology but at least it makes the distinction.

I think this is a useful way of thinking about it, but it's sort of orthogonal to @mattip's original proposal. There are sort of three levels here:

values (x) 1 timedelta64("m")(100) structured([("a", int), ("b", int)])(1, 2)
dtypes (type(x)) int8 timedelta64("m") structured([("a", int), ("b", int))
dtype-types (type(type(x))) np.dtype_types.integral np.dtype_types.timedelta64 np.dtype_types.structured

The dtype-types would have C level slots as follows:

// this is the bit we currently disagree on - but it's orthogonal
#ifdef dtype_is_metaclass
    # define DTYPE_BASE PyTypeObject 
#else
    # define DTYPE_BASE PyObjec
#endif

// the current PyArray_Descr
struct DtypeObject {
    DTYPE_BASE head;
    int itemsize;
    int aligment;
    // push the other slots to subclasses
};

// the current PyArray_DescrType
PyTypeObject DtypeType = { .tp_name = "dtype", .tp_basicsize = sizeof(DtypeObject) };

// new subclasses of `np.dtype` with C level slots

struct IntegerDtypeObject {
    DtypeObject head;  // C-level inheritance
    int min_value;
    int max_value;   // could include a bunch of stuff currently in `iinfo` here if we wanted
    int bits;  // or a getset descriptor returning `8 * itemsize` in the line below
};
PyTypeObject IntegerDtypeType = {
    .tp_base = &DtypeType,
    .tp_name = "dtype_types.integral",
    .tp_basicsize = sizeof(IntegerDtypeObject),
};

struct TimeDeltaDtypeObject {
    DtypeObject head;
    int max_value;
};
PyTypeObject TimeDeltaDtypeType = {
    .tp_base = &DtypeType,
    .tp_name = "dtype_types.timedelta",
    .tp_basicsize = sizeof(TimeDeltaDtypeObject),
};


struct StructuredDtypeObject {
    DtypeObject head;
    struct {
        char const *name;
        DtypeObject *format;
        int offset;
    } *fields;
}
PyTypeObject TimeDeltaDtypeType = {
    .tp_base = &DtypeType,
    .tp_name = "dtype_types.structured",
    .tp_basicsize = sizeof(StructuredDtypeObject),
};

I think we're on the same page about this part of the change - but note that it does not require the introduction of any new metaclasses.

I'll perhaps write more about the other half of this (values vs dtypes) later.

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattip - I think the class ordering now makes a lot more sense. What still would help is to make a better case. I think for this NEP, that should not necessarily be printing (I don't actually think a dtype should be able format anything except a scalar). Instead, I think it would be most useful to focus on subclassing, which this would enable more immediately. So, perhaps in the Color example, override one of those internal C methods like setting or interpreting memory?

@njsmith
Copy link
Member
njsmith commented Jan 3, 2019

np.dtype(np.int8) returns an object that you can use directly in an array

I don't understand what you mean by this

I mean that the object returned by calling np.dtype(np.int8) can be passed as the dtype= argument of the ndarray constructor, can appear in the arr.dtype attribute and at the C level can appear in the ndarray ->descr field.

I don't really buy this argument - it seems to be just "I'm used to dtype not being a type, it will surprise me when it becomes one".

Well, sure, obviously we could decide to make dtype objects also be instances of type, and invent some semantics for that. I'm not sure what exactly those would be, because there's really nothing like that in numpy's current design to guide us, but we certainly could do it. But Matti was specifically talking about defining a new class of dtypes, and to do that you will want to inherit from a class of dtypes, not a specific dtype instance.

One proposal that has been mentioned before is just np.dtype(np.int8) == np.int8 - ie, np.dtype.__new__ is just a function that looks up an existing dtype instance (which is itself a type).

There is a real tricky question about what to do about the scalar types.

Conceptually, the way it works historically is:

  • We have dtype objects = PyArrayDescr objects
  • We have dtypetypes, which are groups of dtypes that have shared methods and some identity. But, these use a simple ad hoc type system in C (methods = ArrFuncs struct, identity = integer typecode), not the Python type system.
  • We also have scalars, which are specialized duck array types. They only work for a specific dtype and shape=() but in return they're really fast. There are also a few semantic quirks [1].
  • Then, we needed a way to check relationships amongst dtypes and dtypetypes. The original ad hoc type system has a way to do this using "kind" codes, but these are pretty limited and have some counterintuitive properties. Really we want a way to do isinstance between a dtype and a dtypetype, and issubclass between two dtypetypes. But it course to do that we'd have to port dtypetypes to use the python type system.
  • Oh hey but look, we also have this hierarchy of specialized duck array types. They're just supposed to be an optimization, but they do use the python type system. So, maybe we can get away with being lazy and instead of fixing dtypes so isinstance/issubclass work, we'll use these duck array types as placeholders for the corresponding dtypes, and whenever we want to do isinstance/issubclass on dtypetypes, we'll translate that into an isinstance/issubclass on the corresponding duck array types. And thus was born issubdtype, and the scalar hierarchy's abstract base classes, and several generations of numpy users grew up with docs that systemically confused dtypes and scalar types.

(This is an impressionistic caricature, no one literally sat down and made this decision, but that's roughly how it turned out.)

So...... It's just a really tangled mess. If we were designing a system from scratch, we wouldn't have scalars; we'd just have a sensible dtype system, + we'd optimize ndarray so zero-dim arrays were fast enough to make scalars unnecessary. And the current design causes real problems. Obviously it's super complicated and confusing. And, it's totally possible, today, to have a dtype that doesn't have a corresponding scalar type; if this happens the core code simply returns a zero-dim ndarray whenever it would otherwise return a scalar. And this is a good thing, because asking everyone who implements a custom dtype to also implement a custom scalar duck array type that's exactly like ndarray(dtype=theirdtype), but different... this is unreasonable. But, this breaks all the machinery we have around using the scalar hierarchy to reason about dtypes.

OTOH it's not like we can simply del np.float64 and friends; that would break everyone's code. We might be able to replace them with something else, but they've become so intertwined with other parts of numpy that it's hard to know what changes are even possible. At the least we can't break code like isinstance(arr[0], np.float64) (but maybe this could use some __instancecheck__ hack?).

I think Matti is right and we should avoid touching scalars for now. We already have a lot of tricky questions to answer just to clean up the core dtype machinery, and I don't think any of it will block future changes to scalars, if all we're doing is rationalizing the existing architecture.

[1] Arithmetic operations aren't quite the same as ndarrays (e.g. int wraparound is handled differently), they're read-only and auto-copy on inplace operators, and the float64 duck-array type also inherits from float.

@mhvk
Copy link
Contributor
mhvk commented Jan 4, 2019

@njsmith - While I agree it is probably best not to consider numpy scalars too much, there are things related to them that a descriptor probably has to provide (if perhaps indirectly). In particular, since it is what describes what a piece of memory actually means, it makes sense that it would also provide some way to implement converters for a single element, whether to standard python scalars via __float__, etc., or to strings via __str__ and __repr__. It would seem some form of scalars makes sense for this, since that form can provide the appropriate methods (rather than inventing a new set as in the present NEP). If so, the question is whether arr.dtype does that directly (by it being a class whose instance is the scalar) or via a helper scalar class.

In thinking about class or instance, perhaps it is useful to think of DType as similar to struct.Struct? Both are descriptions of the meaning of a set of bytes. If so, arr.dtype would more logically be an instance. Though arguably float and int are also descriptions of the meaning of a set of bytes, in which case Struct is more similar to a StructuredDType... So, perhaps not so helpful...

@eric-wieser - One thing that might help is if the proposal of arr.dtype being a class were sharpened up a little. E.g., even if it is, it seems to me its instances should be rather unlike the current numpy scalars in that they should not be some misshapen duck array with sum methods, etc., but rather be similar to the python int and float instances.

10000

@njsmith
Copy link
Member
njsmith commented Jan 4, 2019

While I agree it is probably best not to consider numpy scalars too much, there are things related to them that a descriptor probably has to provide (if perhaps indirectly). In particular, since it is what describes what a piece of memory actually means, it makes sense that it would also provide some way to implement converters for a single element, whether to standard python scalars via __float__, etc., or to strings via __str__ and __repr__. It would seem some form of scalars makes sense for this, since that form can provide the appropriate methods (rather than inventing a new set as in the present NEP). If so, the question is whether arr.dtype does that directly (by it being a class whose instance is the scalar) or via a helper scalar class.

Right now dtype objects already have this: PyArray_ArrFuncs->setitem converts a Python object to raw memory, and PyArray_ArrFuncs->getitem goes the other way. These are absolutely something we should have as (slot) methods on new-style dtypes – in fact this is the same as my unbox and box methods above :-). (Traditional VM terminology: a Python int is "boxed", while a raw machine int is "unboxed". The analogy is the PyObject struct is like a box that we put the machine int inside.) But this doesn't particularly involve scalars, except that a scalar is one kind of Python object that the dtype has to be prepared to handle.

(There are some subtleties to figure out around the different ways you can convert raw memory into a Python object – scalar objects vs. builtin objects vs. 0d arrays – and which one you want to use in which situations, but I'm sure we can figure it out.)

@mattip
Copy link
Member Author
mattip commented Jan 4, 2019

It would seem some form of scalars makes sense for this, since that form can provide the appropriate methods (rather than inventing a new set as in the present NEP)

I am not proposing to invent a new set, rather to refactor the existing code:

if dtype is int:
    do something
elif dtype is float:
    do something else

with method lookup on the dtypes (not on scalars). For examples, see arrayprint

@mattip
Copy link
Member Author
mattip commented Jan 4, 2019

As requested, I moved the new design to PR #12660 and removed those changes from this PR

@mattip
Copy link
Member Author
mattip commented Jan 4, 2019

@eric-wieser said

I think we're on the same page about this part of the change - but note that it does not require the introduction of any new metaclasses.

The only reason I was looking at metaclasses is because of the strange NumPy syntax for creating dtypes np.dtype('i'). But I agree metaclasses are not needed, and they actually complicate the code. I can work around it by continuing to (ab)use the __new__ class method arraydescr_new

@mhvk
Copy link
Contributor
mhvk commented Jan 4, 2019

As requested, I moved the new design to PR #12660 and removed those changes from this PR
Ah, now I see why I got e-mails with my own comments...

I know @eric-wieser requested it, but I think with this move we risk losing what little cohesion there is to the discussion... I wonder if now it is to move over to #12660 altogether, and have design here as the Alternative there, so we can discuss in comparison? @eric-wieser - what do you think would be most helpful?

@eric-wieser
Copy link
Member

@mhvk: Perhaps it would have been better to do the inverse, and move the old revision over to a spin-off PR, leaving the evolving one here. I'd like to see if I can evolve the old revision to be orthogonal to this one. Maybe I should just create a local branch at that revision, and revive a competitor NEP in the next few days, rather than leaving mattip to manage both.

@eric-wieser
Copy link
Member
eric-wieser commented Jan 4, 2019

An argument for making dtype instances themselves types - it allows us to use typing.NamedTuple like syntax without surprising behavior. Right now we can do:

def dtype_struct(name, bases, dict):
	cls = type(name, (np.void,), {})
	return np.dtype((cls, list(dict['__annotations__'].items())))

class X(metaclass=dtype_struct):
	a: np.int8
	b: np.int32

Which actually works, producing X = dtype((__main__.X, [('a', '<i4')]))

However, it will confuse users to no end that isinstance(X, type) is false, since that's a fundamental assumption of what the class keyword does. Also, the __main__.X within that repr doesn't actually refer to the final X

@mhvk
Copy link
Contributor
mhvk commented Jan 4, 2019

[snip] revive a competitor NEP in the next few days, rather than leaving mattip to manage both.

@eric-wieser - in principle, we can work with this branch: as long as @mattip allows edits (which is the default), you can just push to his branch.

@mattip
Copy link
Member Author
mattip commented Jan 7, 2019

I understand the next steps in this are

@mattip
Copy link
Member Author
mattip commented Jan 21, 2019

Ping. I would like to move this forward but am hesitant before approval of the last comment

@mattip
Copy link
Member Author
mattip commented Jan 24, 2019

@seberg see the comment which was never replied to

@mattip
Copy link
Member Author
mattip commented Jan 31, 2019

@eric-wieser can you confirm the re-shuffling of the PRs is what you intended?

@eric-wieser
Copy link
Member

I'm afraid I don't have the context in my cache right now, and am low on cycles to refresh it. I'll try to find time this weekend to reread. Ultimately - how you manage the PRs to best preserve discussion is up to you, and I think the way my suggestion fragmented it was a mistake.

I think there are two orthogonal NEPs to be developed here though

@mattip
Copy link
Member Author
mattip commented Jan 31, 2019

I think we need more informal discussion before diving into a NEP. I threw up some thoughts here, maybe that format is more conducive to interchange? Anyone can edit (there are revision checkpoints) and there is also a comment feature.

The first thing I think needs to be clear "what do we mean by np.dtype('uint8')? Is np.dtype a class? It behaves more like a class factory that produces instances. I think that is the source of my confusion about what the class Dtype in this NEP is doing when you call its __new__.

@mattip
Copy link
Member Author
mattip commented Aug 19, 2019

Closing, it seems starting over will be easier than resuscitating this. Hopefully the discussion will have been useful as a starting point for a replacement.

@mattip mattip closed this Aug 19, 2019
@mattip mattip deleted the dtype-nep2 branch June 8, 2020 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0