8000 WIP, NEP: add dtype design NEP - DtypeMeta, Dtype, and hierarchy by mattip · Pull Request #12660 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

WIP, NEP: add dtype design NEP - DtypeMeta, Dtype, and hierarchy #12660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
8000
Diff view
Diff view
286 changes: 286 additions & 0 deletions doc/neps/nep-0029-dtype-as-type.rst
< 3D80 td class="blob-code blob-code-addition js-file-line"> char: str
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
=============================
NEP 29 — Refactor Dtype Class
=============================

:Author: Matti Picus
:Status: Draft
:Type: Standards Track
:Created: 2018-12-27


Abstract
========

NumPy's `dtype <http://www.numpy.org/devdocs/reference/generated/numpy.dtype.html>`_
describes a class used to create a descriptor, which provides the methods
to convert the strided memory in an ndarray into objects that can be
manipulated, printed, and used in loops. In this NEP, **dtype** is used as the
name of the class, and **descriptor** is used as the name of the instance. A
descriptor is what is held by a ndarray object, in addition to a block of
memory and information about the strides and shape of elements in that memory.
Creating an instance of ``dtype`` *i.e.* ``a = np.dtype('int8')`` will result
in a descriptor, which is a python object of type ``dtype``.

The ``dtype`` obect instance ``a`` has attributes, among them ``a.type``, which
is a class object. Instantiating that class object ``a.type(3)`` produces a
Num`py `scalar <http://www.numpy.org/devdocs/reference/arrays.scalars.html>`_.

This NEP proposes a class heirarchy for dtypes. The ``np.dtype`` class will
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hierarchy spelled wrong (here and below), also abstrace->abstract

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

become an abstrace base class, and a number of new classes will be created with
a heirarchy like scalars. They will support subclassing. A future NEP may
propose unifying the scalar and dtype type systems, but that is not a goal of
this NEP. The changed dtype will:

- facilitate extending dtypes, typically for things like categoricals, novel
representations like datetime or IP addresses, or adding attributes like
units directly on the dtypes.
- Allow replacement of the switch-type code we have in places like arrayprint_,
PyArray_AdaptFlexibleDtype_, and PyArray_CanCastSafely_ (just a few of many),
with method-based overrides, simplifying internal logic and allowing
subclassed dtypes to override the functionality.

.. _arrayprint: https://github.com/numpy/numpy/blob/v1.16.0rc1/numpy/core/arrayprint.py#L418
.. _PyArray_AdaptFlexibleDtype: https://github.com/numpy/numpy/blob/v1.16.0rc1/numpy/core/src/multiarray/convert_datatype.c#L164
.. _PyArray_CanCastSafely: https://github.com/numpy/numpy/blob/v1.16.0rc1/numpy/core/src/multiarray/convert_datatype.c#L398

Overall Design
--------------

The ``Dtype`` class (and any subclass without ``itemsize``) is by definition an
abstract base class. A metaclass ``DtypeMeta`` is used to add slots for
converting memory chunks into objects and back, to provide datatype-specific
functionality, and to provide the casting functions to convert data.

A prototype without error checking, without options handling, and describing
only ``np.dtype(np.uint8)`` together with an overridable ``get_format_function``
for ``arrayprint`` looks like::

import numpy as np

class DtypeMeta(type):
# Add slot methods to the base Dtype to handle low-level memory
# conversion to/from char[itemsize] to int/float/utf8/whatever
# In cython this would look something like
#cdef int (*unbox_method)(PyObject* self, PyObject* source, char* dest)
#cdef PyObject* (*box_method)(PyObject* self, char* source)

def __call__(cls, *args, **kwargs):
# This is reached for Dtype(np.uint8, ...).
# Does not yet handle align, copy positional arguments
if len(args) > 0:
obj = args[0]
if isinstance(obj, int):
return dtype_int_dict[obj]
elif isinstance(obj, type) and issubclass(obj, np.generic):
return dtype_scalar_dict[obj]
else:
# Dtype('int8') or Dtype('S10') or record descr
return create_new_descr(cls, *args, **kwargs)
else:
# At import, when creating Dtype and subclasses
return type.__call__(cls, *args, **kwargs)

class Dtype():
def __new__(cls, *args, **kwargs):
# Do not allow creating instances of abstract base classes
if not hasattr(cls, 'itemsize'):
raise ValueError("cannot create instances of "
f"abstract class {cls!r}")
return super().__new__(cls, *args, **kwargs)

class GenericDescr(Dtype, metaclass=DtypeMeta):
pass

class IntDescr(GenericDescr):
def __repr__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This __repr__ could move to GenericDtype, as it would be the same for float, string (and that gives it some purpose!)

# subclass of IntDescr
return f"dtype('{_kind_to_stem[self.kind]}{self.itemsize:d}')"

def get_format_function(self, data, **options):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk noted: Still feel this would logically be on dtype.type.format; maybe that can be the default in the upper class?

Or, perhaps better, just omit it altogether and focus on a method on a dtype that currently exists, overriding that in Color.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting exists on dtypes, it is based on a switch statement to get the proper formatting function in arrayprint. I am proposing to make that a method-based lookup on the dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the format function including the part where it determines, e.g., for integer the maximum size the string will be? I guess that is indeed logical to have on the dtype. Still, for the purpose of the NEP, it seems too complicated; if the point is to be able to override a dtype, I feel it would be better to have an example that is closer to what a dtype normally does, like interpreting a bit of memory.

# replaces switch on dtype found in _get_format_function
# (in arrayprint), **options details missing
from np.core.arrayprint import IntegerFormat
return IntegerFormat(data)

class UInt8Descr(IntDescr):
kind = 'u'
itemsize = 8
type = np.uint8
# The C-based methods for sort, fill, cast, clip, ... not exposed to
# Python
#ArrFuncs = int8_arrayfuncs


dtype_int_dict = {1: UInt8Descr()}
dtype_scalar_dict = {np.uint8: UInt8Descr()}
_kind_to_stem = {
'u': 'uint',
'i': 'int',
'c': 'complex',
'f': 'float',
'b': 'bool',
'V': 'void',
'O': 'object',
'M': 'datetime',
'm': 'timedelta',
'S': 'bytes',
'U': 'str',
}

At NumPy startup, as we do today, we would generate the builtin set of
descriptor instances, and fill in ``dtype_int_dict`` and ``dtype_scalar_type``
so that the built-in descriptors would continue to be singletons. ``Void``,
``Byte`` and ``Unicode`` descriptors would be constructed on demand, as is done
today. The magic that returns a singleton or a new descriptor happens in
``DtypeMeta.__call__``.

All descriptors would inherit from ``Dtype``::

>>> a = np.dtype(np.uint8)
>>> type(a).mro()
[<class 'UInt8Descr'>, <class 'IntDescr'>, <class 'GenericDescr'>,
<class 'Dtype'>, <class 'object'>]

>>> isinstance(a, np.dtype):
True

Note that the ``repr`` of ``a`` is compatibility with NumPy::

>>> repr(a)
"dtype('uint8')"

Each class will have its own set of ArrFuncs (``clip``, ``fill``,
``cast``).

Downstream users of NumPy can subclass these type classes. Creating a categorical
dtype would look like this (without error checking for out-of-bounds values)::

class Colors(Dtype):
itemsize = 8
colors = ['red', 'green', 'blue']
def get_format_function(self, data, **options):
class Format():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhvk noted: Instead of formatting, I'd pick something that actually exists, perhaps the interpretation of the memory (either setting or getting).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of a categorical dtype, as far as I can tell, is to give textual meaning to the values in the ndarray. The only place that has meaning is when printing them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, should be able to set that type with a string as well? Indeed, even accessing it should give, I would think, an ColorType scalar that is just an int with a different __str__ (heck, scalars come back to haunt us immediately!).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding any of the PyArray_Descr functions or ufuncs requires C code. I was looking for an easy way to demonstrate a subclass, but it seems I only confused reviewers.

def __init__(self, data):
pass
def __call__(self, x):
return self.colors[x]
return Format(data)

c = np.array([0, 1, 1, 0, 2], dtype=Colors)

Additional code would be needed to neutralize the slot functions.

There is a level of indirection between ``Dtype`` and ``IntDescr`` so that
downstream users could create their own duck-descriptors that do not use
``DtypeMeta.__call__`` at all, but could still answer ``True`` to
``isintance(mydtype, Dtype)``.

Advantages
==========

It is very difficult today to override dtype behaviour. Internally
descriptor objects are all instances of a generic dtype class and internally
behave as containers more than classes with method overrides. Giving them a
class heirarchy with overrideable methods will reduce explicit branching in
code (at the expense of a dictionary lookup) and allow downstream users to
more easily define new dtypes. We could re-examine interoperability with
pandas_ typesystem.

Disadvantages
=============

The new code will be incompatible with old code and will at least
require a recompile of C-extension modules, including updating cython. We
should continue with `PR 12284`_ to vendor our own numpy.pxd in order to make the
transition less painful. We should not break working dtype-subclasses like
`quaterions`_.

Future Extensions
=================

Note the descriptor holds a ``typeobj`` which is a scalar class. A call like
``np.dtype('int8')(10)`` could theoretically create a scalar object.
This would make the descriptor more like the ``int`` or ``float`` type. However
allowing instantiating scalars from descriptors is not a goal of this NEP.

A further extension would be to refactor ``numpy.datetime64`` to use the new
heirarchy.

Appendix
========

References
----------

- pandas `ExtensionArray interface`_
- Dtype `brainstorming session <https://github.com/numpy/numpy/wiki/Dtype-Brainstorming>`_
from SciPy

.. _pandas: https://github.com/pandas-dev/pandas/blob/master/pandas/core/dtypes/base.py#L148
.. _`ExtensionArray interface`: https://github.com/pandas-dev/pandas/blob/master/pandas/core/dtypes/base.py#L148
.. _`PR 12284`: https://github.com/numpy/numpy/pull/12284
.. _`quaterions`: https://github.com/moble/quaternion

The current interface of dtypes in NumPy
----------------------------------------

.. code-block:: python

class DescrFlags(IntFlags):
# The item must be reference counted when it is inserted or extracted.
ITEM_REFCOUNT = 0x01
# Same as needing REFCOUNT
ITEM_HASOBJECT = 0x01
# Convert to list for pickling
LIST_PICKLE = 0x02
# The item is a POINTER
ITEM_IS_POINTER = 0x04
# memory needs to be initialized for this data-type
NEEDS_INIT = 0x08
# operations need Python C-API so don't give-up thread.
NEEDS_PYAPI = 0x10
# Use f.getitem when extracting elements of this data-type
USE_GETITEM = 0x20
# Use f.setitem when setting creating 0-d array from this data-type
USE_SETITEM = 0x40
# A sticky flag specifically for structured arrays
ALIGNED_STRUCT = 0x80

class current_dtype(object):
itemsize: int
alignment: int

byteorder: str
flags: DescrFlags
metadata: ... # unknown

# getters
hasobject: bool
isalignedstruct: bool
isbuiltin: bool
isnative: bool


def newbyteorder(self) -> current_dtype: ...

# to move to a structured dtype subclass
names: Tuple[str]
fields: Dict[str, Union[
Tuple[current_dtype, int],
Tuple[current_dtype, int, Any]
]]

# to move to a subarray dtype subclass
subdtype: Optional[Tuple[dtype, Tuple[int,...]]]
shape: Tuple[int]
base: current_dtype

# to deprecate
type: Type # merge with cls
kind: str
num: int
str: str
name: str
descr: List[...]

0