|
| 1 | +=============================================== |
| 2 | +NEP 29 — Refactor Dtypes to become Type Objects |
| 3 | +=============================================== |
| 4 | + |
| 5 | +:Author: Matti Picus |
| 6 | +:Status: Draft |
| 7 | +:Type: Standards Track |
| 8 | +:Created: 2018-12-27 |
| 9 | + |
| 10 | + |
| 11 | +Abstract |
| 12 | +======== |
| 13 | + |
| 14 | +NumPy's `dtype <http://www.numpy.org/devdocs/reference/generated/numpy.dtype.html>` |
| 15 | +is a python class with the simple ``mro`` ``[np.dtype, object]``. Creating an |
| 16 | +instance of ``dtype`` *i.e.* ``a = np.dtype('int8')`` will result in a python |
| 17 | +object of type ``dtype``. The ``dtype`` obect instance has attributes, among |
| 18 | +them ``a.type``, which is a class object. Instantiating that class object |
| 19 | +``a.type(3)`` produces a numpy `scalar |
| 20 | +<http://www.numpy.org/devdocs/reference/arrays.scalars.html>`. |
| 21 | + |
| 22 | +This NEP proposes a different class heirarchy. Objects of ``np.dtype`` will |
| 23 | +become type objects with a heirarchical ``mro`` like scalars. They will support |
| 24 | +subclassing. A future NEP may propose that instantiating a dtype type object |
| 25 | +will produce a scalar refleting that dtype, but that is not a goal of this NEP. |
| 26 | + |
| 27 | +The changed dtype will: |
| 28 | + |
| 29 | +- facilitate extending dtypes, typically for things like categoricals, novel |
| 30 | + representations like datetime or IP addresses, or adding attributes like |
| 31 | + units. |
| 32 | +- Simplify the code around ``__repr__`` and method lookup. |
| 33 | + |
| 34 | +Overall Design |
| 35 | +-------------- |
| 36 | + |
| 37 | +In pure python (without error checking):: |
| 38 | + |
| 39 | + import numpy as np |
| 40 | + |
| 41 | + class Dtype(type): |
| 42 | + |
| 43 | + def __new__(cls, obj, *args, **kwargs): |
| 44 | + if isinstance(obj, int): |
| 45 | + return dtype_int_dict[obj] |
| 46 | + elif isinstance(obj, type) and issubclass(obj, np.generic): |
| 47 | + return dtype_scalar_dict[obj] |
| 48 | + elif len(args) < 1: |
| 49 | + # Dtype('int8') or Dtype('S10') or record descr |
| 50 | + return create_new_descr(cls, obj, *args, **kwargs) |
| 51 | + else: |
| 52 | + return super().__new__(cls, obj, *args, **kwargs) |
| 53 | + |
| 54 | + def __call__(self, *args, **kwargs): |
| 55 | + return self.typeobj(*args, **kwargs) |
| 56 | + |
| 57 | + class IntDtype(Dtype): |
| 58 | + def __repr__(self): |
| 59 | + if self is IntDescr: |
| 60 | + return type.__repr__(self) |
| 61 | + return 'dtype(%s%d)' %(self.kind, self.itemsize) |
| 62 | + |
| 63 | + class GenericDescr(type, metaclass=Dtype): |
| 64 | + pass |
| 65 | + |
| 66 | + class IntDescr(GenericDescr, metaclass=IntDtype): |
| 67 | + def format(value): |
| 68 | + return '%d' % value |
| 69 | + |
| 70 | + class UI
8000
nt8Descr(IntDescr): |
| 71 | + kind = 'uint' |
| 72 | + itemsize = 8 |
| 73 | + typeobj = np.uint8 |
| 74 | + # sort, fill, cast, clip, ... |
| 75 | + ArrFuncs = int8_arrayfuncs |
| 76 | + |
| 77 | + dtype_int_dict = {1: UInt8Descr} |
| 78 | + dtype_scalar_dict = {np.uint8: UInt8Descr} |
| 79 | + |
| 80 | +At NumPy startup, as we do today, we would generate the builtin set of |
| 81 | +descriptor classes, and fill in ``dtype_int_dict`` and ``dtype_scalar_type`` |
| 82 | +so that the built-in descriptors would continue to be singletons. ``Void``, |
| 83 | +``Byte`` and ``Unicode`` descriptors would be constructed on demand, as is done |
| 84 | +today. |
| 85 | + |
| 86 | +All dtype instances would inherit from ``GenericDescr`` which inherits from |
| 87 | +``type``, making them instances of ``type``:: |
| 88 | + |
| 89 | + >>> a = np.dtype(np.int8) |
| 90 | + >>> a.mro(a) |
| 91 | + [dtype(uint8), <class 'dtype.IntDescr'>, <class 'dtype.GenericDescr'>, \ |
| 92 | + <class 'type'>, <class 'object'>] |
| 93 | + |
| 94 | +Each descr class will have its own set of ArrFuncs (``clip``, ``fill``, |
| 95 | +``cast``), The ``format`` function is what ``array_print`` will call to turn a |
| 96 | +memory location into a string. |
| 97 | + |
| 98 | +Downstream users of NumPy could subclass these type classes. Creating a categorical |
| 99 | +dtype would look like this (without error checking for out-of-bounds values):: |
| 100 | + |
| 101 | + class Colors(UInt8Descr): |
| 102 | + colors = ['red', 'green', 'blue'] |
| 103 | + def format(value): |
| 104 | + return Colors.colors[value] |
| 105 | + ArrFuncs = null_arrayfuncs |
| 106 | + |
| 107 | + c = np.array([0, 1, 1, 0, 2], dtype=Colors) |
| 108 | + |
| 109 | +Additional code would be needed to neutralize the `tp_as_number` slot functions. |
| 110 | + |
| 111 | + Advantages |
| 112 | +========== |
| 113 | + |
| 114 | +It is very difficult today to override dtype behaviour, since internally |
| 115 | +descriptor objects are not true type instances, rather contianers for the |
| 116 | +``ArrayDescrObject`` struct. |
| 117 | + |
| 118 | +Disadvantages |
| 119 | +============= |
| 120 | + |
| 121 | +Making descriptors into type objects requires thinking about type classes, |
| 122 | +which is more difficult to reason about than object instances. For instance, |
| 123 | +note that in the ``Colors`` example, we did not instantiate an object of the |
| 124 | +``Colors`` type, rather used that type directly in the ndarray creation. Also |
| 125 | +the ``format`` function is not a bound method of a class instance, rather an |
| 126 | +unbound function on a type class (no ``self`` argument is used). |
| 127 | + |
| 128 | +Future Extensions |
| 129 | +================= |
| 130 | + |
| 131 | +Note the descriptor holds a parallel ``typeobj`` which is a scalar class. A |
| 132 | +call like ``np.dtype('int8')(10)`` will now create a scalar object. The next |
| 133 | +step will be to replace the scalar classes with the descriptor classes, so |
| 134 | +that looking up a scalar's corresponding descriptor type becomes ``type(scalar)``. |
| 135 | + |
| 136 | +We could refactor `numpy.datetime64` to use the new heirarchy, inheriting from |
| 137 | +``np.dtype(uint64)`` |
| 138 | + |
| 139 | +Alternatives |
| 140 | +============ |
| 141 | + |
| 142 | +Descriptors as Instances |
| 143 | +------------------------ |
| 144 | + |
| 145 | +It is confusing that descriptors are classes, not class instances. We could |
| 146 | +define them slightly differently as instances (note the call in the value of |
| 147 | +``dtype_int_dict`` and that ``_repr__`` is now a bound class method of |
| 148 | +``IntDescr``:: |
| 149 | + |
| 150 | + import numpy as np |
| 151 | + |
| 152 | + class Dtype(type): |
| 153 | + |
| 154 | + def __new__(cls, obj, *args, **kwargs): |
| 155 | + if isinstance(obj, int): |
| 156 | + return dtype_int_dict[obj] |
| 157 | + elif isinstance(obj, type) and issubclass(obj, np.generic): |
| 158 | + return dtype_scalar_dict[obj] |
| 159 | + elif len(args) < 1: |
| 160 | + # Dtype('int8') or Dtype('S10') or record descr |
| 161 | + return create_new_descr(cls, obj, *args, **kwargs) |
| 162 | + else: |
| 163 | + return super().__new__(cls, obj, *args, **kwargs) |
| 164 | + |
| 165 | + def __call__(self, args, kwargs): |
| 166 | + return super().__call__(self.__name__, args, kwargs) |
| 167 | + |
| 168 | + class GenericDescr(type, metaclass=Dtype): |
| 169 | + def __new__(cls, *args, **kwargs): |
| 170 | + import pdb;pdb.set_trace() |
| 171 | + return type.__new__(cls, *args, **kwargs) |
| 172 | + |
| 173 | + def __call__(self, *args, **kwargs): |
| 174 | + return self.typeobj(*args, **kwargs) |
| 175 | + |
| 176 | + class IntDescr(GenericDescr): |
| 177 | + def format(value): |
| 178 | + return '%d' % value |
| 179 | + def __repr__(self): |
| 180 | + return 'dtype(%s%d)' %(self.kind, self.itemsize) |
| 181 | + |
| 182 | + |
| 183 | + class UInt8Descr(IntDescr): |
| 184 | + kind = 'uint' |
| 185 | + itemsize = 8 |
| 186 | + typeobj = np.uint8 |
| 187 | + # sort, fill, cast, clip, ... |
| 188 | + #ArrFuncs = int8_arrayfuncs |
| 189 | + |
| 190 | + # Create singletons of builtin descriptors via Dtype.__call__ |
| 191 | + dtype_int_dict = {1: UInt8Descr()} |
| 192 | + dtype_scalar_dict = {np.uint8: dtype_int_dict[1]} |
| 193 | + |
| 194 | + |
| 195 | + |
| 196 | +Appendix |
| 197 | +======== |
| 198 | + |
| 199 | +References |
| 200 | +---------- |
| 201 | + |
| 202 | +- pandas `ExtensionArray interface <https://github.com/pandas-dev/pandas/blob/5b0610b875476a6f3727d7e9bedb90d370c669b5/pandas/core/arrays/base.py>` |
| 203 | +- Dtype `brainstorming session <https://github.com/numpy/numpy/wiki/Dtype-Brainstorming>` |
| 204 | + from SciPy |
| 205 | + |
| 206 | +The current interface of dtypes in NumPy |
| 207 | +---------------------------------------- |
| 208 | + |
| 209 | +.. code-block:: python |
| 210 | +
|
| 211 | + class DescrFlags(IntFlags): |
| 212 | + # The item must be reference counted when it is inserted or extracted. |
| 213 | + ITEM_REFCOUNT = 0x01 |
| 214 | + # Same as needing REFCOUNT |
| 215 | + ITEM_HASOBJECT = 0x01 |
| 216 | + # Convert to list for pickling |
| 217 | + LIST_PICKLE = 0x02 |
| 218 | + # The item is a POINTER |
| 219 | + ITEM_IS_POINTER = 0x04 |
| 220 | + # memory needs to be initialized for this data-type |
| 221 | + NEEDS_INIT = 0x08 |
| 222 | + # operations need Python C-API so don't give-up thread. |
| 223 | + NEEDS_PYAPI = 0x10 |
| 224 | + # Use f.getitem when extracting elements of this data-type |
| 225 | + USE_GETITEM = 0x20 |
| 226 | + # Use f.setitem when setting creating 0-d array from this data-type |
| 227 | + USE_SETITEM = 0x40 |
| 228 | + # A sticky flag specifically for structured arrays |
| 229 | + ALIGNED_STRUCT = 0x80 |
| 230 | +
|
| 231 | + class current_dtype(object): |
| 232 | + itemsize: int |
| 233 | + alignment: int |
| 234 | + |
| 235 | + byteorder: str |
| 236 | + flags: DescrFlags |
| 237 | + metadata: ... # unknown |
| 238 | + |
| 239 | + # getters |
| 240 | + hasobject: bool |
| 241 | + isalignedstruct: bool |
| 242 | + isbuiltin: bool |
| 243 | + isnative: bool |
| 244 | + |
| 245 | + |
| 246 | + def newbyteorder(self) -> current_dtype: ... |
| 247 | + |
| 248 | + # to move to a structured dtype subclass |
| 249 | + names: Tuple[str] |
| 250 | + fields: Dict[str, Union[ |
| 251 | + Tuple[current_dtype, int], |
| 252 | + Tuple[current_dtype, int, Any] |
| 253 | + ]] |
| 254 | + |
| 255 | + # to move to a subarray dtype subclass |
| 256 | + subdtype: Optional[Tuple[dtype, Tuple[int,...]]] |
| 257 | + shape: Tuple[int] |
| 258 | + base: current_dtype |
| 259 | + |
| 260 | + # to deprecate |
| 261 | + type: Type # merge with cls |
| 262 | + kind: str |
| 263 | + num: int |
| 264 | + str: str |
| 265 | + name: str |
| 266 | + char: str |
| 267 | + descr: List[...] |
| 268 | +
|
0 commit comments