8000 NEP: add dtype design NEP · numpy/numpy@9b2631b · GitHub
[go: up one dir, main page]

Skip to content

Commit 9b2631b

Browse files
committed
NEP: add dtype design NEP
1 parent 5e1a891 commit 9b2631b

File tree

1 file changed

+268
-0
lines changed

1 file changed

+268
-0
lines changed

doc/neps/nep-0029-dtype-as-type.rst

Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
===============================================
2+
NEP 29 — Refactor Dtypes to become Type Objects
3+
===============================================
4+
5+
:Author: Matti Picus
6+
:Status: Draft
7+
:Type: Standards Track
8+
:Created: 2018-12-27
9+
10+
11+
Abstract
12+
========
13+
14+
NumPy's `dtype <http://www.numpy.org/devdocs/reference/generated/numpy.dtype.html>`
15+
is a python class with the simple ``mro`` ``[np.dtype, object]``. Creating an
16+
instance of ``dtype`` *i.e.* ``a = np.dtype('int8')`` will result in a python
17+
object of type ``dtype``. The ``dtype`` obect instance has attributes, among
18+
them ``a.type``, which is a class object. Instantiating that class object
19+
``a.type(3)`` produces a numpy `scalar
20+
<http://www.numpy.org/devdocs/reference/arrays.scalars.html>`.
21+
22+
This NEP proposes a different class heirarchy. Objects of ``np.dtype`` will
23+
become type objects with a heirarchical ``mro`` like scalars. They will support
24+
subclassing. A future NEP may propose that instantiating a dtype type object
25+
will produce a scalar refleting that dtype, but that is not a goal of this NEP.
26+
27+
The changed dtype will:
28+
29+
- facilitate extending dtypes, typically for things like categoricals, novel
30+
representations like datetime or IP addresses, or adding attributes like
31+
units.
32+
- Simplify the code around ``__repr__`` and method lookup.
33+
34+
Overall Design
35+
--------------
36+
37+
In pure python (without error checking)::
38+
39+
import numpy as np
40+
41+
class Dtype(type):
42+
43+
def __new__(cls, obj, *args, **kwargs):
44+
if isinstance(obj, int):
45+
return dtype_int_dict[obj]
46+
elif isinstance(obj, type) and issubclass(obj, np.generic):
47+
return dtype_scalar_dict[obj]
48+
elif len(args) < 1:
49+
# Dtype('int8') or Dtype('S10') or record descr
50+
return create_new_descr(cls, obj, *args, **kwargs)
51+
else:
52+
return super().__new__(cls, obj, *args, **kwargs)
53+
54+
def __call__(self, *args, **kwargs):
55+
return self.typeobj(*args, **kwargs)
56+
57+
class IntDtype(Dtype):
58+
def __repr__(self):
59+
if self is IntDescr:
60+
return type.__repr__(self)
61+
return 'dtype(%s%d)' %(self.kind, self.itemsize)
62+
63+
class GenericDescr(type, metaclass=Dtype):
64+
pass
65+
66+
class IntDescr(GenericDescr, metaclass=IntDtype):
67+
def format(value):
68+
return '%d' % value
69+
70+
class UI 8000 nt8Descr(IntDescr):
71+
kind = 'uint'
72+
itemsize = 8
73+
typeobj = np.uint8
74+
# sort, fill, cast, clip, ...
75+
ArrFuncs = int8_arrayfuncs
76+
77+
dtype_int_dict = {1: UInt8Descr}
78+
dtype_scalar_dict = {np.uint8: UInt8Descr}
79+
80+
At NumPy startup, as we do today, we would generate the builtin set of
81+
descriptor classes, and fill in ``dtype_int_dict`` and ``dtype_scalar_type``
82+
so that the built-in descriptors would continue to be singletons. ``Void``,
83+
``Byte`` and ``Unicode`` descriptors would be constructed on demand, as is done
84+
today.
85+
86+
All dtype instances would inherit from ``GenericDescr`` which inherits from
87+
``type``, making them instances of ``type``::
88+
89+
>>> a = np.dtype(np.int8)
90+
>>> a.mro(a)
91+
[dtype(uint8), <class 'dtype.IntDescr'>, <class 'dtype.GenericDescr'>, \
92+
<class 'type'>, <class 'object'>]
93+
94+
Each descr class will have its own set of ArrFuncs (``clip``, ``fill``,
95+
``cast``), The ``format`` function is what ``array_print`` will call to turn a
96+
memory location into a string.
97+
98+
Downstream users of NumPy could subclass these type classes. Creating a categorical
99+
dtype would look like this (without error checking for out-of-bounds values)::
100+
101+
class Colors(UInt8Descr):
102+
colors = ['red', 'green', 'blue']
103+
def format(value):
104+
return Colors.colors[value]
105+
ArrFuncs = null_arrayfuncs
106+
107+
c = np.array([0, 1, 1, 0, 2], dtype=Colors)
108+
109+
Additional code would be needed to neutralize the `tp_as_number` slot functions.
110+
111+
Advantages
112+
==========
113+
114+
It is very difficult today to override dtype behaviour, since internally
115+
descriptor objects are not true type instances, rather contianers for the
116+
``ArrayDescrObject`` struct.
117+
118+
Disadvantages
119+
=============
120+
121+
Making descriptors into type objects requires thinking about type classes,
122+
which is more difficult to reason about than object instances. For instance,
123+
note that in the ``Colors`` example, we did not instantiate an object of the
124+
``Colors`` type, rather used that type directly in the ndarray creation. Also
125+
the ``format`` function is not a bound method of a class instance, rather an
126+
unbound function on a type class (no ``self`` argument is used).
127+
128+
Future Extensions
129+
=================
130+
131+
Note the descriptor holds a parallel ``typeobj`` which is a scalar class. A
132+
call like ``np.dtype('int8')(10)`` will now create a scalar object. The next
133+
step will be to replace the scalar classes with the descriptor classes, so
134+
that looking up a scalar's corresponding descriptor type becomes ``type(scalar)``.
135+
136+
We could refactor `numpy.datetime64` to use the new heirarchy, inheriting from
137+
``np.dtype(uint64)``
138+
139+
Alternatives
140+
============
141+
142+
Descriptors as Instances
143+
------------------------
144+
145+
It is confusing that descriptors are classes, not class instances. We could
146+
define them slightly differently as instances (note the call in the value of
147+
``dtype_int_dict`` and that ``_repr__`` is now a bound class method of
148+
``IntDescr``::
149+
150+
import numpy as np
151+
152+
class Dtype(type):
153+
154+
def __new__(cls, obj, *args, **kwargs):
155+
if isinstance(obj, int):
156+
return dtype_int_dict[obj]
157+
elif isinstance(obj, type) and issubclass(obj, np.generic):
158+
return dtype_scalar_dict[obj]
159+
elif len(args) < 1:
160+
# Dtype('int8') or Dtype('S10') or record descr
161+
return create_new_descr(cls, obj, *args, **kwargs)
162+
else:
163+
return super().__new__(cls, obj, *args, **kwargs)
164+
165+
def __call__(self, args, kwargs):
166+
return super().__call__(self.__name__, args, kwargs)
167+
168+
class GenericDescr(type, metaclass=Dtype):
169+
def __new__(cls, *args, **kwargs):
170+
import pdb;pdb.set_trace()
171+
return type.__new__(cls, *args, **kwargs)
172+
173+
def __call__(self, *args, **kwargs):
174+
return self.typeobj(*args, **kwargs)
175+
176+
class IntDescr(GenericDescr):
177+
def format(value):
178+
return '%d' % value
179+
def __repr__(self):
180+
return 'dtype(%s%d)' %(self.kind, self.itemsize)
181+
182+
183+
class UInt8Descr(IntDescr):
184+
kind = 'uint'
185+
itemsize = 8
186+
typeobj = np.uint8
187+
# sort, fill, cast, clip, ...
188+
#ArrFuncs = int8_arrayfuncs
189+
190+
# Create singletons of builtin descriptors via Dtype.__call__
191+
dtype_int_dict = {1: UInt8Descr()}
192+
dtype_scalar_dict = {np.uint8: dtype_int_dict[1]}
193+
194+
195+
196+
Appendix
197+
========
198+
199+
References
200+
----------
201+
202+
- pandas `ExtensionArray interface <https://github.com/pandas-dev/pandas/blob/5b0610b875476a6f3727d7e9bedb90d370c669b5/pandas/core/arrays/base.py>`
203+
- Dtype `brainstorming session <https://github.com/numpy/numpy/wiki/Dtype-Brainstorming>`
204+
from SciPy
205+
206+
The current interface of dtypes in NumPy
207+
----------------------------------------
208+
209+
.. code-block:: python
210+
211+
class DescrFlags(IntFlags):
212+
# The item must be reference counted when it is inserted or extracted.
213+
ITEM_REFCOUNT = 0x01
214+
# Same as needing REFCOUNT
215+
ITEM_HASOBJECT = 0x01
216+
# Convert to list for pickling
217+
LIST_PICKLE = 0x02
218+
# The item is a POINTER
219+
ITEM_IS_POINTER = 0x04
220+
# memory needs to be initialized for this data-type
221+
NEEDS_INIT = 0x08
222+
# operations need Python C-API so don't give-up thread.
223+
NEEDS_PYAPI = 0x10
224+
# Use f.getitem when extracting elements of this data-type
225+
USE_GETITEM = 0x20
226+
# Use f.setitem when setting creating 0-d array from this data-type
227+
USE_SETITEM = 0x40
228+
# A sticky flag specifically for structured arrays
229+
ALIGNED_STRUCT = 0x80
230+
231+
class current_dtype(object):
232+
itemsize: int
233+
alignment: int
234+
235+
byteorder: str
236+
flags: DescrFlags
237+
metadata: ... # unknown
238+
239+
# getters
240+
hasobject: bool
241+
isalignedstruct: bool
242+
isbuiltin: bool
243+
isnative: bool
244+
245+
246+
def newbyteorder(self) -> current_dtype: ...
247+
248+
# to move to a structured dtype subclass
249+
names: Tuple[str]
250+
fields: Dict[str, Union[
251+
Tuple[current_dtype, int],
252+
Tuple[current_dtype, int, Any]
253+
]]
254+
255+
# to move to a subarray dtype subclass
256+
subdtype: Optional[Tuple[dtype, Tuple[int,...]]]
257+
shape: Tuple[int]
258+
base: current_dtype
259+
260+
# to deprecate
261+
type: Type # merge with cls
262+
kind: str
263+
num: int
264+
str: str
265+
name: str
266+
char: str
267+
descr: List[...]
268+

0 commit comments

Comments
 (0)
0