-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Rewrite of array-coercion to support new dtypes #16200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
57 commits
Select commit
Hold shift + click to select a range
b5dc1ed
WIP: Rework array coercion
seberg f5df08c
WIP: Further steps toward new coercion, start with making discovery p…
seberg 63bb417
Close to the first working setup
seberg 28c8b39
WIP: Some cleanup/changes?
seberg b204379
WIP: Make things work by using AdaptFlexibleDType (without obj) for now
seberg 5bd5847
Use new mechanism for np.asarray, and hopefully get void right, har
seberg 9e03d8d
First version mainly working
seberg efbe979
Further fixes, make max-dims reached more logical and enter obj arrays
seberg a552d2a
TST: Small test adjustments
seberg 2cfcf56
WIP: Seems pretty good, but needs cleaning up...
8000
seberg 302813c
Smaller cleanups, better errors mainly?
seberg cec10fb
Fixup for scalar kind, and ensure OBJECT is special for assignment
seberg 1eaca02
Use PyArray_Pack in a few other places
seberg 3f5e4a2
Some micro-optimization tries (should probably be largely reverted)
seberg 1896813
Optimize away filling all dims with -1 at the start
seberg c7e7dd9
Other smallre changes, some optimization related.
seberg 60fa9b9
Small bug fixup and rebase on master
seberg e20dded
Fixups/comments for compiler warnings
seberg 4e0029d
update some comments, remove outdated old code path
seberg ad31a32
Small fixups/comment changes
seberg ca09045
BUG: Make static declaration safe (may be an issue on msvc mostly)
seberg 9ceeb97
Replace AdaptFlexibleDType with object and delete some datetime thing…
seberg 4a04e89
Add somewhat disgusting hacks for datetime support
seberg 08a4687
MAINT: Remove use of PyArray_GetParamsFromObject from PyArray_CopyObject
seberg a1ee25a
MAINT: Delete legacy dtype discovery
seberg 1405a30
Allow returning NULL for dtype when there is no object to discover from
seberg a7c5a59
BUG: Smaller fixes in object-array parametric discovery
seberg 75a728f
BUG: remove incorrect assert
seberg b09217c
BUG: When filling an array from the cache, store original for objects
seberg b28b2a1
BUG: Fix discovery for empty lists
seberg 7a343c6
BUG: Add missing DECREF
seberg 7d1489a
Fixups: Some smaller fixups and comments to ensure we have tests
seberg 946edc8
BUG: Add missing error check
seberg 002fa2f
BUG: Reorder dimension fix/check and promotion
seberg 29f1515
BUG: Add missing cache free...
seberg ba0a6d0
BUG: Fixup for PyArray_Pack
seberg b3544a1
BUG: Fix use after free in PyArray_CopyObject
seberg bcd3320
BUG: Need to set the base field apparently and swap promotion
seberg 454d785
MAINT: Use flag to indicate that dtype discovery is not necessary
seberg 68cd028
MAINT: Fixups (some based on new tests), almost finished
seberg 1035c3f
MAINT: Use macros/functions instead of direct slot access
seberg e30cbfb
MAINT: Delete PyArray_AssignFromSequence
seberg 56c63d8
MAINT: Undo change of how 0-D array-likes are handled as scalars
seberg 605588c
MAINT: Undo some header changes...
seberg 4eb9cfd
MAINT: Try to clean up headers a bit
seberg 4ac514f
TST: Add test for too-deep non-object deprecation
seberg 8a7f0e6
MAINt: Add assert for an unreachable exception path
seberg 7012ef7
TST: Adapt coercion-tests to the new situation
seberg 3ccf696
DOC: Add release notes for array-coercion changes
seberg 6ff4d48
MAINT: Remove weakref from mapping (for now) and rename
seberg e3f091e
Update numpy/core/src/multiarray/array_coercion.c
seberg 4fe0ad2
MAINT: Put a hack in place to allow datetime6
8000
4 -> string assignment w…
seberg d39953c
Update doc/release/upcoming_changes/16200.compatibility.rst
seberg b36750b
TST: datetime64 test_scalar_coercion does not fail anymore
seberg 0f78129
Update doc/release/upcoming_changes/16200.compatibility.rst
mattip aee13e0
DOC,STY: Use bitshift intsead of powers of two and fix comments
seberg 22ee971
TST: Add test for astype to stringlength tests
seberg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
NumPy Scalars are cast when assigned to arrays | ||
---------------------------------------------- | ||
|
||
When creating or assigning to arrays, in all relevant cases NumPy | ||
scalars will now be cast identically to NumPy arrays. In particular | ||
this changes the behaviour in some cases which previously raised an | ||
error:: | ||
|
||
np.array([np.float64(np.nan)], dtype=np.int64) | ||
|
||
will succeed at this time (this may change) and return an undefined result | ||
(usually the smallest possible integer). This also affects assignments:: | ||
|
||
arr[0] = np.float64(np.nan) | ||
|
||
Note, this already happened for ``np.array(np.float64(np.nan), dtype=np.int64)`` | ||
and that the behaviour is unchanged for ``np.nan`` itself which is a Python | ||
float. | ||
To avoid backward compatibility issues, at this time assignment from | ||
``datetime64`` scalar to strings of too short length remains supported. | ||
This means that ``np.asarray(np.datetime64("2020-10-10"), dtype="S5")`` | ||
succeeds now, when it failed before. In the long term this may be | ||
deprecated or the unsafe cast may be allowed generally to make assignment | ||
of arrays and scalars behave consistently. | ||
|
||
|
||
Array coercion changes when Strings and other types are mixed | ||
------------------------------------------------------------- | ||
|
||
When stringss and other types are mixed, such as:: | ||
|
||
np.array(["string", np.float64(3.)], dtype="S") | ||
|
||
The results will change, which may lead to string dtypes with longer strings | ||
in some cases. In particularly, if ``dtype="S"`` is not provided any numerical | ||
value will lead to a string results long enough to hold all possible numerical | ||
values. (e.g. "S32" for floats). Note that you should always provide | ||
``dtype="S"`` when converting non-strings to strings. | ||
|
||
If ``dtype="S"`` is provided the results will be largely identical to before, | ||
but NumPy scalars (not a Python float like ``1.0``), will still enforce | ||
a uniform string length:: | ||
|
||
np.array([np.float64(3.)], dtype="S") # gives "S32" | ||
np.array([3.0], dtype="S") # gives "S3" | ||
|
||
while previously the first version gave the same result as the second. | ||
|
||
|
||
Array coercion restructure | ||
-------------------------- | ||
|
||
Array coercion has been restructured. In general, this should not affect | ||
users. In extremely rare corner cases where array-likes are nested:: | ||
|
||
np.array([array_like1]) | ||
|
||
things will now be more consistent with:: | ||
|
||
np.array([np.array(array_like1)]) | ||
|
||
which could potentially change output subtly for badly defined array-likes. | ||
We are not aware of any such case where the results were not clearly | ||
incorrect previously. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
#define PY_SSIZE_T_CLEAN | ||
#include <Python.h> | ||
#include "structmember.h" | ||
|
||
|
||
#define NPY_NO_DEPRECATED_API NPY_API_VERSION | ||
#define _MULTIARRAYMODULE | ||
#include "numpy/ndarraytypes.h" | ||
#include "numpy/arrayobject.h" | ||
|
||
#include "abstractdtypes.h" | ||
#include "array_coercion.h" | ||
#include "common.h" | ||
|
||
|
||
static PyArray_Descr * | ||
discover_descriptor_from_pyint( | ||
PyArray_DTypeMeta *NPY_UNUSED(cls), PyObject *obj) | ||
{ | ||
assert(PyLong_Check(obj)); | ||
/* | ||
* We check whether long is good enough. If not, check longlong and | ||
* unsigned long before falling back to `object`. | ||
*/ | ||
long long value = PyLong_AsLongLong(obj); | ||
if (error_converting(value)) { | ||
PyErr_Clear(); | ||
} | ||
else { | ||
if (NPY_MIN_LONG <= value && value <= NPY_MAX_LONG) { | ||
return PyArray_DescrFromType(NPY_LONG); | ||
mattip marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
return PyArray_DescrFromType(NPY_LONGLONG); | ||
} | ||
|
||
unsigned long long uvalue = PyLong_AsUnsignedLongLong(obj); | ||
if (uvalue == (unsigned long long)-1 && PyErr_Occurred()){ | ||
PyErr_Clear(); | ||
} | ||
else { | ||
return PyArray_DescrFromType(NPY_ULONGLONG); | ||
} | ||
|
||
return PyArray_DescrFromType(NPY_OBJECT); | ||
} | ||
|
||
|
||
static PyArray_Descr* | ||
discover_descriptor_from_pyfloat( | ||
PyArray_DTypeMeta* NPY_UNUSED(cls), PyObject *obj) | ||
{ | ||
assert(PyFloat_CheckExact(obj)); | ||
return PyArray_DescrFromType(NPY_DOUBLE); | ||
} | ||
|
||
|
||
static PyArray_Descr* | ||
discover_descriptor_from_pycomplex( | ||
PyArray_DTypeMeta* NPY_UNUSED(cls), PyObject *obj) | ||
{ | ||
assert(PyComplex_CheckExact(obj)); | ||
return PyArray_DescrFromType(NPY_COMPLEX128); | ||
} | ||
|
||
|
||
NPY_NO_EXPORT int | ||
initialize_and_map_pytypes_to_dtypes() | ||
{ | ||
PyArrayAbstractObjDTypeMeta_Type.tp_base = &PyArrayDTypeMeta_Type; | ||
if (PyType_Ready(&PyArrayAbstractObjDTypeMeta_Type) < 0) { | ||
return -1; | ||
} | ||
((PyTypeObject *)&PyArray_PyIntAbstractDType)->tp_base = &PyArrayDTypeMeta_Type; | ||
PyArray_PyIntAbstractDType.scalar_type = &PyLong_Type; | ||
if (PyType_Ready((PyTypeObject *)&PyArray_PyIntAbstractDType) < 0) { | ||
return -1; | ||
} | ||
((PyTypeObject *)&PyArray_PyFloatAbstractDType)->tp_base = &PyArrayDTypeMeta_Type; | ||
PyArray_PyFloatAbstractDType.scalar_type = &PyFloat_Type; | ||
if (PyType_Ready((PyTypeObject *)&PyArray_PyFloatAbstractDType) < 0) { | ||
return -1; | ||
} | ||
((PyTypeObject *)&PyArray_PyComplexAbstractDType)->tp_base = &PyArrayDTypeMeta_Type; | ||
PyArray_PyComplexAbstractDType.scalar_type = &PyComplex_Type; | ||
if (PyType_Ready((PyTypeObject *)&PyArray_PyComplexAbstractDType) < 0) { | ||
return -1; | ||
} | ||
|
||
/* Register the new DTypes for discovery */ | ||
if (_PyArray_MapPyTypeToDType( | ||
&PyArray_PyIntAbstractDType, &PyLong_Type, NPY_FALSE) < 0) { | ||
return -1; | ||
} | ||
if (_PyArray_MapPyTypeToDType( | ||
&PyArray_PyFloatAbstractDType, &PyFloat_Type, NPY_FALSE) < 0) { | ||
return -1; | ||
} | ||
if (_PyArray_MapPyTypeToDType( | ||
&PyArray_PyComplexAbstractDType, &PyComplex_Type, NPY_FALSE) < 0) { | ||
return -1; | ||
} | ||
|
||
/* | ||
* Map str, bytes, and bool, for which we do not need abstract versions | ||
* to the NumPy DTypes. This is done here using the `is_known_scalar_type` | ||
* function. | ||
* TODO: The `is_known_scalar_type` function is considered preliminary, | ||
* the same could be achieved e.g. with additional abstract DTypes. | ||
*/ | ||
PyArray_DTypeMeta *dtype; | ||
dtype = NPY_DTYPE(PyArray_DescrFromType(NPY_UNICODE)); | ||
if (_PyArray_MapPyTypeToDType(dtype, &PyUnicode_Type, NPY_FALSE) < 0) { | ||
return -1; | ||
} | ||
|
||
dtype = NPY_DTYPE(PyArray_DescrFromType(NPY_STRING)); | ||
if (_PyArray_MapPyTypeToDType(dtype, &PyBytes_Type, NPY_FALSE) < 0) { | ||
return -1; | ||
} | ||
dtype = NPY_DTYPE(PyArray_DescrFromType(NPY_BOOL)); | ||
if (_PyArray_MapPyTypeToDType(dtype, &PyBool_Type, NPY_FALSE) < 0) { | ||
return -1; | ||
} | ||
|
||
return 0; | ||
} | ||
|
||
|
||
|
||
/* Note: This is currently largely not used, but will be required eventually. */ | ||
NPY_NO_EXPORT PyTypeObject PyArrayAbstractObjDTypeMeta_Type = { | ||
PyVarObject_HEAD_INIT(NULL, 0) | ||
.tp_name = "numpy._AbstractObjDTypeMeta", | ||
.tp_basicsize = sizeof(PyArray_DTypeMeta), | ||
.tp_flags = Py_TPFLAGS_DEFAULT, | ||
.tp_doc = "Helper MetaClass for value based casting AbstractDTypes.", | ||
}; | ||
|
||
NPY_NO_EXPORT PyArray_DTypeMeta PyArray_PyIntAbstractDType = {{{ | ||
PyVarObject_HEAD_INIT(&PyArrayAbstractObjDTypeMeta_Type, 0) | ||
.tp_basicsize = sizeof(PyArray_DTypeMeta), | ||
.tp_name = "numpy._PyIntBaseAbstractDType", | ||
},}, | ||
.abstract = 1, | ||
.discover_descr_from_pyobject = discover_descriptor_from_pyint, | ||
.kind = 'i', | ||
}; | ||
|
||
NPY_NO_EXPORT PyArray_DTypeMeta PyArray_PyFloatAbstractDType = {{{ | ||
PyVarObject_HEAD_INIT(&PyArrayAbstractObjDTypeMeta_Type, 0) | ||
.tp_basicsize = sizeof(PyArray_DTypeMeta), | ||
.tp_name = "numpy._PyFloatBaseAbstractDType", | ||
},}, | ||
.abstract = 1, | ||
.discover_descr_from_pyobject = discover_descriptor_from_pyfloat, | ||
.kind = 'f', | ||
}; | ||
|
||
NPY_NO_EXPORT PyArray_DTypeMeta PyArray_PyComplexAbstractDType = {{{ | ||
PyVarObject_HEAD_INIT(&PyArrayAbstractObjDTypeMeta_Type, 0) | ||
.tp_basicsize = sizeof(PyArray_DTypeMeta), | ||
.tp_name = "numpy._PyComplexBaseAbstractDType", | ||
},}, | ||
.abstract = 1, | ||
.discover_descr_from_pyobject = discover_descriptor_from_pycomplex, | ||
.kind = 'c', | ||
}; | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
#ifndef _NPY_ABSTRACTDTYPES_H | ||
#define _NPY_ABSTRACTDTYPES_H | ||
|
||
#include "dtypemeta.h" | ||
|
||
/* | ||
* These are mainly needed for value based promotion in ufuncs. It | ||
* may be necessary to make them (partially) public, to allow user-defined | ||
* dtypes to perform value based casting. | ||
*/ | ||
NPY_NO_EXPORT extern PyTypeObject PyArrayAbstractObjDTypeMeta_Type; | ||
NPY_NO_EXPORT extern PyArray_DTypeMeta PyArray_PyIntAbstractDType; | ||
NPY_NO_EXPORT extern PyArray_DTypeMeta PyArray_PyFloatAbstractDType; | ||
NPY_NO_EXPORT extern PyArray_DTypeMeta PyArray_PyComplexAbstractDType; | ||
|
||
NPY_NO_EXPORT int | ||
initialize_and_map_pytypes_to_dtypes(); | ||
|
||
#endif /*_NPY_ABSTRACTDTYPES_H */ |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How hard would it be to make any conversion of
nan
orinf
to any int raise? I understand there might be benchmark concerns which could be checked with an appropriate benchmark.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the question is what we want, do we want NumPy scalars (which are almost like 0-D arrays) behave like arrays, or do we want them to behave specially. If we do not want special casing, my code is probably nicer.
That would mean adding a range check to the actual float casting code, I did not check how much speed that costs, for large arrays probably none (memory speeds should be much slower)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that in all cases: arrays, 0-d arrays, and scalars we do not want to convert
nan
andinf
to ints by default. If someone actually wants this (for arrays and 0-d arrays) they could do it via view -> copy.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I don't much like 0-D arrays being special. And Scalars and 0-D arrays are almost the same (because 0-D arrays tend to be converted to scalars). But yes, you could define element setting as being special...
In general, that does not work, of course for most of our dtypes it does:
__int__
.__float__
__complex__
__str__
is fineAnd that works mostly. It currently creates bugs, because
complex128
thinks it can reasonably use__complex__
, which it must not for afloat128
. It also means you that 0-D objects tend to behave special, because instead of aligning how the two work, we define them as two different operations. And that can be confusingstr(array-like)
is well defined after all, we currently solve that by refuse to callstr(obj)
for any sequence). Also withinfloat128
/complex128
we (incompletely) hack around these by checking for0-D
arrays.So yes, I can make single element assignment special (in most cases, I have my doubt that real consistency can be achieved). But, we can only do that for some of our own dtypes, where Python has well defined protocols.
So either, I need to flag these dtypes to behave differently (let them always just use
__float__
,__int__
, etc., except for times and float128/complex128). Or do some other dance, which feels like it makes dtype creation unnecessarily complicated.DType.setitem
would have to call/signal back into NumPy in some cases.Just for the sake of it:
Is the current behaviour, I cannot say it illogical, but I wonder if it is the best solution to cement item assignment as being special and whether it is really easier on users.
But again, yes, I can put these behaviours back in place, they can just not generalize to arbitrary user-defined dtypes. And maybe we have to discuss it on wednesday or the mailing list...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, all of the examples of assignment should fail. Any reasonable user-defined dtype that converts from float to int should also fail when the conversion makes no sense. Hopefully we will not get the blame if that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not disagree, but in that case casting should probably fail in general (or at least warn) and most just the 0-D case, right?
So it is IMO not an argument for adding an extra path for
setitem
, except possibly performance reasons. The question is whether we want to special case these 0-D item assignments, because the behaviour is better or just because it is somewhat faster than doing proper casting?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the sidelines: agreed that casting should fail in those cases (or, more generally, that setting should behave identically to casting the value to the dtype of the array being set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for me right now, this feels like adding a fast-path, since I don't like the difference between the two paths (casting vs. item assignment). But of course we can argue, that if the long goal is to align casting with current item assignment behaviour, we can add those fast-paths right now. (It must be a fast/additional paths, since Python protocols are not a generic solution.)
I don't mind adding the fast path now, I feel it is one more thing to think about, but, it does not need to be exposed to user-defined dtypes right away. They are unlikely to make use of this, except maybe a rational dtype using "as integer ratio". (I am not 100% sure there cannot be reasons to diverge from the Python scalar protocols.)
I am unsure item assignment+casting is actually worth fast-paths, but maybe that does not matter. They currently lead to weird things (e.g. a DNA dtype, which is much like a string and a sequence could never be assigned to a NumPy string dtype!). However, those issues are resolvable now in either case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an open issue or PR to make any conversion from
inf
ornan
toint
raise an error or at least warn?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16624