8000 ENH: Rewrite of array-coercion to support new dtypes by seberg · Pull Request #16200 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Rewrite of array-coercion to support new dtypes #16200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Jul 9, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
b5dc1ed
WIP: Rework array coercion
seberg Mar 7, 2020
f5df08c
WIP: Further steps toward new coercion, start with making discovery p…
seberg Mar 16, 2020
63bb417
Close to the first working setup
seberg Mar 27, 2020
28c8b39
WIP: Some cleanup/changes?
seberg Mar 30, 2020
b204379
WIP: Make things work by using AdaptFlexibleDType (without obj) for now
seberg May 5, 2020
5bd5847
Use new mechanism for np.asarray, and hopefully get void right, har
seberg May 5, 2020
9e03d8d
First version mainly working
seberg May 6, 2020
efbe979
Further fixes, make max-dims reached more logical and enter obj arrays
seberg May 6, 2020
a552d2a
TST: Small test adjustments
seberg May 8, 2020
2cfcf56
WIP: Seems pretty good, but needs cleaning up...
seberg May 8, 2020
302813c
Smaller cleanups, better errors mainly?
seberg May 8, 2020
cec10fb
Fixup for scalar kind, and ensure OBJECT is special for assignment
seberg May 9, 2020
1eaca02
Use PyArray_Pack in a few other places
seberg May 10, 2020
3f5e4a2
Some micro-optimization tries (should probably be largely reverted)
seberg May 12, 2020
1896813
Optimize away filling all dims with -1 at the start
seberg May 12, 2020
c7e7dd9
Other smallre changes, some optimization related.
seberg May 12, 2020
60fa9b9
Small bug fixup and rebase on master
seberg May 28, 2020
e20dded
Fixups/comments for compiler warnings
seberg May 28, 2020
4e0029d
update some comments, remove outdated old code path
seberg May 28, 2020
ad31a32
Small fixups/comment changes
seberg May 29, 2020
ca09045
BUG: Make static declaration safe (may be an issue on msvc mostly)
seberg May 29, 2020
9ceeb97
Replace AdaptFlexibleDType with object and delete some datetime thing…
seberg May 30, 2020
4a04e89
Add somewhat disgusting hacks for datetime support
seberg Jun 1, 2020
08a4687
MAINT: Remove use of PyArray_GetParamsFromObject from PyArray_CopyObject
seberg Jun 3, 2020
a1ee25a
MAINT: Delete legacy dtype discovery
seberg Jun 4, 2020
1405a30
Allow returning NULL for dtype when there is no object to discover from
seberg Jun 4, 2020
a7c5a59
BUG: Smaller fixes in object-array parametric discovery
seberg Jun 10, 2020
75a728f
BUG: remove incorrect assert
seberg Jun 10, 2020
b09217c
BUG: When filling an array from the cache, store original for objects
seberg Jun 11, 2020
b28b2a1
BUG: Fix discovery for empty lists
seberg Jun 11, 2020
7a343c6
BUG: Add missing DECREF
seberg Jun 13, 2020
7d1489a
Fixups: Some smaller fixups and comments to ensure we have tests
seberg Jun 15, 2020
946edc8
BUG: Add missing error check
seberg Jun 15, 2020
002fa2f
BUG: Reorder dimension fix/check and promotion
seberg Jun 16, 2020
29f1515
BUG: Add missing cache free...
seberg Jun 16, 2020
ba0a6d0
BUG: Fixup for PyArray_Pack
seberg Jun 16, 2020
b3544a1
BUG: Fix use after free in PyArray_CopyObject
seberg Jun 16, 2020
bcd3320
BUG: Need to set the base field apparently and swap promotion
seberg Jun 16, 2020
454d785
MAINT: Use flag to indicate that dtype discovery is not necessary
seberg Jun 16, 2020
68cd028
MAINT: Fixups (some based on new tests), almost finished
seberg Jun 16, 2020
1035c3f
MAINT: Use macros/functions instead of direct slot access
seberg Jun 16, 2020
e30cbfb
MAINT: Delete PyArray_AssignFromSequence
seberg Jun 18, 2020
56c63d8
MAINT: Undo change of how 0-D array-likes are handled as scalars
seberg Jun 18, 2020
605588c
MAINT: Undo some header changes...
seberg Jun 18, 2020
4eb9cfd
MAINT: Try to clean up headers a bit
seberg Jun 18, 2020
4ac514f
TST: Add test for too-deep non-object deprecation
seberg Jun 18, 2020
8a7f0e6
MAINt: Add assert for an unreachable exception path
seberg Jun 18, 2020
7012ef7
TST: Adapt coercion-tests to the new situation
seberg Jun 19, 2020
3ccf696
DOC: Add release notes for array-coercion changes
seberg Jun 19, 2020
6ff4d48
MAINT: Remove weakref from mapping (for now) and rename
seberg Jun 24, 2020
e3f091e
Update numpy/core/src/multiarray/array_coercion.c
seberg Jun 25, 2020
4fe0ad2
MAINT: Put a hack in place to allow datetime64 -> string assignment w…
seberg Jun 25, 2020
d39953c
Update doc/release/upcoming_changes/16200.compatibility.rst
seberg Jun 25, 2020
b36750b
TST: datetime64 test_scalar_coercion does not fail anymore
seberg Jun 25, 2020
0f78129
Update doc/release/upcoming_changes/16200.compatibility.rst
mattip Jun 30, 2020
aee13e0
DOC,STY: Use bitshift intsead of powers of two and fix comments
seberg Jun 30, 2020
22ee971
TST: Add test for astype to stringlength tests
seberg Jul 8, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
WIP: Seems pretty good, but needs cleaning up...
  • Loading branch information
seberg committed Jul 8, 2020
commit 2cfcf56cd1f9465871e355620465b0b734f1f56c
3 changes: 1 addition & 2 deletions numpy/core/include/numpy/ndarraytypes.h
Original file line number Diff line number Diff line change
Expand Up @@ -1550,8 +1550,7 @@ PyArray_GETITEM(const PyArrayObject *arr, const char *itemptr)
static NPY_INLINE int
PyArray_SETITEM(PyArrayObject *arr, char *itemptr, PyObject *v)
{
return ((PyArrayObject_fields *)arr)->descr->f->setitem(
v, itemptr, arr);
return ((PyArrayObject_fields *)arr)->descr->f->setitem(v, itemptr, arr);
}

#else
Expand Down
22 changes: 0 additions & 22 deletions numpy/core/src/multiarray/abstractdtypes.c
Original file line number Diff line number Diff line change
Expand Up @@ -63,25 +63,6 @@ discover_descriptor_from_pycomplex(
}


static int
unicode_is_known_scalar(PyArray_DTypeMeta *NPY_UNUSED(cls), PyObject *obj)
{
return PyUnicode_CheckExact(obj);
}

static int
bytes_is_known_scalar(PyArray_DTypeMeta *NPY_UNUSED(cls), PyObject *obj)
{
return PyBytes_CheckExact(obj);
}

static int
bool_is_known_scalar(PyArray_DTypeMeta *NPY_UNUSED(cls), PyObject *obj)
{
return PyBool_Check(obj);
}


NPY_NO_EXPORT int
initialize_abstract_dtypes_and_map_others()
{
Expand Down Expand Up @@ -121,18 +102,15 @@ initialize_abstract_dtypes_and_map_others()
*/
PyArray_DTypeMeta *dtype;
dtype = NPY_DTYPE(PyArray_DescrFromType(NPY_UNICODE));
dtype->is_known_scalar = unicode_is_known_scalar;
if (_PyArray_MapPyTypeToDType(dtype, &PyUnicode_Type, NPY_FALSE) < 0) {
return -1;
}

dtype = NPY_DTYPE(PyArray_DescrFromType(NPY_STRING));
dtype->is_known_scalar = bytes_is_known_scalar;
if (_PyArray_MapPyTypeToDType(dtype, &PyBytes_Type, NPY_FALSE) < 0) {
return -1;
}
dtype = NPY_DTYPE(PyArray_DescrFromType(NPY_BOOL));
dtype->is_known_scalar = bool_is_known_scalar;
if (_PyArray_MapPyTypeToDType(dtype, &PyBool_Type, NPY_FALSE) < 0) {
return -1;
}
Expand Down
113 changes: 105 additions & 8 deletions numpy/core/src/multiarray/array_coercion.c
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,11 @@
#define _UMATHMODULE
#define _MULTIARRAYMODULE

#include "Python.h"

#include "lowlevel_strided_loops.h"
#include "numpy/arrayobject.h"

#include "Python.h"
#include "descriptor.h"
#include "convert_datatype.h"
#include "dtypemeta.h"
Expand Down Expand Up @@ -249,8 +251,8 @@ discover_dtype_from_pyobject(
* 4. NULL in case of an error.
*/
if ((Py_TYPE(obj) == fixed_DType->scalar_type) ||
(fixed_DType->is_known_scalar != NULL &&
fixed_DType->is_known_scalar(fixed_DType, obj))) {
(fixed_DType->is_known_scalar != NULL &&
fixed_DType->is_known_scalar(fixed_DType, obj))) {
/*
* There are some corner cases, where we want to make sure a
* sequence is considered a scalar. In particular tuples with
Expand All @@ -270,7 +272,7 @@ discover_dtype_from_pyobject(
/*
* At this point we have not found a clear mapping, but mainly for
* backward compatibility we have to make some further attempts at
* interpreting the input correctly.
* interpreting the input as a known scalar type.
*/
PyArray_Descr *legacy_descr;
if (PyArray_IsScalar(obj, Generic)) {
Expand All @@ -290,7 +292,7 @@ discover_dtype_from_pyobject(
}

if (legacy_descr != NULL) {
DType = (PyArray_DTypeMeta *)Py_TYPE(legacy_descr);
DType = NPY_DTYPE(legacy_descr);
Py_INCREF(DType);
Py_DECREF(legacy_descr);
// TODO: Do not add new warning for now...
Expand Down Expand Up @@ -440,6 +442,77 @@ find_scalar_descriptor(
}


/**
* Assign a single element in an array from a python value
*
* @param descr
* @param item
* @param value
* @return 0 on success -1 on failure.
*/
NPY_NO_EXPORT int
PyArray_Pack(PyArray_Descr *descr, char *item, PyObject *value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am quite sure this is not a very important use case , but doesnt look like it handles cases like

x[0] = np.datetime64("2020-01-01") 

where x is a string array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can you elaborate on this (or maybe I just need to step through the code)? This should hit this code with the changes, if it does not, I missed a case where dtype->setitem should be replaced with PyArray_Pack(), since thats now the defined way to put a scalar into an array (I find it pretty fun that we never had a true correct way to do it).

I left out a few cases, on purpose, they seemed either corner case, or have well defined input, but this should not be one of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this code change :

>>> x = np.array(["qwdqwdqwd", "wqdqwd"])
>>> x[0] = np.datetime64("2020-01-01")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: The string provided for NumPy ISO datetime formatting was too short, with length 9

on numpy 1.18.5

>>> import numpy as np
>>> x = np.array(["qwdqwdqwd", "wqdqwd"])
>>> x[0] = np.datetime64("2020-01-01")
>>> x
array(['2020-01-0', 'wqdqwd'], dtype='<U9')

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah, didnt really pay attention the error message. I guess this is considered one of the fixed bugs of this PR !?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reuslt is now consistent with `np.datetime64("2020-01-01").astype("U5"), whether the casting behaviour is ideal, is another question...

I guess this is reversed to the float64(NaN) case though, because its actually more strict now, hmmmm. I could fix that, by defining that a datetime64 is a type known to string (and thus handled by string->setitemm which calls str(...) itself. Or just do that for all NumPy scalars -> string conversions, since at least the __str__ dunder is pretty well defined...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and I have to relax it for np.array(np.datetime64("2020-01-01"), "U9") as well...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reuslt is now consistent with `np.datetime64("2020-01-01").astype("U5"), whether the casting behaviour is ideal, is another question...

aah got it.

I could fix that, by defining that a datetime64 is a type known to string (and thus handled by string->setitemm which calls str(...) itself.

sorry not familiar with all the code here. how will this change look on PyArray_pack side ? currently it calls DATETIME_setitem in this line

      if (tmp_descr->f->setitem(value, data, &arr_fields) < 0) {

will this change somehow ?

Or just do that for all NumPy scalars -> string conversions, since at least the str dunder is pretty well defined...

This sounds reasonable to me. So np.datetime64("2020-01-01").astype("U5") will also not error now right ? not sure about the backward compatibility and deprecation cycle though, since this is a behavior change.

Copy link
Member Author
@seberg seberg Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed that change, I am starting to be unwilling to squabble about whether or not this is small enough to just document as a change. Normally, I would say deprecate, but I am starting to wear down, there are too many of these tiny things making it hard to focus on the big picture. It is super difficult to fix the big picture while retaining behaviour that was never well defined :/.

(not to say it isn't important to undo it for now to be on the safe side)

Sorry: And as the explentation. For PyArray_Pack() the change in the code path is that it will hit:

String->is_known_scalar_type(datetime64)

and unlike before it now returns True and thus it calls String->setitem(datetime64), instead of using

Datetime->setitem(...).astype("S6")

internally (i.e. using the Datetime setitem to convert the scalar into an "array" (so to say)).

Copy link
Member
@anirudh2290 anirudh2290 Jun 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, this sounds good to me, and agreed we should get rid of the inconsistency with np.datetime64("2020-01-01").astype("U5") in the future.

{
static PyArrayObject_fields arr_fields = {
.ob_base.ob_refcnt = 1,
.ob_base.ob_type = &PyArrayDescr_Type,
.flags = NPY_ARRAY_BEHAVED,
};
enum _dtype_discovery_flags flags = GAVE_SUBCLASS_WARNING;
PyArray_DTypeMeta *DType = discover_dtype_from_pyobject(
value, &flags, NPY_DTYPE(descr));

if (DType == NPY_DTYPE(descr) || DType == (PyArray_DTypeMeta *)Py_None ||
DType == NULL) {
Py_XDECREF(DType);
/* We can set the element directly (luckily) */
arr_fields.descr = descr;
return descr->f->setitem(value, item, &arr_fields);
}
PyArray_Descr *tmp_descr;
tmp_descr = DType->discover_descr_from_pyobject(DType, value);
Py_DECREF(DType);
if (tmp_descr == NULL) {
return -1;
}

char *data = PyObject_Malloc(tmp_descr->elsize);
if (data == NULL) {
PyErr_NoMemory();
Py_DECREF(tmp_descr);
return -1;
}
if (PyDataType_FLAGCHK(tmp_descr, NPY_NEEDS_INIT)) {
memset(data, 0, tmp_descr->elsize);
}
arr_fields.descr = tmp_descr;
if (tmp_descr->f->setitem(value, data, &arr_fields) < 0) {
PyObject_Free(data);
Py_DECREF(tmp_descr);
return -1;
}

int needs_api = 0;
PyArray_StridedUnaryOp *stransfer;
NpyAuxData *transferdata;
if (PyArray_GetDTypeTransferFunction(
0, 0, 0, tmp_descr, descr, 1, &stransfer, &transferdata,
&needs_api) < 0) {
PyObject_Free(data);
Py_DECREF(tmp_descr);
return -1;
}
stransfer(item, 0, data, 0, 1, tmp_descr->elsize, transferdata);
NPY_AUXDATA_FREE(transferdata);
PyObject_Free(data);
Py_DECREF(tmp_descr);
if (PyErr_Occurred()) {
return -1;
}
return 0;
}


static int
update_shape(int curr_ndim, int *max_ndim,
npy_intp out_shape[NPY_MAXDIMS], int new_ndim,
Expand Down Expand Up @@ -491,7 +564,7 @@ update_shape(int curr_ndim, int *max_ndim,
NPY_NO_EXPORT int
npy_new_coercion_cache(
PyObject *converted_obj, PyObject *arr_or_sequence, npy_bool sequence,
coercion_cache_obj ***next_ptr)
coercion_cache_obj ***next_ptr, int ndim)
{
coercion_cache_obj *cache = PyArray_malloc(sizeof(coercion_cache_obj));
if (cache == NULL) {
Expand All @@ -502,6 +575,7 @@ npy_new_coercion_cache(
Py_INCREF(arr_or_sequence);
cache->arr_or_sequence = arr_or_sequence;
cache->sequence = sequence;
cache->depth = ndim;
cache->next = NULL;
**next_ptr = cache;
*next_ptr = &(cache->next);
Expand Down Expand Up @@ -676,7 +750,7 @@ PyArray_DiscoverDTypeAndShape_Recursive(
* This is an array object which will be added to the cache, keeps
* the a reference to the array alive.
*/
if (npy_new_coercion_cache(obj, (PyObject *)arr, 0, coercion_cache_tail_ptr) < 0) {
if (npy_new_coercion_cache(obj, (PyObject *)arr, 0, coercion_cache_tail_ptr, curr_dims) < 0) {
Py_DECREF(arr);
return -1;
}
Expand Down Expand Up @@ -789,7 +863,7 @@ PyArray_DiscoverDTypeAndShape_Recursive(
}
return -1;
}
if (npy_new_coercion_cache(obj, seq, 1, coercion_cache_tail_ptr) < 0) {
if (npy_new_coercion_cache(obj, seq, 1, coercion_cache_tail_ptr, curr_dims) < 0) {
Py_DECREF(seq);
return -1;
}
Expand Down Expand Up @@ -883,6 +957,7 @@ PyArray_DiscoverDTypeAndShape(
PyArray_Descr **out_descr)
{
*out_descr = NULL;
coercion_cache_obj **original_coercion_cache_ptr = coercion_cache;
*coercion_cache = NULL;
for (int i = 0; i < max_dims; i++) {
out_shape[i] = -1;
Expand Down Expand Up @@ -953,6 +1028,28 @@ PyArray_DiscoverDTypeAndShape(
"setting an array element with a sequence");
goto fail;
}

/*
* If the array is ragged, the cache may be too deep so prune it.
* The cache is left at the same depth as the array though.
*/
coercion_cache_obj **next_ptr = original_coercion_cache_ptr;
coercion_cache_obj *current = *original_coercion_cache_ptr; /* item to check */
while (current != NULL) {
if (current->depth > ndim) {
/* delete "next" cache item and advanced it (unlike later) */
Py_DECREF(current->arr_or_sequence);
coercion_cache_obj *_old = current;
current = current->next;
PyArray_free(_old);
continue;
}
/* advance both prev and next, and set prev->next to new item */
*next_ptr = current;
next_ptr = &(current->next);
current = current->next;
}
*next_ptr = NULL;
}
/* We could check here for max-ndims being reached as well */

Expand Down
6 changes: 5 additions & 1 deletion numpy/core/src/multiarray/array_coercion.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,17 @@ typedef struct coercion_cache_obj {
PyObject *arr_or_sequence;
struct coercion_cache_obj *next;
npy_bool sequence;
int depth; /* the dimension at which this object was found. */
} coercion_cache_obj;


NPY_NO_EXPORT int
_PyArray_MapPyTypeToDType(
PyArray_DTypeMeta *DType, PyTypeObject *pytype, npy_bool userdef);

NPY_NO_EXPORT int
PyArray_Pack(PyArray_Descr *descr, char *item, PyObject *value);

NPY_NO_EXPORT int
PyArray_DiscoverDTypeAndShape(
PyObject *obj, int max_dims,
Expand All @@ -39,7 +43,7 @@ _discover_array_parameters(PyObject *NPY_UNUSED(self),
/* Create a new cache object */
NPY_NO_EXPORT int npy_new_coercion_cache(
PyObject *converted_obj, PyObject *arr_or_sequence, npy_bool sequence,
coercion_cache_obj ***next_ptr);
coercion_cache_obj ***next_ptr, int ndim);


/* Frees the coercion cache object. */
Expand Down
4 changes: 2 additions & 2 deletions numpy/core/src/multiarray/arraytypes.c.src
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ MyPyLong_AsUnsigned@Type@ (PyObject *obj)
return -1;
}
ret = PyLong_AsUnsigned@Type@(num);
if (PyErr_Occurred()) {
if (error_converting(ret)) {
PyErr_Clear();
ret = PyLong_As@Type@(num);
}
Expand Down Expand Up @@ -219,7 +219,7 @@ static int
else {
temp = (@type@)@func2@(op);
}
if (PyErr_Occurred()) {
if (error_converting(temp)) {
PyObject *type, *value, *traceback;
PyErr_Fetch(&type, &value, &traceback);
if (PySequence_NoString_Check(op)) {
Expand Down
Loading
0