8000 DEP: Deprecate promotion to strings by seberg · Pull Request #15925 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

DEP: Deprecate promotion to strings #15925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

seberg
Copy link
Member
@seberg seberg commented Apr 7, 2020

This may need a bit of discussion, and unfortunately I think there was a reason why I did not open the PR before, but I do not remember what it was. So there might be a hole in PR somewhere (as of now it mainly needs deprecation tests, although the existing tests probably effectively cover most things).

The basic idea is that np.promote_types() and thus also np.result_type() of any datatype and string should be deprecated. It is possible to cast an integer to string (safe or not is up for discussion), but they do not have a reasonable "common type" operation. It is not obvious that concatenating a string and an integer array is a string array, it could be an object array, but object arrays have enough quirks that I would not ever consider promoting to them by default a good idea (except when "object" is already involved).
Maybe you could talk about allowing that with an np.allow_object singleton or kwarg, in which case when promotion returns an error, you just use object instead.

It is possible that my hesitation was the fact that an np.allow_object singleton might be necessary for downstream compatibility.

Closes gh-15897


EDIT: Ah, there is a bit of a mess here with the type resolution errors that I had not cleaned up.

@seberg
Copy link
Member Author
seberg commented Apr 7, 2020

To be a bit more clear as to what this does, the main thing is that code such as:

np.promote_types("int64", "S")
np.array(["asdf", 12.0])
np.concatenate([np.array([1]), np.array(["string"])])

will all be deprecated, forcing you for example to write np.array(["asdf", 12.0], dtype="S").

@seberg
Copy link
Member Author
seberg commented Apr 7, 2020

Ah, one problem is that the "Default Type Resolver" will not honor this, since it uses safe-casting and not type promotion. But... That is probably a no-issue since in practice we simply do not have string loops, it is just an example of how safe-promotion is IMO not a future-proof way to handle things.

@seberg seberg force-pushed the deprecate-promotion-to-string branch from e8da69f to 85178c6 Compare April 7, 2020 18:27
@seberg seberg force-pushed the deprecate-promotion-to-string branch from 85178c6 to c39cbb0 Compare April 7, 2020 20:15
@seberg
Copy link
Member Author
seberg commented Apr 7, 2020

I came back to it because I would prefer to fix this (I think it actually simplifies to have this for a new DType system), but it is a bit of a rabbit hole. It is probably convering, but there are two things to figure out:

  1. That it does not break the world (e.g. pandas)
  2. Unless we also enforce that dtype=object or dtype=np.allow_object is necessary, there is a clear inconsistency that np.array([1, "string", None]) will be deprecated, while np.array([None, 1, "string"]) (object first) will not fail.

So, to be honest, this may only make sense if we are willing to do the full blown object deprecation.

@seberg seberg added the triage review Issue/PR to be discussed at the next triage meeting label Apr 7, 2020
@@ -81,6 +82,12 @@ _array_find_python_scalar_type(PyObject *op)
* PyArray_DTypeFromObject encountered a string type, and that the recursive
* search must be restarted so that string representation lengths can be
* computed for all scalar types.
*
* DEPRECATED NumPy 19.0, 2020-03
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* DEPRECATED NumPy 19.0, 2020-03
* DEPRECATED NumPy 1.19.0, 2020-03

@eric-wieser
Copy link
Member

it could be an object array, but object arrays have enough quirks that I would not ever consider promoting to them by default a good idea (except when "object" is already involved).

I'm not 100% sold on this. A nice feature of object arrays is that users can pass a list of random things into a numpy function like np.random.choice, and ignoring nested sequences, will get one of their objects back out.

The main issues I can think of with object arrays are:

  • users are repeatedly confused by object arrays containing arrays
  • extracting a scalar from an object array throws away dtype information (fixed by leave_wrapped or similar)

Neither of which seems particularly problematic here.

@seberg
Copy link
Member Author
seberg commented Apr 8, 2020

True, the question is really if np.concatenate([string_array, integer_array]) should be an object array. The other question is whether we generally return object if there is a promotion error during array coercion. Those are two different issues and I am OK with the array coercion part, but not really with the concatenation part I guess.

We currently have an annoying try/except in our code which was giving me a bit of headache. But the fix is probably that this try/except is at the wrong place: It should be around the promotion itself only (plus around trying to convert to a sequence).

Arrays containing arrays are an issue, but it is the same one as the ragged-array for the most part.

So right, I would be happy to keep the try/except around the promotion (effectively going to object for coercion), but feel fairly strongly that np.result_type("S5", "int64") must be an error. In other words: We should do this deprcation, but within np.array(...) it would instead be a FutureWarning.

@seberg
Copy link
Member Author
seberg commented Apr 8, 2020

Anyway, that sounds like a good idea (going to object for array coercion), and moving the try/except block, that should be doable, so I think I will explore that.

@eric-wieser
Copy link
Member
eric-wieser commented Apr 8, 2020

True, the question is really if np.concatenate([string_array, integer_array]) should be an object array.

For users who only care about ndarray as an nd-list (and to whom dtypes are just a convenience for more efficient storage), then returning an object array is the answer that gets in their way the least. Perhaps that type of user is uncommon though.

Can you think of a gotcha case where concatenating to object there is likely to cause bugs or confusion?

@seberg
Copy link
Member Author
seberg commented Apr 8, 2020

Right, but its also an incorrect answer, for example:

common_dtype = np.result_type(arr1, arr2)

arr1_cdt = arr1.astype(common_dtype)
arr2_cdt = arr2.astype(common_dtype)

eq1 = arr1 == arr2
eq2 = arr1_cdt == arr2_cdt
np.arrays_equal(eq1, eq2)  # Not guranteed to be true!

Edit: similar to concatenate result, where that part behaves different from the previous part.

@seberg
Copy link
Member Author
seberg commented Apr 8, 2020

Oops, don't type while chatting, I was still at the current strings... Those of course would produce the correct/same answer in most cases, although certain functions would not work or may not work the same (e.g. np.isnan, arr.imag). I guess I am not quite sold on the idea that silently producing object arrays is actually very useful, and it seems to me that they potentially create traps.

@seberg
Copy link
Member Author
seberg commented Apr 8, 2020

Ok, I suppose you could argue that the object ufunc should work on an array containing the scalar object (albeit possibly very slow) and that some of them do not is simply a current shortcoming. arr.imag is a bit of an outlier, because it returns a view.

Maybe I mainly have a bit of a dislike to the fact that object in is a heterogeneously typed array while any other NumPy array is homogeneously typed. I cannot currently find a major argument against defining this by a cast to object, except of the general quirks that object arrays tend to have.

The main argument is maybe that it is not necessarily a good generic fallback. It is in theory possible for two types that currently have no common dtype, to have on added later on. But I am not sure that disallowing the common type/promotion actually helps with such a transition.

@seberg seberg removed the triage review Issue/PR to be discussed at the next triage meeting label Apr 22, 2020
@seberg
Copy link
Member Author
seberg commented Apr 22, 2020

I will probably bring this up on the list some time (so expect pretty much a copy of this there), I would like to reach a consensus on the long term plan here. Independent of what np.array([...]) does, mainly also on whether np.concatenate([...]) or np.result_type() should ever promote to object.
np.array() is only really important, because if we want to remove the general "object" fallback, we would end up changing it twice.

Falling back to object technically should give the correct result, but possibly in a slow way and confusing way. The main danger code like:

np.multiple(string_arr, int_arr)

returns an object array. In many ways the result will be correct, but it is just slightly off due to the unexpected dtype. This might mainly happens if there is a mismatch between the ufunc and the scalar behaviour, but I do not think the issue is limited to that. And even if it is just a mismatch, an error is better then a potential future subtle change when fixing the result dtype.

We currently explicitly disallow a fallback to object loops in ufuncs and generally do not have any in type promotion to object (e.g. np.result_type) which also makes this an error:

datetime_arr = np.array(["2019-05-05 12:03"], dtype="M8")
int_arr = np.array([1])
np.concatenate([datetime_arr, int_arr])
# TypeError: invalid type promotion

Which arguably should also fall back to object if string and int does.

Copy link
Member
@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I may be missing some context, Overall I agree that np.array([..]) with string and int dtypes should error unless dtype is explicitly provided. I also checked the pandas repo tests and it seems to be expecting a failure for str + np.array operation in pandas/tests/arithmetic/test_object.py.
There are places in the test though which do things like np.array([2, "foo"]) which may fail after this change.

EDIT: It won't fail since its a deprecation but will throw dep warning.

* since it would give a spurious DeprecationWarning.
*/
PyErr_Format(PyExc_TypeError,
"No loop matching the specified signature and "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making this consistent with the other ufuncloop error message may avoid changes on pandas side.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, now that you said that. I actually tried to see what happens to pandas, and fixing this a bit was part of that. Although it seemed like a lot of warnings and I wasn't sure how to turn them to errors quickly in the run to actually see better whether or not they matter.

Fixing tests like np.array([2, "foo"]) do not matter, they are quick and easy to fix. But if it is relied on somewhere deeper, we may have to see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it seemed like a lot of warnings and I wasn't sure how to turn them to errors quickly in the run to actually see better whether or not they matter.

I just monkey patched your code to return -1 irrespective of deprecate call result 😅

@mattip
Copy link
Member
mattip commented Nov 12, 2020

@seberg is this meant to be part of the ufunc reworking for 1.20 or do we let it slide to a future release?

@seberg
Copy link
Member Author
seberg commented Nov 12, 2020

Let it slide. Its important on its own IMO and it would have helped if we had it in 1.19, but at this point it doesn't really matter. Marking this as draft, only to remind myself. This has to be redone in the new code and with a different ugly kludge.

@seberg seberg marked this pull request as draft November 12, 2020 18:09
@charris
Copy link
Member
charris commented Dec 18, 2020

@seberg Is this still needed?

@seberg
Copy link
Member Author
seberg commented Dec 18, 2020

Yeah, definitely... But just started looking at this, and basically nothing from this PR is still useful, so closing.

I probably have to make this a FutureWarning given that:

  1. It will raise an error in most cases.
  2. array-coercion uses a try: return promote_types(a, b) except: return np.dtype(oject). I could work around that, but it requires a special case somewhere, so lets see what everyone thinks.

@seberg seberg closed this Dec 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DEP: Disallow np.promote_types(string, other) -> string
5 participants
0