-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: change list-of-array to tuple-of-array returns (Numba compat) #25570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes a lot of sense to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! I would agree that we can probably get away with these as in many contexts tuple vs. list will hopefully not matter. And I do think many of the lists results seem just historically due to the result being created using list.append()
...
sub-arrays : list of ndarrays | ||
A list of sub-arrays as views into `ary`. | ||
sub-arrays : tuple of ndarrays | ||
A tuple of sub-arrays as views into `ary`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with changing this. Although somehow to me the split functions would make very slightly more sense to keep it as a list, and I wonder if there is actually a gain for numba.
Maybe that is because the input seems very natural to be an arbitrary integer value (i.e. not a constant).
Thanks for writing the patch @rgommers and thanks for reviewing it @mhvk and @seberg! With regards to more details about why tuples are preferred as a return type over lists, it largely comes down to two things:
This issue is in part about type stability and making sure that the return type of a function can be derived entirely from the input types (not run time values!). A function that sometimes returns a one thing and sometimes another is fine so long as the return type can be statically resolved from the input types. Tuples are a good vehicle for promoting type stability during development as they are compile time constant. Returning a tuple forces implementors to consider what is going to go into that tuple at the construction site, whereas e.g. a list can just be declared and "collect" its contents dynamically throughout an implementation.
The second part of the issue isn't specifically that lists cannot be returned, it's more that Numba has no list implementation that can support heterogeneous data types. Numba's lists are designed to be homogeneous in data type for speed, aiding type stability (which helps type inference) and for ease of implementation. Contrary to this, Numba's tuple implementation (and other static-fixed-sized containers) do support heterogeneous data types, hence tuples are generally a more suitable return type for cases where a function returns "a heterogeneous collection of items". Looking specifically at an example where a NumPy function might currently return a list of arrays. As Numba's type system is quite fine grained with regards to the type representation of NumPy arrays, a NumPy function that returns a list of arrays is not necessarily returning something suitable for use in a homogeneous list even if all the items are arrays. For example, returning a list of
with only It is of course possible for Numba to just ignore the list return type prescribed by NumPy and make such functions return a tuple, however it would break Numba's aim of behavioural parity with NumPy. Essentially Numba attempts to have "JIT-transparency", which manifests as user code behaving the same way whether the JIT compiler is enabled or not, this to aid development workflow, debugging etc, as a result this not an option for implementations present in Numba. Ideal future-world side-note: It would be great if NumPy had type inference mechanism such that the output type of a function could be derived from its input types (ufuncs have gained some elements of this ability recently). Were such a system to exist, unit tests could enforce API type stability and NumPy re-implementations, like Numba, could just query the type inference mechanism to find out what a function is expected to return. Essentially, the NumPy type inference mechanism becomes the API/type source of truth and the NumPy implementation a reference implementation for that API. I hope this helps with the decision making process. If there are any questions please do ask. Also, if ever needed, |
…pat) Functions in NumPy that return lists of arrays are problematic for Numba. See numba/numba#8008 for a detailed discussion on that. This changes the return types to tuples, which are easier to support for Numba, because tuples are immutable. This change is not backwards-compatible. Estimated impact: - idiomatic end user code should continue to work unchanged, - code that attempts to append or otherwise mutate the list will start raising an exception, which should be easy to fix, - user code that does `if isinstance(..., list):` on the return value of a function like `atleast1d` will break. This should be rare, but since it may not result in a clean error it is probably the place with the highest impact. - user code with explicit `list[NDArray]` type annotations will need to be updated. [skip cirrus] [skip circle]
Also some small docstring improvements to make clearer what is returned.
6461197
to
02f6b7f
Compare
Thanks for the write-up @stuartarchibald, much appreciated!
That sounds a little like the "meta" backend for PyTorch. It was recently also asked about for the array API standard: data-apis/array-api#728. For that subset of functions it's maybe doable. Already a big lift probably though. For all of NumPy, it seems very hard to get there. I agree though that it'd be really nice to have.
It's still possible to get non-equal splits. For example: >>> res = np.split(np.arange(10).reshape((5,2)), (2, 4))
>>> res[0]
array([[0, 1],
[2, 3]])
>>> res[1]
array([[4, 5],
[6, 7]])
>>> res[2]
array([[8, 9]])
>>> res[0].flags['F_CONTIGUOUS']
False
>>> res[2].flags['F_CONTIGUOUS']
True I think doing it for all functions is easier. And then it's also easier to enforce the rule in the future (e.g., searching for I addressed all the existing comments and completed (I hope, identifying all functions that do this is a bit manual) the implementation. |
You are cherry-picking the wrong contiguity. All chunks are C-contiguous. The last one just happens to also be valid F-contiguous (because it is effectively 1-D). Unless the input is a runtime constant, you cannot infer the lengths of these chunks ahead of time (they depend both on the input shape and the splitting constants). So after rethinking it I really don't think splitting violates the typing guidelines at all, unless @stuartarchibald thinks otherwise? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found one left-over list in a docstring, but otherwise this looks all OK to me.
I think the consistency of getting a |
Fair, somehow it really is different for me (maybe because all others an unpacking seems obvious, not sure). EDIT: Maybe also because of More important comment though: All of the types have to be |
Yep, that's what I'd prefer - it's easy to reason about, easy to explain to users, and possible to write a test for. I don't reallly want to think about "shape is non-uniform and flags are not identical, but this case may be okay for Numba anyway".
Argh, good catch! I am signing off now, so will go over the whole thing carefully again later. Updating the type annotations is a little tricky, because you have to update both the stub file and the |
Also fix two remaining docstrings for list/tuple change [skip cirrus]
Okay, I went through everything again and I think this is happy now. Probably best to squash-merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, let's get this in.
RE ^, in the case described, were the return type a tuple of arrays, I think Numba would just end up "seeing" this as a >>> from numba import typeof
>>> import numpy as np
>>> typeof(tuple(np.split(np.arange(10).reshape((5, 2)), (2, 4))))
UniTuple(Array(int64, 2, 'C', False, aligned=True), 3) the runtime size/length/shape is not known to type inference; it is the dimensionality of the arrays that is of concern when determining the type. I think in the case of |
Seems a double edged sword, then? Numba following NumPy now could break some programs that work by relying on the list return (undefined length). Although, unlike things like I am not sure what that gives us: that keeping the list return is probably better for numba? |
I think that this indeed may be the case, on the condition that the list is going to comprise arrays that are all of the same type (and I think they have to be for
I agree with your remarks about I guess the intent of the original issue has been eroded a little by the passage of time and it's good to be revisiting it. Apologies for any confusion caused. |
Thanks for the input @stuartarchibald and @jakevdp!
I don't think there are other cases like this beyond |
Thanks @rgommers, apologies for causing the partial-revert, though I do think it provides a good outcome for Numba based on the (re)considerations above. Thanks again! |
This reverts a part of numpygh-25570. It turns out that the `split` functions are exceptional, and returning a tuple of arrays isn't better (perhaps worse) for Numba. For JAX it didn't matter. Given there's certainly no gain, we decided in the post-merge discussion on numpygh-25570 to revert the changes to the `split` APIs.
PR for that is gh-25800. |
Many thanks @rgommers. |
Functions in NumPy that return lists of arrays are problematic for Numba. See numba/numba#8008 for a detailed discussion on that. This changes the return types to tuples, which are easier to support for Numba, because tuples are immutable.
This change is not backwards-compatible. Estimated impact:
if isinstance(..., list):
on the return value of a function likeatleast1d
will break. This should be rare, but since it may not result in a clean error it is probably the place with the highest impact.list[NDArray]
type annotations will need to be updated.This is not a complete set of changes yet, but before spending more time on it I thought I'd open a PR for discussion. There may be a few more functions (they're not easy to automatically identify), and type annotations aren't fully updated yet. It should be enough to see what changes though.
This was discussed previously and put on the project board for 2.0. However, there is backwards compatibility impact from this, so we should only do this if the benefit for Numba is large enough. From numba/numba#8008 that does seem to be the case, but it's hard to judge. So I'm hoping @stuartarchibald and @seibert can jump in with more detail (not sure in what form - maybe things that start working in Numba code, or effort avoided to support lists, or ...).
Other thoughts:
nonzero
. It's probably random whether the author at the time choselist
ortuple
as the return typetuple
overlist
,tuple
was known, but it was forgotten about in one place:broadcast_arrays
return type is documented aslist
.I tried this branch on SciPy, and there were a lot of failures but all caused by the same line of code - the full diff to adapt to this change was: