8000 ENH: change list-of-array to tuple-of-array returns (Numba compat) by rgommers · Pull Request #25570 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: change list-of-array to tuple-of-array returns (Numba compat) #25570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 21, 2024

Conversation

rgommers
Copy link
Member

Functions in NumPy that return lists of arrays are problematic for Numba. See numba/numba#8008 for a detailed discussion on that. This changes the return types to tuples, which are easier to support for Numba, because tuples are immutable.

This change is not backwards-compatible. Estimated impact:

  • idiomatic end user code should continue to work unchanged,
  • code that attempts to append or otherwise mutate the list will start raising an exception, which should be easy to fix,
  • user code that does if isinstance(..., list): on the return value of a function like atleast1d will break. This should be rare, but since it may not result in a clean error it is probably the place with the highest impact.
  • user code with explicit list[NDArray] type annotations will need to be updated.

This is not a complete set of changes yet, but before spending more time on it I thought I'd open a PR for discussion. There may be a few more functions (they're not easy to automatically identify), and type annotations aren't fully updated yet. It should be enough to see what changes though.

This was discussed previously and put on the project board for 2.0. However, there is backwards compatibility impact from this, so we should only do this if the benefit for Numba is large enough. From numba/numba#8008 that does seem to be the case, but it's hard to judge. So I'm hoping @stuartarchibald and @seibert can jump in with more detail (not sure in what form - maybe things that start working in Numba code, or effort avoided to support lists, or ...).

Other thoughts:

  • NumPy already has functions that return tuple-of-arrays, e.g., nonzero. It's probably random whether the author at the time chose list or tuple as the return type
  • It'd be interesting to know if JAX (@jakevdp?) or another JIT compiler has similar implementation issues as Numba and would prefer tuple over list,
  • In the array API standard the preference for tuple was known, but it was forgotten about in one place: broadcast_arrays return type is documented as list.

I tried this branch on SciPy, and there were a lot of failures but all caused by the same line of code - the full diff to adapt to this change was:

--- a/scipy/stats/_distn_infrastructure.py
+++ b/scipy/stats/_distn_infrastructure.py
@@ -598,7 +598,7 @@ def argsreduce(cond, *args):
 
     # np.atleast_1d returns an array if only one argument, or a list of arrays
     # if more than one argument.
-    if not isinstance(newargs, list):
+    if not isinstance(newargs, (list, tuple)):
         newargs = [newargs, ]
 
     if np.all(cond):

@rgommers rgommers added 54 - Needs decision 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. Numpy 2.0 API Changes labels Jan 11, 2024
@rgommers rgommers marked this pull request as draft January 11, 2024 15:05
@seberg seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jan 11, 2024
Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes a lot of sense to me!

Copy link
Member
@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! I would agree that we can probably get away with these as in many contexts tuple vs. list will hopefully not matter. And I do think many of the lists results seem just historically due to the result being created using list.append()...

sub-arrays : list of ndarrays
A list of sub-arrays as views into `ary`.
sub-arrays : tuple of ndarrays
A tuple of sub-arrays as views into `ary`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK with changing this. Although somehow to me the split functions would make very slightly more sense to keep it as a list, and I wonder if there is actually a gain for numba.

Maybe that is because the input seems very natural to be an arbitrary integer value (i.e. not a constant).

@stuartarchibald
Copy link
Contributor

Thanks for writing the patch @rgommers and thanks for reviewing it @mhvk and @seberg!

With regards to more details about why tuples are preferred as a return type over lists, it largely comes down to two things:

  1. Promoting type stability.

This issue is in part about type stability and making sure that the return type of a function can be derived entirely from the input types (not run time values!). A function that sometimes returns a one thing and sometimes another is fine so long as the return type can be statically resolved from the input types. Tuples are a good vehicle for promoting type stability during development as they are compile time constant. Returning a tuple forces implementors to consider what is going to go into that tuple at the construction site, whereas e.g. a list can just be declared and "collect" its contents dynamically throughout an implementation.

  1. Numba's dynamically sized container implementations are homogeneous in type.

The second part of the issue isn't specifically that lists cannot be returned, it's more that Numba has no list implementation that can support heterogeneous data types. Numba's lists are designed to be homogeneous in data type for speed, aiding type stability (which helps type inference) and for ease of implementation. Contrary to this, Numba's tuple implementation (and other static-fixed-sized containers) do support heterogeneous data types, hence tuples are generally a more suitable return type for cases where a function returns "a heterogeneous collection of items".

Looking specifically at an example where a NumPy function might currently return a list of arrays. As Numba's type system is quite fine grained with regards to the type representation of NumPy arrays, a NumPy function that returns a list of arrays is not necessarily returning something suitable for use in a homogeneous list even if all the items are arrays. For example, returning a list of [1d array, 2d array], would not be considered homogeneous, and neither would returning [2d C-order array, 2d F-order array]. In fact, Numba's type system array type specialises on:

  • dimensionality
  • dtype
  • order
  • aligned-ness
  • mutability

with only order being coercible, in the case of a mixed order, to the less performant A (for any) order. The result of all this is that it's quite likely that what seems like acceptable contents for a list to be returned (e.g. a number of arrays) is in practice often not something that Numba can support. This does open the possibility that if NumPy were to want to continue to return lists of arrays, if guarantees could be provided around them being homogeneous in type that could also work for Numba.

It is of course possible for Numba to just ignore the list return type prescribed by NumPy and make such functions return a tuple, however it would break Numba's aim of behavioural parity with NumPy. Essentially Numba attempts to have "JIT-transparency", which manifests as user code behaving the same way whether the JIT compiler is enabled or not, this to aid development workflow, debugging etc, as a result this not an option for implementations present in Numba.

Ideal future-world side-note: It would be great if NumPy had type inference mechanism such that the output type of a function could be derived from its input types (ufuncs have gained some elements of this ability recently). Were such a system to exist, unit tests could enforce API type stability and NumPy re-implementations, like Numba, could just query the type inference mechanism to find out what a function is expected to return. Essentially, the NumPy type inference mechanism becomes the API/type source of truth and the NumPy implementation a reference implementation for that API.

I hope this helps with the decision making process. If there are any questions please do ask. Also, if ever needed, numba.typeof is the function to use to find out what type in Numba's type system a Python object will have.

…pat)

Functions in NumPy that return lists of arrays are problematic for
Numba. See numba/numba#8008 for a detailed
discussion on that. This changes the return types to tuples, which
are easier to support for Numba, because tuples are immutable.

This change is not backwards-compatible. Estimated impact:

- idiomatic end user code should continue to work unchanged,
- code that attempts to append or otherwise mutate the list will start
  raising an exception, which should be easy to fix,
- user code that does `if isinstance(..., list):` on the return value of
  a function like `atleast1d` will break. This should be rare, but since
  it may not result in a clean error it is probably the place with the
  highest impact.
- user code with explicit `list[NDArray]` type annotations will need to
  be updated.

[skip cirrus] [skip circle]
Also some small docstring improvements to make clearer what
is returned.
@rgommers rgommers force-pushed the numba-compat-tuples branch from 6461197 to 02f6b7f Compare January 19, 2024 18:09
@rgommers
Copy link
Member Author

Thanks for the write-up @stuartarchibald, much appreciated!

Ideal future-world side-note: It would be great if NumPy had type inference mechanism such that the output type of a function could be derived from its input types (ufuncs have gained some elements of this ability recently). Were such a system to exist, unit tests could enforce API type stability and NumPy re-implementations, like Numba, could just query the type inference mechanism to find out what a function is expected to return. Essentially, the NumPy type inference mechanism becomes the API/type source of truth and the NumPy implementation a reference implementation for that API.

That sounds a little like the "meta" backend for PyTorch. It was recently also asked about for the array API standard: data-apis/array-api#728. For that subset of functions it's maybe doable. Already a big lift probably though. For all of NumPy, it seems very hard to get there. I agree though that it'd be really nice to have.

I am OK with changing this. Although somehow to me the split functions would make very slightly more sense to keep it as a list, and I wonder if there is actually a gain for numba.

It's still possible to get non-equal splits. For example:

>>> res = np.split(np.arange(10).reshape((5,2)), (2, 4))
>>> res[0]
array([[0, 1],
       [2, 3]])
>>> res[1]
array([[4, 5],
       [6, 7]])
>>> res[2]
array([[8, 9]])
>>> res[0].flags['F_CONTIGUOUS']
False
>>> res[2].flags['F_CONTIGUOUS']
True

I think doing it for all functions is easier. And then it's also easier to enforce the rule in the future (e.g., searching for list[npt.NDArray should be clean).

I addressed all the existing comments and completed (I hope, identifying all functions that do this is a bit manual) the implementation.

@rgommers rgommers added this to the 2.0.0 release milestone Jan 19, 2024
@rgommers rgommers marked this pull request as ready for review January 19, 2024 18:11
@rgommers rgommers changed the title WIP: MAINT: change list-of-array to tuple-of-array returns (Numba compat) ENH: change list-of-array to tuple-of-array returns (Numba compat) Jan 19, 2024
@rgommers rgommers removed 54 - Needs decision 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes 25 - WIP labels Jan 19, 2024
@seberg
Copy link
Member
seberg commented Jan 19, 2024

It's still possible to get non-equal splits.

You are cherry-picking the wrong contiguity. All chunks are C-contiguous. The last one just happens to also be valid F-contiguous (because it is effectively 1-D).

Unless the input is a runtime constant, you cannot infer the lengths of these chunks ahead of time (they depend both on the input shape and the splitting constants).

So after rethinking it I really don't think splitting violates the typing guidelines at all, unless @stuartarchibald thinks otherwise?

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one left-over list in a docstring, but otherwise this looks all OK to me.

@mhvk
Copy link
Contributor
mhvk commented Jan 19, 2024

I think the consistency of getting a tuple for any function that produces more than one array is in itself worth having, so I'd go with tuple also for split.

@seberg
Copy link
Member
seberg commented Jan 19, 2024

I think the consistency of getting a tuple for any function that produces more than one array is in itself worth having, so I'd go with tuple also for split.

Fair, somehow it really is different for me (maybe because all others an unpacking seems obvious, not sure). EDIT: Maybe also because of string.split() being lists.

More important comment though: All of the types have to be tuple[ndarray, ...], otherwise it is hardcoding a single item tuple!

@rgommers
Copy link
Member Author

I think the consistency of getting a tuple for any function that produces more than one array is in itself worth having, so I'd go with tuple also for split.

Yep, that's what I'd prefer - it's easy to reason about, easy to explain to users, and possible to write a test for. I don't reallly want to think about "shape is non-uniform and flags are not identical, but this case may be okay for Numba anyway".

All of the types have to be tuple[ndarray, ...], otherwise it is hardcoding a single item tuple!

Argh, good catch! I am signing off now, so will go over the whole thing carefully again later. Updating the type annotations is a little tricky, because you have to update both the stub file and the reveal tests together; it's easy to forget a stub file when updating the implementation, and then you won't find the test for it either - it will just continue to pass.

Also fix two remaining docstrings for list/tuple change

[skip cirrus]
@rgommers
Copy link
Member Author

Okay, I went through everything again and I think this is happy now. Probably best to squash-merge.

Copy link
Contributor
@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, let's get this in.

@stuartarchibald
Copy link
Contributor

It's still possible to get non-equal splits.

You are cherry-picking the wrong contiguity. All chunks are C-contiguous. The last one just happens to also be valid F-contiguous (because it is effectively 1-D).

Unless the input is a runtime constant, you cannot infer the lengths of these chunks ahead of time (they depend both on the input shape and the splitting constants).

So after rethinking it I really don't think splitting violates the typing guidelines at all, unless @stuartarchibald thinks otherwise?

RE ^, in the case described, were the return type a tuple of arrays, I think Numba would just end up "seeing" this as a UniTuple type of 3 x (2d C-ordered int64 arrays).

>>> from numba import typeof
>>> import numpy as np
>>> typeof(tuple(np.split(np.arange(10).reshape((5, 2)), (2, 4))))
UniTuple(Array(int64, 2, 'C', False, aligned=True), 3)

the runtime size/length/shape is not known to type inference; it is the dimensionality of the arrays that is of concern when determining the type.

I think in the case of np.split, it might be an anomalous case with respect to the general preference of type inference towards tuple return types. In this case, the function takes indices_or_sections which can be an integer or array-like, if this value has a type that does not carry compile time constant information that will allow the size of the output tuple to be determined, then the function cannot be compiled. In the examples above where indices_or_sections is given as a tuple, this would compile in Numba as tuples are sized at compile time, similarly a compile time constant literal integer would also work. Were however an integer supplied at runtime, or the array-like to be an array (size wouldn't be known at compile time), then it couldn't be compiled. Numba often has to restrict support over input types to functions in this way, however this may be one of the rare cases where a list output would lessen the restriction.

@seberg
Copy link
Member
seberg commented Jan 26, 2024

Seems a double edged sword, then?

Numba following NumPy now could break some programs that work by relying on the list return (undefined length). Although, unlike things like str.split() (where my intuition likely comes from), we sometimes do have a well defined constant return type.
And for those cases where things still work it may allow more things later if the size is attached!

I am not sure what that gives us: that keeping the list return is probably better for numba?

@stuartarchibald
Copy link
Contributor

Seems a double edged sword, then?

Numba following NumPy now could break some programs that work by relying on the list return (undefined length). Although, unlike things like str.split() (where my intuition likely comes from), we sometimes do have a well defined constant return type. And for those cases where things still work it may allow more things later if the size is attached!

I think that this indeed may be the case, on the condition that the list is going to comprise arrays that are all of the same type (and I think they have to be for split, they are all views into the original array and have the same dimensionality?). The original Numba issue that prompted this was probably from cases where heterogeneously typed lists were being returned from functions where a tuple could have been returned instead, the result of which was Numba just couldn't support them.

I am not sure what that gives us: that keeping the list return is probably better for numba?

I agree with your remarks about str.split() and a "well defined constant return type", that is the important thing, it's often hard to ascertain what that is and what the trade-offs may be without a type inference engine running! Perhaps the thing to do here is for cases like split, vsplit and hsplit (are there any more?), returning a list that is guaranteed homogeneous in the type of array is the right thing to do as it is type stable and seems more flexible for users?

I guess the intent of the original issue has been eroded a little by the passage of time and it's good to be revisiting it. Apologies for any confusion caused.

@jakevdp
Copy link
Contributor
jakevdp commented Jan 26, 2024

@rgommers I just noticed you tagged me here: JAX can handle arbitrary python collections (and even nested collections!) as function return values. Essentially it does this by treating all function inputs and outputs as pytrees, and flattening them to a canonical form.

@rgommers
Copy link
Member Author

Thanks for the input @stuartarchibald and @jakevdp!

Perhaps the thing to do here is for cases like split, vsplit and hsplit (are there any more?), returning a list that is guaranteed homogeneous in the type of array is the right thing to do as it is type stable and seems more flexible for users?

I don't think there are other cases like this beyond *split. It sounds like a partial revert is proposed/desired here. I'll do that soon unless there are more comments.

@stuartarchibald
Copy link
Contributor

Thanks @rgommers, apologies for causing the partial-revert, though I do think it provides a good outcome for Numba based on the (re)considerations above. Thanks again!

rgommers added a commit to rgommers/numpy that referenced this pull request Feb 9, 2024
This reverts a part of numpygh-25570. It turns out that the `split`
functions are exceptional, and returning a tuple of arrays isn't
better (perhaps worse) for Numba. For JAX it didn't matter.
Given there's certainly no gain, we decided in the post-merge discussion
on numpygh-25570 to revert the changes to the `split` APIs.
@rgommers
Copy link
Member Author
rgommers commented Feb 9, 2024

It sounds like a partial revert is proposed/desired here. I'll do that soon unless there are more comments.

PR for that is gh-25800.

@stuartarchibald
Copy link
Contributor

Many thanks @rgommers.

eliegoudout added a commit to eliegoudout/numpy-atleast_nd-fill that referenced this pull request Feb 26, 2024
eliegoudout added a commit to eliegoudout/numpy-atleast_nd-fill that referenced this pull request Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. Numpy 2.0 API Changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0