8000 Array API implementation by tomwhite · Pull Request #8750 · dask/dask · GitHub
[go: up one dir, main page]

Skip to content

Array API implementation #8750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Array API implementation #8750

wants to merge 6 commits into from

Conversation

tomwhite
Copy link
Contributor

This is a Dask implementation of the Python Array API standard, discussed in dask/community#109.

It is provided in the dask.array_api namespace, so it is independent of the existing dask.array implementation.

There are a few parts of the API that are not implemented since they don't exist in Dask (such as sorting functions, and in-place operations), and these are excluded when running the test suite. Apart from these tests there are two failures: one (test_array_object.py::test_getitem_masking) looks like it is a shape problem in the test suite, the other should be fixed by #8676. (Note that more edge-case failures will be found when we get Hypothesis to try more examples.)

I've included a GitHub action to run the array API test suite, but I've configured it to be triggered manually for the time being since not all tests are passing yet.

@jakirkham
Copy link
Member

cc @dask/array

@tomwhite
Copy link
Contributor Author

All array API tests are passing (https://github.com/tomwhite/dask/runs/5321391257?check_suite_focus=true), now that #8676 and data-apis/array-api-tests#103 have been merged.

pushd array-api-tests

# Skip testing functions with known issues
cat << EOF >> skips.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be passed to pytest somehow?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The array-api-tests repo automatically uses skips.txt to skip tests https://github.com/data-apis/array-api-tests/blob/master/conftest.py#L84. Ideally you could just pass these XFAILS on the command line, but that doesn't seem to be possible (https://stackoverflow.com/questions/68764967/xfail-pytest-tests-from-the-command-line)

@asmeurer
Copy link

Exciting to see this. Let me know if you have any questions around the test suite or the numpy.array_api implementation.

By the way, I don't know if you use type hints in dask, but if you want to add them, you can get the official type hints for the array API functions from here https://github.com/data-apis/array-api/tree/main/spec/API_specification/signatures.

@jakirkham
Copy link
Member

We do make use of typing in Dask so that's good to know. Thanks for sharing 🙂


def mean(x, /, *, axis=None, keepdims=False):
if x.dtype not in _floating_dtypes:
raise TypeError("Only floating-point dtypes are allowed in mean")
Copy link
@asmeurer asmeurer Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this sort of thing doesn't hurt to have, but it also isn't required by the spec. Type combinations outside of those that say "should" in the spec are just considered out of scope. The NumPy implementation is particularly strict and disallows anything not explicitly required by the spec, but it is not required that other implementations do this. The test suite will only check for required combinations (i.e., this line won't even be covered by array-api-tests). There are some cases where an exception is required by the spec, and those are covered by the test suite.

OTOH, mixed-type promotion and cross-type casting in particular are out of scope in the spec specifically because they are considered bad for various reasons. So it's also not so bad to have this sort of thing in there just to make the API cleaner. Some other things, like limited indexing semantics, are only out of scope because they aren't implemented consistently across array libraries, or aren't always possible on accelerators, so it's not bad to allow them anyway if you already have them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good to know. I was following the NumPy implementation and being strict. In general, it's easier to start out strict, and loosen an implementation rather than the other way round - due to backwards compatibility guarantees. However, this is probably an area where we might want to just use the existing Dask implementation. Interested to see what other @dask/array folks think?

BTW I wonder if the array API spec should use capitalised "SHOULD" rather than "should" to make it a bit clearer it's using RFC 2119 levels?


def ceil(x, /):
if x.dtype in _integer_dtypes:
# Note: The return dtype of ceil is the same as the input

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the "# Note" comments in the numpy.array_api implementation are there to carefully document the differences from the mainline NumPy.

@tomwhite
Copy link
Contributor Author

Thanks @asmeurer for taking a look!

By the way, I don't know if you use type hints in dask, but if you want to add them, you can get the official type hints for the array API functions from here https://github.com/data-apis/array-api/tree/main/spec/API_specification/signatures.

I hadn't seen the type signatures for the array API had been collected in one place so thanks for pointing that out. I will add them to the implementation here. (Either in this PR or a follow-on one.)

@honno
Copy link
honno commented Mar 1, 2022

Were you going to implement the (optional) linalg extension here or in a follow-up PR? Curious if you missed it because the test suite skips testing extension functions if not found—I'm wondering if we should actually test them by default and then ask the user to explicitly disable them via our --disable-extension flag.

@asmeurer
Copy link
asmeurer commented Mar 1, 2022

Presumably not. I think dask was one of the motivating libraries for making linalg optional.

@tomwhite
Copy link
Contributor Author
tomwhite commented Mar 2, 2022

Were yo 8000 u going to implement the (optional) linalg extension here or in a follow-up PR? Curious if you missed it because the test suite skips testing extension functions if not found

I missed it out because it's optional and wanted to get the core part working first.

@honno
Copy link
honno commented Mar 22, 2022

We've resolved data-apis/array-api-tests#94 in the latest release of the test suite... finally 😅 For this branch I got failures for the following with --max-examples=1

Failing special cases
floor_divide(x1_i is +infinity and isfinite(x2_i) and x2_i > 0) -> +infinity
floor_divide(x1_i is +infinity and isfinite(x2_i) and x2_i < 0) -> -infinity
floor_divide(x1_i is -infinity and isfinite(x2_i) and x2_i > 0) -> -infinity
floor_divide(x1_i is -infinity and isfinite(x2_i) and x2_i < 0) -> +infinity
floor_divide(isfinite(x1_i) and x1_i > 0 and x2_i is -infinity) -> -0
floor_divide(isfinite(x1_i) and x1_i < 0 and x2_i is +infinity) -> -0
__floordiv__(x1_i is +infinity and isfinite(x2_i) and x2_i > 0) -> +infinity
__floordiv__(x1_i is +infinity and isfinite(x2_i) and x2_i < 0) -> -infinity
__floordiv__(x1_i is -infinity and isfinite(x2_i) and x2_i > 0) -> -infinity
__floordiv__(x1_i is -infinity and isfinite(x2_i) and x2_i < 0) -> +infinity
__floordiv__(isfinite(x1_i) and x1_i > 0 and x2_i is -infinity) -> -0
__floordiv__(isfinite(x1_i) and x1_i < 0 and x2_i is +infinity) -> -0
__iadd__(x1_i is +infinity and x2_i is -infinity) -> NaN
__iadd__(x1_i is -infinity and x2_i is +infinity) -> NaN
__iadd__(isfinite(x1_i) and x2_i is +infinity) -> +infinity
__iadd__(isfinite(x1_i) and x2_i is -infinity) -> -infinity
__iadd__(x1_i is -0 and x2_i is +0) -> +0
__iadd__((x1_i is +0 or x1_i == -0) and isfinite(x2_i) and x2_i != 0) -> x2_i
__iadd__(isfinite(x1_i) and x1_i != 0 and x2_i == -x1_i) -> +0
__itruediv__((x1_i is +infinity or x1_i == -infinity) and (x2_i is +infinity or x2_i == -infinity)) -> NaN
__itruediv__((x1_i is +0 or x1_i == -0) and (x2_i is +0 or x2_i == -0)) -> NaN
__itruediv__(x1_i is +0 and x2_i < 0) -> -0
__itruediv__(x1_i is -0 and x2_i < 0) -> +0
__itruediv__(x1_i > 0 and x2_i is +0) -> +infinity
__itruediv__(x1_i > 0 and x2_i is -0) -> -infinity
__itruediv__(x1_i < 0 and x2_i is +0) -> -infinity
__itruediv__(x1_i < 0 and x2_i is -0) -> +infinity
__itruediv__(x1_i is +infinity and isfinite(x2_i) and x2_i < 0) -> -infinity
__itruediv__(x1_i is -infinity and isfinite(x2_i) and x2_i < 0) -> +infinity
__itruediv__(isfinite(x1_i) and x1_i > 0 and x2_i is +infinity) -> +0
__itruediv__(isfinite(x1_i) and x1_i > 0 and x2_i is -infinity) -> -0
__itruediv__(isfinite(x1_i) and x1_i < 0 and x2_i is +infinity) -> -0
__itruediv__(isfinite(x1_i) and x1_i < 0 and x2_i is -infinity) -> +0
__itruediv__(copysign(1, x1_i) == copysign(1, x2_i) and isfinite(x1_i) and x1_i != 0 and isfinite(x2_i) and x2_i != 0) -> positive sign
__ifloordiv__((x1_i is +infinity or x1_i == -infinity) and (x2_i is +infinity or x2_i == -infinity)) -> NaN
__ifloordiv__((x1_i is +0 or x1_i == -0) and (x2_i is +0 or x2_i == -0)) -> NaN
__ifloordiv__(x1_i is +0 and x2_i < 0) -> -0
__ifloordiv__(x1_i is -0 and x2_i < 0) -> +0
__ifloordiv__(x1_i > 0 and x2_i is +0) -> +infinity
__ifloordiv__(x1_i > 0 and x2_i is -0) -> -infinity
__ifloordiv__(x1_i < 0 and x2_i is +0) -> -infinity
__ifloordiv__(x1_i < 0 and x2_i is -0) -> +infinity
__ifloordiv__(x1_i is +infinity and isfinite(x2_i) and x2_i < 0) -> -infinity
__ifloordiv__(x1_i is -infinity and isfinite(x2_i) and x2_i < 0) -> +infinity
__ifloordiv__(isfinite(x1_i) and x1_i > 0 and x2_i is +infinity) -> +0
__ifloordiv__(isfinite(x1_i) and x1_i > 0 and x2_i is -infinity) -> -0
__ifloordiv__(isfinite(x1_i) and x1_i < 0 and x2_i is +infinity) -> -0
__ifloordiv__(isfinite(x1_i) and x1_i < 0 and x2_i is -infinity) -> +0
__ifloordiv__(copysign(1, x1_i) == copysign(1, x2_i) and isfinite(x1_i) and x1_i != 0 and isfinite(x2_i) and x2_i != 0) -> positive sign
__imul__((x1_i is +infinity or x1_i == -infinity) and (x2_i is +0 or x2_i == -0)) -> NaN
__imul__((x1_i is +0 or x1_i == -0) and (x2_i is +infinity or x2_i == -infinity)) -> NaN
__ipow__(not x1_i == 1 and x2_i is NaN) -> NaN
__ipow__(x2_i is +0) -> 1
__ipow__(x2_i is -0) -> 1
__ipow__(abs(x1_i) > 1 and x2_i is +infinity) -> +infinity
__ipow__(abs(x1_i) > 1 and x2_i is -infinity) -> +0
__ipow__(abs(x1_i) < 1 and x2_i is -infinity) -> +infinity
__ipow__(x1_i is +infinity and x2_i < 0) -> +0
__ipow__(x1_i is -infinity and x2_i > 0 and not (x2_i.is_integer() and x2_i % 2 == 1)) -> +infinity
__ipow__(x1_i is -infinity and x2_i < 0 and x2_i.is_integer() and x2_i % 2 == 1) -> -0
__ipow__(x1_i is -infinity and x2_i < 0 and not (x2_i.is_integer() and x2_i % 2 == 1)) -> +0
__ipow__(x1_i is +0 and x2_i < 0) -> +infinity
__ipow__(x1_i is -0 and x2_i > 0 and not (x2_i.is_integer() and x2_i % 2 == 1)) -> +0
__ipow__(x1_i is -0 and x2_i < 0 and x2_i.is_integer() and x2_i % 2 == 1) -> -infinity
__ipow__(x1_i is -0 and x2_i < 0 and not (x2_i.is_integer() and x2_i % 2 == 1)) -> +infinity
__ipow__(x1_i < 0 and isfinite(x1_i) and isfinite(x2_i) and not x2_i.is_integer()) -> NaN

Interestingly it was the case with NumPy too that in-place operators tended to deviate. Special cases seem a bit low priority generally, but also noteworthy that floordiv has some of their special cases noted as "libraries may not implement this".

@tomwhite
Copy link
Contributor Author

Thanks for the update @honno!

In-place operations are not supported in Dask (#5199), so I included them in skips.txt. Similarly, I've skipped special-case tests too for the first iteration of this code.

I wonder if there's interest in merging this? There are still cases to fix, but as it stands this is passing a good chunk of the array API tests.

@jakirkham
Copy link
Member

@dask/maintenance would be great if a few folks took a look at this and let us know what they thought 🙂

@jsignell jsignell self-requested a review April 7, 2022 12:17
@tomwhite
Copy link
Contributor Author

I just came across this interesting discussion about the feasibility of making the main NumPy namespace converge to the array API: scikit-learn/scikit-learn#22352 (comment). (TL;DR: it's feasible, but may take a while.)

As the current PR shows, Dask's implementation is pretty close already, so I'd like to explore closing the gap via incremental changes to dask.array, rather than introducing a new namespace dask.array_api at this stage. I think we could get to a good level of compliance with bug fixes and compatible changes. Breaking changes (listed here: https://numpy.org/devdocs/reference/array_api.html), could be left until later, probably following what NumPy chooses to do. Of course, it would always be possible to add a separate namespace at a later point if necessary.

@jrbourbeau
Copy link
Member

https://github.com/orgs/dask/teams/maintenance would be great if a few folks took a look at this and let us know what they thought 🙂

Apologies for the lack of response. I'll give this a review over the next couple of days

@jsignell
Copy link
Member

As the current PR shows, Dask's implementation is pretty close already, so I'd like to explore closing the gap via incremental changes to dask.array, rather than introducing a new namespace dask.array_api at this stage. I think we could get to a good level of compliance with bug fixes and compatible changes. Breaking changes (listed here: https://numpy.org/devdocs/reference/array_api.html), could be left until later, probably following what NumPy chooses to do. Of course, it would always be possible to add a separate namespace at a later point if necessary.

I really like the idea of not introducing a new namespace.

@jakirkham
Copy link
Member

FWIW NumPy is also introducing a new namespace for this and it may be unavoidable for some kinds of functionality (for example how unknown dimensions in shape are handled). Longer term NumPy plans to replace their current API with the new one created in this separate namespace. We may benefit from following a similar strategy.

Co-authored-by: Pavithra Eswaramoorthy <pavithraes@outlook.com>
@jakirkham
Copy link
Member

Any more thoughts on this?

@tomwhite
Copy link
Contributor Author

Any more thoughts on this?

I still think what I said here #8750 (comment), that we shouldn't introduce a new namespace lightly, and we should evolve the current namespace to be as compatible as possible first.

So in a sense this is superseded by #8917.

with:
repository: data-apis/array-api-tests
path: array-api-tests
ref: 2022.02.24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ref: 2022.02.24
ref: 2022.05.18

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a newer version available, apologies for not reviewing this sooner!

< 10000 p class="text-center mt-3" data-hide-on-error>

@jakirkham
Copy link
Member

Any more thoughts on this?

I still think what I said here #8750 (comment), that we shouldn't introduce a new namespace lightly, and we should evolve the current namespace to be as compatible as possible first.

So in a sense this is superseded by #8917.

@tomwhite, did you have thoughts on this comment ( #8750 (comment) )?

@ian-r-rose
Copy link
Collaborator

Any more thoughts on this?

I still think what I said here #8750 (comment), that we shouldn't introduce a new namespace lightly, and we should evolve the current namespace to be as compatible as possible first.
So in a sense this is superseded by #8917.

@tomwhite, did you have thoughts on this comment ( #8750 (comment) )?

I agree (I think) with @jakirkham that a separate namespace may be unavoidable. I'm particularly thinking of all of the keyword-only and positional-only arguments in the spec, which are great, but are difficult to bolt onto the existing API in a backwards-compatible way.

Though I would be in favor of making array_api a submodule of dask.array.

@tomwhite
Copy link
Contributor Author

Any more thoughts on this?

I still think what I said here #8750 (comment), that we shouldn't introduce a new namespace lightly, and we should evolve the current namespace to be as compatible as possible first.
So in a sense this is superseded by #8917.

@tomwhite, did you have thoughts on this comment ( #8750 (comment) )?

Initially, I was all for adding a new namespace (hence this PR), but the comment by @rgommers here scikit-learn/scikit-learn#22352 (comment) gave me pause.

Honestly, I'm not sure what the best way to go is: either i)add a new namespace, or ii) try to converge the existing one. I think it depends on what the wider NumPy ecosystem is aiming to do (and even here there are differences, with CuPy adding a new namespace, and PyTorch evolving the existing one), and what Dask maintainers are comfortable with (there is a variety of views as @jsignell and @ian-r-rose have demonstrated in this PR).

@jakirkham
Copy link
Member

Outside of the divergences in APIs already noted, I think the real issue is maintainability.

If we disrupt or cause issues in normal Dask Array behavior by adopting this API on top of the existing one, that would be a bad user experience and quite stressful for maintainers here. Sandboxing it in another module that users must opt into provides some protections for these other groups. Also users of a new experimental API are more prepared for rough edges.

I understand the ease of adoption argument, but I think that is putting the cart before the horse. We need to do this in a safe way that doesn't accidentally harm users today. A separate module does that. Over time this can be gotten rid of as the ecosystem moves forward.

@leofang
Copy link
leofang commented Jul 20, 2022

and PyTorch evolving the existing one

Isn't it because PyTorch has the tradition to break compatibility in every single release, so they can be as adventurous as they want? 😄

@rgommers
Copy link
Contributor

I understand the ease of adoption argument, but I think that is putting the cart before the horse. We need to do this in a safe way that doesn't accidentally harm users today. A separate module does that. Over time this can be gotten rid of as the ecosystem moves forward.

This isn't quite it. FYI last week in the NumPy dev meeting we decided to put on the NumPy roadmap to converge the main numpy namespace towards the array API standard. Timeline is TBD for bc-breaking changes, but there aren't that many (a lot is adding aliases).

The separate module isn't actually all that useful. One strictly confirming implementation is needed for testing that downstream libraries/users do not use library-specific features. numpy.array_api can serve that purpose. It's probably not ideal to have multiple copies of those. In the consortium meeting of two days ago, we kind of resolved that we need a separate package that is basically a "compat between numpy and array API" module, kind of a superset of what scikit-learn and SciPy finding they need to support both numpy and all array API compliant implementations. @jakirkham did that discussion change your perspective on this topic?

@tomwhite
Copy link
Contributor Author
tomwhite commented Aug 2, 2022

Thanks for the update @rgommers.

In the consortium meeting of two days ago, we kind of resolved that we need a separate package that is basically a "compat between numpy and array API" module, kind of a superset of what scikit-learn and SciPy finding they need to support both numpy and all array API compliant implementations.

Is there somewhere to track this work?

@honno
Copy link
honno commented Feb 17, 2023

Is there somewhere to track this work?

This now lives in https://github.com/data-apis/array-api-compat, which has a compat layer for both NumPy and CuPy, and PyTorch soon too.

@asmeurer
Copy link

Yes, please try out array-api-compat and let me know if you find any issues. Or if there are any ideas from this work that aren't done in array-api-compat let me know and we can see if adding them there makes sense. The idea is for it to become a general wrapper library that things like Dask can use to better make use of the array API.

@tomwhite
Copy link
Contributor Author

Thanks @honno and @asmeurer.

It looks like array-api-compat could be extended to provide a compatibility layer for Dask too? (It could also be used by Dask to provide uniform access to the underlying array library - NumPy or CuPy, for example - but that is a different thing to what's being discussed here.)

@asmeurer
Copy link

I opened an issue where we can discuss it data-apis/array-api-compat#17. It makes sense if consumer libraries will want to use Dask alongside other array API compliant libraries like NumPy, CuPy, and PyTorch, and also Dask itself doesn't want to necessarily update all its APIs to be conformant right away (e.g., due to backwards compatibility concerns).

For Dask itself using the compat library, if you end up using it, please let me know how it works out and if you run into any problems.

@jakirkham
Copy link
Member

Am curious how we should look at this in relation to the configurable backend work?

Maybe Array API could be a different kind of Array backend or an additional configuration on existing backends?

@rjzamora do you have any thoughts on this? 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
array needs attention It's been a while since this was pushed on. Needs attention from the owner or a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants
0