-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Array API implementation #8750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Array API implementation #8750
Conversation
cc @dask/array |
9245ae7
to
1475623
Compare
All array API tests are passing (https://github.com/tomwhite/dask/runs/5321391257?check_suite_focus=true), now that #8676 and data-apis/array-api-tests#103 have been merged. |
pushd array-api-tests | ||
|
||
# Skip testing functions with known issues | ||
cat << EOF >> skips.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be passed to pytest
somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The array-api-tests repo automatically uses skips.txt to skip tests https://github.com/data-apis/array-api-tests/blob/master/conftest.py#L84. Ideally you could just pass these XFAILS on the command line, but that doesn't seem to be possible (https://stackoverflow.com/questions/68764967/xfail-pytest-tests-from-the-command-line)
Exciting to see this. Let me know if you have any questions around the test suite or the numpy.array_api implementation. By the way, I don't know if you use type hints in dask, but if you want to add them, you can get the official type hints for the array API functions from here https://github.com/data-apis/array-api/tree/main/spec/API_specification/signatures. |
We do make use of |
|
||
def mean(x, /, *, axis=None, keepdims=False): | ||
if x.dtype not in _floating_dtypes: | ||
raise TypeError("Only floating-point dtypes are allowed in mean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this sort of thing doesn't hurt to have, but it also isn't required by the spec. Type combinations outside of those that say "should" in the spec are just considered out of scope. The NumPy implementation is particularly strict and disallows anything not explicitly required by the spec, but it is not required that other implementations do this. The test suite will only check for required combinations (i.e., this line won't even be covered by array-api-tests). There are some cases where an exception is required by the spec, and those are covered by the test suite.
OTOH, mixed-type promotion and cross-type casting in particular are out of scope in the spec specifically because they are considered bad for various reasons. So it's also not so bad to have this sort of thing in there just to make the API cleaner. Some other things, like limited indexing semantics, are only out of scope because they aren't implemented consistently across array libraries, or aren't always possible on accelerators, so it's not bad to allow them anyway if you already have them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good to know. I was following the NumPy implementation and being strict. In general, it's easier to start out strict, and loosen an implementation rather than the other way round - due to backwards compatibility guarantees. However, this is probably an area where we might want to just use the existing Dask implementation. Interested to see what other @dask/array folks think?
BTW I wonder if the array API spec should use capitalised "SHOULD" rather than "should" to make it a bit clearer it's using RFC 2119 levels?
|
||
def ceil(x, /): | ||
if x.dtype in _integer_dtypes: | ||
# Note: The return dtype of ceil is the same as the input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI the "# Note" comments in the numpy.array_api implementation are there to carefully document the differences from the mainline NumPy.
Thanks @asmeurer for taking a look!
I hadn't seen the type signatures for the array API had been collected in one place so thanks for pointing that out. I will add them to the implementation here. (Either in this PR or a follow-on one.) |
Were you going to implement the (optional) |
Presumably not. I think dask was one of the motivating libraries for making linalg optional. |
I missed it out because it's optional and wanted to get the core part working first. |
We've resolved data-apis/array-api-tests#94 in the latest release of the test suite... finally 😅 For this branch I got failures for the following with Failing special cases
Interestingly it was the case with NumPy too that in-place operators tended to deviate. Special cases seem a bit low priority generally, but also noteworthy that floordiv has some of their special cases noted as "libraries may not implement this". |
Thanks for the update @honno! In-place operations are not supported in Dask (#5199), so I included them in I wonder if there's interest in merging this? There are still cases to fix, but as it stands this is passing a good chunk of the array API tests. |
@dask/maintenance would be great if a few folks took a look at this and let us know what they thought 🙂 |
I just came across this interesting discussion about the feasibility of making the main NumPy namespace converge to the array API: scikit-learn/scikit-learn#22352 (comment). (TL;DR: it's feasible, but may take a while.) As the current PR shows, Dask's implementation is pretty close already, so I'd like to explore closing the gap via incremental changes to |
Apologies for the lack of response. I'll give this a review over the next couple of days |
I really like the idea of not introducing a new namespace. |
FWIW NumPy is also introducing a new namespace for this and it may be unavoidable for some kinds of functionality (for example how unknown dimensions in shape are handled). Longer term NumPy plans to replace their current API with the new one created in this separate namespace. We may benefit from following a similar strategy. |
Co-authored-by: Pavithra Eswaramoorthy <pavithraes@outlook.com>
Any more thoughts on this? |
I still think what I said here #8750 (comment), that we shouldn't introduce a new namespace lightly, and we should evolve the current namespace to be as compatible as possible first. So in a sense this is superseded by #8917. |
with: | ||
repository: data-apis/array-api-tests | ||
path: array-api-tests | ||
ref: 2022.02.24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref: 2022.02.24 | |
ref: 2022.05.18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a newer version available, apologies for not reviewing this sooner!
@tomwhite, did you have thoughts on this comment ( #8750 (comment) )? |
I agree (I think) with @jakirkham that a separate namespace may be unavoidable. I'm particularly thinking of all of the keyword-only and positional-only arguments in the spec, which are great, but are difficult to bolt onto the existing API in a backwards-compatible way. Though I would be in favor of making |
Initially, I was all for adding a new namespace (hence this PR), but the comment by @rgommers here scikit-learn/scikit-learn#22352 (comment) gave me pause. Honestly, I'm not sure what the best way to go is: either i)add a new namespace, or ii) try to converge the existing one. I think it depends on what the wider NumPy ecosystem is aiming to do (and even here there are differences, with CuPy adding a new namespace, and PyTorch evolving the existing one), and what Dask maintainers are comfortable with (there is a variety of views as @jsignell and @ian-r-rose have demonstrated in this PR). |
Outside of the divergences in APIs already noted, I think the real issue is maintainability. If we disrupt or cause issues in normal Dask Array behavior by adopting this API on top of the existing one, that would be a bad user experience and quite stressful for maintainers here. Sandboxing it in another module that users must opt into provides some protections for these other groups. Also users of a new experimental API are more prepared for rough edges. I understand the ease of adoption argument, but I think that is putting the cart before the horse. We need to do this in a safe way that doesn't accidentally harm users today. A separate module does that. Over time this can be gotten rid of as the ecosystem moves forward. |
Isn't it because PyTorch has the tradition to break compatibility in every single release, so they can be as adventurous as they want? 😄 |
This isn't quite it. FYI last week in the NumPy dev meeting we decided to put on the NumPy roadmap to converge the main The separate module isn't actually all that useful. One strictly confirming implementation is needed for testing that downstream libraries/users do not use library-specific features. |
Thanks for the update @rgommers.
Is there somewhere to track this work? |
This now lives in https://github.com/data-apis/array-api-compat, which has a compat layer for both NumPy and CuPy, and PyTorch soon too. |
Yes, please try out array-api-compat and let me know if you find any issues. Or if there are any ideas from this work that aren't done in array-api-compat let me know and we can see if adding them there makes sense. The idea is for it to become a general wrapper library that things like Dask can use to better make use of the array API. |
It looks like array-api-compat could be extended to provide a compatibility layer for Dask too? (It could also be used by Dask to provide uniform access to the underlying array library - NumPy or CuPy, for example - but that is a different thing to what's being discussed here.) |
I opened an issue where we can discuss it data-apis/array-api-compat#17. It makes sense if consumer libraries will want to use Dask alongside other array API compliant libraries like NumPy, CuPy, and PyTorch, and also Dask itself doesn't want to necessarily update all its APIs to be conformant right away (e.g., due to backwards compatibility concerns). For Dask itself using the compat library, if you end up using it, please let me know how it works out and if you run into any problems. |
Am curious how we should look at this in relation to the configurable backend work? Maybe Array API could be a different kind of Array backend or an additional configuration on existing backends? @rjzamora do you have any thoughts on this? 🙂 |
pre-commit run --all-files
This is a Dask implementation of the Python Array API standard, discussed in dask/community#109.
It is provided in the
dask.array_api
namespace, so it is independent of the existingdask.array
implementation.There are a few parts of the API that are not implemented since they don't exist in Dask (such as sorting functions, and in-place operations), and these are excluded when running the test suite. Apart from these tests there are two failures: one (
test_array_object.py::test_getitem_masking
) looks like it is a shape problem in the test suite, the other should be fixed by #8676. (Note that more edge-case failures will be found when we get Hypothesis to try more examples.)I've included a GitHub action to run the array API test suite, but I've configured it to be triggered manually for the time being since not all tests are passing yet.