Add percentile support for NEP-35 #7162

pentschev · 2021-02-03T21:30:43Z

This PR is a slim version of #6738 with two purposes:

Unblock [BUG] dask-cudf .describe() broken with NumPy 1.20 rapidsai/cudf#7289 with NumPy >= 1.20;
Serve as a first push on NEP-35 into Dask, hopefully a much easier review task than the full Support for NEP-35 #6738.

Given this is a blocker for RAPIDS 0.18 to be released in two weeks, it would be great if it can make its way into the upcoming Dask release on Friday.

jakirkham

Thanks Peter! Had a few questions below

dask/array/tests/test_cupy.py

dask/array/utils.py

dask/array/percentile.py

shwina · 2021-02-03T22:05:31Z

Confirm that this fixes rapidsai/cudf#7289 with both NumPy 1.20 and 1.19. Thanks @pentschev!

dask/array/tests/test_cupy.py

dask/array/utils.py

jakirkham

One last suggestion. Otherwise LGTM

jrbourbeau

I pushed a commit to update the pytest.mark.skipif condition (hope that's okay @pentschev). Also left a couple of small comments, but they shouldn't block merging this PR

jrbourbeau · 2021-02-05T04:25:22Z

dask/array/utils.py

+        return da_func(a, **kwargs)
+    elif isinstance(a, Array):
+        if _is_cupy_type(a._meta):
+            a = a.compute(scheduler="sync")


I'm curious why we're using the synchronous scheduler here. I would have expected us to dispatch to the default scheduler. This is what we do in other situations where we need to trigger a compute, like computing len of a Dask DataFrame

dask/dask/dataframe/core.py

Lines 550 to 553 in 2632bbc

def __len__(self):

return self.reduction(

len, np.sum, token="len", meta=int, split_every=False

).compute()

I had to dig for some old comments, because I remember having a similar conversation with @mrocklin back in the early days of __array_function__ support. Please see #4543 (comment) . This was the case for which Matt suggested we use the synchronous scheduler in the end to compute _meta, but maybe I'm ignorant of this situation and perhaps we really don't need the sync scheduler here, given we're computing the actual CuPy data array, and not just an empty one. I'm happy to do it either way, I'm just not confident I can make the best decision on my own for this particular use case.

Thanks for the additional context @pentschev. Given that we're, as you mentioned, computing a full array and not just a small _meta array this use case seems more in line with other implicit computes like len(ddf) and ddf.to_parquet(...) where we use the default scheduler. As a user, if I was using a distributed cluster, I would be surprised if a large implicit compute was triggered and it didn't use the cluster. @jakirkham do you have thoughts on this topic?

Yeah for context Peter and I discussed this a bit upthread ( #7162 (comment) ).

Generally I must confess I still don't think using .compute is ideal (this can have high overhead as Matt also notes in the thread Peter referenced), but haven't been able to come up with a better solution atm. Peter has also since limited this to CuPy arrays, which avoids affecting other cases.

To the point of "sync" specifically, I think in the cases where we are using this function above (NumPy/CuPy arrays, empty lists, scalars, etc.) this seems like the right choice as they are small in-memory objects that we are merely testing out to find the expected result. So using "sync" seems like the right choice, but I could be overlooking something.

Personally I think we should flag this use of .compute in an issue and rework this after the release to avoid it (if at all possible). The bug it fixes is unfortunately pretty critical for NumPy 1.20 support. Anyways this is just my opinion. Feel free to push back

Generally I must confess I still don't think using .compute is ideal (this can have high overhead as Matt also notes in the thread Peter referenced), but haven't been able to come up with a better solution atm. Peter has also since limited this to CuPy arrays, which avoids affecting other cases.

Agreed. I like your idea of flagging this as an issue and trying to improve things where possible after the release.

this seems like the right choice as they are small in-memory objects that we are merely testing out to find the expected result

That's a good point. The only case I could see where we might trigger a compute on a potentially large Dask array is when the first two arguments of da.percentile are both Dask arrays backed by CuPy arrays. That said, I'm happy to merge as-is and revisit post-release

Filed issue ( #7181 ) on revisiting the compute call

dask/array/utils.py

@jrbourbeau

Thanks also @jrbourbeau for catching my mistake and fixing it in 29ab436 .

jakirkham · 2021-02-05T18:41:06Z

Went ahead and merged as it sounds like based on James' recent comment we are ok going ahead here 👍

Thanks Peter! Also thanks everyone for the reviews! 😄

Will open an issue about the compute call and we can follow up on that after the release 🙂

jakirkham · 2021-02-05T18:46:18Z

Will open an issue about the compute call and we can follow up on that after the release 🙂

Filed as issue ( #7181 ). Let's follow up on this point there

pentschev · 2021-02-05T19:16:24Z

Thanks everyone for reviews and @jakirkham for merging! 😄

pentschev added 4 commits February 3, 2021 12:18

Add array_safe function supporting like= kwarg

271d33d

Support for like= in percentile

6b10e3b

Add CuPy tests for percentiles

d8824a3

Fix array_like_safe for NumPy < 1.20

610bd4b

pentschev mentioned this pull request Feb 3, 2021

[BUG] dask-cudf .describe() broken with NumPy 1.20 rapidsai/cudf#7289

Closed

jakirkham reviewed Feb 3, 2021

View reviewed changes

dask/array/tests/test_cupy.py Outdated Show resolved Hide resolved

dask/array/utils.py Outdated Show resolved Hide resolved

kkraus14 reviewed Feb 3, 2021

View reviewed changes

dask/array/percentile.py Show resolved Hide resolved

Split and xfail non-determi 8000 nistic tokenize CuPy test

14b82f8

kkraus14 mentioned this pull request Feb 3, 2021

2021.02.0 release dask/community#128

Closed

jrbourbeau reviewed Feb 4, 2021

View reviewed changes

dask/array/tests/test_cupy.py Outdated Show resolved Hide resolved

pentschev added 2 commits February 4, 2021 11:46

Limit calling compute only for CuPy in _array_like_safe

77cb472

Use _numpy_120 constant in CuPy NEP-35 tests

aadb47d

shwina mentioned this pull request Feb 4, 2021

Improve performance of merge_percentiles #7172

Merged

jakirkham reviewed Feb 4, 2021

View reviewed changes

dask/array/utils.py Outdated Show resolved Hide resolved

jakirkham reviewed Feb 4, 2021

View reviewed changes

Add _is_cupy_type util function, remove old CuPy references

eb6e0fa

jakirkham approved these changes Feb 4, 2021

View reviewed changes

jakirkham requested review from kkraus14 and jrbourbeau February 4, 2021 20:53

shwina approved these changes Feb 4, 2021

View reviewed changes

kkraus14 approved these changes Feb 4, 2021

View reviewed changes

Update test skip condition

29ab436

jrbourbeau reviewed Feb 5, 2021

View reviewed changes

Improve array_safe docstring

433122a

Copy link

Member Author

pentschev commented Feb 5, 2021

Thanks also @jrbourbeau for catching my mistake and fixing it in 29ab436 .

jakirkham requested a review from jrbourbeau February 5, 2021 15:31

jakirkham merged commit 996b506 into dask:master Feb 5, 2021

jakirkham mentioned this pull request Feb 5, 2021

Removing compute call from _array_like_safe #7181

Open

pentschev deleted the nep-35-percentile branch February 23, 2021 20:31

pentschev mentioned this pull request May 6, 2021

NEP 35: Finalize like= argument behaviour before 1.21 release numpy/numpy#17075

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add percentile support for NEP-35 #7162

Add percentile support for NEP-35 #7162

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	def __len__(self):
	return self.reduction(
	len, np.sum, token="len", meta=int, split_every=False
	).compute()

Uh oh!

Add percentile support for NEP-35 #7162

Add percentile support for NEP-35 #7162

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!