8000 Add percentile support for NEP-35 by pentschev · Pull Request #7162 · dask/dask · GitHub
[go: up one dir, main page]

Skip to content

Add percentile support for NEP-35 #7162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 5, 2021
Merged

Conversation

pentschev
Copy link
Member

This PR is a slim version of #6738 with two purposes:

  1. Unblock [BUG] dask-cudf .describe() broken with NumPy 1.20 rapidsai/cudf#7289 with NumPy >= 1.20;
  2. Serve as a first push on NEP-35 into Dask, hopefully a much easier review task than the full Support for NEP-35 #6738.

Given this is a blocker for RAPIDS 0.18 to be released in two weeks, it would be great if it can make its way into the upcoming Dask release on Friday.

Copy link
Member
@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Peter! Had a few questions below

@shwina
Copy link
Contributor
shwina commented Feb 3, 2021

Confirm that this fixes rapidsai/cudf#7289 with both NumPy 1.20 and 1.19. Thanks @pentschev!

Copy link
Member
@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last suggestion. Otherwise LGTM

Copy link
Member
@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit to update the pytest.mark.skipif condition (hope that's okay @pentschev). Also left a couple of small comments, but they shouldn't block merging this PR

return da_func(a, **kwargs)
elif isinstance(a, Array):
if _is_cupy_type(a._meta):
a = a.compute(scheduler="sync")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we're using the synchronous scheduler here. I would have expected us to dispatch to the default scheduler. This is what we do in other situations where we need to trigger a compute, like computing len of a Dask DataFrame

dask/dask/dataframe/core.py

Lines 550 to 553 in 2632bbc

def __len__(self):
return self.reduction(
len, np.sum, token="len", meta=int, split_every=False
).compute()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to dig for some old comments, because I remember having a similar conversation with @mrocklin back in the early days of __array_function__ support. Please see #4543 (comment) . This was the case for which Matt suggested we use the synchronous scheduler in the end to compute _meta, but maybe I'm ignorant of this situation and perhaps we really don't need the sync scheduler here, given we're computing the actual CuPy data array, and not just an empty one. I'm happy to do it either way, I'm just not confident I can make the best decision on my own for this particular use case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the additional context @pentschev. Given that we're, as you mentioned, computing a full array and not just a small _meta array this use case seems more in line with other implicit computes like len(ddf) and ddf.to_parquet(...) where we use the default scheduler. As a user, if I was using a distributed cluster, I would be surprised if a large implicit compute was triggered and it didn't use the cluster. @jakirkham do you have thoughts on this topic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for context Peter and I discussed this a bit upthread ( #7162 (comment) ).

Generally I must confess I still don't think using .compute is ideal (this can have high overhead as Matt also notes in the thread Peter referenced), but haven't been able to come up with a better solution atm. Peter has also since limited this to CuPy arrays, which avoids affecting other cases.

To the point of "sync" specifically, I think in the cases where we are using this function above (NumPy/CuPy arrays, empty lists, scalars, etc.) this seems like the right choice as they are small in-memory objects that we are merely testing out to find the expected result. So using "sync" seems like the right choice, but I could be overlooking something.

Personally I think we should flag this use of .compute in an issue and rework this after the release to avoid it (if at all possible). The bug it fixes is unfortunately pretty critical for NumPy 1.20 support. Anyways this is just my opinion. Feel free to push back

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I must confess I still don't think using .compute is ideal (this can have high overhead as Matt also notes in the thread Peter referenced), but haven't been able to come up with a better solution atm. Peter has also since limited this to CuPy arrays, which avoids affecting other cases.

Agreed. I like your idea of flagging this as an issue and trying to improve things where possible after the release.

this seems like the right choice as they are small in-memory objects that we are merely testing out to find the expected result

That's a good point. The only case I could see where we might trigger a compute on a potentially large Dask array is when the first two arguments of da.percentile are both Dask arrays backed by CuPy arrays. That said, I'm happy to merge as-is and revisit post-release

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed issue ( #7181 ) on revisiting the compute call

@pentschev
Copy link
Member Author

Thanks also @jrbourbeau for catching my mistake and fixing it in 29ab436 .

@jakirkham jakirkham requested a review from jrbourbeau February 5, 2021 15:31
@jakirkham jakirkham merged commit 996b506 into dask:master Feb 5, 2021
@jakirkham
Copy link
Member

Went ahead and merged as it sounds like based on James' recent comment we are ok going ahead here 👍

Thanks Peter! Also thanks everyone for the reviews! 😄

Will open an issue about the compute call and we can follow up on that after the release 🙂

@jakirkham
Copy link
Member

Will open an issue about the compute call and we can follow up on that after the release 🙂

Filed as issue ( #7181 ). Let's follow up on this point there

@pentschev
Copy link
Member Author

Thanks everyone for reviews and @jakirkham for merging! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0