-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add percentile support for NEP-35 #7162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Peter! Had a few questions below
Confirm that this fixes rapidsai/cudf#7289 with both NumPy 1.20 and 1.19. Thanks @pentschev! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last suggestion. Otherwise LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a commit to update the pytest.mark.skipif
condition (hope that's okay @pentschev). Also left a couple of small comments, but they shouldn't block merging this PR
return da_func(a, **kwargs) | ||
elif isinstance(a, Array): | ||
if _is_cupy_type(a._meta): | ||
a = a.compute(scheduler="sync") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why we're using the synchronous scheduler here. I would have expected us to dispatch to the default scheduler. This is what we do in other situations where we need to trigger a compute, like computing len
of a Dask DataFrame
Lines 550 to 553 in 2632bbc
def __len__(self): | |
return self.reduction( | |
len, np.sum, token="len", meta=int, split_every=False | |
).compute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to dig for some old comments, because I remember having a similar conversation with @mrocklin back in the early days of __array_function__
support. Please see #4543 (comment) . This was the case for which Matt suggested we use the synchronous scheduler in the end to compute _meta
, but maybe I'm ignorant of this situation and perhaps we really don't need the sync scheduler here, given we're computing the actual CuPy data array, and not just an empty one. I'm happy to do it either way, I'm just not confident I can make the best decision on my own for this particular use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional context @pentschev. Given that we're, as you mentioned, computing a full array and not just a small _meta
array this use case seems more in line with other implicit compute
s like len(ddf)
and ddf.to_parquet(...)
where we use the default scheduler. As a user, if I was using a distributed cluster, I would be surprised if a large implicit compute
was triggered and it didn't use the cluster. @jakirkham do you have thoughts on this topic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for context Peter and I discussed this a bit upthread ( #7162 (comment) ).
Generally I must confess I still don't think using .compute
is ideal (this can have high overhead as Matt also notes in the thread Peter referenced), but haven't been able to come up with a better solution atm. Peter has also since limited this to CuPy arrays, which avoids affecting other cases.
To the point of "sync"
specifically, I think in the cases where we are using this function above (NumPy/CuPy arrays, empty list
s, scalars, etc.) this seems like the right choice as they are small in-memory objects that we are merely testing out to find the expected result. So using "sync"
seems like the right choice, but I could be overlooking something.
Personally I think we should flag this use of .compute
in an issue and rework this after the release to avoid it (if at all possible). The bug it fixes is unfortunately pretty critical for NumPy 1.20 support. Anyways this is just my opinion. Feel free to push back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I must confess I still don't think using .compute is ideal (this can have high overhead as Matt also notes in the thread Peter referenced), but haven't been able to come up with a better solution atm. Peter has also since limited this to CuPy arrays, which avoids affecting other cases.
Agreed. I like your idea of flagging this as an issue and trying to improve things where possible after the release.
this seems like the right choice as they are small in-memory objects that we are merely testing out to find the expected result
That's a good point. The only case I could see where we might trigger a compute
on a potentially large Dask array is when the first two arguments of da.percentile
are both Dask arrays backed by CuPy arrays. That said, I'm happy to merge as-is and revisit post-release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed issue ( #7181 ) on revisiting the compute
call
Thanks also @jrbourbeau for catching my mistake and fixing it in 29ab436 . |
Went ahead and merged as it sounds like based on James' recent comment we are ok going ahead here 👍 Thanks Peter! Also thanks everyone for the reviews! 😄 Will open an issue about the |
Filed as issue ( #7181 ). Let's follow up on this point there |
Thanks everyone for reviews and @jakirkham for merging! 😄 |
This PR is a slim version of #6738 with two purposes:
.describe()
broken with NumPy 1.20 rapidsai/cudf#7289 with NumPy >= 1.20;Given this is a blocker for RAPIDS 0.18 to be released in two weeks, it would be great if it can make its way into the upcoming Dask release on Friday.