-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Feature request: Allow np.argmax to output top K maximum values #15128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are you looking for argpartition? |
Hi, I just looked at np.argpartition. This method returns the kth smallest element. I was looking for a family of methods that will return top K smallest/largest elements, and top K smallest/largest indices in sorted order. It is possible to do it in numpy using argsort.
This stackoverflow question is asking for the same thing. |
This answer to the stackoverflow question that you linked explains how to use |
Hi, both the argsort() and the argpartition() methods allow me to extract top K elements. |
|
It seems like the desired behavior is achievable with existing tools. The next step in trying to push this forward would be to take the feature proposal to the mailing list (with a summary of the proposal, alternatives, and a link to this issue). |
I support top k as an argument for readability. While the argpartition method works, it's rather convoluted. |
Returning indices of K min or max elements would be great addition |
Hi, I'd like to take this issue–– will create a branch and PR as per the contrib docs. |
Please see gh-19117 and the associated mailing list discussion before starting any work. |
Okay so from what I've read in the mailing list there is consensus around creating 2 entirely new function: top_k(...) and argtop_k(...), which return the largest k values in order and their corresponding indices, respectively. With respect to @quarrying 's work in https://github.com/numpy/numpy/pull/19117, it seems like the last comment by @eric-wieser suggested breaking top_k into the aforementioned functions. Am I correct? If so I'll proceed with the edits. |
I'm linking the discussion on array-api for It seems that the consensus for such a function could look something like this: def top_k(
x: array,
k: int,
/,
*,
axis: Optional[int] = None,
mode: Literal["largest", "smallest"] = "largest",
) -> Tuple[array, array] I don't mind drafting an implementation for this if it helps further the discussion. Some notes about I think
Currently the default for
From experimenting with In addition to the current API, following torch.topk and tf.math.top_k, we can consider the def top_k(
x: array,
k: int,
/,
*,
axis: Optional[int] = -1,
mode: Literal["largest", "smallest"] = "largest",
sorted: bool = True,
out: Optional[array] = None
) -> Tuple[array, array] |
Given that the new function returns two arrays, I think this would have to be (like pytorch) an optional tuple of two arrays. Note neither |
What would also be quite useful is to add tests for the new function in https://github.com/data-apis/array-api-tests/, which can easily (even in CI) be run against PyTorch, JAX, etc. so that it's easy to verify which semantics (e.g., treatment of |
Following previous discussion at numpy#15128. I made a small change to the interface in the previous discussion by changing the `mode` keyword into a `largest` bool flag. This follows API such as from [torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html). Carrying from the previous discussion, a parameter might be useful is `sorted`. This is also implemented in `torch.topk`, and follows from previous work at numpy#19117. Co-authored-by: quarrying
Can we please clarify the NaN policy or at least voice opinions on it before adding this API? IMO there are two consistent choices:
Unfortunately, I really can't say I like the third choice as considering NaNs the largest element, which The current choices want to go with sorting (
(Yes, I understand that pushing NaNs to the end is annoying if you want to implement "top-k", because partition has no descending sort right now.) |
I'm personally fine with considering |
To add another point: The way I am suggesting to define it is also the way that dataframe libraries define |
Would that look like adding a kwarg that's similar to |
We discussed this today, and I think the leaning was that the pandas behavior is the useful one: It would be best to sort NaNs to the end (i.e. omit if possible) for both largest and smallest. EDIT: To be clear, I may be convinced by "it's too hard", but right now I don't know if it is. |
If we're special casing a bunch of types a possible implementation can be: def top_k(a, k, /, *, axis=-1, largest=True):
if k <= 0:
raise ValueError(f'k(={k}) provided must be positive.')
_arr = np.asanyarray(a)
nan_containing = [
np.clongdouble, np.complex128, np.complex64,
np.float16, np.float32, np.float64, np.longdouble]
to_negate = largest and _arr.dtype in nan_containing
if to_negate:
topk_values, topk_indices = top_k(-_arr, k, axis=axis, largest=False)
return -topk_values, topk_indices
# Actual implementation
... However, this will ignore a bunch of situations like |
Currently np.argmax, np.argmin, np.max and np.min return the maximum and minimum values respectively.
numpy.argmax(a, axis=None, out=None)
Often times we need to get the topKmax or topKmin values, mostly in the case of machine learning workloads where you need top K classes for a given input.
Requesting support for these scenarios in numpy. This scenario could be supported by np.max (and other functions), or there could be a separate family of methods np.maxK, minK etc?
The text was updated successfully, but these errors were encountered: