8000 Feature request: Allow np.argmax to output top K maximum values · Issue #15128 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

Feature request: Allow np.argmax to output top K maximum values #15128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
regstrtn opened this issue Dec 19, 2019 · 21 comments
Open

Feature request: Allow np.argmax to output top K maximum values #15128

regstrtn opened this issue Dec 19, 2019 · 21 comments
Labels
01 - Enhancement 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.

Comments

@regstrtn
Copy link

Currently np.argmax, np.argmin, np.max and np.min return the maximum and minimum values respectively.
numpy.argmax(a, axis=None, out=None)
Often times we need to get the topKmax or topKmin values, mostly in the case of machine learning workloads where you need top K classes for a given input.
Requesting support for these scenarios in numpy. This scenario could be supported by np.max (and other functions), or there could be a separate family of methods np.maxK, minK etc?

@eric-wieser
Copy link
Member

Are you looking for argpartition?

@regstrtn
Copy link
Author
regstrtn commented Dec 19, 2019

Hi, I just looked at np.argpartition. This method returns the kth smallest element. I was looking for a family of methods that will return top K smallest/largest elements, and top K smallest/largest indices in sorted order. It is possible to do it in numpy using argsort.

In [1]: import numpy as np

In [2]: arr = np.array([1, 3, 2, 4, 5])

In [3]: arr.argsort()[-3:][::-1]
Out[3]: array([4, 3, 1])

This stackoverflow question is asking for the same thing.
https://stackoverflow.com/questions/6910641/how-do-i-get-indices-of-n-maximum-values-in-a-numpy-array

@WarrenWeckesser
Copy link
Member
WarrenWeckesser commented Dec 19, 2019

This answer to the stackoverflow question that you linked explains how to use argpartition to get the top K values. It also shows how to sort the top K values once you've found them. Does that not answer your question?

@regstrtn
Copy link
Author

Hi, both the argsort() and the argpartition() methods allow me to extract top K elements.
My request was that since this is a common scenario, could we support this using a single method call? User simply calls np.argmaxK(arr, K) & np.maxK(arr, K) and gets the K largest elements in sorted order.

@mproszewska
Copy link
Contributor
mproszewska commented Dec 24, 2019

argpartition can also be used instead of current argmax, so I don't think that existence of argpartition is an argument against adding proposed parameter to argmax. The question is, whether this is needed addition.
If so, It would probably require changes in all argmax/argmin functions (in numpy/core/src/multiarray/arraytypes.c). I don't think it's worth it, but If numpy team thinks differently, I could probably create PR.
Maybe different approach? e.g. calling argmax/argmin multiple times, changes on Python level?.

@rossbar rossbar changed the title Allow np.argmax to output top K maximum values Feature request: Allow np.argmax to output top K maximum values Jul 18, 2020
@rossbar rossbar added 01 - Enhancement 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. 57 - Close? Issues which may be closable unless discussion continued labels Jul 18, 2020
@rossbar
Copy link
Contributor
rossbar commented Jul 18, 2020

It seems like the desired behavior is achievable with existing tools. The next step in trying to push this forward would be to take the feature proposal to the mailing list (with a summary of the proposal, alternatives, and a link to this issue).

@andrewjong
Copy link

I support top k as an argument for readability. While the argpartition method works, it's rather convoluted.

@skrbnv
Copy link
skrbnv commented Jun 24, 2021

Returning indices of K min or max elements would be great addition

@akorman128
Copy link

Hi, I'd like to take this issue–– will create a branch and PR as per the contrib docs.

@rkern
Copy link
Member
rkern commented Apr 12, 2022

Please see gh-19117 and the associated mailing list discussion before starting any work.

@akorman128
Copy link

Okay so from what I've read in the mailing list there is consensus around creating 2 entirely new function: top_k(...) and argtop_k(...), which return the largest k values in order and their corresponding indices, respectively.

With respect to @quarrying 's work in https://github.com/numpy/numpy/pull/19117, it seems like the last comment by @eric-wieser suggested breaking top_k into the aforementioned functions.

Am I correct? If so I'll proceed with the edits.

@JuliaPoo
Copy link
Contributor
JuliaPoo commented Jun 11, 2024

I'm linking the discussion on array-api for top_k and related functions at data-apis/array-api#722, and the discussions on the mailing list.

It seems that the consensus for such a function could look something like this:

def top_k(
    x: array,
    k: int,
    /,
    *,
    axis: Optional[int] = None,
    mode: Literal["largest", "smallest"] = "largest",
) -> Tuple[array, array]

I don't mind drafting an implementation for this if it helps further the discussion. Some notes about top_k specifically for numpy.

I think top_k implementation can simply call argpartition internally, and we can let that decide a bunch of details about top_k. With reference to the discussion on array-api:

Should we be more strict in specifying what should happen when k exceeds the number of elements?

argpartition does raise a ValueError when that happens.

Are we okay with None being the default for axis, where the default behavior is searching over a flattened array?

Currently the default for argparitition is axis=-1, which performs the partitioning along the last axis rather than flattening. I think top_k for now can have a default axis=-1 similarly until consensus is reached.

Something that seems missing from this spec is to specify the handling of NaN values.

From experimenting with np.max and np.min it seems numpy returns false for any comparison with nan. Since argpartition uses this comparison as well it doesn't seem like argpartition is particularly intentional in its treatment of nan.

In addition to the current API, following torch.topk and tf.math.top_k, we can consider the out keyword to specify an output array, and the sorted keyword. Putting it together can look something like:

def top_k(
    x: array,
    k: int,
    /,
    *,
    axis: Optional[int] = -1,
    mode: Literal["largest", "smallest"] = "largest",
    sorted: bool = True,
    out: Optional[array] = None
) -> Tuple[array, array]

@mattip
Copy link
Member
mattip commented Jun 11, 2024

we can consider the out keyword to specify an output array, ...

Given that the new function returns two arrays, I think this would have to be (like pytorch) an optional tuple of two arrays. Note neither partition nor argpartition take a out kwarg (they are part of the "sorting" family of functions, not ufuncs), so it might be unwieldy to try to implement an out kwarg.

@rgommers
Copy link
Member

Thanks @JuliaPoo, having a concrete implementation to try out will be helpful here. There may also be something to reuse from gh-19117, e.g. the __array_function__ dispatcher.

@rgommers
Copy link
Member

What would also be quite useful is to add tests for the new function in https://github.com/data-apis/array-api-tests/, which can easily (even in CI) be run against PyTorch, JAX, etc. so that it's easy to verify which semantics (e.g., treatment of nan values), or defaulting to largest/smallest, are consistent across libraries and which ones aren't.

JuliaPoo added a commit to JuliaPoo/numpy that referenced this issue Jun 12, 2024
Following previous discussion at numpy#15128. I made a small change
to the interface in the previous discussion by changing the
`mode` keyword into a `largest` bool flag. This follows API
such as from
[torch.topk](https://pytorch.org/docs/stable/generated/torch.topk.html).

Carrying from the previous discussion, a parameter might be useful is
`sorted`. This is also implemented in `torch.topk`, and follows from
previous work at numpy#19117.

Co-authored-by: quarrying
@seberg
Copy link
Member
seberg commented Jul 1, 2024

Can we please clarify the NaN policy or at least voice opinions on it before adding this API? IMO there are two consistent choices:

  • Go with sorting, and push NaNs to the end, so that NaNs are not included in the result. That also gives the same result as nanmax/nanmin with N=1.
  • Go with min/max and always include NaNs, so that the function with N=1 gives the same result as max/argmax.

Unfortunately, I really can't say I like the third choice as considering NaNs the largest element, which top_k does in any of the suggested implementations. Conceptually, I cannot think of any logic by which that would make sense.
And sure, one can purposefully decide it is undefined, although users will rely on it whether we like it or not.

The current choices want to go with sorting (nanmax), but we don't define what descending sorts should do (i.e. finding maximum values first):

  • descending_sort(x) == ascending_sort(x)[::-1] except this fails for stable sorts. This would give the conceptually inconsistent behavior that NaNs are not pushed to the end for being unordered but rather defined as the "largest value".
  • -descending_sort(-x) == ascending_sort(x) (pushes NaNs to the end always; not actually a valid implementation for integers due to -x behavior)

(Yes, I understand that pushing NaNs to the end is annoying if you want to implement "top-k", because partition has no descending sort right now.)

@JuliaPoo
Copy link
Contributor
JuliaPoo commented Jul 2, 2024

I'm personally fine with considering np.nan as the largest element, mostly because it follows the sort order in other numpy apis like np.sort/np.argsort and np.partition/np.argpartition, and users might reasonably expect top_k to match the output of sort.

@seberg
Copy link
Member 8000
seberg commented Jul 2, 2024

To add another point: The way I am suggesting to define it is also the way that dataframe libraries define top_k (e.g. nsmallest and nlargest in pandas) and sort order.
And those library authors care about NaN (or NULL value) semantics.

@JuliaPoo
Copy link
Contributor
JuliaPoo commented Jul 2, 2024

Would that look like adding a kwarg that's similar to na_position in pd.sort_values? Or maybe something like ignore_nan:bool? Specifying the np.nan position/ignoring them will probably take slightly longer than the current implementation but shouldn't affect it too much. Though idt there's a python way to do this in a performant way so it'll have to be a C implementation.

@seberg
Copy link
Member
seberg commented Jul 3, 2024

We discussed this today, and I think the leaning was that the pandas behavior is the useful one: It would be best to sort NaNs to the end (i.e. omit if possible) for both largest and smallest.
I don't know how simple that is in the sorting code, although, you can probably work around in python semi-reasonably (by special casing types that can contain NaNs).

EDIT: To be clear, I may be convinced by "it's too hard", but right now I don't know if it is.

@JuliaPoo
Copy link
Contributor
JuliaPoo commented Jul 4, 2024

If we're special casing a bunch of types a possible implementation can be:

def top_k(a, k, /, *, axis=-1, largest=True):
    if k <= 0:
        raise ValueError(f'k(={k}) provided must be positive.')

    _arr = np.asanyarray(a)

    nan_containing = [
        np.clongdouble, np.complex128, np.complex64,
        np.float16, np.float32, np.float64, np.longdouble]
    to_negate = largest and _arr.dtype in nan_containing
    if to_negate:
        topk_values, topk_indices = top_k(-_arr, k, axis=axis, largest=False)
        return -topk_values, topk_indices

    # Actual implementation
    ...

However, this will ignore a bunch of situations like np.array([np.uint8(1), np.nan], dtype=object) where np.nan is again treated as the biggest element. I personally don't see any way one could come across these special situations though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified.
Projects
None yet
Development

No branches or pull requests

0