8000 DOC: Add clarifications np.argpartition by JuliaPoo · Pull Request #26716 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

DOC: Add clarifications np.argpartition #26716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 1, 2024

Conversation

JuliaPoo
Copy link
Contributor
@JuliaPoo JuliaPoo commented Jun 17, 2024

Linking the PR for np.top_k #26666, since it is helpful for the docs for np.top_k to specify explicitly the treatment of np.nan and the stability of the returned indices, and the behaviour of np.top_k relies on the implementation of np.argpartition, I suggest adding similar clarifications to np.argpartition docs.

With regards to the treatment of np.nan, currently np.argpartition does not explicitly handle np.nan and by the nature of its implementation, np.nan is treated like np.inf. This seems like something that can change in the future so I wrote that the treatment of np.nan is undefined. If this is not desirable, perhaps one can consider adding a np.nanargpartition and np.nanpartition variant.

Making this change also allows np.top_k to justly say that it follows np.argpartition's specifications.

is unstable, and hence the returned indices are not guaranteed
to be the earliest/latest occurrence of the element.

The treatment of ``np.nan`` in the input array is undefined.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to comment on the other issue also, but I don't think this is right. We sort them to the end and I am not aware of any plan to change that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? Doesn't look like it to me:

>>> x = np.array([1, 3, 2, np.nan, 4, 0, np.nan, 3.5])
>>> np.partition(x, 1)
array([0. , 1. , 2. , nan, 4. , 3. , nan, 3.5])
>>> x[np.argpartition(x, 1)]
array([0. , 1. , 2. , nan, 4. , 3. , nan, 3.5])

Without having looked at how it's implemented under the hood, it looks to me like there are no guarantees at least the unsorted elements to the right of the k-th element.

In addtion, this looks like a bug:

>>> np.partition(x, 5)
array([0. , 1. , 2. , 3. , 3.5, 4. , nan, nan])
>&
8000
gt;> np.partition(x, 6)
array([0. , 1. , 2. , 3. , 3.5, 4. , nan, nan])
>>> np.partition(x, 7)  # Incorrect result when `kth == x.size-1`
array([1. , 3. , 2. , nan, 4. , 0. , 3.5, nan])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am sure as your last examples proof: The NaNs are all after or at the partition position.

And the last one is perfectly correct, since partition(x, 7) means that only the last entry (position 7) has any defined order at all.

Copy link
Member
@rgommers rgommers Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we're talking past each other - "after the partition position" isn't the same as "we sort them to the end". The former looks correct indeed.

So the documented behavior should be something like "nan are treated as the largest possible values, larger than inf even - they always end up to the right of the kth partition element (unless there are more nan's in the array than fit to the right of kth)."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8000

Well, "unspecified" would mean that they could occur completely randomly in the result.
The fact that they are random within each half is already clearly specified by partition itself! So yes, I think saying that their sort-order sorts them to the end is sufficient (which should be defined in sort, but maybe is not very clearly defined there).

But, if you prefer you can also add the longer explanation, it might just become incorrect conceptually when thinking about sorting in reverse order.
(I.e. I still think that the sort-order being defined to sort them to the end is the better specification, although maybe not the easier explanation)

@JuliaPoo
Copy link
Contributor Author

I've added a note about the sort order of np.nan (that its greater than np.inf). Sometimes np.partition will default to sorting which has sort order of np.nan defined so I was a bit suspicious it is indeed defined for np.partition, but from my own testing it does seem to be.

I've also added some tests to verify this sort order.

to be the earliest/latest occurrence of the element.

The sort order of ``np.nan`` is bigger than ``np.inf``.

See `partition` for notes on the different selection algorithms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. Aren't we in the partition function docstring? Does this docstring show up in a different function as well?

Copy link
Contributor Author
@JuliaPoo JuliaPoo Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort order is defined in the docstring for np.sort, but since np.partition uses a different implementation, I thought it would be helpful to also define the sort order of np.partition.

I've removed notes regarding the sort order from np.argpartition and instead linked it to the docs for np.partition in the latest commit, this follows what np.sort and np.argsort did.

JuliaPoo and others added 2 commits July 1, 2024 14:37
@mattip mattip merged commit c21ac10 into numpy:main Jul 1, 2024
67 of 68 checks passed
@mattip
Copy link
Member
mattip commented Jul 1, 2024

Thanks @JuliaPoo

@rgommers rgommers added this to the 2.1.0 release milestone Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0