-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
DOC: Add clarifications np.argpartition #26716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
numpy/_core/fromnumeric.py
Outdated
is unstable, and hence the returned indices are not guaranteed | ||
to be the earliest/latest occurrence of the element. | ||
|
||
The treatment of ``np.nan`` in the input array is undefined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to comment on the other issue also, but I don't think this is right. We sort them to the end and I am not aware of any plan to change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure? Doesn't look like it to me:
>>> x = np.array([1, 3, 2, np.nan, 4, 0, np.nan, 3.5])
>>> np.partition(x, 1)
array([0. , 1. , 2. , nan, 4. , 3. , nan, 3.5])
>>> x[np.argpartition(x, 1)]
array([0. , 1. , 2. , nan, 4. , 3. , nan, 3.5])
Without having looked at how it's implemented under the hood, it looks to me like there are no guarantees at least the unsorted elements to the right of the k-th element.
In addtion, this looks like a bug:
>>> np.partition(x, 5)
array([0. , 1. , 2. , 3. , 3.5, 4. , nan, nan])
>&
8000
gt;> np.partition(x, 6)
array([0. , 1. , 2. , 3. , 3.5, 4. , nan, nan])
>>> np.partition(x, 7) # Incorrect result when `kth == x.size-1`
array([1. , 3. , 2. , nan, 4. , 0. , 3.5, nan])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am sure as your last examples proof: The NaNs are all after or at the partition position.
And the last one is perfectly correct, since partition(x, 7)
means that only the last entry (position 7) has any defined order at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we're talking past each other - "after the partition position" isn't the same as "we sort them to the end". The former looks correct indeed.
So the documented behavior should be something like "nan
are treated as the largest possible values, larger than inf
even - they always end up to the right of the kth
partition element (unless there are more nan
's in the array than fit to the right of kth
)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8000Well, "unspecified" would mean that they could occur completely randomly in the result.
The fact that they are random within each half is already clearly specified by partition
itself! So yes, I think saying that their sort-order sorts them to the end is sufficient (which should be defined in sort
, but maybe is not very clearly defined there).
But, if you prefer you can also add the longer explanation, it might just become incorrect conceptually when thinking about sorting in reverse order.
(I.e. I still think that the sort-order being defined to sort them to the end is the better specification, although maybe not the easier explanation)
I've added a note about the sort order of I've also added some tests to verify this sort order. |
numpy/_core/fromnumeric.py
Outdated
to be the earliest/latest occurrence of the element. | ||
|
||
The sort order of ``np.nan`` is bigger than ``np.inf``. | ||
|
||
See `partition` for notes on the different selection algorithms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. Aren't we in the partition
function docstring? Does this docstring show up in a different function as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sort order is defined in the docstring for np.sort
, but since np.partition
uses a different implementation, I thought it would be helpful to also define the sort order of np.partition
.
I've removed notes regarding the sort order from np.argpartition
and instead linked it to the docs for np.partition
in the latest commit, this follows what np.sort
and np.argsort
did.
Co-authored-by: Matti Picus <matti.picus@gmail.com>
Thanks @JuliaPoo |
Linking the PR for
np.top_k
#26666, since it is helpful for the docs fornp.top_k
to specify explicitly the treatment ofnp.nan
and the stability of the returned indices, and the behaviour ofnp.top_k
relies on the implementation ofnp.argpartition
, I suggest adding similar clarifications tonp.argpartition
docs.With regards to the treatment of
np.nan
, currentlynp.argpartition
does not explicitly handlenp.nan
and by the nature of its implementation,np.nan
is treated likenp.inf
. This seems like something that can change in the future so I wrote that the treatment ofnp.nan
is undefined. If this is not desirable, perhaps one can consider adding anp.nanargpartition
andnp.nanpartition
variant.Making this change also allows
np.top_k
to justly say that it followsnp.argpartition
's specifications.