-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: Add an "axis" kwarg to numpy.unique #3584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# a structured array, but only works if array is contiguous) | ||
# Inspired by <http://stackoverflow.com/a/16973510/325565> | ||
new_itemsize = ar.dtype.itemsize * ar.shape[1] | ||
consolidated = ar.view(np.dtype((np.void, new_itemsize))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have an opinion about the idea as such, it should probably be brought up on the list. This is too simple however. Inexact types/objects cannot be compared bitwise. If you want to do optimizations by using void, do it only for types where you know that bitwise equality is the same as value equality. (I guess all integer types should be safe, but not even booleans)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg - Thanks for the catch on the bitwise comparisons! I clearly didn't think it through quite clearly enough.
I'll run my suggestion by the list and then update it to fall back on a structured array for dtypes other than int, float, etc. (Float bitwise equality should be safe, beyond the usual caveats with float equality, right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, luckily, np.unique is mostly useful for ints in the first place ;). Ignoring possible problems with NaNs (at least the python fallback is not consistent here either), there is still the issue of -0.
and 0.
not being the same. Not sure if that can be worked around somehow; np.longdouble
certainly cannot be trusted, though (longdouble is not standardized, can be a composite, include random bits...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarification, I (finally) got around to doing what I said I would do and implemented this correctly and sent an e-mail out to the list. I squashed the commits back together for easier viewing, thus the single commit and outdated diff.
The void optimization is now only used for ints and strings. All other dtypes (including all floats) fall back to a structured array, which is slower, but should be correct.
Thanks again for catching that!
@seberg Want to finish this off ;) |
Yeah, I don't think there was a lot of interest in adding this, based on the reaction on the list. It arguably would cause more confusion than good. I'm still all for adding it, but I'm biased, obviously! :) |
I didn't mean kill it. I meant take it to conclusion, whatever that was ;) |
Ah! Sorry for the confusion! |
@@ -159,6 +170,45 @@ def unique(ar, return_index=False, return_inverse=False): | |||
array([1, 2, 6, 4, 2, 3, 2]) | |||
|
|||
""" | |||
if axis is None or ar.ndim == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just remove the or? If axis is not None, it might still be wrong, and it should be an error.
There is also a merge conflict, probably style changes in the tests or so. Generally looks good to me, and considering that this kind of thing seems to crop up more often. The only real thing I am wondering about is whether someone has a great idea for how to call the argument ;). |
General travis error, closing and reopening. |
Nvm, they seem to ignore that, but on the other hand travis now has a bigger button to restart all jobs ;). |
@seberg - Sorry for the delay! (and for the failing pull request) I've been in the middle of a move over the past few weeks and got sidetracked for awhile. I fixed the failing tests (sorry again for those), but I think I'm going to go ahead and close this pull request. It didn't have much support on the mailing list in its current form. Having this functionality as a separate function seemed to have broader support, but namespace pollution is always a concern. I may run it back by the list as a "unique_rows" function, but in that case it's probably cleaner to make another pull requests. Any thoughts? Thanks again for looking over things, regardless! |
Well, I am not quite sure about keyword vs. name (and what exactly). But the thing seems to crop up once in a while, so it might make sense, and those homebrew versions are likely to forget the subtleties you fixed... |
I, for one, would really like to see this feature in numpy 1.10! Were there any show-stopping concerns about this implementation? I get the impression from the comments that the outstanding issues were mostly API and documentation-related. |
@alimuldal, I do not remember the discussion on the list. Maybe you can have a look at this. We probably had most of the techincal issues already addressed it would seem, so yes. Maybe all this needs is another revival (i.e. from you), but maybe first have a look at the old discussion and see what the result was (if there was any), or just revive it on the mailing list. |
@alimuldal & @seberg Here's the list discussion: http://numpy-discussion.10968.n7.nabble.com/Adding-an-axis-argument-to-numpy-unique-td34841.html If I recall correctly, it was mostly just ambiguity around naming vs. kwargs. I didn't want to push something ahead too hard without a clear consensus. Also related (along with other functionality): |
@joferkington Thanks, I'll take a look |
Related issue from a couple of years ago: #2871 |
I would also like to see this feature in numpy. |
@joferkington As I was saying the other day in another thread, I think it would be a great functionality to have. I personally support the idea of just adding an "axis" parameter. I don't think it would be ambiguous: in all the functions that have one, the given axis is the one that ''disappears'' in after the operation, so it would be quite easy to remember. Are you still interested in taking this forward? Otherwise I may look into it. |
Description
Adds an
axis
kwarg tonumpy.unique
so thatunique
can easily be used to find unique rows/columns/sub-arrays/etc of a larger ndarray.Rationale
There is often a need to find unique rows/columns/etc of an array. However,
numpy.unique
currently operates only on the flattened array, and returns unique elements. It is possible to work around this in several ways (e.g. viewing each row/column/etc as a structured array or a void dtype the size of the row/column in question). However, each of these work-arounds requires a fairly in-depth knowledge of numpy. Having anaxis
kwarg tounique
makes this operation much simpler.Examples