8000 ENH: Add an "axis" kwarg to numpy.unique by joferkington · Pull Request #3584 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

ENH: Add an "axis" kwarg to numpy.unique #3584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

joferkington
Copy link
Contributor

Description

Adds an axis kwarg to numpy.unique so that unique can easily be used to find unique rows/columns/sub-arrays/etc of a larger ndarray.

Rationale

There is often a need to find unique rows/columns/etc of an array. However, numpy.unique currently operates only on the flattened array, and returns unique elements. It is possible to work around this in several ways (e.g. viewing each row/column/etc as a structured array or a void dtype the size of the row/column in question). However, each of these work-arounds requires a fairly in-depth knowledge of numpy. Having an axis kwarg to unique makes this operation much simpler.

Examples

In [1]: import numpy as np

In [2]: a = np.array([[1, 0, 1], [0, 1, 0], [0, 1, 0], [0, 1, 0]])

In [3]: a
Out[3]: 
array([[1, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0]])

In [4]: np.unique(a)
Out[4]: array([0, 1])

In [5]: np.unique(a, axis=0)
Out[5]: 
array([[0, 1, 0],
       [1, 0, 1]])

In [6]: np.unique(a, axis=1)
Out[6]: 
array([[0, 1],
       [1, 0],
       [1, 0],
       [1, 0]])

In [7]: b = np.dstack([a]*3).T

In [8]: b
Out[8]: 
array([[[1, 0, 0, 0],
        [0, 1, 1, 1],
        [1, 0, 0, 0]],

       [[1, 0, 0, 0],
        [0, 1, 1, 1],
        [1, 0, 0, 0]],

       [[1, 0, 0, 0],
        [0, 1, 1, 1],
        [1, 0, 0, 0]]])

In [9]: np.unique(b, axis=0)
Out[9]: 
array([[[1, 0, 0, 0],
        [0, 1, 1, 1],
        [1, 0, 0, 0]]])

# a structured array, but only works if array is contiguous)
# Inspired by <http://stackoverflow.com/a/16973510/325565>
new_itemsize = ar.dtype.itemsize * ar.shape[1]
consolidated = ar.view(np.dtype((np.void, new_itemsize)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an opinion about the idea as such, it should probably be brought up on the list. This is too simple however. Inexact types/objects cannot be compared bitwise. If you want to do optimizations by using void, do it only for types where you know that bitwise equality is the same as value equality. (I guess all integer types should be safe, but not even booleans)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - Thanks for the catch on the bitwise comparisons! I clearly didn't think it through quite clearly enough.

I'll run my suggestion by the list and then update it to fall back on a structured array for dtypes other than int, float, etc. (Float bitwise equality should be safe, beyond the usual caveats with float equality, right?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, luckily, np.unique is mostly useful for ints in the first place ;). Ignoring possible problems with NaNs (at least the python fallback is not consistent here either), there is still the issue of -0. and 0. not being the same. Not sure if that can be worked around somehow; np.longdouble certainly cannot be trusted, though (longdouble is not standardized, can be a composite, include random bits...).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarification, I (finally) got around to doing what I said I would do and implemented this correctly and sent an e-mail out to the list. I squashed the commits back together for easier viewing, thus the single commit and outdated diff.

The void optimization is now only used for ints and strings. All other dtypes (including all floats) fall back to a structured array, which is slower, but should be correct.

Thanks again for catching that!

@charris
Copy link
Member
charris commented Aug 29, 2013

@seberg Want to finish this off ;)

@joferkington
Copy link
Contributor Author

Yeah, I don't think there was a lot of interest in adding this, based on the reaction on the list.

It arguably would cause more confusion than good. I'm still all for adding it, but I'm biased, obviously! :)

@charris
Copy link
Member
charris commented Aug 30, 2013

I didn't mean kill it. I meant take it to conclusion, whatever that was ;)

@joferkington
Copy link
Contributor Author

Ah! Sorry for the confusion!

@@ -159,6 +170,45 @@ def unique(ar, return_index=False, return_inverse=False):
array([1, 2, 6, 4, 2, 3, 2])

"""
if axis is None or ar.ndim == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just remove the or? If axis is not None, it might still be wrong, and it should be an error.

@seberg
Copy link
Member
seberg commented Sep 7, 2013

There is also a merge conflict, probably style changes in the tests or so. Generally looks good to me, and considering that this kind of thing seems to crop up more often. The only real thing I am wondering about is whether someone has a great idea for how to call the argument ;).

@seberg
Copy link
Member
seberg commented Sep 24, 2013

General travis error, closing and reopening.

@seberg seberg closed this Sep 24, 2013
@seberg seberg reopened this Sep 24, 2013
@seberg
Copy link
Member
seberg commented Sep 24, 2013

Nvm, they seem to ignore that, but on the other hand travis now has a bigger button to restart all jobs ;).

@joferkington
Copy link
Contributor Author

@seberg - Sorry for the delay! (and for the failing pull request) I've been in the middle of a move over the past few weeks and got sidetracked for awhile.

I fixed the failing tests (sorry again for those), but I think I'm going to go ahead and close this pull request.

It didn't have much support on the mailing list in its current form. Having this functionality as a separate function seemed to have broader support, but namespace pollution is always a concern.

I may run it back by the list as a "unique_rows" function, but in that case it's probably cleaner to make another pull requests.

Any thoughts? Thanks again for looking over things, regardless!

@seberg
Copy link
Member
seberg commented Sep 25, 2013

Well, I am not quite sure about keyword vs. name (and what exactly). But the thing seems to crop up once in a while, so it might make sense, and those homebrew versions are likely to forget the subtleties you fixed...

@alimuldal
Copy link
Contributor

I, for one, would really like to see this feature in numpy 1.10! Were there any show-stopping concerns about this implementation? I get the impression from the comments that the outstanding issues were mostly API and documentation-related.

@seberg
Copy link
Member
seberg commented Apr 21, 2015

@alimuldal, I do not remember the discussion on the list. Maybe you can have a look at this. We probably had most of the techincal issues already addressed it would seem, so yes. Maybe all this needs is another revival (i.e. from you), but maybe first have a look at the old discussion and see what the result was (if there was any), or just revive it on the mailing list.

@joferkington
Copy link
Contributor Author

@alimuldal & @seberg Here's the list discussion: http://numpy-discussion.10968.n7.nabble.com/Adding-an-axis-argument-to-numpy-unique-td34841.html

If I recall correctly, it was mostly just ambiguity around naming vs. kwargs. I didn't want to push something ahead too hard without a clear consensus.

Also related (along with other functionality):
http://numpy-discussion.10968.n7.nabble.com/New-function-count-unique-to-generate-contingency-tables-td38363.html
(and the pull request in question: #4958)

@alimuldal
Copy link
Contributor

@joferkington Thanks, I'll take a look

@joferkington
Copy link
Contributor Author

Related issue from a couple of years ago: #2871

@breedlun
Copy link

I would also like to see this feature in numpy.

@micmitch
Copy link
micmitch commented Feb 4, 2016

This would be very useful.

I'm currently using the function unique_rows from @bfroehle which was posted here #2871

@martinosorb
Copy link

@joferkington As I was saying the other day in another thread, I think it would be a great functionality to have. I personally support the idea of just adding an "axis" parameter. I don't think it would be ambiguous: in all the functions that have one, the given axis is the one that ''disappears'' in after the operation, so it would be quite easy to remember. Are you still interested in taking this forward? Otherwise I may look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0