8000 argsort/sort with O(n) algorithms: counting-sort, radix-sort · Issue #6050 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

argsort/sort with O(n) algorithms: counting-sort, radix-sort #6050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree 8000 to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
d1manson opened this issue Jul 6, 2015 · 5 comments · Fixed by #12586
Closed

argsort/sort with O(n) algorithms: counting-sort, radix-sort #6050

d1manson opened this issue Jul 6, 2015 · 5 comments · Fixed by #12586

Comments

@d1manson
Copy link
d1manson commented Jul 6, 2015

You can already sort with couting-sort using numpy: as stated in an SO answer, it is simply:

np.repeat(np.arange(1+x.max()), np.bincount(x))

But there is no way (as far as I can tell) to do an argsort.

I think it would be fairly easy to implement and slot into the argsort/sort documentation. The benefit of counting sort is that it's O(n), although it only works for not-too-large non-negative integers...however those are actually quite common in the real world! Pandas uses it behind the scenes apparently.

@jaimefrio
Copy link
Member

I've been toying with the idea of adding some of the O(n) sorting algorithms to NumPy for some time now: counting sort for u/int8s and bools, radix sort for larger ints, timsort for object arrays...

The most difficult thing to do is figure out how to adapt the current NumPy API to fit the new functionality without cluttering it to much. At some point I was thinking on adding to the current options of quicksort, mergesort, heapsort, that sort and argsort provide, two more options: fastest and stable, probably with better names. The idea is that these two new selectors do not have to point to new implementations (you could map fastest to quicksort and stable to mergesort and be done with it), and if they do, they do so in a more or less concealed manner, so that the internals can be changed at will without breaking any user code.

@njsmith
Copy link
Member
njsmith commented Jul 6, 2015

If thinking about such changes then it would also be good to keep in mind
the possibility of moving sort to become a gufunc, and possibly
generalizing the api so we don't have exactly 3 sorts which every type
implements.
On Jul 6, 2015 10:48 AM, "Jaime" notifications@github.com wrote:

I've been toying with the idea of adding some of the O(n) sorting
algorithms to NumPy for some time now: counting sort for u/int8s and bools,
radix sort for larger ints, timsort for object arrays...

The most difficult thing to do is figure out how to adapt the current
NumPy API to fit the new functionality without cluttering it to much. At
some point I was thinking on adding to the current options of quicksort,
mergesort, heapsort, that sort and argsort provide, two more options:
fastest and stable, probably with better names. The idea is that these two
new selectors do not have to point to new implementations (you could map
fastest to quicksort and stable to mergesort and be done with it), and if
they do, they do so in a more or less concealed manner, so that the
internals can be changed at will without breaking any user code.


Reply to this email directly or view it on GitHub
#6050 (comment).

@d1manson d1manson changed the title argsort implementation using counting-sort argsort/sort with O(n) algorithms: counting-sort, radix-sort Jul 6, 2015
@d1manson
Copy link
Author
d1manson commented Jul 6, 2015

Everyone loves a super fast sort!

In terms of API, why not keep the old kind kwarg for backward compatibility, but then add other keywords to control each requirement/hint. Any/all hint kwargs could be ignored in initial/future implementations:

  • stable - True/False.
  • nearly_sorted - True/False
  • small_int_range - True/False
  • small_float_set - True/False
    etc.

@hameerabbasi
Copy link
Contributor
hameerabbasi commented Apr 22, 2018

I realize this is an old issue, but this will be super useful. I have an application where things to be sorted are integers and sorting takes 70+% of the time.

  • stable is useful.
  • nearly_sorted: is useless, introsort is O(n log n) and is what's actually used with kind=quicksort.
  • small_*_*: Again, useless, for the most part.

I'd much rather prefer that for ints, kind is ignored and it defaults to counting/radix sort.

@hameerabbasi
Copy link
Contributor

There are good implementations of radix/counting sort/argsort in this repo: https://github.com/eloj/radix-sorting

O(n) integer sort is important for PyData/Sparse as a lot of time is spent on integer sorting in many operations. See pydata/sparse#52. It’s also important to Zarr (zarr-developers/zarr-python#236)

Ideally, radix sort should be the default sort for integer arrays. The reason is that it’s O(n) instead of O(n log n). Benchmarks in the repository I linked show as much as a 7x speed up for moderately sized arrays.

Radix sort is also stable, which means it doesn’t change the relative ordering of any equal elements.

The only downside to radix sort is that it isn’t in-place: it needs a temporary array to get the work done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0