-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Implement radix sort #12586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Implement radix sort #12586
Conversation
tests and benchmarks come before making radix the default for integers. |
Looks like I need to implement a generic |
Will you also support radix sort for floats? One can make this work by deriving keys from the floats, as documented here: https://github.com/eloj/radix-sorting#float-keys |
I can do |
You would have to check the dtypes in |
I still need to check for bool and float/double. Also, I need to decide the exact method based on the exact dtype. |
You attach the sort methods to the dtype, so this line will return NULL if that dtype cannot do radix sort |
Next, I think, comes attaching It'd also be nice to know where the default sort is decided, if the benchmarks show an improvement and we want to change it. |
Grep for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to zero-initialize cnt array before use.
Okay, while this works, it's massively slower than quicksort (by about a factor of 3). I went back to the reference repo: https://github.com/eloj/radix-sorting And it seems he's doing two things different:
|
Can you do this once for each int dtype and store it somewhere? |
It's the exact same size as the input array, so unfortunately not. EDIT: Here's the original implementation. Everything besides the auxiliary array is on the stack, so we avoid allocations. |
Usually we should avoid using a The next level of abstraction, which hides memory management from the user but still affords reuse, is to wrap such an allocation-free inner function with one that uses a memory pool to allocate workspace/output buffers. This makes memory allocation very fast for frequently used buffer sizes. Unfortunately, I am not familiar enough with numpy internals to understand if these design patterns are already in use elsewhere. Also, this is a nice C/C++ idiom for zero-initialization of a heap allocated array:
It might generate the same code as [edited for clarity] |
Would be nice to have multi-core support so that all available memory bandwidth can be utilized. Would be nice to try 11-bit (2048 entry) counters in radix-sort instead of 8-bit (256 entry). These require 3 passes for 32-bit keys instead of 4, but still fit in L1 cache. Would be nice to have integer and float merge-sort that use SIMD instructions and compare with radix-sort. This is a lot of stuff, I am just putting it is on the 8000 radar. Cheers. |
From what I understand the pymalloc allocator already does this. @mattip, do we have the EDIT: Transposing two loops and getting rid of some unnecessary arrays leads to a huge speedup: 2x over quicksort on a I'll also add in some other optimisations from this implementation, which is the one the author uses in the benchmark. |
I'm not sure how much effort this would take, but the user doesn't usually manage buffers in NumPy, the exception being
In general, NumPy doesn't use multi-core anywhere in its codebase.
Apparently, the author of the original repo tried this and found it wasn't all it was hyped up to be. I'm assuming it has to do with byte alignment being more efficient.
This would either require different compiler flags or ASM code, which has no precedent in NumPy.
Done! |
@jschueller probably can give you some tips on how to structure the C code to get the compiler to make use of SIMD instructions. |
I added more optimisations:
Edit: They do work. 😄 Apparently, Jupyter Lab needs a Edit 2: CI is broken because of warnings, tests pass. |
After some digging I found https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/simd.inc.src
It uses the Intel intrinsics, which are basically a C API for SIMD instructions: Here is a paper that describes how to sort numbers using the 128-bit SSE intrinsics: So SIMD accelerated merge-sort appears possible by adding the vector sort primitives documented in the paper to umath/simd.inc.src. |
I see. But NumPy already depends on multi-threaded BLAS libraries. I actually don't see why NumPy would not also enable multi-threading elsewhere using the same OpenMP style environment variables that BLAS libraries use for threading control. Default to 1 thread for backwards compatibility (otherwise users who have already built multi-processing into their applications will be impacted). Edit: #11826 seeks a numpy API to control the number of BLAS threads, so that applications that are tuned for multi-processing can force it to 1 (the trick is there is no standard BLAS API for setting the number of threads, and there is no way to know which BLAS library that numpy is using). Of course this same API could also control the number of threads available to any numpy internal functions. I have yet to see that the lack of threading in numpy internals is a policy, rather historically most of the heavy lifting is deferred to BLAS libraries which are themselves multi-threaded. |
cc @mattip I think this is ready, a review would be awesome. I have to update docs and changelog, but otherwise this is ready. |
The test failures don't seem to be related to this PR, as a best guess. |
Benchmarks above. :-) It'd be nice to have it as an option, I agree. But the issue is, it's impossible to add a sort without breaking the ABI, currently this PR replaces mergesort (which is now timsort) for relevant |
(I still have the branch around, just doing some housekeeping on PRs, I'd be willing to work on this if NumPy decided their ABI was okay to break) |
not for this I'd say, however you can turn it around too - when we eventually break the ABI then that's a good moment to add radixsort |
I added it to the list: https://github.com/numpy/numpy/wiki/Backwards-incompatible-ideas-for-a-major-release |
A test that seems to be missing from the benchmarks is a simple random array. I noted that when I put them in, so maybe we should add one. The current tests are for arrays that are already ordered to some extent, which is what timsort is designed to deal with. |
Opened a PR for this, #13287. |
Here are the benchmarks for randomly-shuffled arrays: This branch:
Which proves that radix sort is faster for truly random data (reversed is its worst case). Now, there are two ways to look at this:
Which is true? There is absolutely no way to know without some usage data. My instinct relies on the latter, as I think stable sorts are usually used to sort in succession (or preserve some kind of partial order), and so do have sorted areas. In any case, this should be discussed. |
Here's the full set of benchmarks: This branch:
|
I also think this is most likely, so the best default is probably timsort. That said, I do really like the idea of having an option for radix sort. It's too bad that this seems to break C API compatibility (I don't really understand the nuance of that yet). |
We might also want to test int8. I don't know how often people sort that type, but it would seem to offer the best chance of good performance for radixsort. We should also fix the problems with the |
It looks like there are already some compelling speedups, e.g., by up to 10x for random int16 data. I don't think we need to test it on random int8 data to believe this would speed it up. |
That's what I meant... I couldn't articulate it too well. Becnhmarks (just for This PR:
So you're not far off. 🙂 |
Not the API, but rather the ABI. It doesn't mean people have to change their code, but that code compiled for older versions of NumPy will no longer work for newer ones. This is because the positioning of some function pointers changed in a struct. |
Looks to me that a good case can be made for radix sort for at least some of the dtypes. |
d4824a0
to
512e7dd
Compare
I have rebased, fixed the doc error/warning, and incorporated @charris's change about using this only for small dtypes. |
Testing: - np.int8/np.uint8 - np.int16/np.uint16 - np.int32/np.uint32 - np.int64/np.uint64 - np.float32/np.float64 - np.float16/np.longdouble
close/reopen |
Thanks Hameer. |
Thanks for working on this Hameer. Glad to see it in. I missed the last part about limiting radix sort to some smaller types, which types is it being used for? |
It is this line which will choose radix sort for the first 5 types: |
Fixes #6050
This PR implements radix/counting sort for integers.
cc: @mattip A review would be nice, along with next steps. Feel free to force-push if need be.