-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Add x86-simd-sort accelerated sorting #149362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149362
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 86e6f39 with merge base 7a0781e ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering it had a previously caused a pretty significant regression and authors weren't very responsive in debugging the problem, I want to see regression test added that would reproduce the failure using previous versions of the module, but would work fine with this one
Unfortunately, it will probably not be possible to build a regression test for this bug. Because of this, I wouldn't know how to test it; furthermore, it seems the compiler used to build wheels has upgraded from So it's quite possible that the bug has been fixed with that compiler change, so even the pre-workaround patch wouldn't replicate the failure. |
Could I get the CI rerun so the rebased patch can be tested? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, it will probably not be possible to build a regression test for this bug.
Can you write a C++ test that would crashes if compiled by g++9, but passes if compiled by gcc-11+?
Also, if there is a compiler bug, than CMake should disable the integration if compiler that contains a bug is used.
Also, please reference gcc's bugzilla report links in PR description
Thanks for getting back to me so quickly! Also sorry this is so long, I wanted to cover the issue and my reasoning about it fairly thoroughly. Since a mitigation was added to x86-simd-sort intel/x86-simd-sort#183, this patch should be safe on all compilers. With regards to replicating it, unfortunately I have not been able to replicate this behavior within Pytorch itself. Here is the code I used to replicate the failures. This is based off the reproduction made by cmsxbc. Code
With the pre-mitigation version and when compiled with GCC 9.5.0, it fails with Luckily, I was able to extract a backtrace for the crash in Pytorch, and we can compare it to the crash in the reproduction above: Pytorch bad_alloc trace
Reproduction bad_alloc trace
These seem to be the same failure, and this failure is resolved by the mitigation. There seem to have been a few variants of this bug in GCC, but I can't pin down a bug report for exactly this version (using MMX registers to avoid spilling AVX512 masks to the stack). Somes examples of MMX without EMMS bugs are GCC bugs 117926, 91533, and 80799. Even when using GCC 9.5.0, when building Pytorch itself the compiler is choosing not to use MMX instructions for some reason, which is why I haven't been able to replicate this failure within Pytorch. But I hope the reproduction is convincing in showing that this was the source of the issue, and this issue is mitigated now in x86-simd-sort. |
Pinging @malfet and @mingfeima, could I get the CI ran again so the rebased code can be tested? |
sure. |
This is a new pull request for the same feature as #127936; the issue affecting that patch has been resolved. That patch is still closed and doesn't seem to be getting responses, so hopefully to get more attention I'm submitting this new patch; please tell me if this is a problem.
This patch adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.
For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.
Contiguous Benchmarks
Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.
Discontiguous Benchmarks
And single threaded performance on the same 7900x system.
Single Core Performance
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @leslie-fang-intel @r-devulap