10000 Add x86-simd-sort accelerated sorting by sterrettm2 · Pull Request #149362 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Add x86-simd-sort accelerated sorting #149362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

sterrettm2
Copy link
Contributor
@sterrettm2 sterrettm2 commented Mar 17, 2025

This is a new pull request for the same feature as #127936; the issue affecting that patch has been resolved. That patch is still closed and doesn't seem to be getting responses, so hopefully to get more attention I'm submitting this new patch; please tell me if this is a problem.

This patch adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

Contiguous Benchmarks
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214    
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697    
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241    
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577    
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766    
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416    
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323
    
int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779    
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698     
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285    
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997    
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472    
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611    
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952 

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

Discontiguous Benchmarks
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243     
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923    
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952    
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974    
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743    
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133    
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621   
 
int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512 
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908    
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862    
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408    
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889    
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902     
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381     
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742 

And single threaded performance on the same 7900x system.

Single Core Performance

float32, normally distributed  (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             7.113132954    7.125889063    6.855771542    0.998209892    1.03753938     
128            9.120340586    8.584395647    7.56901145     1.06243246     1.204957959    
1024           36.27155249    24.53012899    15.79697341    1.478653149    2.296107714    
10000          711.9155329    200.382199     108.2926268    3.552788305    6.573998194    
100000         8399.78071     2366.537676    1330.463447    3.54939657     6.313424639    
1000000        100915.9743    28517.82126    17892.53366    3.538698604    5.640116497    
10000000       1204376.316    372791.338     258797.0257    3.230698231    4.653748678    

int32_t, uniformly distributed  (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             6.839853764    6.9264884      6.681355715    0.987492272    1.023722438    
128            8.356203556    8.445468426    7.25971818     0.989430442    1.151036907    
1024           30.88020962    23.73411948    14.40595382    1.30108933     2.143572721    
10000          598.6316072    191.3458307    99.9496872     3.128532276    5.989329471    
100000         1971.655619    2248.225125    1253.185778    0.87698318     1.57331471     
1000000        24533.7907     27625.80853    16539.86351    0.888075029    1.483312766    
10000000       361025.8579    358125.9727    248421.4783    1.008097389    1.453279565    
                                                                                                                                                       
                                                                                          
float, normal distributed discontiguous in sorted dimension (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             3.9883219      3.897530437    3.803153276    1.023294613    1.048688183    
128            5.797074333    5.687333666    4.795829393    1.019295627    1.208774095    
1024           49.77498938    25.21366438    16.05679234    1.974127546    3.099933556    
10000          670.7694155    244.0156184    145.6871839    2.748879026    4.604175863    
100000         8045.512319    2731.892052    1707.214788    2.945033027    4.712653836    
1000000        96954.93258    32101.35607    21151.68938    3.020275292    4.583791433    
10000000       1159710.248    427844.882     316131.2342    2.710585769    3.668445642    

int32_t, uniformly distributed discontiguous in sorted dimension  (in microseconds)
size           default        avx2           avx512         Default/AVX2   Default/AVX512
16             3.780948997    3.872428179    3.718787193    0.97637679     1.016715612    
128            5.341575543    5.529783332    4.779936273    0.965964708    1.117499322    
1024           39.1874838     23.01476824    15.89414877    1.702710337    2.465528942    
10000          555.9280075    225.5575979    137.2813291    2.464683135    4.049552922    
100000         6663.585735    2620.158211    1609.420934    2.543199761    4.140362284    
1000000        79281.4539     31679.51566    20372.97304    2.502609407    3.891501439    
10000000       961423.1586    417279.1243    305512.3885    2.304028893    3.146920369

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @leslie-fang-intel @r-devulap

@pytorch-bot pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor labels Mar 17, 2025
Copy link
pytorch-bot bot commented Mar 17, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149362

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 86e6f39 with merge base 7a0781e (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@sanchitintel sanchitintel added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Mar 18, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Mar 18, 2025
@leslie-fang-intel
Copy link
Collaborator

Thanks for the fix. I am not sure how can we test the reported failures in #140590 to confirm the issue has been fixed. So ping @atalman

Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering it had a previously caused a pretty significant regression and authors weren't very responsive in debugging the problem, I want to see regression test added that would reproduce the failure using previous versions of the module, but would work fine with this one

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 20, 2025
@sterrettm2
Copy link
Contributor Author

Unfortunately, it will probably not be possible to build a regression test for this bug.
The issue was caused by a compiler bug, and the fix is a workaround for that bug.

Because of this, I wouldn't know how to test it; furthermore, it seems the compiler used to build wheels has upgraded from
GCC: (GNU) 9.3.1 20200408 (Red Hat 9.3.1-2)
to
GCC: (GNU) 11.2.1 20210728 (Red Hat 11.2.1-1)

So it's quite possible that the bug has been fixed with that compiler change, so even the pre-workaround patch wouldn't replicate the failure.
Sorry about the delay with figuring out the issue, it took a long time to determine why it wouldn't replicate locally given the nature of this issue. I intend to be more timely in the future.

@sterrettm2
Copy link
Contributor Author

Could I get the CI rerun so the rebased patch can be tested?
Pinging @malfet and @mingfeima, it would be great if you could review the patch/rerun the CI if possible, thanks!

Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it will probably not be possible to build a regression test for this bug.

Can you write a C++ test that would crashes if compiled by g++9, but passes if compiled by gcc-11+?

Also, if there is a compiler bug, than CMake should disable the integration if compiler that contains a bug is used.

Also, please reference gcc's bugzilla report links in PR description

@sterrettm2
Copy link
Contributor Author

Thanks for getting back to me so quickly! Also sorry this is so long, I wanted to cover the issue and my reasoning about it fairly thoroughly.

Since a mitigation was added to x86-simd-sort intel/x86-simd-sort#183, this patch should be safe on all compilers.

With regards to replicating it, unfortunately I have not been able to replicate this behavior within Pytorch itself.
However, I have been able to replicate it outside of Pytorch, and it shows an identical bad_alloc failure, which is fixed by the mitigation.

Here is the code I used to replicate the failures. This is based off the reproduction made by cmsxbc.

Code
#include <unordered_map>
#include "../x86-simd-sort/src/x86simdsort-static-incl.h"

double calc(size_t size) {
    auto ptr = new double[size]{0};
    auto indexes = new long[size]{0};
    x86simdsortStatic::keyvalue_qsort(ptr, indexes, size);
    auto ret = ptr[indexes[0]];
    delete[] ptr;
    return ret;
}

int main() {
    std::unordered_map<size_t, size_t> amap;
    amap.clear();
    calc(65); // [65, 256]
    amap.rehash(1); // <- crash here
    return 0;
}
This is compiled with the flags `-O3 -std=c++17 -march=skylake-avx512`. With the mitigation added in x86-simd-sort, this code always works correctly.

With the pre-mitigation version and when compiled with GCC 9.5.0, it fails with std::bad_alloc
With GCC 13.3.0, it succeeds with or without the mitigation.

Luckily, I was able to extract a backtrace for the crash in Pytorch, and we can compare it to the crash in the reproduction above:

Pytorch bad_alloc trace
#0  __cxxabiv1::__cxa_throw (obj=0x55555f9292e0, tinfo=0x7ffff3beefb8 <typeinfo for std::bad_alloc>,
    dest=0x7ffff3ad3020 <std::bad_alloc::~bad_alloc()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00007ffff3ace38a in std::__throw_bad_alloc () at ../../../../../libstdc++-v3/src/c++11/functexcept.cc:54
#2  0x00007fff4c72dcd6 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<stddex const, std::vector<bool (*)(_object*, void*&), std::allocator<bool (*)(_object*, void*&)> > >, false> > >::_e_buckets(unsigned long) [clone .isra.0] ()
   from /home/sterrettm/miniforge3/envs/build_binary/lib/python3.13/site-packages/torch/lib/libtorch_python.so
#3  0x00007fff4c784f8f in std::_Hashtable<std::basic_string<char, std::char_traits<char>, std::allocator<char> >ir<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, pybind11::object>, std::allocatair<std::basic_string<char, std::char_traits<char>, std::allocator<char> > const, pybind11::object> >, std::__delect1st, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::ing<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Dnged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_rehash long, unsigned long const&) ()
Reproduction bad_alloc trace
#0  0x00007ffff7cbb35a in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffff7ca90db in std::__throw_bad_alloc() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x0000555555555476 in __gnu_cxx::new_allocator<std::__detail::_Hash_node_base*>::allocate (
    this=<synthetic pointer>, __n=18446744073709551557) at /usr/include/c++/9/ext/new_allocator.h:105
#3  std::allocator_traits<std::allocator<std::__detail::_Hash_node_base*> >::allocate (__a=<synthetic pointer>...,
    __n=18446744073709551557) at /usr/include/c++/9/bits/alloc_traits.h:443
#4  std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<unsigned long const, unsigned long>, false> > >::_M_allocate_buckets (this=0x7fffffffdef0, __n=18446744073709551557)
    at /usr/include/c++/9/bits/hashtable_policy.h:2134

These seem to be the same failure, and this failure is resolved by the mitigation.
In the reproduction case, I have disassembled the binary and found that indeed, the compiler is adding MMX instructions without EMMS, which it should never do.
Doing the same with one of the broken Pytorch nightly wheels shows the same situation, use of MMX instructions without EMMS.

There seem to have been a few variants of this bug in GCC, but I can't pin down a bug report for exactly this version (using MMX registers to avoid spilling AVX512 masks to the stack). Somes examples of MMX without EMMS bugs are GCC bugs 117926, 91533, and 80799.

Even when using GCC 9.5.0, when building Pytorch itself the compiler is choosing not to use MMX instructions for some reason, which is why I haven't been able to replicate this failure within Pytorch. But I hope the reproduction is convincing in showing that this was the source of the issue, and this issue is mitigated now in x86-simd-sort.

@sterrettm2 sterrettm2 requested a review from malfet April 10, 2025 16:01
@sterrettm2
Copy link
Contributor Author

Pinging @malfet and @mingfeima, could I get the CI ran again so the rebased code can be tested?

@mingfeima mingfeima added the ciflow/trunk Trigger trunk jobs on your pull request label May 13, 2025
@mingfeima
Copy link
Collaborator

Pinging @malfet and @mingfeima, could I get the CI ran again so the rebased code can be tested?

sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0