Parallelize sort using libstdc++ parallel mode #150195

annop-w · 2025-03-28T16:33:37Z

Previously, #149505 used libstdc++ parallel mode by enabling -D_GLIBCXX_PARALLEL. However, mixing source files compiled with and without parallel mode can lead to undefined behavior (See https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html) We switch to using the specific paralell sort from <parallel/algorithm> when compiled with GCC compiler. Note that use of std::execution policy has dependency on libtbb and we thus decide to avoid that.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

pytorch-bot · 2025-03-28T16:33:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150195

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 25fd67c with merge base 7243c69 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

annop-w · 2025-03-28T16:34:32Z

@pytorchbot label "topic: not user facing"

nikhil-arm · 2025-03-28T16:37:36Z

@pytorchbot label "ciflow/linux-aarch64"

annop-w · 2025-03-28T16:39:33Z

@malfet I hope this is an acceptable solution since you had concerns with adding tbb in #142391

cyyever · 2025-03-31T10:48:45Z

Can we use c++17 parallel sort?

annop-w · 2025-03-31T10:51:44Z

@cyyever That has dependency on tbb which we would like to avoid adding.

cyyever · 2025-03-31T10:52:50Z

@annop-w I see, but is it possible to use them for Clang and MSVC?

annop-w · 2025-03-31T10:58:54Z

@cyyever It works for clang and gcc. I have not tried MSVC. But again, there is dependency on tbb and if we would like to avoid adding libtbb dependency in PyTorch, then we aren't able to use the parallel execution policy.

aten/src/ATen/native/cpu/SortingKernel.cpp

cyyever · 2025-03-31T14:24:37Z

aten/src/ATen/native/cpu/SortingKernel.cpp

@@ -146,6 +149,25 @@ static inline void sort_kernel_impl(const value_accessor_t& value_accessor,
  auto composite_accessor = CompositeRandomAccessorCPU<
    value_accessor_t, indices_accessor_t
  >(value_accessor, indices_accessor);
+#if __has_include(<parallel/algorithm>) && defined(_OPENMP)


Can we store the comparator into a variable by descending to reduce the number of branches?

I think that is a bit out of scope for this PR. It can be done easily in a separate PR though :)

I think this is a valid ask. let's figure out the sorting function first and just call it once?

cyyever · 2025-03-31T15:09:45Z

Can you provide benchmark numbers before and after changing?

annop-w · 2025-03-31T15:40:05Z

Can you provide benchmark numbers before and after changing?

I see ~9.7x speedup with 16 threads on NEOVERSE V1 when sorting ~50000 elements.

annop-w · 2025-04-07T15:59:30Z

@pytorchbot rebase

pytorchmergebot · 2025-04-07T16:01:01Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-04-07T16:01:06Z

Successfully rebased fix_sort onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_sort && git pull --rebase)

annop-w · 2025-04-08T07:04:55Z

@malfet Could I please get a review ?

annop-w · 2025-04-15T20:06:38Z

@digantdesai Could you please help taking a look ? The checks passed and the failing one looks unrelated to me.

annop-w · 2025-04-28T09:27:12Z

@malfet Could you please review this one ? Thank you.

annop-w · 2025-05-01T10:28:51Z

@cyyever @digantdesai Could you please review again ? Thank you.

cyyever · 2025-05-01T10:36:58Z

@malfet Have a look because performance gain fro proper C++17 parallel is desirable.

malfet · 2025-05-01T11:58:17Z

@malfet Have a look because performance gain fro proper C++17 parallel is desirable.

But this PR is not using c++17 parallel sort, but rather some g++ extension?

Also, all 3 issues that PR claims to fix has been closed, but I also struggle to understand how it could have fixed them in the first place

malfet · 2025-05-01T12:07:42Z

@cyyever It works for clang and gcc. I have not tried MSVC. But again, there is dependency on tbb and if we would like to avoid adding libtbb dependency in PyTorch, then we aren't able to use the parallel execution policy.

@annop-w Can you please reference the doc? To the best of my knowledge functions which are part of the standard should rely only on c++ runtime shipped with compiler.

If this is not the case, then I think one should write an in-house implementation using at::parallel primitive

annop-w · 2025-05-01T12:23:00Z

@malfet
Here is the doc for GNU https://gcc.gnu.org/onlinedocs/libstdc++/manual/paralle F438 l_mode_using.html.
I will look for docs for Clang.

The 3 issues are closed because #149505 got reverted. I explained how this PR would also solve the issues here

Previously, #149505 used libstdc++ parallel mode by enabling -D_GLIBCXX_PARALLEL. However, mixing source files compiled with and without parallel mode can lead to undefined behavior

cyyever · 2025-05-01T12:24:15Z

@annop-w @malfet For your reference, libstdc++ document says

 Note 3: The Parallel Algorithms have an external dependency on Intel TBB 2018 or later. 
If the <execution> header is included then -ltbb must be used to link to TBB.

Currently, libc++ has no full Parallelism TS support, but I don't know the coverage of std::sort.
MSVC says

C++17's parallel algorithms library is complete. 
Complete doesn't mean that every algorithm is parallelized in every case. 
The most important algorithms have been parallelized. 
Execution policy signatures are provided even where the implementation doesn't 
parallelize algorithms. 
The central internal header, <yvals_core.h>, contains the following 
"Parallel Algorithms Notes": C++ allows an implementation to implement parallel 
algorithms as calls to the serial algorithms. This implementation parallelizes 
several common algorithm calls, but not all.

The following algorithms are parallelized:

    adjacent_difference, adjacent_find, all_of, any_of, count, 
count_if, equal, exclusive_scan, find, find_end, find_first_of, 
find_if, find_if_not, for_each, for_each_n, inclusive_scan, 
is_heap, is_heap_until, is_partitioned, is_sorted, 
is_sorted_until, mismatch, none_of, partition, reduce, 
remove, remove_if, replace, replace_if, search, search_n, 
set_difference, set_intersection, sort, stable_sort, transform, 
transform_exclusive_scan, transform_inclusive_scan, transform_reduce

malfet

Requesting changes because:

We should avoid using random compiler extensions in the codebase (especially when they work for ascending but not descending)
Previous attempt to enable libstdc++ enable parallel sort resulted in correctness issue and this PR adds no unit test nor provide a satisfactory explanation(fact that something might lead to undefined behavior, in my mind is not sufficient, my strong suspicious on what happened are results were unstable compared to sequential sort when there were duplicated indices) why this use of extension will not suffer from the same fate
Last but not least, this is a performance optimization, but no script are provided that one can use to validate that

To conclude, if looks like if one wants to enable parallel sort, they could not rely on C++17 primitives, as their implementation at least on Linux is currently flawed, but instead should write something that relies on at::parallel_for primitive

annop-w · 2025-05-01T13:12:06Z

@malfet I am sorry but I do not agree with your arguments

We should avoid using random compiler extensions in the codebase (especially when they work for ascending but not descending)

This is not just some random extensions. It is just GNU version of C++ standard library which is also compatible with Clang. For Clang, the C++ std. library implementation can be selected with -stdlib= flag. The__gnu_parallel version has the same function signature and accepts a comparator, i.e. it works for both ascending and descending sort.

Previous attempt to enable libstdc++ enable parallel sort resulted in correctness issue and this PR adds no unit test nor provide a satisfactory explanation(fact that something might lead to undefined behavior, in my mind is not sufficient, my strong suspicious on what happened are results were unstable compared to sequential sort when there were duplicated indices) why this use of extension will not suffer from the same fate

This is documented in the documentation as

Note that the _GLIBCXX_PARALLEL define may change the sizes and behavior of standard class templates such as std::search, and therefore one can only link code compiled with parallel mode and code compiled without parallel mode if no instantiation of a container is passed between the two translation units. Parallel mode functionality has distinct linkage, and cannot be confused with normal mode symbols.

I validated the change in this PR with script in #150094 and it passed the accuracy test. So, tests already exist.

Last but not least, this is a performance optimization, but no script are provided that one can use to validate that

Benchmarking script is already provided in #142391 and I paste it here again for reference

import torch
import torch.autograd.profiler as profiler

torch.manual_seed(0)

N = 50000
x = torch.randn(N, dtype=torch.float)

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
    for i in range(1000):
        _, _ = torch.sort(x)

print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10))

if one wants to enable parallel sort, they could not rely on C++17 primitives, as their implementation at least on Linux is currently flawed, but instead should write something that relies on at::parallel_for primitive

I do not think the implementation is flawed. If there exists a working parallel sort in the C++ std. library, why reinvent the wheel ?

fadara01 · 2025-05-13T11:08:12Z

@pytorchbot rebase

pytorchmergebot · 2025-05-13T11:09:56Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Resolve pytorch#149977, pytorch#149979, pytorch#150094. Previously, pytorch#149505 used libstdc++ parallel mode by enabling -D_GLIBCXX_PARALLEL. However, mixing source files compiled with and without parallel mode can lead to undefined behavior (See https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html) We switch to using the specific paralell sort from <parallel/algorithm> when compiled with GCC compiler. Note that use of std::execution policy has dependency on libtbb and we thus decide to avoid that.

pytorchmergebot · 2025-05-13T11:10:01Z

Successfully rebased fix_sort onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_sort && git pull --rebase)

fadara01 · 2025-05-13T11:12:54Z

@pytorchbot label "ciflow/linux-aarch64"

malfet

This is not just some random extensions. It is just GNU version of C++ standard library which is also compatible with Clang. For Clang, the C++ std. library implementation can be selected with -stdlib= flag. The__gnu_parallel version has the same function signature and accepts a comparator, i.e. it works for both ascending and descending sort.

It's not part of C++ standard, therefore we should not rely on that.

Previous attempt to enable libstdc++ enable parallel sort resulted in correctness issue and this PR adds no unit test nor provide a satisfactory explanation(fact that something might lead to undefined behavior, in my mind is not sufficient, my strong suspicious on what happened are results were unstable compared to sequential sort when there were duplicated indices) why this use of extension will not suffer from the same fate

I validated the change in this PR with script in #150094 and it passed the accuracy test. So, tests already exist.

Can you please explain what changed in the approach between this PR and #149505 that make test pass now but fail previously?

Benchmarking script is already provided in #142391 and I paste it here again for reference

Please add it to the PR description and share the numbers that you've observed before and after the change.

I do not think the implementation is flawed. If there exists a working parallel sort in the C++ std. library, why reinvent the wheel ?

Didn't you mention that it works for ascending but not descending case (though I see you have a workaround for it now)

malfet · 2025-05-13T23:11:26Z

aten/src/ATen/native/cpu/SortingKernel.cpp

@@ -19,6 +19,9 @@
 #ifdef USE_FBGEMM
 #include <fbgemm/Utils.h>
 #endif
+#if __has_include(<parallel/algorithm>) && defined(_OPENMP)
+#include <parallel/algorithm>


Is there a C++ standard that says this header will be available in C++20? Or C++23? If not, please don't rely on this header

annop-w · 2025-05-21T11:18:48Z

We will be looking at implementing a solution based on aten::parallel_for.

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Mar 28, 2025

pytorch-bot bot added the topic: not user facing topic category label Mar 28, 2025

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Mar 28, 2025

pytorchbot added the open source label Mar 28, 2025

annop-w force-pushed the fix_sort branch from 40871d6 to d695a2b Compare March 28, 2025 17:01

pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Mar 28, 2025

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 29, 2025

annop-w force-pushed the fix_sort branch from d695a2b to 55f6252 Compare March 31, 2025 10:21

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Mar 31, 2025

cyyever reviewed Mar 31, 2025

View reviewed changes

aten/src/ATen/native/cpu/SortingKernel.cpp Outdated Show resolved Hide resolved

annop-w force-pushed the fix_sort branch from 55f6252 to 13eb689 Compare March 31, 2025 13:28

cyyever reviewed Mar 31, 2025

View reviewed changes

annop-w changed the title ~~Parallelize sort for GCC build~~ Parallelize sort using libstdc++ parallel mode Mar 31, 2025

cyyever requested a review from Skylion007 March 31, 2025 16:08

drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 3, 2025

pytorchmergebot force-pushed the fix_sort branch from 13eb689 to 5ec1947 Compare April 7, 2025 16:01

cyyever added ciflow/mps Run MPS tests (subset of trunk) ciflow/s390 s390x-related CI jobs labels Apr 8, 2025

cyyever requested a review from malfet April 16, 2025 00:33

annop-w force-pushed the fix_sort branch from 5ec1947 to 8419798 Compare May 1, 2025 10:27

pytorch-bot bot removed ciflow/mps Run MPS tests (subset of trunk) ciflow/s390 s390x-related CI jobs labels May 1, 2025

annop-w requested a review from digantdesai May 1, 2025 10:38

malfet requested changes May 1, 2025

View reviewed changes

annop-w requested a review from malfet May 1, 2025 20:21

pytorchmergebot force-pushed the fix_sort branch from 8419798 to 25fd67c Compare May 13, 2025 11:10

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label May 13, 2025

fadara01 added module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 and removed ciflow/linux-aarch64 linux aarch64 CI workflow labels May 13, 2025

malfet requested changes May 13, 2025

View reviewed changes

annop-w closed this May 21, 2025

Parallelize sort using libstdc++ parallel mode #150195

Parallelize sort using libstdc++ parallel mode #150195

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150195

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!