8000 Parallelize sort using libstdc++ parallel mode by annop-w · Pull Request #150195 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Parallelize sort using libstdc++ parallel mode #150195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

annop-w
Copy link
Contributor
@annop-w annop-w commented Mar 28, 2025

Fixes #149977, #149979, #150094.

Previously, #149505 used libstdc++ parallel mode by enabling -D_GLIBCXX_PARALLEL. However, mixing source files compiled with and without parallel mode can lead to undefined behavior (See https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html) We switch to using the specific paralell sort from <parallel/algorithm> when compiled with GCC compiler. Note that use of std::execution policy has dependency on libtbb and we thus decide to avoid that.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

Copy link
pytorch-bot bot commented Mar 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150195

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 25fd67c with merge base 7243c69 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Mar 28, 2025
@annop-w
Copy link
Contributor Author
annop-w commented Mar 28, 2025

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Mar 28, 2025
@nikhil-arm
Copy link
Collaborator

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Mar 28, 2025
@annop-w
Copy link
Contributor Author
annop-w commented Mar 28, 2025

@malfet I hope this is an acceptable solution since you had concerns with adding tbb in #142391

@pytorch-bot pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Mar 28, 2025
@malfet malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 29, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Mar 31, 2025
@cyyever
Copy link
Collaborator
cyyever commented Mar 31, 2025

Can we use c++17 parallel sort?

@annop-w
Copy link
Contributor Author
annop-w commented Mar 31, 2025

@cyyever That has dependency on tbb which we would like to avoid adding.

@cyyever
Copy link
Collaborator
cyyever commented Mar 31, 2025

@annop-w I see, but is it possible to use them for Clang and MSVC?

@annop-w
Copy link
Contributor Author
annop-w commented Mar 31, 2025

@cyyever It works for clang and gcc. I have not tried MSVC. But again, there is dependency on tbb and if we would like to avoid adding libtbb dependency in PyTorch, then we aren't able to use the parallel execution policy.

@@ -146,6 +149,25 @@ static inline void sort_kernel_impl(const value_accessor_t& value_accessor,
auto composite_accessor = CompositeRandomAccessorCPU<
value_accessor_t, indices_accessor_t
>(value_accessor, indices_accessor);
#if __has_include(<parallel/algorithm>) && defined(_OPENMP)
Copy link
Collaborator
@cyyever cyyever Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we store the comparator into a variable by descending to reduce the number of branches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a bit out of scope for this PR. It can be done easily in a separate PR though :)

Copy link
Contributor
@digantdesai digantdesai Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid ask. let's figure out the sorting function first and just call it once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@cyyever
Copy link
Collaborator
cyyever commented Mar 31, 2025

Can you provide benchmark numbers before and after changing?

@annop-w annop-w changed the title Parallelize sort for GCC build Parallelize sort using libstdc++ parallel mode Mar 31, 2025
@annop-w
Copy link
Contributor Author
annop-w commented Mar 31, 2025

Can you provide benchmark numbers before and after changing?

I see ~9.7x speedup with 16 threads on NEOVERSE V1 when sorting ~50000 elements.

@cyyever cyyever requested a review from Skylion007 March 31, 2025 16:08
@drisspg drisspg added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 3, 2025
@annop-w
Copy link
Contributor Author
annop-w commented Apr 7, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix_sort onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_sort && git pull --rebase)

@annop-w
Copy link
Contributor Author
annop-w commented Apr 8, 2025

@malfet Could I please get a review ?

@cyyever cyyever added ciflow/mps Run MPS tests (subset of trunk) ciflow/s390 s390x-related CI jobs labels Apr 8, 2025
@annop-w
Copy link
Contributor Author
annop-w commented Apr 15, 2025

@digantdesai Could you please help taking a look ? The checks passed and the failing one looks unrelated to me.

@cyyever cyyever requested a review from malfet April 16, 2025 00:33
@annop-w
Copy link
Contributor Author
annop-w commented Apr 28, 2025

@malfet Could you please review this one ? Thank you.

@pytorch-bot pytorch-bot bot removed ciflow/mps Run MPS tests (subset of trunk) ciflow/s390 s390x-related CI jobs labels May 1, 2025
@annop-w
Copy link
Contributor Author
annop-w commented May 1, 2025

@cyyever @digantdesai Could you please review again ? Thank you.

@cyyever
Copy link
Collaborator
cyyever commented May 1, 2025

@malfet Have a look because performance gain fro proper C++17 parallel is desirable.

@annop-w annop-w requested a review from digantdesai May 1, 2025 10:38
@malfet
F438 Copy link
Contributor
malfet commented May 1, 2025

@malfet Have a look because performance gain fro proper C++17 parallel is desirable.

But this PR is not using c++17 parallel sort, but rather some g++ extension?

Also, all 3 issues that PR claims to fix has been closed, but I also struggle to understand how it could have fixed them in the first place

@malfet
Copy link
Contributor
malfet commented May 1, 2025

@cyyever It works for clang and gcc. I have not tried MSVC. But again, there is dependency on tbb and if we would like to avoid adding libtbb dependency in PyTorch, then we aren't able to use the parallel execution policy.

@annop-w Can you please reference the doc? To the best of my knowledge functions which are part of the standard should rely only on c++ runtime shipped with compiler.

If this is not the case, then I think one should write an in-house implementation using at::parallel primitive

@annop-w
Copy link
Contributor Author
annop-w commented May 1, 2025

@malfet
Here is the doc for GNU https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html.
I will look for docs for Clang.

The 3 issues are closed because #149505 got reverted. I explained how this PR would also solve the issues here

Previously, #149505 used libstdc++ parallel mode by enabling -D_GLIBCXX_PARALLEL. However, mixing source files compiled with and without parallel mode can lead to undefined behavior

@cyyever
Copy link
Collaborator
cyyever commented May 1, 2025

@annop-w @malfet For your reference, libstdc++ document says

 Note 3: The Parallel Algorithms have an external dependency on Intel TBB 2018 or later. 
If the <execution> header is included then -ltbb must be used to link to TBB. 

Currently, libc++ has no full Parallelism TS support, but I don't know the coverage of std::sort.
MSVC says

C++17's parallel algorithms library is complete. 
Complete doesn't mean that every algorithm is parallelized in every case. 
The most important algorithms have been parallelized. 
Execution policy signatures are provided even where the implementation doesn't 
parallelize algorithms. 
The central internal header, <yvals_core.h>, contains the following 
"Parallel Algorithms Notes": C++ allows an implementation to implement parallel 
algorithms as calls to the serial algorithms. This implementation parallelizes 
several common algorithm calls, but not all.

The following algorithms are parallelized:

    adjacent_difference, adjacent_find, all_of, any_of, count, 
count_if, equal, exclusive_scan, find, find_end, find_first_of, 
find_if, find_if_not, for_each, for_each_n, inclusive_scan, 
is_heap, is_heap_until, is_partitioned, is_sorted, 
is_sorted_until, mismatch, none_of, partition, reduce, 
remove, remove_if, replace, replace_if, search, search_n, 
set_difference, set_intersection, sort, stable_sort, transform, 
transform_exclusive_scan, transform_inclusive_scan, transform_reduce

Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes because:

  • We should avoid using random compiler extensions in the codebase (especially when they work for ascending but not descending)
  • Previous attempt to enable libstdc++ enable parallel sort resulted in correctness issue and this PR adds no unit test nor provide a satisfactory explanation(fact that something might lead to undefined behavior, in my mind is not sufficient, my strong suspicious on what happened are results were unstable compared to sequential sort when there were duplicated indices) why this use of extension will not suffer from the same fate
  • Last but not least, this is a performance optimization, but no script are provided that one can use to validate that

To conclude, if looks like if one wants to enable parallel sort, they could not rely on C++17 primitives, as their implementation at least on Linux is currently flawed, but instead should write something that relies on at::parallel_for primitive

@annop-w
Copy link
Contributor Author
annop-w commented May 1, 2025

@malfet I am sorry but I do not agree with your arguments

We should avoid using random compiler extensions in the codebase (especially when they work for ascending but not descending)

This is not just some random extensions. It is just GNU version of C++ standard library which is also compatible with Clang. For Clang, the C++ std. library implementation can be selected with -stdlib= flag. The__gnu_parallel version has the same function signature and accepts a comparator, i.e. it works for both ascending and descending sort.

Previous attempt to enable libstdc++ enable parallel sort resulted in correctness issue and this PR adds no unit test nor provide a satisfactory explanation(fact that something might lead to undefined behavior, in my mind is not sufficient, my strong suspicious on what happened are results were unstable compared to sequential sort when there were duplicated indices) why this use of extension will not suffer from the same fate

This is documented in the documentation as

Note that the _GLIBCXX_PARALLEL define may change the sizes and behavior of standard class templates such as std::search, and therefore one can only link code compiled with parallel mode and code compiled without parallel mode if no instantiation of a container is passed between the two translation units. Parallel mode functionality has distinct linkage, and cannot be confused with normal mode symbols.

I validated the change in this PR with script in #150094 and it passed the accuracy test. So, tests already exist.

Last but not least, this is a performance optimization, but no script are provided that one can use to validate that

Benchmarking script is already provided in #142391 and I paste it here again for reference

import torch
import torch.autograd.profiler as profiler

torch.manual_seed(0)

N = 50000
x = torch.randn(N, dtype=torch.float)

with profiler.profile(with_stack=True, profile_memory=False, record_shapes=True) as prof:
    for i in range(1000):
        _, _ = torch.sort(x)

print(prof.key_averages(group_by_input_shape=True).table(sort_by='self_cpu_time_total', row_limit=10))

if one wants to enable parallel sort, they could not rely on C++17 primitives, as their implementation at least on Linux is currently flawed, but instead should write something that relies on at::parallel_for primitive

I do not think the implementation is flawed. If there exists a working parallel sort in the C++ std. library, why reinvent the wheel ?

@annop-w annop-w requested a review from malfet May 1, 2025 20:21
@fadara01
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

Resolve pytorch#149977, pytorch#149979, pytorch#150094.
Previously, pytorch#149505 used libstdc++ parallel mode by enabling
-D_GLIBCXX_PARALLEL. However, mixing source files compiled with
and without parallel mode can lead to undefined behavior (See
https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html)
We switch to using the specific paralell sort from <parallel/algorithm>
when compiled with GCC compiler. Note that use of std::execution policy
has dependency on libtbb and we thus decide to avoid that.
@pytorchmergebot
Copy link
Collaborator

Successfully rebased fix_sort onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix_sort && git pull --rebase)

@fadara01
Copy link
Collaborator

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label May 13, 2025
@fadara01 fadara01 added module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 and removed ciflow/linux-aarch64 linux aarch64 CI workflow labels May 13, 2025
Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not just some random extensions. It is just GNU version of C++ standard library which is also compatible with Clang. For Clang, the C++ std. library implementation can be selected with -stdlib= flag. The__gnu_parallel version has the same function signature and accepts a comparator, i.e. it works for both ascending and descending sort.

It's not part of C++ standard, therefore we should not rely on that.

Previous attempt to enable libstdc++ enable parallel sort resulted in correctness issue and this PR adds no unit test nor provide a satisfactory explanation(fact that something might lead to undefined behavior, in my mind is not sufficient, my strong suspicious on what happened are results were unstable compared to sequential sort when there were duplicated indices) why this use of extension will not suffer from the same fate

I validated the change in this PR with script in #150094 and it passed the accuracy test. So, tests already exist.

Can you please explain what changed in the approach between this PR and #149505 that make test pass now but fail previously?

Benchmarking script is already provided in #142391 and I paste it here again for reference

Please add it to the PR description and share the numbers that you've observed before and after the change.

I do not think the implementation is flawed. If there exists a working parallel sort in the C++ std. library, why reinvent the wheel ?

Didn't you mention that it works for ascending but not descending case (though I see you have a workaround for it now)

@@ -19,6 +19,9 @@
#ifdef USE_FBGEMM
#include <fbgemm/Utils.h>
#endif
#if __has_include(<parallel/algorithm>) && defined(_OPENMP)
#include <parallel/algorithm>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a C++ standard that says this header will be available in C++20? Or C++23? If not, please don't rely on this header

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UNSTABLE inductor / linux-jammy-cpu-py3.9-gcc11-inductor / test (cpu_inductor_torchbench)
9 participants
0