-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Parallelize sort using libstdc++ parallel mode #150195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150195
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 25fd67c with merge base 7243c69 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
@pytorchbot label "ciflow/linux-aarch64" |
Can we use c++17 parallel sort? |
@cyyever That has dependency on tbb which we would like to avoid adding. |
@annop-w I see, but is it possible to use them for Clang and MSVC? |
@cyyever It works for clang and gcc. I have not tried MSVC. But again, there is dependency on tbb and if we would like to avoid adding libtbb dependency in PyTorch, then we aren't able to use the parallel execution policy. |
@@ -146,6 +149,25 @@ static inline void sort_kernel_impl(const value_accessor_t& value_accessor, | |||
auto composite_accessor = CompositeRandomAccessorCPU< | |||
value_accessor_t, indices_accessor_t | |||
>(value_accessor, indices_accessor); | |||
#if __has_include(<parallel/algorithm>) && defined(_OPENMP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we store the comparator into a variable by descending
to reduce the number of branches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a bit out of scope for this PR. It can be done easily in a separate PR though :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a valid ask. let's figure out the sorting function first and just call it once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Can you provide benchmark numbers before and after changing? |
I see ~9.7x speedup with 16 threads on NEOVERSE V1 when sorting ~50000 elements. |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
13eb689
to
5ec1947
Compare
@malfet Could I please get a review ? |
@digantdesai Could you please help taking a look ? The checks passed and the failing one looks unrelated to me. |
@malfet Could you please review this one ? Thank you. |
@cyyever @digantdesai Could you please review again ? Thank you. |
@malfet Have a look because performance gain fro proper C++17 parallel is desirable. |
But this PR is not using c++17 parallel sort, but rather some g++ extension? Also, all 3 issues that PR claims to fix has been closed, but I also struggle to understand how it could have fixed them in the first place |
@annop-w Can you please reference the doc? To the best of my knowledge functions which are part of the standard should rely only on c++ runtime shipped with compiler. If this is not the case, then I think one should write an in-house implementation using at::parallel primitive |
@malfet The 3 issues are closed because #149505 got reverted. I explained how this PR would also solve the issues here
|
@annop-w @malfet For your reference, libstdc++ document says
Currently, libc++ has no full Parallelism TS support, but I don't know the coverage of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes because:
- We should avoid using random compiler extensions in the codebase (especially when they work for ascending but not descending)
- Previous attempt to enable libstdc++ enable parallel sort resulted in correctness issue and this PR adds no unit test nor provide a satisfactory explanation(fact that something might lead to undefined behavior, in my mind is not sufficient, my strong suspicious on what happened are results were unstable compared to sequential sort when there were duplicated indices) why this use of extension will not suffer from the same fate
- Last but not least, this is a performance optimization, but no script are provided that one can use to validate that
To conclude, if looks like if one wants to enable parallel sort, they could not rely on C++17 primitives, as their implementation at least on Linux is currently flawed, but instead should write something that relies on at::parallel_for
primitive
@malfet I am sorry but I do not agree with your arguments
This is not just some random extensions. It is just GNU version of C++ standard library which is also compatible with Clang. For Clang, the C++ std. library implementation can be selected with
This is documented in the documentation as
I validated the change in this PR with script in #150094 and it passed the accuracy test. So, tests already exist.
Benchmarking script is already provided in #142391 and I paste it here again for reference
I do not think the implementation is flawed. If there exists a working parallel sort in the C++ std. library, why reinvent the wheel ? |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Resolve pytorch#149977, pytorch#149979, pytorch#150094. Previously, pytorch#149505 used libstdc++ parallel mode by enabling -D_GLIBCXX_PARALLEL. However, mixing source files compiled with and without parallel mode can lead to undefined behavior (See https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html) We switch to using the specific paralell sort from <parallel/algorithm> when compiled with GCC compiler. Note that use of std::execution policy has dependency on libtbb and we thus decide to avoid that.
Successfully rebased |
@pytorchbot label "ciflow/linux-aarch64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not just some random extensions. It is just GNU version of C++ standard library which is also compatible with Clang. For Clang, the C++ std. library implementation can be selected with
-stdlib=
flag. The__gnu_parallel version has the same function signature and accepts a comparator, i.e. it works for both ascending and descending sort.
It's not part of C++ standard, therefore we should not rely on that.
Previous attempt to enable libstdc++ enable parallel sort resulted in correctness issue and this PR adds no unit test nor provide a satisfactory explanation(fact that something might lead to undefined behavior, in my mind is not sufficient, my strong suspicious on what happened are results were unstable compared to sequential sort when there were duplicated indices) why this use of extension will not suffer from the same fate
I validated the change in this PR with script in #150094 and it passed the accuracy test. So, tests already exist.
Can you please explain what changed in the approach between this PR and #149505 that make test pass now but fail previously?
Benchmarking script is already provided in #142391 and I paste it here again for reference
Please add it to the PR description and share the numbers that you've observed before and after the change.
I do not think the implementation is flawed. If there exists a working parallel sort in the C++ std. library, why reinvent the wheel ?
Didn't you mention that it works for ascending but not descending case (though I see you have a workaround for it now)
@@ -19,6 +19,9 @@ | |||
#ifdef USE_FBGEMM | |||
#include <fbgemm/Utils.h> | |||
#endif | |||
#if __has_include(<parallel/algorithm>) && defined(_OPENMP) | |||
#include <parallel/algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a C++ standard that says this header will be available in C++20? Or C++23? If not, please don't rely on this header
Fixes #149977, #149979, #150094.
Previously, #149505 used libstdc++ parallel mode by enabling
-D_GLIBCXX_PARALLEL
. However, mixing source files compiled with and without parallel mode can lead to undefined behavior (See https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html) We switch to using the specific paralell sort from<parallel/algorithm>
when compiled with GCC compiler. Note that use ofstd::execution
policy has dependency on libtbb and we thus decide to avoid that.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01