-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Extend vectorization with SVE(ARM) with Torch Compile (Inductor) #134672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134672
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit fd63fbd with merge base f69bf00 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "module: arm" |
cc: @maajidkhann |
@pytorchbot label "ciflow/linux-aarch64" |
Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help. |
b87e57d
to
67707f3
Compare
@aditew01 Thanks for the PR. This enables Compile flow with SVE and overall changes look good. We are also trying to run through some models from Torchbench with your changes to see the gains compared to Compile() + Neon Flow. Can you look into the comments on your PR where there are few suggestions from reviewers and push the updated changes. Once done, I can cherry-pick your commit on to the original PR (#119571). The original PR is not merged yet and it would be good to have all the changes there including yours at one place. It makes easy for reviewers as they have the context there. We were also internally working on enabling Compile with SVE and have identified places where we need to implement more operators in Vec backend for SVE. Today they fall back to Neon as they don't have the implementation in SVE. Once this change is done and verified internally, we will add it to our main PR on top of your commit. |
@maajidkhann thanks for ack.
pytorch/torch/_inductor/cpu_vec_isa.py Line 166 in b728e3c
Writing it here for full context and visibility. |
Yeah, right now, our whole development was being developed and tested on Graviton 3 (Which is SVE256). We are currently working on a follow up PR/commit that would extend SVE VEC backend support to SVE128 and SVE512. For SVE128, we are using Grace/Graviton4 CPU's to validate the tests and for SVE512, we are using Fugaku instances that come with SVE512. This PR doesn't require any extra SVE code to be added as SVE code is Vector length Agnostic (VLA), we just have to add support for SVE128 and SVE512 and register the kernels for that. The PR is almost ready but currently we are facing a segfault when validating the changes on Graviton 2 (Non SVE) machines. We have to maintain backward compatibility with Non SVE supported machines as well. We expect to get this issue fixed soon and will push a follow up commit. Currently in ARM CPU market, we only have three different SVE lengths offerings (SVE-128,256,512) and we don't forecast any CPU coming with much bigger VL though till 2048 is possible technically. so with this PR, we can cater SVE backend for all ARM CPU's in the market. |
@maajidkhann ack. Thanks for the detailed reply, it clarifies the roadmap. |
Yes it should be a simple change to enable torch.compile for SVE128 later on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think you are doing the right implementation of a dispatch here, but do you mind breaking it down into two PRs: one that adds SVE backend and another that enable it in torch.compile?
[Edit] I see you've already extending #119571 here, so perhaps let's land them in order...
CMakeLists.txt
Outdated
8000
if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64") | ||
include(CheckCSourceCompiles) | ||
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=armv8-a+sve") | ||
check_c_source_compiles("#include <arm_sve.h> | ||
int main() { | ||
svfloat64_t a; | ||
a = svdup_n_f64(0); | ||
return 0; | ||
}" COMPILER_HAS_ARM_SVE) | ||
|
||
if(COMPILER_HAS_ARM_SVE) | ||
string(APPEND CMAKE_CXX_FLAGS " -DCOMPILER_HAS_ARM_SVE") | ||
endif() | ||
set(CMAKE_C_FLAGS ${ORIGINAL_CMAKE_C_FLAGS}) | ||
endif() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a compile time check and is somewhat irrelevant during the compile, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b728e3c
to
2493b1f
Compare
@pytorchbot label "ciflow/linux-aarch64" |
Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help. |
@aditew01 , please rebase and please notice the compile speedup dropped after SVE eager changes were landed, probably because eager is no faster. Let's rebase and merge those changes to make sure that performance speedups are still there |
@aditew01 The Main OSS PR is now merged into OSS (#119571) The other SVE backend commits in this PR from which this PR was based out of were changed in the original PR (#119571). There were some reordering and additional commits that went in later. I think, the easier option would be to just cherry-pick your 3 commits on top of latest PyTorch main and force push the changes into this PR. |
9aba52f
to
d047382
Compare
@pytorchbot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
@malfet can this be merged now ? |
Change-Id: I2a65d40bfdb843e426f2763f980f69f0f6a9f5bf
@@ -28,7 +28,7 @@ | |||
#include <c10/util/TypeCast.h> | |||
#include <torch/csrc/inductor/aoti_torch/c/shim.h> | |||
|
|||
#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR) || defined(CPU_CAPABILITY_NEON) || defined(CPU_CAPABILITY_VSX) | |||
#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR) || defined(CPU_CAPABILITY_NEON) || defined(CPU_CAPABILITY_VSX) || defined(CPU_CAPABILITY_SVE256) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a bit weird to define CPU_CAPABILITY_SVE256
just for this macro, but sure, why not
@pytorchbot merge -f "Lint + aarch64 builds are green" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@@ -338,7 +356,10 @@ def valid_vec_isa_list() -> List[VecISA]: | |||
elif arch == "ppc64le": | |||
isa_list.append(VecVSX()) | |||
elif arch == "aarch64": | |||
isa_list.append(VecNEON()) | |||
if torch.cpu._is_arm_sve_supported(): | |||
isa_list.append(VecSVE()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check sufficient? Do we need to add the check like cpuinfo_get_max_arm_sve_length == 256
?https://github.com/pytorch/pytorch/pull/119571/files#diff-54c373491da67eb31c3777457d7b043a49dd3966412edfd928ffd2013e4d6a54R39-R47 since the macro of VecSVE
is "CPU_CAPABILITY_SVE", "CPU_CAPABILITY_SVE256", "AT_BUILD_ARM_VEC256_WITH_SLEEF",
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe so. I think using if torch.cpu._is_arm_sve256_supported():
would be appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…4672) **Motivation** Enable SVE vectorization with `torch.compile` Extends PR: #119571 * This PR enables vectorization for codegen part using SVE-256 (vec length) * The changes can be extended to other SVE vec lengths I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile` Test results are for 8 cores on ARM Neoverse_V1 <img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2"> It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile` Pull Request resolved: #134672 Approved by: https://github.com/jgong5, https://github.com/malfet
This seems to also cause accuracy issues when running: PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_new_full_cuda_float16 on Grace+H100. |
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed #134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes #145441 ghstack-source-id: d9e42d1 Pull Request resolved: #146207
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed #134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes #145441 ghstack-source-id: 2ee05eb Pull Request resolved: #146207
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed #134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes #145441 Pull Request resolved: #146207 Approved by: https://github.com/angelayi
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed pytorch#134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes pytorch#145441 Pull Request resolved: pytorch#146207 Approved by: https://github.com/angelayi
Motivation
Enable SVE vectorization with
torch.compile
Extends PR: #119571
I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for
torch.compile
Test results are for 8 cores on ARM Neoverse_V1
It's worth mentioning, for standalone
SiLU op
there's a~1.8x
speedup withtorch.compile
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec