Extend vectorization with SVE(ARM) with Torch Compile (Inductor) #134672

aditew01 · 2024-08-28T12:48:30Z

Motivation
Enable SVE vectorization with torch.compile
Extends PR: #119571

This PR enables vectorization for codegen part using SVE-256 (vec length)
The changes can be extended to other SVE vec lengths

I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for torch.compile
Test results are for 8 cores on ARM Neoverse_V1

It's worth mentioning, for standalone SiLU op there's a ~1.8x speedup with torch.compile

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

pytorch-bot · 2024-08-28T12:48:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134672

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fd63fbd with merge base f69bf00 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aditew01 · 2024-08-28T12:49:41Z

@pytorchbot label "module: arm"

aditew01 · 2024-08-28T12:50:36Z

cc: @maajidkhann

aditew01 · 2024-08-28T13:22:57Z

@pytorchbot label "ciflow/linux-aarch64"

pytorch-bot · 2024-08-28T13:23:06Z

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

torch/_inductor/cpu_vec_isa.py

maajidkhann · 2024-08-29T05:47:44Z

cc: @maajidkhann

@aditew01 Thanks for the PR. This enables Compile flow with SVE and overall changes look good. We are also trying to run through some models from Torchbench with your changes to see the gains compared to Compile() + Neon Flow.

Can you look into the comments on your PR where there are few suggestions from reviewers and push the updated changes. Once done, I can cherry-pick your commit on to the original PR (#119571).

The original PR is not merged yet and it would be good to have all the changes there including yours at one place. It makes easy for reviewers as they have the context there.

We were also internally working on enabling Compile with SVE and have identified places where we need to implement more operators in Vec backend for SVE. Today they fall back to Neon as they don't have the implementation in SVE. Once this change is done and verified internally, we will add it to our main PR on top of your commit.

aditew01 · 2024-08-29T09:47:38Z

@maajidkhann thanks for ack.
I had a question regarding different SVE vec lengths which we may want to enable. Currently I see only 256 bit supported

pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h

Line 10 in 7e697fc

#if defined(CPU_CAPABILITY_SVE256)

Is there any plans for supporting different SVE vec length, some of the code like this can be made more generic in that case.

pytorch/aten/src/ATen/cpu/vec/functional_base.h

Line 110 in b728e3c

    
           #if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && defined(CPU_CAPABILITY_SVE)

pytorch/torch/_inductor/cpu_vec_isa.py

Line 166 in b728e3c

class VecSVE(VecISA):

Writing it here for full context and visibility.

maajidkhann · 2024-08-29T10:02:37Z

@maajidkhann thanks for ack. I had a question regarding different SVE vec lengths which we may want to enable. Currently I see only 256 bit supported

pytorch/aten/src/ATen/cpu/vec/vec256/vec256.h

Line 10 in 7e697fc

#if defined(CPU_CAPABILITY_SVE256)

Is there any plans for supporting different SVE vec length, some of the code like this can be made more generic in that case.

pytorch/aten/src/ATen/cpu/vec/functional_base.h

Line 110 in b728e3c

#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && defined(CPU_CAPABILITY_SVE)

pytorch/torch/_inductor/cpu_vec_isa.py

Line 166 in b728e3c

class VecSVE(VecISA):

Writing it here for full context and visibility.

Yeah, right now, our whole development was being developed and tested on Graviton 3 (Which is SVE256). We are currently working on a follow up PR/commit that would extend SVE VEC backend support to SVE128 and SVE512.

For SVE128, we are using Grace/Graviton4 CPU's to validate the tests and for SVE512, we are using Fugaku instances that come with SVE512.

This PR doesn't require any extra SVE code to be added as SVE code is Vector length Agnostic (VLA), we just have to add support for SVE128 and SVE512 and register the kernels for that. The PR is almost ready but currently we are facing a segfault when validating the changes on Graviton 2 (Non SVE) machines. We have to maintain backward compatibility with Non SVE supported machines as well. We expect to get this issue fixed soon and will push a follow up commit.

Currently in ARM CPU market, we only have three different SVE lengths offerings (SVE-128,256,512) and we don't forecast any CPU coming with much bigger VL though till 2048 is possible technically. so with this PR, we can cater SVE backend for all ARM CPU's in the market.

aditew01 · 2024-08-29T11:11:51Z

@maajidkhann ack. Thanks for the detailed reply, it clarifies the roadmap.
I believe once the SVE 128 commit is up, we can make the torch.compile route agnostic to the VEC length and integrate SVE128 without much code changes.

maajidkhann · 2024-08-29T11:16:35Z

@maajidkhann ack. Thanks for the detailed reply, it clarifies the roadmap. I believe once the SVE 128 commit is up, we can make the torch.compile route agnostic to the VEC length and integrate SVE128 without much code changes.

Yes it should be a simple change to enable torch.compile for SVE128 later on.
It should just be 1 liner change here:
67707f3#diff-39665ae5ca878523e2f73397eec7080a5eff86c46c6bd01377c0be235d97109cR169

malfet

Overall I think you are doing the right implementation of a dispatch here, but do you mind breaking it down into two PRs: one that adds SVE backend and another that enable it in torch.compile?

[Edit] I see you've already extending #119571 here, so perhaps let's land them in order...

malfet · 2024-08-30T15:42:55Z

CMakeLists.txt

+if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64")
+  include(CheckCSourceCompiles)
+  set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=armv8-a+sve")
+  check_c_source_compiles("#include <arm_sve.h>
+int main() {
+  svfloat64_t a;
+  a = svdup_n_f64(0);
+  return 0;
+}" COMPILER_HAS_ARM_SVE)
+
+  if(COMPILER_HAS_ARM_SVE)
+    string(APPEND CMAKE_CXX_FLAGS " -DCOMPILER_HAS_ARM_SVE")
+  endif()
+  set(CMAKE_C_FLAGS ${ORIGINAL_CMAKE_C_FLAGS})
+endif()
+


This is a compile time check and is somewhat irrelevant during the compile, isn't it?

I have made the changes and pushed in the main SVE PR.
#119571

Here's the exact commit with the change:
47dcc02

torch/_inductor/cpu_vec_isa.py

linux-foundation-easycla · 2024-09-04T13:29:55Z

The committers listed above are authorized under a signed CLA.

✅ login: aditew01 (d047382, fd63fbd, 4f7aafa, ea42362)

cfRod · 2024-09-12T13:07:31Z

@pytorchbot label "ciflow/linux-aarch64"

pytorch-bot · 2024-09-12T13:07:39Z

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

malfet · 2024-09-19T01:03:32Z

@aditew01 , please rebase and please notice the compile speedup dropped after SVE eager changes were landed, probably because eager is no faster. Let's rebase and merge those changes to make sure that performance speedups are still there

maajidkhann · 2024-09-19T06:54:42Z

@aditew01 , please rebase and please notice the compile speedup dropped after SVE eager changes were landed, probably because eager is no faster. Let's rebase and merge those changes to make sure that performance speedups are still there

@aditew01 The Main OSS PR is now merged into OSS (#119571)

The other SVE backend commits in this PR from which this PR was based out of were changed in the original PR (#119571). There were some reordering and additional commits that went in later.

I think, the easier option would be to just cherry-pick your 3 commits on top of latest PyTorch main and force push the changes into this PR.

aditew01 · 2024-10-03T09:27:35Z

@jgong5 @malfet can I please get a review

torch/_inductor/cpu_vec_isa.py

aditew01 · 2024-10-09T15:52:55Z

@pytorchbot merge

pytorchmergebot · 2024-10-09T15:54:45Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

aditew01 · 2024-10-09T15:57:25Z

@malfet can this be merged now ?

Change-Id: I2a65d40bfdb843e426f2763f980f69f0f6a9f5bf

malfet · 2024-10-10T13:17:36Z

torch/_inductor/codegen/cpp_prefix.h

@@ -28,7 +28,7 @@
 #include <c10/util/TypeCast.h>
 #include <torch/csrc/inductor/aoti_torch/c/shim.h>

-#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR) || defined(CPU_CAPABILITY_NEON) || defined(CPU_CAPABILITY_VSX)
+#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR) || defined(CPU_CAPABILITY_NEON) || defined(CPU_CAPABILITY_VSX) || defined(CPU_CAPABILITY_SVE256)


It feels a bit weird to define CPU_CAPABILITY_SVE256 just for this macro, but sure, why not

malfet · 2024-10-10T13:18:31Z

@pytorchbot merge -f "Lint + aarch64 builds are green"

pytorchmergebot · 2024-10-10T13:20:28Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

CaoE · 2024-10-12T02:30:18Z

torch/_inductor/cpu_vec_isa.py

@@ -338,7 +356,10 @@ def valid_vec_isa_list() -> List[VecISA]:
    elif arch == "ppc64le":
        isa_list.append(VecVSX())
    elif arch == "aarch64":
-        isa_list.append(VecNEON())
+        if torch.cpu._is_arm_sve_supported():
+            isa_list.append(VecSVE())


Is this check sufficient? Do we need to add the check like cpuinfo_get_max_arm_sve_length == 256 ?https://github.com/pytorch/pytorch/pull/119571/files#diff-54c373491da67eb31c3777457d7b043a49dd3966412edfd928ffd2013e4d6a54R39-R47 since the macro of VecSVE is "CPU_CAPABILITY_SVE", "CPU_CAPABILITY_SVE256", "AT_BUILD_ARM_VEC256_WITH_SLEEF",

Yes, I believe so. I think using if torch.cpu._is_arm_sve256_supported(): would be appropriate.

@CaoE @aditew01 Thanks for your suggestions, I agree. I will include this check un my next SVE PR (Add support to SVE 128/SVE 512). I had to implement the API from here: aten/src/ATen/cpu/Utils.cpp

Here's the snapshot of the logic: (I have verified it's working)

…4672) **Motivation** Enable SVE vectorization with `torch.compile` Extends PR: #119571 * This PR enables vectorization for codegen part using SVE-256 (vec length) * The changes can be extended to other SVE vec lengths I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile` Test results are for 8 cores on ARM Neoverse_V1 <img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2"> It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile` Pull Request resolved: #134672 Approved by: https://github.com/jgong5, https://github.com/malfet

nWEIdia · 2025-01-25T06:23:48Z

This seems to also cause accuracy issues when running:

PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_new_full_cuda_float16

on Grace+H100.

This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed #134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes #145441 ghstack-source-id: d9e42d1 Pull Request resolved: #146207

This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed #134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes #145441 ghstack-source-id: 2ee05eb Pull Request resolved: #146207

This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed #134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes #145441 Pull Request resolved: #146207 Approved by: https://github.com/angelayi

This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed pytorch#134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes pytorch#145441 Pull Request resolved: pytorch#146207 Approved by: https://github.com/angelayi

aditew01 requested review from lezcano, nikitaved and IvanYashchuk as code owners August 28, 2024 12:48

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: sparse release notes category labels Aug 28, 2024

pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Aug 28, 2024

pytorchbot added the open source label Aug 28, 2024

aditew01 force-pushed the aditew01/torchcompile_sve branch from b87e57d to 67707f3 Compare August 28, 2024 15:52

Ryo-not-rio reviewed Aug 28, 2024

View reviewed changes

torch/_inductor/cpu_vec_isa.py Outdated Show resolved Hide resolved

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 30, 2024

malfet reviewed Aug 30, 2024

View reviewed changes

torch/_inductor/cpu_vec_isa.py Outdated Show resolved Hide resolved

torch/_inductor/cpu_vec_isa.py Outdated Show resolved Hide resolved

aditew01 force-pushed the aditew01/torchcompile_sve branch from b728e3c to 2493b1f Compare September 4, 2024 13:29

maajidkhann mentioned this pull request Sep 19, 2024

Enable torch build with SLEEF on ARM by default #133339

Closed

aditew01 force-pushed the aditew01/torchcompile_sve branch from 9aba52f to d047382 Compare October 1, 2024 11:21

digantdesai reviewed Oct 8, 2024

View reviewed changes

torch/_inductor/cpu_vec_isa.py Show resolved Hide resolved

torch/_inductor/cpu_vec_isa.py Show resolved Hide resolved

jgong5 approved these changes Oct 9, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 9, 2024

pytorchmergebot added the merging label Oct 9, 2024

pytorchmergebot removed the merging label Oct 9, 2024

Merge remote-tracking branch 'torch/main' into aditew01/torchcompile_sve

fd63fbd

Change-Id: I2a65d40bfdb843e426f2763f980f69f0f6a9f5bf

malfet approved these changes Oct 10, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 10, 2024

pytorchmergebot added the Merged label Oct 10, 2024

pytorchmergebot closed this in 575f260 Oct 10, 2024

pytorchmergebot removed the merging label Oct 10, 2024

malfet mentioned this pull request Oct 11, 2024

inductor regression on aarch64 neoverse-v1, failing unit tests due to compiler error on gcc10.2 #137775

Closed

CaoE reviewed Oct 12, 2024

View reviewed changes

maajidkhann mentioned this pull request Oct 19, 2024

Extending SVE VEC Backend Support in PyTorch to SVE128 and SVE512. #138388

Open

angelayi mentioned this pull request Jan 25, 2025

seg fault in aot_inductor_package on arm GPU with 2.6.0 RC #145441

Closed

malfet mentioned this pull request Jan 31, 2025

[CPUInductor] Fix SVE256 detection #146207

Closed

Extend vectorization with SVE(ARM) with Torch Compile (Inductor) #134672

Extend vectorization with SVE(ARM) with Torch Compile (Inductor) #134672

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134672

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge failed

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!