8000 Extend vectorization with SVE(ARM) with Torch Compile (Inductor) by aditew01 · Pull Request #134672 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Extend vectorization with SVE(ARM) with Torch Compile (Inductor) #134672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

aditew01
Copy link
Collaborator
@aditew01 aditew01 commented Aug 28, 2024

Motivation
Enable SVE vectorization with torch.compile
Extends PR: #119571

  • This PR enables vectorization for codegen part using SVE-256 (vec length)
  • The changes can be extended to other SVE vec lengths

I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for torch.compile
Test results are for 8 cores on ARM Neoverse_V1

Screenshot 2024-08-28 at 16 02 07

It's worth mentioning, for standalone SiLU op there's a ~1.8x speedup with torch.compile

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @malfet @snadampal @milpuz01 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

@pytorch-bot pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor release notes: sparse release notes category labels Aug 28, 2024
Copy link
pytorch-bot bot commented Aug 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134672

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fd63fbd with merge base f69bf00 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@aditew01
Copy link
Collaborator Author

@pytorchbot label "module: arm"

@pytorch-bot pytorch-bot bot added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Aug 28, 2024
@aditew01
Copy link
Collaborator Author

cc: @maajidkhann

@aditew01
Copy link
Collaborator Author

@pytorchbot label "ciflow/linux-aarch64"

Copy link
pytorch-bot bot commented Aug 28, 2024

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

@aditew01 aditew01 force-pushed the aditew01/torchcompile_sve branch from b87e57d to 67707f3 Compare August 28, 2024 15:52
@maajidkhann
Copy link
Contributor

cc: @maajidkhann

@aditew01 Thanks for the PR. This enables Compile flow with SVE and overall changes look good. We are also trying to run through some models from Torchbench with your changes to see the gains compared to Compile() + Neon Flow.

Can you look into the comments on your PR where there are few suggestions from reviewers and push the updated changes. Once done, I can cherry-pick your commit on to the original PR (#119571).

The original PR is not merged yet and it would be good to have all the changes there including yours at one place. It makes easy for reviewers as they have the context there.

We were also internally working on enabling Compile with SVE and have identified places where we need to implement more operators in Vec backend for SVE. Today they fall back to Neon as they don't have the implementation in SVE. Once this change is done and verified internally, we will add it to our main PR on top of your commit.

@aditew01
Copy link
Collaborator Author

@maajidkhann thanks for ack.
I had a question regarding different SVE vec lengths which we may want to enable. Currently I see only 256 bit supported

#if defined(CPU_CAPABILITY_SVE256)
Is there any plans for supporting different SVE vec length, some of the code like this can be made more generic in that case.
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && defined(CPU_CAPABILITY_SVE)
class VecSVE(VecISA):

Writing it here for full context and visibility.

@maajidkhann
Copy link
Contributor
maajidkhann commented Aug 29, 2024

@maajidkhann thanks for ack. I had a question regarding different SVE vec lengths which we may want to enable. Currently I see only 256 bit supported

#if defined(CPU_CAPABILITY_SVE256)

Is there any plans for supporting different SVE vec length, some of the code like this can be made more generic in that case.

#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && defined(CPU_CAPABILITY_SVE)

class VecSVE(VecISA):

Writing it here for full context and visibility.

Yeah, right now, our whole development was being developed and tested on Graviton 3 (Which is SVE256). We are currently working on a follow up PR/commit that would extend SVE VEC backend support to SVE128 and SVE512.

For SVE128, we are using Grace/Graviton4 CPU's to validate the tests and for SVE512, we are using Fugaku instances that come with SVE512.

This PR doesn't require any extra SVE code to be added as SVE code is Vector length Agnostic (VLA), we just have to add support for SVE128 and SVE512 and register the kernels for that. The PR is almost ready but currently we are facing a segfault when validating the changes on Graviton 2 (Non SVE) machines. We have to maintain backward compatibility with Non SVE supported machines as well. We expect to get this issue fixed soon and will push a follow up commit.

Currently in ARM CPU market, we only have three different SVE lengths offerings (SVE-128,256,512) and we don't forecast any CPU coming with much bigger VL though till 2048 is possible technically. so with this PR, we can cater SVE backend for all ARM CPU's in the market.

@aditew01
Copy link
Collaborator Author

@maajidkhann ack. Thanks for the detailed reply, it clarifies the roadmap.
I believe once the SVE 128 commit is up, we can make the torch.compile route agnostic to the VEC length and integrate SVE128 without much code changes.

@maajidkhann
Copy link
Contributor
maajidkhann commented Aug 29, 2024

@maajidkhann ack. Thanks for the detailed reply, it clarifies the roadmap. I believe once the SVE 128 commit is up, we can make the torch.compile route agnostic to the VEC length and integrate SVE128 without much code changes.

Yes it should be a simple change to enable torch.compile for SVE128 later on.
It should just be 1 liner change here:
67707f3#diff-39665ae5ca878523e2f73397eec7080a5eff86c46c6bd01377c0be235d97109cR169

@janeyx99 janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 30, 2024
Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think you are doing the right implementation of a dispatch here, but do you mind breaking it down into two PRs: one that adds SVE backend and another that enable it in torch.compile?

[Edit] I see you've already extending #119571 here, so perhaps let's land them in order...

CMakeLists.txt Outdated 8000
Comment on lines 1153 to 1168
if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64")
include(CheckCSourceCompiles)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -march=armv8-a+sve")
check_c_source_compiles("#include <arm_sve.h>
int main() {
svfloat64_t a;
a = svdup_n_f64(0);
return 0;
}" COMPILER_HAS_ARM_SVE)

if(COMPILER_HAS_ARM_SVE)
string(APPEND CMAKE_CXX_FLAGS " -DCOMPILER_HAS_ARM_SVE")
endif()
set(CMAKE_C_FLAGS ${ORIGINAL_CMAKE_C_FLAGS})
endif()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a compile time check and is somewhat irrelevant during the compile, isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes and pushed in the main SVE PR.
#119571

Here's the exact commit with the change:
47dcc02

@aditew01 aditew01 force-pushed the aditew01/torchcompile_sve branch from b728e3c to 2493b1f Compare September 4, 2024 13:29
Copy link
linux-foundation-easycla bot commented Sep 4, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@cfRod
Copy link
Collaborator
cfRod commented Sep 12, 2024

@pytorchbot label "ciflow/linux-aarch64"

Copy link
pytorch-bot bot commented Sep 12, 2024

Can't add following labels to PR: ciflow/linux-aarch64. Please ping one of the reviewers for help.

@malfet
Copy link
Contributor
malfet commented Sep 19, 2024

@aditew01 , please rebase and please notice the compile speedup dropped after SVE eager changes were landed, probably because eager is no faster. Let's rebase and merge those changes to make sure that performance speedups are still there

@maajidkhann
Copy link
Contributor

@aditew01 , please rebase and please notice the compile speedup dropped after SVE eager changes were landed, probably because eager is no faster. Let's rebase and merge those changes to make sure that performance speedups are still there

@aditew01 The Main OSS PR is now merged into OSS (#119571)

The other SVE backend commits in this PR from which this PR was based out of were changed in the original PR (#119571). There were some reordering and additional commits that went in later.

I think, the easier option would be to just cherry-pick your 3 commits on top of latest PyTorch main and force push the changes into this PR.
image

@aditew01 aditew01 force-pushed the aditew01/torchcompile_sve branch from 9aba52f to d047382 Compare October 1, 2024 11:21
@aditew01
Copy link
Collaborator Author
aditew01 commented Oct 3, 2024

@jgong5 @malfet can I please get a review

@aditew01
Copy link
Collaborator Author
aditew01 commented Oct 9, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Oct 9, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@aditew01
Copy link
Collaborator Author
aditew01 commented Oct 9, 2024

@malfet can this be merged now ?

Change-Id: I2a65d40bfdb843e426f2763f980f69f0f6a9f5bf
@@ -28,7 +28,7 @@
#include <c10/util/TypeCast.h>
#include <torch/csrc/inductor/aoti_torch/c/shim.h>

#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR) || defined(CPU_CAPABILITY_NEON) || defined(CPU_CAPABILITY_VSX)
#if defined(CPU_CAPABILITY_AVX512) || defined(CPU_CAPABILITY_AVX2) || defined(CPU_CAPABILITY_ZVECTOR) || defined(CPU_CAPABILITY_NEON) || defined(CPU_CAPABILITY_VSX) || defined(CPU_CAPABILITY_SVE256)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit weird to define CPU_CAPABILITY_SVE256 just for this macro, but sure, why not

@malfet
Copy link
Contributor
malfet commented Oct 10, 2024

@pytorchbot merge -f "Lint + aarch64 builds are green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@@ -338,7 +356,10 @@ def valid_vec_isa_list() -> List[VecISA]:
elif arch == "ppc64le":
isa_list.append(VecVSX())
elif arch == "aarch64":
isa_list.append(VecNEON())
if torch.cpu._is_arm_sve_supported():
isa_list.append(VecSVE())
Copy link
Collaborator
@CaoE CaoE Oct 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check sufficient? Do we need to add the check like cpuinfo_get_max_arm_sve_length == 256 ?https://github.com/pytorch/pytorch/pull/119571/files#diff-54c373491da67eb31c3777457d7b043a49dd3966412edfd928ffd2013e4d6a54R39-R47 since the macro of VecSVE is "CPU_CAPABILITY_SVE", "CPU_CAPABILITY_SVE256", "AT_BUILD_ARM_VEC256_WITH_SLEEF",

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so. I think using if torch.cpu._is_arm_sve256_supported(): would be appropriate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CaoE @aditew01 Thanks for your suggestions, I agree. I will include this check un my next SVE PR (Add support to SVE 128/SVE 512). I had to implement the API from here: aten/src/ATen/cpu/Utils.cpp

Here's the snapshot of the logic: (I have verified it's working)
image

jackzhxng pushed a commit that referenced this pull request Oct 16, 2024
…4672)

**Motivation**
Enable SVE vectorization with `torch.compile`
Extends PR: #119571

* This PR enables vectorization for codegen part using SVE-256 (vec length)
* The changes can be extended to other SVE vec lengths

I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile`
Test results are for 8 cores on ARM Neoverse_V1

<img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2">

It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile`

Pull Request resolved: #134672
Approved by: https://github.com/jgong5, https://github.com/malfet
@nWEIdia
Copy link
Collaborator
nWEIdia commented Jan 25, 2025

This seems to also cause accuracy issues when running:

PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=6 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_new_full_cuda_float16

on Grace+H100.

malfet added a commit that referenced this pull request Jan 31, 2025
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()`

I should have reviewed #134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256

Fixes #145441

ghstack-source-id: d9e42d1
Pull Request resolved: #146207
malfet added a commit that referenced this pull request Jan 31, 2025
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()`

I should have reviewed #134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256

Fixes #145441

ghstack-source-id: 2ee05eb
Pull Request resolved: #146207
pytorchmergebot pushed a commit that referenced this pull request Feb 1, 2025
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()`

I should have reviewed #134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256

Fixes #145441

Pull Request resolved: #146207
Approved by: https://github.com/angelayi
mori360 pushed a commit to mori360/pytorch that referenced this pull request Feb 6, 2025
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()`

I should have reviewed pytorch#134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256

Fixes pytorch#145441

Pull Request resolved: pytorch#146207
Approved by: https://github.com/angelayi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor open source release notes: sparse release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0