[Intel GPU] Enable SDPA on XPU #147614

DDEle · 2025-02-21T09:21:00Z

Motivation

This PR is part of the plan of OneDNN Upstreaming, as #114848 (comment) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added Attention.cpp file, Graph.h is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in test/test_transformers.py are copied into the new test/xpu/test_transformers.py and modified accordingly to provide additional tests beyond ./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py.

Depends on OneDNN version v3.7 upgrade in #147498
Depends on BUILD_GRAPH switch in #147608

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-02-21T09:21:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147614

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 2 Unrelated Failures

As of commit 6e0b7d2 with merge base af720cd ():

NEW FAILURES - The following jobs have failed:

xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu) (gh)
inductor/test_torchinductor.py::TritonCodeGenTests::test_graph_partition_multiple_functions
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu) (gh)
inductor/test_torchinductor.py::TritonCodeGenTests::test_graph_partition_unbacked_symint
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu) (gh)
inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_matmul_triton_kernel_benchmark

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-cpu-py3.10-gcc9-bazel-test / filter (gh) (detected as infra flaky with no log or failing log classifier)
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu) (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

DDEle · 2025-02-21T09:21:36Z

@pytorchbot label "topic: not user facing"

github-actions · 2025-02-21T09:26:34Z

Attention! native_functions.yaml was changed

If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

Caused by:

aten/src/ATen/native/native_functions.yaml

github-actions · 2025-02-21T09:26:34Z

Attention! PyTorch one of the C-stable API file was changed

You MUST NOT change existing function declarations in this, as this header defines a stable C ABI. If you need to change the signature for a function, introduce a new v2 version of the function and modify code generation to target the new version of the function.

Caused by:

EikanWang · 2025-02-24T16:55:29Z

@pytorchbot rebase

pytorchmergebot · 2025-02-24T16:56:57Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-02-24T16:57:00Z

Successfully rebased onednn_graph_sdpa-integration onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout onednn_graph_sdpa-integration && git pull --rebase)

EikanWang · 2025-02-24T16:57:41Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-02-24T16:59:09Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-02-24T16:59:12Z

Successfully rebased onednn_graph_sdpa-integration onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout onednn_graph_sdpa-integration && git pull --rebase)

EikanWang · 2025-02-25T05:56:00Z

For Intel GPU, this PR can significantly improve performance for key workloads like Stable Diffusion, benefiting Torch users, especially on client devices. In terms of timeline, we are strongly aiming to catch up with the upcoming PyTorch release. We would greatly appreciate it if @albanD, @jansel, and @desertfire could prioritize the review of this PR.

albanD · 2025-02-25T16:15:09Z

.lintrunner.toml

@@ -1251,6 +1251,7 @@ exclude_patterns = [
    'test/test_testing.py',
    'test/test_torch.py',
    'test/test_transformers.py',
+    'test/xpu/test_transformers.py',


Why is this excluded?

Just merged test/xpu/test_transformers.py to test/test_transformer.py

albanD · 2025-02-25T16:17:12Z

torchgen/gen_aoti_c_shim.py

@@ -244,6 +244,7 @@ def convert_return(typ: BaseType, val: str) -> str:
        "_scaled_dot_product_flash_attention",
        "_scaled_dot_product_efficient_attention",
        "_scaled_dot_product_cudnn_attention",
+        "_scaled_dot_product_fused_attention_overrideable",


Why add this here?

Per my understanding, AOTI needs to check if an operation returns empty tensor potentially and then generate c_shim code accordingly. Currently, I think _scaled_dot_product_fused_attention_overrideable should have similar behaviors with other _scaled_dot_product operations, say it may return empty tensor.

albanD · 2025-02-25T16:19:19Z

aten/src/ATen/native/native_functions.yaml

@@ -14866,6 +14867,7 @@
 - func: _scaled_dot_product_fused_attention_overrideable(Tensor query, Tensor key, Tensor value, Tensor? attn_bias=None, float dropout_p=0.0, bool is_causal=False, bool return_debug_mask=False, *, float? scale=None) -> (Tensor output, Tensor logsumexp, Tensor cum_seq_q, Tensor cum_seq_k, SymInt max_q, SymInt max_k, Tensor philox_seed, Tensor philox_offset, Tensor debug_attn_mask)
  dispatch:
    CompositeExplicitAutograd: _scaled_dot_product_fused_attention_overrideable
+    XPU: _scaled_dot_product_fused_attention_overrideable_xpu


@drisspg is this the right place to override it?

I dont think so - at least not from within core. This is meant as a generic op that can be used to register different backends to through privateuse1

@albanD, @drisspg, the original idea intends to avoid adding a new op. Would you be okay with adding an operation like _scaled_dot_product_mkldnn_attention, which is similar to _scaled_dot_product_cudnn_attention?

Confirmed things w/ Alban, It is okay to add the XPU dispatch to any of the available SDPA apis. Which ever one most clostly aligns w/ what you need for forward backwards makes the most sense.

In that context this change seems fine

albanD · 2025-02-25T16:19:41Z

test/xpu/test_transformers.py

+
+
+instantiate_device_type_tests(
+    TestSDPAXpuOnly, globals(), only_for="xpu", allow_xpu=True


Can't we re-use the existing test?

I did try to reuse the existing test/test_transformers.py and I was not sure if it is good as there are too many things to be generalized.

For example, test_onednn_attention_fail_d256 is from test/test_transformers.py:test_cudnn_attention_fail_d128 and obviously we need to use different backend.

Another example is that test_scaled_dot_product_attention_fused_kernels_packed tests nested tensor which is not supported on XPU.

For test/test_transformers.py::TestTransformers::test_scaled_dot_product_attention, it tests float and double in a loop, while XPU does not support double currently.

In addition, all cases with grad are not applicable on XPU as training will be supported later.
Actually, there are only 2 cases in test/test_transformers.py::TestSDPA but 8 cases in TestSDPACpuOnly and 28 cases in TestSDPACudaOnly. I think it would be a mess after merging XPU tests into them.

pytorchmergebot · 2025-02-28T08:30:43Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

pytorchmergebot · 2025-02-28T08:30:46Z

Successfully rebased onednn_graph_sdpa-integration onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout onednn_graph_sdpa-integration && git pull --rebase)

EikanWang · 2025-02-28T12:44:21Z

The test_graph_partition related failures are irrelevant to this PR. The failures were due to a landed PR introducing CUDA-specific code in the test cases. And there has been a PR to fix these failures - #148121

EikanWang · 2025-03-01T15:06:58Z

In terms of inductor/test_kernel_benchmark.py::TestKernelBenchmark::test_matmul_triton_kernel_benchmark failure, it was due to #147620 enabled force_shape_pad for triton kernel benchmark, while Intel GPU supports this scenario. I submitted another PR to fix the failure - #148237.

EikanWang · 2025-03-01T15:09:47Z

@albanD , may I know if we have addressed your comments?

EikanWang · 2025-03-04T01:32:59Z

@pytorchbot merge -i

pytorchmergebot · 2025-03-04T01:34:39Z

Merge started

Your change will be merged while ignoring the following 5 checks: pull / linux-focal-cpu-py3.10-gcc9-bazel-test / filter, xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 2, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu), xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

EikanWang · 2025-03-04T01:38:46Z

@albanD , I will merge this PR first as we are preparing an internal full validation for the upcoming branch cut. And we will continue refine the changes if we have not addressed the comments well.

Motivation === This PR is part of the plan of OneDNN Upstreaming, as pytorch#114848 [(comment)](pytorch#114848 (comment)) stated. The support of SDPA is via the overridable variance on XPU backend. Beside the added `Attention.cpp` file, `Graph.h` is added to hold utils for OneDNN graph including those for kernel/compile graph caching. In addition, a selection of testcases in `test/test_transformers.py` are copied into the new `test/xpu/test_transformers.py` and modified accordingly to provide additional tests beyond `./third_party/torch-xpu-ops/test/xpu/test_ops_xpu.py`. Depends on OneDNN version v3.7 upgrade in pytorch#147498 Depends on BUILD_GRAPH switch in pytorch#147608 Pull Request resolved: pytorch#147614 Approved by: https://github.com/jansel, https://github.com/EikanWang

DDEle requested review from EikanWang and gujinghui as code owners February 21, 2025 09:21

pytorch-bot bot added the module: inductor label Feb 21, 2025

pytorch-bot bot added the topic: not user facing topic category label Feb 21, 2025

pytorchbot added the open source label Feb 21, 2025

EikanWang marked this pull request as draft February 21, 2025 13:25

EikanWang added this to the 2.7.0 milestone Feb 24, 2025

pytorchmergebot force-pushed the onednn_graph_sdpa-integration branch from 5ce9f89 to 2b35316 Compare February 24, 2025 16:57

pytorchmergebot force-pushed the onednn_graph_sdpa-integration branch from 2b35316 to 199187d Compare February 24, 2025 16:59

EikanWang added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end labels Feb 24, 2025

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Feb 25, 2025

EikanWang marked this pull request as ready for review February 25, 2025 04:24

EikanWang requested review from albanD, jansel and desertfire February 25, 2025 04:25

albanD reviewed Feb 25, 2025

View reviewed changes

EikanWang requested a review from drisspg February 26, 2025 16:06

DDEle added 5 commits February 28, 2025 08:30

Enable SDPA on XPU

4fecde8

Fix lint

e1b3b74

Register XPU _fused_sdp_choice_stub

ddc3e82

Fix lint

74e0cd1

Merge test_transformers.py

6e0b7d2

pytorchmergebot force-pushed the onednn_graph_sdpa-integration branch from 06da205 to 6e0b7d2 Compare February 28, 2025 08:30

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Feb 28, 2025

EikanWang added ciflow/xpu Run XPU CI tasks ciflow/trunk Trigger trunk jobs on your pull request labels Feb 28, 2025

EikanWang moved this from In Progress to Approved in PyTorch Intel Feb 28, 2025

EikanWang requested a review from albanD March 1, 2025 15:09

pytorchmergebot added the merging label Mar 4, 2025

pytorchmergebot added the Merged label Mar 4, 2025

pytorchmergebot closed this in c21dc11 Mar 4, 2025

github-project-automation bot moved this from Approved to Done in PyTorch Intel Mar 4, 2025

pytorchmergebot removed the merging label Mar 4, 2025

atalman mentioned this pull request Apr 3, 2025

Release 2.7.0 validations checklist and cherry-picks #150628

Closed

65 tasks

LuFinch mentioned this pull request May 23, 2025

xpu: torch.nn.functional.scaled_dot_product_attention produces NaN on XPU #154051

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Intel GPU] Enable SDPA on XPU #147614

[Intel GPU] Enable SDPA on XPU #147614

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!



		instantiate_device_type_tests(
		TestSDPAXpuOnly, globals(), only_for="xpu", allow_xpu=True

[Intel GPU] Enable SDPA on XPU #147614

[Intel GPU] Enable SDPA on XPU #147614

Uh oh!

Conversation

Uh oh!

Motivation

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147614

❌ 3 New Failures, 2 Unrelated Failures

Uh oh!

Uh oh!

Attention! native_functions.yaml was changed

Uh oh!

Attention! PyTorch one of the C-stable API file was changed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!