[Intel GPU][Inductor] Fallback embedding_dense_backward on XPU #151637

jianyizh · 2025-04-18T03:23:30Z

Reopen #146888, now the modification only affects xpu device. We do not want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. ~~I also align with cuda on gelu decomposition in _addmm_activation~~

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @gujinghui @fengyuan14 @guangyey

pytorch-bot · 2025-04-18T03:23:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151637

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CUDA not found in NVIDIA runners

✅ You can merge normally! (1 Unrelated Failure)

As of commit 4631d97 with merge base 7412b33 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 5, 6, linux.idc.xpu) (gh) (disabled by #153608 but the issue was closed recently and a rebase is needed to make it pass)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jianyizh · 2025-04-18T03:28:14Z

@pytorchbot label "topic: not user facing"

jianyizh · 2025-04-18T03:47:44Z

@pytorchbot rebase

pytorchmergebot · 2025-04-18T03:49:16Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-04-18T03:49:19Z

Successfully rebased jianyi/fallback_embedding_bwd onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/fallback_embedding_bwd && git pull --rebase)

mengfei25 · 2025-04-18T07:33:42Z

@pytorchbot rebase

pytorchmergebot · 2025-04-18T07:35:14Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-04-18T07:35:18Z

Successfully rebased jianyi/fallback_embedding_bwd onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/fallback_embedding_bwd && git pull --rebase)

EikanWang · 2025-04-19T02:52:37Z

torch/_decomp/decompositions.py

+    if grad_output.is_xpu:
+        return NotImplemented


@jianyizh , I'm considering moving the logic to the Inductor due to the different computation behavior of Intel gen to gen GPU SKUs.

EikanWang · 2025-04-19T02:53:04Z

torch/_decomp/decompositions.py

@@ -1474,7 +1476,7 @@ def _addmm_activation(
 ):
    out = addmm(self, mat1, mat2, beta, alpha)
    if use_gelu:
-        if self.is_cuda:
+        if self.is_cuda or self.is_xpu:


This change is irrelevant to this PR, right?

Yes, I just want to align with cuda. I don't want to open another pr for such small change

Pls. help exclude the changes.

guangyey · 2025-04-25T06:14:15Z

Hi @jianyizh, could you elaborate on the motivation behind this PR in the description? It would also be helpful to clarify what issues or limitations would arise without this PR. Additionally, if possible, please consider adding a test case to validate the changes.

jianyizh · 2025-04-25T06:22:05Z

Hi @jianyizh, could you elaborate on the motivation behind this PR in the description? It would also be helpful to clarify what issues or limitations would arise without this PR. Additionally, if possible, please consider adding a test case to validate the changes.

@guangyey I have update the PR description. For test, we already have UT on embedding_dense_backward itself in pytorch. I'm not sure how and where to add UT to check op fallback in inductor

guangyey · 2025-04-25T06:37:41Z

torch/_inductor/decomposition.py

+    padding_idx: int,
+    scale_grad_by_freq: bool,
+):
+    if grad_output.is_xpu:


If the next-generation Intel GPU provides strong performance for atomic operations and no longer requires a fallback path, how can we further improve or ensure compatibility with the current implementation?

I think currently the only way is to distinguish by device name. We do not have an API for ability on atomic op. For next gen, we also need to change the implementation in eager op, fallback itself does not hurt performance too much (just one fusion). Anyway, we need this fallback at least for one year.

Let's add TODO and comments here, we should find an elegant method to keep BC for future arch.

@jianyizh , it would be better to add a check here. For example, fall back to aten if the GPU arch is not XE4.

currently torch.xpu.get_device_properties().architecture does not return a meaningful result. It returns 13136561920 on max1550... Maybe we can query device architecture like this https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_device_architecture.asciidoc#querying-the-device-architecture-with-the-information-descriptor

guangyey · 2025-04-28T02:21:28Z

@pytorchbot rebase

pytorchmergebot · 2025-04-28T02:23:00Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-04-28T02:23:04Z

Successfully rebased jianyi/fallback_embedding_bwd onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/fallback_embedding_bwd && git pull --rebase)

pytorchmergebot · 2025-05-15T02:12:24Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-05-15T02:12:28Z

Successfully rebased jianyi/fallback_embedding_bwd onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/fallback_embedding_bwd && git pull --rebase)

pytorch-bot · 2025-05-15T02:15:17Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

EikanWang · 2025-05-15T12:20:08Z

@jianyizh , could you help share the performance data here?

EikanWang · 2025-05-16T02:04:25Z

torch/_inductor/lowering.py

+if torch.xpu.is_available():
+    make_fallback(
+        aten.embedding_dense_backward, warn=False
+    )  # (XPU-only and faster than decomp)


It seems like a patch. I'm considering that we may need to provide a device-specific fallback design. Meanwhile, @jianyizh , pls. add another section dedicated to XPU besides 1) Easy 2) Medium ...
https://github.com/pytorch/pytorch/pull/151637/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R2618

jianyizh · 2025-05-16T02:09:47Z

@jianyizh , could you help share the performance data here?
I have updated pr description

EikanWang · 2025-05-17T04:37:48Z

@pytorchbot merge

pytorchmergebot · 2025-05-17T04:40:24Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-17T04:40:39Z

Merge failed

Reason: 1 jobs have failed, first few of them are: Apply lint suggestions

Details for Dev Infra team

Raised by workflow job

pytorch-bot bot added the module: inductor label Apr 18, 2025

pytorch-bot bot added the topic: not user facing topic category label Apr 18, 2025

pytorchbot added the open source label Apr 18, 2025

pytorchmergebot force-pushed the jianyi/fallback_embedding_bwd branch from 6bd1e22 to 48e3443 Compare April 18, 2025 03:49

pytorchmergebot force-pushed the jianyi/fallback_embedding_bwd branch from 48e3443 to 0edf49e Compare April 18, 2025 07:35

EikanWang reviewed Apr 19, 2025

View reviewed changes

guangyey added this to PyTorch Intel Apr 20, 2025

guangyey moved this to Pre-Review Required in PyTorch Intel Apr 20, 2025

jianyizh marked this pull request as ready for review April 24, 2025 01:39

guangyey reviewed Apr 25, 2025

View reviewed changes

pytorchmergebot force-pushed the jianyi/fallback_embedding_bwd branch from cb94c1d to 4023e26 Compare April 28, 2025 02:23

guangyey added ciflow/xpu Run XPU CI tasks release notes: xpu release notes category module: xpu Intel XPU related issues labels Apr 28, 2025

pytorch-bot bot added the ciflow/inductor label Apr 28, 2025

guangyey added keep-going Don't stop on first failure, keep running tests until the end and removed ciflow/inductor labels Apr 28, 2025

jianyizh added 6 commits May 15, 2025 02:12

fallback embedding bwd on xpu only

43b6d62

update

295f64e

move the fallback logic to inductor

389fd95

add return type of _embedding_dense_backward

F438

860b5d4

update

553f5a2

update

4631d97

pytorchmergebot force-pushed the jianyi/fallback_embedding_bwd branch from d02a287 to 4631d97 Compare May 15, 2025 02:12

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 15, 2025

etaf added the ciflow/xpu Run XPU CI tasks label May 15, 2025

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 15, 2025

etaf added the ciflow/xpu Run XPU CI tasks label May 15, 2025

EikanWang moved this from Pre-Review Required to Review Required in PyTorch Intel May 15, 2025

EikanWang requested a review from jansel May 15, 2025 12:20

EikanWang reviewed May 16, 2025

View reviewed changes

pytorch-bot bot added the ciflow/inductor label May 16, 2025

jansel approved these changes May 16, 2025

View reviewed changes

EikanWang moved this from Review Required to Approved in PyTorch Intel May 17, 2025

EikanWang approved these changes May 17, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 17, 2025

pytorchmergebot added the merging label May 17, 2025

pytorchmergebot removed the merging label May 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Intel GPU][Inductor] Fallback embedding_dense_backward on XPU #151637

[Intel GPU][Inductor] Fallback embedding_dense_backward on XPU #151637

[Intel GPU][Inductor] Fallback embedding_dense_backward on XPU #151637

Are you sure you want to change the base?

[Intel GPU][Inductor] Fallback embedding_dense_backward on XPU #151637

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151637

❗ 1 Active SEVs

✅ You can merge normally! (1 Unrelated Failure)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Merge started

Merge failed