[Intel GPU] Fallback embedding_dense_backward on XPU #146888

jianyizh · 2025-02-11T06:27:32Z

Do not decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. I also align with cuda on gelu decomposition in _addmm_activation

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

pytorch-bot · 2025-02-11T06:27:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146888

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit d3eeab2 with merge base 229fb0b ():

NEW FAILURE - The following job has failed:

xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 1, 4, linux.idc.xpu) (gh)
inductor/test_unbacked_symints.py::TestUnbackedSymintsXPU::test_expand_ok_with_runtime_assert_xpu

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 3, 4, linux.idc.xpu) (gh) (similar failure)
inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_assert_tensor_meta_xpu
xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu) (gh) (similar failure)
inductor/test_max_autotune.py::TestPrologueFusion::test_pending_fusion_pro_and_epi

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jianyizh · 2025-02-11T06:29:10Z

@pytorchbot label "topic: not user facing"

EikanWang · 2025-02-11T08:16:38Z

torch/_inductor/decomposition.py

+if torch.xpu.is_available():
+    decomps_to_exclude.append(aten.embedding_dense_backward)
+


@jianyizh , could it be refined to check the device arch? Because the 16bit atomic will be supported in future devices.

I think embedding_dense_backward will always decompose to fp32 calculation

Won't this change the behavior of CPU on machines with an XPU installed? This seems not ideal.

I think we might need a wrapper decomp that checks the device.

I agree it's not ideal. We don't know the device at the register decomposition stage. I see other decompositions use return NotImplemented to fallback, but it does not work here.

Won't this change the behavior of CPU on machines with an XPU installed? This seems not ideal.

I think we might need a wrapper decomp that checks the device.

A wrapper decomp should be a good idea. @jianyizh , @etaf , let's implement the decomp wrapper to check device.

pytorch-bot · 2025-02-11T08:18:04Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jansel · 2025-02-13T04:40:48Z

torch/_inductor/decomposition.py

+if torch.xpu.is_available():
+    decomps_to_exclude.append(aten.embedding_dense_backward)
+


Won't this change the behavior of CPU on machines with an XPU installed? This seems not ideal.

I think we might need a wrapper decomp that checks the device.

EikanWang · 2025-02-24T03:12:36Z

@jianyizh , @etaf , any update?

etaf · 2025-02-24T03:57:07Z

@jianyizh , @etaf , any update?

Sorry for missing that, we'll update this PR ASAP.

jianyizh · 2025-02-25T07:13:42Z

After discussion, I don't think a wrapper decomp to check device is feasible.

EikanWang · 2025-02-26T14:44:33Z

@jianyizh , there should be other alternative approaches to achieve the goal.

Reopen #146888, now the modification only affects xpu device. We do not want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. ~~I also align with cuda on gelu decomposition in _addmm_activation~~ Pull Request resolved: #151637 Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang

save

d3eeab2

pytorch-bot bot added the module: inductor label Feb 11, 2025

jianyizh marked this pull request as ready for review February 11, 2025 06:29

pytorch-bot bot added the topic: not user facing topic category label Feb 11, 2025

pytorchbot added the open source label Feb 11, 2025

EikanWang reviewed Feb 11, 2025

View reviewed changes

EikanWang added the ciflow/xpu Run XPU CI tasks label Feb 11, 2025

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Feb 11, 2025

EikanWang added the ciflow/xpu Run XPU CI tasks label Feb 11, 2025

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 11, 2025

EikanWang approved these changes Feb 13, 2025

View reviewed changes

EikanWang requested review from desertfire and jansel February 13, 2025 02:56

jansel requested changes Feb 13, 2025

View reviewed changes

jianyizh closed this Feb 25, 2025

jianyizh mentioned this pull request Apr 18, 2025

[Intel GPU][Inductor] Fallback embedding_dense_backward on XPU #151637

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Intel GPU] Fallback embedding_dense_backward on XPU #146888

[Intel GPU] Fallback embedding_dense_backward on XPU #146888

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		if torch.xpu.is_available():
		decomps_to_exclude.append(aten.embedding_dense_backward)

[Intel GPU] Fallback embedding_dense_backward on XPU #146888

[Intel GPU] Fallback embedding_dense_backward on XPU #146888

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146888

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!