8000 [Intel GPU][Inductor] Fallback embedding_dense_backward on XPU by jianyizh · Pull Request #151637 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Intel GPU][Inductor] Fallback embedding_dense_backward on XPU #151637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jianyizh
Copy link
Contributor
@jianyizh jianyizh commented Apr 18, 2025

Reopen #146888, now the modification only affects xpu device. We do not want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. I also align with cuda on gelu decomposition in _addmm_activation

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @gujinghui @fengyuan14 @guangyey

Copy link
pytorch-bot bot commented Apr 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151637

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ You can merge normally! (1 Unrelated Failure)

As of commit 4631d97 with merge base 7412b33 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jianyizh
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@jianyizh
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased jianyi/fallback_embedding_bwd onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/fallback_embedding_bwd && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the jianyi/fallback_embedding_bwd branch from 6bd1e22 to 48e3443 Compare April 18, 2025 03:49
@mengfei25
Copy link
Contributor

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased jianyi/fallback_embedding_bwd onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/fallback_embedding_bwd && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the jianyi/fallback_embedding_bwd branch from 48e3443 to 0edf49e Compare April 18, 2025 07:35
Comment on lines 1239 to 1240
if grad_output.is_xpu: 8000
return NotImplemented
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianyizh , I'm considering moving the logic to the Inductor due to the different computation behavior of Intel gen to gen GPU SKUs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1474,7 +1476,7 @@ def _addmm_activation(
):
out = addmm(self, mat1, mat2, beta, alpha)
if use_gelu:
if self.is_cuda:
if self.is_cuda or self.is_xpu:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is irrelevant to this PR, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just want to align with cuda. I don't want to open another pr for such small change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls. help exclude the changes.

@guangyey guangyey moved this to Pre-Review Required in PyTorch Intel Apr 20, 2025
@jianyizh jianyizh marked this pull request as ready for review April 24, 2025 01:39
@guangyey
Copy link
Collaborator
guangyey commented Apr 25, 2025

Hi @jianyizh, could you elaborate on the motivation behind this PR in the description? It would also be helpful to clarify what issues or limitations would arise without this PR. Additionally, if possible, please consider adding a test case to validate the changes.

@jianyizh
Copy link
Contributor Author
jianyizh commented Apr 25, 2025

Hi @jianyizh, could you elaborate on the motivation behind this PR in the description? It would also be helpful to clarify what issues or limitations would arise without this PR. Additionally, if possible, please consider adding a test case to validate the changes.

@guangyey I have update the PR description. For test, we already have UT on embedding_dense_backward itself in pytorch. I'm not sure how and where to add UT to check op fallback in inductor

padding_idx: int,
scale_grad_by_freq: bool,
):
if grad_output.is_xpu:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the next-generation Intel GPU provides strong performance for atomic operations and no longer requires a fallback path, how can we further improve or ensure compatibility with the current implementation?

Copy link
Contributor Author
@jianyizh jianyizh Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently the only way is to distinguish by device name. We do not have an API for ability on atomic op. For next gen, we also need to change the implementation in eager op, fallback itself does not hurt performance too much (just one fusion). Anyway, we need this fallback at least for one year.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add TODO and comments here, we should find an elegant method to keep BC for future arch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianyizh , it would be better to add a check here. For example, fall back to aten if the GPU arch is not XE4.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently torch.xpu.get_device_properties().architecture does not return a meaningful result. It returns 13136561920 on max1550... Maybe we can query device architecture like this https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_device_architecture.asciidoc#querying-the-device-architecture-with-the-information-descriptor

@guangyey
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased jianyi/fallback_embedding_bwd onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/fallback_embedding_bwd && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the jianyi/fallback_embedding_bwd branch from cb94c1d to 4023e26 Compare April 28, 2025 02:23
@guangyey guangyey added ciflow/xpu Run XPU CI tasks release notes: xpu release notes category module: xpu Intel XPU related issues labels Apr 28, 2025
@guangyey guangyey added keep-going Don't stop on first failure, keep running tests until the end and removed ciflow/inductor labels Apr 28, 2025
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased jianyi/fallback_embedding_bwd onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout jianyi/fallback_embedding_bwd && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the jianyi/fallback_embedding_bwd branch from d02a287 to 4631d97 Compare May 15, 2025 02:12
@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 15, 2025
@etaf etaf added the ciflow/xpu Run XPU CI tasks label May 15, 2025
Copy link
pytorch-bot bot commented May 15, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 15, 2025
@etaf etaf added the ciflow/xpu Run XPU CI tasks label May 15, 2025
@EikanWang EikanWang moved this from Pre-Review Required to Review Required in PyTorch Intel May 15, 2025
@EikanWang
Copy link
Collaborator

@jianyizh , could you help share the performance data here?

@EikanWang EikanWang requested a review from jansel May 15, 2025 12:20
Comment on lines +2624 to +2627
if torch.xpu.is_available():
make_fallback(
aten.embedding_dense_backward, warn=False
) # (XPU-only and faster than decomp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a patch. I'm considering that we may need to provide a device-specific fallback design. Meanwhile, @jianyizh , pls. add another section dedicated to XPU besides 1) Easy 2) Medium ...
https://github.com/pytorch/pytorch/pull/151637/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R2618

@jianyizh
Copy link
Contributor Author

@jianyizh , could you help share the performance data here?
I have updated pr description

@EikanWang EikanWang moved this from Review Required to Approved in PyTorch Intel May 17, 2025
@EikanWang
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 17, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: Apply lint suggestions

Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks keep-going Don't stop on first failure, keep running tests until the end module: inductor module: xpu Intel XPU related issues open source release notes: xpu release notes category topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Approved
Development

Successfully merging this pull request may close these issues.

10 participants
0