8000 [Intel GPU] Fallback embedding_dense_backward on XPU by jianyizh · Pull Request #146888 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Intel GPU] Fallback embedding_dense_backward on XPU #146888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jianyizh
Copy link
Contributor
@jianyizh jianyizh commented Feb 11, 2025

Do not decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. I also align with cuda on gelu decomposition in _addmm_activation

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov

Copy link
pytorch-bot bot commented Feb 11, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146888

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit d3eeab2 with merge base 229fb0b (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jianyizh
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@jianyizh jianyizh marked this pull request as ready for review February 11, 2025 06:29
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Feb 11, 2025
Comment on lines +122 to +124
if torch.xpu.is_available():
decomps_to_exclude.append(aten.embedding_dense_backward)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jianyizh , could it be refined to check the device arch? Because the 16bit atomic will be supported in future devices.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think embedding_dense_backward will always decompose to fp32 calculation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this change the behavior of CPU on machines with an XPU installed? This seems not ideal.

I think we might need a wrapper decomp that checks the device.

Copy link
Contributor Author
@jianyizh jianyizh Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not ideal. We don't know the device at the register decomposition stage. I see other decompositions use return NotImplemented to fallback, but it does not work here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this change the behavior of CPU on machines with an XPU installed? This seems not ideal.

I think we might need a wrapper decomp that checks the device.

A wrapper decomp should be a good idea. @jianyizh , @etaf , let's implement the decomp wrapper to check device.

@EikanWang EikanWang added the ciflow/xpu Run XPU CI tasks label Feb 11, 2025
Copy link
pytorch-bot bot commented Feb 11, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Feb 11, 2025
@EikanWang EikanWang added the ciflow/xpu Run XPU CI tasks label Feb 11, 2025
@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 11, 2025
Comment on lines +122 to +124
if torch.xpu.is_available():
decomps_to_exclude.append(aten.embedding_dense_backward)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this change the behavior of CPU on machines with an XPU installed? This seems not ideal.

I think we might need a wrapper decomp that checks the device.

@EikanWang
Copy link
Collaborator

@jianyizh , @etaf , any update?

@etaf
Copy link
Collaborator
etaf commented Feb 24, 2025

@jianyizh , @etaf , any update?

Sorry for missing that, we'll update this PR ASAP.

@jianyizh
Copy link
Contributor Author

After discussion, I don't think a wrapper decomp to check device is feasible.

@jianyizh jianyizh closed this Feb 25, 2025
@EikanWang
Copy link
Collaborator

@jianyizh , there should be other alternative approaches to achieve the goal.

pytorchmergebot pushed a commit that referenced this pull request May 19, 2025
Reopen #146888, now the modification only affects xpu device. We do not  want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. ~~I also align with cuda on gelu decomposition in _addmm_activation~~

Pull Request resolved: #151637
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/xpu Run XPU CI tasks module: inductor open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0