-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[Intel GPU][Inductor] Fallback embedding_dense_backward on XPU #151637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151637
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ You can merge normally! (1 Unrelated Failure)As of commit 4631d97 with merge base 7412b33 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
6bd1e22
to
48e3443
Compare
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
48e3443
to
0edf49e
Compare
torch/_decomp/decompositions.py
Outdated
if grad_output.is_xpu: 8000 | ||
return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianyizh , I'm considering moving the logic to the Inductor due to the different computation behavior of Intel gen to gen GPU SKUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
torch/_decomp/decompositions.py
Outdated
@@ -1474,7 +1476,7 @@ def _addmm_activation( | |||
): | |||
out = addmm(self, mat1, mat2, beta, alpha) | |||
if use_gelu: | |||
if self.is_cuda: | |||
if self.is_cuda or self.is_xpu: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is irrelevant to this PR, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just want to align with cuda. I don't want to open another pr for such small change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls. help exclude the changes.
Hi @jianyizh, could you elaborate on the motivation behind this PR in the description? It would also be helpful to clarify what issues or limitations would arise without this PR. Additionally, if possible, please consider adding a test case to validate the changes. |
@guangyey I have update the PR description. For test, we already have UT on embedding_dense_backward itself in pytorch. I'm not sure how and where to add UT to check op fallback in inductor |
padding_idx: int, | ||
scale_grad_by_freq: bool, | ||
): | ||
if grad_output.is_xpu: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the next-generation Intel GPU provides strong performance for atomic operations and no longer requires a fallback path, how can we further improve or ensure compatibility with the current implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently the only way is to distinguish by device name. We do not have an API for ability on atomic op. For next gen, we also need to change the implementation in eager op, fallback itself does not hurt performance too much (just one fusion). Anyway, we need this fallback at least for one year.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add TODO and comments here, we should find an elegant method to keep BC for future arch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianyizh , it would be better to add a check here. For example, fall back to aten if the GPU arch is not XE4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently torch.xpu.get_device_properties().architecture does not return a meaningful result. It returns 13136561920 on max1550... Maybe we can query device architecture like this https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_device_architecture.asciidoc#querying-the-device-architecture-with-the-information-descriptor
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
cb94c1d
to
4023e26
Compare
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
d02a287
to
4631d97
Compare
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
@jianyizh , could you help share the performance data here? |
if torch.xpu.is_available(): | ||
make_fallback( | ||
aten.embedding_dense_backward, warn=False | ||
) # (XPU-only and faster than decomp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a patch. I'm considering that we may need to provide a device-specific fallback design. Meanwhile, @jianyizh , pls. add another section dedicated to XPU besides 1) Easy 2) Medium ...
https://github.com/pytorch/pytorch/pull/151637/files#diff-a1b077971cddfabfa0071c5162265066e867bc07721816d95b9cbe58431c38e3R2618
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: Apply lint suggestions Details for Dev Infra teamRaised by workflow job |
Reopen #146888, now the modification only affects xpu device. We do not want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550.
I also align with cuda on gelu decomposition in _addmm_activationcc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @gujinghui @fengyuan14 @guangyey