-
Notifications
You must be signed in to change notification settings - Fork 24.7k
[Intel GPU] Fallback embedding_dense_backward on XPU #146888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146888
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit d3eeab2 with merge base 229fb0b ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
if torch.xpu.is_available(): | ||
decomps_to_exclude.append(aten.embedding_dense_backward) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jianyizh , could it be refined to check the device arch? Because the 16bit atomic will be supported in future devices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think embedding_dense_backward will always decompose to fp32 calculation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this change the behavior of CPU on machines with an XPU installed? This seems not ideal.
I think we might need a wrapper decomp that checks the device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's not ideal. We don't know the device at the register decomposition stage. I see other decompositions use return NotImplemented to fallback, but it does not work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
if torch.xpu.is_available(): | ||
decomps_to_exclude.append(aten.embedding_dense_backward) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this change the behavior of CPU on machines with an XPU installed? This seems not ideal.
I think we might need a wrapper decomp that checks the device.
After discussion, I don't think a wrapper decomp to check device is feasible. |
@jianyizh , there should be other alternative approaches to achieve the goal. |
Reopen #146888, now the modification only affects xpu device. We do not want to decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we can use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. ~~I also align with cuda on gelu decomposition in _addmm_activation~~ Pull Request resolved: #151637 Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang
Do not decompose embedding_dense_backward for torch.compile. Current XPU devices have hardware limitations on atomic ops. Fallback to eager and we use sort to implement this op. hf_T5 amp bf16 training in torchbench can get 2x improvement on Max 1550. I also align with cuda on gelu decomposition in _addmm_activation
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov