-
Notifications
You must be signed in to change notification settings - Fork 24.2k
softmax: add device check for xpu with half_to_float #150278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150278
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit 6bc08b1 with merge base 004dad4 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pr is pending on the PR on torch-xpu-ops intel/torch-xpu-ops#1516, please do not merge until the PR for third-party is merged. |
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Ready for review as PR intel/torch-xpu-ops#1516 landed. Please help review again. @guangyey |
I think this PR depends on torch-xpu-ops commit pin update. |
Additionally, you need to add a UT to ensure half_to_float functionality works expectantly. |
…into xpu-softmax
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
@weishi-deng , I suppose the
|
By the way, please check the test case failure. |
These cases are cases with strided load and strided store(dim =0). In the kernel implementation, we set the vectorized length as 16Byte/sizeof(input scalar), so the vectorization length grows from float-4 to half-8 when we enable the input with half type. For the strided instances, it increases the memory load instructions with the stride per thread. As synced with @xytintel, we think the perf drop for these cases is expected. |
Have you evaluated the optimization heuristic rule on Arc B-series?
Is there any data to support the conclusion?
Frankly speaking, it should not be as expected. In general, we don't expect any performance regression. Let's collect more E2E performance data. |
We improved the kernel implementation for softmax with half input and float output for Intel GPUs( intel/torch-xpu-ops#1516). The optimized kernel fused the data type casting and softmax computation. To apply the optimization, we need to allow
XPU
to be dispatched toat::_softmax
withhalf_to_float
beingtrue
, sayat::_softmax(input_, dim_, true)
.Performance Improvement:
cc @gujinghui @EikanWang @fengyuan14 @guangyey