8000 softmax: add device check for xpu with half_to_float by weishi-deng · Pull Request #150278 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

softmax: add device check for xpu with half_to_float #150278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

weishi-deng
Copy link
Contributor
@weishi-deng weishi-deng commented Mar 31, 2025

We improved the kernel implementation for softmax with half input and float output for Intel GPUs( intel/torch-xpu-ops#1516). The optimized kernel fused the data type casting and softmax computation. To apply the optimization, we need to allow XPU to be dispatched to at::_softmax with half_to_float being true, say at::_softmax(input_, dim_, true).

Performance Improvement:

single ut with softmax(shape, dim) perf %(PR vs main)
[64, 64] , 0 105%
[64, 64] , 1 247%
[8192, 8192] , 0 145%
[8192, 8192] , 1 134%
[64, 8192] , 0 110%
[64, 8192] , 1 151%
[8192, 64] , 0 76%
[8192, 64] , 1 244%
[1024, 1024], 0 71%
[1024, 1024], 1 174%
oob model end-end perf %(PR vs main)
timm_models_amp_fp16_inference_cait_m36_384_eager 136%
timm_models_amp_fp16_inference_cait_m36_384_inductor 100%

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Copy link
pytorch-bot bot commented Mar 31, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150278

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6bc08b1 with merge base 004dad4 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@weishi-deng weishi-deng marked this pull request as draft March 31, 2025 03:31
@weishi-deng
Copy link
Contributor Author

This pr is pending on the PR on torch-xpu-ops intel/torch-xpu-ops#1516, please do not merge until the PR for third-party is merged.

@guangyey guangyey moved this to Pre-Review Required in PyTorch Intel Apr 1, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Apr 1, 2025
Copy link
pytorch-bot bot commented Apr 1, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Apr 1, 2025
@guangyey guangyey added ciflow/xpu Run XPU CI tasks release notes: xpu release notes category labels Apr 1, 2025
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Apr 8, 2025
weishi-deng and others added 4 commits April 8, 2025 14:11
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
@weishi-deng weishi-deng requested a review from guangyey April 25, 2025 01:59
@weishi-deng weishi-deng marked this pull request as ready for review April 25, 2025 01:59
@weishi-deng
Copy link
Contributor Author

Ready for review as PR intel/torch-xpu-ops#1516 landed. Please help review again. @guangyey

@guangyey guangyey added the module: xpu Intel XPU related issues label Apr 25, 2025
@guangyey
Copy link
Collaborator
guangyey commented Apr 25, 2025

I think this PR depends on torch-xpu-ops commit pin update.
Please sync with @xytintel and make sure the next time update will contain intel/torch-xpu-ops#1516

@guangyey guangyey marked this pull request as draft April 25, 2025 02:38
@guangyey
Copy link
Collaborator

Additionally, you need to add a UT to ensure half_to_float functionality works expectantly.

@etaf etaf added the ciflow/xpu Run XPU CI tasks label May 13, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 14, 2025
@ZhiweiYan-96 ZhiweiYan-96 added the ciflow/xpu Run XPU CI tasks label May 14, 2025
Copy link
pytorch-bot bot commented May 14, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 14, 2025
@ZhiweiYan-96 ZhiweiYan-96 added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 14, 2025
Copy link
pytorch-bot bot commented May 14, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Copy link
pytorch-bot bot commented May 14, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 14, 2025
@EikanWang EikanWang added the ciflow/xpu Run XPU CI tasks label May 14, 2025
@EikanWang
Copy link
Collaborator

@weishi-deng , I suppose the perf %(PR vs main) should be higher is better, right? If so, do you know why the following two shapes have regression?

single ut with softmax(shape, dim) perf %(PR vs main)
[8192, 64] , 0 76%
[1024, 1024], 0 71%

@EikanWang
Copy link
Collaborator

By the way, please check the test case failure.

@weishi-deng
Copy link
Contributor Author

@weishi-deng , I suppose the perf %(PR vs main) should be higher is better, right? If so, do you know why the following two shapes have regression?

single ut with softmax(shape, dim) perf %(PR vs main)
[8192, 64] , 0 76%
[1024, 1024], 0 71%

These cases are cases with strided load and strided store(dim =0). In the kernel implementation, we set the vectorized length as 16Byte/sizeof(input scalar), so the vectorization length grows from float-4 to half-8 when we enable the input with half type. For the strided instances, it increases the memory load instructions with the stride per thread. As synced with @xytintel, we think the perf drop for these cases is expected.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 15, 2025
@etaf etaf added the ciflow/xpu Run XPU CI tasks label May 15, 2025
@chuanqi129 chuanqi129 added the keep-going Don't stop on first failure, keep running tests until the end label May 15, 2025
@EikanWang
Copy link
Collaborator

In the kernel implementation, we set the vectorized length as 16Byte/sizeof(input scalar), so the vectorization length grows from float-4 to half-8 when we enable the input with half type.

Have you evaluated the optimization heuristic rule on Arc B-series?

For the strided instances, it increases the memory load instructions with the stride per thread.

Is there any data to support the conclusion?

As synced with @xytintel, we think the perf drop for these cases is expected.

Frankly speaking, it should not be as expected. In general, we don't expect any performance regression. Let's collect more E2E performance data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/xpu Run XPU CI tasks keep-going Don't stop on first failure, keep running tests until the end module: xpu Intel XPU related issues open source release notes: xpu release notes category
Projects
Status: Pre-Review Required
Development

Successfully merging this pull request may close these issues.

8 participants
0