8000 softmax: add device check for xpu with half_to_float by weishi-deng · Pull Request #150278 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

softmax: add device check for xpu with half_to_float #150278

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

weishi-deng
Copy link
Contributor
@weishi-deng weishi-deng commented Mar 31, 2025

We improved the kernel implementation for softmax with half input and float output for Intel GPUs( intel/torch-xpu-ops#1516). The optimized kernel fused the data type casting and softmax computation. To apply the optimization, we need to allow XPU to be dispatched to at::_softmax with half_to_float being true, say at::_softmax(input_, dim_, true).

Performance Improvement:

<style> </style>
reduction shape PR improvement %
dim=0 [64, 64] 175%
dim=0 [8192, 8192] 160%
dim=0 [64, 8192] 104%
dim=0 [8192, 64] 163%
dim=0 [1024, 1024] 208%
dim=0 [16384, 16384] 156%
dim=1 [64, 64] 252%
dim=1 [8192, 8192] 151%
dim=1 [64, 8192] 224%
dim=1 [8192, 64] 214%
dim=1 [1024, 1024] 234%
dim=1 [16384, 16384] 151%

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Copy link
pytorch-bot bot commented Mar 31, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150278

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 0de4614 with merge base 4585c33 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@weishi-deng weishi-deng marked this pull request as draft March 31, 2025 03:31
@weishi-deng
Copy link
Contributor Author

This pr is pending on the PR on torch-xpu-ops intel/torch-xpu-ops#1516, please do not merge until the PR for third-party is merged.

@guangyey guangyey moved this to Pre-Review Required in PyTorch Intel Apr 1, 2025
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Apr 1, 2025
Copy link
pytorch-bot bot commented Apr 1, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Apr 1, 2025
@guangyey guangyey added ciflow/xpu Run XPU CI tasks release notes: xpu release notes category labels Apr 1, 2025
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Apr 8, 2025
weishi-deng and others added 4 commits April 8, 2025 14:11
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
@weishi-deng weishi-deng requested a review from guangyey April 25, 2025 01:59
@weishi-deng weishi-deng marked this pull request as ready for review April 25, 2025 01:59
@weishi-deng
Copy link
Contributor Author

Ready for review as PR intel/torch-xpu-ops#1516 landed. Please help review again. @guangyey

@guangyey guangyey added the module: xpu Intel XPU related issues label Apr 25, 2025
@guangyey
Copy link
Collaborator
guangyey commented Apr 25, 2025

I think this PR depends on torch-xpu-ops commit pin update.
Please sync with @xytintel and make sure the next time update will contain intel/torch-xpu-ops#1516

@guangyey guangyey marked this pull request as draft April 25, 2025 02:38
@guangyey
Copy link
Collaborator

Additionally, you need to add a UT to ensure half_to_float functionality works expectantly.

@etaf etaf added the ciflow/xpu Run XPU CI tasks label May 13, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 14, 2025
@ZhiweiYan-96 ZhiweiYan-96 added the ciflow/xpu Run XPU CI tasks label May 14, 2025
Copy link
pytorch-bot bot commented May 14, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 14, 2025
@ZhiweiYan-96 ZhiweiYan-96 added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 14, 2025
Copy link
pytorch-bot bot commented May 14, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

Copy link
pytorch-bot bot commented May 14, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 14, 2025
@EikanWang EikanWang added the ciflow/xpu Run XPU CI tasks label May 14, 2025
@EikanWang
Copy link
Collaborator

@weishi-deng , I suppose the perf %(PR vs main) should be higher is better, right? If so, do you know why the following two shapes have regression?

single ut with softmax(shape, dim) perf %(PR vs main)
[8192, 64] , 0 76%
[1024, 1024], 0 71%

@EikanWang
Copy link
Collaborator

By the way, please check the test case failure.

@weishi-deng
Copy link
Contributor Author

@weishi-deng , I suppose the perf %(PR vs main) should be higher is better, right? If so, do you know why the following two shapes have regression?

single ut with softmax(shape, dim) perf %(PR vs main)
[8192, 64] , 0 76%
[1024, 1024], 0 71%

These cases are cases with strided load and strided store(dim =0). In the kernel implementation, we set the vectorized length as 16Byte/sizeof(input scalar), so the vectorization length grows from float-4 to half-8 when we enable the input with half type. For the strided instances, it increases the memory load instructions with the stride per thread. As synced with @xytintel, we think the perf drop for these cases is expected.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 15, 2025
@etaf etaf added the ciflow/xpu Run XPU CI tasks label May 15, 2025
@chuanqi129 chuanqi129 added the keep-going Don't stop on first failure, keep running tests until the end label May 15, 2025
@EikanWang
Copy link
Collaborator

In the kernel implementation, we set the vectorized length as 16Byte/sizeof(input scalar), so the vectorization length grows from float-4 to half-8 when we enable the input with half type.

Have you evaluated the optimization heuristic rule on Arc B-series?

For the strided instances, it increases the memory load instructions with the stride per thread.

Is there any data to support the conclusion?

As synced with @xytintel, we think the perf drop for these cases is expected.

Frankly speaking, it should not be as expected. In general, we don't expect any performance regression. Let's collect more E2E performance data.

9E88
@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 19, 2025
@weishi-deng
Copy link
Contributor Author

The perf statistics are updated as the softmax implementation is optimized according to PRhttps://github.com/intel/torch-xpu-ops/pull/1781

@guangyey
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/150278/head returned non-zero exit code 1

Rebasing (1/42)
Rebasing (2/42)
Rebasing (3/42)
Rebasing (4/42)
Rebasing (5/42)
Rebasing (6/42)
Rebasing (7/42)
Rebasing (8/42)
Auto-merging test/test_xpu.py
CONFLICT (content): Merge conflict in test/test_xpu.py
error: could not apply a9b352b5be4... update unit test
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply a9b352b5be4... update unit test

Raised by https://github.com/pytorch/pytorch/actions/runs/15872719592

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keep-going Don't stop on first failure, keep running tests until the end module: xpu Intel XPU related issues open source release notes: xpu release notes category
Projects
Status: Pre-Review Required
Development

Successfully merging this pull request may close these issues.

8 participants
0