softmax: add device check for xpu with half_to_float #150278

weishi-deng · 2025-03-31T03:31:04Z

We improved the kernel implementation for softmax with half input and float output for Intel GPUs( intel/torch-xpu-ops#1516). The optimized kernel fused the data type casting and softmax computation. To apply the optimization, we need to allow XPU to be dispatched to at::_softmax with half_to_float being true, say at::_softmax(input_, dim_, true).

Performance Improvement:

reduction	shape	PR improvement %
dim=0	[64, 64]	175%
dim=0	[8192, 8192]	160%
dim=0	[64, 8192]	104%
dim=0	[8192, 64]	163 8000 %
dim=0	[1024, 1024]	208%
dim=0	[16384, 16384]	156%
dim=1	[64, 64]	252%
dim=1	[8192, 8192]	151%
dim=1	[64, 8192]	224%
dim=1	[8192, 64]	214%
dim=1	[1024, 1024]	234%
dim=1	[16384, 16384]	151%

cc @gujinghui @EikanWang @fengyuan14 @guangyey

pytorch-bot · 2025-03-31T03:31:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150278

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Cancelled Job

As of commit f56b885 with merge base 69a25f6 ():

CANCELLED JOB - The following job was cancelled. Please retry:

Check Labels / Check labels (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

weishi-deng · 2025-03-31T03:33:17Z

This pr is pending on the PR on torch-xpu-ops intel/torch-xpu-ops#1516, please do not merge until the PR for third-party is merged.

pytorch-bot · 2025-04-01T05:45:48Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

aten/src/ATen/native/SoftMax.cpp

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

weishi-deng · 2025-04-25T02:01:05Z

Ready for review as PR intel/torch-xpu-ops#1516 landed. Please help review again. @guangyey

guangyey · 2025-04-25T02:36:53Z

I think this PR depends on torch-xpu-ops commit pin update.
Please sync with @xytintel and make sure the next time update will contain intel/torch-xpu-ops#1516

guangyey · 2025-04-25T02:50:54Z

Additionally, you need to add a UT to ensure half_to_float functionality works expectantly.

pytorch-bot · 2025-05-14T02:43:02Z

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2025-05-14T02:43:02Z

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

EikanWang · 2025-05-14T16:04:47Z

@weishi-deng , I suppose the perf %(PR vs main) should be higher is better, right? If so, do you know why the following two shapes have regression?

single ut with softmax(shape, dim)	perf %(PR vs main)
[8192, 64] , 0	76%
[1024, 1024], 0	71%

EikanWang · 2025-05-14T16:05:20Z

By the way, please check the test case failure.

weishi-deng · 2025-05-15T06:31:45Z

@weishi-deng , I suppose the perf %(PR vs main) should be higher is better, right? If so, do you know why the following two shapes have regression?

single ut with softmax(shape, dim) perf %(PR vs main)
[8192, 64] , 0 76%
[1024, 1024], 0 71%

These cases are cases with strided load and strided store(dim =0). In the kernel implementation, we set the vectorized length as 16Byte/sizeof(input scalar), so the vectorization length grows from float-4 to half-8 when we enable the input with half type. For the strided instances, it increases the memory load instructions with the stride per thread. As synced with @xytintel, we think the perf drop for these cases is expected.

EikanWang · 2025-05-15T12:30:12Z

In the kernel implementation, we set the vectorized length as 16Byte/sizeof(input scalar), so the vectorization length grows from float-4 to half-8 when we enable the input with half type.

Have you evaluated the optimization heuristic rule on Arc B-series?

For the strided instances, it increases the memory load instructions with the stride per thread.

Is there any data to support the conclusion?

As synced with @xytintel, we think the perf drop for these cases is expected.

Frankly speaking, it should not be as expected. In general, we don't expect any performance regression. Let's collect more E2E performance data.

weishi-deng · 2025-06-25T09:28:38Z

The perf statistics are updated as the softmax implementation is optimized according to PRhttps://github.com/intel/torch-xpu-ops/pull/1781

guangyey · 2025-06-25T09:31:27Z

@pytorchbot rebase

pytorchmergebot · 2025-06-25T09:33:02Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-06-25T09:33:05Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/150278/head returned non-zero exit code 1

Rebasing (1/42)
Rebasing (2/42)
Rebasing (3/42)
Rebasing (4/42)
Rebasing (5/42)
Rebasing (6/42)
Rebasing (7/42)
Rebasing (8/42)
Auto-merging test/test_xpu.py
CONFLICT (content): Merge conflict in test/test_xpu.py
error: could not apply a9b352b5be4... update unit test
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply a9b352b5be4... update unit test

Raised by https://github.com/pytorch/pytorch/actions/runs/15872719592

github-actions · 2025-09-01T11:34:11Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

softmax: add device check for xpu with half_to_float

58f8b21

weishi-deng marked this pull request as draft March 31, 2025 03:31

pytorchbot added the open source label Mar 31, 2025

guangyey added this to PyTorch Intel Apr 1, 2025

guangyey moved this to Pre-Review Required in PyTorch Intel Apr 1, 2025

guangyey added the ciflow/xpu Run XPU CI tasks label Apr 1, 2025

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Apr 1, 2025

guangyey added ciflow/xpu Run XPU CI tasks release notes: xpu release notes category labels Apr 1, 2025

guangyey approved these changes Apr 1, 2025

View reviewed changes

guangyey reviewed Apr 3, 2025

View reviewed changes

aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved

8000

guangyey reviewed Apr 3, 2025

View reviewed changes

aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved

guangyey reviewed Apr 3, 2025

View reviewed changes

aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved

guangyey reviewed Apr 3, 2025

View reviewed changes

aten/src/ATen/native/SoftMax.cpp Outdated Show resolved Hide resolved

Update aten/src/ATen/native/SoftMax.cpp

54bd692

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Apr 8, 2025

weishi-deng and others added 4 commits April 8, 2025 14:11

Update aten/src/ATen/native/SoftMax.cpp

a7435e5

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

Update aten/src/ATen/native/SoftMax.cpp

b3b8a32

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

Update aten/src/ATen/native/SoftMax.cpp

9f9854c

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>

Merge branch 'main' into xpu-softmax

88c89c4

weishi-deng requested a review from guangyey April 25, 2025 01:59

weishi-deng marked this pull request as ready for review April 25, 2025 01:59

guangyey added the module: xpu Intel XPU related issues label Apr 25, 2025

guangyey marked this pull request as draft April 25, 2025 02:38

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 14, 2025

ZhiweiYan-96 added ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 14, 2025

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels May 14, 2025

EikanWang added the ciflow/xpu Run XPU CI tasks label May 14, 2025

Merge branch 'pytorch:main' into xpu-softmax

6bc08b1

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 15, 2025

etaf added the ciflow/xpu Run XPU CI tasks label May 15, 2025

chuanqi129 added the keep-going Don't stop on first failure, keep running tests until the end label May 15, 2025

Merge branch 'pytorch:main' into xpu-softmax

0e5f1a3

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 19, 2025

Merge branch 'pytorch:main' into xpu-softmax

0de4614

guangyey added the ciflow/xpu Run XPU CI tasks label Jul 3, 2025

github-actions bot added the Stale label Sep 1, 2025

Merge branch 'pytorch:main' into xpu-softmax

f56b885

pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Sep 4, 2025

weishi-deng closed this Sep 4, 2025

github-project-automation bot moved this from Pre-Review Required to Done in PyTorch Intel Sep 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

softmax: add device check for xpu with half_to_float #150278

softmax: add device check for xpu with half_to_float #150278

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

softmax: add device check for xpu with half_to_float #150278

softmax: add device check for xpu with half_to_float #150278

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/150278

❌ 1 Cancelled Job

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants