[ATen][CUDA] Optimize 128 bit vectorization #148320

Aidyn-A · 2025-03-03T10:15:02Z

Fixes #147376.
As per request: #145746 (review)
This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size.

cc @ptrblck @msaroufim @eqy @jerryzh168 @manuelcandales @SherlockNoMad @angelayi

pytorch-bot · 2025-03-03T10:15:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148320

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CI workflows being skipped on PR

❌ 1 New Failure, 1 Cancelled Job, 50 Pending, 2 Unrelated Failures

As of commit 00da106 with merge base ce94b21 ():

NEW FAILURE - The following job has failed:

linux-binary-manywheel / manywheel-py3_13t-rocm6_3-test (gh)
Unable to download artifact(s): Unable to download and extract artifact: Artifact download failed after 5 retries.

CANCELLED JOB - The following job was cancelled. Please retry:

linux-binary-manywheel / manywheel-py3_13-cuda12_6-test / test (gh)
A task was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13) (gh) (matched macos rule in flaky-rules.json)
File doesn't exist

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

periodic / linux-focal-cuda12.6-py3.10-gcc11 / test (nogpu_NO_AVX2, 2, 2, ephemeral.linux.4xlarge) (gh) (trunk failure)
inductor/test_cpu_repro.py::CPUReproTests::test_tanh_atan2_use_decompose_tanh

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2025-03-03T14:48:34Z

aten/src/ATen/native/cuda/CUDALoops.cuh

@@ -215,6 +250,11 @@ static inline void launch_vectorized_kernel(
  if constexpr (sizeof(cpp_type) < 2) {
    vec_size = std::min<uint16_t>(vec_size, 4);
  }
+  // Since we are not compiling vec8 kernel on sm<90, we are not calling it.
+  auto dprop = at::cuda::getCurrentDeviceProperties();
+  if (dprop->major < 9) {


~~How much does it increase compilation times? This is an very hot codepath and is used for primitives including copying~~ Is the original PR description wrong? This won't affect compile times since it's not constexpr.

After examining where this call is used, if I am not mistaken; it's only used for vectorized copy of fp8 and lower dtypes. That is the reason why Why would this decrease/increase compilation time? Wouldn't those float8 kernels just not be built on lower SMs? Except SM89?

Do we actually not want to allow this on FP8 supporting consumer GPUs like SM89?! We may want to update the vec8 kernels if we are not compiling them on SM89

Suggested change

if (dprop->major < 9) {

if (dprop->major < 9 || (dprop->major == 8 && dprop->minor >= 9)) {

~How much does it increase compilation times?

I would like ask @atalman about this.

it's only used for vectorized copy of fp8 and lower dtypes

Yes, it is for float16, bfloat16 and fp8.

@Aidyn-A Maybe I am mistaken, but I thought float16 and bfloat16 had their own nvidia primitives they use, and do not go through this codepath.

In general, TensorIterator is used for most of elementwise ops like (relu, exp, log, sigmoid, sin, cos etc.) on all dtypes. Though, not all ops go trough TensorIterator.

@Aidyn-A triggered the windows builds, lets see if there is an improvement

Hi @Aidyn-A I don't see improvement over the nightly build time (Build Pytorch Binary):

This PR:
https://github.com/pytorch/pytorch/actions/runs/13677007625/job/38239672663?pr=148320

Nightly
https://github.com/pytorch/pytorch/actions/runs/13759231430/job/38471718241

Roughtly
cuda 11.8 - cuda 12.6 : 3h:40m-3h:55m and cuda 12.8 4h:20m

This is run of #147455
https://github.com/pytorch/pytorch/actions/runs/13413313017/job/37468143832?pr=147455
cuda 11.8 - cuda 12.6 : 3h:20m-3h:45m

Skylion007 · 2025-03-03T15:05:55Z

aten/src/ATen/native/cuda/CUDALoops.cuh

-           // vectorized memory access
+  if constexpr (vec_size == 8) {
+  // To save some build time on CUDA, we are going to utilize vec8 only on SM90+ devices.
+#if defined(USE_ROCM) || ((defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 900))


Ah, I see. This is where the buildtime is changed. Please allow SM89 devices too if they support them.

Just don't instantiate this kernel with vec_size == 8 on older arches, don't add 2 similar codeblocks to the kernel itself

here, right

pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh

Line 220 in 0bd2caa

#ifdef USE_ROCM

Just don't instantiate this kernel with vec_size == 8 on older arches, don't add 2 similar codeblocks to the kernel itself

I wish I could do that, the issue is that __CUDA_ARCH__ is available only inside kernels. The place where @eqy pointed is a host side code.

Are the binary size savings really worth it? This adds considerable complexity and maintenance burden to the kernel

@ngimel this was raised as it was adding many (dozens?) of minutes to the windows build time, not due to binary size concerns

atalman

lgtm

Aidyn-A · 2025-05-01T22:11:48Z

ROCm and CPU test failures are unrelated
@pytorchbot merge -i

pytorchmergebot · 2025-05-01T22:15:00Z

Merge started

Your change will be merged while ignoring the following 5 checks: periodic / linux-focal-rocm-py3.10 / test (distributed, 2, 3, linux.rocm.gpu.4, module:rocm, oncall:distributed), periodic / linux-focal-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.4, module:rocm, oncall:distributed), periodic / linux-focal-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.4, module:rocm, oncall:distributed), periodic / linux-focal-cuda12.6-py3-gcc11-slow-gradcheck / test (default, 3, 8, ephemeral.linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck), s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 9, 10, linux.s390x)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-05-01T23:16:47Z

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2)

Details for Dev Infra team

Raised by workflow job

Aidyn-A · 2025-05-02T08:30:34Z

@pytorchbot rebase

pytorchmergebot · 2025-05-02T08:32:16Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-05-02T08:32:22Z

Successfully rebased cuda_vectorize_for_sm90+ onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_vectorize_for_sm90+ && git pull --rebase)

malfet · 2025-05-02T17:33:47Z

@pytorchbot merge -f "This looks fine"

pytorchmergebot · 2025-05-02T17:35:22Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · 2025-05-06T18:17:40Z

@pytorchbot cherry-pick --onto release/2.7 -c regression

Fixes #147376. As per request: #145746 (review) This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size. Pull Request resolved: #148320 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 72337bd)

pytorchbot · 2025-05-06T18:22:18Z

Cherry picking #148320

The cherry pick PR is at #152967 and it is recommended to link a regression cherry pick PR with an issue. The following tracker issues are updated:

[v2.7.1] Release Tracker #152627 (comment)

Details for Dev Infra team

Raised by workflow job

[ATen][CUDA] Optimize 128 bit vectorization (#148320) Fixes #147376. As per request: #145746 (review) This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size. Pull Request resolved: #148320 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 72337bd) Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>

Aidyn-A added module: cuda Related to torch.cuda, and CUDA support in general module: core aten Related to change to the Core ATen opset labels Mar 3, 2025

Aidyn-A requested review from malfet, atalman and ngimel March 3, 2025 10:15

Aidyn-A self-assigned this Mar 3, 2025

Aidyn-A requested review from eqy and syed-ahmed as code owners March 3, 2025 10:15

pytorch-bot bot added the release notes: cuda release notes category label Mar 3, 2025

pytorchbot added the open source label Mar 3, 2025

Skylion007 reviewed Mar 3, 2025 8000

View reviewed changes

Skylion007 reviewed Mar 3, 2025

View reviewed changes

janeyx99 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 3, 2025

eqy approved these changes Mar 4, 2025

View reviewed changes

atalman added the ciflow/binaries_wheel Trigger binary build and upload jobs for wheel on the PR label Mar 5, 2025

Aidyn-A force-pushed the cuda_vectorize_for_sm90+ branch from ea89139 to 4e0accc Compare March 11, 2025 17:53

Aidyn-A marked this pull request as draft March 12, 2025 14:29

atalman mentioned this pull request Apr 4, 2025

Revert "[ATen][CUDA] Implement 128 bit vectorization v2 (#145746)" #150679

Open

Aidyn-A force-pushed the cuda_vectorize_for_sm90+ branch from 2b243a5 to b0c21f7 Compare April 26, 2025 16:45

Aidyn-A marked this pull request as ready for review April 28, 2025 14:50

malfet approved these changes May 1, 2025

View reviewed changes

malfet added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label May 1, 2025

atalman approved these changes May 1, 2025

View reviewed changes

eqy approved these changes May 1, 2025

View reviewed changes

atalman added this to the 2.7.1 milestone May 1, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 1, 2025

pytorchmergebot added the merging label May 1, 2025

pytorchmergebot removed the merging label May 1, 2025

check the sm version

c68d7e5

pytorchmergebot force-pushed the cuda_vectorize_for_sm90+ branch from b0c21f7 to c68d7e5 Compare May 2, 2025 08:32

Cleanup and fix minor errors

00da106

pytorchmergebot added the merging label May 2, 2025

pytorchmergebot added the Merged label May 2, 2025

pytorchmergebot closed this in 72337bd May 2, 2025

pytorchmergebot removed the merging label May 2, 2025

pytorchbot mentioned this pull request May 6, 2025

[v2.7.1] Release Tracker #152627

Open

atalman mentioned this pull request May 28, 2025

Release 2.7.1 validations checklist and cherry-picks #154512

Open

49 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ATen][CUDA] Optimize 128 bit vectorization #148320

[ATen][CUDA] Optimize 128 bit vectorization #148320

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	if (dprop->major < 9) {
	if (dprop->major < 9 \|\| (dprop->major == 8 && dprop->minor >= 9)) {

[ATen][CUDA] Optimize 128 bit vectorization #148320

[ATen][CUDA] Optimize 128 bit vectorization #148320

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/148320

❗ 1 Active SEVs

❌ 1 New Failure, 1 Cancelled Job, 50 Pending, 2 Unrelated Failures

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Cherry picking #148320

Uh oh!

Uh oh!