[ATen][CUDA] Optimize 128 bit vectorization #152967

pytorchbot · 2025-05-06T18:22:16Z

Fixes #147376.
As per request: #145746 (review)
This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size.

cc @ptrblck @msaroufim @eqy @jerryzh168 @manuelcandales @SherlockNoMad @angelayi

Fixes #147376. As per request: #145746 (review) This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size. Pull Request resolved: #148320 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 72337bd)

pytorch-bot · 2025-05-06T18:22:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152967

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Pending

As of commit 26f4937 with merge base 924a247 ():

NEW FAILURES - The following jobs have failed:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh)
ERROR: Installation has failed. Please see the file '/var/log/nvidia
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge) (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

atalman · 2025-05-28T15:09:29Z

Hm, I don't see large decrease in size here after merging the cherry-pick:

2.7.0 Release

Nightly build seems better:

Looks like improvement on nightly happened on May 01. Must be one of these changes:
8c7f928

Nightly CUDA arches : 5.0;6.0;7.0;7.5;8.0;8.6,9.0
Release CUDA arches: 5.0;6.0;7.0;7.5;8.0;8.6,9.0

When original PR landed May 02 and was include in May 03 nightly I see only a slight improvement:
torch-2.8.0.dev20250502+cu126-cp310-cp310-manylinux_2_28_x86_64.whl - 799.7 MB
vs
torch-2.8.0.dev20250503+cu126-cp310-cp310-manylinux_2_28_x86_64.whl - 787.2 MB

ngimel · 2025-05-28T17:58:32Z

Should we revert this PR then, if it doesn't improve build time/binary size? It adds considerable complexity, and we shouldn't add complexity for nothing.

pytorchbot requested review from eqy and syed-ahmed as code owners May 6, 2025 18:22

This was referenced May 6, 2025

[v2.7.1] Release Tracker #152627

Closed

[ATen][CUDA] Optimize 128 bit vectorization #148320

Closed

pytorch-bot bot added the release notes: cuda release notes category label May 6, 2025

pytorchbot added the open source label May 6, 2025

malfet approved these changes May 7, 2025

View reviewed changes

malfet merged commit 24b0c4a into release/2.7 May 7, 2025
179 of 187 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ATen][CUDA] Optimize 128 bit vectorization #152967

[ATen][CUDA] Optimize 128 bit vectorization #152967

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[ATen][CUDA] Optimize 128 bit vectorization #152967

[ATen][CUDA] Optimize 128 bit vectorization #152967

Conversation

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152967

❌ 2 New Failures, 4 Pending

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!