8000 [ATen][CUDA] Optimize 128 bit vectorization by pytorchbot · Pull Request #152967 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[ATen][CUDA] Optimize 128 bit vectorization #152967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 7, 2025

Conversation

pytorchbot
Copy link
Collaborator

Fixes #147376.
As per request: #145746 (review)
This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size.

cc @ptrblck @msaroufim @eqy @jerryzh168 @manuelcandales @SherlockNoMad @angelayi

Fixes #147376.
As per request: #145746 (review)
This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size.

Pull Request resolved: #148320
Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman

(cherry picked from commit 72337bd)
Copy link
pytorch-bot bot commented May 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152967

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Pending

As of commit 26f4937 with merge base 924a247 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: cuda release notes category label May 6, 2025
@malfet malfet merged commit 24b0c4a into release/2.7 May 7, 2025
179 of 187 checks passed
@atalman
Copy link
Contributor
atalman commented May 28, 2025

Hm, I don't see large decrease in size here after merging the cherry-pick:
Screenshot 2025-05-28 at 11 06 24 AM

2.7.0 Release
Screenshot 2025-05-28 at 11 13 13 AM

Nightly build seems better:
Screenshot 2025-05-28 at 11 09 04 AM

Looks like improvement on nightly happened on May 01. Must be one of these changes:
8c7f928

Nightly CUDA arches : 5.0;6.0;7.0;7.5;8.0;8.6,9.0
Release CUDA arches: 5.0;6.0;7.0;7.5;8.0;8.6,9.0

When original PR landed May 02 and was include in May 03 nightly I see only a slight improvement:
torch-2.8.0.dev20250502+cu126-cp310-cp310-manylinux_2_28_x86_64.whl - 799.7 MB
vs
torch-2.8.0.dev20250503+cu126-cp310-cp310-manylinux_2_28_x86_64.whl - 787.2 MB

@ngimel
Copy link
Collaborator
ngimel commented May 28, 2025

Should we revert this PR then, if it doesn't improve build time/binary size? It adds considerable complexity, and we shouldn't add complexity for nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0