8000
We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #147376. As per request: #145746 (review) This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size.
cc @ptrblck @msaroufim @eqy @jerryzh168 @manuelcandales @SherlockNoMad @angelayi
Sorry, something went wrong.
Note: Links to docs will display an error until the docs builds have been completed.
There are 1 currently active SEVs. If your PR is affected, please view them below:
As of commit 00da106 with merge base ce94b21 ():
Unable to download artifact(s): Unable to download and extract artifact: Artifact download failed after 5 retries.
A task was canceled.
File doesn't exist
👉 Rebase onto the `viable/strict` branch to avoid these failures
inductor/test_cpu_repro.py::CPUReproTests::test_tanh_atan2_use_decompose_tanh
This comment was automatically generated by Dr. CI and updates every 15 minutes.
There was a problem hiding this comment.
The reason will be displayed to describe this comment to others. Learn more.
How much does it increase compilation times? This is an very hot codepath and is used for primitives including copying Is the original PR description wrong? This won't affect compile times since it's not constexpr.
After examining where this call is used, if I am not mistaken; it's only used for vectorized copy of fp8 and lower dtypes. That is the reason why Why would this decrease/increase compilation time? Wouldn't those float8 kernels just not be built on lower SMs? Except SM89?
Do we actually not want to allow this on FP8 supporting consumer GPUs like SM89?! We may want to update the vec8 kernels if we are not compiling them on SM89
~How much does it increase compilation times?
I would like ask @atalman about this.
it's only used for vectorized copy of fp8 and lower dtypes
Yes, it is for float16, bfloat16 and fp8.
@Aidyn-A Maybe I am mistaken, but I thought float16 and bfloat16 had their own nvidia primitives they use, and do not go through this codepath.
In general, TensorIterator is used for most of elementwise ops like (relu, exp, log, sigmoid, sin, cos etc.) on all dtypes. Though, not all ops go trough TensorIterator.
relu
exp
log
sigmoid
sin
cos
dtypes
@Aidyn-A triggered the windows builds, lets see if there is an improvement
Hi @Aidyn-A I don't see improvement over the nightly build time (Build Pytorch Binary):
This PR: https://github.com/pytorch/pytorch/actions/runs/13677007625/job/38239672663?pr=148320
Nightly https://github.com/pytorch/pytorch/actions/runs/13759231430/job/38471718241
Roughtly cuda 11.8 - cuda 12.6 : 3h:40m-3h:55m and cuda 12.8 4h:20m
This is run of #147455 https://github.com/pytorch/pytorch/actions/runs/13413313017/job/37468143832?pr=147455 cuda 11.8 - cuda 12.6 : 3h:20m-3h:45m
Ah, I see. This is where the buildtime is changed. Please allow SM89 devices too if they support them.
Just don't instantiate this kernel with vec_size == 8 on older arches, don't add 2 similar codeblocks to the kernel itself
here, right
pytorch/aten/src/ATen/native/cuda/CUDALoops.cuh
Line 220 in 0bd2caa
I wish I could do that, the issue is that __CUDA_ARCH__ is available only inside kernels. The place where @eqy pointed is a host side code.
__CUDA_ARCH__
Are the binary size savings really worth it? This adds considerable complexity and maintenance burden to the kernel
@ngimel this was raised as it was adding many (dozens?) of minutes to the windows build time, not due to binary size concerns
ea89139
4e0accc
2b243a5
b0c21f7
lgtm
ROCm and CPU test failures are unrelated @pytorchbot merge -i
Your change will be merged while ignoring the following 5 checks: periodic / linux-focal-rocm-py3.10 / test (distributed, 2, 3, linux.rocm.gpu.4, module:rocm, oncall:distributed), periodic / linux-focal-rocm-py3.10 / test (distributed, 1, 3, linux.rocm.gpu.4, module:rocm, oncall:distributed), periodic / linux-focal-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.4, module:rocm, oncall:distributed), periodic / linux-focal-cuda12.6-py3-gcc11-slow-gradcheck / test (default, 3, 8, ephemeral.linux.g5.4xlarge.nvidia.gpu, module:slowgradcheck), s390x-periodic / linux-manylinux-2_28-py3-cpu-s390x / test (default, 9, 10, linux.s390x)
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX Team
Reason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm-py3.10 / test (default, 1, 2, linux.rocm.gpu.2)
@pytorchbot rebase
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here
check the sm version
c68d7e5
Successfully rebased cuda_vectorize_for_sm90+ onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout cuda_vectorize_for_sm90+ && git pull --rebase)
cuda_vectorize_for_sm90+
refs/remotes/origin/viable/strict
git checkout cuda_vectorize_for_sm90+ && git pull --rebase
Cleanup and fix minor errors
00da106
@pytorchbot merge -f "This looks fine"
Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
-f
-i/--ignore-current
72337bd
@pytorchbot cherry-pick --onto release/2.7 -c regression
[ATen][CUDA] Optimize 128 bit vectorization (#148320)
26f4937
Fixes #147376. As per request: #145746 (review) This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size. Pull Request resolved: #148320 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 72337bd)
The cherry pick PR is at #152967 and it is recommended to link a regression cherry pick PR with an issue. The following tracker issues are updated:
[ATen][CUDA] Optimize 128 bit vectorization (#152967)
24b0c4a
[ATen][CUDA] Optimize 128 bit vectorization (#148320) Fixes #147376. As per request: #145746 (review) This PR omits sm80 or older of using vec8 kernels due to long compilation and large binary size. Pull Request resolved: #148320 Approved by: https://github.com/eqy, https://github.com/malfet, https://github.com/atalman (cherry picked from commit 72337bd) Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Skylion007 Skylion007 left review comments
eqy eqy approved these changes
malfet malfet approved these changes
atalman atalman approved these changes
ngimel Awaiting requested review from ngimel
syed-ahmed Awaiting requested review from syed-ahmed syed-ahmed is a code owner
Aidyn-A
Successfully merging this pull request may close these issues.