[ROCm] Improvements for vectorized elementwise kernels #143269

jerrymannil · 2024-12-15T19:38:12Z

Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes
- for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4
- But elems_per_thread = 8 works better on half datypes for AMD gpus
Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

thread_work_size of size 8 and 16 enabled for dtype size 2 and 1 respectively vec_size of 8 and 16 for dtype size 2 and 1 implemented; but not enabled.

DropOut maxed at vec_size of 8

pytorch-bot · 2024-12-15T19:38:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143269

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit b779f2b with merge base 95b41d2 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 2, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh) (similar failure)
test_nestedtensor.py::TestNestedTensorSubclassCUDA::test_linear_nt_dim_3_cuda
periodic / linux-focal-cuda11.8-py3.10-gcc9-debug / test (default, 5, 7, lf.linux.4xlarge.nvidia.gpu, oncall:debug-build) (gh) (similar failure)
'test/profiler/test_profiler.py::TestProfiler::test_profile_all_threads'

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, lf.linux.2xlarge, unstable) (gh)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aten/src/ATen/native/cuda/CUDAJitLoops.cuh

aten/src/ATen/native/cuda/jit_utils.cpp

facebook-github-bot · 2024-12-17T01:06:24Z

@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pruthvistony · 2024-12-17T21:53:23Z

@Mellonta ,
Can you please update on the status of internal build.

Mellonta · 2024-12-17T22:15:15Z

@Mellonta , Can you please update on the status of internal build.

Our internal builds include all tests in OSS CI. Could you please fix them?

pytorchmergebot · 2024-12-30T02:04:07Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-30T02:04:09Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/143269/head returned non-zero exit code 1

Rebasing (1/10)
Rebasing (2/10)
Auto-merging aten/src/ATen/cuda/jiterator.cu
Auto-merging aten/src/ATen/native/cuda/CUDAJitLoops.cuh
CONFLICT (content): Merge conflict in aten/src/ATen/native/cuda/CUDAJitLoops.cuh
error: could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16

Raised by https://github.com/pytorch/pytorch/actions/runs/12540729651

jerrymannil · 2025-01-13T19:27:31Z

@pytorchbot rebase

pytorchmergebot · 2025-01-13T19:28:56Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-13T19:28:57Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/143269/head returned non-zero exit code 1

Rebasing (1/15)
Rebasing (2/15)
Auto-merging aten/src/ATen/cuda/jiterator.cu
Auto-merging aten/src/ATen/native/cuda/CUDAJitLoops.cuh
CONFLICT (content): Merge conflict in aten/src/ATen/native/cuda/CUDAJitLoops.cuh
error: could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16

Raised by https://github.com/pytorch/pytorch/actions/runs/12754592663

jerrymannil · 2025-01-13T20:46:28Z

@Mellonta
Can you import the updated PR ?

jerrymannil · 2025-01-14T03:54:29Z

@pytorchbot rebase

pytorchmergebot · 2025-01-14T03:55:53Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-01-14T03:55:54Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/143269/head returned non-zero exit code 1

Rebasing (1/10)
Rebasing (2/10)
Auto-merging aten/src/ATen/cuda/jiterator.cu
Auto-merging aten/src/ATen/native/cuda/CUDAJitLoops.cuh
CONFLICT (content): Merge conflict in aten/src/ATen/native/cuda/CUDAJitLoops.cuh
error: could not apply 9d57ba757f... Add support for thread_work_size and vec_size of 8 and 16
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 9d57ba757f... Add support for thread_work_size and vec_size of 8 and 16

Raised by https://github.com/pytorch/pytorch/actions/runs/12760808972

jerrymannil · 2025-01-14T18:20:44Z

@Mellonta
The CI is passing now, except for 2 flaky test.
Can you import the PR again ?

facebook-github-bot · 2025-01-14T18:21:57Z

@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jerrymannil · 2025-01-14T22:01:23Z

@pytorchbot merge

pytorchmergebot · 2025-01-14T22:03:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@akadutta

… (#1874) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>

@akadutta

* Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>

@akadutta

… (#1924) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>

@akadutta

… (#1874) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com> (cherry picked from commit 4686828)

@akadutta

… (#1874) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com> (cherry picked from commit 4686828)

@akadutta

… (#1874) * Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4 * But elems_per_thread = 8 works better on half datypes for AMD gpus * Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively Co-author: @akadutta Pull Request resolved: pytorch#143269 Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com> (cherry picked from commit 4686828)

jerrymannil added 6 commits December 14, 2024 01:05

Make "thread_work_size" as a parameter for jit loop codegen

e3551fb

Add support for thread_work_size and vec_size of 8 and 16

9d57ba7

thread_work_size of size 8 and 16 enabled for dtype size 2 and 1 respectively vec_size of 8 and 16 for dtype size 2 and 1 implemented; but not enabled.

Enable vec_size 8 for elemwise kernels

7a9cf2d

[ROCm] Enable vec_size of 16

19b04da

DropOut maxed at vec_size of 8

[ROCm] Add vec_size of 16 for DropOut

9103bce

[cosmetic chnage] Update at::cuda::jit::generate_code function signature

d12b02c

jerrymannil requested review from eqy and syed-ahmed as code owners December 15, 2024 19:38

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Dec 15, 2024

jerrymannil mentioned this pull request Dec 15, 2024

[ROCm] Improvements for vectorized elementwise kernels #142498

Closed

pytorchbot added the open source label Dec 15, 2024

[cosmetic] Remove #if 0 code

fdba03b

Update jit_utils.cpp

c1cd3df

8000
jeffdaily reviewed Dec 16, 2024

View reviewed changes

aten/src/ATen/native/cuda/CUDAJitLoops.cuh Outdated Show resolved Hide resolved

jeffdaily reviewed Dec 16, 2024

View reviewed changes

aten/src/ATen/native/cuda/jit_utils.cpp Outdated Show resolved Hide resolved

review comments incorporated

9d93f62

jeffdaily approved these changes Dec 16, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 17, 2024

jithunnair-amd mentioned this pull request Dec 17, 2024

[AUTOGENERATED] [release/2.5] Set thread work size to 8 for elementwise kernels ROCm/pytorch#1798

Merged

Fix windows build failure and some minor cleanups

fff09a9

Merge branch 'pytorch:main' into main

ff97c2a

Merge branch 'pytorch:main' into main

b779f2b

pytorchmergebot added the merging label Jan 14, 2025

pytorchmergebot added the Merged label Jan 14, 2025

pytorchmergebot closed this in ea3395e Jan 14, 2025

pytorchmergebot removed the merging label Jan 14, 2025

Aidyn-A mentioned this pull request Jan 24, 2025

[ATen][CUDA] Implement 128 bit vectorization #141959

Closed

jerrymannil mentioned this pull request Jan 31, 2025

[rocm6.4_internal_testing] [ROCm] Improvements for vectorized elementwise kernels (#143269) ROCm/pytorch#1874

Merged

jerrymannil mentioned this pull request Feb 3, 2025

[release/2.6] [ROCm] Improvements for vectorized elementwise kernels (#143269) ROCm/pytorch#1878

Closed

jerrymannil mentioned this pull request Feb 21, 2025

[ROCm] Improvements for vectorized elementwise kernels (#143269) ROCm/pytorch#1924

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] Improvements for vectorized elementwise kernels #143269

[ROCm] Improvements for vectorized elementwise kernels #143269

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[ROCm] Improvements for vectorized elementwise kernels #143269

[ROCm] Improvements for vectorized elementwise kernels #143269

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143269

✅ You can merge normally! (3 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!