8000 [ROCm] Improvements for vectorized elementwise kernels by jerrymannil · Pull Request #143269 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[ROCm] Improvements for vectorized elementwise kernels #143269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed

[ROCm] Improvements for vectorized elementwise kernels #143269

wants to merge 13 commits into from

Conversation

jerrymannil
Copy link
Contributor
@jerrymannil jerrymannil commented Dec 15, 2024
  • Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes
    • for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4
    • But elems_per_thread = 8 works better on half datypes for AMD gpus
  • Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively

Co-author: @akadutta

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Copy link
pytorch-bot bot commented Dec 15, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143269

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit b779f2b with merge base 95b41d2 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pruthvistony pruthvistony added topic: not user facing topic category ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR rocm This tag is for PRs from ROCm team ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow ciflow/rocm Trigger "default" config CI on ROCm ciflow/inductor-rocm Trigger "inductor" config CI on ROCm labels Dec 15, 2024
@facebook-github-bot
Copy link
Contributor

@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@pruthvistony
Copy link
Collaborator

@Mellonta ,
Can you please update on the status of internal build.

@Mellonta
Copy link
Contributor

@Mellonta , Can you please update on the status of internal build.

Our internal builds include all tests in OSS CI. Could you please fix them?

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/143269/head returned non-zero exit code 1

Rebasing (1/10)
Rebasing (2/10)
Auto-merging aten/src/ATen/cuda/jiterator.cu
Auto-merging aten/src/ATen/native/cuda/CUDAJitLoops.cuh
CONFLICT (content): Merge conflict in aten/src/ATen/native/cuda/CUDAJitLoops.cuh
error: could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16

Raised by https://github.com/pytorch/pytorch/actions/runs/12540729651

@jerrymannil
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/143269/head returned non-zero exit code 1

Rebasing (1/15)
Rebasing (2/15)
Auto-merging aten/src/ATen/cuda/jiterator.cu
Auto-merging aten/src/ATen/native/cuda/CUDAJitLoops.cuh
CONFLICT (content): Merge conflict in aten/src/ATen/native/cuda/CUDAJitLoops.cuh
error: could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 9d57ba757f9... Add support for thread_work_size and vec_size of 8 and 16

Raised by https://github.com/pytorch/pytorch/actions/runs/12754592663

@jerrymannil
Copy link
Contributor Author

@Mellonta
Can you import the updated PR ?

@jerrymannil
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/143269/head returned non-zero exit code 1

Rebasing (1/10)
Rebasing (2/10)
Auto-merging aten/src/ATen/cuda/jiterator.cu
Auto-merging aten/src/ATen/native/cuda/CUDAJitLoops.cuh
CONFLICT (content): Merge conflict in aten/src/ATen/native/cuda/CUDAJitLoops.cuh
error: could not apply 9d57ba757f... Add support for thread_work_size and vec_size of 8 and 16
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Could not apply 9d57ba757f... Add support for thread_work_size and vec_size of 8 and 16

Raised by https://github.com/pytorch/pytorch/actions/runs/12760808972

@jerrymannil
Copy link
Contributor Author

@Mellonta
The CI is passing now, except for 2 flaky test.
Can you import the PR again ?

@facebook-github-bot
Copy link
Contributor

@Mellonta has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jerrymannil
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pruthvistony added a commit to ROCm/pytorch that referenced this pull request Jan 31, 2025
… (#1874)

* Make io_size calculation as minimum of size of input and output size,
rather than the summation of all sizes
* for e.g, for torch.add() on half dtypes (bfloat16/float16),
calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by
using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily,
https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
jerrymannil added a commit to ROCm/pytorch that referenced this pull request Feb 7, 2025
*  Make io_size calculation as minimum of size of input and output size, rather than the summation of all sizes
   * for e.g, for torch.add() on half dtypes (bfloat16/float16), calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
pruthvistony added a commit to ROCm/pytorch that referenced this pull request Feb 21, 2025
… (#1924)

* Make io_size calculation as minimum of size of input and output size,
rather than the summation of all sizes
* for e.g, for torch.add() on half dtypes (bfloat16/float16),
calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by
using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily,
https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
dnikolaev-amd pushed a commit to ROCm/pytorch that referenced this pull request Apr 17, 2025
… (#1874)

* Make io_size calculation as minimum of size of input and output size,
rather than the summation of all sizes
* for e.g, for torch.add() on half dtypes (bfloat16/float16),
calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by
using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily,
https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
(cherry picked from commit 4686828)
dnikolaev-amd pushed a commit to ROCm/pytorch that referenced this pull request Apr 24, 2025
… (#1874)

* Make io_size calculation as minimum of size of input and output size,
rather than the summation of all sizes
* for e.g, for torch.add() on half dtypes (bfloat16/float16),
calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by
using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily,
https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
(cherry picked from commit 4686828)
dnikolaev-amd pushed a commit to ROCm/pytorch that referenced this pull request Apr 24, 2025
… (#1874)

* Make io_size calculation as minimum of size of input and output size,
rather than the summation of all sizes
* for e.g, for torch.add() on half dtypes (bfloat16/float16),
calc_io_size() returns 6 causing elems_per_thread to be 4
   * But elems_per_thread = 8 works better on half datypes for AMD gpus
* Enable *_load_dwordx4 ISA for 16-bit and 8-bit dtypes on AMD gpus by
using vector size of 8 and 16 respectively

Co-author: @akadutta

Pull Request resolved: pytorch#143269
Approved by: https://github.com/jeffdaily,
https://github.com/pruthvistony

Co-authored-by: Pruthvi Madugundu <pruthvigithub@gmail.com>
(cherry picked from commit 4686828)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/rocm Trigger "default" config CI on ROCm ciflow/trunk Trigger trunk jobs on your pull request ciflow/unstable Run all experimental or flaky jobs on PyTorch unstable workflow Merged module: rocm AMD GPU support for Pytorch open source rocm This tag is for PRs from ROCm team topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0