[ROCm] CK Flash Attention Backend #143695

xw285cornell · 2024-12-21T01:11:02Z

Replace #138947 for re-import.

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @albanD @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

pytorch-bot · 2024-12-21T01:11:06Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143695

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 9f5531f with merge base bb5e439 ():

NEW FAILURE - The following job has failed:

Lint / pr-sanity-checks (gh)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-12-21T02:09:32Z

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

xw285cornell · 2024-12-22T20:19:58Z

@jithunnair-amd cuda build keeps timing out, do you know what's going on?

LunNova · 2024-12-22T22:59:00Z

128 thread build on a ROCm 6.3 stack on an EPYC Milan system that is otherwise idle:

USE_CK_FLASH_ATTENTION=1 build for PYTORCH_ROCM_ARCH=gfx908;gfx90a took ~42 mins
unset USE_CK_FLASH_ATTENTION build for PYTORCH_ROCM_ARCH=gfx908;gfx90a took ~19.5 mins
Without this PR build for PYTORCH_ROCM_ARCH=gfx908;gfx90a took ~19.3 mins

jithunnair-amd · 2025-01-02T19:02:55Z

@pytorchbot rebase

pytorchmergebot · 2025-01-02T19:04:32Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

…e during git clone due to long file paths eg. https://github.com/pytorch/test-infra/actions/runs/12376044163/job/34542289317#step:6:447

…H_ROCM_ARCH

pytorchmergebot · 2025-01-02T19:04:42Z

Successfully rebased rocm_ck_sdpa onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout rocm_ck_sdpa && git pull --rebase)

xw285cornell · 2025-01-03T06:40:03Z

@albanD any chance you can give an exception to this PR? It's adding the instances of SDPA into the code (we have a similar approach for nvidia's flash attention); and we'll move to pre-built binary rather than from source (for OSS) in the near future.

facebook-github-bot · 2025-01-03T07:04:14Z

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

malfet

This PR makes me sad, but is probably fine as temporary solution

facebook-github-bot · 2025-01-03T21:52:13Z

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

facebook-github-bot · 2025-01-03T21:52:59Z

@pytorchbot merge -i

(Initiating merge automatically since Phabricator Diff has merged, merging with -i because oss signals were bypassed internally)

pytorchmergebot · 2025-01-03T21:55:36Z

Merge started

Your change will be merged while ignoring the following 1 checks: Lint / pr-sanity-checks

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jnolck · 2025-01-06T04:23:32Z

This PR makes me sad, but is probably fine as temporary solution

Me too, I thought this pr included a new ck based fa implementation for navi cards. Even getting the old howiejay implementation in here would have been nice.

@tridao

Replace pytorch#138947 for re-import. Replaces ROCm#1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>

@tridao

Replace pytorch#138947 for re-import. Replaces #1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> (cherry picked from commit 0a94bb4)

@tridao

Replace pytorch#138947 for re-import. Replaces #1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> (cherry picked from commit 0a94bb4)

@tridao

Replace pytorch#138947 for re-import. Replaces #1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> (cherry picked from commit 0a94bb4)

@tridao

Replace pytorch#138947 for re-import. Replaces ROCm#1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>

@tridao

Replace pytorch#138947 for re-import. Replaces ROCm#1592 This PR contains the initial implementation of SDPA with c E6B3 omposable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>

@tridao

Replace pytorch#138947 for re-import. Replaces ROCm#1592 This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics. Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch. Pull Request resolved: pytorch#143695 Approved by: https://github.com/malfet Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com>

xw285cornell requested review from eqy, syed-ahmed, jeffdaily and jithunnair-amd as code owners December 21, 2024 01:11

pytorch-bot bot added ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm module: dynamo module: rocm AMD GPU support for Pytorch release notes: distributed (c10d) release notes category labels Dec 21, 2024

xw285cornell mentioned this pull request Dec 21, 2024

[ROCm] CK Flash Attention Backend #138947

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Dec 21, 2024

xw285cornell added the topic: not user facing topic category label Dec 21, 2024

This comment was marked as outdated.

Sign in to view

xw285cornell mentioned this pull request Jan 2, 2025

sync with top of pytorch tree ROCm/pytorch#1810

Merged

alugorey added 10 commits January 2, 2025 19:04

[ROCm] CK Flash Attention Backend

e8f1db3

Fix test_meta_outplace

627912f

Add Tri Dao to LICENSE file for functionality pulled from flash-atten…

b852e29

…tion

Add ck kernel blob list

cf2b147

Remove generated kernels from .gitignore

b6777eb

Add generated files to git

8c40287

remove plumbing to generate files

9b3d3bf

remove examples dir from inclusion

e57b9ca

Add files from example ck

8406e5f

Remove references to ck examples

ada5bc9

jithunnair-amd and others added 6 commits January 2, 2025 19:04

Rename CK autogenerated files using sha1sum to address Windows failur…

3761532

…e during git clone due to long file paths eg. https://github.com/pytorch/test-infra/actions/runs/12376044163/job/34542289317#step:6:447

Remove flash_api.h from list of includes for hipify

dbab30f

Add informative messages when USE_CK_FLASH_ATTENTION is enabled

fa0a89f

Use Cmake STATUS

cc6ba49

Add warning to users if building for more than one gfx arch in PYTORC…

17b694d

…H_ROCM_ARCH

makefile linter

223aa3d

pytorchmergebot force-pushed the rocm_ck_sdpa branch from b94f160 to 223aa3d Compare January 2, 2025 19:04

Merge branch 'ROCm:main' into rocm_ck_sdpa

9f5531f

xw285cornell requested a review from albanD January 3, 2025 06:38

malfet approved these changes Jan 3, 2025

View reviewed changes

malfet added the skip-pr-sanity-checks label Jan 3, 2025

pytorchmergebot added the merging label Jan 3, 2025

pytorchmergebot added the Merged label Jan 3, 2025

pytorchmergebot closed this in 0a94bb4 Jan 3, 2025

pytorchmergebot removed the merging label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] CK Flash Attention Backend #143695

[ROCm] CK Flash Attention Backend #143695

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[ROCm] CK Flash Attention Backend #143695

[ROCm] CK Flash Attention Backend #143695

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143695

❌ 1 New Failure

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!