-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[cuDNN][SDPA] Match query
's memory layout ordering for output
in cuDNN SDPA
#138354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138354
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit 27360a9 with merge base 2ce2e4d ( BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
CC @ngimel @Skylion007 |
Not a problem for the backport as we use a much lower version of CUDNN though? |
query
's memory layout ordering for output
in cuDNN SDPA
# Summary Curren 8000 tly we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5: 1. #138529 2. huggingface/diffusers#9704 3. #138354 In light of the above we are going to make the CuDNN backend Opt-in by default. This can be done easily with the context manager for choosing backends I.e.: ``` Python from torch.nn.attention import sdpa_kernel, SDPBackend with sdpa_kernel(SDPBackend.CUDNN_ATTENTION): out = F.scaled_dot_product_attention(q, k, v) ``` This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager). Cc atalman cc mikaylagawarecki [ghstack-poisoned]
# Summary Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5: 1. #138529 2. huggingface/diffusers#9704 3. #138354 In light of the above we are going to make the CuDNN backend Opt-in by default. This can be done easily with the context manager for choosing backends I.e.: ``` Python from torch.nn.attention import sdpa_kernel, SDPBackend with sdpa_kernel(SDPBackend.CUDNN_ATTENTION): out = F.scaled_dot_product_attention(q, k, v) ``` This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager). Cc @atalman Pull Request resolved: #138522 Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/malfet (cherry picked from commit 9a9a0ab)
Successfully rebased |
e2c811b
to
27360a9
Compare
|
The potential copy due to |
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@StrongerXi how? |
I have no clue, I saw it on my PR which was odd, and then I saw that it's also failing on main (the above link). |
…cuDNN SDPA (pytorch#138354) For pytorch#138340 ~~We might consider more sophisticated logic here but the corresponding logic in other backends doesn't seem to do anything fancy for non BSHD/BHSD cases https://github.com/pytorch/pytorch/blob/ea8ea2f33fc65b33dc562f4b0430f8c79eb81d8d/aten/src/ATen/native/transformers/cuda/attention.cu#L1145~~ ended up going with a more general approach to much more or less arbitrary layouts Pull Request resolved: pytorch#138354 Approved by: https://github.com/drisspg
Okay, CUDNN upgrade is available on CUDA 12.6 binaries. Feel free to add the gate in a new PR. |
…cuDNN SDPA (pytorch#138354) For pytorch#138340 ~~We might consider more sophisticated logic here but the corresponding logic in other backends doesn't seem to do anything fancy for non BSHD/BHSD cases https://github.com/pytorch/pytorch/blob/ea8ea2f33fc65b33dc562f4b0430f8c79eb81d8d/aten/src/ATen/native/transformers/cuda/attention.cu#L1145~~ ended up going with a more general approach to much more or less arbitrary layouts Pull Request resolved: pytorch#138354 Approved by: https://github.com/drisspg
Update `cuDNN SDPA` meta registration to matching memory layout behavior in: #138354 Pull Request resolved: #148921 Approved by: https://github.com/drisspg, https://github.com/jbschlosser
For #138340
We might consider more sophisticated logic here but the corresponding logic in other backends doesn't seem to do anything fancy for non BSHD/BHSD casespytorch/aten/src/ATen/native/transformers/cuda/attention.cu
Line 1145 in ea8ea2f
ended up going with a more general approach to much more or less arbitrary layouts
cc @csarofeen @ptrblck @xwang233 @msaroufim @drisspg @mikaylagawarecki