-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[ROCm] [Inductor] Nightly torch.compile assert_size_stride AssertionError: wrong number of dimensions #137414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @OrenLeung I can reproduce this. As a workaround if we use the math backend for the scaled_dot_product_attention call then the code works, e.g..
It seems we are hitting some issues with flash attention with torch.compile, if I run the generated code independently then we see this:
|
Hi @OrenLeung I've confirmed ROCm#1428 fixes the issue. I'll reping here when we have upstreamed this. |
Thanks for looking into this! will look into using math backend for sdpa or maybe just explicitly graph breaking on flash sdpa |
workaround I am doing till #137717 gets into nightly def disable_torch_compile_if_amd(func):
# Define a wrapper that applies the torch.compiler.disable decorator conditionally
if torch.cuda.is_available() and "MI300X" in torch.cuda.get_device_name():
return torch.compiler.disable()(func)
else:
return func
@disable_torch_compile_if_amd
def scaled_dot_product_attention_wrapper(q_BHTD, k_BHTD, v_BHTD, dropout_p=0.0, is_causal=True):
# with torch.nn.attention.sdpa_kernel(
# enable_math=True,
# enable_flash=False,
# enable_mem_efficient=False
# ):
o_BHTD = F.scaled_dot_product_attention(q_BHTD, k_BHTD, v_BHTD, dropout_p=dropout_p, is_causal=is_causal)
return o_BHTD |
The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors. The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture. Fixes #131316 #137414 Pull Request resolved: #137717 Approved by: https://github.com/drisspg Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
hi @hliuca @jataylo @xinyazhang , I can confirm this fixed the issue in my internal codebase. Thank you for the fix! Closing this issue as fixed in PR 137717 |
Thank you @OrenLeung |
The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors. The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture. Fixes pytorch#131316 pytorch#137414 Pull Request resolved: pytorch#137717 Approved by: https://github.com/drisspg Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> (cherry picked from commit 770fcaf)
The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors. The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture. Fixes pytorch#131316 pytorch#137414 Pull Request resolved: pytorch#137717 Approved by: https://github.com/drisspg Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> (cherry picked from commit 770fcaf)
…torch#137717) (#1695) The logsumexp tensor was considered for internal use only but apparently exposed to unit tests and inductors. The stream should be selected after picking the current device. Otherwise the code is checking the default device's architecture. Fixes pytorch#131316 pytorch#137414 Pull Request resolved: pytorch#137717 Approved by: https://github.com/drisspg Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> (cherry picked from commit 770fcaf) Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
hi @hliuca ,
ROCm Nightly has been greatly improved performance ever since the F.Linear fix but unfortunately pytorch compile does not work on ROCm even though it works on CUDA.
I am hitting
assert_size_stride
in ROCm inductor. Guessing the bug is inCausalSelfAttention
layer. I have attached a reprod on this issue.cc: @hongxiayang
Eager Command (this works without crash)
Compile Command that causes the error
Error Trace
Reprod Script
Versions
ROCm
# pip list | grep torch pytorch-triton-rocm 3.1.0+cf34004b8a torch 2.6.0.dev20241006+rocm6.2 torchvision 0.18.0a0+68ba7ec
Nvidia
versions where this works on nvidia include 24.07 ngc container, 24.08, pypi nightly
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd @ezyang @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire
The text was updated successfully, but these errors were encountered: