8000 triggered internal assert in matmul · Issue #153172 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

triggered internal assert in matmul #153172

8000
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Angramme opened this issue May 8, 2025 · 2 comments
Open

triggered internal assert in matmul #153172

Angramme opened this issue May 8, 2025 · 2 comments
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: CUDACachingAllocator needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@Angramme
Copy link
Angramme commented May 8, 2025

🐛 Describe the bug

The error triggers when calling matmul

https://github.com/huggingface/transformers/blob/d23aae2b8c8738a12ab1b6710e60ae5866beaf9d/src/transformers/models/qwen2/modeling_qwen2.py#L116

# code taken from transformers/models/qwen2/
def eager_attention_forward(
    module: nn.Module,
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    attention_mask: Optional[torch.Tensor],
    scaling: float,
    dropout: float = 0.0,
    **kwargs,
):
    key_states = repeat_kv(key, module.num_key_value_groups)
    value_states = repeat_kv(value, module.num_key_value_groups)

    ##### HERE
    attn_weights = torch.matmul(query, key_states.transpose(2, 3)) * scaling
    #########
    if attention_mask is not None:
        causal_mask = attention_mask[:, :, :, : key_states.shape[-2]]
        attn_weights = attn_weights + causal_mask

    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
    attn_weights = nn.functional.dropout(attn_weights, p=dropout, training=module.training)
    attn_output = torch.matmul(attn_weights, value_states)
    attn_output = attn_output.transpose(1, 2).contiguous()

    return attn_output, attn_weights
RuntimeError: !handles_.at(i) INTERNAL ASSERT FAILED at "/pytorch/c10/cuda/CUDACachingAllocator.cpp":396, please report a bug to PyTorch. 

I apologise in advance, since the tensors are quite large, I think it would be difficult to include them here.

pip freeze

accelerate==1.6.0
aiohappyeyeballs==2.6.1
aiohttp==3.11.18
aiosignal==1.3.2
asttokens==3.0.0
async-timeout==5.0.1
attrs==25.3.0
bitsandbytes==0.45.5
certifi==2025.4.26
charset-normalizer==3.4.2
-e git+ssh://git@github.com/McLavish/cnlp_icl.git@93c0ce1948b19d9592a94bcdd8793d3c458992eb#egg=cnlp_icl
comm==0.2.2
contourpy==1.3.2
cycler==0.12.1
datasets==3.5.1
debugpy==1.8.14
decorator==5.2.1
dill==0.3.8
exceptiongroup==1.2.2
executing==2.2.0
filelock==3.18.0
fonttools==4.57.0
frozenlist==1.6.0
fsspec==2025.3.0
huggingface-hub==0.30.2
idna==3.10
ipykernel==6.29.5
ipython==8.36.0
jedi==0.19.2
Jinja2==3.1.6
joblib==1.5.0
jupyter_client==8.6.3
jupyter_core==5.7.2
kiwisolver==1.4.8
MarkupSafe==3.0.2
matplotlib==3.10.1
matplotlib-inline==0.1.7
mpmath==1.3.0
multidict==6.4.3
multiprocess==0.70.16
nest-asyncio==1.6.0
networkx==3.4.2
numpy==2.2.5
nvidia-cublas-cu12==12.6.4.1
nvidia-cuda-cupti-cu12==12.6.80
nvidia-cuda-nvrtc-cu12==12.6.77
nvidia-cuda-runtime-cu12==12.6.77
nvidia-cudnn-cu12==9.5.1.17
nvidia-cufft-cu12==11.3.0.4
nvidia-cufile-cu12==1.11.1.6
nvidia-curand-cu12==10.3.7.77
nvidia-cusolver-cu12==11.7.1.2
nvidia-cusparse-cu12==12.5.4.2
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.6.85
nvidia-nvtx-cu12==12.6.77
packaging==25.0
pandas==2.2.3
parso==0.8.4
pexpect==4.9.0
pillow==11.2.1
platformdirs==4.3.7
prompt_toolkit==3.0.51
propcache==0.3.1
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pyarrow==20.0.0
Pygments==2.19.1
pyparsing==3.2.3
python-dateutil==2.9.0.post0
pytz==2025.2
PyYAML==6.0.2
pyzmq==26.4.0
regex==2024.11.6
requests==2.32.3
safetensors==0.5.3
scikit-learn==1.6.1
scipy==1.15.2
seaborn==0.13.2
six==1.17.0
stack-data==0.6.3
sympy==1.14.0
threadpoolctl==3.6.0
tokenizers==0.21.1
torch==2.7.0
tornado==6.4.2
tqdm==4.67.1
traitlets==5.14.3
transformers==4.51.3
triton==3.3.0
typing_extensions==4.13.2
tzdata==2025.2
urllib3==2.4.0
wcwidth==0.2.13
xxhash==3.5.0
yarl==1.20.0

Versions

[pip3] numpy==2.2.5
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-cuda-nvrtc-cu12==12.6.77
[pip3] nvidia-cuda-runtime-cu12==12.6.77
[pip3] nvidia-cudnn-cu12==9.5.1.17
[pip3] nvidia-cufft-cu12==11.3.0.4
[pip3] nvidia-curand-cu12==10.3.7.77
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] torch==2.7.0
[pip3] triton==3.3.0
[conda] Could not collect

cc @ptrblck @msaroufim @eqy @jerryzh168

@jbschlosser jbschlosser added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: CUDACachingAllocator labels May 8, 2025
@jbschlosser
Copy link
Contributor

Hey @Angramme, is it at all possible to distill this down into a small reproduction script demonstrating the problem (e.g. with random, similarly-sized tensors)? That will help us investigate this. It seems possible to me this is a result of an OOM.

@jbschlosser jbschlosser added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label May 8, 2025
@Angramme
Copy link
Author

Hey, sorry for the delay. Yes this occurs interchangeably with OOM errors. I will try to make a reproducible example soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: CUDACachingAllocator needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

2 participants
0