8000 MultiheadAttention returns NaNs when need_weights=False for long sequences with a mask that ignores old tokens · Issue #127055 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

MultiheadAttention returns NaNs when need_weights=False for long sequences with a mask that ignores old tokens #127055

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
twoertwein opened this issue May 24, 2024 · 10 comments
Assignees
Labels
intel This tag is for PR from Intel module: cpu CPU specific problem (e.g., perf, algorithm)
Milestone

Comments

@twoertwein
Copy link
Contributor
twoertwein commented May 24, 2024

🐛 Describe the bug

It works as expected for shorter sequences and when all past tokens are allowed.

import torch

model = torch.nn.MultiheadAttention(embed_dim=2, num_heads=1)

n = 600
sequence = torch.ones(n, 2)

# do not attend to the future and very old tokens
full = torch.full((n, n), float("-inf"))
mask = torch.triu(full, diagonal=1) + torch.tril(full, diagonal=-10)

print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=False)[0])
#tensor([[0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        ...,
#        [   nan,    nan],
#        [   nan,    nan],
#        [   nan,    nan]], grad_fn=<SqueezeBackward1>)
print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=True)[0])
#tensor([[0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        ...,
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435]], grad_fn=<SqueezeBackward1>)

Versions

Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M2 Pro

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] onnx==1.16.0
[pip3] tf2onnx==1.16.1
[pip3] torch==2.3.0
[pip3] torchaudio==2.3.0
[conda] numpy 1.24.3 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] torchaudio 2.3.0 pypi_0 pypi

cc @jbschlosser @bhosmer @cpuhrsch @erichan1 @drisspg @mikaylagawarecki @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

@drisspg drisspg added the intel This tag is for PR from Intel label May 30, 2024
@drisspg
Copy link
Contributor
drisspg commented May 30, 2024

@Valentine233 would you mind taking a look at this I was able to narrow this down to _scaled_dot_product_flash_attention_cpu returning NaNs in the forward

@drisspg drisspg added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 30, 2024
@Valentine233
Copy link
Collaborator
Valentine233 commented Jun 6, 2024

The values of attention mask are mostly -inf for the case. NaNs are generated in flash attention because the computation of std::exp((-inf) - (-inf)) and inf * 0 in lazy softmax.

I also checked for CUDA and it goes into the math SDPA for the following reasons:

  • Memory efficient kernel not used because: Mem efficient attention requires last dimension of inputs to be divisible by 4. Got Query.size(-1): 2, Key.size(-1): 2, Value.size(-1): 2 instead.
  • Flash attention kernel not used because: Flash Attention does not support non-null attn_mask.
  • CuDNN attention kernel not used because: The CuDNN backend needs to be enabled by setting the enviornment variableTORCH_CUDNN_SDPA_ENABLED=1.

@drisspg Is this the expected behavior for CUDA path and do you have any suggestions for the issue? I suppose that CUDA would encounter the same issue if it goes into the fused SDPA.

@Valentine233
Copy link
Collaborator

@drisspg Hi, have you got any comments or suggestions for the issue? Thanks!

@drisspg
Copy link
Contributor
drisspg commented Jun 15, 2024

Hey sorry was on PTO,

I still think this is an issue with the FlashAttention implementation on CPU

import torch
from torch.nn.attention import sdpa_kernel, SDPBackend

device = "cpu"
embed_dim = 4

model = torch.nn.MultiheadAttention(embed_dim=embed_dim, num_heads=1).to(device)

n = 640
sequence = torch.ones(n, embed_dim, device=device)

# do not attend to the future and very old tokens
full = torch.full((n, n), float("-inf"), device=device)
mask = torch.triu(full, diagonal=1) + torch.tril(full, diagonal=-10)

with sdpa_kernel([SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION]):
    print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=False)[0])
print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=True)[0])[

❯ python /home/drisspg/meta/scripts/sdpa/nan_mha.py
tensor([[ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        ...,
        [    nan,     nan,     nan,     nan],
        [    nan,     nan,     nan,     nan],
        [    nan,     nan,     nan,     nan]], grad_fn=<SqueezeBackward1>)
tensor([[ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        ...,
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281]], grad_fn=<SqueezeBackward1>)

While with the cuda device:

import torch
from torch.nn.attention import sdpa_kernel, SDPBackend

device = "cuda"
embed_dim = 4

model = torch.nn.MultiheadAttention(embed_dim=embed_dim, num_heads=1).to(device)

n = 640
sequence = torch.ones(n, embed_dim, device=device)

# do not attend to the future and very old tokens
full = torch.full((n, n), float("-inf"), device=device)
mask = torch.triu(full, diagonal=1) + torch.tril(full, diagonal=-10)

with sdpa_kernel([SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION]):
    print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=False)[0])
print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=True)[0])

❯ python /home/drisspg/meta/scripts/sdpa/nan_mha.py
tensor([[ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        ...,
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582]], device='cuda:0',
       grad_fn=<SqueezeBackward1>)
tensor([[ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        ...,
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582]], device='cuda:0',
       grad_fn=<SqueezeBackward1>)

@Valentine233
Copy link
Collaborator
Valentine233 commented Jun 18, 2024

@drisspg Thanks! I ran with your code and the CUDA MHA went into the efficient attention. It is exactly a CPU-specific issue now.
I'm curious how CUDA deals with the following cases: NaNs are generated in flash attention because the computation of std::exp((-inf) - (-inf)) and +/-inf * 0 in lazy softmax. Actually we would meet this when calculating a piece of attention mask consisting of -infs during lazy softmax.

@thecopeeninja
Copy link

🐛 Describe the bug

It works as expected for shorter sequences and when all past tokens are allowed.

import torch

model = torch.nn.MultiheadAttention(embed_dim=2, num_heads=1)

n = 600
sequence = torch.ones(n, 2)

# do not attend to the future and very old tokens
full = torch.full((n, n), float("-inf"))
mask = torch.triu(full, diagonal=1) + torch.tril(full, diagonal=-10)

print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=False)[0])
#tensor([[0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        ...,
#        [   nan,    nan],
#        [   nan,    nan],
#        [   nan,    nan]], grad_fn=<SqueezeBackward1>)
print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=True)[0])
#tensor([[0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        ...,
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435]], grad_fn=<SqueezeBackward1>)

Versions

Collecting environment information... PyTorch version: 2.3.0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: Could not collect Libc version: N/A

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ] (64-bit runtime) Python platform: macOS-14.4.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Apple M2 Pro

Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] onnx==1.16.0 [pip3] tf2onnx==1.16.1 [pip3] torch==2.3.0 [pip3] torchaudio==2.3.0 [conda] numpy 1.24.3 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchaudio 2.3.0 pypi_0 pypi

cc @jbschlosser @bhosmer @cpuhrsch @erichan1 @drisspg @mikaylagawarecki @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

I ran it on 2.2.2 version and it worked fine on both cuda and cpu.

nvidia-nccl-cu12-2.19.3 
torch-2.2.2 
triton-2.2.0

It failed on 2.3.0 and 2.3.1 versions of pytorch.

@drisspg
Copy link
Contributor
drisspg commented Jul 1, 2024

@Valentine233

// Only when bias is enabled, it's possible that all the first values
// of attention are masked to `-inf`. In that case we want to avoid
// `nan = exp2f(-inf - (-inf))` so we temporarily set `mi` to 0
if (kSupportsBias &&
mi_id == -cutlass::platform::numeric_limits<accum_t>::infinity()) {
restore_mi_to_minus_inf = true;
mi[id] = 0.0f;
}
out_rescale[id] = 1.0f;
}

I think this might be the relevant piece of code you are looking for

xuhancn pushed a commit to xuhancn/pytorch that referenced this issue Jul 25, 2024
Fixes pytorch#127055.

NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation.

Pull Request resolved: pytorch#130014
Approved by: https://github.com/jgong5, https://github.com/drisspg
@twoertwein
Copy link
Contributor Author

Which version of pytorch will have this fix? I just upgraded to 2.4 (on an arm Mac; pip) and I still have the same issue.

@drisspg
Copy link
Contributor
drisspg commented Jul 28, 2024

We are planning to add to the next patch release. To access this now you could:
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

@atalman atalman added this to the 2.4.1 milestone Aug 15, 2024
pytorchbot pushed a commit that referenced this issue Aug 15, 2024
Fixes #127055.

NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation.

Pull Request resolved: #130014
Approved by: https://github.com/jgong5, https://github.com/drisspg

(cherry picked from commit 868d9a4)
atalman pushed a commit that referenced this issue Aug 21, 2024
[cpu][flash attention] fix nan issue (#130014)

Fixes #127055.

NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation.

Pull Request resolved: #130014
Approved by: https://github.com/jgong5, https://github.com/drisspg

(cherry picked from commit 868d9a4)

Co-authored-by: Valentine233 <xuan.liao@intel.com>
@atalman
Copy link
Contributor
atalman commented Aug 29, 2024

Confirmed fixed in final rc 2.4.1:

tensor([[ 0.5036,  0.2297,
788C
  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        ...,
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339]], grad_fn=<SqueezeBackward1>)
tensor([[ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        ...,
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339]], grad_fn=<SqueezeBackward1>)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
intel This tag is for PR from Intel module: cpu CPU specific problem (e.g., perf, algorithm)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
0