MultiheadAttention returns NaNs when need_weights=False for long sequences with a mask that ignores old tokens #127055

twoertwein · 2024-05-24T02:40:07Z

🐛 Describe the bug

It works as expected for shorter sequences and when all past tokens are allowed.

import torch

model = torch.nn.MultiheadAttention(embed_dim=2, num_heads=1)

n = 600
sequence = torch.ones(n, 2)

# do not attend to the future and very old tokens
full = torch.full((n, n), float("-inf"))
mask = torch.triu(full, diagonal=1) + torch.tril(full, diagonal=-10)

print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=False)[0])
#tensor([[0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        ...,
#        [   nan,    nan],
#        [   nan,    nan],
#        [   nan,    nan]], grad_fn=<SqueezeBackward1>)
print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=True)[0])
#tensor([[0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        ...,
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435]], grad_fn=<SqueezeBackward1>)

Versions

Collecting environment information...
PyTorch version: 2.3.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: Could not collect
Libc version: N/A

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ] (64-bit runtime)
Python platform: macOS-14.4.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M2 Pro

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] onnx==1.16.0
[pip3] tf2onnx==1.16.1
[pip3] torch==2.3.0
[pip3] torchaudio==2.3.0
[conda] numpy 1.24.3 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] torchaudio 2.3.0 pypi_0 pypi

cc @jbschlosser @bhosmer @cpuhrsch @erichan1 @drisspg @mikaylagawarecki @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

drisspg · 2024-05-30T23:07:52Z

@Valentine233 would you mind taking a look at this I was able to narrow this down to _scaled_dot_product_flash_attention_cpu returning NaNs in the forward

Valentine233 · 2024-06-06T10:36:44Z

The values of attention mask are mostly -inf for the case. NaNs are generated in flash attention because the computation of std::exp((-inf) - (-inf)) and inf * 0 in lazy softmax.

I also checked for CUDA and it goes into the math SDPA for the following reasons:

Memory efficient kernel not used because: Mem efficient attention requires last dimension of inputs to be divisible by 4. Got Query.size(-1): 2, Key.size(-1): 2, Value.size(-1): 2 instead.
Flash attention kernel not used because: Flash Attention does not support non-null attn_mask.
CuDNN attention kernel not used because: The CuDNN backend needs to be enabled by setting the enviornment variableTORCH_CUDNN_SDPA_ENABLED=1.

@drisspg Is this the expected behavior for CUDA path and do you have any suggestions for the issue? I suppose that CUDA would encounter the same issue if it goes into the fused SDPA.

Valentine233 · 2024-06-13T08:39:48Z

@drisspg Hi, have you got any comments or suggestions for the issue? Thanks!

drisspg · 2024-06-15T01:17:51Z

Hey sorry was on PTO,

I still think this is an issue with the FlashAttention implementation on CPU

import torch
from torch.nn.attention import sdpa_kernel, SDPBackend

device = "cpu"
embed_dim = 4

model = torch.nn.MultiheadAttention(embed_dim=embed_dim, num_heads=1).to(device)

n = 640
sequence = torch.ones(n, embed_dim, device=device)

# do not attend to the future and very old tokens
full = torch.full((n, n), float("-inf"), device=device)
mask = torch.triu(full, diagonal=1) + torch.tril(full, diagonal=-10)

with sdpa_kernel([SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION]):
    print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=False)[0])
print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=True)[0])[

❯ python /home/drisspg/meta/scripts/sdpa/nan_mha.py
tensor([[ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        ...,
        [    nan,     nan,     nan,     nan],
        [    nan,     nan,     nan,     nan],
        [    nan,     nan,     nan,     nan]], grad_fn=<SqueezeBackward1>)
tensor([[ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        ...,
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281],
        [ 0.3151, -0.3888,  0.0733,  0.1281]], grad_fn=<SqueezeBackward1>)

While with the cuda device:

import torch
from torch.nn.attention import sdpa_kernel, SDPBackend

device = "cuda"
embed_dim = 4

model = torch.nn.MultiheadAttention(embed_dim=embed_dim, num_heads=1).to(device)

n = 640
sequence = torch.ones(n, embed_dim, device=device)

# do not attend to the future and very old tokens
full = torch.full((n, n), float("-inf"), device=device)
mask = torch.triu(full, diagonal=1) + torch.tril(full, diagonal=-10)

with sdpa_kernel([SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION]):
    print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=False)[0])
print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=True)[0])

❯ python /home/drisspg/meta/scripts/sdpa/nan_mha.py
tensor([[ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        ...,
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582]], device='cuda:0',
       grad_fn=<SqueezeBackward1>)
tensor([[ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        ...,
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582],
        [ 0.1694, -0.4467,  0.2550, -0.0582]], device='cuda:0',
       grad_fn=<SqueezeBackward1>)

Valentine233 · 2024-06-18T12:50:24Z

@drisspg Thanks! I ran with your code and the CUDA MHA went into the efficient attention. It is exactly a CPU-specific issue now.
I'm curious how CUDA deals with the following cases: NaNs are generated in flash attention because the computation of std::exp((-inf) - (-inf)) and +/-inf * 0 in lazy softmax. Actually we would meet this when calculating a piece of attention mask consisting of -infs during lazy softmax.

thecopeeninja · 2024-06-27T11:27:33Z

🐛 Describe the bug

It works as expected for shorter sequences and when all past tokens are allowed.
import torch

model = torch.nn.MultiheadAttention(embed_dim=2, num_heads=1)

n = 600
sequence = torch.ones(n, 2)

# do not attend to the future and very old tokens
full = torch.full((n, n), float("-inf"))
mask = torch.triu(full, diagonal=1) + torch.tril(full, diagonal=-10)

print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=False)[0])
#tensor([[0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        ...,
#        [   nan,    nan],
#        [   nan,    nan],
#        [   nan,    nan]], grad_fn=<SqueezeBackward1>)
print(model(sequence, sequence, sequence, attn_mask=mask, need_weights=True)[0])
#tensor([[0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        ...,
#        [0.0519, 0.1435],
#        [0.0519, 0.1435],
#        [0.0519, 0.1435]], grad_fn=<SqueezeBackward1>)
Versions

Collecting environment information... PyTorch version: 2.3.0 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A

OS: macOS 14.4.1 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.3.9.4) CMake version: Could not collect Libc version: N/A

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ] (64-bit runtime) Python platform: macOS-14.4.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Apple M2 Pro

Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] onnx==1.16.0 [pip3] tf2onnx==1.16.1 [pip3] torch==2.3.0 [pip3] torchaudio==2.3.0 [conda] numpy 1.24.3 pypi_0 pypi [conda] torch 2.3.0 pypi_0 pypi [conda] torchaudio 2.3.0 pypi_0 pypi

cc @jbschlosser @bhosmer @cpuhrsch @erichan1 @drisspg @mikaylagawarecki @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

I ran it on 2.2.2 version and it worked fine on both cuda and cpu.

nvidia-nccl-cu12-2.19.3 
torch-2.2.2 
triton-2.2.0

It failed on 2.3.0 and 2.3.1 versions of pytorch.

drisspg · 2024-07-01T19:58:11Z

@Valentine233

pytorch/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernel_forward.h

Lines 1258 to 1267 in f86dbae

    
             // Only when bias is enabled, it's possible that all the first values 
        
             // of attention are masked to `-inf`. In that case we want to avoid 
        
             // `nan = exp2f(-inf - (-inf))` so we temporarily set `mi` to 0 
        
             if (kSupportsBias && 
        
                 mi_id == -cutlass::platform::numeric_limits<accum_t>::infinity()) { 
        
               restore_mi_to_minus_inf = true; 
        
               mi[id] = 0.0f; 
        
             } 
        
             out_rescale[id] = 1.0f; 
        
           }

I think this might be the relevant piece of code you are looking for

Fixes pytorch#127055. NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation. Pull Request resolved: pytorch#130014 Approved by: https://github.com/jgong5, https://github.com/drisspg

twoertwein · 2024-07-28T13:49:53Z

Which version of pytorch will have this fix? I just upgraded to 2.4 (on an arm Mac; pip) and I still have the same issue.

drisspg · 2024-07-28T22:45:42Z

We are planning to add to the next patch release. To access this now you could:
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

Fixes #127055. NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation. Pull Request resolved: #130014 Approved by: https://github.com/jgong5, https://github.com/drisspg (cherry picked from commit 868d9a4)

[cpu][flash attention] fix nan issue (#130014) Fixes #127055. NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation. Pull Request resolved: #130014 Approved by: https://github.com/jgong5, https://github.com/drisspg (cherry picked from commit 868d9a4) Co-authored-by: Valentine233 <xuan.liao@intel.com>

atalman · 2024-08-29T20:34:48Z

Confirmed fixed in final rc 2.4.1:

tensor([[ 0.5036,  0.2297,
788C
  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        ...,
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339]], grad_fn=<SqueezeBackward1>)
tensor([[ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        ...,
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339],
        [ 0.5036,  0.2297,  0.6542, -0.5339]], grad_fn=<SqueezeBackward1>)

colesbury added module: multi-headed-attention labels May 24, 2024

drisspg added the intel This tag is for PR from Intel label May 30, 2024

drisspg added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 30, 2024

jgong5 assigned Valentine233 May 30, 2024

Valentine233 mentioned this issue Jul 3, 2024

[cpu][flash attention] fix nan issue #130014

Closed

pytorchmergebot closed this as completed in 868d9a4 Jul 10, 2024

atalman added this to the 2.4.1 milestone Aug 15, 2024

pytorchbot mentioned this issue Aug 15, 2024

[cpu][flash attention] fix nan issue #133598

Merged

atalman mentioned this issue Aug 28, 2024

Release 2.4.1 validations checklist and cherry-picks #134694

Closed

40 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MultiheadAttention returns NaNs when need_weights=False for long sequences with a mask that ignores old tokens #127055

MultiheadAttention returns NaNs when need_weights=False for long sequences with a mask that ignores old tokens #127055

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

🐛 Describe the bug

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MultiheadAttention returns NaNs when need_weights=False for long sequences with a mask that ignores old tokens #127055

MultiheadAttention returns NaNs when need_weights=False for long sequences with a mask that ignores old tokens #127055

Comments

Uh oh!

🐛 Describe the bug

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

🐛 Describe the bug

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!