ROCm SDPA: Ensure attn_mask has the same dtype with q #143242

xinyazhang · 2024-12-14T00:17:26Z

This is required by current AOTriton's backend.

Fixes NaN when calling SDPA ME backend with q.dtype() != attn_mask.dtype() when training llama2 using transformers+deepspeed+pytorch

Corresponding CUDA check seems to be here:

pytorch/aten/src/ATen/native/transformers/cuda/attention.cu

Lines 1331 to 1336 in 708ce3c

    
           if (bias.has_value()) { 
        
             CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA((*bias)); 
        
             TORCH_CHECK( 
        
                 bias->scalar_type() == CutlassToAtenDtype<scalar_t>::atScalarType(), 
        
                 "invalid dtype for bias - should match query's dtype"); 
        
             p.attn_bias_ptr = (const scalar_t*)bias->const_data_ptr();

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2024-12-14T00:17:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143242

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1489d35 with merge base a1ae8fa ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

xinyazhang · 2024-12-14T00:30:50Z

@pytorchbot label "topic: not user facing"

jeffdaily · 2024-12-16T17:41:32Z

@pytorchbot rebase

pytorch-bot · 2024-12-16T17:41:51Z

Please seek CI approval before scheduling CIFlow labels

pytorchmergebot · 2024-12-16T17:44:09Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-12-16T17:44:14Z

Successfully rebased xinyazhang/check_attn_mask_dtype onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xinyazhang/check_attn_mask_dtype && git pull --rebase)

xinyazhang · 2025-01-03T18:54:06Z

@pytorchbot rebase

pytorchmergebot · 2025-01-03T18:55:33Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

This is required by current AOTriton's backend.

pytorchmergebot · 2025-01-03T18:55:36Z

Successfully rebased xinyazhang/check_attn_mask_dtype onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xinyazhang/check_attn_mask_dtype && git pull --rebase)

jithunnair-amd · 2025-01-04T03:59:49Z

@xinyazhang The flex_attention failures look legit.

xinyazhang · 2025-01-06T18:42:46Z

@xinyazhang The flex_attention failures look legit.

~~This is due to we are switching to math backend on certain inputs, and math backend has known numerical accuracy problems.~~ Update: this problem is the current patch excludes binary masks which SDPA will convert it to fp data types.

However I have another concern. According to the code

pytorch/aten/src/ATen/native/transformers/cuda/attention.cu

Lines 1331 to 1336 in e56768f

    
           if (bias.has_value()) { 
        
             CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA((*bias)); 
        
             TORCH_CHECK( 
        
                 bias->scalar_type() == CutlassToAtenDtype<scalar_t>::atScalarType(), 
        
                 "invalid dtype for bias - should match query's dtype"); 
        
             p.attn_bias_ptr = (const scalar_t*)bias->const_data_ptr();

CUDA's ME also assumes q.dtype() == attn_bias.dtype(). How does the CUDA backend make it working?

Update: Okay I found the differences. (Although I'm still not sure how fp16 qkv + fp32 attn_mask works on NVIDIA).

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp

jithunnair-amd · 2025-01-08T15:18:34Z

@pytorchbot merge -f "ROCm CI passed. Change only impacts ROCm"

pytorchmergebot · 2025-01-08T15:20:13Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

jithunnair-amd · 2025-01-08T15:24:50Z

@pytorchbot cherry-pick --onto release/2.6

pytorch-bot · 2025-01-08T15:24:53Z

❌ 🤖 pytorchbot command failed:

@pytorchbot cherry-pick: error: the following arguments are required: -c/--classification

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Try @pytorchbot --help for more info.

jithunnair-amd · 2025-01-08T15:25:28Z

@pytorchbot cherry-pick --onto release/2.6 -c critical

This is required by current AOTriton's backend. Fixes NaN when calling SDPA ME backend with `q.dtype() != attn_mask.dtype()` when training llama2 using transformers+deepspeed+pytorch Corresponding CUDA check seems to be here: https://github.com/pytorch/pytorch/blob/708ce3c0082d670d9eaff84bc3c43cad4554a75d/aten/src/ATen/native/transformers/cuda/attention.cu#L1331-L1336 Pull Request resolved: #143242 Approved by: https://github.com/jeffdaily (cherry picked from commit 3068ce0)

pytorchbot · 2025-01-08T15:30:23Z

Cherry picking #143242

The cherry pick PR is at #144398 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v.2.6.0] Release Tracker #142814 (comment)

Details for Dev Infra team

Raised by workflow job

ROCm SDPA: Ensure attn_mask has the same dtype with q (#143242) This is required by current AOTriton's backend. Fixes NaN when calling SDPA ME backend with `q.dtype() != attn_mask.dtype()` when training llama2 using transformers+deepspeed+pytorch Corresponding CUDA check seems to be here: https://github.com/pytorch/pytorch/blob/708ce3c0082d670d9eaff84bc3c43cad4554a75d/aten/src/ATen/native/transformers/cuda/attention.cu#L1331-L1336 Pull Request resolved: #143242 Approved by: https://github.com/jeffdaily (cherry picked from commit 3068ce0) Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label Dec 14, 2024

pytorch-bot bot added the topic: not user facing topic category label Dec 14, 2024

pytorchbot added the open source label Dec 14, 2024

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Dec 16, 2024

pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Dec 16, 2024

jeffdaily approved these changes Dec 16, 2024

View reviewed changes

jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Dec 16, 2024

pytorchmergebot force-pushed the xinyazhang/check_attn_mask_dtype branch from 60b555e to 29bf7a8 Compare December 16, 2024 17:44

ROCm SDPA: Ensure attn_mask has the same dtype with q

857b9c3

This is required by current AOTriton's backend.

pytorchmergebot force-pushed the xinyazhang/check_attn_mask_dtype branch from 29bf7a8 to 857b9c3 Compare January 3, 2025 18:55

Allows boolean attn_mask.

1489d35

xinyazhang commented Jan 6, 2025

View reviewed changes

aten/src/ATen/native/transformers/cuda/sdp_utils.cpp Show resolved Hide resolved

xinyazhang marked this pull request as ready for review January 7, 2025 15:24

pytorchmergebot added the merging label Jan 8, 2025

pytorchmergebot added the Merged label Jan 8, 2025

pytorchmergebot closed this in 3068ce0 Jan 8, 2025

pytorchmergebot removed the merging label Jan 8, 2025

jithunnair-amd deleted the xinyazhang/check_attn_mask_dtype branch January 8, 2025 15:22

jithunnair-amd added this to the 2.6.0 milestone Jan 8, 2025

pytorchbot mentioned this pull request Jan 8, 2025

[v.2.6.0] Release Tracker #142814

Closed

atalman mentioned this pull request Jan 13, 2025

Release 2.6.0 validations checklist and cherry-picks #144503

Closed

73 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ROCm SDPA: Ensure attn_mask has the same dtype with q #143242

ROCm SDPA: Ensure attn_mask has the same dtype with q #143242

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	if (bias.has_value()) {
	CHECK_NOSPARSE_LASTCONTIGUOUS_CUDA((*bias));
	TORCH_CHECK(
	bias->scalar_type() == CutlassToAtenDtype<scalar_t>::atScalarType(),
	"invalid dtype for bias - should match query's dtype");
	p.attn_bias_ptr = (const scalar_t*)bias->const_data_ptr();

ROCm SDPA: Ensure attn_mask has the same dtype with q #143242

ROCm SDPA: Ensure attn_mask has the same dtype with q #143242

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143242

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cherry picking #143242

Uh oh!

Uh oh!