bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps #146623

hellopahe · 2025-02-06T17:15:59Z

This pr addresses the issue in the MPS backend for _scaled_dot_product_attention_math_mps where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape.

The issue was found in hiyouga/LLaMA-Factory#6835, in transformers qwen2_vl, 3d q/k/v were passed into sdpa function, which lead to an error.

Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms.

reproduce code:

import torch
import torch.nn.functional as F

head_num, seq_len, embed_dim = 16, 16, 80
bsz = 1

q = torch.randn(head_num, seq_len, embed_dim)
k = torch.randn(head_num, seq_len, embed_dim)
v = torch.randn(head_num, seq_len, embed_dim)
attention_mask = torch.ones(1, seq_len, seq_len)

oo_cpu = F.scaled_dot_product_attention(
    q.to("cpu"),
    k.to("cpu"),
    v.to("cpu"),
    attention_mask.to("cpu"),
    dropout_p=0.0
)

if torch.backends.mps.is_available():
    oo_mps = F.scaled_dot_product_attention(
        q.to("mps"),
        k.to("mps"),
        v.to("mps"),
        attention_mask.to("mps"),
        dropout_p=0.0
    )
    assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5)

error outputs:

Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module>
    oo_mps = F.scaled_dot_product_attention(
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)

hardware and envs:

torch               2.6.0
apple m3 max

pytorch-bot · 2025-02-06T17:16:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146623

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 71b52ac with merge base 8a4dd76 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-02-06T17:16:03Z

The committers listed above are authorized under a signed CLA.

✅ login: hellopahe / name: he pang (6c662e5, c112db5, 71b52ac, 691c5d5, c4417f3, a6d0093, 9a310c9)

aten/src/ATen/native/mps/operations/Attention.mm

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

Skylion007 · 2025-02-06T17:24:08Z

aten/src/ATen/native/mps/operations/Attention.mm

+  auto final_out = (sq ? out.squeeze(0) : out);
+  auto final_attn = (sq ? attn.squeeze(0) : attn);
+
+  return {final_out, final_attn};


Suggested change

return {final_out, final_attn};

return {std::move(final_out), stdd::move(final_a 8000 ttn});

malfet · 2025-02-06T17:36:19Z

@hellopahe thank you for the PR, can you please sign the CLA?

aten/src/ATen/native/mps/operations/Attention.mm

malfet · 2025-02-06T17:41:05Z

And last but not least, this PR could benefit from a test, so that it would not regress again

malfet · 2025-02-06T17:42:17Z

I wonder if for 2.6.1 milestone one can land a smaller fix to just fall back to Math implementation if ndim is 3 (cc: @manuelcandales )

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

…tential dangling reference

malfet · 2025-02-13T06:59:20Z

@pytorchbot merge -f "Lint + MPS are green"

pytorchmergebot · 2025-02-13T07:00:42Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…146623) This pr addresses the issue in the MPS backend for `_scaled_dot_product_attention_math_mps` where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape. The issue was found in hiyouga/LLaMA-Factory#6835, in [transformers qwen2_vl](https://github.com/huggingface/transformers/blob/1590c664306766f32ba68c50e67f14d61b16925d/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L373C14-L373C93), 3d q/k/v were passed into sdpa function, which lead to an error. Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms. --- reproduce code: ``` import torch import torch.nn.functional as F head_num, seq_len, embed_dim = 16, 16, 80 bsz = 1 q = torch.randn(head_num, seq_len, embed_dim) k = torch.randn(head_num, seq_len, embed_dim) v = torch.randn(head_num, seq_len, embed_dim) attention_mask = torch.ones(1, seq_len, seq_len) oo_cpu = F.scaled_dot_product_attention( q.to("cpu"), k.to("cpu"), v.to("cpu"), attention_mask.to("cpu"), dropout_p=0.0 ) if torch.backends.mps.is_available(): oo_mps = F.scaled_dot_product_attention( q.to("mps"), k.to("mps"), v.to("mps"), attention_mask.to("mps"), dropout_p=0.0 ) assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5) ``` error outputs: ``` Traceback (most recent call last): File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module> oo_mps = F.scaled_dot_product_attention( IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3) ``` hardware and envs: ``` torch 2.6.0 apple m3 max ``` --- Pull Request resolved: #146623 Approved by: https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

…ytorch#146623) This pr addresses the issue in the MPS backend for `_scaled_dot_product_attention_math_mps` where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape. The issue was found in hiyouga/LLaMA-Factory#6835, in [transformers qwen2_vl](https://github.com/huggingface/transformers/blob/1590c664306766f32ba68c50e67f14d61b16925d/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L373C14-L373C93), 3d q/k/v were passed into sdpa function, which lead to an error. Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms. --- reproduce code: ``` import torch import torch.nn.functional as F head_num, seq_len, embed_dim = 16, 16, 80 bsz = 1 q = torch.randn(head_num, seq_len, embed_dim) k = torch.randn(head_num, seq_len, embed_dim) v = torch.randn(head_num, seq_len, embed_dim) attention_mask = torch.ones(1, seq_len, seq_len) oo_cpu = F.scaled_dot_product_attention( q.to("cpu"), k.to("cpu"), v.to("cpu"), attention_mask.to("cpu"), dropout_p=0.0 ) if torch.backends.mps.is_available(): oo_mps = F.scaled_dot_product_attention( q.to("mps"), k.to("mps"), v.to("mps"), attention_mask.to("mps"), dropout_p=0.0 ) assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5) ``` error outputs: ``` Traceback (most recent call last): File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module> oo_mps = F.scaled_dot_product_attention( IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3) ``` hardware and envs: ``` torch 2.6.0 apple m3 max ``` --- Pull Request resolved: pytorch#146623 Approved by: https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

…ytorch#146623) This pr addresses the issue in the MPS backend for `_ 8000 scaled_dot_product_attention_math_mps` where a 3d input like (num_heads, seq_len, query_dim) cannot be automatically treated as (1, num_heads, seq_len, query_dim), which can be inferred on cpu or cuda, which can be circumvented by adding a util function to ensure a 4d shape. The issue was found in hiyouga/LLaMA-Factory#6835, in [transformers qwen2_vl](https://github.com/huggingface/transformers/blob/1590c664306766f32ba68c50e67f14d61b16925d/src/transformers/models/qwen2_vl/modeling_qwen2_vl.py#L373C14-L373C93), 3d q/k/v were passed into sdpa function, which lead to an error. Considering consistency, since this pattern might pop up elsewhere in the transformers codebase, I think it makes more sense to maintain the same intuition across all platforms. --- reproduce code: ``` import torch import torch.nn.functional as F head_num, seq_len, embed_dim = 16, 16, 80 bsz = 1 q = torch.randn(head_num, seq_len, embed_dim) k = torch.randn(head_num, seq_len, embed_dim) v = torch.randn(head_num, seq_len, embed_dim) attention_mask = torch.ones(1, seq_len, seq_len) oo_cpu = F.scaled_dot_product_attention( q.to("cpu"), k.to("cpu"), v.to("cpu"), attention_mask.to("cpu"), dropout_p=0.0 ) if torch.backends.mps.is_available(): oo_mps = F.scaled_dot_product_attention( q.to("mps"), k.to("mps"), v.to("mps"), attention_mask.to("mps"), dropout_p=0.0 ) assert torch.allclose(oo_cpu, oo_mps.to("cpu"), atol=1e-5) ``` error outputs: ``` Traceback (most recent call last): File "/opt/homebrew/Caskroom/miniconda/base/envs/torch-dev/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "<ipython-input-2-5169b8d2c5dd>", line 21, in <module> oo_mps = F.scaled_dot_product_attention( IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3) ``` hardware and envs: ``` torch 2.6.0 apple m3 max ``` --- Pull Request resolved: pytorch#146623 Approved by: https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

bug fix:ensure 4d input in _scaled_dot_product_attention_math_mps

6c662e5

hellopahe requested review from kulinseth and malfet as code owners February 6, 2025 17:16

pytorch-bot bot added the release notes: mps Release notes category label Feb 6, 2025

pytorchbot added the open source label Feb 6, 2025

Skylion007 reviewed Feb 6, 2025

View reviewed changes

aten/src/ATen/native/mps/operations/Attention.mm Outdated Show resolved Hide resolved

avoiding unnecessary copying

691c5d5

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>

Skylion007 reviewed Feb 6, 2025

View reviewed changes

Update Attention.mm

c4417f3

malfet reviewed Feb 6, 2025

View reviewed changes

aten/src/ATen/native/mps/operations/Attention.mm Outdated Show resolved Hide resolved

malfet added topic: bug fixes topic category ciflow/mps Run MPS tests (subset of trunk) labels Feb 6, 2025

malfet added this to the 2.6.1 milestone Feb 6, 2025

hellopahe and others added 3 commits February 7, 2025 01:42

Update aten/src/ATen/native/mps/operations/Attention.mm

c112db5

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

added test case

9a310c9

use explicit value capture for q_, k_, v_ in enclosing func, avoid po…

a6d0093

…tential dangling reference

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 7, 2025

apply lintrunner

71b52ac

malfet approved these changes Feb 12, 2025

View reviewed changes

pytorchmergebot added the merging label Feb 13, 2025

pytorchmergebot added the Merged label Feb 13, 2025

pytorchmergebot closed this in b9a22b3 Feb 13, 2025

pytorchmergebot removed the merging label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps #146623

bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps #146623

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	return {final_out, final_attn};
	return {std::move(final_out), stdd::move(final_a 8000 ttn});

bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps #146623

bug fix: ensure 4d input in _scaled_dot_product_attention_math_mps #146623

Uh oh!

Conversation

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146623

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!