add sbgemv dispatch in torch cpu flash attention #151108

taoye9 · 2025-04-11T13:03:48Z

Summary

This PR introduces a dispatch to the OpenBLAS sbgemv kernel in PyTorch CPU Flash Attention kernel when the query sequence length is 1.

Motivation

During the decoding phase in transformer models (e.g., for autoregressive inference), the shape of the query tensor often has sequence length = 1. Currently, this leads to dispatching A(m, k) * B(k, n) into the general sbgemm kernel, even when the operation is effectively a matrix-vector multiplication. This PR optimizes such cases by dispatching to sbgemv, which is better suited and shows measurable performance improvements.

Heuristic Consideration

Our heuristic ensures that the matmul is dispatched to sbgemv only when matrix A is multiplied by a vector B, which is the intended use case for GEMV operations. Also we limit the dispatch to transb == NoTranspose because when transb == Transpose, the leading dimension (lda) might not be 1. This causes the sbgemv kernel to handle non-contiguous memory, which performs poorly.

Benchmark result

Benchmarked using torch.nn.functional.scaled_dot_product_attention on Neoverse™ V1.

Configuration:

OMP_NUM_THREADS=16
Tensor shapes:
- Query: [1, 16, 1, 32]
- Key: [1, 16, 1500, 32]
- Value: [1, 16, 1500, 32]

Results:

Kernel	Latency (µs)	Speedup
`sbgemm`	121.700	—
8000 `sbgemv`	104.663	~16%

Benchmark script

import torch
import time
import numpy as np
import math
from torch.profiler import profile, record_function, ProfilerActivity

class SimpleAttentionModel(torch.nn.Module):
    def __init__(self, query, key, value):
        super(SimpleAttentionModel, self).__init__()
        self.query = query
        self.key = key
        self.value = value

    def forward(self, attn_mask=None):
        torch.nn.functional.scaled_dot_product_attention(
                    self.query,
                    self.key,
                    self.value,
                    attn_mask=attn_mask)


# implementation run for BertSdpaSelfAttention
def bench_sdpa(batch_size = 1, num_attention_heads = 16, sequence_length = 142, query_sequence_length = 142 , hidden_size=1024, precision=torch.float32):
    with torch.no_grad():
    
        attention_head_size = int(hidden_size / num_attention_heads)
    
        query = torch.rand(size=(batch_size, num_attention_heads, query_sequence_length, attention_head_size), dtype=precision)
        key = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision)
        value = torch.rand(size=(batch_size, num_attention_heads, sequence_length, attention_head_size), dtype=precision)
         
        model = SimpleAttentionModel(query, key, value)
        model.eval()
        #model = torch.nn.utils.pack_linear.pack_linear_weights(model)
        
        for _ in range(100):
            model()

        times = []
        n_iters = 10000
        for _ in range(n_iters):
            s = time.time_ns()
            model()
            times.append((time.time_ns() - s) / 1e3)
        
        min_times = np.min(times)
        mean_times = np.mean(times)
        print(f"Min Times = {min_times} us")
        print(f"Mean Times = {mean_times} us")
        # print("Times = ", times)




if __name__ == "__main__":
    batch_size = 1
    num_attention_heads = 16
    sequence_length = 1500
    query_sequence_length = 1
    hidden_size=512

    print("BF16 mode:")
    with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
        with record_function("model_inference"):
            bench_sdpa(batch_size = batch_size, num_attention_heads = num_attention_heads, sequence_length = sequence_length, query_sequence_length = query_sequence_length, hidden_size = hidden_size, precision=torch.bfloat16)
    profile_data = prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_time_total")
    print(profile_data)

cc @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01

pytorch-bot · 2025-04-11T13:03:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151108

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PREEMPTIVE] Removal of ephemeral variants on scale-config.yml

✅ No Failures

As of commit c52849f with merge base d759a51 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-04-11T13:03:55Z

The committers listed above are authorized under a signed CLA.

✅ login: taoye9 / name: Ye Tao (c52849f, ba0a974)

soulitzer · 2025-04-11T23:09:48Z

@drisspg do you know who should look at this?

fadara01 · 2025-04-14T10:12:37Z

@pytorchbot label "ciflow/linux-aarch64"

taoye9 · 2025-04-14T10:15:50Z

this pr should be pending until openblas 0.3.30 is released.

fadara01 · 2025-04-14T10:20:35Z

@pytorchbot label "topic: not user facing"

aditew01 · 2025-04-14T10:25:55Z

Changes LGTM!
Can we make sure there's a UT in place for this / make sure this will be tested with existing test cases?

taoye9 · 2025-04-14T10:27:45Z

Changes LGTM! Can we make sure there's a UT in place for this / make sure this will be tested with existing test cases?

yes, this what i'd like to do. but could someone maybe point out the proper place to do such uts?

aditew01 · 2025-04-14T10:42:14Z

yes, this what i'd like to do. but could someone maybe point out the proper place to do such uts?

check this: test_linalg.py . I think we can add a relevant case here (if not already there)

drisspg · 2025-04-14T19:47:57Z

+1 on UT

aditew01

Can we make the description more clear and comment on the heuristic.
Also, as discussed above, possibly adding a UT

aditew01 · 2025-04-15T14:55:51Z

test/test_linalg.py

+    @parametrize("m", [32, 35, 36, 40, 64, 128])
+    @parametrize("k", [32, 35, 36, 40, 64, 128])
+    # NOTE: This is intended to cover sbgemv_ testcase in CPUBlas.cpp.
+    def test_lowprecision_gemv_cpu(self, device, dtype, m, k):


Thanks! That's very thoughtful, I like that it covers both the transposed and non-transposed case. A minor comment.

Question: Would it make sense to split the transposed and non-transposed cases into separate tests for clarity and easier debugging?

lezcano

lezcano left a comment

Can you provide benchmarks of the before / after for a few relevant shapes?

aditew01 · 2025-05-13T10:46:19Z

Alternative approach for this is to enable the Path in OpenBLAS. @taoye9 already has a PR for this. Linking it here for visibility:
OpenMathLib/OpenBLAS#5260

taoye9 · 2025-05-13T11:33:02Z

Can you provide benchmarks of the before / after for a few relevant shapes?

Hi, sry we are pending this PR for a while to further investigate which is the best approach: i.e. inside openblas or pytorch.

malfet

LGTM if it passes CI, but also would be good to enable it directly for torch.mv call as wlel

fadara01 · 2025-05-14T07:15:04Z

@pytorchbot rebase

pytorchmergebot · 2025-05-14T07:16:39Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-05-14T07:16:45Z

Successfully rebased spda_gemv onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout spda_gemv && git pull --rebase)

pytorchbot added the open source label Apr 11, 2025

soulitzer requested a review from drisspg April 11, 2025 23:08

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 11, 2025

aditew01 added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Apr 14, 2025

pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Apr 14, 2025

nikhil-arm requested review from digantdesai, malfet and nikhil-arm April 14, 2025 10:14

nikhil-arm requested a review from aditew01 April 14, 2025 10:16

pytorch-bot bot added the topic: not user facing topic category label Apr 14, 2025

aditew01 requested changes Apr 15, 2025

View reviewed changes

taoye9 requested review from lezcano, nikitaved and IvanYashchuk as code owners April 15, 2025 14:18

pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Apr 15, 2025

aditew01 reviewed Apr 15, 2025

View reviewed changes

lezcano reviewed Apr 15, 2025

View reviewed changes

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2025

malfet approved these changes May 14, 2025

View reviewed changes

taoye9 added 2 commits May 14, 2025 07:16

add sbgemv dispatch in torch cpu flash attention

ba0a974

add ut for dispatch sdpa kernel to sbgemv in CPUBlas.cpp

c52849f

pytorchmergebot force-pushed the spda_gemv branch from 3695c3d to c52849f Compare May 14, 2025 07:16

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add sbgemv dispatch in torch cpu flash attention #151108

add sbgemv dispatch in torch cpu flash attention #151108

add sbgemv dispatch in torch cpu flash attention #151108

Are you sure you want to change the base?

add sbgemv dispatch in torch cpu flash attention #151108

Conversation

Summary

Motivation

Heuristic Consideration

Benchmark result

Benchmark script

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151108

❗ 1 Active SEVs

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment