ggml: aarch64: Implement SVE F32 kernels for Mamba Model #13602

vineelabhinav · 2025-05-17T09:25:17Z

This PR adds SVE kernel support for F32 datatype specific to Mamba Model on ARM architecture.
Major code changes:

Add SVE support for ggml_vec_dot_f32() function.
Add SVE support for ggml_compute_forward_ssm_scan_f32() function.
Add SVE support for ggml_vec_mad_f32() function.
Add SVE support for ggml_vec_scale_f32() function.

Performance

This PR improves performance by ~1.3x compared to the previous NEON-based implementation.
Model: falcon-mamba-7B-F32.gguf
Command: ./build/bin/llama-bench -m falcon-mamba-7B-F32.gguf -t 8,16,32,64 -p 128,1024 -n 0

Task1: Prompt Length: 128 tokens, Generated Tokens: 1 token

Threads	Neon (Tokens/sec)	SVE (Tokens/sec)	Ratio
8	9.21	12.52	1.36
16	17.89	23.85	1.33
32	32.3	41.59	1.29
64	53.08	62.94	1.19

Task2: Prompt Length: 1024 tokens, Generated Tokens: 1 token

Threads	Neon (Tokens/sec)	SVE (Tokens/sec)	Ratio
8	8.95	11.66	1.3
16	17.3	21.97	1.27
32	31.07	38.48	1.24
64	50.73	58.99	1.16

Perplexity

There is no change in model accuracy as a result of this PR.
Command: ./build/bin/llama-perplexity -s 0 -np 128 -t 64 -m falcon-mamba-7B-F32.gguf -c 128 -b 128 --chunks 16 -f scripts/wikitext-2-raw/wiki.test.raw

NEON	SVE
7.6153 +/- 0.66890	7.6153 +/- 0.66890

Contributor: Vineel Abhinav Gottala

cc: @Vithulep

abhijain1204fujitsu · 2025-05-21T08:41:01Z

Hi @ggerganov please support to review this PR

ggerganov

It's better to split the PR in 2 parts. First part with:

Add SVE support for ggml_vec_dot_f32() function.
Add SVE support for ggml_vec_mad_f32() function.
Add SVE support for ggml_vec_scale_f32() function.

The second part with Mamba-specific changes.

For the first part I need to see what is the improvement over the existing GGML_SIMD implementation using ARM_NEON for example, which AFAIK should always be available when SVE is available.

ggerganov · 2025-05-21T10:43:50Z

ggml/src/ggml-cpu/vec.h

+    #if defined(__ARM_FEATURE_SVE)

-    GGML_F32_VEC vx = GGML_F32_VEC_SET1(v);
+        const int sve_register_length = ggml_cpu_get_sve_cnt() * 8;
+        const int ggml_f32_epr = sve_register_length / 32;//8;//svcntw(); // SVE128:4, SVE256:8, SVE512:16
+        const int ggml_f32_step = 2 * ggml_f32_epr;
+        GGML_F32_VEC vx = GGML_F32_VEC_SET1(v);

-    GGML_F32_VEC ax[GGML_F32_ARR];
-    GGML_F32_VEC ay[GGML_F32_ARR];
+        const int np = (n & ~(ggml_f32_step - 1));
+        svfloat32_t ax1,ax2;
+        svfloat32_t ay1,ay2;
+        for ( int i = 0; i < np; i += ggml_f32_step) {

-    for (int i = 0; i < np; i += GGML_F32_STEP) {
-        for (int j = 0; j < GGML_F32_ARR; j++) {
-            ax[j] = GGML_F32_VEC_LOAD(x + i + j*GGML_F32_EPR);
-            ay[j] = GGML_F32_VEC_LOAD(y + i + j*GGML_F32_EPR);
-            ay[j] = GGML_F32_VEC_FMA(ay[j], ax[j], vx);
+            ax1 = GGML_F32_VEC_LOAD(x + i);
+            ay1 = GGML_F32_VEC_LOAD(y + i);
+            ay1 = GGML_F32_VEC_FMA(ax1, vx, ay1);

-            GGML_F32_VEC_STORE(y + i + j*GGML_F32_EPR, ay[j]);
+            GGML_F32_VEC_STORE(y + i, ay1);
+
+            ax2 = GGML_F32_VEC_LOAD(x + i + 1*ggml_f32_epr);
+            ay2 = GGML_F32_VEC_LOAD(y + i + 1*ggml_f32_epr);
+            ay2 = GGML_F32_VEC_FMA(ax2, vx, ay2);
+
+            GGML_F32_VEC_STORE(y + i + 1*ggml_f32_epr, ay2);
        }
-    }
+        // leftovers
+        // maximum number of leftover elements will be less that ggml_f32_epr. Apply predicated svmad on available elements only
+        if(np<n) {
+            svbool_t pg =svwhilelt_b32(np, n);
+            ax1 = svld1_f32(pg, x + np);
+            ay1 = svld1_f32(pg, y + np);
+            ay1 = svmad_f32_m(pg, ax1, vx, ay1);
+
+            svst1_f32(pg, y + np, ay1);
+        }
+    #else


How big is the benefit from these special-cased implementations compared to using the GGML_SIMD abstraction? If the benefit is not significant, it's better to use the existing implementation and avoid this extra code.

Hi @ggerganov,
The function ggml_vec_mad_f32() is called from flash attention operation. Currently this operation is supported in some models like phi-3-4k, falcon-7B and Qwen2-7B. I have converted all these to F32 gguf format. But none of the models are using ggml_compute_forward_flash_attn_back_f32() instead they are using ggml_compute_forward_flash_attn_ext_f16() which doesnot call ggml_vec_mad_f32() . For this reason I could'nt show the performance results. But, I can assure there will be good benifit compared to Neon version if model uses this function because we saw speed up for ggml_vec_dot_f32() which is similar to this function.
Will it be fine to proceed pushing this function?

Hi @ggerganov,
Following up on the previous comment. Please let me know if you’re okay with proceeding.

ggml/src/ggml-cpu/vec.h

ggerganov

Separate the changes in ops.cpp in a separate PR.

ggml/src/ggml-cpu/vec.h

ggml/src/ggml-cpu/vec.cpp

ggml/src/ggml-cpu/simd-mappings.h

vineelabhinav · 2025-05-28T08:34:12Z

@ggerganov

PR-1 for vector functions: Raised in separate PR @ ggml: aarch64: Implement SVE F32 kernels for vector functions #13843
PR-2 for Mamba specific changes: I will raise after PR-1 gets merged, because rebasing of PR-2 will be required if its raised now. There might be some merge conflicts.

ggerganov · 2025-05-30T10:37:57Z

Superseded by #13843 and #13882

vineelabhinav added 2 commits May 17, 2025 08:58

F32-Mamba-SVE

8581c89

F32-Mamba-SVE

55b5545

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label May 17, 2025

vineelabhinav added 2 commits May 19, 2025 14:51

Resolve test errors-1

b4ab67c

Resolve test errors-2

5cd9e35

ggerganov requested changes May 21, 2025

View reviewed changes

ggerganov reviewed May 27, 2025

View reviewed changes

vineelabhinav mentioned this pull request May 28, 2025

ggml: aarch64: Implement SVE F32 kernels for vector functions #13843

Merged

vineelabhinav mentioned this pull request May 29, 2025

ggml: aarch64: Implement SVE F32 kernels for Mamba Sequential Scan Algorithm #13882

Merged

ggerganov closed this May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: aarch64: Implement SVE F32 kernels for Mamba Model #13602

ggml: aarch64: Implement SVE F32 kernels for Mamba Model #13602

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggml: aarch64: Implement SVE F32 kernels for Mamba Model #13602

ggml: aarch64: Implement SVE F32 kernels for Mamba Model #13602

Uh oh!

Conversation

Uh oh!

Performance

Perplexity

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!