ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

angt · 2024-11-28T14:42:39Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This PR improves code readability and lays the groundwork for potential optimizations in the future.
For now, I have limited the changes to a single function to ensure this approach is OK for everyone.
I did not observe any significant performance differences using llama-bench.

8000

ggerganov

Using intrinsics is definitely preferred.

I also don't observe significant performance change. Here are the results on M2 Ultra, using the following patch to force the Q4_0_4_4 repack:

diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 6d2c0adc3..cac45278b 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3812,7 +3812,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
             return GGML_TYPE_Q4_0_8_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
-            return GGML_TYPE_Q4_0_4_8;
+            //return GGML_TYPE_Q4_0_4_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             return GGML_TYPE_Q4_0_4_4;

make -j && ./bin/llama-bench -m ../models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -t 1,2,4,8,16 -p 0 -n 64 -fa 1

master

model	size	params	backend	threads	fa	test	t/s
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	1	1	tg64	37.44 ± 0.19
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	2	1	tg64	65.51 ± 0.19
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	4	1	tg64	117.41 ± 0.35
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	8	1	tg64	158.53 ± 0.30
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	16	1	tg64	159.32 ± 1.79

build: 7281cf1 (4211)

PR

model	size	params	backend	threads	fa	test	t/s
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	1	1	tg64	36.92 ± 0.12
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	2	1	tg64	66.53 ± 0.55
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	4	1	tg64	114.92 ± 1.64
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	8	1	tg64	158.25 ± 0.17
qwen2 1.5B Q4_0	885.97 MiB	1.54 B	CPU	16	1	tg64	157.57 ± 4.43

build: bde1b96 (4210)

ggml/src/ggml-cpu/ggml-cpu-aarch64.c

angt · 2024-11-28T15:30:44Z

Also I replicated the current C/asm code but I would have done something more like this:

-                ret = vdotq_laneq_s32(ret, b0 << 4, a0, 0);
-                ret = vdotq_laneq_s32(ret, b1 << 4, a0, 1);
-                ret = vdotq_laneq_s32(ret, b2 << 4, a0, 2);
-                ret = vdotq_laneq_s32(ret, b3 << 4, a0, 3);
+                ret = vdotq_laneq_s32(ret, b0 >> 4, a0, 0);
+                ret = vdotq_laneq_s32(ret, b1 >> 4, a0, 1);
+                ret = vdotq_laneq_s32(ret, b2 >> 4, a0, 2);
+                ret = vdotq_laneq_s32(ret, b3 >> 4, a0, 3);
 
-                ret = vdotq_laneq_s32(ret, b0 & 0xf0U, a1, 0);
-                ret = vdotq_laneq_s32(ret, b1 & 0xf0U, a1, 1);
-                ret = vdotq_laneq_s32(ret, b2 & 0xf0U, a1, 2);
-                ret = vdotq_laneq_s32(ret, b3 & 0xf0U, a1, 3);
+                ret = vdotq_laneq_s32(ret, b0 & 0xfU, a1, 0);
+                ret = vdotq_laneq_s32(ret, b1 & 0xfU, a1, 1);
+                ret = vdotq_laneq_s32(ret, b2 & 0xfU, a1, 2);
+                ret = vdotq_laneq_s32(ret, b3 & 0xfU, a1, 3);
 
-                acc = vfmaq_f32(acc, vcvtq_n_f32_s32(ret, 4),
+                acc = vfmaq_f32(acc, vcvtq_f32_s32(ret),

If anyone has an explanation for why it was done this way, I'm interested.

ggerganov · 2024-11-28T15:39:49Z

This does not seem to produce correct output:

make -j && ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -p "I believe the meaning of life is" -n 32 -s 1

I believe the meaning of life is anekloatbounceракahnurnجوistle Scene ERA Ordinary spherecord Cheat Spherebenchbenchogle leaguesipogl ordin usefulnessecutenchhec batchjaxrigerrig

angt · 2024-11-28T16:03:13Z

I had only tested with Llama-3.2-1B-Instruct-Q4_0_4_4.gguf, and the output was the same:

I believe the meaning of life is the concept of finding one's purpose and fulfilling a person's potential.
The meaning of life is indeed a profound question that has puzzled philosophers, theologians, and

I'll dig later :)

angt · 2024-11-28T16:35:37Z

@ggerganov I believe the model you're using llama-3.2-3b-instruct/ggml-model-q4_0.gguf doesn't use the function ggml_gemv_q4_0_4x4_q8_0 modified by this PR but ggml_gemv_q4_0_4x8_q8_0.

This is what I get with the master branch:

./llama-cli -m /home/ubuntu/models/Llama-3.2-3B-Instruct-Q4_0.gguf -p I believe the meaning of life is -n 32 -s 1
I believe the meaning of life isเJOftionfortujiartuftuftuftipfmeapsujiathuIquiKpstuqf

Djip007 · 2024-11-28T19:58:05Z

If anyone has an explanation for why it was done this way, I'm interested.

// think of exemple
A = 0b1111nnnn;
A[hight] == Q4 = 0b1111
// because it use signed int4
A[hight] == -1
// now if you do:
A>>4 => 0b00001111
// result a value of +15
A&0xF0 => 0b11110000
// result a value of -16 = -1*2^4

so keep it like that if you don't want to deal with negative Q4

Note: hop I do not make to much error with my "math" 😎

…q4_0_4x4_q8_0() Signed-off-by: Adrien Gallouët <angt@huggingface.co>

angt · 2024-11-29T15:36:53Z

I ran several tests today with many models (like Qwen2.5-3B-Instruct-Q4_0_4_4.gguf and Llama-3.2-1B-Instruct-Q4_0_4_4.gguf on both the master branch and this one, and I confirm that the output is identical each time.

Just for your information, I also wrote the code to match the assembly version and not the C version, which slightly differs:

C version:

./llama-cli -m /home/ubuntu/models/Llama-3.2-1B-Instruct-Q4_0_4_4.gguf -p "I believe the meaning of life is" -n 48 -s 1
I believe the meaning of life is the concept of finding one's purpose and pursuing happiness through personal growth and self-improvement. It's the idea that, by overcoming and overcoming one's limitations, one can discover a sense of fulfillment and satisfaction that comes from achieving their goals

NEON assembly:

./llama-cli -m /home/ubuntu/models/Llama-3.2-1B-Instruct-Q4_0_4_4.gguf -p "I believe the meaning of life is" -n 48 -s 1
I believe the meaning of life is the concept of finding one's purpose and fulfilling a person's potential.
The meaning of life is indeed a profound question that has puzzled philosophers, theologians, and thinkers for centuries. While there may not be a single, definitive answer, I

ggerganov · 2024-11-29T18:07:06Z

Nice. Btw, to clarify that my #10567 (comment) was about the change that you suggested in #10567 (comment). Without this change (i.e. using the PR as it is), everything works on my end. If I apply the change, the output becomes incorrect.

max-krasnyansky · 2024-11-29T18:34:53Z

I tested this on M2 Max and Snapdragon X-Elite.
About the same performance numbers and the outputs match (given the same seed).

…q4_0_4x4_q8_0() (ggml-org#10567) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 28, 2024

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from 69dd941 to bde1b96 Compare November 28, 2024 14:53

ggerganov reviewed Nov 28, 2024

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ggml-cpu-aarch64.c Outdated Show resolved Hide resolved

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from bde1b96 to 7091a85 Compare November 28, 2024 15:23

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch 2 times, most recently from 7587b42 to 6b6b98f Compare November 29, 2024 10:36

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_…

aaa6682

…q4_0_4x4_q8_0() Signed-off-by: Adrien Gallouët <angt@huggingface.co>

angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from 6b6b98f to aaa6682 Compare November 29, 2024 14:59

slaren approved these changes Nov 30, 2024

View reviewed changes

max-krasnyansky merged commit 0c39f44 into ggml-org:master Nov 30, 2024
50 checks passed

smpurkis mentioned this pull request Dec 18, 2024

Misc. bug: Q4_0 with runtime repacking not working as expected (TYPE_Q4_0_4_4 REMOVED) #10757

Closed

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_…

465e6b7

…q4_0_4x4_q8_0() (ggml-org#10567) Signed-off-by: Adrien Gallouët <angt@huggingface.co>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

master

PR

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!