8000 ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() by angt · Pull Request #10567 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

angt
Copy link
Contributor
@angt angt commented Nov 28, 2024

This PR improves code readability and lays the groundwork for potential optimizations in the future.
For now, I have limited the changes to a single function to ensure this approach is OK for everyone.
I did not observe any significant performance differences using llama-bench.

8000
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 28, 2024
@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from 69dd941 to bde1b96 Compare November 28, 2024 14:53
Copy link
Member
@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using intrinsics is definitely preferred.

I also don't observe significant performance change. Here are the results on M2 Ultra, using the following patch to force the Q4_0_4_4 repack:

diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 6d2c0adc3..cac45278b 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3812,7 +3812,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
             return GGML_TYPE_Q4_0_8_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
-            return GGML_TYPE_Q4_0_4_8;
+            //return GGML_TYPE_Q4_0_4_8;
         }
         if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
             return GGML_TYPE_Q4_0_4_4;
make -j && ./bin/llama-bench -m ../models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -t 1,2,4,8,16 -p 0 -n 64 -fa 1

master

model size params backend threads fa test t/s
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 1 1 tg64 37.44 ± 0.19
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 2 1 tg64 65.51 ± 0.19
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 4 1 tg64 117.41 ± 0.35
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 8 1 tg64 158.53 ± 0.30
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 16 1 tg64 159.32 ± 1.79

build: 7281cf1 (4211)

PR

model size params backend threads fa test t/s
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 1 1 tg64 36.92 ± 0.12
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 2 1 tg64 66.53 ± 0.55
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 4 1 tg64 114.92 ± 1.64
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 8 1 tg64 158.25 ± 0.17
qwen2 1.5B Q4_0 885.97 MiB 1.54 B CPU 16 1 tg64 157.57 ± 4.43

build: bde1b96 (4210)

@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from bde1b96 to 7091a85 Compare November 28, 2024 15:23
@angt
Copy link
Contributor Author
angt commented Nov 28, 2024

Also I replicated the current C/asm code but I would have done something more like this:

-                ret = vdotq_laneq_s32(ret, b0 << 4, a0, 0);
-                ret = vdotq_laneq_s32(ret, b1 << 4, a0, 1);
-                ret = vdotq_laneq_s32(ret, b2 << 4, a0, 2);
-                ret = vdotq_laneq_s32(ret, b3 << 4, a0, 3);
+                ret = vdotq_laneq_s32(ret, b0 >> 4, a0, 0);
+                ret = vdotq_laneq_s32(ret, b1 >> 4, a0, 1);
+                ret = vdotq_laneq_s32(ret, b2 >> 4, a0, 2);
+                ret = vdotq_laneq_s32(ret, b3 >> 4, a0, 3);
 
-                ret = vdotq_laneq_s32(ret, b0 & 0xf0U, a1, 0);
-                ret = vdotq_laneq_s32(ret, b1 & 0xf0U, a1, 1);
-                ret = vdotq_laneq_s32(ret, b2 & 0xf0U, a1, 2);
-                ret = vdotq_laneq_s32(ret, b3 & 0xf0U, a1, 3);
+                ret = vdotq_laneq_s32(ret, b0 & 0xfU, a1, 0);
+                ret = vdotq_laneq_s32(ret, b1 & 0xfU, a1, 1);
+                ret = vdotq_laneq_s32(ret, b2 & 0xfU, a1, 2);
+                ret = vdotq_laneq_s32(ret, b3 & 0xfU, a1, 3);
 
-                acc = vfmaq_f32(acc, vcvtq_n_f32_s32(ret, 4),
+                acc = vfmaq_f32(acc, vcvtq_f32_s32(ret),

If anyone has an explanation for why it was done this way, I'm interested.

@ggerganov
Copy link
Member

This does not seem to produce correct output:

make -j && ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -p "I believe the meaning of life is" -n 32 -s 1

I believe the meaning of life is anekloatbounceракahnurnجوistle Scene ERA Ordinary spherecord Cheat Spherebenchbenchogle leaguesipogl ordin usefulnessecutenchhec batchjaxrigerrig

@angt
Copy link
Contributor Author
angt commented Nov 28, 2024

I had only tested with Llama-3.2-1B-Instruct-Q4_0_4_4.gguf, and the output was the same:

I believe the meaning of life is the concept of finding one's purpose and fulfilling a person's potential.
The meaning of life is indeed a profound question that has puzzled philosophers, theologians, and

I'll dig later :)

@angt
Copy link
Contributor Author
angt commented Nov 28, 2024

@ggerganov I believe the model you're using llama-3.2-3b-instruct/ggml-model-q4_0.gguf doesn't use the function ggml_gemv_q4_0_4x4_q8_0 modified by this PR but ggml_gemv_q4_0_4x8_q8_0.

This is what I get with the master branch:

./llama-cli -m /home/ubuntu/models/Llama-3.2-3B-Instruct-Q4_0.gguf -p I believe the meaning of life is -n 32 -s 1
I believe the meaning of life isเJOftionfortujiartuftuftuftipfmeapsujiathuIquiKpstuqf

@Djip007
Copy link
Contributor
Djip007 commented Nov 28, 2024

If anyone has an explanation for why it was done this way, I'm interested.

// think of exemple
A = 0b1111nnnn;
A[hight] == Q4 = 0b1111
// because it use signed int4
A[hight] == -1
// now if you do:
A>>4 => 0b00001111
// result a value of +15
A&0xF0 => 0b11110000
// result a value of -16 = -1*2^4

so keep it like that if you don't want to deal with negative Q4

Note: hop I do not make to much error with my "math" 😎

@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch 2 times, most recently from 7587b42 to 6b6b98f Compare November 29, 2024 10:36
…q4_0_4x4_q8_0()

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
@angt angt force-pushed the ggml-cpu-replace-aarch64-neon-assembly-with-intrinsics branch from 6b6b98f to aaa6682 Compare November 29, 2024 14:59
@angt
Copy link
Contributor Author
angt commented Nov 29, 2024

I ran several tests today with many models (like Qwen2.5-3B-Instruct-Q4_0_4_4.gguf and Llama-3.2-1B-Instruct-Q4_0_4_4.gguf on both the master branch and this one, and I confirm that the output is identical each time.

Just for your information, I also wrote the code to match the assembly version and not the C version, which slightly differs:

C version:

./llama-cli -m /home/ubuntu/models/Llama-3.2-1B-Instruct-Q4_0_4_4.gguf -p "I believe the meaning of life is" -n 48 -s 1
I believe the meaning of life is the concept of finding one's purpose and pursuing happiness through personal growth and self-improvement. It's the idea that, by overcoming and overcoming one's limitations, one can discover a sense of fulfillment and satisfaction that comes from achieving their goals

NEON assembly:

./llama-cli -m /home/ubuntu/models/Llama-3.2-1B-Instruct-Q4_0_4_4.gguf -p "I believe the meaning of life is" -n 48 -s 1
I believe the meaning of life is the concept of finding one's purpose and fulfilling a person's potential.
The meaning of life is indeed a profound question that has puzzled philosophers, theologians, and thinkers for centuries. While there may not be a single, definitive answer, I

@ggerganov
Copy link
Member

Nice. Btw, to clarify that my #10567 (comment) was about the change that you suggested in #10567 (comment). Without this change (i.e. using the PR as it is), everything works on my end. If I apply the change, the output becomes incorrect.

@max-krasnyansky
Copy link
Collaborator

I tested this on M2 Max and Snapdragon X-Elite.
About the same performance numbers and the outputs match (given the same seed).

@max-krasnyansky max-krasnyansky merged commit 0c39f44 into ggml-org:master Nov 30, 2024
50 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
…q4_0_4x4_q8_0() (ggml-org#10567)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0