-
Notifications
You must be signed in to change notification settings - Fork 12k
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cpu: replace AArch64 NEON assembly with intrinsics in ggml_gemv_q4_0_4x4_q8_0() #10567
Conversation
69dd941
to
bde1b96
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using intrinsics is definitely preferred.
I also don't observe significant performance change. Here are the results on M2 Ultra, using the following patch to force the Q4_0_4_4
repack:
diff --git a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
index 6d2c0adc3..cac45278b 100644
--- a/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
+++ b/ggml/src/ggml-cpu/ggml-cpu-aarch64.c
@@ -3812,7 +3812,7 @@ enum ggml_type ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * c
return GGML_TYPE_Q4_0_8_8;
}
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
- return GGML_TYPE_Q4_0_4_8;
+ //return GGML_TYPE_Q4_0_4_8;
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
return GGML_TYPE_Q4_0_4_4;
make -j && ./bin/llama-bench -m ../models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -t 1,2,4,8,16 -p 0 -n 64 -fa 1
master
model | size | params | backend | threads | fa | test | t/s |
---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 1 | 1 | tg64 | 37.44 ± 0.19 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 2 | 1 | tg64 | 65.51 ± 0.19 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 4 | 1 | tg64 | 117.41 ± 0.35 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 8 | 1 | tg64 | 158.53 ± 0.30 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 16 | 1 | tg64 | 159.32 ± 1.79 |
build: 7281cf1 (4211)
PR
model | size | params | backend | threads | fa | test | t/s |
---|---|---|---|---|---|---|---|
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 1 | 1 | tg64 | 36.92 ± 0.12 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 2 | 1 | tg64 | 66.53 ± 0.55 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 4 | 1 | tg64 | 114.92 ± 1.64 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 8 | 1 | tg64 | 158.25 ± 0.17 |
qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | CPU | 16 | 1 | tg64 | 157.57 ± 4.43 |
build: bde1b96 (4210)
bde1b96
to
7091a85
Compare
Also I replicated the current C/asm code but I would have done something more like this: - ret = vdotq_laneq_s32(ret, b0 << 4, a0, 0);
- ret = vdotq_laneq_s32(ret, b1 << 4, a0, 1);
- ret = vdotq_laneq_s32(ret, b2 << 4, a0, 2);
- ret = vdotq_laneq_s32(ret, b3 << 4, a0, 3);
+ ret = vdotq_laneq_s32(ret, b0 >> 4, a0, 0);
+ ret = vdotq_laneq_s32(ret, b1 >> 4, a0, 1);
+ ret = vdotq_laneq_s32(ret, b2 >> 4, a0, 2);
+ ret = vdotq_laneq_s32(ret, b3 >> 4, a0, 3);
- ret = vdotq_laneq_s32(ret, b0 & 0xf0U, a1, 0);
- ret = vdotq_laneq_s32(ret, b1 & 0xf0U, a1, 1);
- ret = vdotq_laneq_s32(ret, b2 & 0xf0U, a1, 2);
- ret = vdotq_laneq_s32(ret, b3 & 0xf0U, a1, 3);
+ ret = vdotq_laneq_s32(ret, b0 & 0xfU, a1, 0);
+ ret = vdotq_laneq_s32(ret, b1 & 0xfU, a1, 1);
+ ret = vdotq_laneq_s32(ret, b2 & 0xfU, a1, 2);
+ ret = vdotq_laneq_s32(ret, b3 & 0xfU, a1, 3);
- acc = vfmaq_f32(acc, vcvtq_n_f32_s32(ret, 4),
+ acc = vfmaq_f32(acc, vcvtq_f32_s32(ret), If anyone has an explanation for why it was done this way, I'm interested. |
This does not seem to produce correct output: make -j && ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q4_0.gguf -p "I believe the meaning of life is" -n 32 -s 1
I believe the meaning of life is anekloatbounceракahnurnجوistle Scene ERA Ordinary spherecord Cheat Spherebenchbenchogle leaguesipogl ordin usefulnessecutenchhec batchjaxrigerrig |
I had only tested with
I'll dig later :) |
@ggerganov I believe the model you're using This is what I get with the
|
// think of exemple
A = 0b1111nnnn;
A[hight] == Q4 = 0b1111
// because it use signed int4
A[hight] == -1
// now if you do:
A>>4 => 0b00001111
// result a value of +15
A&0xF0 => 0b11110000
// result a value of -16 = -1*2^4 so keep it like that if you don't want to deal with negative Q4 Note: hop I do not make to much error with my "math" 😎 |
7587b42
to
6b6b98f
Compare
…q4_0_4x4_q8_0() Signed-off-by: Adrien Gallouët <angt@huggingface.co>
6b6b98f
to
aaa6682
Compare
I ran several tests today with many models (like Just for your information, I also wrote the code to match the assembly version and not the C version, which slightly differs: C version:
NEON assembly:
|
Nice. Btw, to clarify that my #10567 (comment) was about the change that you suggested in #10567 (comment). Without this change (i.e. using the PR as it is), everything works on my end. If I apply the change, the output becomes incorrect. |
I tested this on M2 Max and Snapdragon X-Elite. |
…q4_0_4x4_q8_0() (ggml-org#10567) Signed-off-by: Adrien Gallouët <angt@huggingface.co>
This PR improves code readability and lays the groundwork for potential optimizations in the future.
For now, I have limited the changes to a single function to ensure this approach is OK for everyone.
I did not observe any significant performance differences using
llama-bench
.