10000 OpenCL: Performance comparison depending on gpu_offloads · Issue #12810 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

OpenCL: Performance comparison depending on gpu_offloads #12810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sparkleholic opened this issue Apr 8, 2025 · 6 comments
Closed

OpenCL: Performance comparison depending on gpu_offloads #12810

sparkleholic opened this issue Apr 8, 2025 · 6 comments
Labels

Comments

@sparkleholic
Copy link
Contributor

I expected more gpu_offloads get better performances(tokens/sec), however the bench-results were different.

The followings were executed on QCS8550 with a model (https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct-GGUF/blob/main/EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf).

llama-bench -m ./EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf  -ngl 0,5,10,15,20,31
ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 740'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.42.20.00
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 256 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: A_q_d buffer size reduced from 311164928 to 268435456 due to device limitations.
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   0 |         pp512 |         18.92 ± 0.18 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   0 |         tg128 |          3.90 ± 0.10 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   5 |         pp512 |         16.97 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   5 |         tg128 |          3.37 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  10 |         pp512 |         16.23 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  10 |         tg128 |          3.12 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  15 |         pp512 |         15.87 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  15 |         tg128 |          2.93 ± 0.01 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  20 |         pp512 |         15.22 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  20 |         tg128 |          2.80 ± 0.01 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  31 |         pp512 |         13.81 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  31 |         tg128 |          2.95 ± 0.14 |

@sparkleholic
Copy link
Contributor Author

@lhez , @max-krasnyansky
I've checked the perf. via llama-bench on QCS8550 in case of OpenCL enabled. The odd point is that more gpu_offloads doesn't get better perf.

@kizuna0487
Copy link

try Q4_0 ? I have tried Q4_K_M in the past and the performance was not good either.
https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENCL.md

@sparkleholic
Copy link
Contributor Author

try Q4_0 ? I have tried Q4_K_M in the past and the performance was not good either. https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENCL.md

Thanks. I'm gonna test w/ Q4_0 either.
I didn't notice there is no Adreno 740 (QCS8550) in the verfied device list, and no mention for supporting Q4_M_K either in OPENCL.md.

@lhez
Copy link
Contributor
lhez commented Apr 8, 2025

@sparkleholic - currently Q4_0 is optimized, so you will need to use --pure when quantizing the model to Q4_0. Without --pure, some layers will be quantized in Q6_K, resulting in worse performance. Q4_K is not currently supported. So when you run Q4_K_M models, Q4_K layers will be put back to CPU, resulting in even worse performance.

Adreno 740 should work just fine. Feel free to reply back if you see any issue with 740.

@sparkleholic
Copy link
Contributor Author

@lhez, @kizuna0487
Thanks for the info.
I've verified on QCS8550(Adreno 740) with Q4_0 works well, more n-gpu-layers gets better performance measure results.

EXAONE-3.5-2.4B-Instruct-Q4_0.gguf

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   0 |         pp512 |         16.46 ± 0.23 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   0 |         tg128 |          5.29 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   5 |         pp512 |         17.98 ± 0.06 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   5 |         tg128 |          4.31 ± 0.09 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  10 |         pp512 |         21.16 ± 0.04 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  10 |         tg128 |          5.16 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  15 |         pp512 |         25.82 ± 0.05 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  15 |         tg128 |          6.20 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  20 |         pp512 |         33.12 ± 0.06 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  20 |         tg128 |          7.85 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  31 |         pp512 |         75.02 ± 0.03 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  31 |         tg128 |         11.85 ± 0.05 |

@github-actions github-actions bot added the stale label May 9, 2025
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants
0