OpenCL: Performance comparison depending on gpu_offloads #12810

sparkleholic · 2025-04-08T00:35:40Z

I expected more gpu_offloads get better performances(tokens/sec), however the bench-results were different.

The followings were executed on QCS8550 with a model (https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct-GGUF/blob/main/EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf).

llama-bench -m ./EXAONE-3.5-2.4B-Instruct-Q4_K_M.gguf  -ngl 0,5,10,15,20,31

ggml_opencl: selecting platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: selecting device: 'QUALCOMM Adreno(TM) 740'
ggml_opencl: OpenCL driver: OpenCL 3.0 QUALCOMM build: commit unknown Compiler E031.42.20.00
ggml_opencl: vector subgroup broadcast support: false
ggml_opencl: device FP16 support: true
ggml_opencl: mem base addr align: 128
ggml_opencl: max mem alloc size: 256 MB
ggml_opencl: SVM coarse grain buffer support: true
ggml_opencl: SVM fine grain buffer support: true
ggml_opencl: SVM fine grain system support: false
ggml_opencl: SVM atomics support: true
ggml_opencl: flattening quantized weights representation as struct of arrays (GGML_OPENCL_SOA_Q)
ggml_opencl: using kernels optimized for Adreno (GGML_OPENCL_USE_ADRENO_KERNELS)
ggml_opencl: A_q_d buffer size reduced from 311164928 to 268435456 due to device limitations.
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   0 |         pp512 |         18.92 ± 0.18 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   0 |         tg128 |          3.90 ± 0.10 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   5 |         pp512 |         16.97 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |   5 |         tg128 |          3.37 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  10 |         pp512 |         16.23 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  10 |         tg128 |          3.12 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  15 |         pp512 |         15.87 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  15 |         tg128 |          2.93 ± 0.01 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  20 |         pp512 |         15.22 ± 0.02 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  20 |         tg128 |          2.80 ± 0.01 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  31 |         pp512 |         13.81 ± 0.03 |
| exaone ?B Q4_K - Medium        |   1.39 GiB |     2.41 B | OpenCL     |  31 |         tg128 |          2.95 ± 0.14 |

The text was updated successfully, but these errors were encountered:

sparkleholic · 2025-04-08T00:39:02Z

@lhez , @max-krasnyansky
I've checked the perf. via llama-bench on QCS8550 in case of OpenCL enabled. The odd point is that more gpu_offloads doesn't get better perf.

kizuna0487 · 2025-04-08T02:18:00Z

try Q4_0 ? I have tried Q4_K_M in the past and the performance was not good either.
https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENCL.md

sparkleholic · 2025-04-08T04:49:13Z

try Q4_0 ? I have tried Q4_K_M in the past and the performance was not good either. https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/OPENCL.md

Thanks. I'm gonna test w/ Q4_0 either.
I didn't notice there is no Adreno 740 (QCS8550) in the verfied device list, and no mention for supporting Q4_M_K either in OPENCL.md.

lhez · 2025-04-08T04:57:06Z

@sparkleholic - currently Q4_0 is optimized, so you will need to use --pure when quantizing the model to Q4_0. Without --pure, some layers will be quantized in Q6_K, resulting in worse performance. Q4_K is not currently supported. So when you run Q4_K_M models, Q4_K layers will be put back to CPU, resulting in even worse performance.

Adreno 740 should work just fine. Feel free to reply back if you see any issue with 740.

sparkleholic · 2025-04-08T09:41:19Z

@lhez, @kizuna0487
Thanks for the info.
I've verified on QCS8550(Adreno 740) with Q4_0 works well, more n-gpu-layers gets better performance measure results.

EXAONE-3.5-2.4B-Instruct-Q4_0.gguf

| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   0 |         pp512 |         16.46 ± 0.23 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   0 |         tg128 |          5.29 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   5 |         pp512 |         17.98 ± 0.06 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |   5 |         tg128 |          4.31 ± 0.09 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  10 |         pp512 |         21.16 ± 0.04 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  10 |         tg128 |          5.16 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  15 |         pp512 |         25.82 ± 0.05 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  15 |         tg128 |          6.20 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  20 |         pp512 |         33.12 ± 0.06 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  20 |         tg128 |          7.85 ± 0.13 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  31 |         pp512 |         75.02 ± 0.03 |
| exaone ?B Q4_0                 |   1.32 GiB |     2.41 B | OpenCL     |  31 |         tg128 |         11.85 ± 0.05 |

github-actions · 2025-05-23T01:08:00Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label May 9, 2025

github-actions bot closed this as completed May 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenCL: Performance comparison depending on gpu_offloads #12810

OpenCL: Performance comparison depending on gpu_offloads #12810

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OpenCL: Performance comparison depending on gpu_offloads #12810

OpenCL: Performance comparison depending on gpu_offloads #12810

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!