something with llama_server? slow vs llama_cli #13560

bitcandy · 2025-05-15T08:17:40Z

Name and Version

server version: 5392 (c753d7b)

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

Hello, after testing, I found that llama_server is slower than llama_cli. I am experiencing a decrease in token generation speed of about 5% to 10% with llama_server compared to llama_cli. Is this expected?

When comparing to ollama, it performs somewhere in between these two results, but ollama runs in Flash Attention mode. However, when I enable Flash Attention in llamacpp, I observe an additional 5% decrease in performance.

system: 3 old gpus (1080ti, 2x1070)
-sm layer

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CUDA : ARCHS = 610 | FORCE_CUBLAS = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 3 CUDA devices:

Regards

The text was updated successfully, but these errors were encountered:

VickyReal · 2025-05-21T08:11:51Z

#9013 (comment)

sivansh11 · 2025-05-25T18:08:21Z

facing same issue, the difference is huge
0.11ms vs 24.59ms

commenting here as the other issue has gone stale

bitcandy added the bug-unconfirmed label May 15, 2025

bitcandy changed the title ~~Misc. bug:~~ something with llama_server? slow vs llama_cli May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

something with llama_server? slow vs llama_cli #13560

something with llama_server? slow vs llama_cli #13560

Uh oh!

Uh oh!

something with llama_server? slow vs llama_cli #13560

something with llama_server? slow vs llama_cli #13560

Comments

Uh oh!

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Problem description & steps to reproduce

Uh oh!

Uh oh!