8000 something with llama_server? slow vs llama_cli · Issue #13560 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

something with llama_server? slow vs llama_cli #13560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bitcandy opened this issue May 15, 2025 · 2 comments
Open

something with llama_server? slow vs llama_cli #13560

bitcandy opened this issue May 15, 2025 · 2 comments

Comments

@bitcandy
Copy link
bitcandy commented May 15, 2025

Name and Version

server version: 5392 (c753d7b)

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Problem description & steps to reproduce

Hello, after testing, I found that llama_server is slower than llama_cli. I am experiencing a decrease in token generation speed of about 5% to 10% with llama_server compared to llama_cli. Is this expected?

When comparing to ollama, it performs somewhere in between these two results, but ollama runs in Flash Attention mode. However, when I enable Flash Attention in llamacpp, I observe an additional 5% decrease in performance.

system: 3 old gpus (1080ti, 2x1070)
-sm layer

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CUDA : ARCHS = 610 | FORCE_CUBLAS = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 3 CUDA devices:

Regards

@bitcandy bitcandy changed the title Misc. bug: something with llama_server? slow vs llama_cli May 15, 2025
@VickyReal
Copy link

#9013 (comment)

@sivansh11
Copy link

facing same issue, the difference is huge
0.11ms vs 24.59ms

commenting here as the other issue has gone stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
0