You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
Hello, after testing, I found that llama_server is slower than llama_cli. I am experiencing a decrease in token generation speed of about 5% to 10% with llama_server compared to llama_cli. Is this expected?
When comparing to ollama, it performs somewhere in between these two results, but ollama runs in Flash Attention mode. However, when I enable Flash Attention in llamacpp, I observe an additional 5% decrease in performance.
Uh oh!
There was an error while loading. Please reload this page.
Name and Version
server version: 5392 (c753d7b)
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Problem description & steps to reproduce
Hello, after testing, I found that llama_server is slower than llama_cli. I am experiencing a decrease in token generation speed of about 5% to 10% with llama_server compared to llama_cli. Is this expected?
When comparing to ollama, it performs somewhere in between these two results, but ollama runs in Flash Attention mode. However, when I enable Flash Attention in llamacpp, I observe an additional 5% decrease in performance.
system: 3 old gpus (1080ti, 2x1070)
-sm layer
Regards
The text was updated successfully, but these errors were encountered: