-
Notifications
You must be signed in to change notification settings - Fork 12k
Eval bug: swa_full = true is slower than false #13683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I noticed the same thing with both the 12B and 27B gemma 3 models. (Btw, your markdown is broken, the text of your post is inside a code block) |
This is the expected behaviour. With |
Thanks for your reply. I find that only server has speculative decoding and cache reuse. So there is no reason for people to use swa-full=true for llama-cli and llama-bench? I tried b5427 which supposedly doesn't have iSWA. It does produce the same result as swa-full=true. I suppose this means it is not a bug.
|
Yes, for |
Name and Version
./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5439 (3398305)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
1 x 3090
Models
google_gemma-3-12b-it-Q4_K_M.gguf
Problem description & steps to reproduce
I tried to study if swa_full = true will be faster for context longer than 1024 for gemma 3 12b, so I run
./build/bin/llama-bench -m ~/gguf/google_gemma-3-12b-it-Q4_K_M.gguf -p 2048 -n 1024 -d 8192 -b 64,128,256,512,1024,2048,4096,8192
and get this table:
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: