8000 Eval bug: swa_full = true is slower than false · Issue #13683 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Eval bug: swa_full = true is slower than false #13683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ymcki opened this issue May 21, 2025 · 4 comments
Open

Eval bug: swa_full = true is slower than false #13683

ymcki opened this issue May 21, 2025 · 4 comments

Comments

@ymcki
Copy link
Contributor
ymcki commented May 21, 2025

Name and Version

./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5439 (3398305)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

1 x 3090

Models

google_gemma-3-12b-it-Q4_K_M.gguf

Problem description & steps to reproduce

I tried to study if swa_full = true will be faster for context longer than 1024 for gemma 3 12b, so I run

./build/bin/llama-bench -m ~/gguf/google_gemma-3-12b-it-Q4_K_M.gguf -p 2048 -n 1024 -d 8192 -b 64,128,256,512,1024,2048,4096,8192

and get this table:

model size params backend ngl n_batch test t/s
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 64 pp2048 @ d8192 1629.05 ± 11.72
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 64 tg1024 @ d8192 60.33 ± 0.85
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 128 pp2048 @ d8192 1869.86 ± 2.57
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 128 tg1024 @ d8192 58.23 ± 1.14
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 256 pp2048 @ d8192 2048.31 ± 74.21
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 256 tg1024 @ d8192 55.84 ± 1.16
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 512 pp2048 @ d8192 2015.36 ± 1.87
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 512 tg1024 @ d8192 53.38 ± 1.62
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 1024 pp2048 @ d8192 1898.70 ± 20.55
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 1024 tg1024 @ d8192 51.62 ± 0.93
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 2048 pp2048 @ d8192 1767.68 ± 125.60
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 2048 tg1024 @ d8192 52.08 ± 0.47
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 4096 pp2048 @ d8192 1626.16 ± 83.40
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 4096 tg1024 @ d8192 47.14 ± 1.53
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 8192 pp2048 @ d8192 1506.47 ± 24.72
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 8192 tg1024 @ d8192 43.39 ± 1.57

First Bad Commit

No response

Relevant log output

Since I can't pass swa_full = true to llama-bench, so I manually hard code llama-bench.cpp line 994 to swa_full = true and ran it again.

| model                          |       size |     params | backend    | ngl | n_batch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |      64 |  pp2048 @ d8192 |       1148.70 ± 9.49 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |      64 |  tg1024 @ d8192 |         44.73 ± 2.50 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |     128 |  pp2048 @ d8192 |      1321.45 ± 10.14 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |     128 |  tg1024 @ d8192 |         41.82 ± 1.68 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |     256 |  pp2048 @ d8192 |      1406.80 ± 50.92 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |     256 |  tg1024 @ d8192 |         40.66 ± 2.84 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |     512 |  pp2048 @ d8192 |       1377.80 ± 1.09 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |     512 |  tg1024 @ d8192 |         41.10 ± 1.41 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |    1024 |  pp2048 @ d8192 |      1375.06 ± 13.44 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |    1024 |  tg1024 @ d8192 |         41.29 ± 1.41 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |    2048 |  pp2048 @ d8192 |      1374.82 ± 31.24 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |    2048 |  tg1024 @ d8192 |         41.18 ± 1.84 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |    4096 |  pp2048 @ d8192 |      1391.40 ± 54.52 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |    4096 |  tg1024 @ d8192 |         40.63 ± 2.08 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |    8192 |  pp2048 @ d8192 |      1366.34 ± 27.35 |
| gemma3 12B Q4_K - Medium       |   6.79 GiB |    11.77 B | CUDA       |  99 |    8192 |  tg1024 @ d8192 |         40.20 ± 1.56 |

It is significantly slower across the board. Why is that? I expected context caching should kick in and make it faster for swa_full=true. 

I noticed the first run is around 9.5GB while the second one is 12.5GB, so I believe iSWA is indeed off.
@ddh0
Copy link
Contributor
ddh0 commented May 21, 2025

I noticed the same thing with both the 12B and 27B gemma 3 models. (Btw, your markdown is broken, the text of your post is inside a code block)

@ggerganov
Copy link
Member

This is the expected behaviour. With --swa-full you need more memory and the computation is slower, but you can branch from older points of the context for free and do things like speculative decoding and cache reuse.

@ymcki
Copy link
Contributor Author
ymcki commented May 22, 2025

This is the expected behaviour. With --swa-full you need more memory and the computation is slower, but you can branch from older points of the context for free and do things like speculative decoding and cache reuse.

Thanks for your reply. I find that only server has speculative decoding and cache reuse. So there is no reason for people to use swa-full=true for llama-cli and llama-bench?

I tried b5427 which supposedly doesn't have iSWA. It does produce the same result as swa-full=true. I suppose this means it is not a bug.

model size params backend ngl n_batch test t/s
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 64 pp2048 @ d8192 1154.67 ± 9.03
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 64 tg1024 @ d8192 46.29 ± 0.54
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 128 pp2048 @ d8192 1335.12 ± 21.89
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 128 tg1024 @ d8192 44.15 ± 0.90
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 256 pp2048 @ d8192 1439.33 ± 54.96
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 256 tg1024 @ d8192 42.01 ± 3.45
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 512 pp2048 @ d8192 1435.01 ± 8.44
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 512 tg1024 @ d8192 42.50 ± 0.86
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 1024 pp2048 @ d8192 1490.45 ± 9.74
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 1024 tg1024 @ d8192 42.09 ± 1.65
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 2048 pp2048 @ d8192 1406.48 ± 1.25
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 2048 tg1024 @ d8192 42.58 ± 0.73
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 4096 pp2048 @ d8192 1427.27 ± 41.79
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 4096 tg1024 @ d8192 41.22 ± 0.82
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 8192 pp2048 @ d8192 1473.78 ± 27.69
gemma3 12B Q4_K - Medium 6.79 GiB 11.77 B CUDA 99 8192 tg1024 @ d8192 41.10 ± 1.50

@ggerganov
Copy link
Member

Yes, for llama-cli there is no use for --swa-full unless you want to do context shift (i.e. when the context becomes full, discard part of the oldest tokens and shift the rest in their place). llama-bench does not even allow you to use --swa-full.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants
0