8000 CUDA performance bug when two cards are visible and only one is used · Issue #12838 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content
8000

CUDA performance bug when two cards are visible and only one is used #12838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cmp-nct opened this issue Apr 9, 2025 · 1 comment
Closed

Comments

@cmp-nct
Copy link
Contributor
cmp-nct commented Apr 9, 2025

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5109 (66d17c5a)
built with MSVC 19.40.33811.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

4090+3090

Models

Qwen 14B Q5KM

Problem description & steps to reproduce

Testing:
-m "Qwen.Qwen2.5-14B-Instruct-1M.Q5_K_M.gguf" -ts 1,0 -b 4000 -p "Hello world" -ngl 100 --verbose-prompt -st -n 128 --ignore-eos -fa -c 4096 -mg 0

token generation speed on my 4090 is 55 tokens/sec when using this command (to force computation on gpu 0)
$env:CUDA_VISIBLE_DEVICES = "0";
This addition boosts token generation speed to 65 tokens/sec

The log shows in BOTH cases that only one card is used:

Something is slowing the cuda backend significantly down when a second GPU is visible, even if nothing is offloaded to it and it's just a secondary idle card.

First Bad Commit

No response

Relevant log output

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  9505.88 MiB
load_tensors:   CPU_Mapped model buffer size =   510.47 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 4000
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (1010000) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init:      CUDA0 KV buffer size =   768.00 MiB
llama_context: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =   379.02 MiB
llama_context:  CUDA_Host compute buffer size =    42.02 MiB
llama_context: graph nodes  = 1591
llama_context: graph splits = 2

Here the difference when making only one card visible:

llama_context: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_context:      CUDA0 compute buffer size =   307.00 MiB
llama_context:  CUDA_Host compute buffer size =    18.01 MiB
llama_context: graph nodes  = 1591
llama_context: graph splits = 2

So "pipeline parallelism" is enabled and causes internal delays despite having nothing offloaded on the 2nd gpu

Update
Adding manually -sm none solves the problem. But that's not something most people using llama.cpp would get to.
Any GPU that is not being offloaded to should not be used by default

@cmp-nct cmp-nct changed the title Eval bug: CUDA performance bug when two cards are visible and only one is used Apr 9, 2025
@github-actions github-actions bot added the stale label May 10, 2025
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant
0