Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5109 (66d17c5a)
built with MSVC 19.40.33811.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
4090+3090
Models
Qwen 14B Q5KM
Problem description & steps to reproduce
Testing:
-m "Qwen.Qwen2.5-14B-Instruct-1M.Q5_K_M.gguf" -ts 1,0 -b 4000 -p "Hello world" -ngl 100 --verbose-prompt -st -n 128 --ignore-eos -fa -c 4096 -mg 0
token generation speed on my 4090 is 55 tokens/sec when using this command (to force computation on gpu 0)
$env:CUDA_VISIBLE_DEVICES = "0";
This addition boosts token generation speed to 65 tokens/sec
The log shows in BOTH cases that only one card is used:
Something is slowing the cuda backend significantly down when a second GPU is visible, even if nothing is offloaded to it and it's just a secondary idle card.
First Bad Commit
No response
Relevant log output
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CUDA0 model buffer size = 9505.88 MiB
load_tensors: CPU_Mapped model buffer size = 510.47 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 4000
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (1010000) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init: CUDA0 KV buffer size = 768.00 MiB
llama_context: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: CUDA0 compute buffer size = 379.02 MiB
llama_context: CUDA_Host compute buffer size = 42.02 MiB
llama_context: graph nodes = 1591
llama_context: graph splits = 2
Here the difference when making only one card visible:
llama_context: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_context: CUDA0 compute buffer size = 307.00 MiB
llama_context: CUDA_Host compute buffer size = 18.01 MiB
llama_context: graph nodes = 1591
llama_context: graph splits = 2
So "pipeline parallelism" is enabled and causes internal delays despite having nothing offloaded on the 2nd gpu
Update
Adding manually -sm none
solves the problem. But that's not something most people using llama.cpp would get to.
Any GPU that is not being offloaded to should not be used by default