- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
Description
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5109 (66d17c5a)
built with MSVC 19.40.33811.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
4090+3090
Models
Qwen 14B Q5KM
Problem description & steps to reproduce
Testing:
-m "Qwen.Qwen2.5-14B-Instruct-1M.Q5_K_M.gguf" -ts 1,0 -b 4000 -p "Hello world" -ngl 100 --verbose-prompt -st -n 128 --ignore-eos -fa -c 4096 -mg 0
token generation speed on my 4090 is 55 tokens/sec when using this command (to force computation on gpu 0)
$env:CUDA_VISIBLE_DEVICES = "0";
This addition boosts token generation speed to 65 tokens/sec
The log shows in BOTH cases that only one card is used:
Something is slowing the cuda backend significantly down when a second GPU is visible, even if nothing is offloaded to it and it's just a secondary idle card.
First Bad Commit
No response
Relevant log output
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  9505.88 MiB
load_tensors:   CPU_Mapped model buffer size =   510.47 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 4000
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (1010000) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init:      CUDA0 KV buffer size =   768.00 MiB
llama_context: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =   379.02 MiB
llama_context:  CUDA_Host compute buffer size =    42.02 MiB
llama_context: graph nodes  = 1591
llama_context: graph splits = 2Here the difference when making only one card visible:
llama_context: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_context:      CUDA0 compute buffer size =   307.00 MiB
llama_context:  CUDA_Host compute buffer size =    18.01 MiB
llama_context: graph nodes  = 1591
llama_context: graph splits = 2
So "pipeline parallelism" is enabled and causes internal delays despite having nothing offloaded on the 2nd gpu
Update
Adding manually -sm none solves the problem. But that's not something most people using llama.cpp would get to.
Any GPU that is not being offloaded to should not be used by default