You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
token generation speed on my 4090 is 55 tokens/sec when using this command (to force computation on gpu 0) $env:CUDA_VISIBLE_DEVICES = "0";
This addition boosts token generation speed to 65 tokens/sec
The log shows in BOTH cases that only one card is used:
Something is slowing the cuda backend significantly down when a second GPU is visible, even if nothing is offloaded to it and it's just a secondary idle card.
First Bad Commit
No response
Relevant log output
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CUDA0 model buffer size = 9505.88 MiB
load_tensors: CPU_Mapped model buffer size = 510.47 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 4000
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (1010000) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init: CUDA0 KV buffer size = 768.00 MiB
llama_context: KV self size = 768.00 MiB, K (f16): 384.00 MiB, V (f16): 384.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: CUDA0 compute buffer size = 379.02 MiB
llama_context: CUDA_Host compute buffer size = 42.02 MiB
llama_context: graph nodes = 1591
llama_context: graph splits = 2
Here the difference when making only one card visible:
So "pipeline parallelism" is enabled and causes internal delays despite having nothing offloaded on the 2nd gpu
Update
Adding manually -sm none solves the problem. But that's not something most people using llama.cpp would get to.
Any GPU that is not being offloaded to should not be used by default
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5109 (66d17c5a)
built with MSVC 19.40.33811.0 for x64
Operating systems
Windows
GGML backends
CUDA
Hardware
4090+3090
Models
Qwen 14B Q5KM
Problem description & steps to reproduce
Testing:
-m "Qwen.Qwen2.5-14B-Instruct-1M.Q5_K_M.gguf" -ts 1,0 -b 4000 -p "Hello world" -ngl 100 --verbose-prompt -st -n 128 --ignore-eos -fa -c 4096 -mg 0
token generation speed on my 4090 is 55 tokens/sec when using this command (to force computation on gpu 0)
$env:CUDA_VISIBLE_DEVICES = "0";
This addition boosts token generation speed to 65 tokens/sec
The log shows in BOTH cases that only one card is used:
Something is slowing the cuda backend significantly down when a second GPU is visible, even if nothing is offloaded to it and it's just a secondary idle card.
First Bad Commit
No response
Relevant log output
Here the difference when making only one card visible:
So "pipeline parallelism" is enabled and causes internal delays despite having nothing offloaded on the 2nd gpu
Update
Adding manually
-sm none
solves the problem. But that's not something most people using llama.cpp would get to.Any GPU that is not being offloaded to should not be used by default
The text was updated successfully, but these errors were encountered: