8000 CUDA performance bug when two cards are visible and only one is used · Issue #12838 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content
CUDA performance bug when two cards are visible and only one is used #12838
Closed
@cmp-nct

Description

@cmp-nct

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5109 (66d17c5a)
built with MSVC 19.40.33811.0 for x64

Operating systems

Windows

GGML backends

CUDA

Hardware

4090+3090

Models

Qwen 14B Q5KM

Problem description & steps to reproduce

Testing:
-m "Qwen.Qwen2.5-14B-Instruct-1M.Q5_K_M.gguf" -ts 1,0 -b 4000 -p "Hello world" -ngl 100 --verbose-prompt -st -n 128 --ignore-eos -fa -c 4096 -mg 0

token generation speed on my 4090 is 55 tokens/sec when using this command (to force computation on gpu 0)
$env:CUDA_VISIBLE_DEVICES = "0";
This addition boosts token generation speed to 65 tokens/sec

The log shows in BOTH cases that only one card is used:

Something is slowing the cuda backend significantly down when a second GPU is visible, even if nothing is offloaded to it and it's just a secondary idle card.

First Bad Commit

No response

Relevant log output

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  9505.88 MiB
load_tensors:   CPU_Mapped model buffer size =   510.47 MiB
..........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 4000
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (1010000) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init:      CUDA0 KV buffer size =   768.00 MiB
llama_context: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context:      CUDA0 compute buffer size =   379.02 MiB
llama_context:  CUDA_Host compute buffer size =    42.02 MiB
llama_context: graph nodes  = 1591
llama_context: graph splits = 2

Here the difference when making only one card visible:

llama_context: KV self size  =  768.00 MiB, K (f16):  384.00 MiB, V (f16):  384.00 MiB
llama_context:      CUDA0 compute buffer size =   307.00 MiB
llama_context:  CUDA_Host compute buffer size =    18.01 MiB
llama_context: graph nodes  = 1591
llama_context: graph splits = 2

So "pipeline parallelism" is enabled and causes internal delays despite having nothing offloaded on the 2nd gpu

Update
Adding manually -sm none solves the problem. But that's not something most people using llama.cpp would get to.
Any GPU that is not being offloaded to should not be used by default

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0