Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
Provide a command argument switch that allows user to select whether CPU layers are loaded first or last when assigning tensor splits.
Motivation
Background: I am testing out hybrid layer quants for GGUFs #13040. The basic idea is to use smaller quants at early layers and bigger quants at cortex layers. The goal is to reduce gguf size to enable either more layers to offload to GPU(s) or open up space for bigger KV in the GPU(s) while maintaing high performance at the same time.
Problem: when using CPU + GPU, the tensor split is currently hardcoded to assign CPU to the first layers in the gguf. This is exactly the opposite of optimal for a GGUF where the first layers are smaller due to using smaller quants (Murphys law wins again). By loading the early smaller layers into GPU instead more layers can offload. As an example the Q2_K_H Llama scout hybrid quant :
https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF/resolve/main/Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf
will only offload 34 layers into 3x 4070 on a local lan RPC setup. The bigger cortex layers are being allocated into the GPUs so there is less room for more layers or KV :
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device CPU, is_swa = 1
load_tensors: layer 1 assigned to device CPU, is_swa = 1
load_tensors: layer 2 assigned to device CPU, is_swa = 1
load_tensors: layer 3 assigned to device CPU, is_swa = 0
load_tensors: layer 4 assigned to device CPU, is_swa = 1
load_tensors: layer 5 assigned to device CPU, is_swa = 1
load_tensors: layer 6 assigned to device CPU, is_swa = 1
load_tensors: layer 7 assigned to device CPU, is_swa = 0
load_tensors: layer 8 assigned to device CPU, is_swa = 1
load_tensors: layer 9 assigned to device CPU, is_swa = 1
load_tensors: layer 10 assigned to device CPU, is_swa = 1
load_tensors: layer 11 assigned to device CPU, is_swa = 0
load_tensors: layer 12 assigned to device CPU, is_swa = 1
load_tensors: layer 13 assigned to device CPU, is_swa = 1
load_tensors: layer 14 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 15 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer 16 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 17 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 18 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 19 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer 20 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 21 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 22 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 23 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer 24 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 25 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 26 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 27 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer 28 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 29 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 30 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 31 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer 32 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 33 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 34 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 35 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer 36 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 37 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 38 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 39 assigned to device CUDA0, is_swa = 0
load_tensors: layer 40 assigned to device CUDA0, is_swa = 1
load_tensors: layer 41 assigned to device CUDA0, is_swa = 1
load_tensors: layer 42 assigned to device CUDA0, is_swa = 1
load_tensors: layer 43 assigned to device CUDA0, is_swa = 0
load_tensors: layer 44 assigned to device CUDA0, is_swa = 1
load_tensors: layer 45 assigned to device CUDA0, is_swa = 1
load_tensors: layer 46 assigned to device CUDA0, is_swa = 1
load_tensors: layer 47 assigned to device CUDA0, is_swa = 0
load_tensors: layer 48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q3_K) (and 198 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloaded 34/49 layers to GPU
Proposal #1: a control flag to select whether to offload to CPU first or last. In llama-model.cpp, changing this one line will result in the last layers going into CPU and early layers into GPU devices:
// const int i_gpu_start = std::max((int) hparams.n_layer - n_gpu_layers, (int
const int i_gpu_start = 0;
This change will allow offload of 38 layers instead of 34, or open up more room in GPUs for bigger KV:
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 1 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 2 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 3 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer 4 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 5 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 6 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 7 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer 8 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 9 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 10 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 11 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer 12 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 13 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer 14 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 15 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer 16 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 17 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 18 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 19 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer 20 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 21 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 22 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 23 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer 24 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 25 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 26 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer 27 assigned to device CUDA0, is_swa = 0
load_tensors: layer 28 assigned to device CUDA0, is_swa = 1
load_tensors: layer 29 assigned to device CUDA0, is_swa = 1
load_tensors: layer 30 assigned to device CUDA0, is_swa = 1
load_tensors: layer 31 assigned to device CUDA0, is_swa = 0
load_tensors: layer 32 assigned to device CUDA0, is_swa = 1
load_tensors: layer 33 assigned to device CUDA0, is_swa = 1
load_tensors: layer 34 assigned to device CUDA0, is_swa = 1
load_tensors: layer 35 assigned to device CUDA0, is_swa = 0
load_tensors: layer 36 assigned to device CUDA0, is_swa = 1
load_tensors: layer 37 assigned to device CUDA0, is_swa = 1
load_tensors: layer 38 assigned to device CPU, is_swa = 1
load_tensors: layer 39 assigned to device CPU, is_swa = 0
load_tensors: layer 40 assigned to device CPU, is_swa = 1
load_tensors: layer 41 assigned to device CPU, is_swa = 1
load_tensors: layer 42 assigned to device CPU, is_swa = 1
load_tensors: layer 43 assigned to device CPU, is_swa = 0
load_tensors: layer 44 assigned to device CPU, is_swa = 1
load_tensors: layer 45 assigned to device CPU, is_swa = 1
load_tensors: layer 46 assigned to device CPU, is_swa = 1
load_tensors: layer 47 assigned to device CPU, is_swa = 0
load_tensors: layer 48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q3_K) (and 142 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 38 repeating layers to GPU
load_tensors: offloaded 38/49 layers to GPU
Proposal #2 : First sort the layers by size, then apply proposal #1 control switch to enable smaller layers to go into GPU first. With hybrid quants the layer size differences can be significant (2x or more). This could result in non sequential layers going into a device though and I am not sure about the implications of that.
Possible Implementation
--tensor-split-cpu-last flag controls if the CPU is offloaded to last.
--tensor-split-layer-sort flag controls if the layers are first sorted by size before doing the tensor split