Feature Request: tensor split needs control over where CPU layers go

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Provide a command argument switch that allows user to select whether CPU layers are loaded first or last when assigning tensor splits.

Motivation

Background: I am testing out hybrid layer quants for GGUFs #13040. The basic idea is to use smaller quants at early layers and bigger quants at cortex layers. The goal is to reduce gguf size to enable either more layers to offload to GPU(s) or open up space for bigger KV in the GPU(s) while maintaing high performance at the same time.

Problem: when using CPU + GPU, the tensor split is currently hardcoded to assign CPU to the first layers in the gguf. This is exactly the opposite of optimal for a GGUF where the first layers are smaller due to using smaller quants (Murphys law wins again). By loading the early smaller layers into GPU instead more layers can offload. As an example the Q2_K_H Llama scout hybrid quant :
https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF/resolve/main/Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf
will only offload 34 layers into 3x 4070 on a local lan RPC setup. The bigger cortex layers are being allocated into the GPUs so there is less room for more layers or KV :

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 1
load_tensors: layer   1 assigned to device CPU, is_swa = 1
load_tensors: layer   2 assigned to device CPU, is_swa = 1
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 1
load_tensors: layer   5 assigned to device CPU, is_swa = 1
load_tensors: layer   6 assigned to device CPU, is_swa = 1
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 1
load_tensors: layer   9 assigned to device CPU, is_swa = 1
load_tensors: layer  10 assigned to device CPU, is_swa = 1
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 1
load_tensors: layer  13 assigned to device CPU, is_swa = 1
load_tensors: layer  14 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  15 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer  16 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  17 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  18 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  19 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer  20 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  21 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  22 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  23 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer  24 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  25 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  26 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  27 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  28 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  29 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  30 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  31 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  32 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  33 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  34 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  35 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  36 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  37 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  38 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  39 assigned to device CUDA0, is_swa = 0
load_tensors: layer  40 assigned to device CUDA0, is_swa = 1
load_tensors: layer  41 assigned to device CUDA0, is_swa = 1
load_tensors: layer  42 assigned to device CUDA0, is_swa = 1
load_tensors: layer  43 assigned to device CUDA0, is_swa = 0
load_tensors: layer  44 assigned to device CUDA0, is_swa = 1
load_tensors: layer  45 assigned to device CUDA0, is_swa = 1
load_tensors: layer  46 assigned to device CUDA0, is_swa = 1
load_tensors: layer  47 assigned to device CUDA0, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q3_K) (and 198 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 34 repeating layers to GPU
load_tensors: offloaded 34/49 layers to GPU

Proposal #1: a control flag to select whether to offload to CPU first or last. In llama-model.cpp, changing this one line will result in the last layers going into CPU and early layers into GPU devices:

// const int i_gpu_start = std::max((int) hparams.n_layer - n_gpu_layers, (int
const int i_gpu_start = 0;

This change will allow offload of 38 layers instead of 34, or open up more room in GPUs for bigger KV:

load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   1 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   2 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   3 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer   4 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   5 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   6 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   7 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer   8 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer   9 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  10 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  11 assigned to device RPC[192.9.200.5:50052], is_swa = 0
load_tensors: layer  12 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  13 assigned to device RPC[192.9.200.5:50052], is_swa = 1
load_tensors: layer  14 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  15 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  16 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  17 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  18 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  19 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  20 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  21 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  22 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  23 assigned to device RPC[192.9.200.4:50052], is_swa = 0
load_tensors: layer  24 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  25 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  26 assigned to device RPC[192.9.200.4:50052], is_swa = 1
load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
load_tensors: layer  28 assigned to device CUDA0, is_swa = 1
load_tensors: layer  29 assigned to device CUDA0, is_swa = 1
load_tensors: layer  30 assigned to device CUDA0, is_swa = 1
load_tensors: layer  31 assigned to device CUDA0, is_swa = 0
load_tensors: layer  32 assigned to device CUDA0, is_swa = 1
load_tensors: layer  33 assigned to device CUDA0, is_swa = 1
load_tensors: layer  34 assigned to device CUDA0, is_swa = 1
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 1
load_tensors: layer  37 assigned to device CUDA0, is_swa = 1
load_tensors: layer  38 assigned to device CPU, is_swa = 1
load_tensors: layer  39 assigned to device CPU, is_swa = 0
load_tensors: layer  40 assigned to device CPU, is_swa = 1
load_tensors: layer  41 assigned to device CPU, is_swa = 1
load_tensors: layer  42 assigned to device CPU, is_swa = 1
load_tensors: layer  43 assigned to device CPU, is_swa = 0
load_tensors: layer  44 assigned to device CPU, is_swa = 1
load_tensors: layer  45 assigned to device CPU, is_swa = 1
load_tensors: layer  46 assigned to device CPU, is_swa = 1
load_tensors: layer  47 assigned to device CPU, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q3_K) (and 142 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
load_tensors: offloading 38 repeating layers to GPU
load_tensors: offloaded 38/49 layers to GPU

Proposal #2 : First sort the layers by size, then apply proposal #1 control switch to enable smaller layers to go into GPU first. With hybrid quants the layer size differences can be significant (2x or more). This could result in non sequential layers going into a device though and I am not sure about the implications of that.

Possible Implementation

--tensor-split-cpu-last flag controls if the CPU is offloaded to last.
--tensor-split-layer-sort flag controls if the layers are first sorted by size before doing the tensor split

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions