Feature Request: Allow disabling offload_op for backends by user

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

llama.cpp currently uses hardcoded minimum batch size = 32 and there's no o 71C1 ption to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.

Motivation

With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.

For example, when running llama4 400B with -ot exps=CPU with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses the offload_op in CUDA backend, but the performance is not fully on par with -ub 512 w/ offload_op disabled from source code.

llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512

model	size	params	backend	ngl	n_ubatch	type_k	type_v	fa	ot	test	t/s
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	16	q8_0	q8_0	1	exps=CPU	pp512	93.68 ± 0.70
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	16	q8_0	q8_0	1	exps=CPU	tg128	26.68 ± 0.07
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	512	q8_0	q8_0	1	exps=CPU	pp512	23.14 ± 0.01
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	512	q8_0	q8_0	1	exps=CPU	tg128	26.61 ± 0.16

With offload_op changed to always return false in CUDA backend, there's a 10x performance boost.
./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0

model	size	params	backend	ngl	type_k	type_v	fa	ot	mmap	test	t/s
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	pp512	233.66 ± 1.31
llama4 17Bx128E (Maverick) Q4_0	211.18 GiB	400.71 B	ROCm,RPC	999	q8_0	q8_0	1	exps=CPU	0	tg128	26.91 ± 0.10

Possible Implementation

In ggml-backend.cpp, add some additional options and checks to the following offload_op call

// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1 && ggml_backend_buffer_is_host(src->buffer)) {
    for (int b = 0; b < src_backend_id; b++) {
        if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
            SET_CAUSE(tensor, "1.off");
            return b;
        }
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions