-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Description
Prerequisites
- I am running the latest code. Mention the version if possible as well.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new and useful enhancement to share.
Feature Description
llama.cpp currently uses hardcoded minimum batch size = 32 and there's no o 71C1 ption to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.
Motivation
With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.
For example, when running llama4 400B with -ot exps=CPU
with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses the offload_op
in CUDA backend, but the performance is not fully on par with -ub 512 w/ offload_op
disabled from source code.
llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512
model | size | params | backend | ngl | n_ubatch | type_k | type_v | fa | ot | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|---|---|---|---|
llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | 16 | q8_0 | q8_0 | 1 | exps=CPU | 0 | pp512 | 93.68 ± 0.70 |
llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | 16 | q8_0 | q8_0 | 1 | exps=CPU | 0 | tg128 | 26.68 ± 0.07 |
llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | 512 | q8_0 | q8_0 | 1 | exps=CPU | 0 | pp512 | 23.14 ± 0.01 |
llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | 512 | q8_0 | q8_0 | 1 | exps=CPU | 0 | tg128 | 26.61 ± 0.16 |
With offload_op
changed to always return false in CUDA backend, there's a 10x performance boost.
./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0
model | size | params | backend | ngl | type_k | type_v | fa | ot | mmap | test | t/s |
---|---|---|---|---|---|---|---|---|---|---|---|
llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | q8_0 | q8_0 | 1 | exps=CPU | 0 | pp512 | 233.66 ± 1.31 |
llama4 17Bx128E (Maverick) Q4_0 | 211.18 GiB | 400.71 B | ROCm,RPC | 999 | q8_0 | q8_0 | 1 | exps=CPU | 0 | tg128 | 26.91 ± 0.10 |
Possible Implementation
In ggml-backend.cpp, add some additional options and checks to the following offload_op call
// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1 && ggml_backend_buffer_is_host(src->buffer)) {
for (int b = 0; b < src_backend_id; b++) {
if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
SET_CAUSE(tensor, "1.off");
return b;
}
}
}