8000 Feature Request: Allow disabling `offload_op` for backends by user · Issue #13241 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content
Feature Request: Allow disabling offload_op for backends by user #13241
@hjc4869

Description

@hjc4869

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

llama.cpp currently uses hardcoded minimum batch size = 32 and there's no o 71C1 ption to disable offload_op unless the user specify -ub 16 or less manually. It would be great if the user can disable offload_op manually without reducing -ub.

Motivation

With the introduction of --override-tensor, it has become practical to offload experts to host DRAM in large MoEs while keeping the dense tensors on a GPU with relatively small VRAM. However, in the current implementation, prompt processing performance is not ideal in some configurations due to offload_op being used.

For example, when running llama4 400B with -ot exps=CPU with code from master branch, the prompt processing performance is extremely poor when -ub is set to 512 (default). When -ub is set to 16 it bypasses the offload_op in CUDA backend, but the performance is not fully on par with -ub 512 w/ offload_op disabled from source code.

llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0 -ub 16,512

model size params backend ngl n_ubatch type_k type_v fa ot mmap test t/s
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 16 q8_0 q8_0 1 exps=CPU 0 pp512 93.68 ± 0.70
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 16 q8_0 q8_0 1 exps=CPU 0 tg128 26.68 ± 0.07
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 512 q8_0 q8_0 1 exps=CPU 0 pp512 23.14 ± 0.01
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 512 q8_0 q8_0 1 exps=CPU 0 tg128 26.61 ± 0.16

With offload_op changed to always return false in CUDA backend, there's a 10x performance boost.
./build/bin/llama-bench -m ~/models/llama4-400b-q4_0.gguf -ngl 999 -fa 1 -ctk q8_0 -ctv q8_0 -ot 'exps=CPU' -mmp 0

model size params backend ngl type_k type_v fa ot mmap test t/s
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 pp512 233.66 ± 1.31
llama4 17Bx128E (Maverick) Q4_0 211.18 GiB 400.71 B ROCm,RPC 999 q8_0 q8_0 1 exps=CPU 0 tg128 26.91 ± 0.10

Possible Implementation

In ggml-backend.cpp, add some additional options and checks to the following offload_op call

// check if a backend with higher prio wants to offload the op
if (src_backend_id == sched->n_backends - 1 && ggml_backend_buffer_is_host(src->buffer)) {
    for (int b = 0; b < src_backend_id; b++) {
        if (ggml_backend_supports_op(sched->backends[b], tensor) && ggml_backend_offload_op(sched->backends[b], tensor)) {
            SET_CAUSE(tensor, "1.off");
            return b;
        }
    }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0