Worse speed and GPU load than pure llama-cpp #1831

Mushoz · 2024-11-14T13:52:18Z

Mushoz
Nov 14, 2024

< 8000 task-lists disabled sortable>

I am trying to use llama-cpp-python, but I am getting 22 tokens per second instead of the 25 tokens per second that I usually get under regular llama-cpp. Furthermore, looking at the GPU load, it's only hitting about 80% ish GPU load versus 100% load with pure llama-cpp. How can I get llama-cpp-python to perform the same?

I am running both in docker with the same base image, so I should be getting identical speeds in both. Here is the Dockerfile for llama-cpp with good performance:

FROM rocm/pytorch

WORKDIR /root/
ARG ROCM_TARGET_LST=/root/gfx
RUN echo "gfx1100" > /root/gfx
RUN rocm_agent_enumerator

RUN git clone https://github.com/ggerganov/llama.cpp.git
WORKDIR /root/llama.cpp

RUN make GGML_HIPBLAS=1 -j$(nproc)

ENTRYPOINT ["bash"]

Performance is evaluated through the following command: ./llama-cli -m /models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf -ngl 999 -p "Please write a minesweeper game using html, css and js. Please only output the codeblocks. Do NOT give any explanations." -n 1024

The Dockerfile for llama-cpp-python looks as follows:

FROM rocm/pytorch

WORKDIR /root/
ARG ROCM_TARGET_LST=/root/gfx
RUN echo "gfx1100" > /root/gfx
RUN rocm_agent_enumerator

RUN git clone --recurse-submodules https://github.com/abetlen/llama-cpp-python.git
WORKDIR /root/llama-cpp-python

ARG CMAKE_ARGS="-DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100 -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++"
RUN pip install --upgrade pip
RUN pip install -e .
RUN pip install -e .[server]

ENTRYPOINT ["python3", "-m", "llama_cpp.server", "--config_file", "/root/models.conf"]

The models.conf looks as follows:

{
    "host": "0.0.0.0",
    "port": 8000,
    "models": [
        {
            "model": "/models/Qwen2.5-Coder-32B-Instruct-Q4_K_S.gguf",
            "model_alias": "Qwen2.5-Coder-32B",
            "n_gpu_layers": -1,
            "offload_kqv": true
        }
    ]
}

Performance under llama-cpp-python is evaluated through checking the logs of the server, while querying the server through openweb ui connected through the OpenAI compatible API.

I do see something in the logs when using llama-cpp-python that I do not see when using llama-cpp which might explain the difference in GPU load and performance, as some tensors are running on the CPU for some reason. But I do not know why and how I can fix this. Does anyone have any idea?

llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type ROCm_Host, using CPU instead
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size =   417.66 MiB
llm_load_tensors:      ROCm0 model buffer size = 17490.85 MiB
warning: failed to mlock 1082613760-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).

Answered by Mushoz

Nov 14, 2024

Managed to find the answer myself. For some reason the logits_all parameter defaults to true and tanks performance. Setting it to false brings the performance on par with pure llama-cpp. Not sure if that's a sensible default, but at least I managed to solve the problem. GPU load is also back to 100% again.

View full answer

Mushoz · 2024-11-14T20:28:11Z

Mushoz
Nov 14, 2024
Author

Managed to find the answer myself. For some reason the logits_all parameter defaults to true and tanks performance. Setting it to false brings the performance on par with pure llama-cpp. Not sure if that's a sensible default, but at least I managed to solve the problem. GPU load is also back to 100% again.

2 replies

ExtReMLapin Nov 15, 2024

Thanks for posting out the answer here

gl2007 Dec 31, 2024

Managed to find the answer myself. For some reason the logits_all parameter defaults to true and tanks performance. Setting it to false brings the performance on par with pure llama-cpp. Not sure if that's a sensible default, but at least I managed to solve the problem. GPU load is also back to 100% again.

Can you please specify where to make the change for the parameter? Will be useful for newbies :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Worse speed and GPU load than pure llama-cpp #1831

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Worse speed and GPU load than pure llama-cpp #1831

Uh oh!

Uh oh!

Mushoz Nov 14, 2024

Replies: 1 comment · 2 replies

Uh oh!

Mushoz Nov 14, 2024 Author

Uh oh!

ExtReMLapin Nov 15, 2024

Uh oh!

gl2007 Dec 31, 2024

Mushoz
Nov 14, 2024

Replies: 1 comment 2 replies

Mushoz
Nov 14, 2024
Author