Eval bug: RWKV inference issue with llama-server #13018

blakkd · 2025-04-19T01:25:29Z

Name and Version

build b5155

~/l/b/bin ❯❯❯ ./llama-server --version
version: 5155 (64082100)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 3090 24GB

Models

LatentWanderer/featherless-ai_Qwerky-QwQ-32B-gguf

Problem description & steps to reproduce

llama-cli works as intended.
But when trying to run llama-server, only the first generation is working fine.
When this first generation ends up, or cancelled, the server crashes upon any new generation attempt.

What I exactly did:

git clone https://github.com/ggml-org/llama.cpp

cd llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16

cd build/bin

./llama-server -m /home/user/Downloads/featherless-ai_Qwerky-QwQ-32B-Q4_K_M.gguf -ngl 65 -c 2048 --port 8082 -n 50

First Bad Commit

I can't exactly tell right now, but picking a way older version, for example b4616, the bug isn't encountered.

Relevant log output

Here are 2 consecutive generation requests:

/llama-server -m /home/user/Downloads/featherless-ai_Qwerky-QwQ-32B-Q4_K_M.gguf -ngl 65 -c 2048 --port 8082 -n 50

.
.
.

main: server is listening on http://127.0.0.1:8082 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 20, n_tokens = 20
slot      release: id  0 | task 0 | stop processing: n_past = 69, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     110.29 ms /    20 tokens (    5.51 ms per token,   181.35 tokens per second)
       eval time =    1884.07 ms /    50 tokens (   37.68 ms per token,    26.54 tokens per second)
      total time =    1994.36 ms /    70 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 51 | processing task
slot update_slots: id  0 | task 51 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id  0 | task 51 | need to evaluate at least 1 token to generate logits, n_past = 20, n_prompt_tokens = 20
slot update_slots: id  0 | task 51 | kv cache rm [0, end)
slot update_slots: id  0 | task 51 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id  0 | task 51 | prompt done, n_past = 20, n_tokens = 20
/home/user/llama.cpp-b5155/src/llama-kv-cache.cpp:599: GGML_ASSERT(empty_cell.is_empty()) failed
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

The text was updated successfully, but these errors were encountered:

blakkd added the bug-unconfirmed label Apr 19, 2025

github-actions bot added the stale label May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: RWKV inference issue with llama-server #13018

Eval bug: RWKV inference issue with llama-server #13018

Eval bug: RWKV inference issue with llama-server #13018

Eval bug: RWKV inference issue with llama-server #13018

Comments

Uh oh!

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output