8000 Eval bug: RWKV inference issue with llama-server · Issue #13018 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Eval bug: RWKV inference issue with llama-server #13018

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
blakkd opened this issue Apr 19, 2025 · 0 comments
Open

Eval bug: RWKV inference issue with llama-server #13018

blakkd opened this issue Apr 19, 2025 · 0 comments

Comments

@blakkd
Copy link
blakkd commented Apr 19, 2025

Name and Version

build b5155

~/l/b/bin ❯❯❯ ./llama-server --version
version: 5155 (64082100)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

RTX 3090 24GB

Models

LatentWanderer/featherless-ai_Qwerky-QwQ-32B-gguf

Problem description & steps to reproduce

llama-cli works as intended.
But when trying to run llama-server, only the first generation is working fine.
When this first generation ends up, or cancelled, the server crashes upon any new generation attempt.

What I exactly did:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16
cd build/bin
./llama-server -m /home/user/Downloads/featherless-ai_Qwerky-QwQ-32B-Q4_K_M.gguf -ngl 65 -c 2048 --port 8082 -n 50

First Bad Commit

I can't exactly tell right now, but picking a way older version, for example b4616, the bug isn't encountered.

Relevant log output

Here are 2 consecutive generation requests:

/llama-server -m /home/user/Downloads/featherless-ai_Qwerky-QwQ-32B-Q4_K_M.gguf -ngl 65 -c 2048 --port 8082 -n 50

.
.
.

main: server is listening on http://127.0.0.1:8082 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 20, n_tokens = 20
slot      release: id  0 | task 0 | stop processing: n_past = 69, truncated = 0
slot print_timing: id  0 | task 0 | 
prompt eval time =     110.29 ms /    20 tokens (    5.51 ms per token,   181.35 tokens per second)
       eval time =    1884.07 ms /    50 tokens (   37.68 ms per token,    26.54 tokens per second)
      total time =    1994.36 ms /    70 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 51 | processing task
slot update_slots: id  0 | task 51 | new prompt, n_ctx_slot = 2048, n_keep = 0, n_prompt_tokens = 20
slot update_slots: id  0 | task 51 | need to evaluate at least 1 token to generate logits, n_past = 20, n_prompt_tokens = 20
slot update_slots: id  0 | task 51 | kv cache rm [0, end)
slot update_slots: id  0 | task 51 | prompt processing progress, n_past = 20, n_tokens = 20, progress = 1.000000
slot update_slots: id  0 | task 51 | prompt done, n_past = 20, n_tokens = 20
/home/user/llama.cpp-b5155/src/llama-kv-cache.cpp:599: GGML_ASSERT(empty_cell.is_empty()) failed
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
@github-actions github-actions bot added the stale label May 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant
0