You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Command: docker run -p 8080:8080 -it --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda --port 8080 --host 0.0.0.0 --parallel 16 --ctx-size 49152 --cont-batching --slot-prompt-similarity 0.3 --n-gpu-layers 100000000 --flash-attn --no-warmup --jinja --lora-init-without-apply --lora-scaled ... -m ...
After the high-load requests (50 CCU), the server response empty content even the completion_tokens is ok
This is the difference in log:
The error version:
Name and Version
Image: https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/421179669?tag=server-cuda-b5452
Operating systems
Linux
GGML backends
CUDA
Hardware
L40S GPU (aws g6e.xlarge instance)
Models
https://huggingface.co/lmstudio-community/gemma-3-27b-it-GGUF/resolve/main/gemma-3-27b-it-Q4_K_M.gguf
Problem description & steps to reproduce
Command:
docker run -p 8080:8080 -it --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda --port 8080 --host 0.0.0.0 --parallel 16 --ctx-size 49152 --cont-batching --slot-prompt-similarity 0.3 --n-gpu-layers 100000000 --flash-attn --no-warmup --jinja --lora-init-without-apply --lora-scaled ... -m ...
After the high-load requests (50 CCU), the server response empty content even the completion_tokens is ok
This is the difference in log:
The error version:
The previous version (work well):
Example response error:
First Bad Commit
This image is still working fine: https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/419241486?tag=server-cuda-b5428
Relevant log output
The text was updated successfully, but these errors were encountered: