Open
Description
Name and Version
$ ./llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
version: 0 (unknown)
built with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu
(Actually version 5161)
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server
Command line
./llama-server -m gemma-3-27b-pt-q4_0.gguf -ngl 9999 --host 127.0.0.1 --port 8000 --threads-http 1
# ...
curl -v --request POST \
--url http://127.0.0.1:8000/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Five, Four, Three, Two, One, '$RANDOM'\n\n\n\nThe countdown","n_predict": 256, "n_probs":10, "temperature":0,"stream":true}'
Problem description & steps to reproduce
It looks like sometimes the server will try to generate responses for HTTP requests that have been queued, but the client has since disconnected.
I can reproduce the problem as follows:
- Start the server
- Start the
curl
command above. The key aspects of it is that it must be long-running (i.e.n_predict
is high). - While it's still running, in another terminal, start and then immediately cancel (with Ctrl+C) the same command a few times, in quick succession.
- Start the
curl
command once more. - Cancel the original
curl
command in step 2.
Expected behavior: The server should start to immediately reply to the command from step 4.
Actual behavior: The server seems to hang, because it is pointlessly generating replies to the canceled commands in step 3.
I tried to force the server to handle one request at a time with the --threads-http 1
option, but it doesn't seem to make a difference.
First Bad Commit
This seems to be a regression, but it was introduced about a year ago, so the exact change which introduced it is probably not relevant.