Description
Name and Version
llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 5172 (eb1776b)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Epyc 7642 + 3x RTX 3090
Models
Mistral-Small-3.1-24B-Instruct-2503-Q8_0.gguf + Mistral-Small-3.1-DRAFT-0.5B.Q8_0.gguf
Problem description & steps to reproduce
Running the following command:
/models/llama.cpp/llama-server -m /models/Mistral-Small-3.1-24B-Instruct-2503-Q8_0.gguf \ -md /models/Mistral-Small-3.1-DRAFT-0.5B.Q8_0.gguf \ -fa -sm row --no-mmap \ -ngl 99 -ngld 99 --port 9009 -c 65536 \ --draft-max 16 --draft-min 5 --draft-p-min 0.5 \ --device CUDA2,CUDA1 --device-draft CUDA1 --tensor-split 0,1,1 \ --slots --metrics --numa distribute -t 40 --no-warmup
The models load fine. When the first request is received, a CUDA out of memory error occurs:
srv params_from_: Chat format: Content-only slot launch_slot_: id 0 | task 0 | processing task slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 110 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 110, n_tokens = 110, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 110, n_tokens = 110 /home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error CUDA error: out of memory current device: 1, in function alloc at /home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:450 cuMemCreate(&handle, reserve_size, &prop, 0)
However, llama-server doesn't terminate. The process stays running and the API responds to requests for /v1/models. Clients can send new completion requests, but no response is returned:
'srv log_server_r: request: GET /v1/models 192.168.1.207 200
srv log_server_r: request: GET /v1/models 192.168.1.207 200
srv params_from_: Chat format: Content-only
srv log_server_r: request: GET /v1/models 192.168.1.207 200
srv log_server_r: request: GET /v1/models 192.168.1.207 200
srv params_from_: Chat format: Content-only
nvtop shows no GPU activity, though the models are still loaded.
First Bad Commit
No response
Relevant log output
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, n_prompt_tokens = 110
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 110, n_tokens = 110, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 110, n_tokens = 110
/home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
CUDA error: out of memory
current device: 1, in function alloc at /home/ali/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:450
cuMemCreate(&handle, reserve_size, &prop, 0)