Closed
Description
Name and Version
llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 576 (af04481)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
CPU: 13th Gen Intel(R) Core(TM) i7-13700T (24) @ 4.90 GHz
GPU: NVIDIA GeForce RTX 3090 [Discrete]
Memory: 6.51 GiB / 125.51 GiB (5%)
Models
nomic-embed-text-v1.5-Q8_0.gguf
Problem description & steps to reproduce
When I run it this way, the model insists running on GPU:
/usr/local/bin/llama-server -v -c 8192 -ub 8192 --embedding --log-timestamps --host 192.168.1.68 --port 9999 -m /mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf
and when I run it this way, model runs on CPU:
/usr/local/bin/llama-server -v -c 8192 --embedding --log-timestamps --host 192.168.1.68 --port 9999 -m /mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf
IMHO, -u 8192 works better but can't be sure.
I can actually see it running on GPU even if I did not specify -ngl, strange.
First Bad Commit
No response
Relevant log output
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 2048
llama_context: n_ubatch = 2048
llama_context: causal_attn = 0
llama_context: flash_attn = 0
llama_context: freq_base = 1000.0
llama_context: freq_scale = 1
llama_context: n_ctx_pre_seq (8192) > n_ctx_train (2048) -- possible training context overflow
set_abort_callback: call
llama_context: CPU output buffer size = 0.00 MiB
llama_context: n_ctx = 8192
llama_context: n_ctx = 8192 (padded)
init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 12, can_shift = 1
init: layer 0: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 1: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 2: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 3: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 4: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 5: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 6: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 7: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 8: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 9: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 10: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer 11: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: CPU KV buffer size = 288.00 MiB
llama_context: KV self size = 288.00 MiB, K (f16): 144.00 MiB, V (f16): 144.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: n_tokens = 2048, n_seqs = 1, n_outputs = 2048
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 256.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 268435456
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
common_init_from_params: failed to create context with model '/mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf'
srv load_model: failed to load model, '/mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf'
srv operator(): operator(): cleaning up before exit...
terminate called without an active exception
main: exiting due to model loading error
Aborted