8000 Eval bug: with -ub 8192 model llama-server insists running on GPU · Issue #12675 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content
Eval bug: with -ub 8192 model llama-server insists running on GPU #12675
Closed
@gnusupport

Description

@gnusupport

Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 576 (af04481)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

CPU: 13th Gen Intel(R) Core(TM) i7-13700T (24) @ 4.90 GHz
GPU: NVIDIA GeForce RTX 3090 [Discrete]
Memory: 6.51 GiB / 125.51 GiB (5%)

Models

nomic-embed-text-v1.5-Q8_0.gguf

Problem description & steps to reproduce

When I run it this way, the model insists running on GPU:

/usr/local/bin/llama-server -v -c 8192 -ub 8192 --embedding --log-timestamps --host 192.168.1.68 --port 9999 -m /mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf

and when I run it this way, model runs on CPU:

/usr/local/bin/llama-server -v -c 8192  --embedding --log-timestamps --host 192.168.1.68 --port 9999 -m /mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf

IMHO, -u 8192 works better but can't be sure.

I can actually see it running on GPU even if I did not specify -ngl, strange.

First Bad Commit

No response

Relevant log output

llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 0
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_pre_seq (8192) > n_ctx_train (2048) -- possible training context overflow
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.00 MiB
llama_context: n_ctx = 8192
llama_context: n_ctx = 8192 (padded)
init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 12, can_shift = 1
init: layer   0: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   1: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   2: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   3: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   4: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   5: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   6: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   7: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   8: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   9: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer  10: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer  11: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init:        CPU KV buffer size =   288.00 MiB
llama_context: KV self size  =  288.00 MiB, K (f16):  144.00 MiB, V (f16):  144.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: n_tokens = 2048, n_seqs = 1, n_outputs = 2048
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 256.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 268435456
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
common_init_from_params: failed to create context with model '/mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf'
srv    load_model: failed to load model, '/mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf'
srv    operator(): operator(): cleaning up before exit...
terminate called without an active exception
main: exiting due to model loading error
Aborted

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0