Eval bug: with -ub 8192 model llama-server insists running on GPU

### Name and Version

llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 576 (af04481)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu


### Operating systems

Linux

### GGML backends

CUDA

### Hardware

CPU: 13th Gen Intel(R) Core(TM) i7-13700T (24) @ 4.90 GHz
GPU: NVIDIA GeForce RTX 3090 [Discrete]
Memory: 6.51 GiB / 125.51 GiB (5%)

### Models

nomic-embed-text-v1.5-Q8_0.gguf

### Problem description & steps to reproduce

When I run it this way, the model insists running on GPU:
```
/usr/local/bin/llama-server -v -c 8192 -ub 8192 --embedding --log-timestamps --host 192.168.1.68 --port 9999 -m /mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf
```

and when I run it this way, model runs on CPU:

```
/usr/local/bin/llama-server -v -c 8192  --embedding --log-timestamps --host 192.168.1.68 --port 9999 -m /mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf
```

IMHO, -u 8192 works better but can't be sure. 

I can actually see it running on GPU even if I did not specify -ngl, strange.

### First Bad Commit

_No response_

### Relevant log output

```shell
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 2048
llama_context: causal_attn   = 0
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_pre_seq (8192) > n_ctx_train (2048) -- possible training context overflow
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.00 MiB
llama_context: n_ctx = 8192
llama_context: n_ctx = 8192 (padded)
init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 12, can_shift = 1
init: layer   0: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   1: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   2: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   3: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   4: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   5: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   6: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   7: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   8: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer   9: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer  10: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init: layer  11: n_embd_k_gqa = 768, n_embd_v_gqa = 768, dev = CPU
init:        CPU KV buffer size =   288.00 MiB
llama_context: KV self size  =  288.00 MiB, K (f16):  144.00 MiB, V (f16):  144.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: n_tokens = 2048, n_seqs = 1, n_outputs = 2048
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 256.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 268435456
llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
common_init_from_params: failed to create context with model '/mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf'
srv    load_model: failed to load model, '/mnt/nvme0n1/LLM/nomic-ai/quantized/nomic-embed-text-v1.5-Q8_0.gguf'
srv    operator(): operator(): cleaning up before exit...
terminate called without an active exception
main: exiting due to model loading error
Aborted
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions