llama : rework embeddings logic #14208

ggerganov · 2025-06-16T08:32:08Z

Support embeddings with causal attention
Support logits + embeddings output
llama-server requires --embeddings flag to compute embeddings/reranking
Auto-fill embedding batches with output[i] == true when batch.logits == NULL

ggml-ci

llama : rework embeddings logic

e8ddfa3

ggml-ci

github-actions bot added examples server labels Jun 16, 2025

ggerganov mentioned this pull request Jun 16, 2025

Eval bug: Error in trying to use llama-server with Qwen3-Embedding-0.6B-GGUF #14204

Closed

ggerganov added 2 commits June 16, 2025 12:19

cont : fix rerank

2a6952b

ggml-ci

cont : engrish [no ci]

079d330

ggerganov marked this pull request as ready for review June 16, 2025 09:30

ggerganov requested a review from ngxson as a code owner June 16, 2025 09:30

ggerganov added 3 commits June 16, 2025 12:45

cont : fix rerank

0d03605

ggml-ci

server : support both embeddings and completions with single model

aadc68b

ggml-ci

cont : avoid embeddings_org

8075582

ggml-ci

ggerganov merged commit d3e64b9 into master Jun 16, 2025
55 checks passed

ggerganov deleted the gg/llama-rework-embeddings branch June 16, 2025 11:14

ggerganov mentioned this pull request Jun 16, 2025

server : fix incorrect usage of llama_get_embeddings() #14225

Merged

Provide feedback