10000 Eval bug: Error when load `bge-reranker-v2-gemma` model · Issue #13041 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Eval bug: Error when load bge-reranker-v2-gemma model #13041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
congson1293 opened this issue Apr 21, 2025 · 0 comments
Open

Eval bug: Error when load bge-reranker-v2-gemma model #13041

congson1293 opened this issue Apr 21, 2025 · 0 comments

Comments

@congson1293
Copy link

Name and Version

build/bin/llama-cli --version
version: 5162 (2016f07)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.4.0

Operating systems

Mac, Linux

GGML backends

Metal, CUDA

Hardware

Nvidia L40 and Macbook Pro M3 Pro

Models

https://huggingface.co/RichardErkhov/BAAI_-_bge-reranker-v2-gemma-gguf/blob/main/bge-reranker-v2-gemma.Q8_0.gguf

Problem description & steps to reproduce

When I run llama-server with the command:

./build/bin/llama-server -m ./bge-m3/bge-reranker-v2-gemma-Q8_0.gguf \
--host 0.0.0.0 \
--ctx-size 8192\
--batch-size 8192 \
--ubatch-size 8192 \
--n-gpu-layers 99 \
--flash-attn \
--n-predict 8192 \
--threads-http -1 \
--timeout 60 \
--cont-batching \
--rerank

It throws the exception with a detailed log below:

First Bad Commit

No response

Relevant log output

build: 5162 (2016f07b) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.4.0
system info: n_threads = 5, n_threads_batch = 5, total_threads = 11

system_info: n_threads = 5 (n_threads_batch = 5) / 11 | Metal : EMBED_LIBRARY = 1 | CPU : ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 10
main: loading model
srv    load_model: loading model './bge-m3/bge-reranker-v2-gemma-Q8_0.gguf'
llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 12287 MiB free
llama_model_loader: loaded meta data with 36 key-value pairs and 164 tensors from ./bge-m3/bge-reranker-v2-gemma-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2b
llama_model_loader: - kv   3:                       general.organization str              = Google
llama_model_loader: - kv   4:                           general.basename str              = gemma
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                               general.tags arr[str,3]       = ["transformers", "sentence-transforme...
llama_model_loader: - kv   8:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv   9:                       gemma.context_length u32              = 8192
llama_model_loader: - kv  10:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv  11:                          gemma.block_count u32              = 18
llama_model_loader: - kv  12:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv  13:                 gemma.attention.head_count u32              = 8
llama_model_loader: - kv  14:              gemma.attention.head_count_kv u32              = 1
llama_model_loader: - kv  15:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  17:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  18:                          general.file_type u32              = 7
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  22:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  26:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  30:             tokenizer.ggml.prefix_token_id u32              = 67
llama_model_loader: - kv  31:             tokenizer.ggml.suffix_token_id u32              = 69
llama_model_loader: - kv  32:             tokenizer.ggml.middle_token_id u32              = 68
llama_model_loader: - kv  33:                tokenizer.ggml.eot_token_id u32              = 107
llama_model_loader: - kv  34:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  35:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   37 tensors
llama_model_loader: - type q8_0:  127 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 2.48 GiB (8.50 BPW) 
load: control-looking token:    107 '<end_of_turn>' was not control-type; this is probably a bug in the model. its type will be overridden
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 5
load: token to piece cache size = 1.6014 MB
print_info: arch             = gemma
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 2048
print_info: n_layer          = 18
print_info: n_head           = 8
print_info: n_head_kv        = 1
print_info: n_rot            = 256
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 16384
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 2B
print_info: model params     = 2.51 B
print_info: general.name     = Gemma 2b
print_info: vocab type       = SPM
print_info: n_vocab          = 256000
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 107 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 227 '<0x0A>'
print_info: FIM PRE token    = 67 '<unused60>'
print_info: FIM SUF token    = 69 '<unused62>'
print_info: FIM MID token    = 68 '<unused61>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 107 '<end_of_turn>'
print_info: max token length = 93
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 18 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 19/19 layers to GPU
load_tensors: Metal_Mapped model buffer size =  2539.67 MiB
load_tensors:   CPU_Mapped model buffer size =   531.25 MiB
.............................................................
common_init_from_params: warning: vocab does not have a  SEP token, reranking will not work
srv    load_model: failed to load model, './bge-m3/bge-reranker-v2-gemma-Q8_0.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
@congson1293 congson1293 changed the title Eval bug: Error when load bge-reranker-v2-gemma model Eval bug: Error when load bge-reranker-v2-gemma model Apr 21, 2025
@github-actions github-actions bot added the stale label May 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant
0