10000
We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
bge-reranker-v2-gemma
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build/bin/llama-cli --version version: 5162 (2016f07) built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.4.0
Mac, Linux
Metal, CUDA
Nvidia L40 and Macbook Pro M3 Pro
Nvidia L40
Macbook Pro M3 Pro
https://huggingface.co/RichardErkhov/BAAI_-_bge-reranker-v2-gemma-gguf/blob/main/bge-reranker-v2-gemma.Q8_0.gguf
When I run llama-server with the command:
llama-server
./build/bin/llama-server -m ./bge-m3/bge-reranker-v2-gemma-Q8_0.gguf \ --host 0.0.0.0 \ --ctx-size 8192\ --batch-size 8192 \ --ubatch-size 8192 \ --n-gpu-layers 99 \ --flash-attn \ --n-predict 8192 \ --threads-http -1 \ --timeout 60 \ --cont-batching \ --rerank
It throws the exception with a detailed log below:
No response
build: 5162 (2016f07b) with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.4.0 system info: n_threads = 5, n_threads_batch = 5, total_threads = 11 system_info: n_threads = 5 (n_threads_batch = 5) / 11 | Metal : EMBED_LIBRARY = 1 | CPU : ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | main: binding port with default address family main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 10 main: loading model srv load_model: loading model './bge-m3/bge-reranker-v2-gemma-Q8_0.gguf' llama_model_load_from_file_impl: using device Metal (Apple M3 Pro) - 12287 MiB free llama_model_loader: loaded meta data with 36 key-value pairs and 164 tensors from ./bge-m3/bge-reranker-v2-gemma-Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma 2b llama_model_loader: - kv 3: general.organization str = Google llama_model_loader: - kv 4: general.basename str = gemma llama_model_loader: - kv 5: general.size_label str = 2B llama_model_loader: - kv 6: general.license str = apache-2.0 llama_model_loader: - kv 7: general.tags arr[str,3] = ["transformers", "sentence-transforme... llama_model_loader: - kv 8: general.languages arr[str,1] = ["multilingual"] llama_model_loader: - kv 9: gemma.context_length u32 = 8192 llama_model_loader: - kv 10: gemma.embedding_length u32 = 2048 llama_model_loader: - kv 11: gemma.block_count u32 = 18 llama_model_loader: - kv 12: gemma.feed_forward_length u32 = 16384 llama_model_loader: - kv 13: gemma.attention.head_count u32 = 8 llama_model_loader: - kv 14: gemma.attention.head_count_kv u32 = 1 llama_model_loader: - kv 15: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 16: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 17: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 18: general.file_type u32 = 7 llama_model_loader: - kv 19: tokenizer.ggml.model str = llama llama_model_loader: - kv 20: tokenizer.ggml.pre str = default llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 26: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 30: tokenizer.ggml.prefix_token_id u32 = 67 llama_model_loader: - kv 31: tokenizer.ggml.suffix_token_id u32 = 69 llama_model_loader: - kv 32: tokenizer.ggml.middle_token_id u32 = 68 llama_model_loader: - kv 33: tokenizer.ggml.eot_token_id u32 = 107 llama_model_loader: - kv 34: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 35: general.quantization_version u32 = 2 llama_model_loader: - type f32: 37 tensors llama_model_loader: - type q8_0: 127 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q8_0 print_info: file size = 2.48 GiB (8.50 BPW) load: control-looking token: 107 '<end_of_turn>' was not control-type; this is probably a bug in the model. its type will be overridden load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 5 load: token to piece cache size = 1.6014 MB print_info: arch = gemma print_info: vocab_only = 0 print_info: n_ctx_train = 8192 print_info: n_embd = 2048 print_info: n_layer = 18 print_info: n_head = 8 print_info: n_head_kv = 1 print_info: n_rot = 256 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 256 print_info: n_embd_head_v = 256 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 256 print_info: n_embd_v_gqa = 256 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 16384 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 8192 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 2B print_info: model params = 2.51 B print_info: general.name = Gemma 2b print_info: vocab type = SPM print_info: n_vocab = 256000 print_info: n_merges = 0 print_info: BOS token = 2 '<bos>' print_info: EOS token = 1 '<eos>' print_info: EOT token = 107 '<end_of_turn>' print_info: UNK token = 3 '<unk>' print_info: PAD token = 0 '<pad>' print_info: LF token = 227 '<0x0A>' print_info: FIM PRE token = 67 '<unused60>' print_info: FIM SUF token = 69 '<unused62>' print_info: FIM MID token = 68 '<unused61>' print_info: EOG token = 1 '<eos>' print_info: EOG token = 107 '<end_of_turn>' print_info: max token length = 93 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 18 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 19/19 layers to GPU load_tensors: Metal_Mapped model buffer size = 2539.67 MiB load_tensors: CPU_Mapped model buffer size = 531.25 MiB ............................................................. common_init_from_params: warning: vocab does not have a SEP token, reranking will not work srv load_model: failed to load model, './bge-m3/bge-reranker-v2-gemma-Q8_0.gguf' srv operator(): operator(): cleaning up before exit... main: exiting due to model loading error
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Name and Version
build/bin/llama-cli --version
version: 5162 (2016f07)
built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin24.4.0
Operating systems
Mac, Linux
GGML backends
Metal, CUDA
Hardware
Nvidia L40
andMacbook Pro M3 Pro
Models
https://huggingface.co/RichardErkhov/BAAI_-_bge-reranker-v2-gemma-gguf/blob/main/bge-reranker-v2-gemma.Q8_0.gguf
Problem description & steps to reproduce
When I run
llama-server
with the command:It throws the exception with a detailed log below:
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: