Auto chat format not loading chat eos_token and bos_token for mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

Hello,

I am using 0.2.44. I run the following code:

from llama_cpp import Llama
llm = Llama(model_path=models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf)

I get the following output:

llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:                         llama.expert_count u32              = 8
llama_model_loader: - kv  10:                    llama.expert_used_count u32              = 2
llama_model_loader: - kv  11:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  12:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  16:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:   32 tensors
llama_model_loader: - type q8_0:   64 tensors
llama_model_loader: - type q4_K:  833 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 8
llm_load_print_meta: n_expert_used    = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 24.62 GiB (4.53 BPW) 
llm_load_print_meta: general.name     = mistralai_mixtral-8x7b-instruct-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.38 MiB
llm_load_tensors:        CPU buffer size = 25215.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    10.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   114.53 MiB
llama_new_context_with_model: graph splits (measure): 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.chat_template': "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}", 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '0', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'llama.expert_count': '8', 'llama.context_length': '32768', 'general.name': 'mistralai_mixtral-8x7b-instruct-v0.1', 'llama.expert_used_count': '2'}
Using chat template: {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
Using chat eos_token: 
Using chat bos_token:

Please note the last two lines, in which there is no chat eos_token or bos_token.

If I then run:

llm.create_chat_completion(
      messages = [
          {"role": "user","content": "Q: Name the planets in the solar system? A: "}
      ]
)

I get no response:

llama_print_timings:        load time =    1252.31 ms
llama_print_timings:      sample time =       0.41 ms /     1 runs   (    0.41 ms per token,  2463.05 tokens per second)
llama_print_timings: prompt eval time =    1252.16 ms /    21 tokens (   59.63 ms per token,    16.77 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    1255.73 ms /    22 tokens
{'id': 'chatcmpl-14e36e5b-7360-4428-a7ae-49f2b3a5d91b',
 'object': 'chat.completion',
 'created': 1708461563,
 'model': 'models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant', 'content': ''},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 21, 'completion_tokens': 1, 'total_tokens': 22}}

On the other hand, if I run (where the only change I've made is adding the chat_format arg):

from llama_cpp import Llama
llm = Llama(model_path=models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf, chat_format="mistral-instruct")
llm.create_chat_completion(
      messages = [
          {"role": "user","content": "Q: Name 
58BF
the planets in the solar system? A: "}
      ]
)

I get the expected result:

llama_print_timings:        load time =    1281.79 ms
llama_print_timings:      sample time =      55.94 ms /   126 runs   (    0.44 ms per token,  2252.37 tokens per second)
llama_print_timings: prompt eval time =    1281.64 ms /    22 tokens (   58.26 ms per token,    17.17 tokens per second)
llama_print_timings:        eval time =   19087.61 ms /   125 runs   (  152.70 ms per token,     6.55 tokens per second)
llama_print_timings:       total time =   20728.50 ms /   147 tokens
{'id': 'chatcmpl-df42bfe7-c7a3-4f91-93ba-d7e0d801adf0',
 'object': 'chat.completion',
 'created': 1708461626,
 'model': 'models/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf',
 'choices': [{'index': 0,
   'message': {'role': 'assistant',
    'content': ' Sure, I\'d be happy to help with that! The solar system consists of eight planets. Here they are, listed in order of their proximity to the Sun:\n\n1. Mercury\n2. Venus\n3. Earth\n4. Mars\n5. Jupiter\n6. Saturn\n7. Uranus\n8. Neptune\n\nIt\'s worth noting that Pluto was once considered the ninth planet in our solar system, but it was reclassified as a "dwarf planet" by the International Astronomical Union in 2006.'},
   'finish_reason': 'stop'}],
 'usage': {'prompt_tokens': 22, 'completion_tokens': 125, 'total_tokens': 147}}

Is this possibly related to a bug in da003d8?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions