You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
.\llama-mtmd-cli.exe --version
load_backend: loaded RPC backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-cpu-haswell.dll
version: 5453 (6b56a64)
built with clang version 18.1.8 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
.\llama-mtmd-cli.exe -m ".\Models\SmolVLM2-500M-Video-Instruct\SmolVLM2-500M-Video-Instruct-Q8_0.gguf" --mmproj ".\Models\SmolVLM2-500M-Video-Instruct\mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf" --image ".\Images\img_01.png" --image ".\Images\img_02.png" --image ".\Images\img_03.png" --image ".\Images\img_04.png" --image ".\Images\img_05.png" -p "Describe this video sequence, as depicted by these images." --temp 0.2 --seed 42
Problem description & steps to reproduce
Using the llama-mtmd-cli with multiple images as input only seem to use the first image for inference.
Reading the code:
it do seem to have been intended to support multiple images.
But running the given command line excluding img_05, then img_04, then img_03, then img_02, will all end with the same output answer. Also the answer do not seem to give information about the later images.
When using llama-mtmd-cli in chat mode and inputing the same images, then multi image is supported. In this case I can ask question about the last image which is not know by only using the first image.
I can also use the llama-server with the same images as input and in this case multi image is also supported.
So it seem that there is something wrong with the llama-mtmd-cli none chat mode implementation.
I have myself been unable to find the issue in the code.
Additionally the processing time for llama-mtmd-cli in chat mode vs none chat mode with the above data is about 5 times slower. This also correspond to the terminal output where in none chat mode I only see encoding image or slice... image/slice encoded in 1215 ms decoding image batch 1/1, n_tokens_batch = 64 image decoded (batch 1/1) in 182 ms
once, while in chat mode I see it number_of_images times.
First Bad Commit
No response
Relevant log output
PS D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64> .\llama-mtmd-cli.exe -m ".\Models\SmolVLM2-500M-Video-Instruct\SmolVLM2-500M-Video-Instruct-Q8_0.gguf" --mmproj ".\Models\SmolVLM2-500M-Video-Instruct\mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf" --image ".\Images\img_01.png" --image ".\Images\img_02.png" --image ".\Images\img_03.png" --image ".\Images\img_04.png" --image ".\Images\img_05.png" -p "Describe this video sequence, as depicted by these images." --temp 0.2 --seed 42
load_backend: loaded RPC backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-cpu-haswell.dll
build: 5453 (6b56a646) with clang version 18.1.8 for x86_64-pc-windows-msvc
llama_model_loader: loaded meta data with 75 key-value pairs and 291 tensors from .\Models\SmolVLM2-500M-Video-Instruct\SmolVLM2-500M-Video-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = SmolVLM2 500M Video Instruct
llama_model_loader: - kv 3: general.finetune str = Video-Instruct
llama_model_loader: - kv 4: general.basename str = SmolVLM2
llama_model_loader: - kv 5: general.size_label str = 500M
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = SmolVLM 500M Instruct
llama_model_loader: - kv 9: general.base_model.0.organization str = HuggingFaceTB
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv 11: general.dataset.count u32 = 12
llama_model_loader: - kv 12: general.dataset.0.name str = The_Cauldron
llama_model_loader: - kv 13: general.dataset.0.organization str = HuggingFaceM4
llama_model_loader: - kv 14: general.dataset.0.repo_url str = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv 15: general.dataset.1.name str = Docmatix
llama_model_loader: - kv 16: general.dataset.1.organization str = HuggingFaceM4
llama_model_loader: - kv 17: general.dataset.1.repo_url str = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv 18: general.dataset.2.name str = LLaVA OneVision Data
llama_model_loader: - kv 19: general.dataset.2.organization str = Lmms Lab
llama_model_loader: - kv 20: general.dataset.2.repo_url str = https://huggingface.co/lmms-lab/LLaVA...
llama_model_loader: - kv 21: general.dataset.3.name str = M4 Instruct Data
llama_model_loader: - kv 22: general.dataset.3.organization str = Lmms Lab
llama_model_loader: - kv 23: general.dataset.3.repo_url str = https://huggingface.co/lmms-lab/M4-In...
llama_model_loader: - kv 24: general.dataset.4.name str = Finevideo
llama_model_loader: - kv 25: general.dataset.4.organization str = HuggingFaceFV
llama_model_loader: - kv 26: general.dataset.4.repo_url str = https://huggingface.co/HuggingFaceFV/...
llama_model_loader: - kv 27: general.dataset.5.name str = MAmmoTH VL Instruct 12M
llama_model_loader: - kv 28: general.dataset.5.organization str = MAmmoTH VL
llama_model_loader: - kv 29: general.dataset.5.repo_url str = https://huggingface.co/MAmmoTH-VL/MAm...
llama_model_loader: - kv 30: general.dataset.6.name str = LLaVA Video 178K
llama_model_loader: - kv 31: general.dataset.6.organization str = Lmms Lab
llama_model_loader: - kv 32: general.dataset.6.repo_url str = https://huggingface.co/lmms-lab/LLaVA...
llama_model_loader: - kv 33: general.dataset.7.name str = Video STaR
llama_model_loader: - kv 34: general.dataset.7.organization str = Orrzohar
llama_model_loader: - kv 35: general.dataset.7.repo_url str = https://huggingface.co/orrzohar/Video...
llama_model_loader: - kv 36: general.dataset.8.name str = Vript
llama_model_loader: - kv 37: general.dataset.8.organization str = Mutonix
llama_model_loader: - kv 38: general.dataset.8.repo_url str = https://huggingface.co/Mutonix/Vript
llama_model_loader: - kv 39: general.dataset.9.name str = VISTA 400K
llama_model_loader: - kv 40: general.dataset.9.organization str = TIGER Lab
llama_model_loader: - kv 41: general.dataset.9.repo_url str = https://huggingface.co/TIGER-Lab/VIST...
llama_model_loader: - kv 42: general.dataset.10.name str = MovieChat 1K_train
llama_model_loader: - kv 43: general.dataset.10.organization str = Enxin
llama_model_loader: - kv 44: general.dataset.10.repo_url str = https://huggingface.co/Enxin/MovieCha...
llama_model_loader: - kv 45: general.dataset.11.name str = ShareGPT4Video
llama_model_loader: - kv 46: general.dataset.11.organization str = ShareGPT4Video
llama_model_loader: - kv 47: general.dataset.11.repo_url str = https://huggingface.co/ShareGPT4Video...
llama_model_loader: - kv 48: general.tags arr[str,1] = ["image-text-to-text"]
llama_model_loader: - kv 49: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 50: llama.block_count u32 = 32
llama_model_loader: - kv 51: llama.context_length u32 = 8192
llama_model_loader: - kv 52: llama.embedding_length u32 = 960
llama_model_loader: - kv 53: llama.feed_forward_length u32 = 2560
llama_model_loader: - kv 54: llama.attention.head_count u32 = 15
llama_model_loader: - kv 55: llama.attention.head_count_kv u32 = 5
llama_model_loader: - kv 56: llama.rope.freq_base f32 = 100000.000000
llama_model_loader: - kv 57: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 58: llama.attention.key_length u32 = 64
llama_model_loader: - kv 59: llama.attention.value_length u32 = 64
llama_model_loader: - kv 60: llama.vocab_size u32 = 49280
llama_model_loader: - kv 61: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 62: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 63: tokenizer.ggml.pre str = smollm
llama_model_loader: - kv 64: tokenizer.ggml.tokens arr[str,49280] = ["<|endoftext|>", "<|im_start|>", "<|...llama_model_loader: - kv 65: tokenizer.ggml.token_type arr[i32,49280] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...llama_model_loader: - kv 66: tokenizer.ggml.merges arr[str,48900] = ["─á t", "─á a", "i n", "h e", "─á ─á...
llama_model_loader: - kv 67: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 68: tokenizer.ggml.eos_token_id u32 = 49279
llama_model_loader: - kv 69: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 70: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 71: tokenizer.chat_template str =<|im_start|>{% formessagein message...
llama_model_loader: - kv 72: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 73: general.quantization_version u32 = 2
llama_model_loader: - kv 74: general.file_type u32 = 7
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 414.86 MiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 145
load: token to piece cache size = 0.3199 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 8192
print_info: n_embd = 960
print_info: n_layer = 32
print_info: n_head = 15
print_info: n_head_kv = 5
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 3
print_info: n_embd_k_gqa = 320
print_info: n_embd_v_gqa = 320
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 2560
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 8192
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 8B
print_info: model params = 409.25 M
print_info: general.name = SmolVLM2 500M Video Instruct
print_info: vocab type = BPE
print_info: n_vocab = 49280
print_info: n_merges = 48900
print_info: BOS token = 1 '<|im_start|>'
print_info: EOS token = 49279 '<end_of_utterance>'
print_info: EOT token = 0 '<|endoftext|>'
print_info: UNK token = 0 '<|endoftext|>'
print_info: PAD token = 2 '<|im_end|>'
print_info: LF token = 198 '─è'
print_info: FIM REP token = 4 '<reponame>'
print_info: EOG token = 0 '<|endoftext|>'
print_info: EOG token = 2 '<|im_end|>'
print_info: EOG token = 4 '<reponame>'
print_info: EOG token = 49279 '<end_of_utterance>'
print_info: max token length = 162
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/33 layers to GPU
load_tensors: CPU_Mapped model buffer size = 414.86 MiB
...............................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 100000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 0.19 MiB
llama_kv_cache_unified: CPU KV buffer size = 160.00 MiB
llama_kv_cache_unified: size = 160.00 MiB ( 4096 cells, 32 layers, 1 seqs), K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_context: CPU compute buffer size = 135.51 MiB
llama_context: graph nodes = 1158
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to generate tool call example: Value is not callable: null at row 1, column 72:
<|im_start|>{% formessagein messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% forlinein message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
at row 1, column 42:
<|im_start|>{% formessagein messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% forlinein message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
at row 1, column 42:
<|im_start|>{% formessagein messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% forlinein message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
at row 1, column 13:
<|im_start|>{% formessagein messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% forlinein message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
at row 1, column 1:
<|im_start|>{% formessagein messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% forlinein message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
clip_ctx: CLIP using CPU backend
mtmd_cli_context: chat template example:
<|im_start|>You are a helpful assistant
User: Hello<end_of_utterance>
Assistant: Hi there<end_of_utterance>
User: How are you?<end_of_utterance>
Assistant:
clip_model_loader: model name: SmolVLM2 500M Video Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 198
clip_model_loader: n_kv: 66
load_hparams: projector: idefics3
load_hparams: n_embd: 768
load_hparams: n_head: 12
load_hparams: n_ff: 3072
load_hparams: n_layer: 12
load_hparams: projection_dim: 960
load_hparams: image_size: 512
load_hparams: patch_size: 16
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 4
load_hparams: n_wa_pattern: 0
load_hparams: ffn_op: gelu
load_hparams: model size: 103.73 MiB
load_hparams: metadata size: 0.07 MiB
alloc_compute_meta: CPU compute buffer size = 60.00 MiB
main: loading model: .\Models\SmolVLM2-500M-Video-Instruct\SmolVLM2-500M-Video-Instruct-Q8_0.gguf
encoding image or slice...
image/slice encoded in 1168 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 188 ms
The video sequence depicts a close-up view of a person's eye, focusing on the iris and pupil. The eye is partially visible, with the iris appearing to be a light brown color. The pupil is dark, and the surrounding skin is pale. The eye is set against a blurred background, which suggests that the person is wearing a dark-colored shirt. The lighting in the scene is soft and diffused, with no harsh shadows or highlights. The overall color tone of the image is muted, with the eye being the main focus. The video sequence appears to be a still shot, capturing a moment in time without any movement or action.llama_perf_context_print: load time = 254.19 msllama_perf_context_print: prompt eval time = 1446.16 ms / 87 tokens ( 16.62 ms per token, 60.16 tokens per second)llama_perf_context_print: eval time = 2121.79 ms / 131 runs ( 16.20 ms per token, 61.74 tokens per second)llama_perf_context_print: total time = 3798.00 ms / 218 tokens
The text was updated successfully, but these errors were encountered:
Name and Version
.\llama-mtmd-cli.exe --version
load_backend: loaded RPC backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-cpu-haswell.dll
version: 5453 (6b56a64)
built with clang version 18.1.8 for x86_64-pc-windows-msvc
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
Problem description & steps to reproduce
Using the
llama-mtmd-cli
with multiple images as input only seem to use the first image for inference.Reading the code:
llama.cpp/tools/mtmd/mtmd-cli.cpp
Line 292 in 6b56a64
it do seem to have been intended to support multiple images.
But running the given command line excluding img_05, then img_04, then img_03, then img_02, will all end with the same output answer. Also the answer do not seem to give information about the later images.
When using
llama-mtmd-cli
in chat mode and inputing the same images, then multi image is supported. In this case I can ask question about the last image which is not know by only using the first image.I can also use the
llama-server
with the same images as input and in this case multi image is also supported.So it seem that there is something wrong with the
llama-mtmd-cli
none chat mode implementation.I have myself been unable to find the issue in the code.
Additionally the processing time for
llama-mtmd-cli
in chat mode vs none chat mode with the above data is about 5 times slower. This also correspond to the terminal output where in none chat mode I only seeencoding image or slice... image/slice encoded in 1215 ms decoding image batch 1/1, n_tokens_batch = 64 image decoded (batch 1/1) in 182 ms
once, while in chat mode I see it number_of_images times.
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: