Misc. bug: llama-mtmd-cli ignores multiple image input #13704

KaareLJensen · 2025-05-22T11:39:21Z

Name and Version

.\llama-mtmd-cli.exe --version
load_backend: loaded RPC backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-cpu-haswell.dll
version: 5453 (6b56a64)
built with clang version 18.1.8 for x86_64-pc-windows-msvc

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

.\llama-mtmd-cli.exe  -m ".\Models\SmolVLM2-500M-Video-Instruct\SmolVLM2-500M-Video-Instruct-Q8_0.gguf" --mmproj ".\Models\SmolVLM2-500M-Video-Instruct\mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf" --image ".\Images\img_01.png" --image ".\Images\img_02.png" --image ".\Images\img_03.png" --image ".\Images\img_04.png" --image ".\Images\img_05.png" -p "Describe this video sequence, as depicted by these images." --temp 0.2 --seed 42

Problem description & steps to reproduce

Using the llama-mtmd-cli with multiple images as input only seem to use the first image for inference.
Reading the code:

llama.cpp/tools/mtmd/mtmd-cli.cpp

Line 292 in 6b56a64

for (const auto & image : params.image) {

it do seem to have been intended to support multiple images.

But running the given command line excluding img_05, then img_04, then img_03, then img_02, will all end with the same output answer. Also the answer do not seem to give information about the later images.

When using llama-mtmd-cli in chat mode and inputing the same images, then multi image is supported. In this case I can ask question about the last image which is not know by only using the first image.

I can also use the llama-server with the same images as input and in this case multi image is also supported.

So it seem that there is something wrong with the llama-mtmd-cli none chat mode implementation.
I have myself been unable to find the issue in the code.

Additionally the processing time for llama-mtmd-cli in chat mode vs none chat mode with the above data is about 5 times slower. This also correspond to the terminal output where in none chat mode I only see
encoding image or slice... image/slice encoded in 1215 ms decoding image batch 1/1, n_tokens_batch = 64 image decoded (batch 1/1) in 182 ms
once, while in chat mode I see it number_of_images times.

First Bad Commit

No response

Relevant log output

PS D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64> .\llama-mtmd-cli.exe  -m ".\Models\SmolVLM2-500M-Video-Instruct\SmolVLM2-500M-Video-Instruct-Q8_0.gguf" --mmproj ".\Models\SmolVLM2-500M-Video-Instruct\mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf" --image ".\Images\img_01.png" --image ".\Images\img_02.png" --image ".\Images\img_03.png" --image ".\Images\img_04.png" --image ".\Images\img_05.png" -p "Describe this video sequence, as depicted by these images." --temp 0.2 --seed 42
load_backend: loaded RPC backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-rpc.dll
load_backend: loaded CPU backend from D:\Models\llama.cpp\llama-b5453-bin-win-cpu-x64\ggml-cpu-haswell.dll
build: 5453 (6b56a646) with clang version 18.1.8 for x86_64-pc-windows-msvc
llama_model_loader: loaded meta data with 75 key-value pairs and 291 tensors from .\Models\SmolVLM2-500M-Video-Instruct\SmolVLM2-500M-Video-Instruct-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SmolVLM2 500M Video Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Video-Instruct
llama_model_loader: - kv   4:                           general.basename str              = SmolVLM2
llama_model_loader: - kv   5:                         general.size_label str              = 500M
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = SmolVLM 500M Instruct
llama_model_loader: - kv   9:          general.base_model.0.organization str              = HuggingFaceTB
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv  11:                      general.dataset.count u32              = 12
llama_model_loader: - kv  12:                     general.dataset.0.name str              = The_Cauldron
llama_model_loader: - kv  13:             general.dataset.0.organization str              = HuggingFaceM4
llama_model_loader: - kv  14:                 general.dataset.0.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  15:                     general.dataset.1.name str              = Docmatix
llama_model_loader: - kv  16:             general.dataset.1.organization str              = HuggingFaceM4
llama_model_loader: - kv  17:                 general.dataset.1.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  18:                     general.dataset.2.name str              = LLaVA OneVision Data
llama_model_loader: - kv  19:             general.dataset.2.organization str              = Lmms Lab
llama_model_loader: - kv  20:                 general.dataset.2.repo_url str              = https://huggingface.co/lmms-lab/LLaVA...
llama_model_loader: - kv  21:                     general.dataset.3.name str              = M4 Instruct Data
llama_model_loader: - kv  22:             general.dataset.3.organization str              = Lmms Lab
llama_model_loader: - kv  23:                 general.dataset.3.repo_url str              = https://huggingface.co/lmms-lab/M4-In...
llama_model_loader: - kv  24:                     general.dataset.4.name str              = Finevideo
llama_model_loader: - kv  25:             general.dataset.4.organization str              = HuggingFaceFV
llama_model_loader: - kv  26:                 general.dataset.4.repo_url str              = https://huggingface.co/HuggingFaceFV/...
llama_model_loader: - kv  27:                     general.dataset.5.name str              = MAmmoTH VL Instruct 12M
llama_model_loader: - kv  28:             general.dataset.5.organization str              = MAmmoTH VL
llama_model_loader: - kv  29:                 general.dataset.5.repo_url str              = https://huggingface.co/MAmmoTH-VL/MAm...
llama_model_loader: - kv  30:                     general.dataset.6.name str              = LLaVA Video 178K
llama_model_loader: - kv  31:             general.dataset.6.organization str              = Lmms Lab
llama_model_loader: - kv  32:                 general.dataset.6.repo_url str              = https://huggingface.co/lmms-lab/LLaVA...
llama_model_loader: - kv  33:                     general.dataset.7.name str              = Video STaR
llama_model_loader: - kv  34:             general.dataset.7.organization str              = Orrzohar
llama_model_loader: - kv  35:                 general.dataset.7.repo_url str              = https://huggingface.co/orrzohar/Video...
llama_model_loader: - kv  36:                     general.dataset.8.name str              = Vript
llama_model_loader: - kv  37:             general.dataset.8.organization str              = Mutonix
llama_model_loader: - kv  38:                 general.dataset.8.repo_url str              = https://huggingface.co/Mutonix/Vript
llama_model_loader: - kv  39:                     general.dataset.9.name str              = VISTA 400K
llama_model_loader: - kv  40:             general.dataset.9.organization str              = TIGER Lab
llama_model_loader: - kv  41:                 general.dataset.9.repo_url str              = https://huggingface.co/TIGER-Lab/VIST...
llama_model_loader: - kv  42:                    general.dataset.10.name str              = MovieChat 1K_train
llama_model_loader: - kv  43:            general.dataset.10.organization str              = Enxin
llama_model_loader: - kv  44:                general.dataset.10.repo_url str              = https://huggingface.co/Enxin/MovieCha...
llama_model_loader: - kv  45:                    general.dataset.11.name str              = ShareGPT4Video
llama_model_loader: - kv  46:            general.dataset.11.organization str              = ShareGPT4Video
llama_model_loader: - kv  47:                general.dataset.11.repo_url str              = https://huggingface.co/ShareGPT4Video...
llama_model_loader: - kv  48:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  49:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  50:                          llama.block_count u32              = 32
llama_model_loader: - kv  51:                       llama.context_length u32              = 8192
llama_model_loader: - kv  52:                     llama.embedding_length u32              = 960
llama_model_loader: - kv  53:                  llama.feed_forward_length u32              = 2560
llama_model_loader: - kv  54:                 llama.attention.head_count u32              = 15
llama_model_loader: - kv  55:              llama.attention.head_count_kv u32              = 5
llama_model_loader: - kv  56:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  57:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  58:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  59:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  60:                           llama.vocab_size u32              = 49280
llama_model_loader: - kv  61:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  62:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  63:                         tokenizer.ggml.pre str              = smollm
llama_model_loader: - kv  64:                      tokenizer.ggml.tokens arr[str,49280]   = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv  65:                  tokenizer.ggml.token_type arr[i32,49280]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  66:                      tokenizer.ggml.merges arr[str,48900]   = ["─á t", "─á a", "i n", "h e", "─á ─á...
llama_model_loader: - kv  67:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  68:                tokenizer.ggml.eos_token_id u32              = 49279
llama_model_loader: - kv  69:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  70:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  71:                    tokenizer.chat_template str              = <|im_start|>{% for message in message...
llama_model_loader: - kv  72:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  73:               general.quantization_version u32              = 2
llama_model_loader: - kv  74:                          general.file_type u32              = 7
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 414.86 MiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 145
load: token to piece cache size = 0.3199 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 960
print_info: n_layer          = 32
print_info: n_head           = 15
print_info: n_head_kv        = 5
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 320
print_info: n_embd_v_gqa     = 320
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2560
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 409.25 M
print_info: general.name     = SmolVLM2 500M Video Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 49280
print_info: n_merges         = 48900
print_info: BOS token        = 1 '<|im_start|>'
print_info: EOS token        = 49279 '<end_of_utterance>'
print_info: EOT token        = 0 '<|endoftext|>'
print_info: UNK token        = 0 '<|endoftext|>'
print_info: PAD token        = 2 '<|im_end|>'
print_info: LF token         = 198 '─è'
print_info: FIM REP token    = 4 '<reponame>'
print_info: EOG token        = 0 '<|endoftext|>'
print_info: EOG token        = 2 '<|im_end|>'
print_info: EOG token        = 4 '<reponame>'
print_info: EOG token        = 49279 '<end_of_utterance>'
print_info: max token length = 162
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   414.86 MiB
...............................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.19 MiB
llama_kv_cache_unified:        CPU KV buffer size =   160.00 MiB
llama_kv_cache_unified: size =  160.00 MiB (  4096 cells,  32 layers,  1 seqs), K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_context:        CPU compute buffer size =   135.51 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to generate tool call example: Value is not callable: null at row 1, column 72:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
                                                                       ^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
 at row 1, column 42:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
                                         ^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
 at row 1, column 42:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
                                         ^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
 at row 1, column 13:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
            ^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}
 at row 1, column 1:
<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>
^
{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}

clip_ctx: CLIP using CPU backend
mtmd_cli_context: chat template example:
<|im_start|>You are a helpful assistant

User: Hello<end_of_utterance>
Assistant: Hi there<end_of_utterance>
User: How are you?<end_of_utterance>
Assistant:
clip_model_loader: model name:   SmolVLM2 500M Video Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    198
clip_model_loader: n_kv:         66

load_hparams: projector:          idefics3
load_hparams: n_embd:             768
load_hparams: n_head:             12
load_hparams: n_ff:               3072
load_hparams: n_layer:            12
load_hparams: projection_dim:     960
load_hparams: image_size:         512
load_hparams: patch_size:         16

load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  4
load_hparams: n_wa_pattern:       0
load_hparams: ffn_op:             gelu
load_hparams: model size:         103.73 MiB
load_hparams: metadata size:      0.07 MiB
alloc_compute_meta:        CPU compute buffer size =    60.00 MiB
main: loading model: .\Models\SmolVLM2-500M-Video-Instruct\SmolVLM2-500M-Video-Instruct-Q8_0.gguf
encoding image or slice...
image/slice encoded in 1168 ms
decoding image batch 1/1, n_tokens_batch = 64
image decoded (batch 1/1) in 188 ms

 The video sequence depicts a close-up view of a person's eye, focusing on the iris and pupil. The eye is partially visible, with the iris appearing to be a light brown color. The pupil is dark, and the surrounding skin is pale. The eye is set against a blurred background, which suggests that the person is wearing a dark-colored shirt. The lighting in the scene is soft and diffused, with no harsh shadows or highlights. The overall color tone of the image is muted, with the eye being the main focus. The video sequence appears to be a still shot, capturing a moment in time without any movement or action.


llama_perf_context_print:        load time =     254.19 ms
llama_perf_context_print: prompt eval time =    1446.16 ms /    87 tokens (   16.62 ms per token,    60.16 tokens per second)
llama_perf_context_print:        eval time =    2121.79 ms /   131 runs   (   16.20 ms per token,    61.74 tokens per second)
llama_perf_context_print:       total time =    3798.00 ms /   218 tokens

The text was updated successfully, but these errors were encountered:

ngxson · 2025-05-27T12:05:48Z

Should be fixed via #13784

KaareLJensen · 2025-05-27T13:54:59Z

Thanks for fixing this fast. Really appreciated!
I can verify that this is now working on my side with the latest release.

KaareLJensen added the bug-unconfirmed label May 22, 2025

KaareLJensen mentioned this issue May 26, 2025

Feature Reequest: Multi model cli tools: Add a possibility to specify a image in conversation mode plus tab auto completion for path #12983

Open

ngxson closed this as completed May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: llama-mtmd-cli ignores multiple image input #13704

Misc. bug: llama-mtmd-cli ignores multiple image input #13704

Uh oh!

Uh oh!

Misc. bug: llama-mtmd-cli ignores multiple image input #13704

Misc. bug: llama-mtmd-cli ignores multiple image input #13704

Comments

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Uh oh!

Uh oh!