[transformers x vLLM] standardize processors #37915

zucchini-nlp · 2025-05-01T13:04:42Z

What does this PR do?

Part of #37780. The design was tested on different model types:

Oke, I verified that the inference works for all models, unless I forgot about some new ones. Here is the list I tested. A few models (blip2, gotOcr, gemma3) won't be supported in the first release. Gemma3 is already planned after we merge first version of integration, it requires bigger changes for us to make bidirectional attention with token_type_ids

model_example_map = {
    "aria": run_aria,
    "aya_vision": run_aya_vision,
    "chameleon": run_chameleon, # NOTE: ready but needs to add suppress token in hub saved generation config
    "emu3": run_emu,
    "fuyu": run_fuyu, # Almost there, needs new attn interface for Persimmon LM backend in new PR
    "got_ocr": run_got_ocr, # More complex as it needs to add boxes/etc. Might support later
    "idefics3": run_idefics3,
    "internvl_chat": run_internvl,
    "llava": run_llava,
    "pixtral": run_pixtral,
    "llava_next": run_llava_next,
    "llava_onevision": run_llava_onevision,
    "mllama": run_mllama, # Cross attn not yet supported
    "mistral3": run_mistral3,
    "paligemma": run_paligemma,
    "paligemma2": run_paligemma2,
    "qwen2_vl": run_qwen2_vl,
    "qwen2_5_vl": run_qwen2_5_vl,
    "vipllava": run_vipllava,
}

I will do a subsequent PR with the rest of changes for modeling code. That's pretty much all left

github-actions · 2025-05-01T13:04:54Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

HuggingFaceDocBuilderDev · 2025-05-01T13:17:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Nice! 🤗

ArthurZucker · 2025-05-16T08:43:55Z

src/transformers/models/aria/modular_aria.py

+        if return_mm_token_type_ids:
+            array_ids = np.array(text_inputs["input_ids"])
+            mm_token_type_ids = np.zeros_like(text_inputs["input_ids"])
+            mm_token_type_ids[array_ids == self.image_token_id] = 1
+            text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()


I am guessing numpy for no torch deps?

yep, I know that we don't have any new processors that need to supports jax or TF, so torch probably is already installed for users. I did this just for consistency with all the other processor

src/transformers/models/aria/modular_aria.py

ArthurZucker · 2025-05-16T08:45:51Z

src/transformers/models/aya_vision/processing_aya_vision.py

@@ -29,6 +30,7 @@
    ImageInput,
    make_flat_list_of_images,
 )
+from ..got_ocr2.image_processing_got_ocr2 import get_optimal_tiled_canvas


have not checked this one, but is it pre-computed? if not we should probably!

it is computed based on input image, similar what llava-next does. Resize and divide to patches while keeping as much information as possible, I moved patching related logic to image processor class now

src/transformers/models/chameleon/processing_chameleon.py

src/transformers/models/emu3/processing_emu3.py

src/transformers/models/gemma3/processing_gemma3.py

ArthurZucker · 2025-05-16T08:48:02Z

src/transformers/models/idefics3/processing_idefics3.py

+        self.row_col_ids = [
+            tokenizer.convert_tokens_to_ids(f"<row_{i + 1}_col_{j + 1}>") for i in range(6) for j in range(6)
+        ]


would rather just hardcode if there are only 6!

it is already hardcoded to be 6. You mean in hub configs or where?

src/transformers/models/idefics3/processing_idefics3.py

ArthurZucker · 2025-05-16T09:02:57Z

src/transformers/models/idefics3/processing_idefics3.py

+            array_ids = np.array(inputs["input_ids"])
+            mm_token_type_ids = np.zeros_like(array_ids)
+            for i, seq_lengths in enumerate(batch_image_seq_lengths):
+                image_start_positions = np.where(array_ids[i] == self.fake_image_token_id)[0]
+                j = 0
+                for seq_len in seq_lengths:
+                    if j >= len(image_start_positions):
+                        break
+                    start = image_start_positions[j]
+                    end = start + seq_len
+                    mm_token_type_ids[i, start:end] = 1
+                    j = np.searchsorted(image_start_positions, end)
+
+            inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
+


This can definitely be simplified no?
It's weird!

I know, I would love to simply mask out certain token ids. The issue is that idefics special image tokens include \n and we can't mask it by id

So the expanded seq is smth like {image_id * N}{fake_wrapper_id}\n\n{next_col_id}{image_id * N}.... We could get out by masking only image ids but that means vLLM chunked prefill will fail. vLLM assumes to get contiguous positions for a single image

Though for idefics specifically the inference freezes forever when input length is higher than "max-tokens-allowed-per-batch". I will look into it more with Harry

src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py

…tion

zucchini-nlp · 2025-05-22T14:34:37Z

Oke, I verified that the inference works for all models, unless I forgot about some new ones. Here is the list I tested. A few models (blip2, gotOcr, gemma3) won't be supported in the first release. Gemma3 is already planned after we merge first version of integration, it requires bigger changes for us to make bidirectional attention with token_type_ids

model_example_map = {
    "aria": run_aria,
    "aya_vision": run_aya_vision,
    "chameleon": run_chameleon, # NOTE: DONE but needs to add suppress token in hub generation config
    "emu3": run_emu,
    "fuyu": run_fuyu, # Almost there, needs new attn interface for Persimmon LM backend in new PR
    "got_ocr": run_got_ocr, # More complex as it needs to add boxes/etc. Might support later
    "idefics3": run_idefics3,
    "internvl_chat": run_internvl,
    "llava": run_llava,
    "pixtral": run_pixtral,
    "llava_next": run_llava_next,
    "llava_onevision": run_llava_onevision,
    "mllama": run_mllama, # Cross attn not yet supported
    "mistral3": run_mistral3,
    "paligemma": run_paligemma,
    "paligemma2": run_paligemma2,
    "qwen2_vl": run_qwen2_vl,
    "qwen2_5_vl": run_qwen2_5_vl,
    "vipllava": run_vipllava,
}

I will do clean up of this PR and open a subsequent PR with the rest of changes for modeling code. That's pretty much all left

zucchini-nlp · 2025-05-22T15:11:59Z

src/transformers/models/aya_vision/processing_aya_vision.py

@@ -131,6 +132,11 @@ def __init__(
        self.img_line_break_token = img_line_break_token
        self.tile_token = tile_token
        self.tile_global_token = tile_global_token
+        self.image_token_id = tokenizer.convert_tokens_to_ids(self.img_patch_token)


Aya vision was implemented to have different placeholder tokens in the prompt and after processing. image_token is replaced by img_patch_token, which should not happen. We make a lot of assumption in the codebase that the input placeholder is the same one which the model uses

So two options are:

Refactor it out to be correct and use only img_patch_token which means we need to update Hub chat template. Might be very breaking

Current solution, not intuitive but image_token_id isn't used anywhere thus not breaking

src/transformers/models/pixtral/image_processing_pixtral.py

ArthurZucker

LGTM, one comment is that I would rather use explicit names, like when it's not really multimodal I would not call the dict multimodel, just for explicity!
Otherwise nice abstraction 🤗

src/transformers/models/aya_vision/processing_aya_vision.py

src/transformers/processing_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

standardize

68a5744

github-actions bot marked this pull request as draft May 1, 2025 13:04

zucchini-nlp marked this pull request as ready for review May 1, 2025 13:04

fix tests

28325a4

zucchini-nlp mentioned this pull request Apr 25, 2025

Support multimodal models in vLLM with transformers backend #37780

Closed

7 tasks

ArthurZucker reviewed May 16, 2025

View reviewed changes

zucchini-nlp added 7 commits May 16, 2025 13:33

Merge remote-tracking branch 'upstream/main' into processor-standardize

65530cf

batch update some processors, not final yet

4b91b69

oke, now I tested that everything indeed runs. Still needs prettifica…

c729e62

…tion

emu3

dc84894

fixup

344e753

gemma3 but it doesn't generate anything

6dfb05d

fuyu

4d6c23d

zucchini-nlp changed the title ~~[WIP] standardize processors for vLLM~~ [transformers x vLLM] standardize processors May 22, 2025

zucchini-nlp commented May 22, 2025

View reviewed changes

zucchini-nlp added 3 commits May 22, 2025 17:55

update

73fb887

why?

02fad30

Merge branch 'main' into processor-standardize

12e8f9f

zucchini-nlp commented May 23, 2025

View reviewed changes

src/transformers/models/pixtral/image_processing_pixtral.py Outdated Show resolved Hide resolved

ArthurZucker approved these changes May 23, 2025

View reviewed changes

src/transformers/models/aya_vision/processing_aya_vision.py Outdated Show resolved Hide resolved

src/transformers/models/aya_vision/processing_aya_vision.py Outdated Show resolved Hide resolved

src/transformers/processing_utils.py Outdated Show resolved Hide resolved

zucchini-nlp and others added 7 commits May 26, 2025 15:30

Update src/transformers/models/aya_vision/processing_aya_vision.py

39e8abb

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

address comments

5ca8180

bc

91b6ab9

why do we need to guard import this every time?

a522040

i hate guarded imports

732170c

Merge branch 'main' into processor-standardize

58316f8

i am blind

794F

aed2b78

zucchini-nlp merged commit 9e1017b into huggingface:main May 27, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[transformers x vLLM] standardize processors #37915

[transformers x vLLM] standardize processors #37915

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[transformers x vLLM] standardize processors #37915

[transformers x vLLM] standardize processors #37915

Uh oh!

Conversation

Uh oh!

What does this PR do?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!