[video processors] support frame sampling within processors #38105

zucchini-nlp · 2025-05-13T14:10:18Z

What does this PR do?

Now that we have video processors separate from image processors, the next step is to keep all video-specific processing there. This PR refactors how we sample video frames

Before:

Frames can be sampled only with apply_chat_template and require one to pass a callable sampling_fn for model-specific cases. Cannot handle non-common kwargs.
Users have to sample frames themselves if they prefer to not use chat templates

Now:

Video sampling is the first step in self.video_processor. Each model can define their own logic and use all kwargs defined in video processing config.
Users can pass a whole video and expect the processor to sample it the best way, as the model expects
Chat template code cleaned up. Now it only loads the whole video/image/audio and formats the text. Everything else is done by respective processors

Note: For SmolVLM this is quite difficult to implement without breaking because the model never had a video_token and treated videos as sequence of images. To keep BC we would have to update chat template for all models on the hub which is not possible. So an ugly workaround is to keep default chat template in code

github-actions · 2025-05-13T14:10:35Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

…-processor

HuggingFaceDocBuilderDev · 2025-05-13T20:45:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-05-14T16:19:14Z

src/transformers/models/qwen2_5_vl/modular_qwen2_5_vl.py

@@ -916,7 +916,6 @@ class Qwen2_5_VLProcessorKwargs(ProcessingKwargs, total=False):
        "text_kwargs": {
            "padding": False,
        },
-        "videos_kwargs": {"fps": 2.0},


otherwise we're always sampling, even when asked not to sample. The fps should match actual fps of video, and will be returned from video processor

zucchini-nlp · 2025-05-14T16:20:02Z

src/transformers/models/qwen2_vl/video_processing_qwen2_vl.py

+    min_pixels = 128 * 28 * 28
+    max_pixels = 28 * 28 * 768


Video defaults are different from images in their impl, we just never cared before. Now that we can, better to use correct defaults

zucchini-nlp · 2025-05-14T16:20:48Z

src/transformers/models/qwen2_vl/video_processing_qwen2_vl.py

+    def sample_frames(
+        self,
+        video: "torch.Tensor",
+        frame_factor: int,
+        min_frames: int,
+        max_frames: int,
+        metadata: Optional[Union[VideoMetadata, dict]] = None,
+        num_frames: Optional[int] = None,
+        fps: Optional[int] = None,
+    ):
+        """


copied from Qwen2-VL repo, same as above, we couldn't make it work with image processors at the time of first release

src/transformers/models/smolvlm/processing_smolvlm.py

zucchini-nlp · 2025-05-14T16:24:31Z

src/transformers/models/smolvlm/video_processing_smolvlm.py

+        if "video_sampling" in kwargs:
+            self.num_frames = kwargs["video_sampling"]["max_frames"]
+            self.fps = kwargs["video_sampling"]["fps"]
+            self.size = get_size_dict(kwargs["video_sampling"]["video_size"], default_to_square=self.default_to_square)


would be nice to update official configs though and use actual video_preprocessor_config.json. I will handle updating hub repos later, and open PRs for all VLMs

zucchini-nlp · 2025-05-14T16:25:47Z

src/transformers/processing_utils.py

-        """
-        return conversation
-
+    @deprecate_kwarg("video_fps", version="4.58", new_name="fps")


somehow we ended up using both args. Deprecating video_fps in favor of fps because fps has been a valid kwarg in Qwen2 way before we added chat templates

zucchini-nlp · 2025-05-14T16:26:56Z

tests/test_processing_common.py

-    @require_av
-    @require_torch
-    def test_apply_chat_template_video_special_processing(self):
-        """
-        Tests that models can use their own preprocessing to preprocess conversations.
-        """


no need anymore after this clean up

…-processor

zucchini-nlp · 2025-05-14T16:32:08Z

Ready for review!

zucchini-nlp · 2025-05-23T10:45:04Z

Since @qubvel will be off for a while, cc @molbap as well for initial review

Cyrilvallez

Hey! This looks very nice! I'm just mostly not a fan of the do_sample_frames arg everywhere, IMO it should be directly inferred from the presence of the other related args, even though we have similar do_xxx args for other transforms. Any particular reason to have it except the fact that it's there for other transforms?

src/transformers/models/internvl/video_processing_internvl.py

src/transformers/models/smolvlm/processing_smolvlm.py

src/transformers/processing_utils.py

Cyrilvallez · 2025-05-27T14:26:33Z

src/transformers/video_processing_utils.py

+        do_sample_frames (`int`, *optional*, defaults to `self.do_sample_frames`):
+            Whether to sample frames from the video before processing or to process the whole video.


Do we actually need this do_sample_frames args everywhere? We can infer from the presence of either num_frames or fps directly no? (or metadata as well sometimes?) I know we already have similar args such as do_resize etc, but they are all very weird to me.

If a user explicitly passes fps, they should not need to add do_sample_frames=True as well WDYT?

Yeah, some models (e,g, Qwen2-VL) have a default fps in their config saved. I believe all models would be interested in saving a default value, so video get sampled the same way model was tuned

Adding a flag allows users to turn off the model enforced sampling, if for any reason they want the whole video. Or they have pre-sampled frames and just want to process

I see, in such a case have the user explicitly pass fps=None instead of do_sample=False would work as well no? But maybe a little less intuitive wdyt?

Yeah, I feel like this is less intuitive given other transforms happen under do_xxx flag. That could also be kinda breaking for users since we have default fps=2.0 in current version

IIUC, you don't like adding more do_xxx flags. Hmm, let me think

I don't think there's a way to do without breaking BC. I realized SmolVLM also has defaults for fps/num_frames and personally I find it sloppy checking as follows

if num_frames is not None and fps is not None and video_metadata is not None: self.sample_frames(video)

tests/models/smolvlm/test_video_processing_smolvlm.py

…-processor

Cyrilvallez

ALright, LGTM! Let's merge it 🤗

…ace#38105) * apply updates smolVLM (still needs workaround for chat template) * add other models * dump qwen omni for now, come back later * port qwen omni from their impl * wait, all qwens sample videos in same way! * clean up * make smolvlm backwards compatible and fix padding * dix some tests * fox smolvlm tests * more clean up and test fixing * delete unused arg * fix * address comments * style * fix test

Enable fractional fps values (e.g., 1.5, 29.97) in video processors for more precise frame sampling control. - Change fps type from int to float across all video processors - Maintain backward compatibility with integer values Extends: huggingface#38105

Change fps type from Optional[float] to Optional[Union[int, float]] for more explicit type information about supporting both integer and floating-point frame rates. - Update type hints and docstrings across 8 files - Maintain backward compatibility - Clarify support for both int and float values Extends: huggingface#38105

* [video processors] Support float fps for precise frame sampling Enable fractional fps values (e.g., 1.5, 29.97) in video processors for more precise frame sampling control. - Change fps type from int to float across all video processors - Maintain backward compatibility with integer values Extends: #38105 * [video processors] Refine fps typing to Union[int, float] Change fps type from Optional[float] to Optional[Union[int, float]] for more explicit type information about supporting both integer and floating-point frame rates. - Update type hints and docstrings across 8 files - Maintain backward compatibility - Clarify support for both int and float values Extends: #38105 * Revert "[video processors] Support float fps for precise frame sampling" This reverts commit 7360d6e.

…ingface#39134) * [video processors] Support float fps for precise frame sampling Enable fractional fps values (e.g., 1.5, 29.97) in video processors for more precise frame sampling control. - Change fps type from int to float across all video processors - Maintain backward compatibility with integer values Extends: huggingface#38105 * [video processors] Refine fps typing to Union[int, float] Change fps type from Optional[float] to Optional[Union[int, float]] for more explicit type information about supporting both integer and floating-point frame rates. - Update type hints and docstrings across 8 files - Maintain backward compatibility - Clarify support for both int and float values Extends: huggingface#38105 * Revert "[video processors] Support float fps for precise frame sampling" This reverts commit 7360d6e.

zucchini-nlp added 6 commits May 12, 2025 16:49

apply updates smolVLM (still needs workaround for chat template)

60c1454

add other models

505f971

dump qwen omni for now, come back later

e58153b

port qwen omni from their impl

c195c9f

wait, all qwens sample videos in same way!

089bd17

clean up

fc00937

github-actions bot marked this pull request as draft May 13, 2025 14:10

make smolvlm backwards compatible and fix padding

e9b3990

zucchini-nlp marked this pull request as ready for review May 13, 2025 20:31

zucchini-nlp added 2 commits May 13, 2025 22:31

merge main

69ec0b9

Merge remote-tracking branch 'upstream/main' into video-sampling-from…

5d94247

…-processor

zucchini-nlp added 3 commits May 13, 2025 23:31

dix some tests

8fa8683

fox smolvlm tests

d34143c

more clean up and test fixing

3cccc1e

zucchini-nlp commented May 14, 2025

View reviewed changes

src/transformers/models/smolvlm/processing_smolvlm.py Outdated Show resolved Hide resolved

zucchini-nlp commented May 14, 2025

View reviewed changes

zucchini-nlp added 2 commits May 14, 2025 18:29

delete unused arg

c8ddaaf

Merge remote-tracking branch 'upstream/main' into video-sampling-from…

194284d

…-processor

zucchini-nlp requested a review from qubvel May 14, 2025 16:32

zucchini-nlp changed the title ~~Video sampling from processor~~ [video processors] support frame sampling within processors May 14, 2025

fix

67728c0

zucchini-nlp mentioned this pull request May 26, 2025

video_utils.group_videos_by_shape does not consider video length #38352

Closed

4 tasks

zucchini-nlp requested a review from Cyrilvallez May 26, 2025 09:00

Cyrilvallez reviewed May 27, 2025

View reviewed changes

zucchini-nlp added 4 commits May 28, 2025 12:56

address comments

ac544af

Merge remote-tracking branch 'upstream/main' into video-sampling-from…

0b23122

…-processor

style

3faf164

fix test

abdfb17

zucchini-nlp requested a review from Cyrilvallez May 28, 2025 11:18

This was referenced Jun 2, 2025

Why do you remove sample_indices_fn for processor.apply_chat_template? #38527

Closed

GLM-4.1V Model support #38431

Merged

Merge branch 'main' into video-sampling-from-processor

8f72296

Cyrilvallez approved these changes Jun 12, 2025

View reviewed changes

Merge branch 'main' into video-sampling-from-processor

bdbf881

zucchini-nlp enabled auto-merge (squash) June 12, 2025 09:23

zucchini-nlp merged commit 2745902 into huggingface:main Jun 12, 2025
20 checks passed

zucchini-nlp mentioned this pull request Jun 13, 2025

[internvl] fix video inference #38811

Merged

pcuenca mentioned this pull request Jun 24, 2025

smolvlm video processing #39006

Closed

zrohyun mentioned this pull request Jun 30, 2025

[video processors] Support float fps for precise frame sampling #39134

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[video processors] support frame sampling within processors #38105

[video processors] support frame sampling within processors #38105

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		do_sample_frames (`int`, optional, defaults to `self.do_sample_frames`):
		Whether to sample frames from the video before processing or to process the whole video.

[video processors] support frame sampling within processors #38105

[video processors] support frame sampling within processors #38105

Uh oh!

Conversation

Uh oh!

What does this PR do?

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!