8000 [video processors] support frame sampling within processors by zucchini-nlp · Pull Request #38105 · huggingface/transformers · GitHub
[go: up one dir, main page]

Skip to content

[video processors] support frame sampling within processors #38105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

zucchini-nlp
Copy link
Member
@zucchini-nlp zucchini-nlp commented May 13, 2025

What does this PR do?

Now that we have video processors separate from image processors, the next step is to keep all video-specific processing there. This PR refactors how we sample video frames

Before:

  • Frames can be sampled only with apply_chat_template and require one to pass a callable sampling_fn for model-specific cases. Cannot handle non-common kwargs.
  • Users have to sample frames themselves if they prefer to not use chat templates

Now:

  • Video sampling is the first step in self.video_processor. Each model can define their own logic and use all kwargs defined in video processing config.
  • Users can pass a whole video and expect the processor to sample it the best way, as the model expects
  • Chat template code cleaned up. Now it only loads the whole video/image/audio and formats the text. Everything else is done by respective processors

Note: For SmolVLM this is quite difficult to implement without breaking because the model never had a video_token and treated videos as sequence of images. To keep BC we would have to update chat template for all models on the hub which is not possible. So an ugly workaround is to keep default chat template in code

@github-actions github-actions bot marked this pull request as draft May 13, 2025 14:10
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@zucchini-nlp zucchini-nlp marked this pull request as ready for review May 13, 2025 20:31
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -916,7 +916,6 @@ class Qwen2_5_VLProcessorKwargs(ProcessingKwargs, total=False):
"text_kwargs": {
"padding": False,
},
"videos_kwargs": {"fps": 2.0},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise we're always sampling, even when asked not to sample. The fps should match actual fps of video, and will be returned from video processor

Comment on lines +107 to +108
min_pixels = 128 * 28 * 28
max_pixels = 28 * 28 * 768
Copy link
Member Author 10000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Video defaults are different from images in their impl, we just never cared before. Now that we can, better to use correct defaults

Comment on lines +122 to +132
def sample_frames(
self,
video: "torch.Tensor",
frame_factor: int,
min_frames: int,
max_frames: int,
metadata: Optional[Union[VideoMetadata, dict]] = None,
num_frames: Optional[int] = None,
fps: Optional[int] = None,
):
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied from Qwen2-VL repo, same as above, we couldn't make it work with image processors at the time of first release

Comment on lines +150 to +153
if "video_sampling" in kwargs:
self.num_frames = kwargs["video_sampling"]["max_frames"]
self.fps = kwargs["video_sampling"]["fps"]
self.size = get_size_dict(kwargs["video_sampling"]["video_size"], default_to_square=self.default_to_square)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to update official configs though and use actual video_preprocessor_config.json. I will handle updating hub repos later, and open PRs for all VLMs

"""
return conversation

@deprecate_kwarg("video_fps", version="4.58", new_name="fps")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow we ended up using both args. Deprecating video_fps in favor of fps because fps has been a valid kwarg in Qwen2 way before we added chat templates

Comment on lines -1027 to -1032
@require_av
@require_torch
def test_apply_chat_template_video_special_processing(self):
"""
Tests that models can use their own preprocessing to preprocess conversations.
"""
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need anymore after this clean up

@zucchini-nlp
Copy link
Member Author

Ready for review!

@zucchini-nlp zucchini-nlp requested a review from qubvel May 14, 2025 16:32
@zucchini-nlp zucchini-nlp changed the title Video sampling from processor [video processors] support frame sampling within processors May 14, 2025
@zucchini-nlp
Copy link
Member Author

Since @qubvel will be off for a while, cc @molbap as well for initial review

Copy link
Member
@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! This looks very nice! I'm just mostly not a fan of the do_sample_frames arg everywhere, IMO it should be directly inferred from the presence of the other related args, even though we have similar do_xxx args for other transforms. Any particular reason to have it except the fact that it's there for other transforms?

Comment on lines +126 to +127
do_sample_frames (`int`, *optional*, defaults to `self.do_sample_frames`):
Whether to sample frames from the video before processing or to process the whole video.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this do_sample_frames args everywhere? We can infer from the presence of either num_frames or fps directly no? (or metadata as well sometimes?) I know we already have similar args such as do_resize etc, but they are all very weird to me.

If a user explicitly passes fps, they should not need to add do_sample_frames=True as well WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, some models (e,g, Qwen2-VL) have a default fps in their config saved. I believe all models would be interested in saving a default value, so video get sampled the same way model was tuned

Adding a flag allows users to turn off the model enforced sampling, if for any reason they want the whole video. Or they have pre-sampled frames and just want to process

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, in such a case have the user explicitly pass fps=None instead of do_sample=False would work as well no? But maybe a little less intuitive wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I feel like this is less intuitive given other transforms happen under do_xxx flag. That could also be kinda breaking for users since we have default fps=2.0 in current version

IIUC, you don't like adding more do_xxx flags. Hmm, let me think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a way to do without breaking BC. I realized SmolVLM also has defaults for fps/num_frames and personally I find it sloppy checking as follows

if num_frames is not None and fps is not None and video_metadata is not None:
    self.sample_frames(video)

Copy link
Member
@Cyrilvallez Cyrilvallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALright, LGTM! Let's merge it 🤗

@zucchini-nlp zucchini-nlp enabled auto-merge (squash) June 12, 2025 09:23
@zucchini-nlp zucchini-nlp merged commit 2745902 into huggingface:main Jun 12, 2025
20 checks passed
lmarshall12 pushed a commit to lmarshall12/transformers that referenced this pull request Jun 12, 2025
…ace#38105)

* apply updates smolVLM (still needs workaround for chat template)

* add other models

* dump qwen omni for now, come back later

* port qwen omni from their impl

* wait, all qwens sample videos in same way!

* clean up

* make smolvlm backwards compatible and fix padding

* dix some tests

* fox smolvlm tests

* more clean up and test fixing

* delete unused arg

* fix

* address comments

* style

* fix test
zrohyun added a commit to zrohyun/transformers that referenced this pull request Jun 30, 2025
Enable fractional fps values (e.g., 1.5, 29.97) in video processors
for more precise frame sampling control.

- Change fps type from int to float across all video processors
- Maintain backward compatibility with integer values

Extends: huggingface#38105
zrohyun added a commit to zrohyun/transformers that referenced this pull request Jul 1, 2025
Change fps type from Optional[float] to Optional[Union[int, float]]
for more explicit type information about supporting both integer
and floating-point frame rates.

- Update type hints and docstrings across 8 files
- Maintain backward compatibility
- Clarify support for both int and float values

Extends: huggingface#38105
zucchini-nlp pushed a commit that referenced this pull request Jul 7, 2025
* [video processors] Support float fps for precise frame sampling

Enable fractional fps values (e.g., 1.5, 29.97) in video processors
for more precise frame sampling control.

- Change fps type from int to float across all video processors
- Maintain backward compatibility with integer values

Extends: #38105

* [video processors] Refine fps typing to Union[int, float]

Change fps type from Optional[float] to Optional[Union[int, float]]
for more explicit type information about supporting both integer
and floating-point frame rates.

- Update type hints and docstrings across 8 files
- Maintain backward compatibility
- Clarify support for both int and float values

Extends: #38105

* Revert "[video processors] Support float fps for precise frame sampling"

This reverts commit 7360d6e.
rjgleaton pushed a commit to rjgleaton/transformers that referenced this pull request Jul 17, 2025
…ingface#39134)

* [video processors] Support float fps for precise frame sampling

Enable fractional fps values (e.g., 1.5, 29.97) in video processors
for more precise frame sampling control.

- Change fps type from int to float across all video processors
- Maintain backward compatibility with integer values

Extends: huggingface#38105

* [video processors] Refine fps typing to Union[int, float]

Change fps type from Optional[float] to Optional[Union[int, float]]
for more explicit type information about supporting both integer
and floating-point frame rates.

- Update type hints and docstrings across 8 files
- Maintain backward compatibility
- Clarify support for both int and float values

Extends: huggingface#38105

* Revert "[video processors] Support float fps for precise frame sampling"

This reverts commit 7360d6e.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0