-
Notifications
You must be signed in to change notification settings - Fork 29.8k
[video processors] support frame sampling within processors #38105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[video processors] support frame sampling within processors #38105
Conversation
Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@@ -916,7 +916,6 @@ class Qwen2_5_VLProcessorKwargs(ProcessingKwargs, total=False): | |||
"text_kwargs": { | |||
"padding": False, | |||
}, | |||
"videos_kwargs": {"fps": 2.0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise we're always sampling, even when asked not to sample. The fps
should match actual fps
of video, and will be returned from video processor
min_pixels = 128 * 28 * 28 | ||
max_pixels = 28 * 28 * 768 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Video defaults are different from images in their impl, we just never cared before. Now that we can, better to use correct defaults
def sample_frames( | ||
self, | ||
video: "torch.Tensor", | ||
frame_factor: int, | ||
min_frames: int, | ||
max_frames: int, | ||
metadata: Optional[Union[VideoMetadata, dict]] = None, | ||
num_frames: Optional[int] = None, | ||
fps: Optional[int] = None, | ||
): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied from Qwen2-VL repo, same as above, we couldn't make it work with image processors at the time of first release
if "video_sampling" in kwargs: | ||
self.num_frames = kwargs["video_sampling"]["max_frames"] | ||
self.fps = kwargs["video_sampling"]["fps"] | ||
self.size = get_size_dict(kwargs["video_sampling"]["video_size"], default_to_square=self.default_to_square) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to update official configs though and use actual video_preprocessor_config.json
. I will handle updating hub repos later, and open PRs for all VLMs
""" | ||
return conversation | ||
|
||
@deprecate_kwarg("video_fps", version="4.58", new_name="fps") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somehow we ended up using both args. Deprecating video_fps
in favor of fps
because fps
has been a valid kwarg in Qwen2 way before we added chat templates
@require_av | ||
@require_torch | ||
def test_apply_chat_template_video_special_processing(self): | ||
""" | ||
Tests that models can use their own preprocessing to preprocess conversations. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need anymore after this clean up
Ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! This looks very nice! I'm just mostly not a fan of the do_sample_frames
arg everywhere, IMO it should be directly inferred from the presence of the other related args, even though we have similar do_xxx
args for other transforms. Any particular reason to have it except the fact that it's there for other transforms?
do_sample_frames (`int`, *optional*, defaults to `self.do_sample_frames`): | ||
Whether to sample frames from the video before processing or to process the whole video. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this do_sample_frames
args everywhere? We can infer from the presence of either num_frames
or fps
directly no? (or metadata
as well sometimes?) I know we already have similar args such as do_resize
etc, but they are all very weird to me.
If a user explicitly passes fps
, they should not need to add do_sample_frames=True
as well WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, some models (e,g, Qwen2-VL) have a default fps
in their config saved. I believe all models would be interested in saving a default value, so video get sampled the same way model was tuned
Adding a flag allows users to turn off the model enforced sampling, if for any reason they want the whole video. Or they have pre-sampled frames and just want to process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, in such a case have the user explicitly pass fps=None
instead of do_sample=False
would work as well no? But maybe a little less intuitive wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I feel like this is less intuitive given other transforms happen under do_xxx
flag. That could also be kinda breaking for users since we have default fps=2.0
in current version
IIUC, you don't like adding more do_xxx
flags. Hmm, let me think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a way to do without breaking BC. I realized SmolVLM also has defaults for fps/num_frames
and personally I find it sloppy checking as follows
if num_frames is not None and fps is not None and video_metadata is not None:
self.sample_frames(video)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ALright, LGTM! Let's merge it 🤗
…ace#38105) * apply updates smolVLM (still needs workaround for chat template) * add other models * dump qwen omni for now, come back later * port qwen omni from their impl * wait, all qwens sample videos in same way! * clean up * make smolvlm backwards compatible and fix padding * dix some tests * fox smolvlm tests * more clean up and test fixing * delete unused arg * fix * address comments * style * fix test
Enable fractional fps values (e.g., 1.5, 29.97) in video processors for more precise frame sampling control. - Change fps type from int to float across all video processors - Maintain backward compatibility with integer values Extends: huggingface#38105
Change fps type from Optional[float] to Optional[Union[int, float]] for more explicit type information about supporting both integer and floating-point frame rates. - Update type hints and docstrings across 8 files - Maintain backward compatibility - Clarify support for both int and float values Extends: huggingface#38105
* [video processors] Support float fps for precise frame sampling Enable fractional fps values (e.g., 1.5, 29.97) in video processors for more precise frame sampling control. - Change fps type from int to float across all video processors - Maintain backward compatibility with integer values Extends: #38105 * [video processors] Refine fps typing to Union[int, float] Change fps type from Optional[float] to Optional[Union[int, float]] for more explicit type information about supporting both integer and floating-point frame rates. - Update type hints and docstrings across 8 files - Maintain backward compatibility - Clarify support for both int and float values Extends: #38105 * Revert "[video processors] Support float fps for precise frame sampling" This reverts commit 7360d6e.
…ingface#39134) * [video processors] Support float fps for precise frame sampling Enable fractional fps values (e.g., 1.5, 29.97) in video processors for more precise frame sampling control. - Change fps type from int to float across all video processors - Maintain backward compatibility with integer values Extends: huggingface#38105 * [video processors] Refine fps typing to Union[int, float] Change fps type from Optional[float] to Optional[Union[int, float]] for more explicit type information about supporting both integer and floating-point frame rates. - Update type hints and docstrings across 8 files - Maintain backward compatibility - Clarify support for both int and float values Extends: huggingface#38105 * Revert "[video processors] Support float fps for precise frame sampling" This reverts commit 7360d6e.
What does this PR do?
Now that we have video processors separate from image processors, the next step is to keep all video-specific processing there. This PR refactors how we sample video frames
Before:
apply_chat_template
and require one to pass a callablesampling_fn
for model-specific cases. Cannot handle non-common kwargs.Now:
self.video_processor
. Each model can define their own logic and use all kwargs defined in video processing config.Note: For SmolVLM this is quite difficult to implement without breaking because the model never had a
video_token
and treated videos as sequence of images. To keep BC we would have to update chat template for all models on the hub which is not possible. So an ugly workaround is to keep default chat template in code