research-article

Video2Subtitle: Matching Weakly-Synchronized Sequences via Dynamic Temporal Alignment

Authors:

Ben Xue,

Chenchen Liu,

Yadong MuAuthors Info & Claims

ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

Pages 342 - 350

https://doi.org/10.1145/3512527.3531371

Published: 27 June 2022 Publication History

Get Access

Abstract

This paper investigates a new research task in multimedia analysis, dubbed as Video2Subtitle. The goal of this task is to finding the most plausible subtitle from a large pool for a querying video clip. We assume that the temporal duration of each sentence in a subtitle is unknown. Compared with existing cross-modal matching tasks, the proposed Video2Subtitle confronts several new challenges. In particular, video frames / subtitle sentences are temporally ordered, respectively, yet no precise synchronization is available. This casts Video2Subtitle into a problem of matching weakly-synchronized sequences. In this work, our technical contributions are two-fold. First, we construct a large-scale benchmark for the Video2Subtitle task. It consists of about 100K video clip / subtitle pairs with a full duration of 759 hours. All data are automatically trimmed from conversational sub-parts of movies and youtube videos. Secondly, an ideal algorithm for tackling Video2Subtitle requires both temporal synchronization of the visual / textual sequences, but also strong semantic consistency between two modalities. To this end, we propose a novel algorithm with the key traits of heterogeneous multi-cue fusion and dynamic temporal alignment. The proposed method demonstrates excellent performances in comparison with several state-of-the-art cross-modal matching methods. Additionally, we also depict a few interesting applications of Video2Subtitle, such as re-generating subtitle for given videos.

Supplementary Material

MP4 File (ICMR22-fp101.mp4)

Presentation video. It introduces Video2Subtitle task in the aspects of task definition, dataset collection procedure and temporal-based muti-cue fusion retrieval method. It also shows some interesting examples of our subtitle re-generation application.

Download
223.48 MB

References

[1]

Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. CoRR (2016), 1609.08675.

Abstract

Supplementary Material

References

Index Terms

Recommendations

Self-Paced Cross-Modal Subspace Matching

Temporal subgraph matching method for multi-connected temporal graph

A fast bit-parallel multi-patterns string matching algorithm for biological sequences

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations