Video2Subtitle: Matching Weakly-Synchronized Sequences via Dynamic Temporal Alignment

Published: 27 June 2022


This paper investigates a new research task in multimedia analysis, dubbed as Video2Subtitle. The goal of this task is to finding the most plausible subtitle from a large pool for a querying video clip. We assume that the temporal duration of each sentence in a subtitle is unknown. Compared with existing cross-modal matching tasks, the proposed Video2Subtitle confronts several new challenges. In particular, video frames / subtitle sentences are temporally ordered, respectively, yet no precise synchronization is available. This casts Video2Subtitle into a problem of matching weakly-synchronized sequences. In this work, our technical contributions are two-fold. First, we construct a large-scale benchmark for the Video2Subtitle task. It consists of about 100K video clip / subtitle pairs with a full duration of 759 hours. All data are automatically trimmed from conversational sub-parts of movies and youtube videos. Secondly, an ideal algorithm for tackling Video2Subtitle requires both temporal synchronization of the visual / textual sequences, but also strong semantic consistency between two modalities. To this end, we propose a novel algorithm with the key traits of heterogeneous multi-cue fusion and dynamic temporal alignment. The proposed method demonstrates excellent performances in comparison with several state-of-the-art cross-modal matching methods. Additionally, we also depict a few interesting applications of Video2Subtitle, such as re-generating subtitle for given videos.

Supplementary Material

MP4 File (ICMR22-fp101.mp4)
Presentation video. It introduces Video2Subtitle task in the aspects of task definition, dataset collection procedure and temporal-based muti-cue fusion retrieval method. It also shows some interesting examples of our subtitle re-generation application.


Information & Contributors


Published In

cover image ACM Conferences
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
Publication History

Published: 27 June 2022


Author Tags

  1. cross-modal matching
  2. deep neural networks
  3. temporal alignment


Funding Sources

  • Beijing Natural Science Foundation
  • Science and Technology Innovation 2030 - New Generation Artificial Intelligence of China


ICMR '22

