[go: up one dir, main page]

skip to main content
10.1145/3512527.3531371acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Video2Subtitle: Matching Weakly-Synchronized Sequences via Dynamic Temporal Alignment

Published: 27 June 2022 Publication History

Abstract

This paper investigates a new research task in multimedia analysis, dubbed as Video2Subtitle. The goal of this task is to finding the most plausible subtitle from a large pool for a querying video clip. We assume that the temporal duration of each sentence in a subtitle is unknown. Compared with existing cross-modal matching tasks, the proposed Video2Subtitle confronts several new challenges. In particular, video frames / subtitle sentences are temporally ordered, respectively, yet no precise synchronization is available. This casts Video2Subtitle into a problem of matching weakly-synchronized sequences. In this work, our technical contributions are two-fold. First, we construct a large-scale benchmark for the Video2Subtitle task. It consists of about 100K video clip / subtitle pairs with a full duration of 759 hours. All data are automatically trimmed from conversational sub-parts of movies and youtube videos. Secondly, an ideal algorithm for tackling Video2Subtitle requires both temporal synchronization of the visual / textual sequences, but also strong semantic consistency between two modalities. To this end, we propose a novel algorithm with the key traits of heterogeneous multi-cue fusion and dynamic temporal alignment. The proposed method demonstrates excellent performances in comparison with several state-of-the-art cross-modal matching methods. Additionally, we also depict a few interesting applications of Video2Subtitle, such as re-generating subtitle for given videos.

Supplementary Material

MP4 File (ICMR22-fp101.mp4)
Presentation video. It introduces Video2Subtitle task in the aspects of task definition, dataset collection procedure and temporal-based muti-cue fusion retrieval method. It also shows some interesting examples of our subtitle re-generation application.

References

[1]
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. Youtube-8m: A large-scale video classification benchmark. CoRR (2016), 1609.08675.
[2]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. International Conference on Machine Learning (2013), 1247--1255.
[3]
K Selçuk Candan, Rosaria Rossini, Maria Luisa Sapino, and Xiaolan Wang. 2012. sDTW: computing DTW distances using locally relevant constraints based on salient feature alignments. International Conference on Very Large Data Bases (2012).
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. IEEE Conference on Computer Vision and Pattern Recognition (2017), 6299--6308.
[5]
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. IEEE Conference on Computer Vision and Pattern Recognition (2020), 10638--10647.
[6]
Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Advances in Neural Information Processing Systems Workshop (2014).
[7]
Joon Son Chung and Andrew Zisserman. 2016. Lip reading in the wild. Asian Conference on Computer Vision (2016), 87--103.
[8]
Joon Son Chung and AP Zisserman. 2017. Lip reading in profile. (2017).
[9]
Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. The Grid Audio-Visual Speech Corpus. https://doi.org/10.5281/zenodo.3625687
[10]
Marco Cuturi and Mathieu Blondel. 2017. Soft-dtw: a differentiable loss function for time-series. International Conference on Machine Learning (2017), 894--903.
[11]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition (2009), 248--255.
[12]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. CoRR (2018), 1810.04805.
[13]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. IEEE Conference on Computer Vision and Pattern Recognition (2019), 9346--9355.
[14]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse++: Improving visual-semantic embeddings with hard negatives. CoRR (2017), 1707.05612.
[15]
Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. (1999).
[16]
Jiabo He, Sarah Erfani, Sudanthi Wijewickrema, Stephen O'Leary, and Kotagiri Ramamohanarao. 2020. Segmented Pairwise Distance for Time Series with Large Discontinuities. International Joint Conference on Neural Networks (2020), 1--8.
[17]
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. 2017. Densely connected convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition (2017), 4700--4708.
[18]
Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. MovieNet: A Holistic Dataset for Movie Understanding. European Conference on Computer Vision (2020).
[19]
Justin Johnson, Agrim Gupta, and Li Fei-Fei. 2018. Image generation from scene graphs. IEEE Conference on Computer Vision and Pattern Recognition (2018), 1219--1228.
[20]
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al . 2017. The kinetics human action video dataset. CoRR (2017), 1705.06950.
[21]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. 2011. HMDB: A large video database for human motion recognition. IEEE International Conference on Computer Vision (2011), 2556--2563.
[22]
Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. 2015. Numba: A LLVM-Based Python JIT Compiler. Workshop on the LLVM Compiler Infrastructure in HPC, Article 7 (2015), 6 pages.
[23]
Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L Berg. 2018. Tvqa: Localized, compositional video question answering. CoRR (2018), 1809.01696.
[24]
Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval. CoRR (2020), 2001.09099.
[25]
Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV++ Fully Deep Learning for Ad-hoc Video Search. ACM International Conference on Multimedia (2019), 1786--1794.
[26]
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A new dataset and benchmark on animated GIF description. IEEE Conference on Computer Vision and Pattern Recognition (2016), 4641--4650.
[27]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CoRR (2021), 2104.08860.
[28]
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. Howto100M: Learning a text-video embedding by watching hundred million narrated video clips. IEEE International Conference on Computer Vision (2019), 2630--2640.
[29]
Niluthpol Chowdhury Mithun, Juncheng Li, Florian Metze, and Amit K Roy-Chowdhury. 2018. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. ACM International Conference on Multimedia Retrieval (2018), 19--27.
[30]
Arsha Nagrani, Joon Son Chung, Samuel Albanie, and Andrew Zisserman. 2020. Disentangled Speech Embeddings using Cross-Modal Self-Supervision. International Conference on Acoustics, Speech, and Signal Processing (2020).
[31]
A. Nagrani, J. S. Chung, and A. Zisserman. 2017. VoxCeleb: a large-scale speaker identification dataset. Conference of the International Speech Communication Association (2017).
[32]
Tae-Hyun Oh, Tali Dekel, Changil Kim, Inbar Mosseri, William T Freeman, Michael Rubinstein, and Wojciech Matusik. 2019. Speech2face: Learning the face behind a voice. IEEE Conference on Computer Vision and Pattern Recognition (2019), 7539--7548.
[33]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. Conference on Epirical Methods in Natural Language Processing (2014), 1532--1543.
[34]
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative adversarial text to image synthesis. CoRR (2016), 1605.05396.
[35]
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie description. International Journal of Computer Vision 123, 1 (2017), 94--120.
[36]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. IEEE International Conference on Computer Vision (2019), 7464--7473.
[37]
Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. 2016. MovieQA: Understanding Stories in Movies through Question-Answering. IEEE Conference on Computer Vision and Pattern Recognition (2016).
[38]
Bruce Thompson. 2005. Canonical correlation analysis. Encyclopedia of statistics in behavioral science (2005).
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems (2017), 5998--6008.
[40]
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al . 2017. Tacotron: Towards end-to-end speech synthesis. Conference of the International Speech Communication Association (2017).
[41]
Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. ACM International Conference on Multimedia (2016), 791--800.
[42]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. IEEE Conference on Computer Vision and Pattern Recognition (2016), 5288--5296.
[43]
Ran Xu, Caiming Xiong, Wei Chen, and Jason J Corso. 2015. Jointly modeling deep video and compositional text to bridge vision and language in a unified framework. AAAI Conference on Artificial Intelligence (2015).
[44]
Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. IEEE Conference on Computer Vision and Pattern Recognition (2017), 3165--3173.
[45]
Zheng Zhang, Romain Tavenard, Adeline Bailly, Xiaotong Tang, Ping Tang, and Thomas Corpetti. 2017. Dynamic time warping under limited warping path length. Information Sciences 393 (2017), 91--107.
[46]
Jiaping Zhao and Laurent Itti. 2018. shapedtw: Shape dynamic time warping. Pattern Recognition 74 (2018), 171--184.
[47]
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Places: A 10 million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2017).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval
June 2022
714 pages
ISBN:9781450392389
DOI:10.1145/3512527
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal matching
  2. deep neural networks
  3. temporal alignment

Qualifiers

  • Research-article

Funding Sources

  • Beijing Natural Science Foundation
  • Science and Technology Innovation 2030 - New Generation Artificial Intelligence of China

Conference

ICMR '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 103
    Total Downloads
  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media