[go: up one dir, main page]

skip to main content
10.1145/3633637.3633643acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccprConference Proceedingsconference-collections
research-article
Open access

FTAN: Exploring Frame-Text Attention for lightweight Video Captioning

Published: 28 February 2024 Publication History

Abstract

Traditional video captioning approaches employ LSTM as a lightweight decoder. However, these methods focus on fully extracting visual features, but pay less attention to textual information, resulting in relatively low-quality performance. Recent transformer-based methods achieve more accurate results, but at the cost of excessive computing resources. In this paper, we propose a lightweight model for video captioning named Frame-Text Attention Network (FTAN), aiming to make full use of both visual and textual features to obtain more accurate captions. We develop a novel text attention module in FTAN, which uses the hidden state of LSTM as query to generate attentive text features. Then the attentive text features are merged with visual features, which are used as input for LSTM to generate more accurate captions. To the best of our knowledge, we are the first to introduce attention mechanism to extract more textual information hidden in LSTM architecture in video captioning. Extensive experiments demonstrate the effectiveness of FTAN. FTAN outperforms the state-of-the-art LSTM-based method on MSVD dataset by 0.8 in CIDEr-D and is about one-fourth of the transformer-based methods in terms of parameters.

References

[1]
Lorenzo Baraldi, Costantino Grana, and Rita Cucchiara. 2017. Hierarchical boundary-aware neural encoder for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1657–1666.
[2]
David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.
[3]
Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. 2020. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10337–10346.
[4]
Yangyu Chen, Shuhui Wang, Weigang Zhang, and Qingming Huang. 2018. Less is more: Picking informative frames for video captioning. In Proceedings of the European conference on computer vision (ECCV). 358–373.
[5]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546–6555.
[6]
Xuelong Li, Bin Zhao, Xiaoqiang Lu, 2017. Mam-rnn: multi-level attention model based rnn for video captioning. In IJCAI, Vol. 2017. 2208–2214.
[7]
Kevin Lin, Linjie Li, Chung-Ching Lin, Faisal Ahmed, Zhe Gan, Zicheng Liu, Yumao Lu, and Lijuan Wang. 2022. Swinbert: End-to-end transformers with sparse attention for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17949–17958.
[8]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012–10022.
[9]
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. 2016. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1029–1038.
[10]
Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, and Yu-Wing Tai. 2019. Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8347–8356.
[11]
Hobin Ryu, Sunghun Kang, Haeyong Kang, and Chang D Yoo. 2021. Semantic grouping network for video captioning. In proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2514–2522.
[12]
Jingkuan Song, Zhao Guo, Lianli Gao, Wu Liu, Dongxiang Zhang, and Heng Tao Shen. 2017. Hierarchical LSTM with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231 (2017).
[13]
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014).
[14]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[15]
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. 2014. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014).
[16]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3156–3164.
[17]
Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. 2018. Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7622–7631.
[18]
Junbo Wang, Wei Wang, Yan Huang, Liang Wang, and Tieniu Tan. 2018. M3: Multimodal memory modelling for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7512–7520.
[19]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296.
[20]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision. 4507–4515.
[21]
Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang, and Ming-Hsuan Yang. 2022. Hierarchical modular network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17939–17948.
[22]
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. 2016. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4584–4593.
[23]
Junchao Zhang and Yuxin Peng. 2019. Object-aware aggregation with bidirectional temporal graph for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8327–8336.
[24]
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, and Zheng-Jun Zha. 2020. Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 13278–13288.

Index Terms

  1. FTAN: Exploring Frame-Text Attention for lightweight Video Captioning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICCPR '23: Proceedings of the 2023 12th International Conference on Computing and Pattern Recognition
    October 2023
    589 pages
    ISBN:9798400707988
    DOI:10.1145/3633637
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 February 2024

    Check for updates

    Author Tags

    1. Attention Mechanism
    2. Textual Information
    3. Video Captioning

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    • National Key Research and Development Program of China in the 14th Five-Year
    • National Key Research and Development Program of China in the 14th Five-Year

    Conference

    ICCPR 2023

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 153
      Total Downloads
    • Downloads (Last 12 months)153
    • Downloads (Last 6 weeks)35
    Reflects downloads up to 14 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media