A New Attention-Based LSTM for Image Captioning

Fen Xiao¹,
Wenfeng Xue¹,
Yanqing Shen¹ &
…
Xieping Gao ORCID: orcid.org/0000-0002-7764-3616²

1009 Accesses
1 Altmetric
Explore all metrics

Abstract

Image captioning aims to describe the content of an image with a complete and natural sentence. Recently, the image captioning methods with encoder-decoder architecture has made great progress, in which LSTM became a dominant decoder to generate word sequence. However, in the decoder stage, the input vector keep same and there is much uncorrelated with previously visual parts or generated words. In this paper, we propose an attentional LSTM (ALSTM) and show how to integrate it within state-of-the-art automatic image captioning framework. Instead of traditional LSTM in existing models, ALSTM learns to refine input vector from network hidden states and sequential context information. Thus ALSTM can attend more relevant features such as spatial attention, visual relations and pay more attention on the most relevant context words. Moreover, ALSTM is utilized as the decoder in some classical frameworks and shows how to get effective visual/context attention to update input vector. Extensive quantitative and qualitative evaluations on the Flickr30K and MSCOCO image datasets with modified network illustrate the superiority of ALSTM. ALSTM based methods can generate high quality descriptions by combining sequence context and relations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Next-LSTM: a novel LSTM-based image captioning technique

Article 21 June 2023

Multilevel attention and relation network based image captioning model

Article 16 September 2022

Adaptive Syncretic Attention for Constrained Image Captioning

Article 26 April 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL, Microsoft coco captions: Data collection and evaluation server. Computer Science
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. 9:1735
Article Google Scholar
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383
Yang Z, Yuan Y, Wu Y, Cohen WW, Salakhutdinov RR (2016) Review networks for caption generation. In: Adv Neural Inf Process Syst. pp. 2361–2369
Xiao F, Gong X, Zhang Y, Shen Y, Li J, Gao X (2019) Daa: dual lstms with adaptive attention for image captioning. Neurocomputing 364:322–329
Article Google Scholar
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7008–7024
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057
Xu Y, Hanwang Z, Jianfei C (2019) Learning to collocate neural modules for image captioning. In: Proc of the IEEE int conf on computer vision. Piscataway, NJ: IEEE, pp. 4249–4259
Zhou L, Zhang Y, Jiang Y-G, Zhang T, Fan W (2019) Re-caption: saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709
Article MathSciNet Google Scholar
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp. 684–699
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10685–10694
Gao L, Fan K, Song J, Liu X, Xu X, Shen HT (2019) Deliberate attention networks for image captioning. Proc AAAI Conf Artif Intell 33:8320–8327
Google Scholar
Ke L, Pei W, Li R, Shen X, Tai Y-W (2019) Reflective decoding network for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8888–8897
Shi Z, Zhou X, Qiu X, Zhu X (2020) Improving image captioning with better use of caption. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 7454–7464
Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50(1):103–119
Article Google Scholar
He C, Hu H (2019) Image captioning with text-based visual attention. Neural Process Lett 49(1):177–185
Article Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575
Wu Z, Cohen R, Encode, review, and decode: Reviewer module for caption generation, arXiv preprint arXiv:1605.07912 3
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp. 4894–4902
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp. 311–318
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision, Springer, pp. 382–398
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667
He X, Yang Y, Shi B, Bai X (2019) Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55
Article Google Scholar
Yuan A, Li X, Lu X (2019) 3g structure for image caption generation. Neurocomputing 330:17–28
Article Google Scholar
Zhu X, Li L, Liu J, Li Z, Peng H, Niu X (2018) Image captioning with triple-attention and stack parallel lstm. Neurocomputing 319:55–65
Article Google Scholar
Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100
Article Google Scholar
Ji J, Xu C, Zhang X, Wang B, Song X (2020) Spatio-temporal memory attention for image captioning. IEEE Trans Image Process 29:7615–7628
Article Google Scholar

Download references

Acknowledgements

This research was supported by the National Science and Technology Major Project (Grant No.2020YFA0713504), CERNET Innovation Project under NGII20180309 and the Scientific Research Fund of Hunan Provincial Education Department (Grant No. 210153).

Author information

Authors and Affiliations

The MOE Key Laboratory of Intelligent Computing and Information Processing, Xiangtan University, Xiangtan, 411105, Hunan, China
Fen Xiao, Wenfeng Xue & Yanqing Shen
Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, 410081, China
Xieping Gao

Authors

Fen Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Wenfeng Xue
View author publications
You can also search for this author in PubMed Google Scholar
Yanqing Shen
View author publications
You can also search for this author in PubMed Google Scholar
Xieping Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xieping Gao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xiao, F., Xue, W., Shen, Y. et al. A New Attention-Based LSTM for Image Captioning. Neural Process Lett 54, 3157–3171 (2022). https://doi.org/10.1007/s11063-022-10759-z

Download citation

Accepted: 27 January 2022
Published: 14 February 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s11063-022-10759-z

A New Attention-Based LSTM for Image Captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Next-LSTM: a novel LSTM-based image captioning technique

Multilevel attention and relation network based image captioning model

Adaptive Syncretic Attention for Constrained Image Captioning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A New Attention-Based LSTM for Image Captioning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Next-LSTM: a novel LSTM-based image captioning technique

Multilevel attention and relation network based image captioning model

Adaptive Syncretic Attention for Constrained Image Captioning

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now