Abstract
Image captioning aims to describe the content of an image with a complete and natural sentence. Recently, the image captioning methods with encoder-decoder architecture has made great progress, in which LSTM became a dominant decoder to generate word sequence. However, in the decoder stage, the input vector keep same and there is much uncorrelated with previously visual parts or generated words. In this paper, we propose an attentional LSTM (ALSTM) and show how to integrate it within state-of-the-art automatic image captioning framework. Instead of traditional LSTM in existing models, ALSTM learns to refine input vector from network hidden states and sequential context information. Thus ALSTM can attend more relevant features such as spatial attention, visual relations and pay more attention on the most relevant context words. Moreover, ALSTM is utilized as the decoder in some classical frameworks and shows how to get effective visual/context attention to update input vector. Extensive quantitative and qualitative evaluations on the Flickr30K and MSCOCO image datasets with modified network illustrate the superiority of ALSTM. ALSTM based methods can generate high quality descriptions by combining sequence context and relations.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL, Microsoft coco captions: Data collection and evaluation server. Computer Science
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput. 9:1735
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383
Yang Z, Yuan Y, Wu Y, Cohen WW, Salakhutdinov RR (2016) Review networks for caption generation. In: Adv Neural Inf Process Syst. pp. 2361–2369
Xiao F, Gong X, Zhang Y, Shen Y, Li J, Gao X (2019) Daa: dual lstms with adaptive attention for image captioning. Neurocomputing 364:322–329
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7008–7024
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057
Xu Y, Hanwang Z, Jianfei C (2019) Learning to collocate neural modules for image captioning. In: Proc of the IEEE int conf on computer vision. Piscataway, NJ: IEEE, pp. 4249–4259
Zhou L, Zhang Y, Jiang Y-G, Zhang T, Fan W (2019) Re-caption: saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp. 684–699
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10685–10694
Gao L, Fan K, Song J, Liu X, Xu X, Shen HT (2019) Deliberate attention networks for image captioning. Proc AAAI Conf Artif Intell 33:8320–8327
Ke L, Pei W, Li R, Shen X, Tai Y-W (2019) Reflective decoding network for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8888–8897
Shi Z, Zhou X, Qiu X, Zhu X (2020) Improving image captioning with better use of caption. In: Proceedings of the 58th annual meeting of the association for computational linguistics, pp. 7454–7464
Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50(1):103–119
He C, Hu H (2019) Image captioning with text-based visual attention. Neural Process Lett 49(1):177–185
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575
Wu Z, Cohen R, Encode, review, and decode: Reviewer module for caption generation, arXiv preprint arXiv:1605.07912 3
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp. 4894–4902
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4651–4659
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics, pp. 311–318
Banerjee S, Lavie A (2005) Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision, Springer, pp. 382–398
Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5659–5667
He X, Yang Y, Shi B, Bai X (2019) Vd-san: visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55
Yuan A, Li X, Lu X (2019) 3g structure for image caption generation. Neurocomputing 330:17–28
Zhu X, Li L, Liu J, Li Z, Peng H, Niu X (2018) Image captioning with triple-attention and stack parallel lstm. Neurocomputing 319:55–65
Tan YH, Chan CS (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100
Ji J, Xu C, Zhang X, Wang B, Song X (2020) Spatio-temporal memory attention for image captioning. IEEE Trans Image Process 29:7615–7628
Acknowledgements
This research was supported by the National Science and Technology Major Project (Grant No.2020YFA0713504), CERNET Innovation Project under NGII20180309 and the Scientific Research Fund of Hunan Provincial Education Department (Grant No. 210153).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xiao, F., Xue, W., Shen, Y. et al. A New Attention-Based LSTM for Image Captioning. Neural Process Lett 54, 3157–3171 (2022). https://doi.org/10.1007/s11063-022-10759-z
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-022-10759-z