Abstract
Recently, Transformer-based models have gained increasing popularity in the field of image captioning. The global attention mechanism of the Transformer facilitates the integration of region and grid features, leading to a significant improvement in accuracy. However, combining two features through direct fusion may lead to inevitable semantic noise, which is caused by non-synergistic issue of the region and grid features; meanwhile, the additional detector to extract region features also decrease the efficiency of the model. In this paper, we introduce a novel position-shift alignment network (PSNet) to exploit the advantages of the two features. Concretely, we embed a simple detector DETR into the model and extracted region features based on grid features to improve model efficiency. Moreover, we propose a P-shift alignment module to address semantic noise caused by non-synergistic issue of the region and grid features. To validate our model, we conduct extensive experiments and visualization on the MS-COCO dataset, and results show that PSNet is qualitatively competitive with existing models under comparable experimental conditions.





Similar content being viewed by others
Research Data Policy and Data Availability Statements
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. And the code that supports the findings of this study is available from the corresponding author upon reasonable request.
References
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch AC, Berg AC, Berg TL, Daumé H (2012) Midge: generating image descriptions from computer vision detections. In: Conference of the European chapter of the association for computational linguistics
Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision
Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence
Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3156–3164
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6077–6086
Ren S, He K, Girshick RB, Sun J (2015) Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. In: AAAI conference on artificial intelligence
Xian T, Li Z, Tang Z, Ma H (2022) Adaptive path selection for dynamic image captioning. IEEE Trans Circuits Syst Video Technol 32:5762–5775
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:abs/2005.12872
Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2019) Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10575–10584
Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15460–15469
Zeng P, Zhang H, Song J, Gao L (2022) S2 transformer for image captioning. In: International joint conference on artificial intelligence
Uijlings JRR, Sande KEA, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104:154–171
He K, Gkioxari G, Dollár P, Girshick RB (2017) Mask r-CNN. https://api.semanticscholar.org/CorpusID:54465873
Singh B, Davis LS (2017) An analysis of scale invariance in object detection—snip. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 3578–3587
Najibi M, Singh B, Davis LS (2018) Autofocus: efficient multi-scale inference. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9744–9754
Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu C-Y, Berg AC (2015) Ssd: single shot multibox detector. In: European conference on computer vision. https://api.semanticscholar.org/CorpusID:2141740
Zhang S, Wen L, Bian X, Lei Z, Li S (2017) Single-shot refinement neural network for object detection. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 4203–4212
Shen Z, Liu Z, Li J, Jiang Y-G, Chen Y, Xue X (2017) Dsod: learning deeply supervised object detectors from scratch. In: 2017 IEEE international conference on computer vision (ICCV), pp 1937–1945
Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv:abs/2010.04159
Yao Z, Ai J, Li B, Zhang C (2021) Efficient detr: improving end-to-end object detector with dense prior. arXiv:abs/2104.01318
Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:abs/1706.03762
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. In: Neural information processing systems
Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10324–10333
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 8927–8936
He S, Liao W, Tavakoli HR, Yang MY, Rosenhahn B, Pugeault N (2020) Image captioning through image transformer. arXiv:abs/2004.14231
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2016) Self-critical sequence training for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1179–1195
Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). J. Artif. Intell. Res. 47:853–899
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Annual meeting of the association for computational linguistics
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Annual meeting of the association for computational linguistics
Vedantam R, Zitnick CL, Parikh D (2014) Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: NIPS
Jiang H, Misra I, Rohrbach M, Learned-Miller EG, Chen X (2020) In defense of grid features for visual question answering. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10264–10273
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: European Conference on Computer Vision
Yang X, Tang K, Zhang H, Cai J (2018) Auto-encoding scene graphs for image captioning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10677–10686
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4633–4642
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977
Fan Z, Wei Z, Wang S, Wang R, Li Z, Shan H, Huang X (2021) Tcic: theme concepts learning cross language and vision for image captioning. In: International joint conference on artificial intelligence
Jiang W, Zhou W, Hu H (2022) Double-stream position learning transformer network for image captioning. IEEE Trans Circuits Syst Video Technol 32:7706–7718
Jiang Z, Wang X, Zhai Z, Cheng B (2023) LG-MLFormer: local and global MLP for image captioning. Int J Multimedia Inf Retr 12. https://api.semanticscholar.org/CorpusID:257106495
Chen L, Yang Y, Hu J, Pan L, Zhai H (2023) Relational-convergent transformer for image captioning. Displays 77:102377
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning. https://api.semanticscholar.org/CorpusID:1055111
Cheng Y, Huang F, Zhou L, Jin C, Zhang Y, Zhang T (2017) A hierarchical multimodal attention-based neural network for image captioning. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval
Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) Gla: global-local attention for image description. IEEE Trans Multimedia 20:726–737
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimedia 21:2942–2956
Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530
Wang C, Gu X (2021) Image captioning with adaptive incremental global context attention. Appl Intell 52:6575–6597
Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recogn. 138:109420
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4651–4659
Lu J, Xiong C, Parikh D, Socher R (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3242–3250
Zhang J, Mei K, Zheng Y, Fan J (2021) Integrating part of speech guidance for image captioning. IEEE Trans Multimedia 23:92–104
Acknowledgements
We express our sincere thanks to the anonymous reviewers for their helpful comments and suggestions to raise the standard of this paper. This work was supported partially by the National Key R &D Program of China (Grant Number 2020YFC1512601), National Natural Science Foundation of China (Grant Number 62106064) and National Natural Science Foundation of China (Grant Number U20B2044).
Author information
Authors and Affiliations
Contributions
L.X.: Conceptualization, Methodology, Software; A.Z.: Software, Writing; R.W.: Visualization, Investigation; J.Y.: Writing, Editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interest relevant to this article.
Ethical approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xue, L., Zhang, A., Wang, R. et al. PSNet: position-shift alignment network for image caption. Int J Multimed Info Retr 12, 42 (2023). https://doi.org/10.1007/s13735-023-00307-3
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13735-023-00307-3