PSNet: position-shift alignment network for image caption

Lixia Xue¹,
Awen Zhang¹,
Ronggui Wang¹ &
…
Juan Yang¹

221 Accesses
3 Citations
Explore all metrics

Abstract

Recently, Transformer-based models have gained increasing popularity in the field of image captioning. The global attention mechanism of the Transformer facilitates the integration of region and grid features, leading to a significant improvement in accuracy. However, combining two features through direct fusion may lead to inevitable semantic noise, which is caused by non-synergistic issue of the region and grid features; meanwhile, the additional detector to extract region features also decrease the efficiency of the model. In this paper, we introduce a novel position-shift alignment network (PSNet) to exploit the advantages of the two features. Concretely, we embed a simple detector DETR into the model and extracted region features based on grid features to improve model efficiency. Moreover, we propose a P-shift alignment module to address semantic noise caused by non-synergistic issue of the region and grid features. To validate our model, we conduct extensive experiments and visualization on the MS-COCO dataset, and results show that PSNet is qualitatively competitive with existing models under comparable experimental conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 2

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Dual visual align-cross attention-based image captioning transformer

Article 17 May 2024

Integrating grid features and geometric coordinates for enhanced image captioning

Article 07 December 2023

Research Data Policy and Data Availability Statements

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. And the code that supports the findings of this study is available from the corresponding author upon reasonable request.

References

Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch AC, Berg AC, Berg TL, Daumé H (2012) Midge: generating image descriptions from computer vision detections. In: Conference of the European chapter of the association for computational linguistics
Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision
Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence
Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3156–3164
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6077–6086
Ren S, He K, Girshick RB, Sun J (2015) Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Article Google Scholar
Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. In: AAAI conference on artificial intelligence
Xian T, Li Z, Tang Z, Ma H (2022) Adaptive path selection for dynamic image captioning. IEEE Trans Circuits Syst Video Technol 32:5762–5775
Article Google Scholar
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:abs/2005.12872
Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2019) Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10575–10584
Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15460–15469
Zeng P, Zhang H, Song J, Gao L (2022) S2 transformer for image captioning. In: International joint conference on artificial intelligence
Uijlings JRR, Sande KEA, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104:154–171
Article Google Scholar
He K, Gkioxari G, Dollár P, Girshick RB (2017) Mask r-CNN. https://api.semanticscholar.org/CorpusID:54465873
Singh B, Davis LS (2017) An analysis of scale invariance in object detection—snip. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 3578–3587
Najibi M, Singh B, Davis LS (2018) Autofocus: efficient multi-scale inference. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9744–9754
Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu C-Y, Berg AC (2015) Ssd: single shot multibox detector. In: European conference on computer vision. https://api.semanticscholar.org/CorpusID:2141740
Zhang S, Wen L, Bian X, Lei Z, Li S (2017) Single-shot refinement neural network for object detection. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 4203–4212
Shen Z, Liu Z, Li J, Jiang Y-G, Chen Y, Xue X (2017) Dsod: learning deeply supervised object detectors from scratch. In: 2017 IEEE international conference on computer vision (ICCV), pp 1937–1945
Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv:abs/2010.04159
Yao Z, Ai J, Li B, Zhang C (2021) Efficient detr: improving end-to-end object detector with dense prior. arXiv:abs/2104.01318
Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:abs/1706.03762
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. In: Neural information processing systems
Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10324–10333
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 8927–8936
He S, Liao W, Tavakoli HR, Yang MY, Rosenhahn B, Pugeault N (2020) Image captioning through image transformer. arXiv:abs/2004.14231
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2016) Self-critical sequence training for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1179–1195
Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676
Article Google Scholar
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). J. Artif. Intell. Res. 47:853–899
Article MATH Google Scholar
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Article Google Scholar
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Annual meeting of the association for computational linguistics
Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Annual meeting of the association for computational linguistics
Vedantam R, Zitnick CL, Parikh D (2014) Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: NIPS
Jiang H, Misra I, Rohrbach M, Learned-Miller EG, Chen X (2020) In defense of grid features for visual question answering. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10264–10273
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: European Conference on Computer Vision
Yang X, Tang K, Zhang H, Cai J (2018) Auto-encoding scene graphs for image captioning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10677–10686
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4633–4642
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977
Fan Z, Wei Z, Wang S, Wang R, Li Z, Shan H, Huang X (2021) Tcic: theme concepts learning cross language and vision for image captioning. In: International joint conference on artificial intelligence
Jiang W, Zhou W, Hu H (2022) Double-stream position learning transformer network for image captioning. IEEE Trans Circuits Syst Video Technol 32:7706–7718
Article Google Scholar
Jiang Z, Wang X, Zhai Z, Cheng B (2023) LG-MLFormer: local and global MLP for image captioning. Int J Multimedia Inf Retr 12. https://api.semanticscholar.org/CorpusID:257106495
Chen L, Yang Y, Hu J, Pan L, Zhai H (2023) Relational-convergent transformer for image captioning. Displays 77:102377
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning. https://api.semanticscholar.org/CorpusID:1055111
Cheng Y, Huang F, Zhou L, Jin C, Zhang Y, Zhang T (2017) A hierarchical multimodal attention-based neural network for image captioning. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval
Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) Gla: global-local attention for image description. IEEE Trans Multimedia 20:726–737
Article Google Scholar
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimedia 21:2942–2956
Article Google Scholar
Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530
Wang C, Gu X (2021) Image captioning with adaptive incremental global context attention. Appl Intell 52:6575–6597
Article Google Scholar
Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recogn. 138:109420
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4651–4659
Lu J, Xiong C, Parikh D, Socher R (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3242–3250
Zhang J, Mei K, Zheng Y, Fan J (2021) Integrating part of speech guidance for image captioning. IEEE Trans Multimedia 23:92–104
Article Google Scholar

Download references

Acknowledgements

We express our sincere thanks to the anonymous reviewers for their helpful comments and suggestions to raise the standard of this paper. This work was supported partially by the National Key R &D Program of China (Grant Number 2020YFC1512601), National Natural Science Foundation of China (Grant Number 62106064) and National Natural Science Foundation of China (Grant Number U20B2044).

Author information

Authors and Affiliations

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China
Lixia Xue, Awen Zhang, Ronggui Wang & Juan Yang

Authors

Lixia Xue
View author publications
You can also search for this author in PubMed Google Scholar
Awen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ronggui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Juan Yang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

L.X.: Conceptualization, Methodology, Software; A.Z.: Software, Writing; R.W.: Visualization, Investigation; J.Y.: Writing, Editing.

Corresponding author

Correspondence to Juan Yang.

Ethics declarations

Conflict of interest

The authors declare no competing interest relevant to this article.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xue, L., Zhang, A., Wang, R. et al. PSNet: position-shift alignment network for image caption. Int J Multimed Info Retr 12, 42 (2023). https://doi.org/10.1007/s13735-023-00307-3

Download citation

Received: 31 March 2023
Revised: 22 September 2023
Accepted: 05 November 2023
Published: 27 November 2023
DOI: https://doi.org/10.1007/s13735-023-00307-3

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Dual visual align-cross attention-based image captioning transformer

Integrating grid features and geometric coordinates for enhanced image captioning

Research Data Policy and Data Availability Statements

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

PSNet: position-shift alignment network for image caption

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SATNet: Captioning with Semantic Alignment and Feature Enhancement

Dual visual align-cross attention-based image captioning transformer

Integrating grid features and geometric coordinates for enhanced image captioning

Research Data Policy and Data Availability Statements

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now