[go: up one dir, main page]

Skip to main content

Advertisement

Log in

PSNet: position-shift alignment network for image caption

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Recently, Transformer-based models have gained increasing popularity in the field of image captioning. The global attention mechanism of the Transformer facilitates the integration of region and grid features, leading to a significant improvement in accuracy. However, combining two features through direct fusion may lead to inevitable semantic noise, which is caused by non-synergistic issue of the region and grid features; meanwhile, the additional detector to extract region features also decrease the efficiency of the model. In this paper, we introduce a novel position-shift alignment network (PSNet) to exploit the advantages of the two features. Concretely, we embed a simple detector DETR into the model and extracted region features based on grid features to improve model efficiency. Moreover, we propose a P-shift alignment module to address semantic noise caused by non-synergistic issue of the region and grid features. To validate our model, we conduct extensive experiments and visualization on the MS-COCO dataset, and results show that PSNet is qualitatively competitive with existing models under comparable experimental conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Fig. 3

Similar content being viewed by others

Research Data Policy and Data Availability Statements

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. And the code that supports the findings of this study is available from the corresponding author upon reasonable request.

References

  1. Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch AC, Berg AC, Berg TL, Daumé H (2012) Midge: generating image descriptions from computer vision detections. In: Conference of the European chapter of the association for computational linguistics

  2. Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9

  3. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth DA (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision

  4. Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence

  5. Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 3156–3164

  6. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 6077–6086

  7. Ren S, He K, Girshick RB, Sun J (2015) Faster r-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149

    Article  Google Scholar 

  8. Luo Y, Ji J, Sun X, Cao L, Wu Y, Huang F, Lin C-W, Ji R (2021) Dual-level collaborative transformer for image captioning. In: AAAI conference on artificial intelligence

  9. Xian T, Li Z, Tang Z, Ma H (2022) Adaptive path selection for dynamic image captioning. IEEE Trans Circuits Syst Video Technol 32:5762–5775

    Article  Google Scholar 

  10. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. arXiv:abs/2005.12872

  11. Lin T-Y, Maire M, Belongie SJ, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision

  12. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2019) Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10575–10584

  13. Zhang X, Sun X, Luo Y, Ji J, Zhou Y, Wu Y, Huang F, Ji R (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15460–15469

  14. Zeng P, Zhang H, Song J, Gao L (2022) S2 transformer for image captioning. In: International joint conference on artificial intelligence

  15. Uijlings JRR, Sande KEA, Gevers T, Smeulders AWM (2013) Selective search for object recognition. Int J Comput Vis 104:154–171

    Article  Google Scholar 

  16. He K, Gkioxari G, Dollár P, Girshick RB (2017) Mask r-CNN. https://api.semanticscholar.org/CorpusID:54465873

  17. Singh B, Davis LS (2017) An analysis of scale invariance in object detection—snip. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 3578–3587

  18. Najibi M, Singh B, Davis LS (2018) Autofocus: efficient multi-scale inference. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9744–9754

  19. Liu W, Anguelov D, Erhan D, Szegedy C, Reed SE, Fu C-Y, Berg AC (2015) Ssd: single shot multibox detector. In: European conference on computer vision. https://api.semanticscholar.org/CorpusID:2141740

  20. Zhang S, Wen L, Bian X, Lei Z, Li S (2017) Single-shot refinement neural network for object detection. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 4203–4212

  21. Shen Z, Liu Z, Li J, Jiang Y-G, Chen Y, Xue X (2017) Dsod: learning deeply supervised object detectors from scratch. In: 2017 IEEE international conference on computer vision (ICCV), pp 1937–1945

  22. Redmon J, Divvala SK, Girshick RB, Farhadi A (2015) You only look once: unified, real-time object detection. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788

  23. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv:abs/2010.04159

  24. Yao Z, Ai J, Li B, Zhang C (2021) Efficient detr: improving end-to-end object detector with dense prior. arXiv:abs/2104.01318

  25. Vaswani A, Shazeer NM, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:abs/1706.03762

  26. Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. In: Neural information processing systems

  27. Guo L, Liu J, Zhu X, Yao P, Lu S, Lu H (2020) Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10324–10333

  28. Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 8927–8936

  29. He S, Liao W, Tavakoli HR, Yang MY, Rosenhahn B, Pugeault N (2020) Image captioning through image transformer. arXiv:abs/2004.14231

  30. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2016) Self-critical sequence training for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 1179–1195

  31. Karpathy A, Fei-Fei L (2014) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676

    Article  Google Scholar 

  32. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: Data, models and evaluation metrics (extended abstract). J. Artif. Intell. Res. 47:853–899

    Article  MATH  Google Scholar 

  33. Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78

    Article  Google Scholar 

  34. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Annual meeting of the association for computational linguistics

  35. Banerjee S, Lavie A (2005) Meteor: an automatic metric for MT evaluation with improved correlation with human judgments. In: IEEvaluation@ACL

  36. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Annual meeting of the association for computational linguistics

  37. Vedantam R, Zitnick CL, Parikh D (2014) Cider: Consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575

  38. Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision

  39. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: NIPS

  40. Jiang H, Misra I, Rohrbach M, Learned-Miller EG, Chen X (2020) In defense of grid features for visual question answering. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10264–10273

  41. Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: European Conference on Computer Vision

  42. Yang X, Tang K, Zhang H, Cai J (2018) Auto-encoding scene graphs for image captioning. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10677–10686

  43. Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 4633–4642

  44. Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10968–10977

  45. Fan Z, Wei Z, Wang S, Wang R, Li Z, Shan H, Huang X (2021) Tcic: theme concepts learning cross language and vision for image captioning. In: International joint conference on artificial intelligence

  46. Jiang W, Zhou W, Hu H (2022) Double-stream position learning transformer network for image captioning. IEEE Trans Circuits Syst Video Technol 32:7706–7718

    Article  Google Scholar 

  47. Jiang Z, Wang X, Zhai Z, Cheng B (2023) LG-MLFormer: local and global MLP for image captioning. Int J Multimedia Inf Retr 12. https://api.semanticscholar.org/CorpusID:257106495

  48. Chen L, Yang Y, Hu J, Pan L, Zhai H (2023) Relational-convergent transformer for image captioning. Displays 77:102377

    Article  Google Scholar 

  49. Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning. https://api.semanticscholar.org/CorpusID:1055111

  50. Cheng Y, Huang F, Zhou L, Jin C, Zhang Y, Zhang T (2017) A hierarchical multimodal attention-based neural network for image captioning. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval

  51. Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) Gla: global-local attention for image description. IEEE Trans Multimedia 20:726–737

    Article  Google Scholar 

  52. Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder-decoder network for image captioning. IEEE Trans Multimedia 21:2942–2956

    Article  Google Scholar 

  53. Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530

  54. Wang C, Gu X (2021) Image captioning with adaptive incremental global context attention. Appl Intell 52:6575–6597

    Article  Google Scholar 

  55. Ma Y, Ji J, Sun X, Zhou Y, Ji R (2023) Towards local visual modeling for image captioning. Pattern Recogn. 138:109420

  56. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4651–4659

  57. Lu J, Xiong C, Parikh D, Socher R (2016) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 3242–3250

  58. Zhang J, Mei K, Zheng Y, Fan J (2021) Integrating part of speech guidance for image captioning. IEEE Trans Multimedia 23:92–104

    Article  Google Scholar 

Download references

Acknowledgements

We express our sincere thanks to the anonymous reviewers for their helpful comments and suggestions to raise the standard of this paper. This work was supported partially by the National Key R &D Program of China (Grant Number 2020YFC1512601), National Natural Science Foundation of China (Grant Number 62106064) and National Natural Science Foundation of China (Grant Number U20B2044).

Author information

Authors and Affiliations

Authors

Contributions

L.X.: Conceptualization, Methodology, Software; A.Z.: Software, Writing; R.W.: Visualization, Investigation; J.Y.: Writing, Editing.

Corresponding author

Correspondence to Juan Yang.

Ethics declarations

Conflict of interest

The authors declare no competing interest relevant to this article.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xue, L., Zhang, A., Wang, R. et al. PSNet: position-shift alignment network for image caption. Int J Multimed Info Retr 12, 42 (2023). https://doi.org/10.1007/s13735-023-00307-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-023-00307-3

Keywords