Exploiting Knowledge Embedding to Improve the Description for Image Captioning

Dandan Song¹⁰,
Cuimei Peng¹⁰,
Huan Yang¹⁰ &
…
Lejian Liao¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1356))

Included in the following conference series:

China Conference on Knowledge Graph and Semantic Computing

1031 Accesses

Abstract

Most existing methods for image captioning are based on the encoder-decoder framework which directly translates visual features into sentences, without exploiting commonsense knowledge available in the form of knowledge graph. Inspired by the success of information retrieval and question answering systems that leverage prior knowledge, we explore a knowledge embedding approach for image captioning. In this paper, we propose a Knowledge Embedding with Attention on Attention (KE-AoA) method for image captioning, which judges whether or how well the objects are related and augments semantic correlations and constraints between them. The KE-AoA method combines knowledge base method (TransE) and text method (Skip-gram), adding external knowledge graph information (triplets) into the language model to guide the learning of word vectors as the regularization term. Then it employs the AoA module to model the relations among different objects. As more inherent relations and commonsense knowledge are learned, the model can generate better image descriptions. The experiments on MSCOCO data sets achieve a significant improvement on the existing methods and validate the effectiveness of our prior knowledge-based approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Boost image captioning with knowledge reasoning

Article 27 October 2020

Enhance understanding and reasoning ability for image captioning

Article 12 May 2022

Hierarchical decoding with latent context for image captioning

Article 27 August 2022

References

Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems, pp. 2787–2795 (2013)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
Google Scholar
Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Chapter Google Scholar
Li, S., Kulkarni, G., Berg, T.L., et al.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220–228 (2011)
Google Scholar
Kulkarni, G., Premraj, V., Ordonez, V., et al.: BabyTalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Article Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. IEEE (2015)
Google Scholar
Xu, J., Gavves, E., Fernando, B., Tuytelaars, T.: Guiding the long-short term memory model for image caption generation. In: IEEE International Conference on Computer Vision (2016)
Google Scholar
Cheng, W., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: ACM on Multimedia Conference (2016)
Google Scholar
Mao, J., Wei, X., Yi, Y., Jiang, W., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR (2015)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML (2015)
Google Scholar
Anderson, P., He, X., Buehler, C., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Mrkšić, N., Vulić, I., Séaghdha, D.Ó., et al.: Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. Trans. Assoc. Comput. Linguist. 5, 309–324 (2017)
Article Google Scholar
Ahn, S., Choi, H., Pärnamaa, T., et al.: A neural knowledge language model. arXiv preprint arXiv:1608.00318 (2016)
Ghaddar, A., Langlais, P.: Robust lexical features for improved neural network named-entity recognition. arXiv preprint arXiv:1806.03489 (2018)
Vashishth, S., Joshi, R., Prayaga, S.S., et al.: RESIDE: improving distantly-supervised neural relation extraction using side information. arXiv preprint arXiv:1812.04361 (2018)
Yang, X., Tang, K., Zhang, H., et al.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Google Scholar
Huang, L., Wang, W., Chen, J., et al.: Attention on attention for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4634–4643 (2019)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: 26th International Conference on Advanced Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Xu, C., et al.: RC-NET: a general framework for incorporating knowledge into word representations. In: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1219–1228. ACM (2014)
Google Scholar
Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Google Scholar
Lin, C.-Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Google Scholar
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Deng, J., Dong, W., Socher, R., et al.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Vendrov, I., Kiros, R., Fidler, S., et al.: Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 (2015)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning (2017)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar

Download references

Acknowledgment

This work was supported by National Key Research and Development Program of China (Grant No. 2020AAA0106600) and National Natural Science Foundation of China (Grant Nos. 61976021, 61672100 and U1811262).

Author information

Authors and Affiliations

Beijing Engineering Research Center of High Volume Language Information, Processing and Cloud Computing Applications, Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Dandan Song, Cuimei Peng, Huan Yang & Lejian Liao

Authors

Dandan Song
View author publications
You can also search for this author in PubMed Google Scholar
Cuimei Peng
View author publications
You can also search for this author in PubMed Google Scholar
Huan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Lejian Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dandan Song .

Editor information

Editors and Affiliations

Zhejiang University, Hangzhou, China
Huajun Chen
Chinese Academy of Sciences, Institute of Automation, Beijing, China
Kang Liu
University of California, Los Angeles, CA, USA
Yizhou Sun
Shanxi University, Taiyuan, China
Suge Wang
Tsinghua University, Beijing, China
Lei Hou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, D., Peng, C., Yang, H., Liao, L. (2021). Exploiting Knowledge Embedding to Improve the Description for Image Captioning. In: Chen, H., Liu, K., Sun, Y., Wang, S., Hou, L. (eds) Knowledge Graph and Semantic Computing: Knowledge Graph and Cognitive Intelligence. CCKS 2020. Communications in Computer and Information Science, vol 1356. Springer, Singapore. https://doi.org/10.1007/978-981-16-1964-9_25

Download citation

DOI: https://doi.org/10.1007/978-981-16-1964-9_25
Published: 06 May 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1963-2
Online ISBN: 978-981-16-1964-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics