He 2017
He 2017
G
enerating a natural language description from an image is
an emerging interdisciplinary problem at the intersection of
computer vision, natural language processing, and artificial
intelligence (AI). This task, often referred to as image or visual
captioning, forms the technical foundation of many important
applications, such as semantic visual search, visual intelligence
in chatting robots, photo and video sharing in social media, and
aid for visually impaired people to perceive surrounding visual
content. Thanks to the recent advances in deep learning, the AI
research community has witnessed tremendous progress in visu-
al captioning in recent years. In this article, we will first sum-
marize this exciting emerging visual captioning area. We will
then analyze the key development and the major progress the
community has made, their impact in both research and industry
deployment, and what lies ahead in future breakthroughs.
Introduction
It has been long envisioned that machines one day will under-
stand the visual world at a human level of intelligence. Thanks
to the progress in deep learning [15], [36], [59], [60], [69],
researchers can now build very deep convolutional neural
networks (CNNs) and achieve an impressively low error rate
for tasks like large-scale image classification [9], [15], [23]. In
these tasks, one way for researchers to train a model to predict
©Istockphoto.com/zapp2photo the category of a given image is to first annotate each image in
a training set with a label from a predefined set of categories.
Through such fully supervised training, the computer learns
how to classify an image.
However, in tasks like image classification, the content of an
image is usually simple, containing a predominant object to be
classified. The situation could be much more challenging when
we want computers to understand complex scenes. Image caption-
ing is one such task. The challenges come from two perspectives.
First, to generate a semantically meaningful and syntactically
fluent caption, the system needs to detect salient semantic con-
cepts in the image, understand the relationships among them, and
compose a coherent description about the overall content of the
Digital Object Identifier 10.1109/MSP.2017.2741510
Date of publication: 13 November 2017 image, which involves language and common-sense knowledge
5 3 3 3
3 3 3
5
13 13 13 13
27 13 13
27 384 384 256 1,000
55 256
Global Visual
Feature Vector
Figure 2. An illustration of a deep CNN such as the AlexNet [15]. The CNN is trained for a 1,000-class image classification task on the large-scale Ima-
geNet data set [41]. The last layer of the AlexNet contains 1,000 nodes, each corresponding to a category. The second last fully connected dense layer is
usually extracted as the global visual feature vector, representing the semantic content of the overall images.
...
Attention Context Vector
Visual Vectors
for Subregions
Figure 4. An illustration of the attention mechanism in the image caption generation process.
Global
Visual Vector
In [5], the detected tags are then fed into an n-gram-based algorithms to directly optimize the model for specific rewards.
max-entropy language model to generate a list of caption For example, [67] proposed a self-critical sequence training
hypotheses. Each hypothesis is a full sentence that covers cer- algorithm. It uses the REINFORCE algorithm to optimize a
tain tags and is regularized by the syntax m odeled by the lan- particular evaluation metric that is usually not differentiable
guage model, which defines the probability distribution over and therefore not easy to optimize by conventional gradient-
word sequences. based methods. In [65], within the actor-critic framework, a
All of these hypotheses are then reranked by a linear com- policy network and a value network are learned to generate the
bination of features computed over an entire sentence and caption by optimizing a visual semantic reward, which mea-
the whole image, including sentence length, language model sures the similarity between the image and generated caption.
scores, and semantic similarity between the overall image Relevant to image-caption generation, models based on gen-
and an entire caption hypothesis. Among them, the image- erative adversarial networks (GANs) recently have been pro-
caption semantic similarity is computed by a deep multimodal posed for text generation. Among them, SeqGAN [68] models
similarity model (DMSM), which consists of a pair of neural the generator as a stochastic policy in reinforcement learning
networks, one for mapping each input modality, image, and for discrete outputs like texts, while RankGAN [66] proposed
language, to be vectors in a common semantic space. Image- a ranking-based loss for the discriminator, which gives better
caption semantic similarity is then defined as the cosine simi- assessment of the quality of the generated text and therefore
larity between their vectors. leads to a better generator.
Compared to the end-to-end framework, the compositional
approach provides better flexibility and scalability in system Metrics
development and deployment and facilitates exploiting vari- The quality of the automatically generated captions is evalu-
ous data sources to optimizing the performance of different ated and reported in the literature in both automatic met-
modules more effectively, rather than learn all of the models rics and human studies. Commonly used automatic metrics
on limited image-caption paired data. On the other hand, the include BLEU [45], METEOR [44], CIDEr [46], and SPICE
end-to-end model usually has a simpler architecture and can [47]. BLEU [45] is widely used in machine translation and mea-
optimize the overall system jointly for a better performance. sures the fraction of n-grams (up to four grams) that are in com-
More recently, a class of models has been proposed to mon between a hypothesis and a reference or set of references.
integrate explicit semantic-concept detection in an encoder- METEOR [44] instead measures unigram precision and recall,
decoder framework. A general diagram of this class of mod- but extending exact word matches to include similar words
els is illustrated in Figure 6. For example, [1] applied retrieved based on WordNet synonyms and stemmed tokens. CIDEr
sentences as additional semantic information to guide the [46] also measures the n-gram match between the caption
LSTM when generating captions, while [31] and [33] applied hypothesis and the references, while the n-grams are weighted
a semantic-concept-detection process before generating sen- by term frequency–inverse document frequency (TF-IDF). On
tences. In [7], a semantic compositional network is construct- the other hand, SPICE [47] measures the F1 score of semantic
ed based on the probability of detected semantic concepts for propositional content contained in image captions given the
composing captions. references, and therefore it has the best correlation to human
judgment [47]. These automatic metrics can be computed effi-
Other related work ciently. They can greatly speed up the development of image
Other related work also learns a joint embedding of visual fea- captioning algorithms. However, all of these automatic metrics
tures and associated captions, including [5] for image caption- are known to only roughly correlate with human judgment [50].
ing and [21] for video captioning. Most recently, [27] has looked
into generating dense image captions for individual regions in Benchmarks
images. In addition, a variational autoencoder was developed in Researchers have created many data sets to facilitate the
[22] for image captioning. Also motivated by its recent success, research of image captioning. The Flickr data set [49] and the
researchers proposed a set of reinforcement learning-based PASCAL sentence data set [48] were created for facilitating
the research of image captioning. More recently, Microsoft of captions that were as good or better than human captions.
sponsored the creation of the Common Objects in Context Overall, Microsoft Research and Google jointly received first
(COCO) data set [51], the largest image captioning data set prize in the 2015 COCO Image Captioning Challenge. The
available to the public today. The availability of the large-scale results of two special systems, human and random, are also
data sets significantly prompted research in image captioning included for reference.
in the last several years. There are more systems that have been developed since the
In 2015, approximately 15 groups participated in the 2015 COCO competition. However, due to the high cost, human
COCO Captioning Challenge [52]. The entries in the challenge judging was no longer performed. Instead, the organizers of
are evaluated by human judgment. Five the COCO benchmark set up an automatic
human judge metrics are listed in Table 1. evaluation server. The server can receive
In the competition, all entries are assessed Microsoft Research and the captions generated by a new system and
based on the results from metric 1 (M1) and Google jointly received then evaluate and report the results on the
metric 2 (M2).The other metrics have been first prize in the 2015 blind test set in automatic metrics. Table
used as diagnostic and interpretation of the COCO Image Captioning 3 summarizes the top 24 entries plus the
results. Specifically, in evaluation, each task human system as of August 2017, ranked by
presents a human judge with an image and
Challenge. SPICE, using 40 references per image [52].
two captions: one is automatically gener- Note that these 24 systems outperform the
ated, and the other is a human caption. For M1, the judge is human system in all automatic metrics except SPICE. How-
asked to select which caption better describes the image, or ever, in human judgment, it is likely that the human system still
to choose the “same” option when they are of equal quality. has a lead, given that in Table 2 there is a huge gap between the
For M2, the judge is asked to tell which of the two captions best systems and a human.
is generated by a human. If the judge chooses the automati-
cally generated caption, or chooses the “cannot tell” option, Industrial deployment
it is considered to have passed the Turing test. Table 2 tabu- Given the fast progress in the research community, the indus-
lates the results of the 15 entries in the 2015 COCO Captioning try started deploying image captioning services. In March 2016,
Challenge. Among them, the Microsoft Research entry (MSR) Microsoft released the first public image captioning application
achieves the best performance on the Turing test metric, programming interface as a cloud service [53]. To showcase the
while the Google team outperforms others in the percentage usage of the functionality, it deployed a web application called
CaptionBot (http://CaptionBot.ai) which captions arbitrary pic-
tures users uploaded [33]. The service also supports applications
Table 1. Human evaluation metrics in the 2015 COCO like Seeing AI, designed for the low-vision community, that nar-
Captioning Challenge. rate the world around people who are blind or visually impaired
Metric Comment [71]. More recently, Microsoft further deployed the caption ser-
vice in its widely used product Office, specifically, Word and
M1 Percentage of captions that are evaluated as better or
equal to human caption. PowerPoint, for automatically generating alt-text, i.e., text de-
M2 Percentage of captions that pass the Turing test.
scriptions of pictures, for accessibility [61]. Facebook released
an automatic image captioning tool that provides a list of objects
M3 Average correctness of the captions on a scale from one
to five (incorrect–correct).
and scenes identified in a photo [34]. Meanwhile, although the
service has not yet been deployed publicly, Google open sourced
M4 Average amount of detail of the captions on a scale
from one to five (lack of details–very detailed). its image captioning system for the community [35]. With all of
these industrial-scale deployment and open-source projects, a
M5 Percentage of captions that are similar to human
description. massive number of images and user feedback in real-world sce-
narios are collected and serve as training data to continuously
Table 3. The state-of-the-art image captioning systems in automatic metrics (as of 8 December 2016).
SPICE
Entry CIDEr-D METEOR BLEU-4 (x10) Date
Watson Multimodal 1.123 0.268 0.344 0.204 16 November 2016
DONOT_FAIL_AGAIN 1.01 0.262 0.32 0.199 22 November 2016
Human 0.854 0.252 0.217 0.198 23 March 2015
MSM@MSRA 1.049 0.266 0.343 0.197 25 October 2016
MetaMind/VT_GT 1.042 0.264 0.336 0.197 1 December 2016
ATT-IMG (MSM@MSRA) 1.023 0.262 0.34 0.193 13 June 2016
G-RMI(PG-SPIDEr-TAG) 1.042 0.255 0.331 0.192 11 November 2016
DLTC@MSR 1.003 0.257 0.331 0.19 4 September 2016
Postech_CV 0.987 0.255 0.321 0.19 13 June 2016
G-RMI (PG-BCMR) 1.013 0.257 0.332 0.187 30 October 2016
feng 0.986 0.255 0.323 0.187 6 November 2016
THU_MIG 0.969 0.251 0.323 0.186 3 June 2016
MSR 0.912 0.247 0.291 0.186 8 April 2015
reviewnet 0.965 0.256 0.313 0.185 24 October 2016
Dalab_Master_Thesis 0.96 0.253 0.316 0.183 28 November 2016
ChalLS 0.955 0.252 0.309 0.183 21 May 2016
ATT_VC_REG 0.964 0.254 0.317 0.182 3 December 2016
AugmentCNNwithDe 0.956 0.251 0.315 0.182 29 March 2016
AT 0.943 0.25 0.316 0.182 29 October 2015
Google 0.943 0.254 0.309 0.182 29 May 2015
TsinghuaBigeye 0.939 0.248 0.314 0.181 9 May 2016
ally grounded dialog [56], and image synthesis from text descrip- [4] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S.
Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional net-
tion [57], [72]. The progress in multimodal intelligence is critical works for visual recognition and description,” in Proc. Conf. Computer Vision and
for building more general AI abilities in the future, and we hope Pattern Recognition, 2015, pp. 2625–2634.
[5] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. Gao, X.
the overview provided in this article can encourage students and He, M. Mitchell, J. C. Platt, et al. “From captions to visual concepts and back,” in
researchers to enter and contribute to this exciting AI area. Proc. Conf. Computer Vision and Pattern Recognition, 2015, pp. 1473–1482.
[6] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J.
Hockenmaier, and D. Forsyth, Every picture tells a story: Generating sentences
Authors from images. in Proc. European Conf. Computer Vision, 2010.
Xiaodong He (xiaohe@microsoft.com) received his bachelor’s [7] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng,
degree from Tsinghua University, Beijing, China, in 1996, his “Semantic compositional networks for visual captioning,” Proc. Conf. Computer
Vision and Pattern Recognition, 2017.
M.S. degree from the Chinese Academy of Sciences, Beijing, in [8] R. Girshick. “Fast r-CNN,” in Proc. Int. Conf. Computer Vision, 2015, pp.
1999, and his Ph.D. degree from the University of Missouri– 1440–1448.
Columbia in 2003. He is a principal researcher in the Deep [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni-
tion,” in Proc. Conf. Computer Vision and Pattern Recognition, 2016, pp 770–778.
Learning Group of Microsoft Research, Redmond, Washington.
[10] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a
He is also an affiliate professor in the Department of Electrical ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res., vol. 47, pp.
Engineering and Computer Engineering at the University of 853–899, 2013.
[11] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short
Washington, Seattle. His research interests are mainly in artifi- term memory model for image caption generation,” in Proc. Int. Conf. Computer
cial intelligence areas including deep learning, natural language Vision, 2015, pp. 2407–2415.
processing, computer vision, speech, information retrieval, and [12] Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating
image descriptions,” in Proc. Conf. Computer Vision and Pattern Recognition,
knowledge representation. He received several awards including 2015, pp. 3128–3137.
the Outstanding Paper Award at the 2015 Conference of the [13] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Multimodal neural language
Association for Computational Linguistics (ACL). He has held models,” in Proc. Int. Conf. Machine Learning, 2014.
editorial positions on several IEEE journals, was the area chair [14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y.
Kalantidis, L.-J. Li, D. A. Shamma, et al. “Visual genome: Connecting language
for the North American Chapter of the 2015 Conference of the and vision using crowdsourced dense image annotations,” arXiv Preprint,
ACL, and served on the organizing committee/program commit- arXiv:1602.07332, 2016.
[15] Krizhevsky, I. Sutskever and G. E. Hinton, “Imagenet classification with deep
tee of major speech and language processing conferences. He is convolutional neural networks,” in Proc. Conf. Neural Information Processing
a Senior Member of the IEEE. Systems, 2012.
Li Deng (l.deng@ieee.org) received the Ph.D. degree from [16] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and
T. L. Berg. “Babytalk: Understanding and generating simple image descriptions,” in
the University of Wisconsin-Madison in 1987. He was an assis- Proc. Conf. Computer Vision and Pattern Recognition, 2011, pp. 1601–1608.
tant professor (1989–1992), tenured associate professor (1992– [17] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, “Composing simple
1996), and full professor (1996–1999) at the University of image descriptions using web-scale n-grams,” in Proc. 15th Conf. Computational
Natural Language Learning, 2011, pp. 220–228.
Waterloo, Ontario, Canada. In 1999, he joined Microsoft [18] C. Liu, J. Mao, F. Sha, and A. Yuille, “Attention correctness in neural image
Research, Redmond, Washington, where he currently leads the captioning,” arXiv Preprint, arXiv:1605.09553, 2016.
research and development of deep learning as a partner research [19] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning
with multimodal recurrent neural networks (m-RNN),” in Proc. Int. Conf. Learning
manager of its Deep Learning Technology Center, and where he Representations, 2015.
is a chief scientist of artificial intelligence. Since 2000, he has [20] V. Ordonez, G. Kulkarni, T. L. Berg, V. Ordonez, G. Kulkarni, and T. L. Berg,
also been an affiliate full professor and graduate committee “Im2text: Describing images using 1 million captioned photographs,” in Proc. Conf.
Neural Information Processing Systems, 2011.
member at the University of Washington, Seattle. He is a Fellow
[21] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and
of the IEEE, the Acoustical Society of America, and the translation to bridge video and language,” in Proc. Conf. Computer Vision and
International Speech Communication Association. He served on Pattern Recognition, 2016, pp. 4594–4602.
[22] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin,
the Board of Governors of the IEEE Signal Processing Society “Variational autoencoder for deep learning of images, labels and captions,” in Proc.
(SPS) (2008–2010), and as editor-in-chief of IEEE Signal Conf. Neural Information Processing Systems, 2016.
[40] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [64] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang,
Comput., vol. 98, pp. 1735–1780 1997. “Bottom-up and top-down attention for image captioning and VQA,” arXiv Preprint,
arXiv:1707.07998.
[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A
large-scale hierarchical image database,” in Proc. Conf. Computer Vision and [65] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforcement learning-
based image captioning with embedding reward,” in Proc. Conf. Computer Vision and
Pattern Recognition, 2009, pp. 248–255.
Pattern Recognition, 2017.
[42] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell,
[66] K. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun, “Adversarial ranking for language
“On learning to localize objects with minimal supervision,” in Proc. Int. Conf.
generation,” arXiv Preprint, arXiv:1705.11001
Machine Learning, 2014.
[67] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical
[43] C. Zhang, J. C. Platt, and P. A. Viola, “Multiple instance boosting for object
Sequence Training for Image Captioning,” in Proc. Conf. Computer Vision and
detection,” in Proc. Conf. Neural Information Processing Systems, 2005.
Pattern Recognition, 2017.
[44] S. Banerjee and A. Lavie. “METEOR: An automatic metric for MT evaluation
[68] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial
with improved correlation with human judgments,” in Proc. ACL Workshop on
nets with policy gradient,” in Proc. Association Advancement Artificial Intelligence,
Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or
2017.
Summarization, 2005.
[69] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA:
[45] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automat-
MIT Press 2016.
ic evaluation of machine translation,” in Proc. 40th Annu. Meeting Association
Computational Linguistics, 2002, pp. 311–318. [70] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel, “Visual
question answering: A survey of methods and data sets,” in Computer Vision and
[46] R. Vedantam, L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image
Image Understanding. Elsevier, 2017.
description evaluation,” in Proc. European Conf. Computer Vision, 2015, pp.
4566–4575. [71] Seeing AI. [Online]. Available: https://www.microsoft.com/en-us/seeing-ai/
[47] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic prop- [72] S. Reed, Z. Akata, X. Yan, L. Logeswaran, H. Lee, and B. Schiel “Generative
ositional image caption evaluation,” in Proc. European Conf. Computer Vision, adversarial text to image synthesis,” in Proc. Int. Conf. Machine Learning, 2016.
2016.
SP