[go: up one dir, main page]

0% found this document useful (0 votes)
15 views8 pages

He 2017

Uploaded by

Thanhbich Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views8 pages

He 2017

Uploaded by

Thanhbich Nguyen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Deep Learning for Visual Understanding

Xiaodong He and Li Deng

Deep Learning for Image-to-Text Generation


A technical overview

G
enerating a natural language description from an image is
an emerging interdisciplinary problem at the intersection of
computer vision, natural language processing, and artificial
intelligence (AI). This task, often referred to as image or visual
captioning, forms the technical foundation of many important
applications, such as semantic visual search, visual intelligence
in chatting robots, photo and video sharing in social media, and
aid for visually impaired people to perceive surrounding visual
content. Thanks to the recent advances in deep learning, the AI
research community has witnessed tremendous progress in visu-
al captioning in recent years. In this article, we will first sum-
marize this exciting emerging visual captioning area. We will
then analyze the key development and the major progress the
community has made, their impact in both research and industry
deployment, and what lies ahead in future breakthroughs.

Introduction
It has been long envisioned that machines one day will under-
stand the visual world at a human level of intelligence. Thanks
to the progress in deep learning [15], [36], [59], [60], [69],
researchers can now build very deep convolutional neural
networks (CNNs) and achieve an impressively low error rate
for tasks like large-scale image classification [9], [15], [23]. In
these tasks, one way for researchers to train a model to predict
©Istockphoto.com/zapp2photo the category of a given image is to first annotate each image in
a training set with a label from a predefined set of categories.
Through such fully supervised training, the computer learns
how to classify an image.
However, in tasks like image classification, the content of an
image is usually simple, containing a predominant object to be
classified. The situation could be much more challenging when
we want computers to understand complex scenes. Image caption-
ing is one such task. The challenges come from two perspectives.
First, to generate a semantically meaningful and syntactically
fluent caption, the system needs to detect salient semantic con-
cepts in the image, understand the relationships among them, and
compose a coherent description about the overall content of the
Digital Object Identifier 10.1109/MSP.2017.2741510
Date of publication: 13 November 2017 image, which involves language and common-sense knowledge

1053-5888/17©2017IEEE IEEE Signal Processing Magazine | November 2017 | 109


based approaches, which first select a
Global Visual Caption set of the visually similar images from a
Vector large database and then transfer the cap-
a baby holding a toothbrush
CNN RNN tions of retrieved images to fit the query
in its mouth
image [10], [20]. There is little flexibili-
ty to modify words based on the content
of the query image, since they directly
Figure 1. An illustration of the CNN-RNN-based image captioning framework.
rely on captions of training images and
cannot generate new captions.
modeling beyond object recognition. In addition, due to the Deep neural networks can potentially address both of these
complexity of scenes in the image, it is difficult to represent all issues by generating fluent and expressive captions, which can
fine-grained, subtle differences among them with the simple also generalize beyond those in the train set. In particular, recent
attribute of category. The supervision for training image caption- successes of using neural networks in image classification [9],
ing models is a full description of the content of the image in [15], [23] and object detection [8] have motivated strong interest
natural language, which is sometimes ambiguous and lacks fine- in using neural networks for visual captioning.
grained alignments between the subregions in the image and the
words in the description. Major deep-learning paradigms for image captioning
Moreover, unlike image classification tasks, where we can
easily tell if the classification output is correct or wrong after The end-to-end framework
comparing it to the ground truth, there are multiple valid ways
to describe the content of an image. It is not easy to tell if the Vector-to-sequence learning
generated caption is correct or not, at what degree. In practice, Motivated by the recent success of sequence-to-sequence
human studies are often employed to judge the quality of the learning in machine translation [37]–[39], researchers studied
caption given an image. However, since human evaluation is an end-to-end encoder-decoder framework for image caption-
costly and time-consuming, many automatic metrics are pro- ing [2]–[4], [12], [26]. Figure 1 illustrates a typical encoder-
posed, which could serve as a proxy mainly for speeding up the decoder-based captioning system [26].
development cycle of the system. In such a framework, first the raw image is encoded by a global
Early approaches to image captioning can be roughly visual feature vector which represents the overall semantic infor-
divided into two families. The first one is based on template mation of the image, via deep CNNs. As illustrated in Figure 2,
matching [6], [16], [17]. These approaches start from detecting a CNN consists of several convolutional, max-pooling, response-
objects, actions, scenes, and attributes in images and then fill normalization, and fully connected layers. This architecture has
them into a hand-designed and rigid sentence template. The been very successful for large-scale image classification [21], and
captions generated by these approaches are not always fluent the learned features have shown to transfer to a broad variety
and expressive. The second family is grounded on retrieval- of vision tasks [40]. Usually, given a raw image, the activation

5 3 3 3

3 3 3
5
13 13 13 13
27 13 13
27 384 384 256 1,000
55 256

55 4,096 4,096 Classification


Raw Image Layer
96

Global Visual
Feature Vector

Figure 2. An illustration of a deep CNN such as the AlexNet [15]. The CNN is trained for a 1,000-class image classification task on the large-scale Ima-
geNet data set [41]. The last layer of the AlexNet contains 1,000 nodes, each corresponding to a category. The second last fully connected dense layer is
usually extracted as the global visual feature vector, representing the semantic content of the overall images.

110 IEEE Signal Processing Magazine | November 2017 |


values at the second last fully connected layer are extracted as the which demonstrated a state-of-the-art performance on image
global visual feature vector. captioning. In the end-to-end framework, all of the model
Once the global visual vector is extracted, it is then fed into parameters, including the CNN, the RNN, and the attention
a recurrent neural network (RNN)-based decoder for caption model, are trained jointly in an end-to-end fashion; hence, the
generation, as illustrated in Figure 3. In practice, a long-short term end to end.
memory network (LSTM) [40] or gated recurrent unit (GRU)
[39] variation of the RNN is often used; both have been shown A compositional framework
to be more efficient and effective in training and capturing Different from the end-to-end encoder-decoder framework pre-
long-span language dependencies than vanilla RNNs [38], viously described, a separate class of image-to-text approaches
[39], and both have found successful applications in action rec- uses an explicit semantic-concept-detection process for caption
ognition tasks [62], [63]. generation. The detection model and other modules are often
The representative set of studies using the aforementioned trained separately. Figure 5 illustrates a semantic-concept-detec-
end-to-end framework include [2]–[4], [7], [11]–[13], [19], tion-based compositional approach proposed by Fang et al. [5].
and [26] for image captioning and [1], [21] [24], [25], and [32] In this framework, the first step in the caption generation pipe-
for video captioning. The differences of the various methods line detects a set of semantic concepts, known as tags or attri-
mainly lie in the types of CNN architectures and the RNN- butes, that are likely to be part of the image’s description. These
based language models. For example, the vanilla RNN was tags may belong to any part of speech, including nouns, verbs, and
used in [12] and [19], while the LSTM was used in [26]. The adjectives. Unlike image classification, standard supervised learn-
visual feature vector was only fed into the RNN once at the ing techniques are not directly applicable for learning detectors
first time step in [26], while it was used at each time step of since the supervision only contains the whole image and the
the RNN in [19]. human-annotated whole sentence of caption, while the image
bounding boxes corresponding to the words are unknown. To
The attention mechanism address this issue, [5] proposed learning the detectors using
Most recently, [29] utilized an attention-based mechanism to the weakly supervised approach of multiple instance learning
learn where to focus in the image during caption generation. (MIL) [42], [43], while in [33], this problem is treated as a mul-
The attention architecture is illustrated in Figure 4. Differ- tilabel classification task.
ent from the simple encoder-decoder approach, the attention-
based approach first uses the CNN to not only generate a
global visual vector but also generate a set of visual vectors a baby holding mouth </s>
for subregions in the image. These subregion vectors can Global
be extracted from a lower convolutional layer in the CNN. Visual Vector
...
Then, in language generation, at each step of generating a
new word, the RNN will refer to these subregion vectors and
determine the likelihood that each of the subregions is rele- <s> a baby its mouth
vant to the current state to generate the word. Eventually, the
attention mechanism will form a contextual vector, which is Figure 3. An illustration of an RNN-based caption decoder. At the initial
a sum of subregional visual vectors weighted by the likeli- step, the global visual vector, which represents the overall semantic
hood of relevance, for the RNN to decode the next new word. meaning of the image, is fed into the RNN to compute the hidden layer at
This work was followed by [30], which introduced a the first step while the sentence-start symbol <s> is used as the input to
“review” module to improve the attention mechanism and the hidden layer at the first step. Then the first word is generated from the
hidden layer. Continuing this process, the word generated in the previous
further by [18], which proposed a method to improve the cor- step becomes the input to the hidden layer at the next step to generate
rectness of visual attention. More recently, based on object the next word. This generation process keeps going until the sentence-
detection, a bottom-up attention model was proposed in [64], end symbol, </s>, is generated.

Global Visual Caption


Vector
a baby holding a toothbrush
CNN RNN in its mouth

...
Attention Context Vector
Visual Vectors
for Subregions

Figure 4. An illustration of the attention mechanism in the image caption generation process.

IEEE Signal Processing Magazine | November 2017 | 111


Semantic Tags
baby
mouth N-best
caption DMSM a baby holding a toothbrush
CNN ... MaxEnt LM
hypotheses Reranking in its mouth
holding

Global
Visual Vector

Figure 5. An illustration of a semantic-concept-detection-based compositional approach [5].

In [5], the detected tags are then fed into an n-gram-based algorithms to directly optimize the model for specific rewards.
max-entropy language model to generate a list of caption For example, [67] proposed a self-critical sequence training
hypotheses. Each hypothesis is a full sentence that covers cer- algorithm. It uses the REINFORCE algorithm to optimize a
tain tags and is regularized by the syntax m ­ odeled by the lan- particular evaluation metric that is usually not differentiable
guage model, which defines the probability distribution over and therefore not easy to optimize by conventional gradient-
word sequences. based methods. In [65], within the actor-critic framework, a
All of these hypotheses are then reranked by a linear com- policy network and a value network are learned to generate the
bination of features computed over an entire sentence and caption by optimizing a visual semantic reward, which mea-
the whole image, including sentence length, language model sures the similarity between the image and generated caption.
scores, and semantic similarity between the overall image Relevant to image-caption generation, models based on gen-
and an entire caption hypothesis. Among them, the image- erative adversarial networks (GANs) recently have been pro-
caption semantic similarity is computed by a deep multimodal posed for text generation. Among them, SeqGAN [68] models
similarity model (DMSM), which consists of a pair of neural the generator as a stochastic policy in reinforcement learning
networks, one for mapping each input modality, image, and for discrete outputs like texts, while RankGAN [66] proposed
language, to be vectors in a common semantic space. Image- a ranking-based loss for the discriminator, which gives better
caption semantic similarity is then defined as the cosine simi- assessment of the quality of the generated text and therefore
larity between their vectors. leads to a better generator.
Compared to the end-to-end framework, the compositional
approach provides better flexibility and scalability in system Metrics
development and deployment and facilitates exploiting vari- The quality of the automatically generated captions is evalu-
ous data sources to optimizing the performance of different ated and reported in the literature in both automatic met-
modules more effectively, rather than learn all of the models rics and human studies. Commonly used automatic metrics
on limited image-caption paired data. On the other hand, the include BLEU [45], METEOR [44], CIDEr [46], and SPICE
end-to-end model usually has a simpler architecture and can [47]. BLEU [45] is widely used in machine translation and mea-
optimize the overall system jointly for a better performance. sures the fraction of n-grams (up to four grams) that are in com-
More recently, a class of models has been proposed to mon between a hypothesis and a reference or set of references.
integrate explicit semantic-concept detection in an encoder- METEOR [44] instead measures unigram precision and recall,
decoder framework. A general diagram of this class of mod- but extending exact word matches to include similar words
els is illustrated in Figure 6. For example, [1] applied retrieved based on WordNet synonyms and stemmed tokens. CIDEr
sentences as additional semantic information to guide the [46] also measures the n-gram match between the caption
LSTM when generating captions, while [31] and [33] applied hypothesis and the references, while the n-grams are weighted
a semantic-concept-detection process before generating sen- by term frequency–inverse document frequency (TF-IDF). On
tences. In [7], a semantic compositional network is construct- the other hand, SPICE [47] measures the F1 score of semantic
ed based on the probability of detected semantic concepts for propositional content contained in image captions given the
composing captions. references, and therefore it has the best correlation to human
judgment [47]. These automatic metrics can be computed effi-
Other related work ciently. They can greatly speed up the development of image
Other related work also learns a joint embedding of visual fea- captioning algorithms. However, all of these automatic metrics
tures and associated captions, including [5] for image caption- are known to only roughly correlate with human judgment [50].
ing and [21] for video captioning. Most recently, [27] has looked
into generating dense image captions for individual regions in Benchmarks
images. In addition, a variational autoencoder was developed in Researchers have created many data sets to facilitate the
[22] for image captioning. Also motivated by its recent success, research of image captioning. The Flickr data set [49] and the
researchers proposed a set of reinforcement learning-based PASCAL sentence data set [48] were created for facilitating

112 IEEE Signal Processing Magazine | November 2017 |


Global Visual Caption
Vector
a baby holding a toothbrush
CNN RNN in its mouth
baby
mouth
...
holding
Semantic Tags

Figure 6. An illustration of integrate explicit semantic-concept-detection in an encoder-decoder framework.

the research of image captioning. More recently, Microsoft of ­captions that were as good or better than human captions.
sponsored the creation of the Common Objects in Context Overall, Microsoft Research and Google jointly received first
(COCO) data set [51], the largest image captioning data set prize in the 2015 COCO Image Captioning Challenge. The
available to the public today. The availability of the large-scale results of two special systems, human and random, are also
data sets significantly prompted research in image captioning included for reference.
in the last several years. There are more systems that have been developed since the
In 2015, approximately 15 groups participated in the 2015 COCO competition. However, due to the high cost, human
COCO Captioning Challenge [52]. The entries in the challenge judging was no longer performed. Instead, the organizers of
are evaluated by human judgment. Five the COCO benchmark set up an automatic
human judge metrics are listed in Table 1. evaluation server. The server can receive
In the competition, all entries are assess­­ed Microsoft Research and the captions generated by a new system and
based on the results from metric 1 (M1) and Google jointly received then evaluate and report the results on the
metric 2 (M2).The other metrics have been first prize in the 2015 blind test set in automatic metrics. Table
used as diagnostic and interpretation of the COCO Image Captioning 3 summarizes the top 24 entries plus the
results. Specifically, in evaluation, each task human system as of August 2017, ranked by
presents a human judge with an image and
Challenge. SPICE, using 40 references per image [52].
two captions: one is automatically gener- Note that these 24 systems outperform the
ated, and the other is a human caption. For M1, the judge is human system in all automatic metrics except SPICE. How-
asked to select which caption better describes the image, or ever, in human judgment, it is likely that the human system still
to choose the “same” option when they are of equal quality. has a lead, given that in Table 2 there is a huge gap between the
For M2, the judge is asked to tell which of the two captions best systems and a human.
is generated by a human. If the judge chooses the automati-
cally generated caption, or chooses the “cannot tell” option, Industrial deployment
it is considered to have passed the Turing test. Table 2 tabu- Given the fast progress in the research community, the indus-
lates the results of the 15 entries in the 2015 COCO Captioning try ­started deploying image captioning services. In March 2016,
Challenge. Among them, the Microsoft Research entry (MSR) ­Microsoft released the first public image captioning application
achieves the best performance on the Turing test metric, programming interface as a cloud service [53]. To showcase the
while the Google team outperforms others in the percentage usage of the functionality, it deployed a web application called
CaptionBot (http://CaptionBot.ai) which captions arbitrary pic-
tures users uploaded [33]. The service also supports applications
Table 1. Human evaluation metrics in the 2015 COCO like Seeing AI, designed for the low-vision community, that nar-
Captioning Challenge. rate the world around people who are blind or visually impaired
Metric Comment [71]. More recently, Microsoft further deployed the caption ser-
vice in its widely used product Office, specifically, Word and
M1 Percentage of captions that are evaluated as better or
equal to human caption. PowerPoint, for automatically generating alt-text, i.e., text de-
M2 Percentage of captions that pass the Turing test.
scriptions of pictures, for accessibility [61]. Facebook released
an automatic image captioning tool that provides a list of objects
M3 Average correctness of the captions on a scale from one
to five (incorrect–correct).
and scenes identified in a photo [34]. Meanwhile, although the
service has not yet been deployed publicly, Google open sourced
M4 Average amount of detail of the captions on a scale
from one to five (lack of details–very detailed). its image captioning system for the community [35]. With all of
these industrial-scale deployment and open-source projects, a
M5 Percentage of captions that are similar to human
description. massive number of images and user feedback in real-world sce-
narios are collected and serve as training data to continuously

IEEE Signal Processing Magazine | November 2017 | 113


Table 2. Human evaluation results of entries in the 2015 COCO Captioning Challenge.
Entry M1 M2 M3 M4 M5 Date
Human 0.638 0.675 4.836 3.428 0.352 23 March 2015
Google 0.273 0.317 4.107 2.742 0.233 29 May 2015
MSR 0.268 0.322 4.137 2.662 0.234 8 April 2015
Montreal/Toronto 0.262 0.272 3.932 2.832 0.197 14 May 2015
MSR Captivator 0.25 0.301 4.149 2.565 0.233 28 May 2015
Berkeley LRCN 0.246 0.268 3.924 2.786 0.204 25 April 2015
m-RNN 0.223 0.252 3.897 2.595 0.202 30 May 2015
Nearest Neighbor 0.216 0.255 3.801 2.716 0.196 15 May 2015
PicSOM 0.202 0.25 3.965 2.552 0.182 26 May 2015
Brno University 0.194 0.213 3.079 3.482 0.154 29 May 2015
m-RNN (Baidu/UCLA) 0.19 0.241 3.831 2.548 0.195 26 May 2015
MIL 0.168 0.197 3.349 2.915 0.159 29 May 2015
MLBL 0.167 0.196 3.659 2.42 0.156 10 April 2015
NeuralTalk 0.166 0.192 3.436 2.742 0.147 15 April 2015
ACVT 0.154 0.19 3.516 2.599 0.155 26 May 2015
Tsinghua Bigeye 0.1 0.146 3.51 2.163 0.116 23 April 2015
Random 0.007 0.02 1.084 3.247 0.013 29 May 2015

Table 3. The state-of-the-art image captioning systems in automatic metrics (as of 8 December 2016).
SPICE
Entry CIDEr-D METEOR BLEU-4 (x10) Date
Watson Multimodal 1.123 0.268 0.344 0.204 16 November 2016
DONOT_FAIL_AGAIN 1.01 0.262 0.32 0.199 22 November 2016
Human 0.854 0.252 0.217 0.198 23 March 2015
MSM@MSRA 1.049 0.266 0.343 0.197 25 October 2016
MetaMind/VT_GT 1.042 0.264 0.336 0.197 1 December 2016
ATT-IMG (MSM@MSRA) 1.023 0.262 0.34 0.193 13 June 2016
G-RMI(PG-SPIDEr-TAG) 1.042 0.255 0.331 0.192 11 November 2016
DLTC@MSR 1.003 0.257 0.331 0.19 4 September 2016
Postech_CV 0.987 0.255 0.321 0.19 13 June 2016
G-RMI (PG-BCMR) 1.013 0.257 0.332 0.187 30 October 2016
feng 0.986 0.255 0.323 0.187 6 November 2016
THU_MIG 0.969 0.251 0.323 0.186 3 June 2016
MSR 0.912 0.247 0.291 0.186 8 April 2015
reviewnet 0.965 0.256 0.313 0.185 24 October 2016
Dalab_Master_Thesis 0.96 0.253 0.316 0.183 28 November 2016
ChalLS 0.955 0.252 0.309 0.183 21 May 2016
ATT_VC_REG 0.964 0.254 0.317 0.182 3 December 2016
AugmentCNNwithDe 0.956 0.251 0.315 0.182 29 March 2016
AT 0.943 0.25 0.316 0.182 29 October 2015
Google 0.943 0.254 0.309 0.182 29 May 2015
TsinghuaBigeye 0.939 0.248 0.314 0.181 9 May 2016

114 IEEE Signal Processing Magazine | November 2017 |


improve the performance of the system and stimulate new re- Processing Magazine (2009–2011), which earned the highest
searches in deep visual understanding. impact factor in 2010 and 2011 among all IEEE publications and
for which he received the 2012 IEEE SPS Meritorious Service
Outlook Award. He recently joined Citadel as its chief artificial intelli-
Image-to-text generation is an important interdisciplinary area gence officer.
across computer vision and natural language processing. It also
forms the technical foundation of many important applications. References
[1] N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional
Thanks to deep-learning technologies, we have seen significant networks for learning video representations,” in Proc. Int. Conf. Learning
progress in this area in recent years. In this article, we have Representations, 2016.
reviewed the key developments that the community has made and [2] X. Chen and C. Lawrence Zitnick, “Mind’s eye: A recurrent visual representa-
tion for image caption generation,” in Proc. Conf. Computer Vision and Pattern
their impact in both research and industry deployment. Looking Recognition, 2015, pp 2422–2431.
forward, image captioning will be a key subarea in the image– [3] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He, G. Zweig, and M.
natural language multimodal intelligence field. A number of new Mitchell, “Language models for image captioning: The quirks and what works,”
in roc. 53rd Annu. Meeting Association Computational Linguistics and the
problems in this field have been proposed lately, including visual 7th Int. Joint Conf. Natural Language Processing, 2015, Beijing, China, pp.
question answering [54], [55], [70], visual storytelling [58], visu- 100–105.

ally grounded dialog [56], and image synthesis from text descrip- [4] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S.
Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional net-
tion [57], [72]. The progress in multimodal intelligence is critical works for visual recognition and description,” in Proc. Conf. Computer Vision and
for building more general AI abilities in the future, and we hope Pattern Recognition, 2015, pp. 2625–2634.
[5] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar, J. Gao, X.
the overview provided in this article can encourage students and He, M. Mitchell, J. C. Platt, et al. “From captions to visual concepts and back,” in
researchers to enter and contribute to this exciting AI area. Proc. Conf. Computer Vision and Pattern Recognition, 2015, pp. 1473–1482.
[6] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J.
Hockenmaier, and D. Forsyth, Every picture tells a story: Generating sentences
Authors from images. in Proc. European Conf. Computer Vision, 2010.
Xiaodong He (xiaohe@microsoft.com) received his bachelor’s [7] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng,
degree from Tsinghua University, Beijing, China, in 1996, his “Semantic compositional networks for visual captioning,” Proc. Conf. Computer
Vision and Pattern Recognition, 2017.
M.S. degree from the Chinese Academy of Sciences, Beijing, in [8] R. Girshick. “Fast r-CNN,” in Proc. Int. Conf. Computer Vision, 2015, pp.
1999, and his Ph.D. degree from the University of Missouri– 1440–1448.
Columbia in 2003. He is a principal researcher in the Deep [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recogni-
tion,” in Proc. Conf. Computer Vision and Pattern Recognition, 2016, pp 770–778.
Learning Group of Microsoft Research, Redmond, Washington.
[10] M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a
He is also an affiliate professor in the Department of Electrical ranking task: Data, models and evaluation metrics,” J. Artif. Intell. Res., vol. 47, pp.
Engineering and Computer Engineering at the University of 853–899, 2013.
[11] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short
Washington, Seattle. His research interests are mainly in artifi- term memory model for image caption generation,” in Proc. Int. Conf. Computer
cial intelligence areas including deep learning, natural language Vision, 2015, pp. 2407–2415.
processing, computer vision, speech, information retrieval, and [12] Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating
image descriptions,” in Proc. Conf. Computer Vision and Pattern Recognition,
knowledge representation. He received several awards including 2015, pp. 3128–3137.
the Outstanding Paper Award at the 2015 Conference of the [13] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Multimodal neural language
Association for Computational Linguistics (ACL). He has held models,” in Proc. Int. Conf. Machine Learning, 2014.
editorial positions on several IEEE journals, was the area chair [14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y.
Kalantidis, L.-J. Li, D. A. Shamma, et al. “Visual genome: Connecting language
for the North American Chapter of the 2015 Conference of the and vision using crowdsourced dense image annotations,” arXiv Preprint,
ACL, and served on the organizing committee/program commit- arXiv:1602.07332, 2016.
[15] Krizhevsky, I. Sutskever and G. E. Hinton, “Imagenet classification with deep
tee of major speech and language processing conferences. He is convolutional neural networks,” in Proc. Conf. Neural Information Processing
a Senior Member of the IEEE. Systems, 2012.
Li Deng (l.deng@ieee.org) received the Ph.D. degree from [16] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and
T. L. Berg. “Babytalk: Understanding and generating simple image descriptions,” in
the University of Wisconsin-Madison in 1987. He was an assis- Proc. Conf. Computer Vision and Pattern Recognition, 2011, pp. 1601–1608.
tant professor (1989–1992), tenured associate professor (1992– [17] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, “Composing simple
1996), and full professor (1996–1999) at the University of image descriptions using web-scale n-grams,” in Proc. 15th Conf. Computational
Natural Language Learning, 2011, pp. 220–228.
Waterloo, Ontario, Canada. In 1999, he joined Microsoft [18] C. Liu, J. Mao, F. Sha, and A. Yuille, “Attention correctness in neural image
Research, Redmond, Washington, where he currently leads the captioning,” arXiv Preprint, arXiv:1605.09553, 2016.
research and development of deep learning as a partner research [19] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, “Deep captioning
with multimodal recurrent neural networks (m-RNN),” in Proc. Int. Conf. Learning
manager of its Deep Learning Technology Center, and where he Representations, 2015.
is a chief scientist of artificial intelligence. Since 2000, he has [20] V. Ordonez, G. Kulkarni, T. L. Berg, V. Ordonez, G. Kulkarni, and T. L. Berg,
also been an affiliate full professor and graduate committee “Im2text: Describing images using 1 million captioned photographs,” in Proc. Conf.
Neural Information Processing Systems, 2011.
member at the University of Washington, Seattle. He is a Fellow
[21] Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui, “Jointly modeling embedding and
of the IEEE, the Acoustical Society of America, and the translation to bridge video and language,” in Proc. Conf. Computer Vision and
International Speech Communication Association. He served on Pattern Recognition, 2016, pp. 4594–4602.
[22] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin,
the Board of Governors of the IEEE Signal Processing Society “Variational autoencoder for deep learning of images, labels and captions,” in Proc.
(SPS) (2008–2010), and as editor-in-chief of IEEE Signal Conf. Neural Information Processing Systems, 2016.

IEEE Signal Processing Magazine | November 2017 | 115


[23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large- [48] C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, “Collecting
scale image recognition,” in Proc. Comput. Sci. Conf., 2014. image annotations using Amazon’s mechanical turk,” in Proc. NAACL HLT
[24] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Workshop Creating Speech and Language Data with Amazon’s Mechanical
Saenko, “Translating videos to natural language using deep recurrent neural net- Turk, 2010.
works,” in Proc. Conf. North American Chapter Association Computational [49] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to
Linguistics: Human Language Technologies, 2015, pp. 1494–1505. visual denotations: New similarity metrics for semantic inference over event descrip-
[25] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. tions,” in Proc. Association Computational Linguistics, vol. 2, 2014, pp. 67–78.
Saenko, “Sequence to sequence-video to text,” in Proc. Int. Conf. Computer Vision, [50] D. Elliott and F. Keller, “Comparing automatic evaluation measures for image
2015, pp. 4534–4542. description,” in Proc. 52nd Annu. Meeting Association Computational Linguistics,
[26] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. “Show and tell: A neural 2014, pp. 452–457.
image caption generator,” in Proc. Conf. Computer Vision and Pattern [51] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D.
Recognition, 2015, pp. 3156–3164. Ramanan, C. Lawrence Zitnick, and P. Dollár, “Microsoft COCO: Common objects in
[27] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully convolutional local- context,” in Proc. European Conf. Computer Vision, 2015.
ization networks for dense captioning,” in Proc. IEEE Conf. Computer Vision and [52] Y. Cui, M. R. Ronchi, T.-Y. Lin, P. Dollár, L. Zitnick. (2015) COCO captioning
Pattern Recognition (CVPR), 2015, pp. 4565–4574. challenge. [Online]. Available: http://mscoco.org/dataset/#captions-challenge
[28] Q. Wu, C. Shen, L. Liu, A. Dick, and A. v d. Hengel, “What value do explicit [53] Microsoft Cognitive Services Computer Vision API. [Online]. Available: https://
high level concepts have in vision to language problems?” in Proc. Conf. Computer www.microsoft.com/cognitive-services/en-us/computer-vision-api
Vision and Pattern Recognition, 2016, pp. 203–212. [54] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for
[29] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and image question answering,” in Proc. Conf. Computer Vision and Pattern Recognition,
Y. Bengio, “Show, attend and tell: Neural image caption generation with visual 2016, pp. 21–29.
attention,” in Proc. Int. Conf. Machine Learning, 2015. [55] A. Agrawal, J. Lu, S. Antol, M. Mitchell, L. Zitnick, D. Batra, and D. Parikh,
[30] Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen, “Review net- “VQA: Visual question answering,” in Proc. Int. Conf. Computer Vision, 2015, pp.
works for caption generation,” in Proc. Conf. Neural Information Processing 2425–2433.
Systems, 2016. [56] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and
[31] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic D. Batra, “Visual dialog,” in Proc. Conf. Computer Vision and Pattern Recognition,
attention,” in Proc. Conf. Computer Vision and Pattern Recognition, 2016, pp. 2017.
4651–4659. [57] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas,
[32] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu, “Video paragraph captioning “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial
using hierarchical recurrent neural networks,” in Proc. Conf. Computer Vision and networks,” in Proc. Int. Conf. Computer Vision, 2017.
Pattern Recognition, 2016, pp. 4584–4593. [58] T.-H. (K. ). Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin,
[33] K. Tran, X. He, L. Zhang, J. Sun, C. Carapcea, C. Thrasher, C. Buehler, and, R. Girshick, X. He, P. Kohli, D. Batra, C. Lawrence Zitnick, D. Parikh, L.
and C. Sienkiewicz, “Rich image captioning in the wild. Deep Vision Workshop,” in Vanderwende, M. Galley, and M. Mitchell, “Visual storytelling,” in Proc. 2016 Conf.
Proc. Conf. Computer Vision and Pattern Recognition, 2016, pp. 434–441. North American Chapter Association Computational Linguistics: Human Language
Technologies, 2016, pp. 1233–1239.
[34] S. Wu, J. Wieland, O. Farivar, and J. Schiller. “Automatic Alt-text: Computer-
generated image descriptions for blind users on a social network service,” in Proc. [59] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neu-
20th ACM Conf. Computer Supported Cooperative Work and Social Computing, ral networks for large-vocabulary speech recognition,” IEEE Trans. Audio, Speech,
2017. Lang. Process., vol. 20, pp. 30–42, Jan. 2012.
[35] C. Shallue. (2016). Open-source code on show and tell: A neural image caption [60] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, and N. Jaitly, A, “Deep
generator. [Online]. Available: https://github.com/tensorflow/models/tree/master/ neural networks for acoustic modeling in speech recognition,” IEEE Signal Process.
im2txt Mag., vol. 29, pp. 82–97, Dec. 2012.
[36] L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW [61] K. Koenigsbauer, Microsoft Office Blogs. (2016). [Online]. Available: https://
Publishers, 2014. blogs.office.com/2016/12/20/new-to-office-365-in-december-accessibility-updates-
and-more/
[37] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with
neural networks,” in Proc. Conf. Neural Information Processing Systems, 2014. [62] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang, “A Siamese long short-term
[38] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly memory architecture for human re-identification,” in Proc. European Conf. Computer
learning to align and translate,” in Proc. Int. Conf. Learning Representations, Vision, 2016.
2015. [63] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal LSTM with trust
[39] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Gated feedback recurrent neu- gates for 3D human action recognition,” in Proc. European Conf. Computer Vision,
ral networks,” in Proc. Int. Conf. Machine Learning, 2015. 2016.

[40] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [64] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang,
Comput., vol. 98, pp. 1735–1780 1997. “Bottom-up and top-down attention for image captioning and VQA,” arXiv Preprint,
arXiv:1707.07998.
[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A
large-scale hierarchical image database,” in Proc. Conf. Computer Vision and [65] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforcement learning-
based image captioning with embedding reward,” in Proc. Conf. Computer Vision and
Pattern Recognition, 2009, pp. 248–255.
Pattern Recognition, 2017.
[42] H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell,
[66] K. Lin, D. Li, X. He, Z. Zhang, and M.-T. Sun, “Adversarial ranking for language
“On learning to localize objects with minimal supervision,” in Proc. Int. Conf.
generation,” arXiv Preprint, arXiv:1705.11001
Machine Learning, 2014.
[67] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical
[43] C. Zhang, J. C. Platt, and P. A. Viola, “Multiple instance boosting for object
Sequence Training for Image Captioning,” in Proc. Conf. Computer Vision and
detection,” in Proc. Conf. Neural Information Processing Systems, 2005.
Pattern Recognition, 2017.
[44] S. Banerjee and A. Lavie. “METEOR: An automatic metric for MT evaluation
[68] L. Yu, W. Zhang, J. Wang, and Y. Yu, “SeqGAN: Sequence generative adversarial
with improved correlation with human judgments,” in Proc. ACL Workshop on
nets with policy gradient,” in Proc. Association Advancement Artificial Intelligence,
Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or
2017.
Summarization, 2005.
[69] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA:
[45] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automat-
MIT Press 2016.
ic evaluation of machine translation,” in Proc. 40th Annu. Meeting Association
Computational Linguistics, 2002, pp. 311–318. [70] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel, “Visual
question answering: A survey of methods and data sets,” in Computer Vision and
[46] R. Vedantam, L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image
Image Understanding. Elsevier, 2017.
description evaluation,” in Proc. European Conf. Computer Vision, 2015, pp.
4566–4575. [71] Seeing AI. [Online]. Available: https://www.microsoft.com/en-us/seeing-ai/
[47] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic prop- [72] S. Reed, Z. Akata, X. Yan, L. Logeswaran, H. Lee, and B. Schiel “Generative
ositional image caption evaluation,” in Proc. European Conf. Computer Vision, adversarial text to image synthesis,” in Proc. Int. Conf. Machine Learning, 2016.
2016. 
SP

116 IEEE Signal Processing Magazine | November 2017 |

You might also like