Image Caption Generator
Image Caption Generator
Abstract — In the modern era, image captioning has become While human beings are able to do it easily, it takes a strong
one of the most widely required tools. Moreover, there are inbuilt algorithm and a lot of computational power for a computer
applications that generate and provide a caption for a certain system to do so. Many attempts have been made to simplify
image, all these things are done with the help of deep neural
network models. The process of generating a description of an
this problem and break it down into various simpler
image is called image captioning. It requires recognizing the problems such as object detection, image classification, and
important objects, their attributes, and the relationships among text generation. A computer system takes input images as
the objects in an image. It generates syntactically and two-dimensional arrays and mapping is done from images
semantically correct sentences.In this paper, we present a deep to
learning model to describe images and generate captions using
computer vision and machine translation. This paper aims to
detect different objects found in an image, recognize the
relationships between those objects and generate captions. The
dataset used is Flickr8k and the programming language used
was Python3, and an ML technique called Transfer Learning
will be implemented with the help of the Xception model, to
demonstrate the proposed experiment. This paper will also
elaborate on the functions and structure of the various Neural
networks involved. Generating image captions is an important
aspect of Computer Vision and Natural language processing. Figure 1: Our model is based on a deep learning neural
Image caption generators can find applications in Image network that consists of a vision CNN followed by a language
segmentation as used by Facebook and Google Photos, and even generating RNN. It generates complete sentences as an output
more so, its use can be extended to video frames. They will easily captions or descriptive sentences.In recent years a lot of
automate the job of a person who has to interpret images. Not to
mention it has immense scope in helping visually impaired
attention has been drawn towards the task of automatically
people. generating captions for images. However, while new
datasets often spur considerable innovation, benchmark
Keywords- Image, Caption, CNN, Xception, RNN, LSTM, datasets also require fast, accurate, and competitive
Neural Networks evaluation metrics to encourage rapid progress. Being able
to automatically describe the content of a picture using
I. INTRODUCTION properly formed English sentences may be a very
Making a computer system detect objects and describe them challenging task, but it could have an excellent impact, as
using natural language processing (NLP) in an age-old an example by helping visually impaired people better
problem of Artificial Intelligence. This was considered an understand the content of images online. This task is
impossible task by computer vision researchers till now. significantly harder, for instance than the well-studied
With the growing advancements in Deep learning image classification or visual perception tasks, which are a
techniques, availability of vast datasets, and computational main focus within the computer vision community Deep
learning methods have demonstrated advanced results on
power, models are often built which will generate captions
caption generation problems. What is most impressive
for an image. Image caption generation is a task that
about these methods is that one end-to-end model is often
involves image processing and natural language processing
defined to predict a caption, given a photograph, rather than
concepts to recognize the context of an image and describe requiring sophisticated data preparation or a pipeline of
them in a natural language like English or any other specifically designed models.
language. Deep learning has attracted a lot of attention because it's
particularly good at a kind of learning that has the potential
to be very useful for real-world applications.
Revised Manuscript Received on January 10, 2021.
* Correspondence Author The ability to find out from unlabeled or unstructured data
Megha J Panicker*, Department. of Computer science and is a huge benefit for those curious about real-world
Engineering Delhi Technical Campus GGSIPU, Delhi, India
meghajp7@gmail.com applications. [1 - 4]
Vikas Upadhayay, Department. of Computer science and Engineering
Delhi Technical Campus GGSIPU, Delhi, India vikas84uu@gmail.com II. PROBLEM STATEMENT
Gunjan Sethi, Department. of Computer science and Engineering
Delhi Technical Campus GGSIPU, Delhi, India The main problem in the development of image description
g.sethi@delhitechnicalcampus.ac.in
Vrinda Mathur, Department. of Computer science and Engineering started with object detectionusing static object class
Delhi Technical Campus GGSIPU, Delhi, India libraries in the image and modelled using statistical
vrindamathur1428@gmail.com language models. [5]
© The Authors. Published by Blue Eyes Intelligence Engineering and
Sciences Publication (BEIESP). This is an open access article under the
CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-
nd/4.0/)
V. RESULT / ANALYSIS
For simplicity, only three images have been subjected to
testing, and the results can be seen in the following images:
1. Path of Img 1:
Flicker8k_Dataset/111537222_07e56d5a30.jpg
Table 1: Comparison between original and predicted Caption Generator”, International Conference on Computational
Intelligence in Data Science(ICCIDS) - 2017
values 8. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, et al.,
"Show, attend and tell: Neural image caption generation with visual
attention", Proceedings of the International Conference on Machine
Learning (ICML), 2015.
9. J. Redmon, S. Divvala, Girshick and A. Farhadi, "You only look
once: Unified real-time object detection", Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
2016
10. D. Bahdanau, K. Cho, and Y. Bengio. “Neural machine translation by
jointly learning to align and translate.arXiv:1409.0473”, 2014.
11. Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi,
“Understanding of a convolutional neural network”, IEEE - 2017
AUTHORS’ PROFILES
Megha J Panicker, Graduate in Computer Science
and Engineering(2021) from Delhi Technical Campus,
GGSIPU, Delhi. Completed high school education
from St. Paul’s School, New Delhi. Has done several
projects in the field of Machine Learning and Artificial
Intelligence. Some of the projects are Emotion
Detection from textual data using NLP, Image
classification, and Sentiment Analysis.
VI. CONCLUSION
Vikas Upadhayay, Graduate in B.Tech Computer
Based on the results obtained we can see that the deep Science and Engineering(2021) from Delhi Technical
learning methodology used here bore successful results. Campus, Guru Gobind Singh Indraprastha University,
Dwarka, New Delhi. Completed high school education
The CNN and the LSTM worked together in proper from HMDAV School, New Delhi, is a certified
synchronization, they were able to find the relation between Machine Learning Enthusiast and has done many
projects based on Image and Text classification using
objects in images. Deep Neural Network(CNN, RNN, and LSTM), also interested in
To compare the accuracy of the predicted caption, we can JavaScript based FrontEnd.
compare them with target captions in our Flickr8k test
Vrinda Mathur, Graduate in B.Tech Computer
dataset, using BLEU(Bilingual Evaluation Understudy) Science and Engineering from Delhi Technical
score [5, 8]. BLEU scores are used in text translation for Campus, GGSIPU, Dwarka, New Delhi(2021).
evaluating translated text against one or more reference Completed high school education from Delhi Public
School, Mathura Road, New Delhi (2017). Fields of
translations. Over the years several other neural network interest are Machine Learning, NLP, Deep Learning
technologies have been used to create hybrid image caption and AI. Previously worked on projects involving
generators, similar to the one proposed here. eg VGG16 Sentiment Analysis, Web Scraping, CNNs and Neural
Networks.
model instead of the Xception model, or the GRU model
instead of the STM model. Furthermore, BLEU Score can Gunjan Sethi, is an Experienced Assistant Professor
be used to draw comparisons between these models, to see with a demonstrated history of working in the higher
which one provides maximum accuracy. This paper education industry. Skilled in Python , Oracle
Database, C++ etc . Strong education professional
introduced us to various new developments in the field of with a MTech ,BTech focused in CSE and Pursuing
machine learning and AI, and how vast this field is. In Fact Phd.
several topics within this paper are open to further research
and development, while this paper itself tries to cover the
basic essentials needed to create an image caption
generator.
REFERENCES
1. HaoranWang , Yue Zhang, and Xiaosheng Yu, “An Overview of
Image Caption Generation Methods”, (CIN-2020)
2. B.Krishnakumar, K.Kousalya, S.Gokul, R.Karthikeyan, and
D.Kaviyarasu, “IMAGE CAPTION GENERATOR USING DEEP
LEARNING”, (international Journal of Advanced Science and
Technology- 2020 )
3. MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and
Hamid Laga, “A Comprehensive Survey of Deep Learning for Image
Captioning” ,(ACM-2019)
4. Rehab Alahmadi, Chung Hyuk Park, and James Hahn, “Sequence-to-
sequence image caption generator”, (ICMV-2018)
5. Oriol Vinyals, Alexander Toshev, SamyBengio, and Dumitru Erhan,
“Show and Tell: A Neural Image Caption Generator”,(CVPR 1, 2-
2015)
6. Priyanka Kalena, Nishi Malde, Aromal Nair, Saurabh Parkar, and
Grishma Sharma, “Visual Image Caption Generator Using Deep
Learning”, (ICAST-2019)
7. Pranay Mathur, Aman Gill, Aayush Yadav, Anurag Mishra, and
Nand Kumar Bansode,“Camera2Caption: A Real-Time Image