[go: up one dir, main page]

0% found this document useful (0 votes)
7 views6 pages

Image Caption Generator

Image Caption Generator

Uploaded by

rbagewadi63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views6 pages

Image Caption Generator

Image Caption Generator

Uploaded by

rbagewadi63
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Innovative Technology and Exploring Engineering (IJITEE)

ISSN: 2278-3075 (Online), Volume-10 Issue-3, January 2021

Image Caption Generator


Megha J Panicker, Vikas Upadhayay, Gunjan Sethi, Vrinda Mathur

Abstract — In the modern era, image captioning has become While human beings are able to do it easily, it takes a strong
one of the most widely required tools. Moreover, there are inbuilt algorithm and a lot of computational power for a computer
applications that generate and provide a caption for a certain system to do so. Many attempts have been made to simplify
image, all these things are done with the help of deep neural
network models. The process of generating a description of an
this problem and break it down into various simpler
image is called image captioning. It requires recognizing the problems such as object detection, image classification, and
important objects, their attributes, and the relationships among text generation. A computer system takes input images as
the objects in an image. It generates syntactically and two-dimensional arrays and mapping is done from images
semantically correct sentences.In this paper, we present a deep to
learning model to describe images and generate captions using
computer vision and machine translation. This paper aims to
detect different objects found in an image, recognize the
relationships between those objects and generate captions. The
dataset used is Flickr8k and the programming language used
was Python3, and an ML technique called Transfer Learning
will be implemented with the help of the Xception model, to
demonstrate the proposed experiment. This paper will also
elaborate on the functions and structure of the various Neural
networks involved. Generating image captions is an important
aspect of Computer Vision and Natural language processing. Figure 1: Our model is based on a deep learning neural
Image caption generators can find applications in Image network that consists of a vision CNN followed by a language
segmentation as used by Facebook and Google Photos, and even generating RNN. It generates complete sentences as an output
more so, its use can be extended to video frames. They will easily captions or descriptive sentences.In recent years a lot of
automate the job of a person who has to interpret images. Not to
mention it has immense scope in helping visually impaired
attention has been drawn towards the task of automatically
people. generating captions for images. However, while new
datasets often spur considerable innovation, benchmark
Keywords- Image, Caption, CNN, Xception, RNN, LSTM, datasets also require fast, accurate, and competitive
Neural Networks evaluation metrics to encourage rapid progress. Being able
to automatically describe the content of a picture using
I. INTRODUCTION properly formed English sentences may be a very
Making a computer system detect objects and describe them challenging task, but it could have an excellent impact, as
using natural language processing (NLP) in an age-old an example by helping visually impaired people better
problem of Artificial Intelligence. This was considered an understand the content of images online. This task is
impossible task by computer vision researchers till now. significantly harder, for instance than the well-studied
With the growing advancements in Deep learning image classification or visual perception tasks, which are a
techniques, availability of vast datasets, and computational main focus within the computer vision community Deep
learning methods have demonstrated advanced results on
power, models are often built which will generate captions
caption generation problems. What is most impressive
for an image. Image caption generation is a task that
about these methods is that one end-to-end model is often
involves image processing and natural language processing
defined to predict a caption, given a photograph, rather than
concepts to recognize the context of an image and describe requiring sophisticated data preparation or a pipeline of
them in a natural language like English or any other specifically designed models.
language. Deep learning has attracted a lot of attention because it's
particularly good at a kind of learning that has the potential
to be very useful for real-world applications.
Revised Manuscript Received on January 10, 2021.
* Correspondence Author The ability to find out from unlabeled or unstructured data
Megha J Panicker*, Department. of Computer science and is a huge benefit for those curious about real-world
Engineering Delhi Technical Campus GGSIPU, Delhi, India
meghajp7@gmail.com applications. [1 - 4]
Vikas Upadhayay, Department. of Computer science and Engineering
Delhi Technical Campus GGSIPU, Delhi, India vikas84uu@gmail.com II. PROBLEM STATEMENT
Gunjan Sethi, Department. of Computer science and Engineering
Delhi Technical Campus GGSIPU, Delhi, India The main problem in the development of image description
g.sethi@delhitechnicalcampus.ac.in
Vrinda Mathur, Department. of Computer science and Engineering started with object detectionusing static object class
Delhi Technical Campus GGSIPU, Delhi, India libraries in the image and modelled using statistical
vrindamathur1428@gmail.com language models. [5]
© The Authors. Published by Blue Eyes Intelligence Engineering and
Sciences Publication (BEIESP). This is an open access article under the
CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-
nd/4.0/)

Retrieval Number: 100.1/ijitee.C83830110321 Published By:


DOI: 10.35940/ijitee.C8383.0110321 Blue Eyes Intelligence Engineering
Journal Website: www.ijitee.org 87 and Sciences Publication
Image Caption Generator

A. Making use of CNN: It’s a Deep Learning algorithm


that will intake in a 2D matrix input image, assign
importance (learnable weights and biases) to different
aspects/objects in the image, and be intelligent enough
to be able to differentiate one from the other.
B. This model was advantageous in naming the objects in
an image but it could not tell us the relationship among
them(that’s plain image classification).
C. In this paper, we present a generative model built on a
deep recurrent architecture that unites recent advances
in computer vision and machine translation and that can
effectively generate meaningful sentences.
D. Making use of an RNN: They are networks with loops
in them, allowing information to persist. LSTMs are a
particular kind of RNN, capable of learning long-term
dependencies. [5]

III. PROPOSED METHODOLOGY


A. Task Figure 2: Glimpse of the Flickr8k Image Dataset

The task is to build a system that will take an image input in


the form of a dimensional array and generate an output
consisting of a sentence that describes the image and is
syntactically and grammatically correct.
B. Corpus Figure 3: Glimpse of Flickr8k Text File
We have used the Flickr 8K dataset as the corpus. The C. Preprocessing
dataset consists of 8000 images and for every image, there
are 5 captions. The 5 captions for a single image helps in Data preprocessing is done in two parts, the images and the
understanding all the various possible scenarios. The dataset corresponding captions are cleaned and pre-processed
has a predefined training dataset Flickr_8k.trainImages.txt separately. Image preprocessing is done by feeding the
(6,000images), development dataset input data to the Xception application of the Keras API
Flickr_8k.devImages.txt (1,000 images), and test dataset running on top of TensorFlow. Xception is pre-trained on
Flickr_8k.testImages.txt (1,000 images). ImageNet. This helped us train the images faster with the
The Images are opted from six varied Flickr groups and do help of Transfer learning. The descriptions are cleaned
not contain any well-known personality or places. However, using the tokenizer class in Keras, this will vectorize the
they are manually selected to show a variety of scenes.[1] text corpus and is stored in a separate dictionary. Then each
These datasets(Size 1GB) can be directly downloaded from word of the vocabulary is mapped with a unique index
the given links(Thanks to Jason Brownlee ): value.
Image Dataset: D. Model
https://github.com/jbrownlee/Datasets/releases/download/Fl
Deep learning carries out the machine learning process
ickr8k/Flickr8k_Dataset.zip using an artificial neural network that is composed of
Text Dataset: several levels arranged in a hierarchy. The model is based
https://github.com/jbrownlee/Datasets/releases/download/Fl on deep networks where the flow of information starts from
ickr8k/Flickr8k_text.zip the initial level, where the model learns something simple
and then the output of which is passed to layer two of the
network and input is combined into something that is a bit
more complex and passes it on to the third level. This
process continues as each level in the network produces
something more complex from the input it received from
the ascendant level. [1]
Convolutional Neural Networks (CNN)
Convolutional Neural networks are specialized deep neural
networks that can process the data that has input shape like
a 2D matrix. Images can be easily represented as a 2D
matrix.

Retrieval Number: 100.1/ijitee.C83830110321 Published By:


DOI: 10.35940/ijitee.C8383.0110321 Blue Eyes Intelligence Engineering
Journal Website: www.ijitee.org 88 and Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075 (Online), Volume-10 Issue-3, January 2021

CNN is crucial in working with images. It takes as input an


image, assigns importance (weights and biases) to various
aspects/objects in the image, and differentiates one from the
other. The CNN makes use of filters(also known as
Kernels) which help in feature learning(detect abstract
concepts, like Blurring, Edge Detection, Sharpening, etc),
much the same as a human brain identifying objects in time
and space. The architecture performs a better fitting to the
image dataset due to the reduction in the number of Figure 6: The recurrent module in LSTM contains four
parameters involved(from 2048 to 256) and the reusability interconnected layers.[7]
of weights. Architecture
We utilize a CNN + LSTM to take an image as input and
output a caption.
An “encoder” RNN maps the source sentence (which is of
variable length) and transforms it into a fixed-length vector
representation, which in turn is used as the initial hidden
state of a “decoder” RNN which ultimately generates the
Figure 4: Architecture of Convolutional Neural Networks for
object classification.[11] final meaningful sentence as a prediction. [7]

Recurrent Neural Networks (RNN)


The human brain is evolved in such a way so as to make
sense of previous words, and keeping these in mind
generate the next words, thus forming a perfect sentence.
Basic Neural networks don’t have the ability to do this.
However, advancements in Recurrent neural networks
address this issue. They are networks with loops in them,
allowing information to persist for a while, by making use
of their internal states, thus creating a feedback loop. [6]

Figure 7: CNN-LSTM structure.[5]


However, we are going to replace this RNN with a deep
CNN - since it can produce a rich representation of the input
image by embedding it to a fixed-length vector - by first
pre-training it for an image classification task and using the
last hidden layer as an input to the RNN decoder that
Figure 5: Loop in RNN[6] generates sentences. [5 - 7]
Long Short Term Memory networks – usually just called
“LSTMs” – are a special kind of RNN, capable of learning IV. EVALUATION.
long-term dependencies. Remembering information for long Execution of the entire program takes place in 5 major
periods is practically their default behaviour, and this steps. The implementation of the five major modules is as
behaviour is controlled with the help of “gates”. follows:
While RNNs process single data points, LSTMs can process A. Data Cleaning and Preprocessing:
entire sequences. Not only that, they can learn which point
in the data holds importance, and which can be thrown 1. For a comfortable and fast work experience, we use
Google Colaboratory: a tool which provides free
away. Hence, the only relevant information is passed on to
GPU/TPU processing power, over our local machines,
the next layer.
which can take several hours to do the task that a GPU
The 3 main gates involved are: input gate, output gate and will take few minutes to do.
forget gate. These gates decide whether to forget the current 2. Our program starts with loading both, the text file, and
cell value, read a value into the cell, or output the cell value. the image file into separate variables; the test file is
The hidden states play an important role since the previous stored in a string.
hidden states are passed to the next step of the sequence. 3. This string is used and manipulated in such a way so as
The hidden state acts as the neural network's memory, as it to create a dictionary that maps each image with a list
is storing the data that the neural network has seen before. of 5 descriptions.
Thus it allows the neural network to function like a human
brain trying to form sentences. [6, 7]

Retrieval Number: 100.1/ijitee.C83830110321 Published By:


DOI: 10.35940/ijitee.C8383.0110321 Blue Eyes Intelligence Engineering
Journal Website: www.ijitee.org 89 and Sciences Publication
Image Caption Generator

4. The main task of data cleaning involves removing


punctuation marks, converting the whole text to
lowercase, removing stop words and removing words
that contain numbers.
5. Further, a vocabulary of all unique words from all the
descriptions is created, which in the future will be used
to generate captions to test images.
6. Another aspect of Preprocessing the data involves
tokenizing our vocabulary with a unique index value.
This is because a computer won’t understand regular
English words, hence they need to be represented using
numbers. The tokens are then stored in a pickle file. i.e. Figure 9: Glimpse of extracted features with
in the form of character stream, but with all the corresponding image names, that we’ll store in a pickle
information necessary to reconstruct it into the original file
object type. C. Layering the CNN-RNN model:
7. The above two preprocessing tasks can be achieved
manually or by using the Keras.preprocessingmodule
for ease of writing the code.
8. We proceed to append the <start> and <end> identifier
for each caption since these will act as indicators for
our LSTM to understand where a caption is starting and
where it’s ending.
9. We will proceed with calculating the number of words
in our vocabulary and finding the maximum length of
the description, which will be used in later phases.

Figure 10: Structure of the Neural Network


To stack the model, we’ll use the Keras Model from
Functional API. The structure will consist of 3 parts:
1. Feature Extractor: It will be used to reduce the
Figure 8: Text file after performing data cleaning dimensions from 2048 to 256. We’ll make use of a
B. Extraction of feature vectors: Dropout Layer. One of these will be added in the CNN
and the LSTM each. We have pre-processed the photos
1. A feature vector(or simply feature) is a numerical value with the Xception model (without the output layer) and
in the matrix form, containing information about an will use the extracted features predicted by this model
object’s important characteristics, eg. intensity value of as input.
each pixel of the image in our case. These vectors, 2. Sequence Processor: This Embedding layer will
we’ll ultimately store in a pickle file. handle the textual input, followed by the LSTM layer.
2. In our model we’ll be using Transfer Learning, which 3. Decoder: We will merge the output from above two
simply means, using a pretrained model( in our case the layers, and use a Dense layer to make the final
Xception model) to extract features from it. predictions. Both the feature extractor and sequence
3. The Xception model is a Convolutional Neural processor output a fixed-length vector. These are
Network that is 71 layers deep. It is trained on the merged together and processed by a Dense layer. The
famous Imagenet dataset which has millions of images number of nodes in the final layer will be the same as
and over 1000 different classes to classify from. the size of our vocabulary. [6,7].
4. Python makes using this model in our code extremely D. Training the model:
easy with keras.applications.xceptionmodule.To use it
in our code, we’ll drop the classification layer from it,
and hence obtain the 2048 feature vector.
5. Hence weights will be downloaded for each image, and
then image names will be mapped with their respective
feature array.
6. This process can take a few hours depending on your
processor.
Figure 11: Model under Training

Retrieval Number: 100.1/ijitee.C83830110321 Published By:


DOI: 10.35940/ijitee.C8383.0110321 Blue Eyes Intelligence Engineering
Journal Website: www.ijitee.org 90 and Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075 (Online), Volume-10 Issue-3, January 2021

1. We’ll be training our model on 6000 images each Output:


having 2048 long feature vectors.
2. Since it is not possible to hold all this data in the
memory at the same time, we’ll make use of a Data
Generator. This will help us create batches of the data,
and will improve the speed.
3. Along with this we’ll be defining the number of
epochs(i.e. iterations of the training dataset) the model
has to complete during its training. This number has to
be selected in such a way that our model is neither
underfitted nor overfitted.
4. model.fit_generator() method will be used. And this
whole process will take some time depending on the
processor.
5. The maximum length of descriptions calculated earlier
will be used as parameter value here. It will also take as
input the clean and tokenized data. Figure 13: Caption generated using deep neural network
6. We’ll also create a sequence creator, which will play
for input Image 1
the role of predicting the next word based on the
previous word and feature vectors of the image. 2. Path of Img 2:
7. While training our model, we can use the development Flicker8k_Dataset/256085101_2c2617c5d0.jpg
dataset(It’s provided with the rest of the files), to Output:
monitor the performance of the model, to decide when
to save the model version to the fiel
8. We will proceed to save several models, out of which
the final one will be used for testing in future.

Figure 14: Caption generated using deep neural network


for input Image 2
Figure 12: Model details 3. Path of Img 3:
E. Testing the model: Flicker8k_Dataset/3344233740_c010378da7.jpg
1. A separate Python notebook can be created, or the Output:
same can be used to perform testing. Either way, we’ll
load the trained model that we had saved in the
previous step and generate predictions.
2. The sequence generator will come into play at this
stage, besides the tokenizer file we created.
3. The primary step of feature extraction for the particular
image under observation will be performed.
4. The path of one of the images from the remaining 2000
test images is passed to the function manually. You can
also iterate through the test data set, and store the
prediction for each image in a dictionary or a list.
5. The actual functioning behind image generation
involves using the start sequence and the end sequence,
and to call the model recursively to generate Figure 15: Caption generated using deep neural network
meaningful sentences. for input Image 3

V. RESULT / ANALYSIS
For simplicity, only three images have been subjected to
testing, and the results can be seen in the following images:
1. Path of Img 1:
Flicker8k_Dataset/111537222_07e56d5a30.jpg

Retrieval Number: 100.1/ijitee.C83830110321 Published By:


DOI: 10.35940/ijitee.C8383.0110321 Blue Eyes Intelligence Engineering
Journal Website: www.ijitee.org 91 and Sciences Publication
Image Caption Generator

Table 1: Comparison between original and predicted Caption Generator”, International Conference on Computational
Intelligence in Data Science(ICCIDS) - 2017
values 8. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, et al.,
"Show, attend and tell: Neural image caption generation with visual
attention", Proceedings of the International Conference on Machine
Learning (ICML), 2015.
9. J. Redmon, S. Divvala, Girshick and A. Farhadi, "You only look
once: Unified real-time object detection", Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
2016
10. D. Bahdanau, K. Cho, and Y. Bengio. “Neural machine translation by
jointly learning to align and translate.arXiv:1409.0473”, 2014.
11. Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi,
“Understanding of a convolutional neural network”, IEEE - 2017

AUTHORS’ PROFILES
Megha J Panicker, Graduate in Computer Science
and Engineering(2021) from Delhi Technical Campus,
GGSIPU, Delhi. Completed high school education
from St. Paul’s School, New Delhi. Has done several
projects in the field of Machine Learning and Artificial
Intelligence. Some of the projects are Emotion
Detection from textual data using NLP, Image
classification, and Sentiment Analysis.
VI. CONCLUSION
Vikas Upadhayay, Graduate in B.Tech Computer
Based on the results obtained we can see that the deep Science and Engineering(2021) from Delhi Technical
learning methodology used here bore successful results. Campus, Guru Gobind Singh Indraprastha University,
Dwarka, New Delhi. Completed high school education
The CNN and the LSTM worked together in proper from HMDAV School, New Delhi, is a certified
synchronization, they were able to find the relation between Machine Learning Enthusiast and has done many
projects based on Image and Text classification using
objects in images. Deep Neural Network(CNN, RNN, and LSTM), also interested in
To compare the accuracy of the predicted caption, we can JavaScript based FrontEnd.
compare them with target captions in our Flickr8k test
Vrinda Mathur, Graduate in B.Tech Computer
dataset, using BLEU(Bilingual Evaluation Understudy) Science and Engineering from Delhi Technical
score [5, 8]. BLEU scores are used in text translation for Campus, GGSIPU, Dwarka, New Delhi(2021).
evaluating translated text against one or more reference Completed high school education from Delhi Public
School, Mathura Road, New Delhi (2017). Fields of
translations. Over the years several other neural network interest are Machine Learning, NLP, Deep Learning
technologies have been used to create hybrid image caption and AI. Previously worked on projects involving
generators, similar to the one proposed here. eg VGG16 Sentiment Analysis, Web Scraping, CNNs and Neural
Networks.
model instead of the Xception model, or the GRU model
instead of the STM model. Furthermore, BLEU Score can Gunjan Sethi, is an Experienced Assistant Professor
be used to draw comparisons between these models, to see with a demonstrated history of working in the higher
which one provides maximum accuracy. This paper education industry. Skilled in Python , Oracle
Database, C++ etc . Strong education professional
introduced us to various new developments in the field of with a MTech ,BTech focused in CSE and Pursuing
machine learning and AI, and how vast this field is. In Fact Phd.
several topics within this paper are open to further research
and development, while this paper itself tries to cover the
basic essentials needed to create an image caption
generator.
REFERENCES
1. HaoranWang , Yue Zhang, and Xiaosheng Yu, “An Overview of
Image Caption Generation Methods”, (CIN-2020)
2. B.Krishnakumar, K.Kousalya, S.Gokul, R.Karthikeyan, and
D.Kaviyarasu, “IMAGE CAPTION GENERATOR USING DEEP
LEARNING”, (international Journal of Advanced Science and
Technology- 2020 )
3. MD. Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, and
Hamid Laga, “A Comprehensive Survey of Deep Learning for Image
Captioning” ,(ACM-2019)
4. Rehab Alahmadi, Chung Hyuk Park, and James Hahn, “Sequence-to-
sequence image caption generator”, (ICMV-2018)
5. Oriol Vinyals, Alexander Toshev, SamyBengio, and Dumitru Erhan,
“Show and Tell: A Neural Image Caption Generator”,(CVPR 1, 2-
2015)
6. Priyanka Kalena, Nishi Malde, Aromal Nair, Saurabh Parkar, and
Grishma Sharma, “Visual Image Caption Generator Using Deep
Learning”, (ICAST-2019)
7. Pranay Mathur, Aman Gill, Aayush Yadav, Anurag Mishra, and
Nand Kumar Bansode,“Camera2Caption: A Real-Time Image

Retrieval Number: 100.1/ijitee.C83830110321 Published By:


DOI: 10.35940/ijitee.C8383.0110321 Blue Eyes Intelligence Engineering
Journal Website: www.ijitee.org 92 and Sciences Publication

You might also like