Final Report
Final Report
BACHELOR OF TECHNOLOGY
IN
COMPUTER ENGINEERING
1
CERTIFICATE
The work is submitted in partial fulfilment of the requirements for the award
of the degree of Bachelor of Technology in Computer Engineering under my
guidance. The matter embodied in this project work has not been submitted
earlier for the award of any degree or diploma to the best of my knowledge
and belief.
2
ACKNOWLEDGMENT
We would like to extend special gratitude to the the assistants and lab
coordinators of the Department for providing us the infrastructural facilities
necessary to sustain the project.
3
ABSTRACT
Developing sign language application for deaf people can be very important,
as they’ll be able to communicate easily with even those who don’t understand
sign language. Our project aims at taking the basic step in bridging the
communication gap between normal people, deaf and dumb people using sign
language.
The main focus of this work is to create a vision based system to identify sign
language gestures from the video sequences. The reason for choosing a system
based on vision relates to the fact that it provides a simpler and more intuitive
way of communication between a human and a computer. In this report, 46
different gestures have been considered.
Video sequences contain both the temporal as well as the spatial features. So
we have used two different models to train both the temporal as well as the
spatial features. To train the model on the spatial features of the video
sequences we have used Inception model[14] which is a deep
CNN(convolutional neural net). CNN was trained on the frames obtained from
the video sequences of train data. We have used RNN(recurrent neural
network) to train the model on the temporal features. Trained CNN model was
used to make predictions for individual frames to obtain a sequence of
predictions or pool layer outputs for each video. Now this sequence of
prediction or pool layer outputs was given to RNN to train on the temporal
features.The data set[7] used consists of Argentinian Sign Language(LSA)
Gestures, with around 2300 videos belonging to 46 gestures categories. Using
the predictions by CNN as input for RNN 93.3% accuracy was obtained and
by using pool layer output as input for RNN an accuracy of 95.217% was
obtained.
4
TABLE OF CONTENTS
1 Introduction 8
2 Literature Survey 10
5
3.2.3.2 Exploding Gradient 26
3.2.4 Long Short Term Memory Units 26
3.2.5 Our RNN Model 27
______________________________________________
4 Experimental Design 28
5 Results 33
6
LIST OF FIGURES
7
1. INTRODUCTION
________________________________________________
Motion of any body part like face, hand is a form of gesture. Here for gesture
recognition we are using image processing and computer vision. Gesture
recognition enables computer to understand human actions and also acts as an
interpreter between computer and human. This could provide potential to
human to interact naturally with the computers without any physical contact of
the mechanical devices. Gestures are performed by deaf and dumb community
to perform sign language. This community used sign language for their
communication when broadcasting audio is impossible, or typing and writing
is difficult, but there is the vision possibility. At that time sign language is the
only way for exchanging information between people. Normally sign language
is used by everyone when they do not want to speak, but this is the only way of
communication for deaf and dumb community. Sign language is also serving
the same meaning as spoken language does. This is used by deaf and dumb
community all over the world but in their regional form like ISL, ASL. Sign
language can be performed by using Hand gesture either by one hand or two
hands. It is of two type Isolated sign language and continuous sign language.
Isolated sign language consists of single gesture having single word while
continuous ISL or Continuous Sign language is a sequence of gestures that
generate a meaningful sentence. In this report we performed isolated ASL
gesture recognition technique.
Deaf people around the world communicate using sign language as distinct
from spoken language in their every day a visual language that uses a system
of manual, facial and body movements as the means of communication. Sign
language is not an universal language, and different sign languages are used in
different countries, like the many spoken languages all over the world. Some
countries such as Belgium, the UK, the USA or India may have more than one
sign language. Hundreds of sign languages are in used around the world, for
instance, Japanese Sign Language, British Sign Language (BSL), Spanish Sign
Language, Turkish Sign Language.
8
Fingerspelling Word level sign Nonmanual features
vocabulary
Used to spell words Used for the majority Facial expressions and
letter by letter . of communication. tongue, mouth and
body position.
9
2. LITERATURE SURVEY
________________________________________________
In the recent years, there has been tremendous research on the hand sign
language gesture recognition. The technology for gesture recognition is given
below.
2.1 Visionbased
In visionbased methods computer camera is the input device for observing the
information of hands or fingers. The Vision Based methods require only a
camera, thus realizing a natural interaction between humans and computers
without the use of any extra devices. These systems tend to complement
biological vision by describing artificial vision systems that are implemented
in software and/or hardware. This poses a challenging problem as these
systems need to be background invariant, lighting insensitive, person and
camera independent to achieve real time performance. Moreover, such systems
must be optimized to meet the requirements, including accuracy and
robustness.
Vision based analysis, is based on the way human beings perceive information
about their surroundings, yet it is probably the most difficult to implement in a
satisfactory way. Several different approaches have been tested so far.
10
2. Second one to capture the image using a camera then extract some feature
and those features are used as input in a classification algorithm for
classification.
11
2.1.2 Automatic Indian Sign Language Recognition for Continuous
Video Sequence [2]
12
Euclidean distance, Correlation, Manhattan distance, city block distance etc.
Comparative analysis of their proposed scheme was performed with various
types of distance classifiers. From the above analysis they found that the
results obtained from Correlation and Euclidean distance gives better accuracy
than other classifiers.
13
2.1.4 Recognition of isolated Indian Sign Language Gesture in Real
Time [4]
14
3. ALGORITHMS
________________________________________________
CNNs have repetitive blocks of neurons that are applied across space (for
images) or time (for audio signals etc). For images, these blocks of neurons
can be interpreted as 2D convolutional kernels, repeatedly applied over each
patch of the image. For speech, they can be seen as the 1D convolutional
kernels applied across timewindows. At training time, the weights for these
repeated blocks are 'shared', i.e. the weight gradients learned over various
image patches are averaged.
There are four main steps in CNN: convolution, subsampling, activation and
full connectedness.
15
3.1.1.1 Convolution
The first layers that receive an input signal are called convolution filters.
Convolution is a process where the network tries to label the input signal by
referring to what it has learned in the past. If the input signal looks like
previous cat images it has seen before, the “cat” reference signal will be mixed
into, or convolved with, the input signal. The resulting output signal is then
passed on to the next layer.
Fig 9: Convolving Wally with a circle filter. The circle filter responds
strongly to the eyes.
For e.g suppose we convolve a 32x32x3 (32x32 image with 3 channels R,G
and B respectively) with a 5x5x3 filter. We take the 5*5*3 filter and slide it
over the complete image and along the way take the dot product between the
filter and chunks of the input image.
16
Fig 10: Dot Product of Filter with single chunk of Input Image[12]
Fig 11: Dot Product or Convolve over all possible 5x5 spatial location in
Input Image[12]
17
Fig 12: Input Image Convolving with a Convolutional layer of 6
independent filters[12]
The CNN may consists of several Convolutional layers each of which can
have similar or different number of independent filters. For example the
following diagram shows the effect of two Convolutional layers having 6 and
10 filters respectively.
Fig 13: Input Image Convolving with two Convolutional layers having 6
and 10 filters respectively[12]
All these filters are initialized randomly and become our parameters which
will be learned by the network subsequently.
18
3.1.1.2 Subsampling
Fig 14: Sub sampling Wally by 10 times. This creates a lower resolution
image.
3.1.1.3 Pooling
19
Fig 16: Max Pooling [12]
3.1.1.4 Activation
The activation layer controls how the signal flows from one layer to the next,
emulating how neurons are fired in our brain. Output signals which are
strongly associated with past references would activate more neurons,
enabling signals to be propagated more efficiently for identification.
The last layers in the network are fully connected, meaning that neurons of
preceding layers are connected to every neuron in subsequent layers. This
mimics high level reasoning where all possible pathways from the input to
output are considered.
When training the neural network, there is additional layer called the loss
layer. This layer provides feedback to the neural network on whether it
identified inputs correctly, and if not, how far off its guesses were. This helps
to guide the neural network to reinforce the right concepts as it trains. This is
always the last layer during training.
3.1.2 Implementation
Algorithms used in training CNN are analogous to studying for exams using
flash cards. First, you draw several flashcards and check if you have mastered
the concepts on each card. For cards with concepts that you already know,
20
discard them. For those cards with concepts that you are unsure of, put them
back into the pile. Repeat this process until you are fairly certain that you
know enough concepts to do well in the exam. This method allows you to
focus on less familiar concepts by revisiting them often. Formally, these
algorithms are called gradient descent algorithms for forward pass learning.
Modern deep learning algorithm uses a variation called stochastic gradient
descent, where instead of drawing the flashcards sequentially, you draw them
at random. If similar topics are drawn in sequence, the learners might
overestimate how well they know the topic. The random approach helps to
minimize any form of bias in the learning of topics.
CNNs are too complex to implement from scratch. Today, machine learning
practitioners often utilize toolboxes developed such as Caffe, Torch,
MatConvNet and Tensor flow for their work.
We’ve used the Inception v3 model of the Tensor Flow library. Inception is a
huge image classification model with millions of parameters that can
differentiate a large number of kinds of images. We only trained the final layer
of that network, so training will end in a reasonable amount of time.
Inceptionv3 is trained for the ImageNet Large Visual Recognition Challenge
using the data from 2012 where it reached a top5 error rate of as low as
3.46%.
21
Fig 17: Inception v3 model Architecture
The kinds of information that make it possible for the model to differentiate
among 1,000 classes are also useful for distinguishing other objects. By using
this pretrained network, we are using that information as input to the final
classification layer that distinguishes our dataset.
Humans don’t start their thinking from scratch every second. We don’t throw
everything away and start thinking from scratch again. Our thoughts have
persistence. Traditional neural networks can’t do this but Recurrent Neural
Networks can. There is information in the sequence itself, and recurrent nets
use it to perform tasks that feedforward networks can’t. Recurrent networks
are distinguished from feedforward networks by the fact that they have
feedback loop, ingesting their own outputs moment after moment as input.
They’re especially useful with sequential data because each neuron or unit can
use its internal memory to maintain information about the previous input.
22
through a sentence even as a human, you’re picking up the context of each
word from the words before it.
A RNN has loops in them that allow information to be carried across neurons
while reading in input. In the following diagram, a chunk of Recurrent neural
network, A, looks at some input xt and outputs a value ht. The loop allows
information to be passed from one step of the network to the next. The
decision a recurrent net reached at time step t1 affects the decision it will
reach one moment later at time step t. So recurrent networks have two sources
of input, the present and the recent past, which combine to determine how they
respond to new data.
This chainlike nature reveals that recurrent neural networks are intimately
related to sequences and lists. They’re the natural architecture of neural
network to use for such data. The sequential information is preserved in the
recurrent network’s hidden state, which manages to span many time steps as it
cascades forward to affect the processing of each new example
23
3.2.2 How Memory of previous inputs Carried forward
The hidden state at time step t is ht. It is a function of the input at the same
time step xt, modified by a weight matrix W, added to the hidden state of the
previous time step h_t1 multiplied by its own hiddenstatetohiddenstate
matrix U called transition matrix. The weight matrices are filters that
determine how much importance to accord to both the present input and the
past hidden state. The error they generate can be used to adjust their weights
using Backpropagation through Time (BPTT). The sum of the weight input
and hidden state is squashed by the function φ – either a logistic sigmoid
function or tanh.
Because this feedback loop occurs at every time step in the series, each hidden
state contains traces not only of the previous hidden state, but also of all those
that preceded h_t1 for as long as memory can persist.
24
3.2.3 Exploding and Vanishing Gradient Problem
The gradients of the network's output with respect to the parameters in the
early layers become extremely small. In other words even a large change in
the value of parameters for the early layers doesn't have a big effect on the
output. Hence the network can’t learn the parameter effectively.
This happens because the activation functions (sigmoid or tanh) squash their
input into a very small output range in a very nonlinear fashion. For example,
sigmoid maps the real number line onto a "small" range of [0, 1]. As a result,
there are large regions of the input space which are mapped to an extremely
small range. In these regions of the input space, even a large change in the
input will produce a small change in the output hence the gradient is small.
25
This becomes much worse when we stack multiple layers of such
nonlinearities on top of each other. For instance, first layer will map a large
input region to a smaller output region, which will be mapped to an even
smaller region by the second layer, which will be mapped to an even smaller
region by the third layer and so on. As a result, even a large change in the
parameters of the first layer doesn't change the output much.
In the Fig _ we can see the effects of applying a sigmoid function over and
over again. The data is flattened until, for large stretches, it has no detectable
slope. This is analogous to a gradient vanishing as it passes through many
layers.
LSTMs help preserve the error that can be backpropagated through time and
layers. By maintaining a more constant error, they allow recurrent nets to
continue to learn over many time steps (over 1000), thereby opening a channel
to link causes and effects remotely.
26
3.2.5 Our RNN Model
We have create a RNN model based on LSTMs. The first layer is an input
layer used to feed input to the upcoming layers. Its size is determined by the
size of the input being fed. Our Model is a wide network consisting of single
layer of 256 LSTM units. This Layer is followed by a fully connected layer
with softmax activation. In Fully Connected every neuron is connected to
every neuron of previous layer. The fully connected layer consists of as many
neurons as there are categories/classes. Finally a regression layer to apply a
regression (linear or logistic) to the provided input. We used adam[8]
(Adaptive Moment Estimation) which is a stochastic optimizer, as a gradient
descent optimizer to minimize the provided loss function
“categorical_crossentropy” (which calculate the errors).
We also tried a wider RNN network with 512 LSTM units and another deep
RNN network with three layers of 64 LSTM units each. We tested these on a
sample of the dataset and found that wide model with 256 LSTM units
performed the best and therefore only the wide model was used for training
and testing on complete dataset.
27
4. EXPERIMENTAL DESIGN
________________________________________________
We have used two approaches to train the model on the temporal and the
spatial features. Both approaches differ by the inputs given to RNN to train it
on the temporal features.
The data set[7] used for both the approaches consists of Argentinian Sign
Language(LSA) Gestures, with around 2300 videos belonging to 46 gestures
categories. 10 nonexpert subjects executed the 5 repetitions of every gesture
thereby producing 50 videos per category or gesture.
28
Out of the 50 gestures per category, 75% i.e. 40 were used for training and
25% i.e. 10 were used for testing
4.2.1 Methodology
● First, we will extract the frames from the multiple video sequences of
each gesture.
● After the first step, noise from the frames i.e background, body parts
other than hand are removed to extract more relevant features from the
frame.
● Frames of the train data are given to the CNN model for training on the
spatial features. We have used inception model for this purpose which
is a deep neural net.
● Store the train and test frame predictions. We’ll use the model obtained
in the above step for the prediction of frames.
● The predictions of the train data are now given to the RNN model for
training on the temporal features. We have used LSTM model for this
purpose.
Fig 23
In further subsections of this section, each step of the methodology has been
shown diagrammatically for better understanding of that step.
29
4.2.1.1 Frame Extraction and Background Removal
Each video gesture video is broken down into a sequence of frames. Frames
are then processed to remove all the noise from the image that is everything
except hands.
The final image consists of grey scale image of hands to avoid color specific
learning of the model
Fig 26
The first row in the below illustration is the video of a gesture Elephant. The
second row shows the set of frames extracted from it. The third row shows the
sequence of predictions for each frame by CNN after training it.
Fig 27
31
4.2.1.3 Training RNN (Temporal Features)
Fig 28
4.2.2 Limitations
In this approach we have used CNN to train the model on the spatial features
and have given the output of the pool layer, before it’s made into a prediction,
to the RNN. The pool layer gives us a 2048 dimensional vector that represents
the convoluted features of the image, but not a class prediction.
Rest of the steps of this approach are same as that of first approach. Both
approaches only differ by inputs given to RNN.
32
6. RESULTS
________________________________________________
Fig 29
Average accuracy obtained using this approach is 93.3333%.
Out of the 460 Gestures (10 Per category) used for testing 438 were
recognized correctly giving an average accuracy of 95.217%.
33
Fig 30: Accuracy
The second approach provided a better accuracy than the first approach
because of the fact that in the first approach the input to the RNN was a
sequence of 46 dimensional prediction while in the second approach the RNN
was being given a sequence of 2048 dimensional pool layer output. This gave
RNN more number of feature points to distinguish among different videos.
34
7. CONCLUSION AND FUTURE WORK
________________________________________________
Hand gestures are a powerful way for human communication, with lots of
potential applications in the area of human computer interaction. Visionbased
hand gesture recognition techniques have many proven advantages compared
with traditional devices. However, hand gesture recognition is a difficult
problem and the current work is only a small contribution towards achieving
the results needed in the field of sign language gesture recognition. This report
presented a visionbased system able to interpret isolated hand gestures from
the Argentinian Sign Language(LSA).
Videos are difficult to classify because they contain both the temporal as well
as the spatial features. We have used two different models to classify on the
spatial and temporal features. CNN was used to classify on the spatial features
whereas RNN was used to classify on the temporal features. We obtained an
accuracy of 95.217 %. This shows that CNN along with RNN can be
successfully used to learn spatial and temporal features and classify Sign
Language Gestures.
We have used two approaches to solve our problem and both of the approaches
only differ by the inputs given to the RNN as explained in the methodologies
above.
35
8. REFERENCES
________________________________________________
[2] Singha, Joyeeta, and Karen Das. "Automatic Indian Sign Language
Recognition for Continuous Video Sequence." ADBU Journal of
Engineering Technology 2, no. 1 (2015).
[8] Kingma, Diederik, and Jimmy Ba. "Adam: A method for stochastic
optimization." arXiv preprint arXiv:1412.6980 (2014).
36
[10] Hahnloser, Richard HR, Rahul Sarpeshkar, Misha A. Mahowald,
Rodney J. Douglas, and H. Sebastian Seung. "Digital selection and
analogue amplification coexist in a cortexinspired silicon circuit."
Nature 405, no. 6789 (2000): 947951.12 Bottou, Léon. "Largescale
machine learning with stochastic gradient descent." In Proceedings
of COMPSTAT'2010, pp. 177186. PhysicaVerlag HD, 2010.
[12] https://medium.com/technologymadeeasy/thebestexplanationofco
nvolutionalneuralnetworksontheinternetfbb8b1ad5df8
[13] https://www.quora.com/WhatisanintuitiveexplanationofConvol
utionalNeuralNetworks
[15] Cooper, Helen, Brian Holt, and Richard Bowden. "Sign language
recognition." In Visual Analysis of Humans, pp. 539562. Springer
London, 2011.
37