Sign Language to Text Conversion – A
Survey
Abhishek Kulkarni, Vishwatej Harer, Deep Thombare
Under the guidance of,
Prof. Kopal Gangrade
Department of Computer Science and Engineering,
Pune Institute Of Computer Technology, Pune, MH
Abstract. Sign languages are languages that use the visual-manual modality to convey meaning. It is a communication
system using gestures usually used by deaf and dumb people to communicate with each other and with other people. As sign
language is not a common one that people know, only people well versed in sign language are able to interpret the gestures
and communicate with the mute. Hence, a need arises to bridge this gap and this is our aim. We plan to develop a web
application that reads in sign language and converts it to text that many people understand. Most of the techniques present
now have few dis- advantages like less accuracy, skin tones, motion gestures, clutter, variability etc. Our main aim is to
develop a web application that converts sign language to text while also trying to mitigate the above drawbacks to some
extent.
Keywords: Sign to Text Conversion, Convolutional neural networks, Deep learning, Web application.
I. Introduction
Sign language detection and conversion is a multi-step process that includes object detection, image processing and feature extraction.
Object detection is a computer technology for computer vision that detects objects such as hand signs, faces etc. Image Processing is a
method that performs operations on a given image to enhance it or extract information useful to us. Once an object is detected, we apply
image pro- cessing techniques to remove noise/clutter and obtain a simplified version of the image. Feature extraction is a process by
which we obtain relevant information from data and represent it in lower dimensionality space. Once we get the enhanced image, we
apply feature extraction techniques on it to get useful information.
Once feature extraction is done, we use this relevant information to train a deep learning model. Deep learning is a branch of Machine
Learning and Artificial Intelligence where we can train a network on unsupervised data.These are usually neural networks and can be a
Convolutional Neural Network, Recurrent Neural Network, Generative Adversarial Networks etc. A CNN would be most suitable in the
case of converting sign language to text as it is a type of neural network that uses layers of perceptrons to analyze data and train a model.
It applies to image processing, Natural Language Processing and many other tasks related to cognitive capabilities.
Once a trained model is obtained, it is deployed on the cloud and then an interface has to be developed to use that model to detect and
understand gestures. We would also need a camera to record the gestures and an output device to display the text. Most people use
smartphones and laptops hence developing a web application is a very convenient way to achieve this. A web application is developed
where we record the sign language gestures and pass it on to the model on the cloud for conversion. The resultant text obtained can then
be displayed on the screen.
The proposed system would work like this:
Fig 1: The proposed system to convert sign language to text.
II. Related Works
Bantupalli et al., [1] have used a convolutional neural network for spatial feature extraction, long short term Memory and recurrent neural networks for
temporal feature extraction and then used ADAM optimizer and a softmax layer for prediction. Modi et al., [2] have developed a system that obtains
frames from a video every 4 seconds and matches it with a database to find out the least error. Abdulla et al., [3] have used sensors that have RF
transmitters that send signals based on hand movement to a receiver that translates it into Arabic language. Dutta et al., [4] have developed a system
that calcu-lates Eigen values for images and the takes pre-processed image as input and checks for maximum match.
Padmavathi et al., [5] have developed a system which Convert image to frames and apply HSI color model based segmentation on each. Neural
networks are used to predict the character. Anand et al., [6] have developed a system which creates image feature vector by binarization, noise removal
and hand detection in image and then compare with existing database. For speech to sign conversion, remove noise from audio, con-vert to text and
then compare with existing database. Ong et al., [7] have developed a system which detects hand signs, hand shape and position of the hand across all
posi-tions and scales in the given image after removing erroneous values. Bhat et al., [8] have developed a system which uses Sensors in gloves to pick
up gestures, convert to text with Analog to Digital Converter and microcontrollers, sends it to phone via Blue-tooth which then converts text to speech.
Pramada et al., [9] have developed a system which captures image and performs RGB color detection and converts it to binary image and performs
pattern matching and text to speech conversion. Huang et al., [10] have developed a system which uses Mi-crosoft Kinect as input device and gives
color and depth video streams and uses five inputs. The CNN has 9 frames and 8 layers, subsampling and convolution is done mul-tiple times. Madhuri
et al., [11] have developed a system in which image is obtained using a mobile camera and then image processing is done to extract hand sign and
match it and then corresponding audio file is played. Wu et al., [12] have developed a system in which classifier is trained on both positive and
negative data, weak classifier with lowest error rate is chosen in each run and then all the classifiers are combined as a strong classifier to detect the
meaning of the gestures.
Jarndal et al., [13] have developed two systems for conversion to dual language (English and Arabic) text and voice, a vision based system and a
wireless-interfaced glove based system. Hays et al., [14] have developed a mobile application for real time sign language to text conversion from a
video input using classification algorithms Lo-cality Preserving Projections (LPP) and Support Vector Machine (SVM). Vijaya-lakshmi et al., [15] have
developed a flex sensor, tactile sensor and accelerator, HMM based sign language to text and speech conversion model.
III. Comparison of different sign language translation methods
The following table 1 gives us an idea of the different methods used in the field of sign language detection and translation by different
authors. It also illustrates some of our recommendations that we thought could be implemented.
Table 1. Comparison of different methods used for sign language
conversion
Objective Methodology Results/ Outcome Advantage Disadvantage
Create a vision A CNN named ‘Inception’ is As CNN and
used in spatial feature extraction RNN are trained This model faced
based application
from the video, then using a Accuracy was more for in- dependently problems while testing with
that offers sign
LSTM and a RNN model we soft- max layer than pool there is a different skin tones. It
language translation
extract tem- poral features using layer for various sample minimization of dropped accuracy if it hadn’t
to text, aiding
the outputs of softmax and the sizes. cross-entropy-cost been trained on a certain
communica- tion. skin tone.
pool layer of function ADAM.
CNN.
A method to
enable translating Extraction of video every 4 Simple
This approach results
Sign language seconds and processing it, mechanism to
in the clear comparison There is a need of
finger-spell- ings extracted features are compared detect the
obtained for each addition of more gestures and
to English text and with data- base of finger-spelling.
finger-spelling with all features so
enable finger- finger-spellings. Error is Easy to implement that it supports motion too.
other database images. It
spelling to Digital calculated, one with minimum resulted in 96% accuracy. with the desired
,Audio or Text con- error is the best match. output probability
version. of 0.96.
Five flex sensors detect the When a person
To develop a There is a need to combine
bind- ing of each finger. The wearing the smart gloves Low cost and it
device which uses two gloves instead of one as it
Arduino NANO interfaced with does an Arabic letter can be used to
gloves to convert doesn’t cover a wider range of
RF trans- mitter transmits the gesture, the LCD displays repre- sent a wider
sign language to signs and use of smart gloves
signals to receiver, The received the letter and the speaker range of words.
Arabic text is a compulsion.
signals generates Arabic letters outputs the voice when the
language. sound button is clicked.
and are displayed on LCD
screen
To develop a Min Eigenvalue is applied on 5 The test image and the It is carried
system trained to images of each alphabet and database image is out with bare
convert sign pre- processed input image. compared by matching the hands and the
language to text Interest- ing points are feature points results were -
language using extracted. Check for max and the database image background and
single and double matching. matching is displayed as per- son
hand text and later to speech. independent.
and Min Eigenvalue
Convert image to frames and The accuracy Accuracy is
To convert the ap- ply HSI color model based percentage of the result more. This Improper segmentation
Indian sign seg- mentation on each. obtained is approach gave which results in varying of
language hand Features line centroid of the 99%,precision better results with robustness
gestures to hand is extracted and is fed to when hands are overlapped.
89.47%,recall 89.78% sigmoid trans-
appropriate text neural network to fer.
and specificity
messages. recognize the particular character. 97.54%
Table 1. Comparison of different methods used for sign language
conversion
Ease Create image feature vector, Convenient
communication noise removal, hand detection in way of Should be extended to
between deaf/dumb image, compare with existing communication words and sentences. Difficult
database. For speech to sign Not implemented yet.
and normal people between to imple- ment image
without use of conversion, re- move noise from deaf/dumb and processing technique
sophisticated audio, convert to text, and normal people on mobile phones.
devices like data compare with existing with two way
database.
gloves etc. translation.
Exhaustive detection across all
Training a detector Unsupervised
positions and scales, 99.8% success rate No motion or background
to recognize the approach trained
thresholding image, connected on hand detection and models, accuracy not evaluated
human hand in the using K-medoid
component analysis to detect 97.4% success rate in in environments with more
image and also algorithm,very
position of hand. Sub image in hand shape clutter and variability.
classify the hand classification. efficient on
area of detected hand given to
shape. gray-level images.
hand shape detectors.
Improve Sensors in gloves pick up Successfully converts Reliable, user
communica- tion in gestures, convert to text with In- dian Sign Language, in- dependent and
Indian Sign ADC, microcontrollers and numbers and symbols to portable system -
Language using flex send to phone via Bluetooth text and display it on a and consumes less
sensor technology. which then mobile phone. power compared to
converts text to speech. other
IV. Conclusion
Sign language is a visual language. Visual information is the most important type of information perceived, processed and
interpreted by the human brain. Digital image processing, as a computer-based technology, has applications in a variety of
fields such as image sharpening, restoration, medical, remote sensing etc. Also, deep learning, as a sub field of machine
learning, tries to imitate the actions of the human brain and is used in fields like speech and image recognition.
After going through the above listed papers on sign language conversion, we feel it is safe to say that there are many
techniques used to convert sign language to various output types each with its own advantages and drawbacks in some. Our
plan is to develop a web application that helps convert sign language to text output.
V. Acknowledgment
The authors express gratitude towards the assistance provided by our mentors and faculty members who guided us
throughout the research and helped us in achieving desired results.
VI. References
1. Kshitij Bantupalli, Ying Xie: “American Sign Language Recognition using Deep Learning and Computer Vision”, 2018 IEEE Conference
on Big Data.
2. Krishna Modi, Amrita More: “Translation of Sign Language Finger - Spelling to Text using Image Processing”, International Journal of
Computer Applications, 11 Septem- ber 2013, vol. 77.
3. Dalal Abdulla, Shahrazad Abdulla, Rameesa Manaf, Anwar H. Jarndal: “Design and Implementation of A Sign to Speech/Text System
for Deaf and Dumb People”, 2016 Fifth International Conference on Electronic Devices, Systems and Application (ICEDSA).
4. Kusumika Krori Dutta, Satheesh Kumar Raju K, Anil Kumar G S, Sunny Arokia Swamy B: “Double Handed Indian Sign Language to
Speech and Text”, 2015 Third International Conference on Image Information Processing.
5. Padmavathi. S, Saipreethy M S, Valliammai V: “Indian Sign Language character recognition using Neural Networks”, IJCA Special Issue
on Recent Trends in Pattern Recognition and Image Analysis RTPRIA.
6. M Suresh Anand, A. Kumaresan, Dr. N Mohan Kumar: “An integrated two way ISL(Indian Sign Language) translation system - A new
approach”, International Journal of Advanced Research in Computer Science, Jan/Feb2013, Vol. 4 Issue 1, p7-12. 6p.
7. Eng-Jon Ong, Richard Bowden: “A Boosted Classifier tree for Hand Shape Detection”, Sixth IEEE International Conference on
Automatic Face and Gesture Recognition, 2004.
8. Sachin Bhat, Amruthesh M, Ashik Chidanandas, Sujith: “Translating Indian Sign Language to text and voice messages using flex
sensors”, International Journal of Ad- vanced Research in Computer and Communication Engineering, May 2015, Vol. 4 Issue 5.
9. Sawaant Pramada, Deshpande Saylee, Naale Pranita, Nerkar Samiksha, Mrs. Archana S Vaidya: “Intelligent Sign Language recognition
using Image Processing”, IOSR Journal of Engineering, Feb 2013, Vol. 3 Issue 2, pp 45-51.
10. Jie Huang, Wengang Zhou, Houqiang Li, Weiping Li: “Sign Language Recognition using 3D Convolutional Neural Networks”, 2015
IEEE Conference on Multimedia and Expo (ICME), 1-6, 2015.
11. Yellapu Madhuri, Anitha G, Anburajan M: “Vision-based Sign Language Translation Device”, 2013 International Conference on
Information Communication and Embed- ded Systems (ICICES), 565-568, 2013.
12. Shuqiong Wu, Hiroshi Nagahashi: “Real-time 2D hands detection and tracking for Sign Language Recognition”, Proceedings of the 2013
8th International Conference on Sys- tem of Systems Engineering, Maui, Hawaii, USA - Jun 2-6, 2013.
13. Anwar Jarndal, Ahmed Al-Maflehi: “On Design and Implementation of A Sign-to- Speech/Text System”, 2017 International Conference
on Electrical, Electronics, Com- munication, Computer and Optimization Techniques (ICEECCOT).
14. Philip Hays, Raymond Ptucha, Roy Melton: “Mobile Device to Cloud co-processing of ASL Finger Spelling to Text Conversion”, 2013
IEEE Western New York Image Processing Workshop (WNYIPW), 22-23 Nov, 2013.
15. Vijayalakshmi P, Aarthi M: “Sign Language to Speech Conversion”, 2016 International Conference on Recent Trends in Information
Technology, 8-9 April, 2016.