Sign Language Recognition Using LSTM and Media Pipe
Sign Language Recognition Using LSTM and Media Pipe
P. A.Sujasri
Soma Anitha
Department of computer science Ramavath Alivela
and engineering Department of computer science
and engineering Department of computer science
Gokaraju Rangaraju Institute Of and engineering
Engineering and Technology Gokaraju Rangaraju Institute Of
Engineering and Technology Gokaraju Rangaraju Institute Of
Hyderabad, India Engineering and Technology
Hyderabad, India
psujasri181@gmail.com Hyderabad, India
somaanitha193@gmail.com
ramavathalivela99@gmail.com
Abstract—There are learning aids available for those persons to understand what they are saying. As a result, there
who are deaf or have trouble speaking or hearing, but they is a need for systems that can identify various symptoms and
are rarely used. Live sign motions would be handled via notify everyday people of what they mean. It is crucial to
image processing in the suggested system, which would create specific sign language applications for the deaf and
operate in real-time. Classifiers would then be employed to dumb so they may easily communicate with others who do
distinguish between distinct signs, and the translated output not understand sign language. The major goal of our
initiative is to start closing the communication gap between
would show text. On the set of data, machine learning
hearing individuals and sign language users who are deaf or
algorithms will be trained. With the use of effective
dumb. Creating a vision-based system that can recognize
algorithms, top-notch data sets, and improved sensors, the sign language motions in action or video sequences is the
system aims to enhance the performance of the current main goal of our research. The technique for the sign
system in this area in terms of response time and accuracy. language gestures was as follows: the video sequences'
Due to the fact that they solely employ image processing, temporal and spatial properties. Both temporal and
the current systems can identify movements with geographical aspects have been learned through the use of
considerable latency. In this project, our research aims to models. The LSTM model of the recurrent neural network
create a cognitive system that is sensitive and reliable so was used to train the model using spatial information from
that persons with hearing and speech impairments may the video series.
utilize it in day-to-day applications
The proposed system provides an efficient way to
translate sign language into text language with good
Keywords— Image Processing, sign motions, sensors,
performance. The system can be used in many applications
speaking or hearing like engaging the young children with computers by sign-
I. INTRODUCTION language understanding.
It can be extremely difficult to talk to persons who have II Literature Study
hearing loss. The use of hand gestures in sign language by
the deaf and the mute makes it difficult for non-disabled
1.Sign Language Recognition (SLR), which tries to The method we propose can identify a variety of motions
translate Sign Language (SL) into speech or text, aims to by recording video and converting it into distinct sign
enhance communication between hearing-impaired people language labels. After being categorized and matched to a
and able-bodied people. Because sign language is intricate picture, manually created pixels are then compared to a
and varies for different individuals, this problem is trained model. Because of this, our system is very good at
challenging and has a big social impact. [1] . finding certain character labels. Our proposed system
recognizes various actions in video recordings and separates
2.Many different sign language recognition (SLR) them into discrete frames using sign language. Our method is
algorithms have been developed by researchers, but they can quite tight in determining specific text labels for characters
only distinguish between distinct sign motions. In this since the hand pixels are divided and matched to the
research, we propose a modified long short-term memory generated picture before being transferred for comparison
(LSTM) model for continuous sequences of gestures, also with a trained model. Collaborative Communication, which
known as continuous SLR, that may be able to recognize a enables users to communicate successfully, is a feature of the
collection of related gestures.[2] . suggested system. The proposed system contains an
4.Systems that comprehend sign language instantly translate Embedded Voice Module with a User-Friendly Interface to
signs in video feeds to text. Utilizing convolution neural overcome language or speech barriers. The fundamental
networks (CNNs), feature pooling modules, and long short- benefit of this proposed system is that it may be used for
term memory networks, a novel isolated sign language communication by both sign language users and verbal
recognition model is developed in this study. (LSTMs). speakers. The suggested system is written in Python and uses
[3,4]. the YOLOv5 Algorithm, which has modules like a graphical
5.Deep learning methods can be applied to overcome user interface for simplicity, a training module to train CNN
communication barriers. To identify and detect words in a models, a gesture module that enables users to create their
person's gestures, the model discussed in this paper makes own gestures, a word-formation module that enables users to
use of deep learning. [5,6] . create words by combining gestures, and a speech module
that turns the converted text into speech. Our suggested
6.Hand and body gestures are used to symbolize the
approach is intended to alleviate the issues that deaf people
vocabulary of dynamic sign language. This approach uses a
in India confront. This system is intended to translate each
combination of Media Pipe and RNN models to address word that is received.
problems with dynamic sign language detection. We were
able to determine the position, shape, and orientation of the Here the collection of Gestures are Recognized by using
objects by removing important hands, body, and facial parts Open CV in the Real Time scenario.The Open CV takes the
using Media Pipe.[7,8]. sequence of frames frequently and gives the desire Output .
7.Using the sign language datasets and the human key B. Data Collection
points deduced from the face, hands, and other bodily parts,
we develop a sign language recognition system.[9,10]. The Media Pipe Holistic pipeline is used to combine the
8. It is still challenging for non-sign language speakers to separate posture, face, and hand models, each of which is
communicate with sign language users or signers, despite optimized for a specific platform. The pose estimation
the fact that sign language has lately gained more model treats their input as a video frame with a lower, fixed
popularity. Recent developments in deep learning and resolution (256x256).
computer vision have led to promising success in the areas 1. Hand Landmarks: Each hand has 21 landmarks.
of motion and gesture recognition using deep learning and 2. There are 33 Pose Landmarks in total.
computer vision-based techniques. [11,12].
3. Face Landmarks: There are a total of 468 landmarks
III FRAMEWORK DEVELOPMENT
i. Hand Landmarks
The RNN neural network has been combined with the
The user experience can be enhanced across a
long short-term memory networks (LSTM) model, which the range of technical platforms and disciplines by
system uses to anticipate sequences. Numerous sign comprehending the shape and motion of hands.
language recognition (SLR) systems have been developed by
researchers, but they can only identify specific sign motions.
In the present study, we propose a modified long short-term
memory model (continuous SLR) for continuous sequences
of gestures that may recognize a collection of related
gestures. LSTM networks were researched and employed for
the classification of gesture data because they can learn long
short-term associations. The created model demonstrated the
potential of using LSTM-based neural networks for sign
language translation, with a classification accuracy of 98% Fig 1 Hand Landmarks
for 26 motions.
Media Pipe Hands is a high-precision hand and finger
A. Methodology tracking solution. In contrast to current state-of-the-
art methodologies, which usually require strong
desktop workstations for inference, our solution The figure 4 describes the system architecture employed in
offers real-time performance of many hands scale. experimental in detail.
ii Pose Landmarks:
A backdrop segmentation mask and 33 3D landmarks
are extracted from RGB video frames using the
machine learning (ML) technique known as Blaze
Pose. We only consider landmarks at 17 significant
locations in the COCO topology. The Media Pipe
Pose Landmark Model predicts 33 pose landmarks.
The Facial Transform module fills the gap between
facial landmark estimation and precise real-time
augmented reality (AR) applications. The 3D
landmark network receives a clipped video frame as
its input. The model outputs the 3D point coordinates
and the probability that a face is present and correctly
aligned in the input data. By adjusting predictions and
iteratively bootstrapping, we can increase the stability Fig 4. System Architecture
and accuracy of our model. A. Image Acquisition
It is the process of removing a picture from a source, usually
one that is hardware-based, for image processing. The
hardware-based source for our project is Web Camera. Due
to the fact that no processing can be done without a
picture, it is the initial stage in the workflow sequence. The
image that is obtained has not undergone any kind of
processing.
B. Segmentation
It is a method of removing objects or other background
details from a recorded image. The segmentation procedure
Fig 2: Pose Landmarks
makes use of edge detection, skin color detection, and
context subtraction. Order to recognize gestures, the motion
iii.Face Landmarks and position of the hand must be classified as well as
Using Media Pipe Facial Mesh technology, 468 3D face identified
landmarks are evaluated in the actual world. Machine Edge Based Segmentation is used in this project to achieve
learning was used to build the 3D facial surface (ML). It Segmentation.
doesn't require a separate depth sensor and simply requires
one camera input. The face transform data consists of
typical 3D primitives such as a triangular face mesh and a
face position transformation matrix.
This face Land Marks are used to Identify and Represent the C. Preprocessing:
key points of Human Face marks and it is useful during the Images need to be processed before they can be used by
capturing of frames in Open CV. models for inference and training. This includes, but is not
limited to, changes in colour, size, and orientation.
Experimental Design: Additionally, preprocessing a model can shorten the training
process and speed up inference. Shrinkage of extremely big
input photos will greatly shorten the training period without brought about those changes. There are mainly two states in
significantly affecting model performance. the State Chart Diagram: 1. The Initial condition and the
The following are the stages of preprocessing: second condition the Final State. Following are a few of the
elements of a state chart diagram:
1.Morpholical transform:
Morphological processes create an output image of the same
size by using the structural characteristics of an input image.
By comparing the matching pixel from the input image to its
neighbour, each pixel in the output image is given a value.
Dilation and erosion are two separate types of
morphological changes.
2.Blurring:
One example is the blurring of a picture using a low-pass
filter. The term "low-pass filter" in computer vision refers to
a method for reducing noise from a photograph while
keeping the majority of the image. The blur must be finished
before tackling harder tasks, including edge detection.
3.Recognition:
Children who have hearing loss are at a disadvantage
since they find it difficult to understand the lectures
that are shown on the screen. The American Sign Fig 3 State Chart Diagram
Language was developed to assist these kids in Fig 5. State Chart diagram
managing their schooling as well as to make daily life
easier for them. To assist these kids in learning, we State: It is a situation or stage in an object's life cycle in
came up with a model that would let them make ASL which it encounters a recurring issue, performs an action, or
motions to the camera, which would then interpret waits for an outcome.
them and give feedback on what language was Transition: A "transition" between two states illustrates
understood. To do this, we combined Media pipe how an object in the first state acts before transitioning to the
Holistic with OpenCV to determine the essential second state or event.
indicators of the poser with all the values that needed
to be collected and trained on the Long Short Term An event is a description of a significant occurrence that
Memory. occurs at a certain time and location.
4.Text Output: The state chart diagram given in figure 3 illustrate how
Recognizing diverse postures and bodily gestures, as well as the video frame is captured from web cam is processed. It
converting them into text, to better understand human associate extracting image frame and identify the sign
behaviour. represented by the image.
various signs, such as U and W. However, after some Inclusive Technologies and Education (CONTIE), pp.1-11,
thought, perhaps it doesn't have to work perfectly as 2022.
using an orthography corrector or word predictor will [4] Ozge Mercanoglu Sincan, Hacer Yalim Keles, "Using
increase translation accuracy. The next stage is to Motion History Images With 3D Convolutional Networks in
analyse the response and look for methods to make the Isolated Sign Language Recognition", IEEE Access, vol.10,
system better. pp.18608-18618, 2022.
V FUTURE SCOPE [5] Wadhawan, A.; Kumar, P. Sign language recognition
systems: A decade systematic literature review. Arch.
For the recognition of single language words and Comput. Methods Eng. 2021, 28, 785–81.
sentences, we can create a model. A system that can [6] Kothadiya, D., Bhatt, C., Sapariya, K., Patel, K., Gil-
recognize changes in the temporal space will be needed for González, A.B. and Corchado, J.M., 2022. Deepsign: Sign
this. By creating a comprehensive offering, we can bridge language detection and recognition using deep
the communication gap for those who are deaf or hard of learning. Electronics, 11(11), p.1780.
hearing. In order to able to translate spoken language into [7]Samaan, Gerges H., et al. "MediaPipe’s Landmarks with
sign language and vice-versa, the systems image processing RNN for Dynamic Sign Language
component needs to be improved. we'll look for any motion Recognition." Electronics 11.19 (2022): 3228.
related clues. We will also focus on translating the sequence [8] Wadie AR, Attia AK, Asaad AM, Kamel AE, Slim SO,
of movements into text, or words and sentences, and then Abdallah MS, Cho YI. MediaPipe’s Landmarks with RNN
translating that text into audible speech. for Dynamic Sign Language Recognition. Electronics. 2022
Oct 8;11(19):3228.
VI. ACKNOWLEDGEMENT [9] C. Dong, M. C. Leu, and Z. Yin. American sign
language alphabet recognition using Microsoft Kinect. In
We are extremely grateful to Mr. G. Mallikarjuna Rao, our 2015 IEEE Conference on Computer Vision and Pattern
project guide, for his unwavering patience and leadership Recognition Workshops (CVPRW), pages 44--52, June
throughout our project work. 2015
We sincerely thank everyone who has assisted us in meeting [10] K. Cho, B. van Merrienboer, C. Gulcehre, D.
our needs for the growth and success of our project work, Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.
whether directly or indirectly. Learning phrase
representations using rnn encoder-decoder for statistical
VII REFERENCES
machine translation. In Proceedings of the
2014 Conference on Empirical Methods in Natural
[1] Liu, Tao, Wengang Zhou, and Houqiang Li. "Sign Language Processing (EMNLP), pages 1724--1734, 2014
language recognition with long short-term memory." 2016 [11] Bantupalli, Kshitij, and Ying Xie. "American sign
IEEE international conference on image processing (ICIP). language recognition using deep learning and computer
IEEE, 2016. vision." 2018 IEEE International Conference on Big Data
[2] Mittal, Anshul, et al. "A modified LSTM model for (Big Data). IEEE, 2018.
continuous sign language recognition using leap [12] Wei, Chengcheng, et al. "Deep grammatical multi-
motion." IEEE Sensors Journal 19.16 (2019): 7056-7063. classifier for continuous sign language recognition." 2019
[3] Jimmy Jiménez-Salas, Mario Chacón-Rivas, "A IEEE fifth international conference on multimedia big data
Systematic Mapping of Computer Vision-Based Sign (BigMM). IEEE, 2019.
Language Recognition", 2022 International Conference on