Referencia N°06
Referencia N°06
Referencia N°06
DOI: 10.32604/cmes.2023.045731
REVIEW
ABSTRACT
Sign language, a visual-gestural language used by the deaf and hard-of-hearing community, plays a crucial
role in facilitating communication and promoting inclusivity. Sign language recognition (SLR), the process of
automatically recognizing and interpreting sign language gestures, has gained significant attention in recent years
due to its potential to bridge the communication gap between the hearing impaired and the hearing world. The
emergence and continuous development of deep learning techniques have provided inspiration and momentum
for advancing SLR. This paper presents a comprehensive and up-to-date analysis of the advancements, challenges,
and opportunities in deep learning-based sign language recognition, focusing on the past five years of research.
We explore various aspects of SLR, including sign data acquisition technologies, sign language datasets, evaluation
methods, and different types of neural networks. Convolutional Neural Networks (CNN) and Recurrent Neural
Networks (RNN) have shown promising results in fingerspelling and isolated sign recognition. However, the
continuous nature of sign language poses challenges, leading to the exploration of advanced neural network
models such as the Transformer model for continuous sign language recognition (CSLR). Despite significant
advancements, several challenges remain in the field of SLR. These challenges include expanding sign language
datasets, achieving user independence in recognition systems, exploring different input modalities, effectively
fusing features, modeling co-articulation, and improving semantic and syntactic understanding. Additionally,
developing lightweight network architectures for mobile applications is crucial for practical implementation. By
addressing these challenges, we can further advance the field of deep learning for sign language recognition and
improve communication for the hearing-impaired community.
KEYWORDS
Sign language recognition; deep learning; artificial intelligence; computer vision; gesture recognition
1 Introduction
Effective communication is essential for individuals to express their thoughts, feelings, and needs.
However, for individuals with hearing impairments, spoken language may not be accessible. In such
cases, sign language serves as a vital mode of communication. Sign language is a visual-gestural
language that utilizes hand movements, facial expressions, and body postures to convey meaning.
This unique language has a rich history and has evolved to become a distinct and complex system of
This work is licensed under a Creative Commons Attribution 4.0 International License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
2400 CMES, 2024, vol.139, no.3
communication. Sign languages differ across regions and countries, with each having its own grammar
and vocabulary.
Stokoe W. C. made a significant contribution to the understanding of sign language by recognizing
its structural similarities to spoken languages. Like spoken languages, sign language has a phonological
system. Signs can be broken down into smaller linguistic units [1]. As shown in Fig. 1, sign language
can be categorized into manual and non-manual features. Manual features can be further divided
into handshape, orientation, position, and movement. Non-manual features include head and body
postures, and facial expressions. These features work together to convey meaning and enable effective
communication in sign language.
Sign
Handshape Orientation Position Movement Head Posture Body Posture Facial Expression
According to the World Health Organization, there are over 466 million people globally with
disabling hearing loss, and this number is expected to increase in the coming years. For individuals who
are deaf or hard of hearing, sign language is often their primary mode of communication. However, the
majority of the population does not understand sign language, leading to significant communication
barriers and exclusion for the deaf community. Sign language recognition (SLR) refers to the process of
automatically interpreting and understanding sign language gestures and movements through various
technological means, such as computer vision and machine learning algorithms. By enabling machines
to understand and interpret sign language, we can bridge the communication gap between the deaf
community and the hearing world. SLR technology has the potential to revolutionize various sectors,
including education, healthcare, and communication, by empowering deaf individuals to effectively
communicate and access information, services, and opportunities that were previously limited [2,3]. In
addition, SLR technology can be expanded to other areas related to gesture commands, such as traffic
sign recognition, military gesture recognition, and smart appliance control [4–8].
Research on SLR dates back to the 1990s. Based on the nature of the signs, these techniques were
categorized into fingerspelling recognition, isolated sign language recognition, and continuous sign
language recognition, as depicted in Fig. 2.
fingerspelling recognition isolated sign language recognition continuous sign language recognition
Static signs, such as alphabet and digit signs, primarily belong to the category of fingerspelling
recognition. This type of recognition involves analyzing and interpreting the specific hand shapes and
CMES, 2024, vol.139, no.3 2401
positions associated with each sign. Although it is important to acknowledge that certain static signs
may involve slight movements or variations in hand shape, they are generally regarded as static because
their main emphasis lies in the configuration and positioning of the hands rather than continuous
motion.
On the other hand, dynamic signs can be further classified into isolated sign recognition and
continuous sign recognition systems. Isolated sign gesture recognition aims to recognize individual
signs or gestures performed in isolation. It involves identifying and classifying the hand movements,
facial expressions, and other relevant cues associated with each sign. In contrast, continuous sign
recognition systems aim to recognize complete sentences or phrases in sign language. They go beyond
recognizing individual signs and focus on understanding the context, grammar, and temporal sequence
of the signs. This type of recognition is crucial for facilitating natural and fluid communication in sign
language.
In the field of sign language recognition, traditional machine learning methods have played signifi-
cant roles. These methods have been utilized for feature extraction, classification, and modeling of sign
language. However, traditional machine learning approaches often face certain limitations and have
reached a bottleneck. These limitations include the need for manual feature engineering, which can
be time-consuming and may not capture all the relevant information in the data. Additionally, these
methods may struggle with handling complex and high-dimensional data, such as the spatiotemporal
information present in sign language gestures. Over recent years, deep learning methods outperformed
previous state-of-the-art machine learning techniques in different areas, especially in computer vision
and natural language processing [9]. Deep learning techniques have brought significant advancements
to sign language recognition [10–14], leading to a surge in research papers published on deep learning-
based SLR. As the field continues to evolve, it is crucial to conduct updated literature surveys.
Therefore, this paper aims to provide a comprehensive review and classification of the current state of
research in deep learning-based SLR.
This review delves into various aspects and technologies related to SLR using deep learning,
covering the latest advancements in the field. It also discusses publicly available datasets commonly
used in related research. Additionally, the paper addresses the challenges encountered in SLR and
identifies potential research directions. The remaining sections of the paper are organized as follows:
Section 2 describes the collection and quantitative analysis of literature related to SLR. Section 3
describes different techniques for acquiring sign language data. Section 4 discusses sign language
datasets and evaluation methods. Section 5 explores deep learning techniques relevant to SLR. In
Section 6, advancements and challenges of various techniques employed in SLR are compared and
discussed. Finally, Section 7 summarizes the development directions in this field.
sign language recognition (CSLR) and gloss-to-text translation. After eliminating irrelevant papers,
our study encompassed 346 relevant papers. The PRISMA chart depicting our selection process is
presented in Fig. 3.
Figure 3: The PRISMA flow diagram for identifying relevant documents included in this review
A comprehensive literature analysis was performed on various aspects of SLR using deep
learning, including annual publication volume, publishers, sign language subjects, main technologies,
and architectures. Fig. 4 demonstrates a consistent increase in the number of publications each
year, indicating the growing interest and continuous development in this field. Fig. 5 highlights the
prominent publishers in the domain of deep learning-based SLR. Notably, IEEE leads with the highest
number of publications, accounting for 37.57% of the total, followed by Springer Nature with 19.36%
and Mdpi with 10.41%. Table 1 displays the primary sign language subjects for research, encompassing
American SL, Indian SL, Chinese SL, German SL, and Arabic SL. It is important to note that this data
is derived from the experimental databases utilized in the papers. In cases where a paper conducted
experiments using multiple databases, each database is counted individually. For instance, experiments
were conducted on two test datasets: the RWTH-PHOENIX-Weather multi-signer dataset and a
Chinese SL (CSL) dataset [15]. Therefore, German SL and Chinese SL are each counted once. Table 2
presents the main technologies and architectures employed in deep learning-based SLR. In Session 5,
we will focus on elucidating key technological principles to facilitate comprehension for readers new
to this field. The statistical data in Table 2 is obtained by first preprocessing and normalizing the
keywords in the literature, and then using VOSviewer software to analyze and calculate the keywords.
CMES, 2024, vol.139, no.3 2403
No. of publictions
100
90
80 y = 7.5429x + 31.267
R² = 0.5349
70
60
50
40
30
20
10
0
2018 2019 2020 2021 2022 2023
120
100
80
60
40
20
0
IEEE Springer Nature Mdpi Elsevier Assoc Science & other
Computing Information Sai
Machinery Organization Ltd
Table 1: The main sign language subjects on sign language recognition in deep learning (No. >=5)
Sign language No. Sign language No.
American SL 54 Turkish SL 6
Indian SL 35 British SL 6
German SL 33 Japanese SL 6
Chinese SL 26 Pakistan SL 5
Arabic SL 25 Russian SL 5
Korean SL 11 Bangla SL 5
Table 2: The main technologies or architectures of sign language recognition in deep learning (No.
>=5)
Technologies (Architectures) No. Technologies (Architectures) No.
CNN 91 ResNet 8
Transfer learning 33 vgg 8
(Continued)
2404 CMES, 2024, vol.139, no.3
Table 2 (continued)
Technologies (Architectures) No. Technologies (Architectures) No.
Attention 27 CNN-LSTM 8
Transformer 20 CNN-HMM 7
LSTM 16 yolo 7
3D-CNN 15 Ensemble learning 6
Inception 11 Generative adversarial networks 5
RNN 11 Lightweight network 5
Bi-LSTM 8 Mobilenet 5
Graph convolutional network 8 gru 5
with five flex sensors, an inertial sensor, and two contact sensors for recognizing the Brazilian sign
language alphabet. Wen et al. [23] utilized gloves configured with 15 triboelectric sensors to track and
record hand motions such as finger bending, wrist motions, touch with fingertips, and interaction with
the palm.
Figure 7: (a) System structure. (b) IMU collecting the hand motion data. (c) Bending sensor collecting
the hand shape data
Leap Motion Controller (LMC) is a small, motion-sensing device that allows users to interact
with their computer using hand and finger gestures. It uses infrared sensors and cameras to track the
movement of hands and fingers in 3D space with high precision and accuracy. In the field of sign
2406 CMES, 2024, vol.139, no.3
language recognition, by tracking the position, orientation, and movement of hands and fingers, the
Leap Motion Controller can provide real-time data that can be used to recognize and interpret sign
language gestures [24–26].
Some studies have utilized commercially available devices such as the Myo armband [27–29],
which are worn below the elbow and equipped with sEMG and inertial sensors. The sEMG sensors
can measure the electrical potentials produced by muscles. By placing these sensors on the forearm
over key muscle groups, specific hand and finger movements can be identified and recognized [30–32].
Li et al. [27] used A wearable Myo armband to collect human arm surface electromyography (sEMG)
signals for improving SLR accuracy. Pacifici et al. [28] built a comprehensive dataset that includes
EMG and IMU data captured with the Myo Gesture Control Armband. This data was collected while
performing the complete set of 26 gestures representing the alphabet of the Italian Sign Language.
Mendes Junior et al. [29] demonstrated the classification of a series of alphabet gestures in Brazilian
Sign Language (Libras) through the utilization of sEMG obtained from a MyoTM armband.
Recent literature studies have highlighted the potential of utilizing WiFi sensors to accurately
identify hand and finger gestures through channel state information [33–36]. The advantage of
WiFi signals is their nonintrusive nature, allowing for detachment from the user’s hand or finger
and enabling seamless recognition. Zhang et al. [33] introduced a WiFi-based SLR system called
Wi-Phrase, which applies principal component analysis (PCA) projection to eliminate noise and
transform cleaned WiFi signals into a spectrogram. Zhang et al. [35] proposed WiSign, which
recognizes continuous sentences of American Sign Language (ASL) using existing WiFi infrastructure.
Additionally, RF sensors provide a pathway for SLR [37–41].
Sensor-based devices offer the benefit of minimizing reliance on computer vision techniques for
signer body detection and segmentation. This allows the recognition system to identify sign gestures
with minimal processing power. Moreover, these devices can track the signer’s movements, providing
valuable spatial and temporal information about the executed signs. However, it is worth noting that
certain devices, such as digital gloves, require the signer to wear the sensor device while signing, limiting
their applicability in real-time scenarios.
of users. In the field of sign language recognition, the Kinect has been widely utilized [42–45].
Raghuveera et al. [46] captured hand gestures through Microsoft Kinect. Gangrade et al. [47,48] have
leveraged the 3D depth information obtained from hand motions, which is generated by Microsoft’s
Kinect sensor.
Multi-camera and 3D systems can mitigate certain environmental limitations but introduce
a higher computational burden, which can be effectively addressed due to the rapid progress in
computing technologies. Kraljević et al. [49] proposed a high-performance sign recognition module
that utilizes the 3DCNN network. They employ the StereoLabs ZED M stereo camera to capture
real-time RGB and depth information of signs.
In SLR systems, vision-based techniques are more suitable compared to sensor-based approaches.
These techniques utilize video cameras instead of sensors, eliminating the need for attaching sensors
to the signer’s body and overcoming the limited operating range of sensor-based devices. Vision-
based devices, however, provide raw video streams that often require preprocessing for convenient
feature extraction, such as signer detection, background removal, and motion tracking. Furthermore,
computer vision systems must address the significant variability and sources of errors inherent in
their operation. These challenges include noise and environmental factors resulting from variations
in illumination, viewpoint, orientation, scale, and occlusion.
Table 3 (continued)
Dataset Year Language Type Signs/ Signers Samples/ Modality Data_link
vocabulary videos
MS-ASL-500 [59] 2018 ASL Signs 500 222 17823 RGB [60]
MS-ASL-1000 [59] 2018 ASL Signs 1000 222 25513 RGB [60]
CSL-500 [61] 2019 CSL Signs 500 50 125000 RGB, D, [62]
skeleton
INCLUDE [63] 2020 ISL Signs 263 – 4287 RGB [64]
WLASL100 [65] 2020 ASL Signs 100 97 2038 RGB [66]
WLASL300 [65] 2020 ASL Signs 300 109 5117 RGB [66]
WLASL1000 [65] 2020 ASL Signs 1000 116 13168 RGB [66]
WLASL2000 [65] 2020 ASL Signs 2000 119 21083 RGB [66]
AUTSL [67] 2020 TuSL Signs 226 42 38336 RGB [68]
Libras [69] 2021 BrSL Signs 20 – 1200 RGB, D, [70]
body points,
face
information
KArSL [71] 2021 ArSL Signs 502 3 – RGB, D, [72]
skeleton
BdSLW-11 [73] 2022 BdSL Signs 11 – 1105 RGB [74]
PHOENIX [75] 2012 GSL Sentences 911 7 1980 RGB [52]
PHOENIX 14 [76] 2014 GSL Sentences 1080 9 6841 RGB [52]
PHOENIX 14T [77] 2018 GSL Sentences 1066 9 8257 RGB [52]
CSL [78] 2018 CSL Sentences 178 50 25000 RGB, D, [79]
body joints
SIGNUM [80] 2009 GSL Sentences 455 25 33210 RGB [81]
HKSL [11] 2022 HKSL Sentences 50 6 – RGB,D, smart –
watch data
Fingerspelling datasets primarily focus on sign language alphabets and/or digits. Some exclude
letters that involve motion, such as ‘j’ and ‘z’ in American Sign Language [50]. The impact of signer
variability on recognition systems is minimal in fingerspelling databases since most images only display
the signer’s hands. These datasets mainly consist of static images as the captured signs do not involve
motion [61,64,65].
Isolated sign datasets are the most widely used type of sign language datasets. They encompass
isolated sign words performed by one or more signers. Unlike fingerspelling databases, these databases
contain motion-based signs that require more training for non-expert signers. Vision-based techniques,
such as video cameras, are commonly used to capture these signs [66,72,82]. However, there are other
devices, like the Kinect, which output multiple data streams for collecting sign words [71,69].
Continuous sign language databases comprise a collection of sign language sentences, where each
sentence consists of a continuous sequence of signs. This type of database presents more challenges
compared to the previous types, resulting in a relatively limited number of available databases.
Currently, only PHOENIX14 [76], PHOENIX14T [77] and CSL Database [78] are used regularly.
The scarcity of sign language datasets suitable for CSLR can be attributed to the time-consuming and
complex nature of dataset collection, the diversity of sign languages, and the difficulty of annotating
the data.
Upon analysis, it is evident that current sign language datasets have certain limitations.
CMES, 2024, vol.139, no.3 2409
In CSLR, a higher WER indicates a lower level of Accuracy and a lower WER indicates a higher
level of Accuracy.
Convolutional Layer
Convolutional Layer
Pooling Layer
Pooling Layer
Pooling Layer
Output Layer
Input Layer
...
Hidden Layer
The calculation process of the first value of the feature map is as follows:
feature map (1) = 1 × 1 + 0 × 1 + 1 × 1 + 0 × 0 + 1 × 1 + 0 × 1 + 1 × 0 + 0 × 0 + 1 × 1 = 4
feature map (2) = 1 × 1 + 0 × 1 + 1 × 0 + 0 × 1 + 1 × 1 + 0 × 1 + 1 × 0 + 0 × 1 + 1 × 1 = 3
Fig. 9 illustrates the convolution operation with an input. Assuming the input image has a shape
of Hin × Win , convolution kernels is with a shape of Kh × Kw . Convolution is performed using a kernel
of size Kh × Kw on a 2D array of size Hin × Win . The results are then summed, resulting in a 2D array
with a shape of Hout × Wout . The formula to calculate the size of the output feature map is as follows:
Hin − Kh + 2 × Ph
Hout = +1 (7)
Sh
Win − Kw + 2 × Pw
Wout = +1 (8)
Sw
Here, Sh and Sw represent the stride value in the vertical and horizontal directions, respectively. Ph
and Pw denote the size of padding in the vertical and horizontal directions, respectively.
Unlike traditional 2D CNNs, which are primarily employed for image analysis, 3D CNNs are
designed specifically for the analysis of volumetric data. This could include video sequences or medical
scans. 3D CNNs model utilizes 3D convolutions to extract features from both spatial and temporal
dimensions. This allows the model to capture motion information encoded within multiple adjacent
frames [83]. This allows for enhanced feature representation and more accurate decision-making in
tasks such as action recognition [10,84,85], video segmentation [86], and medical image analysis [87].
2412 CMES, 2024, vol.139, no.3
Activation Function: The output of the convolutional layer is passed through a non-linear
transformation using an activation function, commonly ReLU (Rectified Linear Unit), to introduce
non-linearity.
Pooling Layer: This layer performs down sampling on the feature maps, reducing their dimensions
while retaining important features.
Max pooling and average pooling are two common types of pooling used in deep learning models
as shown in Fig. 10. Max pooling is a pooling operation that selects the maximum value from a specific
region of the input data. Average pooling, on the other hand, calculates the average value of a specific
region of the input data [88].
Figure 10: The pooling operation (max pooling and average pooling)
Fully Connected Layer: The output of the pooling layer is connected to a fully connected neural
network, where feature fusion and classification are performed.
Output Layer: Depending on the task type, the output layer can consist of one or multiple neurons,
used for tasks such as classification, detection, or segmentation.
The structure diagram of CNNs can be adjusted and expanded based on specific network
architectures and task requirements. The provided diagram is a basic representation. In practical
applications, additional layers such as batch normalization and dropout can be added to improve
the model’s performance and robustness.
this recurrent structure results in the structure shown on the right side of Fig. 11: Xt−1 to Xt+1 represent
sequentially input data at different time steps, and each time step’s input generates a corresponding
hidden state S. This hidden state at each time step is used not only to produce the output at that time
step but also participates in calculating the next time step’s hidden state.
RNNs are neural networks that excel at processing sequential data. However, they suffer from
the vanishing or exploding gradient problem, which hinders learning. To address this, variants like
Long Short-Term Memory Networks (LSTM) and Gate Recurrent Units (GRU) have been developed,
incorporating gating mechanisms to control information flow.
The input gate determines how much new information should be added to the memory cell. It
takes into account the current input and the previous output to make this decision. The forget gate
2414 CMES, 2024, vol.139, no.3
controls the amount of information that should be discarded from the memory cell. It considers the
current input and the previous output as well. Finally, the output gate determines how much of the
memory cell’s content should be outputted to the next step in the sequence. By using these gates, LSTM
networks can selectively remember or forget information at each time step, allowing them to capture
long-term dependencies in the data. This is particularly useful in tasks such as speech recognition,
machine translation, and sentiment analysis, where understanding the context of the entire sequence
is crucial.
Another important characteristic of LSTM is its ability to handle gradient flow during training.
The vanishing gradient problem occurs when gradients become extremely small as they propagate
backward through time in traditional RNNs. LSTM addresses this issue by using a constant error
carousel, which allows gradients to flow more freely and prevents them from vanishing or exploding.
In terms of training, BiRNN typically employs backpropagation through time (BPTT) or gradient
descent algorithms to optimize the network parameters. However, the bidirectional nature of BiRNN
introduces challenges in training, as information from both directions needs to be synchronized. To
address this issue, techniques such as sequence padding and masking are commonly used. The basic
unit of a BiRNN can be a standard RNN, as well as a GRU or LSTM unit. In practice, for many
Natural Language Processing (NLP) problems involving text, the most used type of bidirectional RNN
model is the one with LSTM units.
5.3.1 VGGNet
VGGNet [99] was developed by the Visual Geometry Group at the University of Oxford and
has achieved remarkable performance in the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC). One of VGGNet’s notable characteristics is its uniform architecture. It comprises multiple
stacked convolutional layers, allowing for deeper networks with 11 to 19 layers, enabling the network
to learn intricate features and patterns.
Among the variants of VGGNet, VGG-16 is particularly popular. As shown in Fig. 14, VGG-16
consists of 16 layers, including 13 convolutional layers and 3 fully connected layers. It utilizes 3 × 3
convolutional filters, followed by max-pooling layers with a 2 × 2 window and stride of 2. The number
of filters gradually increases from 64 to 512. The network also incorporates three fully connected layers
with 4096 units each, employing a ReLU activation function. The final output layer consists of 1000
units representing the classes in the ImageNet dataset, utilizing a softmax activation function.
While VGGNet has demonstrated success, it has limitations. The deep architecture of VGGNet
results in computationally expensive and memory-intensive operations, demanding substantial com-
putational resources. However, the transfer learning capability of VGGNet is a significant advan-
tage. With pre-trained VGG models, which have been trained on large datasets such as ImageNet,
2416 CMES, 2024, vol.139, no.3
researchers and practitioners can conveniently utilize them as a starting point for other computer
vision tasks. This has greatly facilitated research and development in the field.
Soft-max
FC 1000
FC-1000
FC 4096
FC-4096
FC-4096
pooling
3 [ 3 conv-512
512
3 [ 3 conv-512
512
3 [ 3 conv-512
pooling
3 [ 3 conv-512
512
3 [ 3 conv-512
512
3 [ 3 conv-512
pooling
3 [ 3 conv-256
256
3 [ 3 conv-256
256
3 [ 3 conv-256
pooling
3 [ 3 conv-128
128
3 [ 3 conv-128
128
pooling
3 [ 3 conv-64
64
3 [ 3 conv-64
Input
Over time, the Inception of GoogLeNet evolved with versions like Inception-v2 [101], Inception-
v3 [102], Inception-v4 [103], and Inception-ResNet [103]. These versions introduced various enhance-
ments, including batch normalization [101], optimized intermediate layers, label smoothing, and the
combination of Inception with ResNet’s residual connections. These improvements led to higher
accuracy, faster convergence, and better computational efficiency.
5.3.3 ResNet
ResNet, short for Residual Network, was introduced by Kaiming He and his team from Microsoft
Research in 2015 [104]. It was specifically designed to address the problem of degradation in very deep
neural networks.
Traditional deep neural networks face challenges in effectively learning transformations as they
become deeper. This is due to the vanishing or exploding gradients during backpropagation, making
it difficult to optimize the weights of deep layers. ResNet tackles this issue by introducing residual
connections, which learn the residual mapping—the difference between the input and output of a
layer [104]. The architecture of residual connections is illustrated in Fig. 16. The input passes through
convolutional layers and residual blocks. Each residual block contains multiple convolutional layers,
with the input added to the block’s output through a skip connection. This allows the gradient to flow
directly to earlier layers, addressing the vanishing gradient problem.
Mathematically, the residual connection is represented as H(x) = F(x) + x. Here, x is the input
to a layer, F(x) is the layer’s transformation, and H(x) is the output. The residual connection adds the
input x to the transformed output F(x), creating the residual mapping H(x). The network learns to
optimize this mapping during training. ResNet’s architecture has inspired the development of other
residual-based models like ResNeXt [105], Wide ResNet [106], and DenseNet [107], which have further
improved performance in various domains.
2418 CMES, 2024, vol.139, no.3
Figure 17: Comparison between standard convolutional layer and depthwise seprarable convolutions
MobileNet has multiple variations, including MobileNetV1 [108], MobileNetV2 [113], and
MobileNetV3 [114]. Each version improves upon the previous one by introducing new techniques
to further enhance efficiency and accuracy. MobileNetV2, for example, introduces inverted residual
blocks and linear bottleneck layers to achieve better performance. MobileNetV3 leverages a
combination of channel and spatial attention modules to improve both speed and accuracy.
5.3.5 Transformer
The transformer model, proposed by Vaswani et al. in 2017, has emerged as a breakthrough in
the field of deep learning. Its self-attention mechanism, parallelizable computation, ability to handle
variable-length sequences, and interpretability have propelled it to the forefront of research in natural
language processing and computer vision [115].
CMES, 2024, vol.139, no.3 2419
The Transformer model architecture consists of two main components: the encoder and the
decoder. These components are composed of multiple layers of self-attention and feed-forward neural
networks, as shown in Fig. 18.
The encoder takes an input sequence and processes it to obtain a representation that captures the
contextual information of each element in the sequence. The input sequence is first embedded into a
continuous representation, which is then passed through a stack of identical encoder layers.
Each encoder layer in the Transformer model architecture has two sub-layers: a multi-head self-
attention mechanism and a feed-forward neural network. The self-attention mechanism allows the
model to attend to different parts of the input sequence when processing each element, capturing the
relationships and dependencies between elements. The feed-forward neural network applies a non-
linear transformation to each element independently, enhancing the model’s ability to capture complex
patterns in the data.
The decoder, on the other hand, generates an output sequence based on the representation
obtained from the encoder. It also consists of a stack of identical layers, but with an additional sub-
layer that performs multi-head attention over the encoder’s output. This allows the decoder to focus
on relevant parts of the input sequence when generating each element of the output sequence.
In addition to the self-attention mechanism, the Transformer model architecture incorporates
positional encodings to handle the order of elements in the input sequence. These positional encodings
2420 CMES, 2024, vol.139, no.3
are added to the input embeddings, providing the model with information about the relative positions
of elements. This enables the model to differentiate between different positions in the sequence.
The Transformer model architecture is trained using a variant of the attention mechanism called
“scaled dot-product attention”. This mechanism computes the attention weights between elements in
the sequence by taking the dot product of their representations and scaling the result by the square root
of the dimension of the representations. The attention weights are then used to compute a weighted
sum of the representations, which forms the output of the attention mechanism.
The impact of the Transformer architecture is evident in its state-of-the-art performance across
various domains, establishing it as a fundamental building block in modern deep learning models.
These new models, including GPT [116], BERT [117], T5, ViT [118], and DeiT, are all based on
the Transformer architecture and have achieved remarkable performance in a wide range of tasks
in natural language processing and computer vision, making significant contributions to these fields.
Table 4 (continued)
Abbreviations Full names
DA Data augmentation
DBN Deep belief net
DR Dropout techniques
DRL Deep reinforcement learning
GSL German sign language
H-GANs Hyperparameter based optimized
generative adversarial networks
HP Hand pose
IMU Inertial measurement unit
IMUs Inertial measurement units
ISL Indian sign language
JSL Japanese sign language
KD Knowledge distillation
KSU-ArSL King Saud University Arabic sign
language dataset
LRN Local response normalization
LSE Spanish sign language
LSTM Long short-term memory
MC-LSTMs Multi-cue long short-term memory
networks
MHA Multi-head attention
MoSL Moroccan sign language
PSL Persian sign language
ReLU Rectified linear unit
RKD random knowledge distillation strategy
RSL Russian sign language
RST Relative sign transformer
RTS Rotated, translated and scaled
SA Statistical attention
OF Optical flow
SaSL Saudi sign language
sEMG surface electromyography
SF Scene flow
SLVM Sign language video in museums dataset
SMKD Self-mutual knowledge distillation
SP Stochastic pooling
SSD Single shot detector
STFE-Net Spatial-temporal feature extraction
network
ST-GCNs Spatial-temporal graph convolutional
networks
(Continued)
2422 CMES, 2024, vol.139, no.3
Table 4 (continued)
Abbreviations Full names
STMC Spatial-temporal multi-cue
SVAE Stacked variational auto-encoders
PHOENIX RWTH-PHOENIX Weather
PHOENIX14 RWTH-PHOENIX Weather-2014
PHOENIX14T RWTH-PHOENIX Weather-2014-T
TaSL Tactical sign language
TFSL Thai finger-spelling sign language
TuSL Turkish sign language
ViT Vision transformer
WER Word error rate
WLASL word-level American sign language
ZSL Zero-shot learning
Table 5 (continued)
Paper Year Language Modality Methods Performance (Acc.)
[128] 2022 ASL Image (RGB) ensemble learning (feature extraction: 98.83%
LeNet, AlexNet, VGGNet, GoogleNet,
and ResNet+ classification: ARS-MA)
[129] 2022 CSL data glove CNN 99.50%
[130] 2022 ASL IMU Senor feature extraction: time/time-frequency Nearly 100%
domain/angle-based features (Within-User) 74.8%
classification: CTC recognition: (Cross-User)
encoder-decoder
[131] 2022 ASL Image (RGB) CNN 99.38%
[132] 2023 BdSL Image (RGB) deep transfer learning + random forest 91.67%
classifier
[133] 2023 ASL Image (RGB) MobileNetV2 98.77%
[134] 2023 ArSL Image (RGB) MobileNet 94.46%
[135] 2023 AsSL Image (RGB) MediaPipe 99%
[136] 2023 ISL Image (RGB) Transformer 99.29
[137] 2023 ISL Image (RGB) CNN (data augmentation, BN, dropout, 99.76%
stochastic pooling, diffGrad optimizer)
[138] 2023 ASL BdSL Image (RGB) Attention + MobileNetV2 99.95%
92.1%
In recent years, there has been rapid development in the field of deep transfer learning and
ensemble learning, and a set of pre-trained models has been applied to fingerspelling recognition.
Sandler et al. [113] introduced two methods for automatic recognition of the BdSL alphabet, utilizing
conventional transfer learning and contemporary zero-shot learning (ZSL) to identify both seen
and unseen data. Through extensive quantitative experiments on 18 CNN architectures and 21
classifiers, the pre-trained DenseNet201 architecture demonstrated exceptional performance as a
feature extractor. The top-performing classifier, identified as Linear Discriminant Analysis, achieved
an impressive overall accuracy of 93.68% on the extensive dataset used in the study. Podder et al. [127]
compared the classification performance with and without background images to determine the
optimal working model for BdSL alphabet classification. Three pre-trained CNN models, namely
ResNet18 [104], MobileNet_V2 [113], and EfficientNet_B1 [111], were used for classification. It was
found that ResNet18 achieved the highest accuracy of 99.99%. Ma et al. [128] proposed an ASL
recognition system based on ensemble learning, utilizing multiple pre-trained CNN models including
LeNet, AlexNet, VGGNet, GoogleNet, and ResNet for feature extraction. The system incorporated
accuracy-based weighted voting (ARS-MA) to improve the recognition performance. Das et al. [132]
proposed a hybrid model combining a deep transfer learning-based CNN with a random forest
classifier for automatic recognition of BdSL alphabet.
Some models have combined two or more approaches in order to boost the recognition accuracy.
Aly et al. [120] presented a novel user-independent recognition system for the ASL alphabet. This
system utilized the PCANet, a principal component analysis network, to extract features from depth
images captured by the Microsoft Kinect depth sensor. The extracted features were then classified
using a linear support vector machine (SVM) classifier. Rivera-Acosta et al. [126] proposed a novel
approach to address the accuracy loss when training models to interpret completely unseen data.
The model presented in this paper consists of two primary data processing stages. In the first stage,
YOLO was employed for handshape segmentation and classification. In the second stage, a Bi-LSTM
was incorporated to enhance the system with spelling correction functionality, thereby increasing the
robustness of completely unseen data.
Some SLR works have been deployed in embedded systems and edge devices, such as mobile
devices, Raspberry Pi, and Nareshkumar et al. [133] utilized MobileNetV2 on terminal devices to
achieve fast and accurate recognition of letters in ASL, reaching an accuracy of 98.77%. MobileNet
was utilized to develop a model for recognizing the Arabic language’s alphabet signs, with a recognition
accuracy of 94.46% [134]. Zhang et al. [138]introduced a novel lightweight network model for alphabet
recognition, incorporating an attention mechanism. Experimental results on the ASL dataset and
BdSL dataset demonstrated that the proposed model outperformed existing methods in terms of
performance. Ang et al. [139] implemented a fingerspelling recognition model for Filipino Sign
Language using Raspberry Pi. They used YOLO-Lite for hand detection and MobileNetV2 for
classification, achieving an average accuracy of 93.29% in differentiating 26 hand gestures representing
FSL letters. Siddique et al. [140] developed an automatic Bangla sign language (BSL) detection system
using deep learning approaches and a Jetson Nano edge device.
CMES, 2024, vol.139, no.3 2425
utilized VGG-19 for spatial feature extraction and employed BiLSTM for temporal feature extraction.
Experimental results demonstrated that the proposed HCBSLR system achieved an average accuracy
of 87.67%.
Due to limited storage and computing capacities on mobile phones, the implementation of
SLR applications is often restricted. To address this issue, Abdallah et al. [156] proposed the use
of lightweight deep neural networks with advanced processing for real-time dynamic sign language
recognition (DSLR). The application leveraged two robust deep learning models, namely the GRU
and the 1D CNN, in conjunction with the MediaPipe framework. Experimental results demonstrated
that the proposed solution could achieve extremely fast and accurate recognition of dynamic signs, even
in real-time detection scenarios. The DSLR application achieved high accuracies of 98.8%, 99.84%,
and 88.40% on the DSL-46, LSA64, and LIBRAS-BSL datasets, respectively. Li et al. [153] presented
MyoTac, a user-independent real-time tactical sign language classification system. The network was
made lightweight through knowledge distillation by designing tactical CNN and BiLSTM to capture
spatial and temporal features of the signals. Soft targets were extracted using knowledge distillation
to compress the neural network scale nearly four times without affecting the accuracy.
Most studies on SLR have traditionally focused on manual features extracted from the shape of the
dominant hand or the entire frame. However, it is important to consider facial expressions and body
gestures. Shaik et al. [147] proposed an isolated SLR framework that utilized Spatial-Temporal Graph
Convolutional Networks (ST-GCNs) [151,152] and Multi-Cue Long Short-Term Memories (MC-
LSTMs) to leverage multi-articulatory information (such as body, hands, and face) for recognizing
sign glosses.
using a deep BiLSTM. Deep belief net (DBN) was applied to the field of wearable-sensor-based CSL
recognition [144].To obtain multi-view deep features for recognition, Shaik et al. [147] proposed using
an end-to-end trainable multi-stream CNN with late feature fusion. The fused multi-view features are
then fed into a two-layer dense network and a softmax layer for decision-making. Eunice et al. [161]
proposed a novel approach for gloss prediction using the Sign2Pose Gloss prediction transformer.
[164] 2019 GSL RGB PHOENIX14 CNN+ stacked temporal fusion Full + 22.86% (WER)
+BiLSTM+iterative optimization OF
SIGNUM 2.8% (WER)
[164] 2019 GSL RGB PHOENIX14 Feature learning: 3D-ResNet Full 36.7% (WER)
CSL CSL Dataset sequence modelling: encoder-decoder with 32.7% (WER)
LSTM and CTC
[12] 2019 ISL Leap 157 / 35 /6 Sub_units+2DCNN+ modified LSTM H 72.3% (Acc.)
motion
sensor
[166] 2019 ISL sensor 20/- /10 CapsNet H 94% (Acc.)
[167] 2020 GSL RGB PHOENIX14 Video Encoder (CNN+ stacked 1D temporal Full 24.0% (WER)
GSL PHOENIX14T convolution layers + BiLSTM) +Text 24.3% (WER)
CSL CSL Encoder (LSTM)+ Latent Space Alignment 2.4% (WER)
+Decoder (Split I)
[168] 2020 GSL RGB PHOENIX 2014 T Multi-Stream CNN-LSTM-HMMs H 73.4% (Acc.)
[169] 2020 GSL RGB PHOENIX14 CNN-TCN visual encoder, sequential model Full 21.9% (WER)
CSL CSL and text encoder, with cross modality 24.5% (WER)
augmentation
[170] 2021 GSL RGB PHOENIX 2014 T Spatiotemporal Feature Extractor with Full 34.4% (WER)
iteratively fine-tune sequence model
(BiLSTM+CTC)
[171] 2021 GSL RGB PHOENIX 2014 T GRU-RST Full 23.5% (WER)
[172] 2021 GSL RGB PHOENIX14 H-GAN (LSTM+3DCNN) Full 20.7% (WER)
[173] 2021 GSL RGB PHOENIX14 SMKD+ CTC Full 21.0% (WER)
PHOENIX14T 22.4% (WER)
[174] 2021 GSL RGB PHOENIX14 SignBERT (BERT+ResNet) Full 20.2% (WER)
CSL CSL Dataset BERT + ResNet 23.3% (WER)
HKSL HKSL 12.35% (WER)
[11] 2022 CSL RGB CSL Dataset CA-SignBERT (BERT+ Full 19.8% (WER)
GSL RGB PHOENIX14 cross-attention+CNN + BiLSTM+CTC loss) 18.6% (WER)
GrSL RGB+sm GrSL 31.15% (WER)
HKSL art watch HKSL 7.19% (WER)
data
(Continued)
2431
2432
Table 7 (continued)
Paper Year Language Modality1 Database Methods Components Performance
(sentences/signs/signers)
[175] 2022 CSL sensor 60 /-/- DeepSLR (attention-based Encoder-Decoder H 10.8% (WER)
model+ multi-channel CNN)
[176] 2022 GSL RGB PHOENIX14 STMC (SMC+TMC+Encoder+Decoder) H+F+B 20.7% (WER)
CSL CSL Dataset 28.6% (WER)
GSL PHOENIX14T 21.0% (WER)
[177] 2022 GSL RGB PHOENIX14 two-stream Resnet34 + transformer Full+H+F 16.72% (WER)
CSL CSL 0.87.1% (Acc.)
[178] 2022 CSL RGB CSL 3D-MobileNetv2+RKD Full 2.2% (WER)
(Split I)
[179] 2022 GSL RGB PHOENIX14 Multilingual SLR framework: CNN-TCN Full 20.9% (WER)
CSL CSL visual feature extractor, language-independent 18.1% (WER)
BLSTM-CTC branches, together with a
shared BLSTM initialized with language
embeddings
[180] 2023 CSL sensor OH-Sentence(723/-/24) SeeSign: Transformer + SA +CA Full 18.34% (WER)
TH-Sentence(182/-/14) 22.08% (WER)
[181] 2023 ISL sensor 40 /-/- CNN+BiLSTM+CTC+ transfer learning Full 15.14 1.59
[182] 2023 CSL RGB CSL TrCLR (Transformer) Full 96.6% (Acc.)
[183] 2023 CSL RGB 60//21000/- STFE-Net (Bi-GRU+Transformer) Full –
[184] 2023 CSL RGB CSL spatial temporal graph attention Full 1.59% (WER)
skeleton network+BLSTM
data
[185] 2023 − RGB PHOENIX14 self-supervised pre-training + Downstream H+HP 20.0% (WER)
PHOENIX14 Fine-Tuning + multi-level masked modeling 19.9% (WER)
strategies
Note: 1 Some databases contain multiple data modalities, such as RGB and Depth. However, not all of them are used in the algorithms. The table below only shows the modalities
used in the algorithms.
CMES, 2024, vol.139, no.3
CMES, 2024, vol.139, no.3 2433
shown promising performance, with a WER of 16.72%. Zhang et al. [180] proposed SeeSign, a
multimodal fusion transformer framework for SLR. SeeSign incorporated two attention mechanisms,
namely statistical attention and contrastive attention, to thoroughly investigate the intra-modal and
inter-modal correlations present in surface Electromyography (sEMG) and inertial measurement unit
(IMU) signals, and effectively fuse the two modalities. The experimental results showed that SeeSign
achieved a WER of 18.34% and 22.08% on the OH-Sentence and TH-Sentence datasets, respectively.
Jiang et al. [182] presented TrCLR, a novel Transformer-based model for CSLR. To extract features,
they employed the CLIP4Clip video retrieval method, while the overall model architecture adopts an
end-to-end Transformer structure. The CSL dataset, consisting of sign language data, is utilized for
this experiment. The experimental results demonstrated that TrCLR achieved an accuracy of 96.3%.
Hu et al. [183] presented a spatial-temporal feature extraction network (STFE-Net) for continuous
sign language translation (CSLT). The spatial feature extraction network (SFE-Net) selected 53 key
points related to sign language from the 133 key points in the COCO-WholeBody dataset. The
temporal feature extraction network (TFE-Net) utilized a Transformer to implement temporal feature
extraction, incorporating relative position encoding and position-aware self-attention optimization.
The proposed model achieved BLUE-1 = 77.59, BLUE-2 = 75.62, BLUE-3 = 74.25, and BLUE-4 =
72.14 on a Chinese continuous sign language dataset collected by the researchers themselves.
BERT (Bidirectional Encoder Representations from Transformers) is based on the Transformer
architecture and is pre-trained on a large corpus of unlabeled text data [117]. This pre-training
allows BERT to be fine-tuned for various NLP tasks, achieving remarkable performance across
multiple domains. Zhou et al. [174] developed a deep learning framework called SignBERT. SignBERT
combined the BERT with the ResNet to effectively model underlying sign languages and extract
spatial features for CSLR. In another study, Zhou et al. [11] developed A BERT-based deep learning
framework named CASignBERT for CSLR. the proposed CA-SignBERT framework consisted of the
cross-attention mechanism and the weight control module. Experimental results demonstrated that
the CA-SignBERT framework attained the lowest WER in both the validation set (18.3%) and test set
(18.6%) of the PHOENIX14.
X3Ds to create compact and fast spatiotemporal models for continuous sign language tasks. In
order to enhance their performance, they also implemented a random knowledge distillation strategy
(RKD).
Furthermore, user-independency and generalization across different sign languages and users are
crucial for CSLR systems. Developing models that can adapt to different signing styles, regional
variations, and individual preferences is a complex task that requires extensive training data and robust
algorithms.
One of the major challenges in SLR is capturing and understanding the complex spatio-temporal
nature of sign language. On one hand, sign language involves dynamic movements that occur over
time, and recognizing and interpreting these temporal dependencies is essential for understanding the
meaning of signs. On the other hand, SLR also relies on spatial information, particularly the precise
hand shape, hand orientation, and hand trajectory. To address these challenges, researchers in the
field are actively exploring advanced deep learning techniques, to effectively capture and model the
spatio-temporal dependencies in sign language data.
(4) Limited contextual information
Capturing and utilizing contextual information in CSLR systems remains a challenge. Under-
standing the meaning of signs in the context of a sentence is crucial for accurate recognition.
Incorporating linguistic knowledge and language modeling techniques can help interpret signs in the
context of the sentence, improving accuracy and reducing ambiguity.
(5) Real-time processing and latency
Achieving real-time and low-latency CSLR systems while maintaining high accuracy poses
computational challenges. Developing efficient algorithms and optimizing computational resources
can enable real-time processing. Techniques like parallel processing, model compression, and hardware
acceleration should be explored to minimize latency and ensure a seamless user experience.
(6) Generalization to new users and sign languages
Generalizing CSLR models to new users and sign languages is complex. Adapting models to
different users’ signing styles and accommodating new sign languages require additional training data
and adaptation techniques. Transfer learning can be employed to generalize CSLR models across
different sign languages, reducing the need for extensive language-specific training data. Exploring
multilingual CSLR models that can recognize multiple sign languages simultaneously can also improve
generalization.
By addressing these limitations, the field of SLR can make significant progress in addressing the
limitations and advancing the accuracy, efficiency, and generalization capabilities of SLR systems.
(7) The contradiction between model accuracy and computational power
As models become more complex and accurate, they often require a significant amount of
computational power to train and deploy. This can limit their practicality and scalability in real-world
applications. To address this contradiction, several approaches can be considered:
a) Explore techniques to optimize and streamline the model architecture to reduce compu-
tational requirements without sacrificing accuracy. This can include techniques like model
compression, pruning, or quantization, which aim to reduce the model size and computational
complexity while maintaining performance.
b) Develop lightweight network architectures specifically designed for SLR. These architectures
aim to reduce the number of parameters and operations required for inference while maintain-
ing a reasonable level of accuracy, such as [187–190].
Acknowledgement: Thanks to three anonymous reviewers and the editors of this journal for providing
valuable suggestions for the paper.
CMES, 2024, vol.139, no.3 2439
Funding Statement: This work was supported from the National Philosophy and Social Sciences
Foundation (Grant No. 20BTQ065).
Author Contributions: The authors confirm contribution to the paper as follows: study conception and
design: Yanqiong Zhang, Xianwei Jiang; data collection: Yanqiong Zhang; analysis and interpretation
of results: Yanqiong Zhang, Xianwei Jiang; draft manuscript preparation: Yanqiong Zhang. All
authors reviewed the results and approved the final version of the manuscript.
Availability of Data and Materials: All the reviewed research literature and used data in this manuscript
includes scholarly articles, conference proceedings, books, and reports that are publicly available. The
references and citations can be found in the reference list of this manuscript and are accessible through
online databases, academic libraries, or by contacting the publishers directly.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the
present study.
References
1. Stokoe, W. C. (1960). Sign language structure. Buffalo: University of Buffalo Press.
2. Batnasan, G., Gochoo, M., Otgonbold, M. E., Alnajjar, F., Shih, T. K. (2022). ArSL21L: Ara-
bic sign language letter dataset benchmarking and an educational avatar for metaverse applications.
2022 IEEE Global Engineering Education Conference (EDUCON), pp. 1814–1821. Tunis, Tunisia.
https://doi.org/10.1109/EDUCON52537.2022.9766497
3. Marzouk, R., Alrowais, F., Al-Wesabi, F. N., Hilal, A. M. (2022). Atom search optimization with deep
learning enabled arabic sign language recognition for speaking and hearing disability persons. Healthcare,
10(9), 1606. https://doi.org/10.3390/healthcare10091606
4. Amrani, N. E. A., Abra, O. E. K., Youssfi, M., Bouattane, O. (2019). A new interpreta-
tion technique of traffic signs, based on deep learning and semantic web. 2019 Third Interna-
tional Conference on Intelligent Computing in Data Sciences (ICDS), pp. 1–6. Maui, HI, USA.
https://doi.org/10.1109/ICDS47004.2019.8942319
5. Zhu, Y., Liao, M., Yang, M., Liu, W. (2018). Cascaded segmentation-detection networks for text-
based traffic sign detection. IEEE Transactions on Intelligent Transportation Systems, 19(1), 209–219.
https://doi.org/10.1109/TITS.2017.2768827
6. Canese, L., Cardarilli, G. C., Di Nunzio, L., Fazzolari, R., Ghadakchi, H. F. et al. (2022). Sensing
and detection of traffic signs using CNNs: An assessment on their performance. Sensors, 22(22), 8830.
https://doi.org/10.3390/s22228830
7. Manoharan, Y., Saxena, S., D., R. (2022). A vision-based smart human computer
interaction system for hand-gestures recognition. 2022 1st International Conference on
Computational Science and Technology (ICCST), pp. 321–324. Sharjah, United Arab Emirates.
https://doi.org/10.1109/ICCST55948.2022.10040464
8. Hmida, I., Romdhane, N. B. (2022). Arabic sign language recognition algorithm based on deep learning
for smart cities. The 3rd International Conference on Distributed Sensing and Intelligent Systems (ICDSIS
2022), pp. 119–127. Sharjah, United Arab Emirates. https://doi.org/10.1049/icp.2022.2426
9. Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E. (2018). Deep learning for
computer vision: A brief review. Computational Intelligence and Neuroscience, 2018, 7068349.
https://doi.org/10.1155/2018/7068349
10. Liang, Z. J., Liao, S. B., Hu, B. Z. (2018). 3D convolutional neural networks for dynamic sign language
recognition. Computer Journal, 61(11), 1724–1736. https://doi.org/10.1093/comjnl/bxy049
2440 CMES, 2024, vol.139, no.3
11. Zhou, Z., Tam, V. W. L., Lam, E. Y. (2022). A cross-attention BERT-based framework
for continuous sign language recognition. IEEE Signal Processing Letters, 29, 1818–1822.
https://doi.org/10.1109/LSP.2022.3199665
12. Mittal, A., Kumar, P., Roy, P. P., Balasubramanian, R., Chaudhuri, B. B. (2019). A modified LSTM model
for continuous sign language recognition using leap motion. IEEE Sensors Journal, 19(16), 7056–7063.
https://doi.org/10.1109/JSEN.2019.2909837
13. Sharma, S., Kumar, K. (2021). ASL-3DCNN: American sign language recognition technique
using 3-D convolutional neural networks. Multimedia Tools and Applications, 80(17), 26319–26331.
https://doi.org/10.1007/s11042-021-10768-5
14. Luqman, H., El-Alfy, E. S. M. (2021). Towards hybrid multimodal manual and non-manual
Arabic sign language recognition: mArSL database and pilot study. Electronics, 10(14), 1739.
https://doi.org/10.3390/electronics10141739
15. Xue, C., Yu, M., Yan, G., Qin, M., Liu, Y. et al. (2022). A multi-modal fusion framework for continuous
sign language recognition based on multi-layer self-attention mechanism. Journal of Intelligent & Fuzzy
Systems, 43(4), 4303–4316. https://doi.org/10.3233/JIFS-211697
16. Kudrinko, K., Flavin, E., Zhu, X., Li, Q. (2021). Wearable sensor-based sign language
recognition: A comprehensive review. IEEE Reviews in Biomedical Engineering, 14, 82–97.
https://doi.org/10.1109/RBME.2020.3019769
17. Ahmed, M. A., Zaidan, B. B., Zaidan, A. A., Salih, M. M., Lakulu, M. M. B. (2018). A review on systems-
based sensory gloves for sign language recognition state of the art between 2007 and 2017. Sensors, 18(7),
2208. https://doi.org/10.3390/s18072208
18. Lu, C., Amino, S., Jing, L. (2023). Data glove with bending sensor and inertial sensor
based on weighted DTW fusion for sign language recognition. Electronics, 12(3), 613.
https://doi.org/10.3390/electronics12030613
19. Alzubaidi, M. A., Otoom, M., Abu Rwaq, A. M. (2023). A novel assistive glove to convert arabic sign
language into speech. ACM Transactions on Asian and Low-Resource Language Information Processing,
22(2), 1–16. https://doi.org/10.1145/3545113
20. DelPreto, J., Hughes, J., D’Aria, M., de Fazio, M., Rus, D. (2022). A wearable smart glove and its
application of pose and gesture detection to sign language classification. IEEE Robotics and Automation
Letters, 7(4), 10589–10596. https://doi.org/10.1109/LRA.2022.3191232
21. Oz, C., Leu, M. C. (2011). American sign language word recognition with a sensory glove
using artificial neural networks. Engineering Applications of Artificial Intelligence, 24(7), 1204–1213.
https://doi.org/10.1016/j.engappai.2011.06.015
22. Dias, T. S., Alves Mendes Junior, J. J., Pichorim, S. F. (2022). An instrumented glove for
recognition of Brazilian Sign Language Alphabet. IEEE Sensors Journal, 22(3), 2518–2529.
https://doi.org/10.1109/JSEN.2021.3136790
23. Wen, F., Zhang, Z., He, T., Lee, C. (2021). AI enabled sign language recognition and VR space
bidirectional communication using triboelectric smart glove. Nature Communications, 12(1), Article 1.
https://doi.org/10.1038/s41467-021-25637-w
24. Lee, C. K. M., Ng, K. K. H., Chen, C. H., Lau, H. C. W., Chung, S. Y. et al. (2021). American sign language
recognition and training method with recurrent neural network. Expert Systems with Applications, 167,
114403. https://doi.org/10.1016/j.eswa.2020.114403
25. Abdullahi, S. B., Chamnongthai, K. (2022). American sign language words recognition of
skeletal videos using processed video driven multi-stacked deep LSTM. Sensors, 22(4), 1406.
https://doi.org/10.3390/s22041406
26. Abdullahi, S. B., Chamnongthai, K. (2022). American sign language words recognition using spatio-
temporal prosodic and angle features: A sequential learning approach. IEEE Access, 10, 15911–15923.
https://doi.org/10.1109/ACCESS.2022.3148132
CMES, 2024, vol.139, no.3 2441
27. Li, J., Zhong, J., Wang, N. (2023). A multimodal human-robot sign language interaction framework applied
in social robots. Frontiers in Neuroscience, 17, 1168888. https://doi.org/10.3389/fnins.2023.1168888
28. Pacifici, I., Sernani, P., Falcionelli, N., Tomassini, S., Dragoni, A. F. (2020). A surface electromyography
and inertial measurement unit dataset for the Italian Sign Language alphabet. Data in Brief , 33, 106455
https://doi.org/10.1016/j.dib.2020.106455
29. Mendes Junior, J. J. A., Freitas, M. L. B., Campos, D. P., Farinelli, F. A., Stevan, S. L. et al. (2020). Analysis
of influence of segmentation, features, and classification in sEMG processing: A case study of recognition
of Brazilian sign language alphabet. Sensors, 20(16), 4359. https://doi.org/10.3390/s20164359
30. Gu, Y., Zheng, C., Todoh, M., Zha, F. (2022). American sign language translation using wearable
inertial and electromyography sensors for tracking hand movements and facial expressions. Frontiers in
Neuroscience, 16, 962141. https://doi.org/10.3389/fnins.2022.962141
31. Tateno, S., Liu, H., Ou, J. (2020). Development of Sign Language Motion recognition system for hearing-
impaired people using electromyography signal. Sensors, 20(20), 5807. https://doi.org/10.3390/s20205807
32. Khomami, S. A., Shamekhi, S. (2021). Persian sign language recognition using IMU and surface EMG
sensors. Measurement, 168, https://doi.org/10.1016/j.measurement.2020.108471
33. Zhang, N., Zhang, J., Ying, Y., Luo, C., Li, J. (2022). Wi-Phrase: Deep residual-multihead model
for WiFi sign language phrase recognition. IEEE Internet of Things Journal, 9(18), 18015–18027.
https://doi.org/10.1109/JIOT.2022.3164243
34. Thariq Ahmed, H. F., Ahmad, H., Phang, S. K., Harkat, H., Narasingamurthi, K. (2021). Wi-Fi CSI based
human sign language recognition using LSTM network. 2021 IEEE International Conference on Industry
4.0, Artificial Intelligence, and Communications Technology (IAICT), pp. 51–57. Bandung, Indonesia.
https://doi.org/10.1109/IAICT52856.2021.9532548
35. Zhang, L., Zhang, Y., Zheng, X. (2020). WiSign: Ubiquitous American sign language recognition
using commercial Wi-Fi devices. ACM Transactions on Intelligent Systems and Technology, 11(3), 1–24.
https://doi.org/10.1145/3377553
36. Chen, H., Feng, D., Hao, Z., Dang, X., Niu, J. et al. (2022). Air-CSL: Chinese sign language recognition
based on the commercial WiFi devices. Wireless Communications and Mobile Computing, 2022, 5885475.
https://doi.org/10.1155/2022/5885475
37. Gurbuz, S. Z., Rahman, M. M., Kurtoglu, E., Malaia, E., Gurbuz, A. C. et al. (2022). Multi-frequency
RF sensor fusion for word-level fluent ASL recognition. IEEE Sensors Journal, 22(12), 11373–11381.
https://doi.org/10.1109/JSEN.2021.3078339
38. Hameed, H., Usman, M., Khan, M. Z., Hussain, A., Abbas., H. et al. (2022). Privacy-preserving
british sign language recognition using deep learning. 2022 44th Annual International Conference of the
IEEE Engineering in Medicine & Biology Society (EMBC), pp. 4316–4319. Glasgow, Scotland, UK.
https://doi.org/10.1109/EMBC48229.2022.9871491
39. Kulhandjian, H., Sharma, P., Kulhandjian, M., D’Amours, C. (2019). Sign language gesture recognition
using doppler radar and deep learning. 2019 IEEE Globecom Workshops (GC Wkshps), pp. 1–6. Waikoloa,
HI, USA. https://doi.org/10.1109/GCWkshps45667.2019.9024607
40. McCleary, J., García, L. P., Ilioudis, C., Clemente, C. (2021). Sign language recognition using micro-
doppler and explainable deep learning. 2021 IEEE Radar Conference (RadarConf21), pp. 1–6. Atlanta,
GA, USA. https://doi.org/10.1109/RadarConf2147009.2021.9455257
41. Rahman, M. M., Mdrafi, R., Gurbuz, A. C., Malaia, E., Crawford., C. et al. (2021). Word-level sign
language recognition using linguistic adaptation of 77 GHz FMCW radar data. 2021 IEEE Radar
Conference (RadarConf21), pp. 1–6. https://doi.org/10.1109/RadarConf2147009.2021.9455190
42. Cerna, L. R., Cardenas, E. E., Miranda, D. G., Menotti, D., Camara-Chavez, G. (2021). A multimodal
LIBRAS-UFOP Brazilian sign language dataset of minimal pairs using a microsoft Kinect sensor. Expert
Systems With Applications, 167, 114179. https://doi.org/10.1016/j.eswa.2020.114179
2442 CMES, 2024, vol.139, no.3
43. Lee, G. C., Yeh, F. H., Hsiao, Y. H. (2016). Kinect-based Taiwanese sign-language recognition system.
Multimedia Tools And Applications, 75(1), 261–279. https://doi.org/10.1007/s11042-014-2290-x
44. Sun, C., Zhang, T., Xu, C. (2015). Latent support vector machine modeling for sign language
recognition with kinect. ACM Transactions on Intelligent Systems and Technology, 6(2), 1–20.
https://doi.org/10.1145/2629481
45. Ansari, Z. A., Harit, G. (2016). Nearest neighbour classification of Indian sign language ges-
tures using kinect camera. Sadhana-Academy Proceedings in Engineering Sciences, 41(2), 161–182.
https://doi.org/10.1007/s12046-015-0405-3
46. Raghuveera, T., Deepthi, R., Mangalashri, R., Akshaya, R. (2020). A depth-based Indian sign language
recognition using Microsoft Kinect. Sādhanā, 45(1), 34. https://doi.org/10.1007/s12046-019-1250-6
47. Gangrade, J., Bharti, J., Mulye, A. (2022). Recognition of Indian sign language using ORB
with bag of visual words by Kinect sensor. IETE Journal of Research, 68(4), 2953–2967.
https://doi.org/10.1080/03772063.2020.1739569
48. Yang, H. D. (2015). Sign language recognition with the kinect sensor based on conditional random fields.
Sensors, 15(1), 135–147. https://doi.org/10.3390/s150100135
49. Kraljević, L., Russo, M., Pauković, M., Šarić, M. (2020). A dynamic gesture recognition
interface for smart home control based on croatian sign language. Applied Sciences, 10(7).
https://doi.org/10.3390/app10072300
50. Pugeault, N., Bowden, R. (2011). Spelling it out: Real-time ASL fingerspelling recognition. IEEE Interna-
tional Conference on Computer Vision Workshops, ICCV 2011 Workshops, Barcelona, Spain.
51. Wikipedia (2023). ASL fingerspelling. https://en.wikipedia.org/wiki/American_manual_alphabet
(accessed on 14/10/2023)
52. Camgoz, N. C., Hadfield, S., Koller, O., Hermann, N. RWTH-PHOENIX-Weather (2014). T: Par-
allel corpus of sign language video, gloss and translation. https://www-i6.informatik.rwth-aachen.de/
∼koller/RWTH-PHOENIX/ (accessed on 20/10/2023)
53. Paudyal, P. (2018). “American sign language (ASL) Fingerspelling dataset for Myo Sensor”, Mendeley
Data. https://doi.org/10.17632/dbymbhhpk9.1
54. Shi, B., Del Rio, A. M., Keane, J., Michaux, J. et al. (2018). American sign language fingerspelling
recognition in the wild. 2018 IEEE Workshop on Spoken Language Technology (SLT 2018), pp. 145–152.
Athens, Greece.
55. Bowen, S., Aurora, M. D. R., Jonathan, K., Jonathan, M., Diane, B. et al. (2018).
Chicago Fingerspelling in the Wild Data Sets (ChicagoFSWild, ChicagoFSWild+).
https://home.ttic.edu/∼klivescu/ChicagoFSWild.htm (accessed on 23/10/2023)
56. Gao, Y., Zhang, Y., Jiang, X. (2022). An optimized convolutional neural network with combination blocks
for chinese sign language identification. Computer Modeling in Engineering & Sciences, 132(1), 95–117.
https://doi.org/10.32604/cmes.2022.019970
57. Latif, G., Mohammad, N., Alghazo, J., AlKhalaf, R., AlKhalaf, R. (2019). ArASL: Arabic alphabets sign
language dataset. Data in Brief , 23, 103777. https://doi.org/10.1016/j.dib.2019.103777
58. Munkhjargal, G. (2018). ArSL. https://www.kaggle.com/datasets/alaatamimi/arsl2018 (accessed on
21/10/2023)
59. Joze, H. R. V., Koller, O. (2018). MS-ASL: A large-scale data set and benchmark for understanding
American sign language. https://doi.org/10.48550/arXiv.1812.01053
60. Oscar, K. (2019). Papers with code—MS-ASL dataset. https://paperswithcode.com/dataset/ms-asl
(accessed on 21/10/2023)
61. Huang, J., Zhou, W., Li, H., Li, W. (2019). Attention-based 3D-CNNs for large-vocabulary sign lan-
guage recognition. IEEE Transactions on Circuits and Systems for Video Technology, 29(9), 2822–2832.
https://doi.org/10.1109/TCSVT.2018.2870740
CMES, 2024, vol.139, no.3 2443
62. Huang, J., Zhou, W., Li, H., Li, W. (2019). Papers with code—CSL dataset. https://paperswithcode.com/
dataset/csl (accessed on 21/10/2023)
63. Sridhar, A., Ganesan, R. G., Kumar, P., Khapra, M. (2020). INCLUDE: A large scale dataset for
Indian sign language recognition. Proceedings of the 28th ACM International Conference on Multimedia,
pp. 1366–1375. Seattle, WA, USA. https://doi.org/10.1145/3394171.3413528
64. Sridhar, A., Ganesan, R. G., Kumar, P., Khapra, M. (2020). INCLUDE.
https://zenodo.org/records/4010759 (accessed on 14/10/2023)
65. Li, D., Opazo, C. R., Yu, X., Li, H. (2020). Word-level deep sign language recognition from
video: A new large-scale dataset and methods comparison. 2020 IEEE Winter Conference on Appli-
cations of Computer Vision (WACV), pp. 1448–1458. Snowmass, CO, USA. https://doi.org/10.1109/
WACV45572.2020.9093512
66. Dongxu (2023). WLASL: A large-scale dataset for Word-Level American sign language (WACV 20’ Best
Paper Honourable Mention). https://github.com/dxli94/WLASL (accessed on 21/10/2023)
67. Sincan, O. M., Keles, H. Y. (2020). AUTSL: A large scale multi-modal turkish sign language dataset and
baseline methods. IEEE Access, 8, 181340–181355. https://doi.org/10.1109/ACCESS.2020.3028072
68. Sincan, O. M., Keles, H. Y. (2020). AUTSL Dataset. http://cvml.ankara.edu.tr/datasets/ (accessed on
21/10/2023)
69. Rezende, T. M., Moreira Almeida, S. G., Guimaraes, F. G. (2021). Development and validation of a
Brazilian sign language database for human gesture recognition. Neural Computing & Applications, 33(16),
10449–10467. https://doi.org/10.1007/s00521-021-05802-4
70. Rezende, T. M., Moreira Almeida, S. G., Guimaraes, F. G. (2021). Libras. https://dataportal.asia/
dataset/212582112_libras-movement (accessed on 21/10/2023)
71. Sidig, A. A. I., Luqman, H., Mahmoud, S., Mohandes, M. (2021). KArSL: Arabic sign language
database. ACM Transactions on Asian and Low-Resource Language Information Processing, 20(1), 1–19.
https://doi.org/10.1145/3423420
72. Sidig, A. A. I., Luqman, H., Mahmoud, S., Mohandes, M. (2021). KArSL.
https://github.com/Hamzah-Luqman/KArSL (accessed on 14/10/2023)
73. Islam, M. D. M., Uddin, M. D. R., Ferdous, M. J., Akter, S., Nasim Akhtar, M. D. (2022). BdSLW-11:
Dataset of Bangladeshi sign language words for recognizing 11 daily useful BdSL words. Data in Brief ,
45, 108747. https://doi.org/10.1016/j.dib.2022.108747
74. Islam, M. D. M., Uddin, M. D. R., Ferdous, M. J., Akter, S., Nasim Akhtar, M. D. (2022). BdSLW-11:
A bangladeshi sign language words dataset for recognizing 11 daily useful BdSL words—Mendeley data.
https://data.mendeley.com/datasets/523d6dxz4n/4 (accessed on 21/10/2023)
75. Forster, J., Schmidt, C., Hoyoux, T., Koller, O., Zelle, U. et al. (2012). RWTH-PHOENIX-weather: A
large vocabulary sign language recognition and translation corpus. Proceedings of the 8th International
Conference on Language Resources and Evaluation (LREC 2012). Istanbul, Turkey.
76. Forster, J., Schmidt, C., Roller, O., Bellgardt, M., Ney, H. (2014). Extensions of the sign language
recognition and translation corpus RWTH-PHOENIX-weather. 9th International Conference on Language
Resources and Evaluation, Reykjavik, Iceland.
77. Camgoz, N. C., Hadfield, S., Koller, O., Ney, H., Bowden, R. (2018). Neural sign language translation.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7784–7793. Salt Lake City,
UT, USA. https://doi.org/10.1109/CVPR.2018.00812
78. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W. (2018). Video-based sign language recognition without
temporal segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, New
Orleans, Louisiana, USA. https://doi.org/10.1609/aaai.v32i1.11903
79. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W. (2018). Csl_daily. https://ustc-slr.github.io/
datasets/2021_csl_daily/ (accessed on 21/10/2023)
2444 CMES, 2024, vol.139, no.3
80. Agris, U. V., Knorr, M., Kraiss, K. F. (2009). The significance of facial features for automatic sign language
recognition. IEEE International Conference on Automatic Face & Gesture Recognition. Amsterdam,
Netherlands.
81. Agris, U. V., Knorr, M., Kraiss, K. F. (2009). SIGNUM database–ELRA catalogue. (n.d.).
https://catalogue.elra.info/en-us/repository/browse/ELRA-S0300/ (accessed on 21/10/2023)
82. Joze, H. R. V. (2018). MS-ASL: A large-scale data set and benchmark for understanding American sign
language. https://doi.org/10.48550/arXiv.1812.01053
83. Ma, Y., Xu, T., Kim, K. (2022). A digital sign language recognition based on a 3D-CNN system with an
attention mechanism. 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia),
pp. 1–4. Yeosu, Korea. https://doi.org/10.1109/ICCE-Asia57006.2022.9954810
84. Ji, S., Xu, W., Yang, M., Yu, K. (2013). 3D convolutional neural networks for human action
recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 221–231.
https://doi.org/10.1109/TPAMI.2012.59
85. Lu, Z., Qin, S., Li, X., Li, L., Zhang, D. (2019). One-shot learning hand gesture recognition based
on modified 3d convolutional neural networks. Machine Vision and Applications, 30(7–8), 1157–1180.
https://doi.org/10.1007/s00138-019-01043-7
86. Boukdir, A., Benaddy, M., Ellahyani, A., El Meslouhi, O., Kardouchi, M. (2022). 3D ges-
ture segmentation for word-level Arabic sign language using large-scale RGB video sequences
and autoencoder convolutional networks. Signal Image and Video Processing, 16(8), 2055–2062.
https://doi.org/10.1007/s11760-022-02167-6
87. Ren, X., Xiang, L., Nie, D., Shao, Y., Zhang, H. et al. (2018). Interleaved 3D-CNNs for joint seg-
mentation of small-volume structures in head and neck CT images. Medical Physics, 45(5), 2063–2075.
https://doi.org/10.1002/mp.12837
88. Ling, N. Z. (2019). Convolutional neural network (CNN) detailed explanation.
https://www.cnblogs.com/LXP-Never/p/9977973.html (accessed on 03/07/2023)
89. Liu, C. (2019). RNN. https://blog.csdn.net/qq_32505207/article/details/105227028 (accessed on
30/08/2023)
90. Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
91. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F. et al. (2014).
Learning phrase representations using RNN encoder-decoder for statistical machine translation.
http://arxiv.org/abs/1406.1078
92. Chung, J., Gulcehre, C., Cho, K., Bengio, Y. (2014). Empirical evaluation of gated recurrent neural
networks on sequence modeling. http://arxiv.org/abs/1412.3555
93. Schuster, M., Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal
Processing, 8, 2673–2681.
94. Schuster, M. (2021). Bidirectional RNN. https://blog.csdn.net/csdn_xmj/article/details/118195670
(accessed on 18/07/2023)
95. Mitchell, T. (1997). Machine learning. McGraw Hill.
96. Pan, S. J., Tsang, I. W., Kwok, J. T., Yang, Q. (2011). Domain adaptation via transfer component analysis.
IEEE Transactions on Neural Networks, 22(2), 199–210. https://doi.org/10.1109/TNN.2010.2091281
97. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C. et al. (2018). A survey on deep transfer learning. 27th
International Conference on Artificial Neural Networks, Rhodes, Greece.
98. Ganin, Y., Lempitsky, V. (2014). Unsupervised domain adaptation by backpropagation. Proceedings of the
32nd International Conference on Machine Learning, pp. 1180–1189.
99. Simonyan, K., Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition.
https://doi.org/10.48550/arXiv.1409.1556
CMES, 2024, vol.139, no.3 2445
100. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S. et al. (2014). Going deeper with convolutions.
https://doi.org/10.48550/arXiv.1409.4842
101. Ioffe, S., Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal
covariate shift. https://doi.org/10.48550/arXiv.1502.03167
102. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2015). Rethinking the inception architecture
for computer vision. https://doi.org/10.48550/arXiv.1512.00567
103. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. (2016). Inception-v4, inception-resnet and the impact of
residual connections on learning. http://arxiv.org/abs/1602.07261
104. He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Las Vegas, NV, USA.
https://doi.org/10.1109/CVPR.2016.90
105. Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K. (2017). Aggregated residual transformations for deep neural
networks. Computer Vision and Pattern Recognition, 8, 5987–5995.
106. Zagoruyko, S., Komodakis, N. (2016). Wide residual networks. https://doi.org/10.48550/arXiv.1605.07146
107. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K. Q. (2018). Densely connected convolutional
networks. https://doi.org/10.48550/arXiv.1608.06993
108. Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W. et al. (2017). MobileNets: Efficient
convolutional neural networks for mobile vision applications. https://doi.org/10.48550/arXiv.1704.04861
109. Zhang, X., Zhou, X., Lin, M., Sun, J. (2017). ShuffleNet: An extremely efficient convolutional neural
network for mobile devices. http://arxiv.org/abs/1707.01083
110. Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J. et al. (2016). SqueezeNet: AlexNet-level
accuracy with 50x fewer parameters and <0.5MB model size. https://doi.org/10.48550/arXiv.1602.07360
111. Tan, M., Le, Q. V. (2020). EfficientNet: Rethinking model scaling for convolutional neural networks.
https://doi.org/10.48550/arXiv.1905.11946
112. Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions.
http://arxiv.org/abs/1610.02357
113. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L. C. (2019). MobileNetV2: Inverted residuals
and linear bottlenecks. https://doi.org/10.48550/arXiv.1801.04381
114. Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B. et al. (2019). Searching for MobileNetV3.
https://doi.org/10.48550/arXiv.1905.02244
115. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L. et al. (2017). Attention is all you need.
https://doi.org/10.48550/arXiv.1706.03762
116. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. (2018). Improving language understanding by
generative pre-training. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 21/10/2023)
117. Devlin, J., Chang, M. W., Lee, K., Toutanova, K. (2018). BERT: Pre-training of deep bidirectional
transformers for language understanding. https://doi.org/10.48550/arXiv.1810.04805
118. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X. et al. (2021). An image is worth
16x16 words: Transformers for image recognition at scale. https://doi.org/10.48550/arXiv.2010.11929
119. Jiang, X., Zhang, Y. D. (2019). Chinese sign language fingerspelling recognition via six-layer convolutional
neural network with leaky rectified linear units for therapy and rehabilitation. Journal of Medical Imaging
and Health Informatics, 9(9), 2031–2038. https://doi.org/10.1166/jmihi.2019.2804
120. Aly, W., Aly, S., Almotairi, S. (2019). User-independent American sign language alphabet
recognition based on depth image and PCANet features. IEEE Access, 7, 123138–123150.
https://doi.org/10.1109/ACCESS.2019.2938829
121. Jiang, X., Lu, M., Wang, S. H. (2020). An eight-layer convolutional neural network with stochastic pooling,
batch normalization and dropout for fingerspelling recognition of Chinese sign language. Multimedia Tools
and Applications, 79(21–22), 15697–15715. https://doi.org/10.1007/s11042-019-08345-y
2446 CMES, 2024, vol.139, no.3
122. Nihal, R. A., Rahman, S., Broti, N. M., Deowan, S. A. (2021). Bangla sign alphabet
recognition with zero-shot and transfer learning. Pattern Recognition Letters, 150, 84–93.
https://doi.org/10.1016/j.patrec.2021.06.020
123. Martinez-Martin, E., Morillas-Espejo, F. (2021). Deep learning techniques for spanish sign language
interpretation. Computational Intelligence and Neuroscience, 2021. https://doi.org/10.1155/2021/5532580
124. Aksoy, B., Salman, O. K. M., Ekrem, O. (2021). Detection of Turkish sign language using
deep learning and image processing methods. Applied Artificial Intelligence, 35(12), 952–981.
https://doi.org/10.1080/08839514.2021.1982184
125. Pariwat, T., Seresangtakul, P. (2021). Multi-stroke thai finger-spelling sign language recognition system
with deep learning. Symmetry, 13(2), 262. https://doi.org/10.3390/sym13020262
126. Rivera-Acosta, M., Ruiz-Varela, J. M., Ortega-Cisneros, S., Rivera, J., Parra-Michel, R. et al. (2021).
Spelling correction real-time american sign language alphabet translation system based on YOLO network
and LSTM. Electronics, 10(9). https://doi.org/10.3390/electronics10091035
127. Podder, K. K., Chowdhury, M. E. H., Tahir, A. M., Mahbub, Z. B., Khandakar, A. et al. (2022). Bangla
Sign Language (BdSL) alphabets and numerals classification using a deep learning model. Sensors, 22(2),
574. https://doi.org/10.3390/s22020574
128. Ma, Y., Xu, T., Han, S., Kim, K. (2022). Ensemble learning of multiple deep CNNs using accuracy-based
weighted voting for ASL recognition. Applied Sciences, 12(22), 111. https://doi.org/10.3390/app122211766
129. Zhang, Y., Xu, W., Zhang, X., Li, L. (2022). Sign annotation generation to alphabets via integrating visual
data with somatosensory data from flexible strain sensor-based data glove. Measurement, 202, 111700.
https://doi.org/10.1016/j.measurement.2022.111700
130. Gu, Y., Sherrine, S., Wei, W., Li, X., Yuan, J. et al. (2022). American sign language alpha-
bet recognition using inertial motion capture system with deep learning. Inventions, 7(4), 112.
https://doi.org/10.3390/inventions7040112
131. Kasapbaşi, A., Elbushra, A. E. A., Al-hardanee, O., Yilmaz, A. (2022). DeepASLR: A CNN based human
computer interface for American sign language recognition for hearing-impaired individuals. Computer
Methods and Programs in Biomedicine Update, 2, 100048. https://doi.org/10.1016/j.cmpbup.2021.100048
132. Das, S., Imtiaz, Md S., Neom, N. H., Siddique, N., Wang, H. (2023). A hybrid approach for Bangla sign
language recognition using deep transfer learning model with random forest classifier. Expert Systems
With Applications, 213, 118914. https://doi.org/10.1016/j.eswa.2022.118914
133. Nareshkumar, M. D., Jaison, B. (2023). A light-weight deep learning-based architecture for
sign language classification. Intelligent Automation and Soft Computing, 35(3), 3501–3515.
https://doi.org/10.32604/iasc.2023.027848
134. Aljuhani, R., Alfaidi, A., Alshehri, B., Alwadei, H., Aldhahri, E. et al. (2023). Arabic sign language recog-
nition using convolutional neural network and mobileNet. Arabian Journal for Science and Engineering,
48(2), 2147–2154. https://doi.org/10.1007/s13369-022-07144-2
135. Bora, J., Dehingia, S., Boruah, A., Chetia, A. A., Gogoi, D. (2023). Real-time assamese sign language
recognition using mediapipe and deep learning. International Conference on Machine Learning and Data
Engineering, 218, 1384–1393. https://doi.org/10.1016/j.procs.2023.01.117
136. Kothadiya, D. R., Bhatt, C. M., Saba, T., Rehman, A., Bahaj, S. A. (2023). SIGNFORMER:
DeepVision transformer for sign language recognition. IEEE Access, 11, 4730–4739.
https://doi.org/10.1109/ACCESS.2022.3231130
137. Nandi, U., Ghorai, A., Singh, M. M., Changdar, C., Bhakta, S. et al. (2023). Indian sign language alphabet
recognition system using CNN with diffGrad optimizer and stochastic pooling. Multimedia Tools and
Applications, 82(7), 9627–9648. https://doi.org/10.1007/s11042-021-11595-4
138. Zhang, L., Tian, Q., Ruan, Q., Shi, Z. (2023). A simple and effective static gesture recognition method
based on attention mechanism. Journal of Visual Communication and Image Representation, 92, 103783.
https://doi.org/10.1016/j.jvcir.2023.103783
CMES, 2024, vol.139, no.3 2447
139. Ang, M. C., Taguibao, K. R. C., Manlises, C. O. (2022). Hand gesture recognition for Fil-
ipino sign language under different backgrounds. 2022 IEEE International Conference on Artifi-
cial Intelligence in Engineering and Technology (IICAIET), pp. 1–6. Kota Kinabalu, Malaysia.
https://doi.org/10.1109/IICAIET55139.2022.9936801
140. Siddique, S., Islam, S., Neon, E. E., Sabbir, T., Naheen, I. T. et al. (2023). Deep learning-based
Bangla sign language detection with an edge device. Intelligent Systems with Applications, 18, 200224.
https://doi.org/10.1016/j.iswa.2023.200224
141. Jiang, X., Satapathy, S. C., Yang, L., Wang, S. H., Zhang, Y. D. (2020). A survey on artificial intelligence
in Chinese sign language recognition. Arabian Journal for Science and Engineering, 45(12), 9859–9894.
https://doi.org/10.1007/s13369-020-04758-2
142. Lee, B. G., Chong, T. W., Chung, W. Y. (2020). Sensor fusion of motion-based sign language interpretation
with deep learning. Sensors, 20(21), 6256. https://doi.org/10.3390/s20216256
143. Aly, S., Aly, W. (2020). DeepArSLR: A Novel signer-independent deep learning framework
for isolated Arabic sign language gestures recognition. IEEE Access, 8, 83199–83212.
https://doi.org/10.1109/ACCESS.2020.2990699
144. Yu, Y., Chen, X., Cao, S., Zhang, X., Chen, X. (2020). Exploration of Chinese sign language recognition
using wearable sensors based on deep Belief Net. IEEE Journal of Biomedical and Health Informatics,
24(5), 1310–1320. https://doi.org/10.1109/JBHI.2019.2941535
145. Rastgoo, R., Kiani, K., Escalera, S. (2020). Video-based isolated hand sign language recogni-
tion using a deep cascaded model. Multimedia Tools and Applications, 79(31–32), 22965–22987.
https://doi.org/10.1007/s11042-020-09048-5
146. Venugopalan, A., Reghunadhan, R. (2021). Applying deep neural networks for the automatic recognition
of sign language words: A communication aid to deaf agriculturists. Expert Systems with Applications, 185,
115601. https://doi.org/10.1016/j.eswa.2021.115601
147. Shaik, A. A., Mareedu, V. D. P., Polurie, V. V. K. (2021). Learning multiview deep features from skeletal
sign language videos for recognition. Turkish Journal of Electrical Engineering and Computer Sciences,
29(2), 1061–1076. https://doi.org/10.3906/elk-2005-57
148. Abdul, W., Alsulaiman, M., Amin, S. U., Faisal, M., Muhammad, G. et al. (2021). Intelligent real-time
Arabic sign language classification using attention-based inception and BiLSTM. Computers and Electrical
Engineering, 95, 107395. https://doi.org/10.1016/j.compeleceng.2021.107395
149. Rastgoo, R., Kiani, K., Escalera, S. (2021). Hand pose aware multimodal isolated sign language recogni-
tion. Multimedia Tools and Applications, 80(1), 127–163. https://doi.org/10.1007/s11042-020-09700-0
150. Boukdir, A., Benaddy, M., Ellahyani, A., El Meslouhi, O., Kardouchi, M. (2022). Isolated video-based
Arabic sign language recognition using convolutional and recursive neural networks. Arabian Journal for
Science and Engineering, 47(2), 2187–2199. https://doi.org/10.1007/s13369-021-06167-5
151. Kothadiya, D., Bhatt, C., Sapariya, K., Patel, K., Gil-Gonzalez, A. B. et al. (2022). Deep-
sign: Sign language detection and recognition using deep learning. Electronics, 11(11), 1780.
https://doi.org/10.3390/electronics11111780
152. Guney, S., Erkus, M. (2022). A real-time approach to recognition of Turkish sign language
by using convolutional neural networks. Neural Computing & Applications, 34(5), 4069–4079.
https://doi.org/10.1007/s00521-021-06664-6
153. Li, H., Zhang, Y., Cao, Q. (2022). MyoTac: Real-time recognition of tactical sign language
based on lightweight deep neural network. Wireless Communications & Mobile Computing, 2022.
https://doi.org/10.1155/2022/2774430
154. Venugopalan, A., Reghunadhan, R. (2023). Applying hybrid deep neural network for the recognition of
sign language words used by the deaf COVID-19 patients. Arabian Journal for Science and Engineering,
48(2), 1349–1362. https://doi.org/10.1007/s13369-022-06843-0
2448 CMES, 2024, vol.139, no.3
155. Balaha, M. M., El-Kady, S., Balaha, H. M., Salama, M., Emad, E. et al. (2023). A vision-based deep
learning approach for independent-users Arabic sign language interpretation. Multimedia Tools and
Applications, 82(5), 6807–6826. https://doi.org/10.1007/s11042-022-13423-9
156. Abdallah, M. S. S., Samaan, G. H. H., Wadie, A. R. R., Makhmudov, F., Cho, Y. I. (2023). Light-weight
deep learning techniques with advanced processing for real-time hand gesture recognition. Sensors, 23(1),
2. https://doi.org/10.3390/s23010002
157. Gupta, R., Bhatnagar, A. S., Singh, G. (2023). A weighted deep ensemble for Indian sign language
recognition. IETE Journal of Research, https://doi.org/10.1080/03772063.2023.2175057
158. Das, S., Biswas, S. K., Purkayastha, B. (2023). A deep sign language recognition system for Indian sign lan-
guage. Neural Computing & Applications, 35(2), 1469–1481. https://doi.org/10.1007/s00521-022-07840-y
159. Ozdemir, O., Baytas, I. M., Akarun, L. (2023). Multi-cue temporal modeling for skeleton-based sign
language recognition. Frontiers in Neuroscience, 17, 1148191. https://doi.org/10.3389/fnins.2023.1148191
160. Miah, A. S. M., Shin, J., Hasan, M. A. M., Rahim, M. A., Okuyama, Y. (2023). Rotation, translation and
scale invariant Sign word recognition using deep learning. Computer Systems Science and Engineering,
44(3), 2521–2536. https://doi.org/10.32604/csse.2023.029336
161. Eunice, J., Andrew, J., Sei, Y., Hemanth, D. J. (2023). Sign2Pose: A pose-based approach for gloss
prediction using a transformer model. Sensors, 23(5), 2853. https://doi.org/10.3390/s23052853
162. Rajalakshmi, E., Elakkiya, R., Prikhodko, A. L., Grif, M. G., Bakaev, M. A. et al. (2023). Static and
dynamic isolated indian and Russian sign language recognition with spatial and temporal feature detection
using hybrid neural network. ACM Transactions on Asian and Low-Resource Language Information
Processing, 22(1), 26. https://doi.org/10.1145/3530989
163. Koller, O., Zargaran, S., Ney, H., Bowden, R. (2018). Deep sign: Enabling robust statistical continuous
sign language recognition via hybrid CNN-HMMs. International Journal of Computer Vision, 126(12),
1311–1325. https://doi.org/10.1007/s11263-018-1121-3
164. Cui, R., Liu, H., Zhang, C. (2019). A deep neural framework for continuous sign language
recognition by iterative training. IEEE Transactions on Multimedia, 21(7), 1880–1891.
https://doi.org/10.1109/TMM.2018.2889563
165. Pu, J., Zhou, W., Li, H. (2019). Iterative alignment network for continuous sign language recognition.
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2019), pp. 4160–4169.
https://doi.org/10.1109/CVPR.2019.00429
166. Suri, K., Gupta, R. (2019). Continuous sign language recognition from wearable IMUs using
deep capsule networks and game theory. Computers & Electrical Engineering, 78, 493–503.
https://doi.org/10.1016/j.compeleceng.2019.08.006
167. Papastratis, I., Dimitropoulos, K., Konstantinidis, D., Daras, P. (2020). Continuous sign language recog-
nition through cross-modal alignment of video and text embeddings in a joint-latent space. IEEE Access,
8, 91170–91180. https://doi.org/10.1109/ACCESS.2020.2993650
168. Koller, O., Camgoz, N. C., Ney, H., Bowden, R. (2020). Weakly supervised learning with multi-stream
CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 42(9), 2306–2320. https://doi.org/10.1109/TPAMI.2019.2911077
169. Pu, J., Zhou, W., Hu, H., Li, H., Assoc Comp Machinery (2020). Boosting continuous sign
language recognition via cross modality augmentation. Chinese Academy of Sciences, 1497–1505.
https://doi.org/10.1145/3394171.3413931
170. Koishybay, K., Mukushev, M., Sandygulova, A. (2021). Continuous sign language recognition with
iterative spatiotemporal fine-tuning. 2020 25th International Conference on Pattern Recognition (ICPR),
pp. 10211–10218. Milan, Italy. https://doi.org/10.1109/ICPR48806.2021.9412364
171. Aloysius, N., G., M., Nedungadi, P. (2021). Incorporating relative position information in
transformer-based sign language recognition and translation. IEEE Access, 9, 145929–145942.
https://doi.org/10.1109/ACCESS.2021.3122921
CMES, 2024, vol.139, no.3 2449
172. Elakkiya, R., Vijayakumar, P., Kumar, N. (2021). An optimized generative adversarial network
based continuous sign language classification. Expert Systems with Applications, 182, 115276.
https://doi.org/10.1016/j.eswa.2021.115276
173. Hao, A., Min, Y., Chen, X. (2021). Self-mutual distillation learning for continuous sign language recogni-
tion. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11283–11292. Montreal,
QC, Canada. https://doi.org/10.1109/ICCV48922.2021.01111
174. Zhou, Z., Tam, V. W. L., Lam, E. Y. (2021). SignBERT: A BERT-based deep learning
framework for continuous sign language recognition. IEEE Access, 9, 161669–161682.
https://doi.org/10.1109/ACCESS.2021.3132668
175. Wang, Z., Zhao, T., Ma, J., Chen, H., Liu, K. et al. (2022). Hear sign language: A real-time end-
to-end sign language recognition system. IEEE Transactions on Mobile Computing, 21(7), 2398–2410.
https://doi.org/10.1109/TMC.2020.3038303
176. Zhou, H., Zhou, W., Zhou, Y., Li, H. (2022). Spatial-temporal multi-cue network for
sign language recognition and translation. IEEE Transactions on Multimedia, 24, 768–779.
https://doi.org/10.1109/TMM.2021.3059098
177. Chen, Y., Mei, X., Qin, X. (2022). Two-stream lightweight sign language transformer. Machine Vision and
Applications, 33(5), 79. https://doi.org/10.1007/s00138-022-01330-w
178. Han, X., Lu, F., Tian, G. (2022). Efficient 3D CNNs with knowledge transfer for sign language recognition.
Multimedia Tools and Applications, 81(7), 10071–10090. https://doi.org/10.1007/s11042-022-12051-7
179. Hu, H., Pu, J., Zhou, W., Li, H. (2022). Collaborative multilingual continuous sign language recognition: A
unified framework. IEEE Transactions on Multimedia, 1–12. https://doi.org/10.1109/TMM.2022.3223260
180. Zhang, J., Wang, Q., Wang, Q., Zheng, Z. (2023). Multimodal fusion framework based on statistical
attention and contrastive attention for sign language recognition. IEEE Transactions on Mobile Computing,
1–13. https://doi.org/10.1109/TMC.2023.3235935
181. Sharma, S., Gupta, R., Kumar, A. (2023). Continuous sign language recognition using isolated signs data
and deep transfer learning. Journal of Ambient Intelligence and Humanized Computing, 14(3), 1531–1542.
https://doi.org/10.1007/s12652-021-03418-z
182. Jiang, S., Liu, Y., Jia, H., Lin, P., He, Z. et al. (2023). Research on end-to-end continuous
sign language sentence recognition based on transformer. 2023 15th International Conference
on Computer Research and Development (ICCRD), pp. 220–226. Hangzhou, China.
https://doi.org/10.1109/ICCRD56364.2023.10080216
183. Hu, J., Liu, Y., Lam, K. M., Lou, P. (2023). STFE-Net: A spatial-temporal feature
extraction network for continuous sign language translation. IEEE Access, 11, 46204–46217.
https://doi.org/10.1109/ACCESS.2023.3234743
184. Guo, Q., Zhang, S., Li, H. (2023). Continuous sign language recognition based on spatial-
temporal graph attention network. Computer Modeling in Engineering & Sciences, 134(3), 1653–1670.
https://doi.org/10.32604/cmes.2022.021784
185. Hu, H., Zhao, W., Zhou, W., Li, H. (2023). SignBERT+: Hand-model-aware self-supervised pre-training
for sign language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9),
11221–11239. https://doi.org/10.1109/TPAMI.2023.3269220
186. da Silva, D. R. B., de Araujo, T. M. U., do Rego, T. G., Brandao, M. A. C., Goncalves, L. M. G. (2023). A
multiple stream architecture for the recognition of signs in Brazilian sign language in the context of health.
Multimedia Tools and Applications. https://doi.org/10.1007/s11042-023-16332-7
187. AlKhuraym, B. Y., Ben Ismail, M. M., Bchir, O. (2022). Arabic sign language recognition using lightweight
CNN-based architecture. International Journal of Advanced Computer Science and Applications, 13(4),
319–328.
188. Amrutha, K., Prabu, P., Poonia, R. C. (2023). LiST: A lightweight framework for continuous Indian sign
language translation. Information, 14(2), 79. https://doi.org/10.3390/info14020079
2450 CMES, 2024, vol.139, no.3
189. Sun, S., Han, L., Wei, J., Hao, H., Huang, J. et al. (2023). ShuffleNetv2-YOLOv3: A real-time recognition
method of static sign language based on a lightweight network. Signal Image and Video Processing, 17(6),
2721–2729. https://doi.org/10.1007/s11760-023-02489-z
190. Wang, F., Zhang, L., Yan, H., Han, S. (2023). TIM-SLR: A lightweight network for video
isolated sign language recognition. Neural Computing & Applications, 35(30), 22265–22280.
https://doi.org/10.1007/s00521-023-08873-7