Thesis
Thesis
Thesis
by
Imam Hossain Rafi
ID: CSE-02207156
Supervised by
Shafayet Nur
Lecturer
The thesis titled ’Enhancing Bangla Sign Language Recognition Through Batch Size
and Optimizer Variations in Deep Learning’submitted by ID: CSE-02207156 , Ses-
sion has been accepted as satisfactory in fulfilment of the requirement for the degree of
Bachelor of Science in Computer Science & Engineering to be awarded by the Port City
International University.
Shafayet Nur
Lecturer
Department of Computer Science & Engineering
Port City International University
Email:
Cell:
i
DEDICATION
ii
DECLARATION OF ORIGINALITY
This certifies that Imam Hossain Rafi CSE 02207156 is the research author, and that
neither the research nor any portion of it has been submitted to any other university for
credit toward a degree. To the best of our knowledge, my research does not violate
any copyright or proprietary rights, and all ideas, methods, quotations, or other material
from other people’s work that is included in our thesis—published or not—is properly
credited in accordance with accepted referencing guidelines. I am also aware that the
Department of CSE, PCIU, may take legal and disciplinary action against me if any
copyright infringement is discovered, whether intentional or not. Without the Depart-
ment of CSE, PCIU’s permission, any reproduction or use of this thesis work in any form
or by any means whatsoever is forbidden. I hereby transfer all rights in the copyright of
this thesis work to them.
iii
ABSTRACT
The identification and categorization of Bangla Sign Language (BdSL) numbers and
letters has become more significant for the purpose of providing assistance to the com-
munity of hearing-impaired individuals as the demand for effective communication tools
that are accessible to all individuals continues to rise. Utilizing the Shongket dataset,
which consists of 10 number classes and 36 letter classes, this paper gives a complete
analysis into the identification of Burmese Sign Language (BdSL) hand signals. Dur-
ing preprocessing, background elimination and binarization techniques were utilized
to increase the clarity of hand sign pictures, permitting improved feature extraction.
The research assesses the performance of Machine Learning, Deep Learning, and Hy-
brid Models for identifying these hand signals. Specifically, four distinct optimiza-
tion algorithms—Adam, Nadam, Adamax, and RMSprop—were tried, with variable
batch sizes (16, 32, 64, and 128) to measure their influence on model training and accu-
racy. Comprehensive experiments and comparative studies were undertaken to discover
the ideal combinations of optimizers and batch sizes for boosting classification perfor-
mance. The results give useful insights into the strengths and drawbacks of each tech-
nique, leading to developments in sign language recognition technology and enabling
the creation of more robust and efficient systems for BdSL identification.
Keywords: optimizer based Bangla sign language identification, Bangla sign language
classifiaction, Sign language detection , Background remove using rembg
iv
ACKNOWLEDGEMENT
v
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
CHAPTER 1 INTRODUCTION 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Sign language: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Application of Bangla sign language identification and classification methods 5
1.5 Key Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 3 METHODOLOGY 13
3.1 Dataset preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Background removal using rembg . . . . . . . . . . . . . . . . . . 15
3.1.2 Gray Scale conversion . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Adam (Adaptive Moment Estimation) . . . . . . . . . . . . . . . . 18
3.3.2 Nadam (Nesterov-accelerated Adaptive Moment Estimation) . . . . 19
3.3.3 RMSProp (Root Mean Square Propagation): . . . . . . . . . . . . 20
3.3.4 Adamax : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Model description: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . 21
3.4.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . 22
3.4.3 Bidirectional LSTM . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.4 Resnet 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.5 Xception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.6 VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.7 VGG-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.8 InceptionV3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.9 CNN-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vi
3.4.10 CNN-VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.11 CNN-VGG19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.12 CNN-InceptionV3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
CHAPTER 5 CONCLUSION 71
5.1 Future Work: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
REFERENCES 72
vii
LIST OF FIGURES
3.1 Steps in the process of identifying Bangla sing language and classification 14
3.2 Sample image of before and after background elimination. . . . . . . . . . 15
3.3 Steps involves in preprocessing . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Visual overview of dataset for digit . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Visual overview of dataset for letter . . . . . . . . . . . . . . . . . . . . . 18
3.6 Architecture of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7 Architecture of LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.8 Architecture of BiLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
viii
4.25 Evaluation of VGG16 Optimizers and Batch Sizes (Digit) . . . . . . . . . 59
4.26 Evaluation of VGG16 Optimizers and Batch Sizes (Letter) . . . . . . . . . 59
4.27 VGG16(Digit) Accuracy and Loss Curve For Adam optimizer with Batch
Size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.28 VGG16(Letter) Accuracy and Loss Curve For Adam optimizer with Batch
Size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.29 Classification report for Adam optimizer batch size 16(Digit) . . . . . . . . 61
4.30 Classification report for Adam optimizer batch size 16(Letter) . . . . . . . 62
4.31 Confusion Matrix For Adam optimizer batch size 16 (Digit) . . . . . . . . 63
4.32 Confusion Matrix For Adam optimizer batch size 16 (Letter) . . . . . . . . 63
4.33 Visualization of VGG19(Letter) Optimizers Across Different Batch Sizes . 66
4.34 Visualization of VGG19(Letter) Optimizers Across Different Batch Sizes . 67
4.35 VGG19(Digit) Accuracy and Loss Curve For Adam optimizer with Batch
Size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.36 VGG19(Letter) Accuracy and Loss Curve For Adam optimizer with Batch
Size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.37 Classification report of VGG19(Digit) for Adam optimizer batch size 19 . . 68
4.38 Classification report of VGG19(Letter) for Adam optimizer batch size 16 . 69
4.39 Confusion Matrix For Adam optimizer batch size 19 (Digit) . . . . . . . . 70
4.40 Confusion Matrix For Adam optimizer batch size 1 (Letter) . . . . . . . . . 70
ix
LIST OF TABLES
x
CHAPTER 1
INTRODUCTION
Sign language serves as a vital bridge between deaf, hard of hearing, and speech-disabled
individuals and the broader world, functioning as an essential communication tool that
facilitates interaction. The term ”deaf community” refers to individuals who are deaf,
hard of hearing, or speech-disabled, along with their families and allies, who often share
a unique culture, language, and set of experiences related to communication challenges.
As awareness of this community grows, the demand for sign language is steadily in-
creasing. In Bangladesh, it is estimated that nearly 13 million people experience varying
degrees of hearing loss, with approximately 3 million suffering from severe to profound
hearing loss, significantly impacting their daily lives and interactions. The challenges
faced by deaf and speech-disabled individuals underscore the importance of sign lan-
guage as a crucial communication aid. Due to the barriers created by hearing loss or
speech impairments, many people in these communities encounter difficulties in ex-
pressing themselves and understanding others, leading to feelings of isolation. In re-
sponse to these challenges, researchers are actively exploring innovative methods and
techniques to develop machines capable of interpreting sign language more efficiently.
This ongoing research is vital for bridging the communication gap between deaf, speech-
disabled, and hearing individuals. Various approaches, including machine learning,
deep learning, and hybrid models, are being employed to enhance the effectiveness of
sign language recognition systems. From a review by [Khatun et al., 2021], about 60%
of the reviewed papers use both single and double hand signs in BdSL, with CNN be-
ing the most popular technique. [Yasir et al., 2017]present an approach to Bangla Sign
Language recognition using CNN. A hand-tracking device known as the Leap Motion
Controller was used to detect hand gestures. CNN is also used by [Islalm et al., 2019]
for detecting Bangla Sign Language. A large dataset was developed, comprising both
alphabets and numerals, with 7,052 sample images of 10 numerals and 23,864 images
of 35 basic alphabet characters. Their results showed 99.83% accuracy on numerals,
nearly 100% accuracy on alphabets, and 99.80% accuracy overall. [Hasan et al., 2016]
employed a machine learning approach to give voice to speech-disabled individuals.
Sign language was identified through hand gestures and converted into text, which was
then transformed into voice. HOG was used for feature extraction and SVM for classifi-
cation, with 16 BdSL sign expressions being recogniz[Podder et al., 2022] addressed the
population suffering from deafness or hearing disabilities in Bangladesh. To promote
the study of Bangla Sign Language, they prepared two robust datasets for BdSL alpha-
bets and numerals, which were classified using deep learning approaches. Models using
1
images with and without backgrounds were compared, with CNN performing better on
images with backgrounds. ResNet-18 achieved the highest accuracy at 99.99%.This the-
sis focuses on developing an effective approach for identifying Bangla Sign Language.
The dataset used contains 10 digit classes and 36 character classes. To obtain optimized
results, the input images undergo preprocessing, which involves background elimina-
tion and thresholding algorithms. These steps help clarify the input images by removing
background noise. Subsequently, deep learning and hybrid machine learning models are
employed for recognition tasks.The performance of these models is evaluated using four
distinct optimization algorithms: Adam, Nadam, Adamax, and RMSprop. Additionally,
the impact of different batch sizes—16, 32, 64, and 128—on the training process and
model performance is examined. This comparative study aims to identify the most ef-
fective combinations of optimizers and batch sizes for the task. The findings from this
thesis provide valuable insights into the recognition processes of Bangla sign language
digits and characters, contributing to the development of more accurate and efficient
methods for sign language identification.
1.1 Overview
Bangla Sign Language (BdSL) detection presents several challenges due to variations
in hand gestures, skin color, and complex or similar hand shapes. Additionally, incon-
sistent lighting conditions and cluttered backgrounds further complicate the detection
process. To address these challenges, various preprocessing techniques are employed
to enhance image quality. These include background elimination methods such as re-
gion of interest (ROI) extraction, thresholding techniques (Otsu’s thresholding, adap-
tive thresholding), OpenPose for keypoint detection, HSV color space adjustments, and
skin color detection algorithms. These methods help isolate hand signs from the back-
ground and reduce interference from skin tones or environmental noise, thereby improv-
ing model performance in recognizing distinct hand gestures. In this study, I focused on
BdSL detection using the Shonket dataset, which consists of two distinct subsets: one
for digits and one for letters. The digit dataset contains 10 classes with 150 images per
class, while the letter dataset consists of 36 classes with 120 images per class. A range of
deep learning models were employed for this task, including VGG16, VGG19, LSTM,
InceptionV3, Xception, and hybrid models such as CNN+LSTM. Preprocessing was
essential to address the problem of detecting hands with varying skin tones and distin-
guishing between gestures that are visually similar. The process began with background
removal using rembg, followed by thresholding methods like Otsu’s thresholding to bi-
narize the images, making the hand signs more distinguishable. Skin color detection and
HSV color space adjustments were also applied to enhance the clarity of the hand signs
and minimize distractions from similar backgrounds or complex hand gestures. Among
the models tested, the CNN+LSTM hybrid model achieved the highest performance on
2
the digit dataset, with an accuracy of 98.6%. On the letter dataset, the highest accuracy
of 92.6% was obtained using the Xception model. However, the increased complexity
of the letter dataset, with more classes and visually similar gestures, resulted in slightly
lower accuracy. These findings suggest areas for future improvement, such as diver-
sifying the dataset to include a wider range of skin tones, refining model architectures
to better handle complex gestures, or optimizing preprocessing techniques. This study
contributes to the advancement of BdSL recognition systems, with the goal of improv-
ing communication accessibility for the hearing-impaired community through accurate
sign language detection.
3
researchers are creating tools that can interpret and translate sign language in real time.
These technologies not only enhance communication for the deaf and hard of hearing
but also support educators and family members in interacting effectively with the deaf
community. By prioritizing research in sign language technologies, we can promote
accessibility and understanding, ensuring that everyone has a voice and can be under-
stood. Bangla Sign Language (BdSL) serves as the primary mode of communication
for the deaf and hard-of-hearing community in Bangladesh. Like other sign languages
around the world, BdSL involves a combination of hand gestures, facial expressions,
and body movements to convey meaning. It is a fully developed language with its own
grammatical rules, vocabulary, and syntax, distinct from the spoken Bengali language.
BdSL incorporates signs for the Bengali script, including 10 digit characters and 36 let-
ters, representing the vowels and consonants of Bengali. This linguistic structure allows
individuals who are deaf or hard of hearing to communicate effectively with each other
and those who understand the language, bridging communication gaps in a predomi-
nantly spoken language society.
1.3 Motivation
• Limited Resources: BdSL has less technological resources and less research
than other sign languages like ASL, which have a plethora of instruments at their
disposal. By contributing to the creation of specific instruments and models for
Bangla Sign Language recognition, this thesis aims to close that gap.
4
• Technological Developments: Sign language recognition is now more practical
and accurate because to developments in deep learning and computer vision. By
utilizing these technologies, this research hopes to enhance current techniques by
developing reliable systems for BdSL gesture recognition.
The Bangla sign language identification and classification methods devised in this thesis
have numerous practical applications across various disciplines. Here are some impor-
tant areas where these techniques can be effectively applied:
• Media and Entertainment: BdSL detection may be integrated into digital plat-
forms to provide real-time sign language interpretation for deaf viewers.
• Assistive Devices: The technology may be used to create wearable devices that
convert BdSL into text or voice output, promoting independent communication
among the deaf community.
5
• Sign Language Learning Tools: Detection models can enable learning platforms
and applications for BdSL learners, making the language more accessible to hear-
ing people and increasing its adoption.
• The study involved the application of a varied set of models for character clas-
sification, ranging from deep learning to traditional machine learning methods.
Convolutional neural networks (CNNs) were employed for their ability to learn
complex features from images, while long-short-term memory (LSTM) networks
and bidirectional LSTM (BiLSTM) were utilized for sequence modeling tasks.
Pretrained models such as VGG16, VGG19, ResNet50, Xception, and MobileNet
were fine-tuned for character recognition tasks. Additionally, hybrid models mix-
ing CNNs with other designs, as well as standard machine learning models like
Support Vector Machines (SVMs), were studied to evaluate their effectiveness in
handling degraded document pictures.
6
• The work used systematic comparison and analysis to identify the optimum com-
binations of models, optimizers, and batch sizes for character classification tasks
on degraded Bangla texts. This analysis provided useful direction for future re-
search and development activities, assisting in identifying the most successful
ways for recovering and categorizing characters from old printed texts.
7
CHAPTER 2
LITERATURE REVIEW
The recognition and detection of Bangla Sign Language (BdSL) has received a lot of
interest as there is a rising demand for accessible communication tools for Bangladesh’s
deaf and hard-of-hearing communities. The creation of efficient BdSL identification
systems is essential for promoting social inclusion and communication. This area of
study is essential for improving accessibility in public services, healthcare, and educa-
tion, as well as for removing obstacles to communication.
To increase the clarity and quality of BdSL images—which are essential for precise
model training and recognition—a broad range of preprocessing methods have been
used. Preprocessing techniques that are often used include Otsu’s thresholding, Sauvola’s
adaptive thresholding, Niblack’s thresholding, and background eradication utilizing tools
like rembg. To further separate hand motions from noisy backdrops, morphological
techniques including dilation and erosion, area of interest (ROI) extraction, and skin
color recognition algorithms are employed. Techniques like Gaussian filtering and me-
dian filtering are also utilized to remove noise and boost image quality before the iden-
tification phase.
Many deep learning models have been applied to the field of BdSL detection, ranging
from straightforward designs to intricate, pretrained networks. Because Convolutional
Neural Networks (CNNs) are so good at capturing spatial characteristics in pictures, they
continue to be one of the most often used techniques. Advanced architectures with great
accuracy in sign language identification applications, including as VGG16, VGG19, In-
ceptionV3, ResNet50, and Xception, have been widely employed. Furthermore, hybrid
models that integrate temporal and spatial information—both necessary for continuous
hand gesture recognition—like CNN paired with LSTM or CNN-XGBoost have demon-
strated potential.
Other machine learning methods, such Support Vector Machines (SVM), Random Forests
8
(RF), and K-Nearest Neighbors (KNN), have also been used in addition to CNN-based
models, especially in previous research. While deep learning models have demonstrated
better performance recently, these models have laid the groundwork for BdSL detection
when paired with features collected using techniques such as Histogram of Oriented
Gradients (HOG).
The best preprocessing methods and model architectures for BdSL identification will
be critically examined in this literature review, along with the difficulties caused by
problems with gesture similarity, skin tones, and uneven illumination. The review will
also look at how these issues have been resolved and detection accuracy increased by
developments in hybrid and deep learning models.
This study seeks to fill in knowledge gaps and suggest future avenues for research by
analyzing the benefits and drawbacks of present approaches. In the end, this chapter
will advance knowledge on how to create more reliable and accurate BdSL recogni-
tion systems, which will significantly enhance the community of Bangladesh’s hearing-
impaired members’ accessibility to communication.
• Comprising 1800 grayscale pictures of 36 Bangla letters, [Islam et al., 2018] cre-
ated the first entirely open-access dataset, Ishara-Lipi, for isolated characters in
Bangla Sign Language (BdSL). Otsu thresholding and grayscale conversion helped
the photos to be preprocessed. Following that, a 9-layer Convolutional Neural
Network (CNN) tuned using the ADAM optimizer was trained using the dataset
and attained 92.65% accuracy on the training set and 94.74% on the validation set.
The aim of the dataset was especially to close the resource gap for BdSL identifi-
9
cation based on machine learning. The rather limited size of the dataset was one
of the constraints; while future plans call for growing it to improve model perfor-
mance. Ishara-Lipi will be a great tool for sign language study and development,
the report notes (Ishara-Lipi).
• [Das et al., 2023] proposed a composite approach for the recognition of Bangla
Sign Language (BSL) that employs a Random Forest (RF) classifier and a Convo-
lutional Neural Network (CNN). The transfer learning process was combined with
a background elimination algorithm that employs morphological operations and
an adaptive Gaussian thresholding technique to obtain optimal results. The CNN
models (VGG16, VGG19, InceptionV3, Xception, ResNet50) were pre-trained
on ImageNet. Two public datasets, Ishara-Bochon (digits) and Ishara-Lipi (char-
acters), were used for training. The system performed well, obtaining an accuracy
of 91.67% for character recognition and 97.33% for digit recognition. Future en-
hancements could focus on data augmentation and real-time optimization.
• [Shurid et al., 2020] suggested a novel architecture, the Concatenated BdSL Net-
work, intended for recognizing Bangla Sign Language (BdSL) by combining a
CNN for visual feature extraction and OpenPose for hand keypoint estimation.
This method handles the challenges posed by the subtle differences between sim-
ilar BdSL gestures. The model got a test accuracy of 91.51%, surpassing other
CNN-based models. However, misclassifications happened in symbols with nearly
identical hand gestures. The authors mentioned future improvements could in-
volve training a custom pose estimation model specific to BdSL. Experiments
were performed using Google Colaboratory with limited computational resources,
and further development may focus on real-time recognition and larger datasets.
• [Abedin et al., 2023] proposed a novel model called the ”Concatenated BdSL Net-
work” for Bangla Sign Language (BdSL) recognition, combining a CNN and a
pose estimation network. The CNN, composed of 10 convolutional layers, was
responsible for extracting visual features from the images, while OpenPose es-
timated hand keypoints, addressing the challenge of differentiating subtle hand
gestures. The model used two separate inputs: for the CNN, images were con-
verted to grayscale, while for pose estimation, the images were converted to BGR
format. The features extracted from both the CNN and pose estimation network,
flattened into layers, were combined by passing them through two fully connected
layers with ReLU activation. These combined features were then passed through
three additional fully connected layers to produce the final output. The model
achieved a test accuracy of 91.51%, outperforming previous methods. However,
certain symbols, such as [�-�] and [�-�], were frequently misclassified. The
10
authors suggested that training a custom pose estimation network specifically for
BdSL, rather than relying on a pre-trained one, would likely yield better results
• [Podder et al., 2022] developed a deep learning-based system for real-time Bangla
Sign Language (BdSL) recognition, focused on alphabets and numerals. Utilizing
the largest dataset created for BdSL, they trained pre-trained CNN models such
as ResNet18 and MobileNet_V2, getting high accuracy in both approaches: with
background and after background removal. ResNet18 provided the best perfor-
mance, with 99.99% accuracy, precision, and sensitivity. The study addressed
challenges related to skin tone, hand orientation, and background, finding that
models trained with backgrounds performed slightly better than those without.
This study made significant progress in BdSL recognition and provided publicly
available datasets to support future studies.
• [Tasmere and Ahmed, 2020] proposed a hand gesture recognition framework us-
ing a deep convolutional neural network (CNN) for Bangla Sign Language (BSL),
getting a high accuracy of 99.22%. Their system relied on a dataset of 3,219 im-
ages collected from six different people, representing 37 Bangla sign characters.
The authors applied a combination of HSV and YCbCr color spaces for hand de-
tection, followed by a CNN architecture with four convolutional layers and a 40%
dropout layer to improve performance. Although the model achieved strong re-
sults, it was limited to recognizing static images, suggesting potential for future
improvements by incorporating dynamic gestures and real-time recognition.
11
The model obtained 95.13% accuracy on the testing set, beating individual CNN
models and other cutting-edge architectures like ResNet50. While the technique
had potential, its dependence on MediaPipe for skeletal feature extraction and em-
phasis on static indicators posed limits. The authors advised that future research
should investigate dynamic indications and build more robust real-time solutions.
• [Hadiuzzaman et al., 2024] built the BAUST Lipi dataset, comprising 18,000 im-
ages representing 36 Bangla Sign Language (BdSL) alphabets. They introduced a
CNN-LSTM hybrid model that employs CNN layers for spatial feature extraction
and LSTM for temporal sequence learning, obtaining a high accuracy of 97.28%
on the testing set. The dataset was acquired from 15 participants under diverse
conditions, enhancing its robustness for machine learning tasks. However, the
model is limited to static signs, implying future work could investigate dynamic
gestures and real-time recognition.
12
CHAPTER 3
METHODOLOGY
Bangla Sign Language helps those with hearing disabilities bridge the communication
gap, allowing them to engage more successfully in society. As the need for sign lan-
guage recognition systems rises, the task of creating efficient and accurate models for
detecting Bangla hand signals becomes more pressing. This study addresses these prob-
lems by presenting a thorough approach for detecting and classifying Bangla sign lan-
guage, with an emphasis on the Shonket dataset, which comprises digit and letter hand
sign pictures. The approach begins with important preprocessing processes that im-
prove the clarity and uniformity of the hand sign pictures. Using a backdrop removal
method, rembg, all unnecessary features are removed, enabling the hand signals to stand
out. The photos are then transformed to grayscale to minimize computing complexity,
followed by binarization to provide crisp black-and-white images with uniformity and
increased contrast for further processing. After preprocessing, the photos are sent into
a succession of machine learning, deep learning, and hybrid models for categorization.
These models are charged with classifying the hand signals into two main sets: 10-digit
and 36-letter classes. To achieve high classification accuracy, a range of cutting-edge
techniques are used, including hybrid models that integrate convolutional neural net-
works (CNNs) with long short-term memory networks (LSTMs) and sophisticated deep
learning architectures such as Xception. To improve the performance of these models,
four different optimizers—Adam, Nadam, RMSprop, and Adamax—are evaluated with
different batch sizes (16, 32, 64, and 128). This study aims to determine the most suc-
cessful combinations of optimizers and batch sizes, with an emphasis on model correct-
ness and convergence speed. This study gives useful insights into the best optimization
approaches for detecting Bangla Sign Language by carefully comparing the outcomes
of each optimizer with varied batch sizes. The findings seek to improve the accuracy
and efficiency of sign language recognition systems, hence facilitating the creation of
more accessible and inclusive communication tools.
13
Figure 3.1: Steps in the process of identifying Bangla sing language and classification
Figure 3.1 shows the steps of our proposed model that involve Bangla sign language
identification and classification.
Preprocessing refers to the set of operations performed on unprocessed data before feed-
ing it into a machine learning or deep learning model. This phase is crucial in trans-
forming the data into a more suitable format, enhancing its quality, and making it more
informative for the learning algorithms. The objective of preprocessing is to ensure
that the data is clean, standardized, and in a form that the model can efficiently work
with, thereby increasing its accuracy and performance. The efficacy and accuracy of the
training process are significantly improved by preprocessing, which is a critical step in
the development of any machine learning or deep learning model. Preprocessing is the
foundation for the dataset to be optimized for performance in this study on Bangla Sign
Language identification, which includes two categories—digits and letters. Raw images
are converted to a more suitable format for model training through the use of grayscale
conversion, binarization, and background removal in the preprocessing pipeline. Fig-
ure 3.3 illustrates the processes involved in the preprocessing stage. The primary ob-
jective of the initial stage of preprocessing is to remove any undesirable components
from the images in order to mitigate noise or any inconsistencies that could impede
the recognition process. The gesture’s shape is accurately defined as the background
is eliminated, thereby concentrating solely on the hand sign. Subsequently, the images
are converted to grayscale in order to reduce the computational burden and simplify the
data. Lastly, Otsu’s binarization technique is implemented to transform these images
into crisp black-and-white representations, thereby emphasizing the essential attributes
required for precise sign classification.
14
3.1.1 Background removal using rembg
Figure 3.2 shows the difference between before and after backgoud elimination.
After removing background the the images were converted into Grayscale images. Grayscale
conversion is the process of changing a color image into a single-channel image, where
15
each pixel represents the intensity or brightness of the corresponding pixel in the original
image. In a grayscale picture, shades of gray spanning from black to white are utilized
to depict varying degrees of intensity, with darker shades suggesting lower intensity and
lighter colors indicating more intensity. Grayscale pictures feature only one intensity
channel, simplifying processing compared to color images, which generally comprise
three channels (red, green, and blue). This minimizes computational complexity and
memory constraints, making grayscale pic tures easier to deal with. Aside from that,
grayscale pictures provide us with a clearer depiction of visual content.
3.1.3 Binarization
3.2 Dataset
The Shongket dataset is a cutting-edge resource for recognizing Bangla Sign Language
(BdSL) using machine learning and computer vision. It is one of the largest datasets
in this field and attempts to close the communication gap by offering a well-organized
and varied set of hand gesture photos to both the general public and those with hearing
and speech impairments. The Shongket dataset, which is specifically focused on Bangla
16
Sign Language, includes a wide range of examples of both alphabetic and numeric hand
motions. It gives researchers and developers working on sign language recognition mod-
els a useful tool by providing a huge number of hand gesture photographs for each class,
allowing advances in system effectiveness and accuracy. This dataset encourages the
advancement of assistive technology, which enhances communication accessibility and
inclusivity.
There are two key sections to the dataset:
Digit Classes: It has 10 classes representing Bangla digits (0-9), with 150 hand motion
photos each class, totaling 1,500 images.
Letter Classes: The collection also comprises 36 classes matching to the Bangla alpha-
bet, with 120 hand motion photos per class, resulting in 4,320 images.
In all, Shongket features 5,820 photos, captured under varied situations and displaying
diverse hand movements. This makes it a crucial resource for furthering research and
applications in Bangla Sign Language detection.
Figure 3.4 shows the full set of classes for Bangla numbers in the Shongket dataset and
visually represents the hand motion corresponding to each Bangla digit within those
classes. The hand movements and their corresponding numerical values are precisely
mapped out, with each class being clearly related with one of the 10 Bangla numerals
(0–9).
17
Figure 3.5: Visual overview of dataset for letter
In a similar vein, Figure 3.5 offers a thorough summary of the Bangla letter classes found
in the Shongket dataset. It shows the hand motion that goes with each of the 36 Bangla
letters. The hand motions connected to each character in the Bangla alphabet are clearly
illustrated in each class, which is matched to a particular letter.
3.3 Optimizer
Adam is a well-known optimization algorithm that takes ideas from both RMSProp
and Momentum and mixes them. It figures out the adjusted learning rates for each
parameter, which makes it possible to optimize problems in a lot of different ways.
There are two moving averages that Adam keeps track of. The first is mt , which is
the exponentially decaying average of past gradients. The second is vt , which is the
exponentially decaying average of past squared gradients
Default Learning Rate: (α ): 0.001
Default Decay Rates: (β1 and β2 ): 0.9 and 0.999 respectively
Epsilon: (ε ) Default Value: 1e − 7
18
The rule for updating Adam’s parameters is this:
where:
19
where µ is the Nesterov momentum coefficient.
RMSProp is a rate optimization approach for adaptive learning. The learning rates for
every parameter are adjusted according to the average of the gradients’ most recent mag-
nitudes. Accelerating convergence is the result of this overcoming the issue of declining
learning rates. Default Learning Rate: (α ): 0.001
Default Decay Rate: (γ ): 0.9
Epsilon; (ε )Default Value: 1 × 10−7
3.3.4 Adamax :
Adamax is a variant of the Adam optimizer based on the infinity norm. It is more stable
and effective when the gradients are large, as it uses the maximum of the past squared
gradients rather than their sum. This makes it more robust in certain cases, especially
when dealing with large gradients.
Default Learning Rate: (α ): 0.002
Beta 1: (β1 ) Default Value: 0.9
Beta 2: (β2 ) Default Value: 0.999
Epsilon: (ε ) Default Value: 1 × 10−8 The update rule for Adamax is:
where ε is a tiny constant to prevent division by zero and Gt is the sum of the squares
of the gradients up to time step t.
20
3.4 Model description:
21
Figure 3.6: Architecture of CNN
The figure 3.6 illusrate the architecture of cnn for better understanding.
The Long Short-Term Memory (LSTM) model is a specialized type of Recurrent Neural
Network (RNN) architecture, designed to capture long-term dependencies in sequential
data. Unlike conventional RNNs, LSTMs are capable of remembering information for
extended periods and avoiding issues such as vanishing gradients, making them ideal
for sequence prediction tasks that involve temporal relationships. This includes ap-
plications like time series analysis, natural language processing, and image sequence
22
classification.
In this case, a LSTM model is used for classifying Bangla Sign Language digit se-
quences. The model processes images by treating each image as a sequence of rows,
where each row (28 pixels broad) is a time step, and each pixel within the row is a fea-
ture. The input images are first preprocessed by converting them to grayscale, resizing
them to 28x28 pixels (similar to the MNIST dataset), and normalizing the pixel values
to lie between 0 and 1. The dataset is divided into training and testing sets, and the labels
are one-hot encoded for classification into 10 or 36 output classes.
The architecture of the LSTM model is as follows:
Initial Layer: The input layer anticipates input sequences with a shape of (28, 28), rep-
resenting 28 time steps (rows) with 28 features (pixels) at each time step.
LSTM Layer: The first LSTM layer consists of 256 units and returns sequences, mean-
ing it outputs a sequence for each time step, which is passed on to the next LSTM layer.
This layer helps capture the temporal dependencies within the input sequences.
Dropout Layer:A dropout layer with a 0.2 dropout rate is implemented to reduce over-
fitting. Dropout works by arbitrarily turning off a fraction of neurons during training,
encouraging the model to generalize better.
Second LSTM Layer: Another LSTM layer with 256 units follows, but this time it only
returns the final output (not the entire sequence) to provide a summary of the learned
temporal features.
Final Output Layer: The model concludes with a Dense layer consisting of 10 and
36 units (for 10 and 36 output classes), using the softmax activation function to output
probabilities for each class.
23
The figure 3.7 illustrate the architecture of LSTM . The architecture details all the layers
that have been utilized in this model.
24
Figure 3.8: Architecture of BiLSTM
The figure 3.8 illustrate the architecture of BiLSTM . The architecture details all the
layers that have been utilized in this model.
3.4.4 Resnet 50
The ResNet50 model for classifying Bangla Sign Language digits and letters. Initially,
the images are preprocessed by resizing them to 224x224 pixels (ResNet50’s required
input size) and normalizing pixel values. The dataset is then split into training and test-
ing sets, and the labels are one-hot encoded to facilitate classification across 10 classes
and 36 classes. ResNet50, pre-trained on the ImageNet dataset, is imported without
its fully connected layers, and the pre-trained layers are frozen to retain their learned
weights. On top of ResNet50, custom classification layers are added, including a global
average pooling layer to convert the feature maps into a single vector, followed by a
dense layer with 1024 neurons for feature extraction, a dropout layer to prevent over-
fitting, and a final dense layer with 10 and 36 units for the digit and letter classification
task. The model is compiled and trained using different batch size over 50 epochs, with
a checkpoint mechanism implemented to save the best model based on validation accu-
racy. After training, the model is evaluated on the test dataset, and predictions are made
to assess its performance.
3.4.5 Xception
The ‘Xception’ model is an enhanced version of the Inception architecture that incorpo-
rates depthwise separable convolutions, which consist of a depthwise convolution ap-
plied individually to each channel followed by a pointwise 1x1 convolution to combine
25
the channels. This innovation enables Xception to maintain the efficiency of Inception
while improving performance by making better use of parameters. Xception is particu-
larly suited for image classification tasks, having exhibited strong results on benchmarks
such as ImageNet. The model consists of an input flow, intermediate flow blocks, and an
exit flow, all utilizing depthwise separable convolutions. These convolutions, combined
with residual connections, reduce the number of parameters and computation while al-
lowing the network to extract intricate features. The use of residual connections also
helps to address the vanishing gradient problem, making it simpler to train deep net-
works. Xception’s structure is optimal for transfer learning, where pre-trained models
can be fine-tuned for specific applications with improved accuracy and faster training
times.
For the task of Bangla Sign Language letter classification, the Xception model has been
adapted as follows:
Data Preprocessing:
• Images are resized to 299x299 pixels (as required by Xception) and converted
from grayscale to RGB by repeating the single grayscale channel.
• The dataset is divided into training and testing sets, and the labels are one-hot
encoded for multi-class classification.
Model Architecture:
• A dense layer with 128 neurons and a dropout layer (0.2) is added to prevent
overfitting.
• A final dense layer with 36 units (one for each class) and softmax activation is
used for multi-class classification.
26
• The training is conducted with a batch size of 16, 32, 64, and 128 for 50 epochs,
with a checkpoint to save the best model based on validation accuracy.
3.4.6 VGG16
The VGG16 deep learning model, devised by Karen Simonyan and Andrew Zisserman
at the University of Oxford [Simonyan and Zisserman, 2014], is renowned for its sim-
plicity yet powerful feature extraction capability in image classification tasks. With 16
weight layers, the architecture consists of compact 3x3 convolutional filters followed by
max-pooling layers to reduce spatial dimensions. This design has 138 million param-
eters, making it computationally intensive but highly effective for tasks such as image
recognition and feature extraction, particularly in large datasets like ImageNet. The
VGG16 model operates through looking at an input image (typically 224x224 RGB)
with its convolutional layers. These layers detect low-level features like boundaries
and textures, which are then refined through multiple convolution and pooling opera-
tions. Max-pooling layers help reduce the feature map’s dimensions, preserving essen-
tial information while preventing overfitting. The extracted high-level features are then
transmitted to fully connected layers for classification. In this architecture, two fully
connected layers have 4096 neurons each, followed by a final layer with 1000 neu-
rons (corresponding to 1000 ImageNet classes) using softmax activation to predict the
image class probabilities. For this specific task, VGG16 has been adapted for Bangla
Sign Language letter classification. The input images are resized to 50x50 pixels and
converted from grayscale to RGB by replicating the grayscale channel. Using the pre-
trained VGG16 model (without the top fully connected layers), the model is customized
by adding a flattening layer, a dense layer with 128 neurons, a dropout layer to prevent
overfitting, and a final output layer for the 36-letter classes and 10 digit classes.
The stages for this modified VGG16 model include:
Data Preprocessing: Images are read in grayscale, resized to 50x50, and normalized
to the range [0, 1]. The images are then divided into training and testing sets, with one-
hot encoding applied to the labels for multi-class classification.
Model Architecture:
• A dense layer with 128 neurons is included, followed by a dropout layer (0.2) to
27
reduce overfitting.
• A dense layer with 128 neurons and a dropout layer (0.2) is added to prevent
overfitting.
• The final layer has 10 or 36 units (one for each class), with softmax activation for
multi-class classification.
Training the Model: The model is compiled with the different optimizer and categori-
cal cross-entropy loss. It is trained using a batch size of 16,32,64 and 128 for 50 epochs,
with a checkpoint to save the best model based on validation accuracy.
This modified VGG16 model leverages transfer learning to accomplish robust feature
extraction while minimizing computational cost by freezing the pre-trained layers and
only training the added custom layers.
3.4.7 VGG-19
The VGG19 model is a deep Convolutional Neural Network (CNN) architecture renowned
for its efficacy in image recognition tasks. Developed by the Visual Geometry Group
(VGG) at Oxford [Simonyan and Zisserman, 2014], VGG19 comprises 19 layers, in-
cluding 16 convolutional layers and 3 fully connected layers, employing small 3x3 con-
volutional filters applied sequentially. This design enables the model to capture fine de-
tails efficiently while progressively reducing spatial dimensions through max-pooling
layers, which retain crucial features. In implementation, a pre-trained VGG19 model
is leveraged for classifying images from the Bangla Sign Language digit dataset. Ini-
tially, the dataset was preprocessed by importing grayscale images, resizing them to
50x50 pixels, and converting them to RGB format. The pixel values are normalized to
a range of 0 to 1, and the dataset is divided into training and testing sets, with labels
one-hot encoded for multi-class classification. The VGG19 model was imported with
ImageNet weights while excluding the fully connected layers to add my custom classi-
fication layers. The model’s output is compressed into a 1D vector, followed by a dense
layer with 128 neurons and ReLU activation, and a dropout layer to mitigate overfitting.
The final output layer employs softmax activation to derive class probabilities for the
10 digit classes and 36 letter classes. To enhance training efficiency, the weights of the
VGG19 layers were paused, ensuring that only the custom layers learn during training.
The model is compiled with the optimizer (like: Adam, Nadam, Adamax and RMSprop)
and categorical cross-entropy loss, then trained using a checkpoint callback to save the
best-performing model based on validation accuracy.
After training, the model was evaluate on the test set, making predictions and calculating
metrics such as F1 score, precision, and recall. The model’s performance is also visu-
alize using a confusion matrix and generate a classification report for detailed insights.
28
Lastly, the training and validation accuracy and loss curves are illustrated to observe the
model’s learning progress throughout the epochs.
3.4.8 InceptionV3
3.4.9 CNN-LSTM
3.4.10 CNN-VGG16
29
testing sets, and labels are one-hot encoded for the 36-class classification problem. A
pre-trained VGG16 model is used as the base, with its layers frozen to retain the learned
features from the ImageNet dataset. Additional CNN layers are added on top of VGG16
for further feature extraction. Dropout layers are incorporated between convolutional
layers to prevent overfitting. The GlobalMaxPooling2D layer is used to reduce the fea-
ture maps without completely shrinking them. The model is compiled using the differ-
ent optimizer and trained for 50 epochs with a batch size of 16,32,64 and 128. The best
model is saved using a model checkpoint mechanism based on validation accuracy.
3.4.11 CNN-VGG19
The CNN and VGG19 builds a hybrid model to classify Bangla Sign Language letters
using a hybrid architecture that integrates VGG19 as a feature extractor with additional
custom CNN layers. The grayscale images in the dataset are resized to 50x50 and con-
verted to 3-channel RGB format to meet the input requirements of VGG19. The dataset
is split into training and testing sets, and labels are one-hot encoded for the 36-class
classification task. The VGG19 model, pre-trained on ImageNet, is used as the base
model with its fully connected layers removed. The layers of VGG19 are frozen to re-
tain the learned features from the original dataset. On top of this base, custom CNN
layers are added, consisting of two convolutional layers followed by max-pooling op-
erations to capture additional features from the images. These convolutional layers are
then flattened, and fully connected layers are added to perform the final classification.
3.4.12 CNN-InceptionV3
30
layers. Training is performed for 50 epochs with different optimizer and batch size, and
the model is monitored for validation accuracy.
31
CHAPTER 4
Precision: The precision of the model measures its capacity to accurately identify posi-
tive instances from the total predicted positives. The accuracy of optimistic predictions
is the main focus. The ratio of true positive to the total of true positive and false positive
is used to calculate precision.
TP
Precision = (4.1.1)
T P + FP
TP
Recall = (4.1.2)
T P + FN
Precision · Recall
F1 = 2 · (4.1.3)
Precision + Recall
In order to effectively build and evaluate the CNN model, the dataset is split into training
and testing sets. The digit dataset features 1,500 single-channel (grayscale) images,
which are divided into 10 classes. Each class contains 150 images. Each of the 36 classes
in the letter dataset consists of 120 images, totaling 4,320 single-channel images. All
32
images are resized and normalized to 50x50 pixels. A well-balanced split is maintained
by reserving 20% of the data for testing, while approximately 80% is utilized for training.
This arrangement enables the model to be trained on the preponderance of the data while
retaining a significant portion for evaluation. The CNN model is trained over a period
of 50 epochs, with each epoch representing a complete journey through the training
dataset. During training, the model minimizes the categorical cross-entropy loss using
gradient descent and backpropagation to revise its weights at each epoch. In an effort
to avert overfitting, the model’s performance is tracked on a subset of the training data
using a 20% validation split. This allows tracking and visualizing the learning curves,
including both accuracy and loss for the training and validation sets.
The table below summarizes the performance of the Convolutional Neural Network
(CNN) model using various optimizers and batch sizes. The metrics evaluated include
accuracy, precision, recall, and F1-score, providing a comprehensive assessment of the
model’s performance.
Performance table of Digit Dataset:
33
Performance table of Letter dataset:
From CNN Performance Table 4.1 and 4.2 we can see that:
Adam Optimizer: The best accuracy is 0.961 on the digit dataset, and 0.888 is the best
accuracy on the letter dataset, especially with batch size 16 on both datasets. The very
high accuracy, recall, and F1-score on the Digit dataset demonstrate excellent model
performance.
Nadam Optimizer : Has exceptional performance as well, maintaining an accuracy
of 0.97 on the digit dataset and 0.894 on the letter dataset over a range of batch sizes.
Recall, F1-score, and precision are still flawless or very close to it.
Adamax Optimizer: Achieves high accuracy 0.945 on the digit dataset and almost 0.90
on the letter dataset., especially with batch sizes 16 .
RMSprop Optimizer:Shows excellent and reliable performance, with accuracy values
0.966 on the digit dataset and 0.902 on the letter dataset.
The below bar charts demonstrate the performance of the CNN-based image classifica-
tion model for both the letter and digit datasets. The performance is evaluated across
several optimizers (Adam, Nadam, Adamax, RMSprop) and varying batch sizes (16, 32,
34
Figure 4.1: Evaluation of CNN (Digit) Optimizers and Batch Sizes
64, 128). Each optimizer is represented by a different color, while the x-axis denotes
the batch size, and the y-axis represents the accuracy percentage.
For the CNN (Letter) Model: The figure 4.2 demonstrates how the model’s accu-
racy changes with different optimizers and batch sizes. Across all batch sizes, Adam
and Nadam optimizers demonstrate generally stable and excellent performance. The
Adamax optimizer likewise performs well but slightly lags behind the others in some
circumstances. RMSprop exhibits competitive accuracy for increasing batch sizes, how-
ever its performance reduces for the largest batch size of 128. This shows that Adam,
Nadam, and Adamax are generally useful for training the CNN on the letter dataset, but
RMSprop may require careful adjustment.
For the CNN (Digit) Model: This figure 4.1 shows highlights the performance varia-
tions of the CNN model on the digit dataset with different optimizers and batch sizes.
Adam, Nadam, and RMSprop consistently perform well across all batch sizes, achieving
35
near-perfect accuracy. Adamax shows slightly slower performance, especially at big-
ger batch sizes (64 and 128). RMSprop’s performance remains competitive, though it
shows a slight fall at bigger batch sizes. In both models, the analysis suggests that Adam
and Nadam are the most trustworthy optimizers, producing consistently high accuracy,
while Adamax and RMSprop may require careful tuning based on the batch size.
4.2.1.3 Training Curves for CNN Model with Different Optimizers and Batch
Sizes.
The training curve: shows how the model learns over the course of training by chang-
ing its performance metric (such as accuracy) throughout a series of epochs.
The validation curve:Tracks the model’s performance measure on a different valida-
tion set throughout training to demonstrate how effectively the model generalizes to new
data.
The loss curve : shows how the model’s loss function decreases across epochs, demon-
strating its capacity to reduce mistakes and boost prediction accuracy while being trained.
Figure 4.3: CNN(Digit) Accuracy and Loss Curve For Nadam optimizer with Batch Size 32
The training accuracy rapidly increases in the first few epochs, reaching 97% accuracy
on the digit dataset within 50 epochs. The validation accuracy similarly exhibits a quick
increase during the initial epochs, though not as steep as the training accuracy. The
validation loss remain limited and reasonably consistent after the initial fall with slight
fluctuations.
36
Figure 4.4: CNN(Letter) Accuracy and Loss Curve For RMSprop optimizer with Batch Size 16
The training accuracy rapidly increases in the first few epochs, reaching 90.2% accuracy
on the letter dataset within 50 epochs.
Figure 4.5: Classification report for Nadam optimizer batch size 32 (Digit Dataset)
37
Figure 4.6: Classification report for RMSprop optimizer batch size 16 (Letter Dataset)
In the digit dataset, Class 1 from the first classification report has a support of 35, mean-
ing that 35 samples in actuality match the ground truth for class 1. It has a precision
of 1.00, indicating that all 35 samples predicted to be in class 1 are indeed correctly
classified with no false positives. The recall is 1.00, showing that the model accurately
identified all 35 true class 1 samples, leaving no false negatives. With a perfect F1-score
of 1.00, this class demonstrates a flawless balance between precision and recall. In the
second classification report of letter dataset, class 1 has a support of 19, meaning that
there are 19 actual samples for class 1. The model’s precision is 0.95, meaning that most
samples predicted to be class 1 were correct, but a few were false positives. The recall
38
is also 0.95, indicating that the model correctly predicted 18 out of the 19 true class 1
samples, missing 1 sample as a false negative. The F1-score of 0.95 shows that class 1
has a very good balance between precision and recall, with only minor inaccuracies.
Figure 4.7: Confusion Matrix For Nadam optimizer batch size 32 (Digit Dataset)
The Convolutional Neural Network (CNN) model for Digit dataset, Nadam optimizer
batch size 32 identified all 10 classes in the sample with remarkable accuracy. This out-
standing accomplishment reveals CNN’s resilience and efficiency in identifying com-
plex patterns in a variety of classes. Its excellent recall and precision highlight how
well-suited it is for a variety of real-world scenarios involving complex categorization
tasks.
39
Figure 4.8: Confusion Matrix For Adam optimizer batch size 32
The diagonal cells (highlighted) represent The Convolutional Neural Network (CNN)
model for Letter dataset, RMSprop optimizer batch size 16 identified all 36 classes.
40
4.2.2 LSTM Model :
41
Optimizer Batch Size Precision Recall F1-Score Accuracy
Adam 16 78.9 78.4 77.8 78
32 83.2 83.5 82.9 84
64 83.7 83.9 83.4 84
128 80.8 80.7 80.2 80
Nadam 16 80.9 80.8 79.9 80
32 84.2 84.4 84.0 84
64 83.4 83.7 83.1 83
128 79.2 79.6 78.9 80
Adamax 16 80.5 80.7 79.9 81
32 81.8 81.6 81.2 81
64 78.7 78.3 78.0 78
128 76.3 76.0 75.7 76
RMSprop 16 81.0 81.3 80.8 81
32 78.9 78.5 78.2 79
64 77.0 76.7 76.5 77
128 74.5 74.8 74.1 75
The table 4.3 and 4.4 below summarizes the performance of the LSTM model using
various opti- mizers and batch sizes. The metrics evaluated include accuracy, precision,
recall, and F1-score, providing a comprehensive assessment of the model’s performance.
Adam Optimizer:Overall, the Adam optimizer batch size 16 performed the best for the
digit dataset. It obtained 95% accuracy, 0.956 precision, 0.95 recall, and 0.952 F1-score
with a batch size of 16. On the other hand, Adam optimizer batch sizes 32 and 64 achieve
84% accuracy for the letter dataset.
Nadam Optimizer :The best result for the Nadam optimizer was with batch size 16,
reaching 93.7% accuracy, 93.4% precision, 93.4% recall, and an F1-score of 93. As
the batch size increased, performance slightly decreased, with batch size 32 maintain-
ing similar findings, but batch sizes 64 and 128 witnessing decreases to 89% and 76%
accuracy, respectively.
Adamax Optimizer: Batch size 32 generated the best result for Adamax, with 87.3%
accuracy, 86.1% precision, 86.1% recall, and an F1-score of 86. However, utilizing
batch size 16 resulted in slightly reduced accuracy at 86.2%, while bigger batch sizes
64 and 128 observed considerable decreases, with the accuracy dropping to 78% and
53.8%.
RMSprop Optimizer:For RMSprop, batch size 16 performed best, achieving 91.9%
accuracy, 91.3% precision, 91.4% recall, and an F1-score of 92. Performance declined
42
as the batch size increased, with batch size 32 dropping accuracy to 88%, and batch
sizes 64 and 128 showing additional decreases to 85% and 84%, respectively.
Figure 4.10: Figure 4.13: Evaluation of LSTM Optimizers and Batch Sizes(Letter)
43
128. RMSprop also displays a consistent performance but with a steeper drop-off at
batch size 128. Overall, Adam and Nadam emerge as the most dependable optimizers,
while Adamax and RMSprop require more careful modifying when the batch size in-
creases.
For the LSTM (Letter) Model: The graph demonstrates how the model’s accuracy
changes according on the choice of optimizer and batch size. Adam, Nadam, and Adamax
optimizers show remarkably similar performance, consistently reaching excellent accu-
racy across all batch sizes. RMSprop also performs well but exhibits a little drop in
performance at increasing batch sizes, particularly at batch size 128. It means that while
all optimizers work well, Adam and Nadam appear to be slightly more consistent, espe-
cially as batch sizes increase.
In both models, Adam and Nadam optimizers are highly successful, consistently deliv-
ering outstanding results across different batch sizes. Adamax and RMSprop work well
but exhibit increasing susceptibility to bigger batch sizes, particularly in the case of the
digit dataset.
4.2.2.4 Training Curves for LSTM Model with Different Optimizers and Batch
Sizes.
The training curve: shows how the model learns over the course of training by chang-
ing its performance metric (such as accuracy) throughout a series of epochs. The val-
idation curve:Tracks the model’s performance measure on a different validation set
throughout training to demonstrate how effectively the model generalizes to new data.
The loss curve : shows how the model’s loss function decreases across epochs, demon-
strating its capacity to reduce mistakes and boost prediction accuracy while being trained.
Figure 4.11: LSTM(Digit) Accuracy and Loss Curve For Adam optimizer with Batch Size 16.
44
The graphs show that the model learns well in the first few epochs, boosting accuracy
in training and validation while lowering loss for digit dataset. Good generalization is
shown by the alignment of the training and validation accuracy and loss curves, while
moderate overfitting may begin to show near the end of the training period. All things
considered, the model performs well and has strong learning and generalization skills.
Figure 4.12: LSTM (Letter) Accuracy and Loss Curve For Nadam with Batch Size 32.
The graphs show that the letter dataset accuracy and loss curve for Nadam optimizer
with batch size 32. Reaching 84.4% accuracy within 50 epochs.
A classification report offers crucial metrics including accuracy, recall, F1-score, and
support for each class to give a detailed overview of a classification model’s perfor-
mance. Accuracy shows the overall accuracy of the model, whereas recall evaluates the
capacity to detect all relevant occurrences. The F1-score establishes a balance between
accuracy and recall, offering insight into the model’s performance when both false posi-
tives and false negatives are essential. Support refers to the number of true examples for
each class in the dataset. The categorization report aids in assessing model strengths and
weaknesses across multiple classes, allowing for focused improvements in the model’s
predictive capabilities
45
Figure 4.13: Classification report for Adam optimizer batch size 16 (Digit)
Figure 4.14: Classification report for Nadam optimizer batch size 32 (Letter)
46
The classification reports 4.13 and 4.14indicate that the model performs exceptionally
well across most classes, with precision, recall, and F1-scores generally in a digit dataset.
However, the Letter dataset has slightly lower metrics, suggesting some difficulty in ac-
curate prediction. The overall accuracy of 95% and 84% highlights the model’s effec-
tiveness in classifying instances correctly. Both macro and weighted averages of pre-
cision, recall, and F1-score are 0.97, confirming balanced and consistent performance
across all classes. This demonstrates the model’s robustness and reliability in handling
the dataset.
Figure 4.15: Confusion Matrix For Adam optimizer batch size 16 (Digit)
The LSTM model for Digit dataset, Adam optimizer batch size 10 identified all 10
classes in the sample with remarkable accuracy.
Figure 4.16: Confusion Matrix For Nadam optimizer batch size 32(Letter)
47
4.2.3 BiLSTM Model :
The BiLSTM image classification model utilizes Bidirectional Long Short-Term Mem-
ory (BiLSTM) networks to categorize images into 36 classes for the letter dataset and 10
classes for the digit dataset. The input images are first scaled to 28x28 pixels and trans-
formed to grayscale, with pixel values adjusted between 0 and 1 for optimal training.
The model architecture consists of two BiLSTM layers, each with 256 units. To prevent
overfitting, dropout regularization is used after each BiLSTM layer. The final output
layer utilizes softmax activation to classify the images. For training, the images are sep-
arated into training and testing sets, with labels one-hot encoded into 36 categories for
the letter dataset and 10 categories for the digit dataset. The model is created using the
Nadam optimizer with a learning rate of 0.001 and categorical cross-entropy as the loss
function. It is trained across 50 epochs, utilizing batch sizes of 16 and a 20% validation
split. A model checkpoint callback is implemented to save the best-performing model
during training. After training, the model’s performance is tested on the test set. Key
measures, including accuracy, F1-score, precision, and recall, are calculated.
48
Performance table of Letter Dataset:
The table 4.5 and 4.6 summarizes the performance of the BiLSTM model using various
op- timizers and batch sizes. The metrics evaluated include accuracy, precision, recall,
and F1-score, providing a comprehensive assessment of the model’s performance.
Adam Optimizer::Adam performed well for batch sizes ranging from 16 to 128. Accu-
racy with the digit dataset range from 88% to 93%, and F1-scores held constant between
88.1 and 92.7. The letter dataset performed a little bit worse, with F1-scores ranging
from 80.7 to 84.3 and accuracy ranging from 81% to 85%. The values of precision and
recall showed a similar pattern, with lower batch sizes providing the highest results.
Nadam Optimizer : Nadam performed similarly to Adam. With the digit dataset, accu-
racy ranged from 81% to 94%, with F1-scores from 81.0 to 93.6. It outperformed Adam
almost for smaller batch sizes (e.g., batch size of 16). In the letter dataset, the perfor-
mance was also constant, with accuracy between 81% and 87% and F1-scores ranging
from 81.3 to 86.5. Smaller batch sizes once again saw the best outcomes.
Adamax Optimizer:Performance decreased more noticeably with Adamax at larger
batch sizes. The F1-scores in the digit dataset ranged from 63.3 to 89.3, while the ac-
curacy ranged from 64% to 89%. Similar patterns were observed in the letter dataset,
49
where F1-scores ranged from 76.4 to 81.9 and accuracy from 77% to 82%. The optimal
performance of this optimizer was highest at lower batch sizes, but it rapidly decreased
as batch sizes increased
RMSprop Optimizer::RMSprop shows stable performance throughout batch sizes. In
the digit dataset, accuracy ranged from 83% to 94%, with F1-scores between 84.1 and
93.7. For the letter dataset, performance was lower but constant, with accuracy ranging
from 78% to 81% and F1-scores from 77.9 to 81.3. This optimizer performed well at
smaller batch sizes, especially with batch sizes of 16 and 32 in both datasets.
The bar charts shows the performance of a Bi-LSTM-based model applied to both let-
ter and digit datasets. The performance is evaluated using four different optimizers
(Adam, Nadam, Adamax, RMSprop) with variable batch sizes (16, 32, 64, 128). The
50
x-axis demonstrates the batch sizes, while the y-axis indicates the accuracy in percent-
age. Each optimizer is color-coded.
For the Bi-LSTM (Digit) Model: The figure 4.18 shows how different optimizers
perform when applied to the digit dataset. Adam, Nadam and RMSprop show high per-
formance across most batch sizes, achieving over 80% accuracy for lower batch sizes.
Adamax, however, exhibits a small loss in performance as batch size expands, especially
for batch sizes of 64 and 128. RMSprop similarly experiences a reduction in accuracy
at batch size 128, while it stays competitive at other levels. For the Bi-LSTM (Letter)
Model: The accuracy trends for different optimizers and batch sizes are shown in fig-
ure ??. Adam and Nadam optimizers frequently show high and stable accuracy across
all batch sizes, reaching over 80Adamax also performs well but shows a slight drop
compared to Adam and Nadam at greater batch sizes, but its performance remains com-
petitive. RMSprop while comparable in performance for smaller batch sizes, exhibits
a little reduction in accuracy as batch size grows, particularly at 128. This shows that
Adam and Nadam are stronger optimizers for the Bi-LSTM (Letter) model, whereas
Adamax and RMSprop may need fine-tuning depending on the batch size. Overall,
Adam and Nadam optimizers are the most consistent for the Bi-LSTM (Digit) model,
while Adamax and RMSprop indicate performance variances that would require more
changes.
4.2.3.4 Training Curves for BiLSTM Model with Different Optimizers and Batch
Sizes.
The training curve: shows how the model learns over the course of training by chang-
ing its performance metric (such as accuracy) throughout a series of epochs. The val-
idation curve:Tracks the model’s performance measure on a different validation set
throughout training to demonstrate how effectively the model generalizes to new data.
The loss curve : shows how the model’s loss function decreases across epochs, demon-
strating its capacity to reduce mistakes and boost prediction accuracy while being trained.
The model seems to be successfully learning to categorize the Class, based on the
combined observations of growing training accuracy and decreasing training loss over
epochs. Its capacity to recognize the appropriate categories for the class in the training
data is gradually becoming better.
51
Figure 4.19: Bi-LSTM Accuracy and Loss Curve For Nadam with Batch Size 16 (Digit)
Figure 4.20: Bi-LSTM Accuracy and Loss Curve For Nadam with Batch Size 16 (Letter)
The training and validation curves for Adam and RMSprop optimizers demonstrate good
performance across all tested measures, with consistent and rising trajectories indicat-
ing effective model training and generalization. In comparison, the curves linked with
the Adagrad optimizer show significantly poorer performance, with erratic trends and
minimal progress over epochs.
52
Figure 4.22: Classification report for Nadam batch size 16 (Letter)
According to the aforementioned classification findings, the Bi-LSTM model has gen-
erally performed well across most classes, with accuracy and recall values on the digit
dataset usually above 0.90 and on the letter dataset frequently reaching 0.85. This in-
dicates that the model has a high degree of precision in identifying positive samples
and a high recall level in locating all of the positive samples for these classes. But the
performance is significantly worse for other classes, such as those with accuracy and
recall scores closer to 0.82 on the digit dataset or 0.47 on the letter dataset. This could
mean that certain specific classes are harder for the model to identify, or that the classes
are underrepresented in the data. For the majority of classes, the F1-score, a weighted
average of accuracy and recall, stays high, indicating a balanced performance between
precision and recall in both datasets. For numbers, the best accuracy was 94%, while
for letters, it was 87%.
53
4.2.3.6 Confusion Matrix
Figure 4.23: Confusion Matrix For Nadam optimizer batch size 16 (Digit)
54
Figure 4.24: Confusion Matrix For Nadam optimizer batch size 16 (Letter)
55
4.2.4 VGG-16
This model utilizes transfer learning with the VGG16 architecture, which is pretrained
on ImageNet, to classify images. Two datasets are used: one for digits with 10 classes
and another for letters with 36 classes. The input images are scaled to 50x50 pixels and
processed to match the VGG16 standards. Grayscale images are transformed to RGB
by doubling the single channel. The model’s custom classification head includes fully
connected layers, dropout regularization, and a softmax layer for class prediction.
The VGG16 base layers are frozen, and Adam and Nadam optimizers are utilized in
independent tests. The model is trained using a batch size of 16 for 50 epochs, and a
validation split of 20
For both datasets, the model is evaluated using classification measures including ac-
curacy, F1-score, precision, recall, and confusion matrices. These metrics illustrate
the model’s effectiveness in detecting Bangla sign language characters across different
classes.
56
Optimizer Batch Size Precision Recall F1-Score Accuracy
Adam 16 83.6 83.7 83.0 84
32 83.9 83.6 83.1 84
64 81.7 81.9 81.3 82
128 79.3 79.8 79.0 80
Nadam 16 83.7 83.8 82.9 83
32 83.5 83.6 83.2 83
64 81.6 81.8 81.1 82
128 79.0 79.5 78.8 80
Adamax 16 77.7 78.1 77.2 78
32 76.6 76.9 76.1 77
64 74.9 75.2 74.3 75
128 69.6 69.8 68.8 70
RMSprop 16 83.7 83.8 83.3 84
32 84.1 83.9 83.7 84
64 80.5 80.0 79.5 80
128 79.7 79.8 78.7 79
57
4.2.4.2 Explanation of VGG 16 Performance Table
The table below summarizes the performance of the VGG16 model using various opti-
mizers and batch sizes. The metrics evaluated include accuracy, precision, recall, and
F1-score, providing a comprehensive assessment of the model’s performance. Adam
Optimizer:: The Adam optimizer shows strong performance across various batch sizes,
with accuracy values ranging from 84% to 80% for the letter dataset and 96% to 93% for
the digit dataset. Precision, recall, and F1-scores also stay consistently good, especially
for smaller batch sizes. On the letter dataset, scores hover around 84% for batch sizes
of 16 and 32, while on the digit dataset, same measures reach 96% with batch sizes of
16.
Nadam Optimizer :The Nadam optimizer performs similarly to Adam, producing ac-
curacy scores between 83% and 80% for letters and between 95% and 94% for digits.
Precision, recall, and F1-scores are close to Adam’s values, with batch sizes of 16 and
32 frequently scoring above 83% for letters and 94% to 95% for digits, making it a great
choice for both datasets.
Adamax Optimizer: Adamax provides some lower performance compared to Adam
and Nadam. For the letter dataset, accuracy ranges from 78% to 70%, while for the digit
dataset, it gets up to 92% for lower batch sizes. Precision, recall, and F1-scores follow
similar patterns, with lower batch sizes producing better outcomes. For the digit dataset,
it obtains 92% precision at a batch size of 16 but drops to 86% at larger batch sizes.
RMSprop Optimizer:RMSprop gives excellent performance, particularly with batch
sizes of 32 and 64, where accuracy reaches 84% for letters and 96% to 94% for num-
bers. Precision, recall, and F1-scores are generally strong, with letter dataset metrics
ranging around 83% to 84% and digit dataset metrics highest on 94%, indicating accu-
rate and reliable classification results.
The bar charts show the performance of the Bi-LSTM model for both the letter and digit
datasets, evaluated using multiple optimizers (Adam, Nadam, Adamax, RMSprop) and
varying batch sizes (16, 32, 64, 128). The x-axis shows the batch size, while the y-axis
demonstrates the accuracy percentage. For the Bi-LSTM (Letter) Model: The per-
formance of the Bi-LSTM model on the letter dataset shows consistent trends across
the four optimizers. The Adam and Nadam optimizers achieve relatively good accuracy
across all batch sizes, with little difference between them. Adamax also works well,
though it tends to significantly underperform compared to Adam and Nadam, especially
58
Figure 4.25: Evaluation of VGG16 Optimizers and Batch Sizes (Digit)
at batch sizes of 32 and 128. RMSprop performs satisfactorily for lower batch sizes but
demonstrates a more obvious drop at batch size 128. Overall, Adam and Nadam ap-
pear to be the most trustworthy for sustaining high accuracy on the letter dataset, while
RMSprop may require fine-tuning for bigger batch sizes. For the Bi-LSTM (Digit)
Model: The digit dataset performance follows the same pattern to the letter dataset.
Adam, Nadam, and RMSprop demonstrate consistent and strong accuracy at all batch
sizes, however RMSprop shows a small decrease in performance at bigger batch sizes.
Adamax falls behind the other optimizers in some instances, particularly for batch sizes
32 and 64. Nevertheless, the variations amongst the optimizers are small in terms of
total performance. Once again, Adam and Nadam stand out as the strongest perform-
ers, generating excellent accuracy consistently, whereas Adamax and RMSprop require
further changes, particularly for bigger batch sizes. These findings imply that Adam
and Nadam are generally the most effective optimizers for the Bi-LSTM model on both
the letter and digit datasets, but Adamax and RMSprop may require more attention to
maintain equivalent performance, particularly as batch size grows.
59
4.2.4.4 Training Curves for VGG16 Model with Different Optimizers and Batch
Sizes.
The training curve: shows how the model learns over the course of training by chang-
ing its performance metric (such as accuracy) throughout a series of epochs. The val-
idation curve:Tracks the model’s performance measure on a different validation set
throughout training to demonstrate how effectively the model generalizes to new data.
The loss curve : shows how the model’s loss function decreases across epochs, demon-
strating its capacity to reduce mistakes and boost prediction accuracy while being trained.
Figure 4.27: VGG16(Digit) Accuracy and Loss Curve For Adam optimizer with Batch Size 16
Figure 4.28: VGG16(Letter) Accuracy and Loss Curve For Adam optimizer with Batch Size 16
60
Figure 4.29: Classification report for Adam optimizer batch size 16(Digit)
61
Figure 4.30: Classification report for Adam optimizer batch size 16(Letter)
62
4.2.4.6 Confusion Matrix
Figure 4.31: Confusion Matrix For Adam optimizer batch size 16 (Digit)
Figure 4.32: Confusion Matrix For Adam optimizer batch size 16 (Letter)
63
4.2.5 VGG-19
The VGG19 model was utilized for image classification on two datasets: a 36-class letter
dataset and a 10-class digit dataset. The images were preprocessed with background re-
moval, resized to 50x50 pixels, converted to RGB, and normalized. The VGG19 model
was initialized with pre-trained weights from ImageNet, excluding the fully connected
layers, and custom classification layers were added. For the letter dataset, Nadam was
used as the optimizer, while Adamax was employed for the digit dataset. The mod-
els were trained with batch sizes of 16 over 50 epochs, and early stopping with model
checkpointing was applied.
Training and validation accuracy and loss curves were plotted to assess performance.
On the test set, the models were evaluated using metrics such as loss, accuracy, preci-
sion, recall, and F1 score. Additionally, confusion matrices were generated to visualize
the classification performance.
64
Optimizer Batch Size Precision Recall F1-Score Accuracy
Adam 16 83.6 83.7 83.0 84
32 83.9 83.6 83.1 84
64 81.7 81.9 81.3 82
128 79.3 79.7 79.0 80
Nadam 16 83.7 83.8 82.9 83
32 83.5 83.6 83.2 83
64 81.6 81.8 81.1 82
128 79.0 79.5 78.8 80
Adamax 16 77.7 78.1 77.2 78
32 76.6 76.9 76.1 77
64 74.6 74.8 73.9 75
128 69.6 69.8 68.8 70
RMSprop 16 83.7 83.8 83.3 84
32 84.1 83.9 83.7 84
64 80.5 80.0 79.5 80
128 79.7 79.8 78.7 79
65
4.2.5.2 Explanation of VGG 16 Performance Table
The VGG19 model was evaluated using multiple optimizers across different batch sizes.
Using the Adam optimizer, the model achieved an accuracy range of 94% to 88%, with
precision ranging from 94.8% to 88.7%, recall from 94.3% to 87.7%, and F1 score from
94.4% to 87.8% as the batch size increased from 16 to 128. The Nadam optimizer re-
sulted in comparable results with accuracy ranging from 92% to 86% with precision,
recall, and F1 score showing little variation among the batch sizes. The Adamax op-
timizer showed a more obvious decrease in performance, with accuracy falling from
91% at a batch size of 16 to 74% at a batch size of 128. Precision, recall, and F1
scores followed similar trends, showing the model’s decreased capacity to generalize
at increasing batch sizes. In contrast, RMSprop maintained very constant performance,
with accuracy ranging from 92% to 82%, demonstrating less variability compared to the
Adam and Nadam optimizers.
The performance of the CNN-based image classification models for the VGG19 (Letter)
and VGG19 (Digit) datasets is shown in the graphs. Several optimizers (Adam, Nadam,
Adamax, RMSprop) and varying batch sizes (16, 32, 64, 128) are used to assess the per-
formance. Batch size is shown on the x-axis, and each optimizer’s accuracy percentage
is shown on the y-axis. VGG19 (Letter) Model: This graph illustrates how batch sizes
and different optimizers impact the model’s accuracy. The Adam and Nadam optimiz-
ers show consistently strong and stable results across all batch sizes. When compared
to Adam and Nadam, the Adamax optimizer performs significantly worse, particularly
for bigger batch sizes. In general, RMSprop produces competitive results; however,
66
Figure 4.34: Visualization of VGG19(Letter) Optimizers Across Different Batch Sizes
when the batch size is increased to 128 it performs considerably worse. This indicates
that while Adamax and RMSprop could need careful batch size improving for best re-
sults, Adam and Nadam seem like good choices for this dataset.may require fine-tuning
for bigger batch sizes. VGG19 (Digit) Model: This graph analyzes whether the CNN
model performs with various optimizers and batch sizes on the digit dataset. Once more,
Adam and Nadam provide excellent, constant accuracy for all batch sizes, and RMSprop
also produces values that are competitive. Adamax performs slightly worse, especially
when batch sizes increase. Similar to the VGG16 (Letter) model, RMSprop’s perfor-
mance is still competitive but slightly decreases at greater batch sizes (128).
The most efficient optimizers in both models are Adam and Nadam, who consistently
produce excellent accuracy for all batch sizes. In order to achieve optimal performance,
Adamax and RMSprop may need further adjustment depending on the batch size, as
they show greater variability.
4.2.5.4 Training Curves for VGG19 Model with Different Optimizers and Batch
Sizes.
67
Figure 4.35: VGG19(Digit) Accuracy and Loss Curve For Adam optimizer with Batch Size 16
Figure 4.36: VGG19(Letter) Accuracy and Loss Curve For Adam optimizer with Batch Size 16
Figure 4.37: Classification report of VGG19(Digit) for Adam optimizer batch size 19
68
Figure 4.38: Classification report of VGG19(Letter) for Adam optimizer batch size 16
69
4.2.5.6 Confusion Matrix
Figure 4.39: Confusion Matrix For Adam optimizer batch size 19 (Digit)
Figure 4.40: Confusion Matrix For Adam optimizer batch size 1 (Letter)
70
CHAPTER 5
CONCLUSION
This thesis addresses the difficulty of recovering and identifying characters from de-
graded printed Bangla documents for preserving historical and cultural legacies. By
applying standard image processing methods to modern deep learning models, consid-
erable advances were made in boosting document readability and accessibility. The
preprocessing step involves methods including skew correction and morphological pro-
cesses, crucial for successful segmentation and classification. Various models, includ-
ing CNNs, LSTMs, BiLSTMs, and numerous pre-trained architectures such as VGG16,
VGG19, ResNet50, Xception, and MobileNet, as well as hybrid models like VGG19-
BiLSTM, CNN-XGBoost, and semi-supervised VGG16-RF, were applied for character
classification. The research investigated five optimization algorithms (Adam, Nadam,
Adagrad, SGD, and RMSprop) and varying batch sizes, indicating Adam and Nadam as
the most successful owing to their adaptable learning rates. RMSprop also performed
well, especially for non-convex optimization problems, whereas SGD and AdaGrad
failed with fixed or falling learning rates.
While this thesis has made substantial progress in restoring and detecting damaged
Bangla papers, there is still need for more research. Future study might look at the
use of more complex deep learning algorithms, such as generative adversarial networks
(GANs), to improve restoration quality. Expanding the dataset to include compound
characters would also be useful because it would offer the models with a more com-
plete training set. Furthermore, we intend to create a more robust optical character
recognition (OCR) technology that is especially designed for the study of old, deterio-
rated documents. This upgraded OCR technology has the potential to greatly increase
text extraction accuracy, allowing for greater preservation and accessibility of historical
texts. By adding these sophisticated methodologies and growing the dataset, we may
continue to enhance our methods’ efficacy and generalizability, eventually assisting in
the preservation and study of historical documents.
71
REFERENCES
[Abedin et al., 2023] Abedin, T., Prottoy, K. S., Moshruba, A., and Hakim, S. B. (2023).
Bangla sign language recognition using a concatenated bdsl network. In Computer
Vision and Image Analysis for Industry 4.0, pages 76–86. Chapman and Hall/CRC.
[Basnin et al., 2021] Basnin, N., Nahar, L., and Hossain, M. S. (2021). An integrated
cnn-lstm model for bangla lexical sign language recognition. In Proceedings of Inter-
national Conference on Trends in Computational and Cognitive Engineering: Pro-
ceedings of TCCE 2020, pages 695–707. Springer.
[Das et al., 2023] Das, S., Imtiaz, M. S., Neom, N. H., Siddique, N., and Wang, H.
(2023). A hybrid approach for bangla sign language recognition using deep trans-
fer learning model with random forest classifier. Expert Systems with Applications,
213:118914.
[Hadiuzzaman et al., 2024] Hadiuzzaman, M., Ali, M. S., Sultana, T., Shafi, A. R.,
Miah, A. S. M., and Shin, J. (2024). Baust lipi: A bdsl dataset with deep learning
based bangla sign language recognition. arXiv preprint arXiv:2408.10518.
[Hasan et al., 2016] Hasan, M., Sajib, T. H., and Dey, M. (2016). A machine learning
based approach for the detection and recognition of bangla sign language. In 2016
international conference on medical engineering, health informatics and technology
(MediTec), pages 1–5. IEEE.
[Hasan et al., 2021] Hasan, S. N., Hasan, M. J., and Alam, K. S. (2021). Shongket: A
comprehensive and multipurpose dataset for bangla sign language detection. In 2021
International Conference on Electronics, Communications and Information Technol-
ogy (ICECIT), pages 1–4. IEEE.
[Islalm et al., 2019] Islalm, M. S., Rahman, M. M., Rahman, M. H., Arifuzzaman, M.,
Sassi, R., and Aktaruzzaman, M. (2019). Recognition bangla sign language using
convolutional neural network. In 2019 international conference on innovation and
intelligence for informatics, computing, and technologies (3ICT), pages 1–6. IEEE.
[Islam et al., 2018] Islam, M. S., Mousumi, S. S. S., Jessan, N. A., Rabby, A. S. A.,
and Hossain, S. A. (2018). Ishara-lipi: The first complete multipurposeopen access
dataset of isolated characters for bangla sign language. In 2018 International Con-
ference on Bangla Speech and Language Processing (ICBSLP), pages 1–4. IEEE.
[Khatun et al., 2021] Khatun, A., Shahriar, M. S., Hasan, M. H., Das, K., Ahmed, S.,
and Islam, M. S. (2021). A systematic review on the chronological development of
72
bangla sign language recognition systems. In 2021 Joint 10th International Con-
ference on Informatics, Electronics & Vision (ICIEV) and 2021 5th International
Conference on Imaging, Vision & Pattern Recognition (icIVPR), pages 1–9. IEEE.
[Podder et al., 2022] Podder, K. K., Chowdhury, M. E., Tahir, A. M., Mahbub, Z. B.,
Khandakar, A., Hossain, M. S., and Kadir, M. A. (2022). Bangla sign language
(bdsl) alphabets and numerals classification using a deep learning model. Sensors,
22(2):574.
[Rafi et al., 2019] Rafi, A. M., Nawal, N., Bayev, N. S. N., Nima, L., Shahnaz, C., and
Fattah, S. A. (2019). Image-based bengali sign language alphabet recognition for deaf
and dumb community. In 2019 IEEE global humanitarian technology conference
(GHTC), pages 1–7. IEEE.
[Shams et al., 2024] Shams, K. A., Reaz, M. R., Rafi, M. R. U., Islam, S., Rahman,
M. S., Rahman, R., Reza, M. T., Parvez, M. Z., Chakraborty, S., Pradhan, B., et al.
(2024). Multimodal ensemble approach leveraging spatial, skeletal, and edge features
for enhanced bangla sign language recognition. IEEE Access.
[Shurid et al., 2020] Shurid, S. A., Amin, K. H., Mirbahar, M. S., Karmaker, D.,
Mahtab, M. T., Khan, F. T., Alam, M. G. R., and Alam, M. A. (2020). Bangla
sign language recognition and sentence building using deep learning. In 2020 IEEE
Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), pages
1–9. IEEE.
[Sultan et al., 2022] Sultan, A., Makram, W., Kayed, M., and Ali, A. A. (2022). Sign
language identification and recognition: A comparative study. Open Computer Sci-
ence, 12(1):191–210.
[Sun et al., 2019] Sun, S., Cao, Z., Zhu, H., and Zhao, J. (2019). A survey of optimiza-
tion methods from a machine learning perspective. IEEE transactions on cybernetics,
50(8):3668–3681.
[Szegedy et al., 2016] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z.
(2016). Rethinking the inception architecture for computer vision. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
[Tasmere and Ahmed, 2020] Tasmere, D. and Ahmed, B. (2020). Hand gesture recog-
nition for bangla sign language using deep convolution neural network. In 2020 2nd
73
International Conference on Sustainable Technologies for Industry 4.0 (STI), pages
1–5. IEEE.
[Yasir et al., 2017] Yasir, F., Prasad, P., Alsadoon, A., Elchouemi, A., and Sreedharan,
S. (2017). Bangla sign language recognition using convolutional neural network. In
2017 international conference on intelligent computing, instrumentation and control
technologies (ICICICT), pages 49–53. IEEE.
74