Thesis

Enhancing Bangla Sign Language Recognition Through Batch Size and Optimizer
Variations in Deep Learning
Submitted in partial fulfillment of the requirements

for the Degree of Bachelor of Science
in Computer Science & Engineering
by
Imam Hossain Rafi
ID: CSE-02207156
Supervised by
Shafayet Nur
Lecturer
Department of Computer Science & Engineering

Port City International University
7, Nikunja Housing Society, South Khulshi, Chattogram - 4202, Bangladesh
APPROVAL FOR SUBMISSION
The thesis titled ’Enhancing Bangla Sign Language Recognition Through Batch Size
and Optimizer Variations in Deep Learning’submitted by ID: CSE-02207156 , Ses-
sion has been accepted as satisfactory in fulfilment of the requirement for the degree of
Bachelor of Science in Computer Science & Engineering to be awarded by the Port City
International University.
Shafayet Nur
Lecturer
Department of Computer Science & Engineering
Email:
Cell:
i
DEDICATION
This thesis is a heartfelt tribute to my beloved parents and respected educators.
ii
DECLARATION OF ORIGINALITY
This certifies that Imam Hossain Rafi CSE 02207156 is the research author, and that
neither the research nor any portion of it has been submitted to any other university for
credit toward a degree. To the best of our knowledge, my research does not violate
any copyright or proprietary rights, and all ideas, methods, quotations, or other material
from other people’s work that is included in our thesis—published or not—is properly
credited in accordance with accepted referencing guidelines. I am also aware that the
Department of CSE, PCIU, may take legal and disciplinary action against me if any
copyright infringement is discovered, whether intentional or not. Without the Depart-
ment of CSE, PCIU’s permission, any reproduction or use of this thesis work in any form
or by any means whatsoever is forbidden. I hereby transfer all rights in the copyright of
this thesis work to them.
Signature of the candidate

Imam Hossain Rafi
ID: CSE-02207156
Department of CSE
iii
ABSTRACT
The identification and categorization of Bangla Sign Language (BdSL) numbers and
letters has become more significant for the purpose of providing assistance to the com-
munity of hearing-impaired individuals as the demand for effective communication tools
that are accessible to all individuals continues to rise. Utilizing the Shongket dataset,
which consists of 10 number classes and 36 letter classes, this paper gives a complete
analysis into the identification of Burmese Sign Language (BdSL) hand signals. Dur-
ing preprocessing, background elimination and binarization techniques were utilized
to increase the clarity of hand sign pictures, permitting improved feature extraction.
The research assesses the performance of Machine Learning, Deep Learning, and Hy-
brid Models for identifying these hand signals. Specifically, four distinct optimiza-
tion algorithms—Adam, Nadam, Adamax, and RMSprop—were tried, with variable
batch sizes (16, 32, 64, and 128) to measure their influence on model training and accu-
racy. Comprehensive experiments and comparative studies were undertaken to discover
the ideal combinations of optimizers and batch sizes for boosting classification perfor-
mance. The results give useful insights into the strengths and drawbacks of each tech-
nique, leading to developments in sign language recognition technology and enabling
the creation of more robust and efficient systems for BdSL identification.
Keywords: optimizer based Bangla sign language identification, Bangla sign language
classifiaction, Sign language detection , Background remove using rembg
iv
ACKNOWLEDGEMENT
We extend our sincere appreciation to a number of individuals whose steadfast support

and efforts have enabled the realisation of this research. Above all, we express our heart-
felt gratitude to our supervisor, Ms.Tahmina Akter, Senior Lec- Lurer, Dept. of CSE,
PCIU for her consistent mentorship, unshakable backing, and priceless aid through-
out our research expedition. Her boundless inspiration, unwavering patience, profound
comprehension, and eagerness to encourage autonomous exploration have played a cru-
cial role in determining the course of this project. We are profoundly grateful for her
unwavering support and the abundance of thoughtprovoking concepts she offered. We
would like to express our genuine gratitude to the members of our defence committee,
whose invaluable guidance and constructive criticism greatly enhanced the calibre of
this research. We express our gratitude to the author , Chandan Biswas for providing us
the dataset. The combined endeavours of these individuals have played a crucial role in
defining our research, and their assistance is sincerely valued.
v
TABLE OF CONTENTS
Title Page No.
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
ACKNOWLEDGEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
CHAPTER 1 INTRODUCTION 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Sign language: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Application of Bangla sign language identification and classification methods 5
1.5 Key Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 2 LITERATURE REVIEW 8

2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER 3 METHODOLOGY 13
3.1 Dataset preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Background removal using rembg . . . . . . . . . . . . . . . . . . 15
3.1.2 Gray Scale conversion . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1 Adam (Adaptive Moment Estimation) . . . . . . . . . . . . . . . . 18
3.3.2 Nadam (Nesterov-accelerated Adaptive Moment Estimation) . . . . 19
3.3.3 RMSProp (Root Mean Square Propagation): . . . . . . . . . . . . 20
3.3.4 Adamax : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4 Model description: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4.1 Convolutional Neural Network (CNN) . . . . . . . . . . . . . . . . 21
3.4.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . 22
3.4.3 Bidirectional LSTM . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4.4 Resnet 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.5 Xception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.6 VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.7 VGG-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4.8 InceptionV3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.9 CNN-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
vi
3.4.10 CNN-VGG16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.11 CNN-VGG19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.12 CNN-InceptionV3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
CHAPTER 4 RESULTS AND PERFORMANCE ANALYSIS 32

4.1 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Result Analysis and Discussion: . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 CNN Model : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 LSTM Model : . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.3 BiLSTM Model : . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.4 VGG-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.5 VGG-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
CHAPTER 5 CONCLUSION 71
5.1 Future Work: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
REFERENCES 72
vii
LIST OF FIGURES
3.1 Steps in the process of identifying Bangla sing language and classification 14
3.2 Sample image of before and after background elimination. . . . . . . . . . 15
3.3 Steps involves in preprocessing . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Visual overview of dataset for digit . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Visual overview of dataset for letter . . . . . . . . . . . . . . . . . . . . . 18
3.6 Architecture of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.7 Architecture of LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.8 Architecture of BiLSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Evaluation of CNN (Digit) Optimizers and Batch Sizes . . . . . . . . . . . 35

4.2 Evaluation of CNN (Letter) Optimizers and Batch Sizes . . . . . . . . . . . 35
4.3 CNN(Digit) Accuracy and Loss Curve For Nadam optimizer with Batch
Size 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 CNN(Letter) Accuracy and Loss Curve For RMSprop optimizer with Batch
Size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 Classification report for Nadam optimizer batch size 32 (Digit Dataset) . . 37
4.6 Classification report for RMSprop optimizer batch size 16 (Letter Dataset) . 38
4.7 Confusion Matrix For Nadam optimizer batch size 32 (Digit Dataset) . . . . 39
4.8 Confusion Matrix For Adam optimizer batch size 32 . . . . . . . . . . . . 40
4.9 Evaluation of LSTM Optimizers and Batch Sizes (Digit) . . . . . . . . . . 43
4.10 Figure 4.13: Evaluation of LSTM Optimizers and Batch Sizes(Letter) . . . 43
4.11 LSTM(Digit) Accuracy and Loss Curve For Adam optimizer with Batch
Size 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.12 LSTM (Letter) Accuracy and Loss Curve For Nadam with Batch Size 32. . 45
4.13 Classification report for Adam optimizer batch size 16 (Digit) . . . . . . . 46
4.14 Classification report for Nadam optimizer batch size 32 (Letter) . . . . . . 46
4.15 Confusion Matrix For Adam optimizer batch size 16 (Digit) . . . . . . . . 47
4.16 Confusion Matrix For Nadam optimizer batch size 32(Letter) . . . . . . . . 47
4.17 Evaluation of Bi-LSTM Optimizers and Batch Sizes(Digit) . . . . . . . . . 50
4.18 Evaluation of Bi- LSTM Optimizers and Batch Sizes(Letter) . . . . . . . . 50
4.19 Bi-LSTM Accuracy and Loss Curve For Nadam with Batch Size 16 (Digit) 52
4.20 Bi-LSTM Accuracy and Loss Curve For Nadam with Batch Size 16 (Letter) 52
4.21 Classification report for Nadam batch size 16 (Digit) . . . . . . . . . . . . 52
4.22 Classification report for Nadam batch size 16 (Letter) . . . . . . . . . . . . 53
4.23 Confusion Matrix For Nadam optimizer batch size 16 (Digit) . . . . . . . . 54
4.24 Confusion Matrix For Nadam optimizer batch size 16 (Letter) . . . . . . . 55
viii
4.25 Evaluation of VGG16 Optimizers and Batch Sizes (Digit) . . . . . . . . . 59
4.26 Evaluation of VGG16 Optimizers and Batch Sizes (Letter) . . . . . . . . . 59
4.27 VGG16(Digit) Accuracy and Loss Curve For Adam optimizer with Batch
Size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.28 VGG16(Letter) Accuracy and Loss Curve For Adam optimizer with Batch
Size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.29 Classification report for Adam optimizer batch size 16(Digit) . . . . . . . . 61
4.30 Classification report for Adam optimizer batch size 16(Letter) . . . . . . . 62
4.32 Confusion Matrix For Adam optimizer batch size 16 (Letter) . . . . . . . . 63
4.33 Visualization of VGG19(Letter) Optimizers Across Different Batch Sizes . 66
4.34 Visualization of VGG19(Letter) Optimizers Across Different Batch Sizes . 67
4.35 VGG19(Digit) Accuracy and Loss Curve For Adam optimizer with Batch
Size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.36 VGG19(Letter) Accuracy and Loss Curve For Adam optimizer with Batch
Size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.37 Classification report of VGG19(Digit) for Adam optimizer batch size 19 . . 68
4.38 Classification report of VGG19(Letter) for Adam optimizer batch size 16 . 69
4.40 Confusion Matrix For Adam optimizer batch size 1 (Letter) . . . . . . . . . 70
ix
LIST OF TABLES
4.1 CNN Performance Table (Digit) . . . . . . . . . . . . . . . . . . . . . . . 33

4.2 CNN Performance Table (Letter) . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 LSTM Model Results (Digit) . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 LSTM Model Results (Letter) . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 BiLSTM(Digit) Model Results . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6 : BiLSTM(Letter) Model Results . . . . . . . . . . . . . . . . . . . . . . . 49
4.7 VGG16 Model Results (Digit) . . . . . . . . . . . . . . . . . . . . . . . . 56
4.8 VGG16 Model Results (Letter) . . . . . . . . . . . . . . . . . . . . . . . . 57
4.9 VGG19 Model Results (Digit) . . . . . . . . . . . . . . . . . . . . . . . . 64
4.10 VGG19 Model Results (letter) . . . . . . . . . . . . . . . . . . . . . . . . 65
x
CHAPTER 1
INTRODUCTION
Sign language serves as a vital bridge between deaf, hard of hearing, and speech-disabled
individuals and the broader world, functioning as an essential communication tool that
facilitates interaction. The term ”deaf community” refers to individuals who are deaf,
hard of hearing, or speech-disabled, along with their families and allies, who often share
a unique culture, language, and set of experiences related to communication challenges.
As awareness of this community grows, the demand for sign language is steadily in-
creasing. In Bangladesh, it is estimated that nearly 13 million people experience varying
degrees of hearing loss, with approximately 3 million suffering from severe to profound
hearing loss, significantly impacting their daily lives and interactions. The challenges
faced by deaf and speech-disabled individuals underscore the importance of sign lan-
guage as a crucial communication aid. Due to the barriers created by hearing loss or
speech impairments, many people in these communities encounter difficulties in ex-
pressing themselves and understanding others, leading to feelings of isolation. In re-
sponse to these challenges, researchers are actively exploring innovative methods and
techniques to develop machines capable of interpreting sign language more efficiently.
This ongoing research is vital for bridging the communication gap between deaf, speech-
disabled, and hearing individuals. Various approaches, including machine learning,
deep learning, and hybrid models, are being employed to enhance the effectiveness of
sign language recognition systems. From a review by [Khatun et al., 2021], about 60%
of the reviewed papers use both single and double hand signs in BdSL, with CNN be-
ing the most popular technique. [Yasir et al., 2017]present an approach to Bangla Sign
Language recognition using CNN. A hand-tracking device known as the Leap Motion
Controller was used to detect hand gestures. CNN is also used by [Islalm et al., 2019]
for detecting Bangla Sign Language. A large dataset was developed, comprising both
alphabets and numerals, with 7,052 sample images of 10 numerals and 23,864 images
of 35 basic alphabet characters. Their results showed 99.83% accuracy on numerals,
nearly 100% accuracy on alphabets, and 99.80% accuracy overall. [Hasan et al., 2016]
employed a machine learning approach to give voice to speech-disabled individuals.
Sign language was identified through hand gestures and converted into text, which was
then transformed into voice. HOG was used for feature extraction and SVM for classifi-
cation, with 16 BdSL sign expressions being recogniz[Podder et al., 2022] addressed the
population suffering from deafness or hearing disabilities in Bangladesh. To promote
the study of Bangla Sign Language, they prepared two robust datasets for BdSL alpha-
bets and numerals, which were classified using deep learning approaches. Models using
1
images with and without backgrounds were compared, with CNN performing better on
images with backgrounds. ResNet-18 achieved the highest accuracy at 99.99%.This the-
sis focuses on developing an effective approach for identifying Bangla Sign Language.
The dataset used contains 10 digit classes and 36 character classes. To obtain optimized
results, the input images undergo preprocessing, which involves background elimina-
tion and thresholding algorithms. These steps help clarify the input images by removing
background noise. Subsequently, deep learning and hybrid machine learning models are
employed for recognition tasks.The performance of these models is evaluated using four
distinct optimization algorithms: Adam, Nadam, Adamax, and RMSprop. Additionally,
the impact of different batch sizes—16, 32, 64, and 128—on the training process and
model performance is examined. This comparative study aims to identify the most ef-
fective combinations of optimizers and batch sizes for the task. The findings from this
thesis provide valuable insights into the recognition processes of Bangla sign language
digits and characters, contributing to the development of more accurate and efficient
methods for sign language identification.
1.1 Overview
Bangla Sign Language (BdSL) detection presents several challenges due to variations
in hand gestures, skin color, and complex or similar hand shapes. Additionally, incon-
sistent lighting conditions and cluttered backgrounds further complicate the detection
process. To address these challenges, various preprocessing techniques are employed
to enhance image quality. These include background elimination methods such as re-
gion of interest (ROI) extraction, thresholding techniques (Otsu’s thresholding, adap-
tive thresholding), OpenPose for keypoint detection, HSV color space adjustments, and
skin color detection algorithms. These methods help isolate hand signs from the back-
ground and reduce interference from skin tones or environmental noise, thereby improv-
ing model performance in recognizing distinct hand gestures. In this study, I focused on
BdSL detection using the Shonket dataset, which consists of two distinct subsets: one
for digits and one for letters. The digit dataset contains 10 classes with 150 images per
class, while the letter dataset consists of 36 classes with 120 images per class. A range of
deep learning models were employed for this task, including VGG16, VGG19, LSTM,
InceptionV3, Xception, and hybrid models such as CNN+LSTM. Preprocessing was
essential to address the problem of detecting hands with varying skin tones and distin-
guishing between gestures that are visually similar. The process began with background
removal using rembg, followed by thresholding methods like Otsu’s thresholding to bi-
narize the images, making the hand signs more distinguishable. Skin color detection and
HSV color space adjustments were also applied to enhance the clarity of the hand signs
and minimize distractions from similar backgrounds or complex hand gestures. Among
the models tested, the CNN+LSTM hybrid model achieved the highest performance on
2
the digit dataset, with an accuracy of 98.6%. On the letter dataset, the highest accuracy
of 92.6% was obtained using the Xception model. However, the increased complexity
of the letter dataset, with more classes and visually similar gestures, resulted in slightly
lower accuracy. These findings suggest areas for future improvement, such as diver-
sifying the dataset to include a wider range of skin tones, refining model architectures
to better handle complex gestures, or optimizing preprocessing techniques. This study
contributes to the advancement of BdSL recognition systems, with the goal of improv-
ing communication accessibility for the hearing-impaired community through accurate
sign language detection.
1.2 Sign language:
Sign language is a visual-manual form of communication primarily used by the deaf

and hard-of-hearing communities. It involves the use of hand gestures, facial expres-
sions, and body movements to convey meaning, rather than relying on spoken words
[Sultan et al., 2022]. Each gesture or sign corresponds to a specific word or concept,
and, similar to spoken languages, different regions have their own unique sign lan-
guages. For example, American Sign Language (ASL) is distinct from British Sign
Language (BSL) or Bangla Sign Language (BdSL), even though they all serve the same
purpose of communication. Sign language provides an essential communication bridge
for individuals who are unable to hear or speak. It enables them to interact effectively
with others, whether communicating simple day-to-day needs or engaging in complex
conversations. For many deaf individuals, sign language is their first language, allow-
ing them to connect with their community, access education, and participate in various
social settings without barriers. Additionally, sign language plays an important role in
early childhood development for children born deaf or hard of hearing. It facilitates
language acquisition in crucial developmental years, enabling children to develop cog-
nitive skills and express themselves freely. Beyond the deaf community, sign language
also assists individuals with speech disorders, autism, and other conditions that make
spoken communication challenging. The need for sign language arises from the ne-
cessity of ensuring inclusive, accessible, and equitable communication for everyone,
regardless of their hearing abilities. Without it, millions would face severe communi-
cation barriers, leading to social isolation, limited access to education, and difficulties
accessing essential services. Sign language is more than just a tool for communica-
tion; it is a fundamental right that supports the dignity and independence of the deaf and
hard-of-hearing communities. In an increasingly interconnected world, the recognition
and promotion of sign language foster more inclusive societies. Research is crucial in
developing technologies such as sign language detection systems and translation ser-
vices that bridge the gap between hearing and non-hearing communities. By leveraging
advancements in computer vision, machine learning, and natural language processing,
3
researchers are creating tools that can interpret and translate sign language in real time.
These technologies not only enhance communication for the deaf and hard of hearing
but also support educators and family members in interacting effectively with the deaf
community. By prioritizing research in sign language technologies, we can promote
accessibility and understanding, ensuring that everyone has a voice and can be under-
stood. Bangla Sign Language (BdSL) serves as the primary mode of communication
for the deaf and hard-of-hearing community in Bangladesh. Like other sign languages
around the world, BdSL involves a combination of hand gestures, facial expressions,
and body movements to convey meaning. It is a fully developed language with its own
grammatical rules, vocabulary, and syntax, distinct from the spoken Bengali language.
BdSL incorporates signs for the Bengali script, including 10 digit characters and 36 let-
ters, representing the vowels and consonants of Bengali. This linguistic structure allows
individuals who are deaf or hard of hearing to communicate effectively with each other
and those who understand the language, bridging communication gaps in a predomi-
nantly spoken language society.
1.3 Motivation
The motivations of this study are:
• Overcoming Communication Barriers: Millions of individuals in the deaf and

hard-of-hearing populations rely on sign language for communication. Develop-
ing reliable BdSL detection technologies can help break down communication
barriers and enable engagement between the hearing and non-hearing groups in
Bangladesh.
• Inclusion and Accessibility: By enhancing accessibility in public services, ed-

ucation, and daily life, sign language recognition technologies can help the deaf
community become more socially integrated. The goal of this thesis is to help
create a society that is more inclusive.
• Preserving Cultural and Linguistic Heritage: BdSL is an integral aspect of the

cultural identity of the deaf population in Bangladesh. Through the development
of technology able to identify and understand BdSL, this study helps to conserve
and advance the language for next generations.
• Limited Resources: BdSL has less technological resources and less research
than other sign languages like ASL, which have a plethora of instruments at their
disposal. By contributing to the creation of specific instruments and models for
Bangla Sign Language recognition, this thesis aims to close that gap.
4
• Technological Developments: Sign language recognition is now more practical
and accurate because to developments in deep learning and computer vision. By
utilizing these technologies, this research hopes to enhance current techniques by
developing reliable systems for BdSL gesture recognition.
• Enhancing BdSL identification Accuracy: Complicated hand movements,

background noise, and changes in skin tone can all cause problems for current
BdSL identification systems. Through the use of sophisticated machine learning
models and efficient preprocessing techniques, this thesis seeks to increase the
accuracy of BdSL identification.
• Effect on Education and Learning: By developing resources for BdSL instruc-

tion, this project will facilitate the learning and communication of BdSL by mem-
bers of the hearing and deaf populations. Additionally, it can support systems for
translating text into sign language that are used in public spaces, educational in-
stitutions, and online.
1.4 Application of Bangla sign language identification and classification methods
The Bangla sign language identification and classification methods devised in this thesis
have numerous practical applications across various disciplines. Here are some impor-
tant areas where these techniques can be effectively applied:
• Communication Tools: Real-time sign language translation is made possible by

the created BdSL recognition technologies, which may be included into online and
mobile applications to help people who are deaf and hard of hearing communicate.
• Educational Platforms: : BdSL recognition systems may be used in hospitals,

government offices, and customer service centers, enabling deaf persons to inter-
act without an interpreter.
• Public Services: BdSL recognition systems may be used in hospitals, govern-

ment offices, and customer service centers, enabling deaf persons to interact with-
out an interpreter.
• Media and Entertainment: BdSL detection may be integrated into digital plat-
forms to provide real-time sign language interpretation for deaf viewers.
• Assistive Devices: The technology may be used to create wearable devices that
convert BdSL into text or voice output, promoting independent communication
among the deaf community.
5
• Sign Language Learning Tools: Detection models can enable learning platforms
and applications for BdSL learners, making the language more accessible to hear-
ing people and increasing its adoption.
1.5 Key Contribution
• In this study, a comprehensive preprocessing pipeline was designed to handle

the specific issues provided by ancient, deteriorated Bangla printed manuscripts.
Skew correction techniques were applied to remedy any skew present in the docu-
ments, ensuring that text lines are horizontally aligned for accurate segmentation
and identification. Denoising methods were performed to eliminate undesirable
noise and artifacts from the document pictures, boosting the clarity of the text.
Additionally, morphological techniques such as dilation and erosion were used to
improve the text’s structure and remove unnecessary pieces. Various thresholding
techniques were also applied to binarize the photos efficiently, isolating the text
from the backdrop and improving contrast for greater readability.
• A meticulous endeavor was made to compile a comprehensive character dataset

comprising characters commonly encountered in Bangla printed documents. This
dataset was curated to incorporate characters from diverse categories, encompass-
ing the extensive range of characters present in Bangla script. Each character in
the dataset was meticulously labeled and organized, creating a structured dataset
appropriate for machine learning model assessment and training
• The study involved the application of a varied set of models for character clas-
sification, ranging from deep learning to traditional machine learning methods.
Convolutional neural networks (CNNs) were employed for their ability to learn
complex features from images, while long-short-term memory (LSTM) networks
and bidirectional LSTM (BiLSTM) were utilized for sequence modeling tasks.
Pretrained models such as VGG16, VGG19, ResNet50, Xception, and MobileNet
were fine-tuned for character recognition tasks. Additionally, hybrid models mix-
ing CNNs with other designs, as well as standard machine learning models like
Support Vector Machines (SVMs), were studied to evaluate their effectiveness in
handling degraded document pictures.
• Extensive experimentation was conducted to evaluate the efficacy of various op-

timization algorithms, including Adam, Nadam, Adagrad, SGD, and RMSprop.
The study also investigated the influence of varying group sizes (16, 32, 64, and
128) on the training procedure and the resulting model performance. This analy-
sis provided valuable insights into the most suitable optimization algorithms and
batch sizes for effectively training models on degraded Bangla document datasets.
6
• The work used systematic comparison and analysis to identify the optimum com-
binations of models, optimizers, and batch sizes for character classification tasks
on degraded Bangla texts. This analysis provided useful direction for future re-
search and development activities, assisting in identifying the most successful
ways for recovering and categorizing characters from old printed texts.
7
CHAPTER 2
LITERATURE REVIEW
The recognition and detection of Bangla Sign Language (BdSL) has received a lot of
interest as there is a rising demand for accessible communication tools for Bangladesh’s
deaf and hard-of-hearing communities. The creation of efficient BdSL identification
systems is essential for promoting social inclusion and communication. This area of
study is essential for improving accessibility in public services, healthcare, and educa-
tion, as well as for removing obstacles to communication.
To increase the clarity and quality of BdSL images—which are essential for precise
model training and recognition—a broad range of preprocessing methods have been
used. Preprocessing techniques that are often used include Otsu’s thresholding, Sauvola’s
adaptive thresholding, Niblack’s thresholding, and background eradication utilizing tools
like rembg. To further separate hand motions from noisy backdrops, morphological
techniques including dilation and erosion, area of interest (ROI) extraction, and skin
color recognition algorithms are employed. Techniques like Gaussian filtering and me-
dian filtering are also utilized to remove noise and boost image quality before the iden-
tification phase.
Many deep learning models have been applied to the field of BdSL detection, ranging
from straightforward designs to intricate, pretrained networks. Because Convolutional
Neural Networks (CNNs) are so good at capturing spatial characteristics in pictures, they
continue to be one of the most often used techniques. Advanced architectures with great
accuracy in sign language identification applications, including as VGG16, VGG19, In-
ceptionV3, ResNet50, and Xception, have been widely employed. Furthermore, hybrid
models that integrate temporal and spatial information—both necessary for continuous
hand gesture recognition—like CNN paired with LSTM or CNN-XGBoost have demon-
strated potential.
A range of optimization methods, including Adam, Nadam, RMSprop, Adagrad, and

SGD, have also been tested by researchers. These techniques have been essential in
helping to increase the convergence and accuracy of models during training, along with
different batch sizes. Performance has been further enhanced by the use of transfer learn-
ing, since huge datasets like as ImageNet have allowed for the fine-tuning of pretrained
models for BdSL detection tasks.
Other machine learning methods, such Support Vector Machines (SVM), Random Forests
8
(RF), and K-Nearest Neighbors (KNN), have also been used in addition to CNN-based
models, especially in previous research. While deep learning models have demonstrated
better performance recently, these models have laid the groundwork for BdSL detection
when paired with features collected using techniques such as Histogram of Oriented
Gradients (HOG).
The best preprocessing methods and model architectures for BdSL identification will
be critically examined in this literature review, along with the difficulties caused by
problems with gesture similarity, skin tones, and uneven illumination. The review will
also look at how these issues have been resolved and detection accuracy increased by
developments in hybrid and deep learning models.
This study seeks to fill in knowledge gaps and suggest future avenues for research by
analyzing the benefits and drawbacks of present approaches. In the end, this chapter
will advance knowledge on how to create more reliable and accurate BdSL recogni-
tion systems, which will significantly enhance the community of Bangladesh’s hearing-
impaired members’ accessibility to communication.
2.1 Related Work
• Shongket, a comprehensive dataset for Bangla sign language identification, was

generated by [Hasan et al., 2021] Initially, a huge number of hand sign gesture
photographs were taken, and low-quality images were filtered out. The data were
then labeled into 10 classes for digits and 36 classes for letters. For the digit
classes, 150 photographs per class were retained, while for the letter classes, 120
images per class were kept, resulting in a total of 5,820 images. To increase model
performance, the image size was lowered to 128x128 pixels. Preprocessing oc-
curred after the dataset had been prepared. After determining that color had no
effect on sign gestures, the photos were converted to grayscale and background
noise was removed using binary thresholding. Once the data was separated, CNN,
KNN, SVM, Random Forest, and Decision Tree classifiers were used. The KNN
model obtained 95The CNN model had four convolutional layers and was trained
over 50 epochs with the ADAM optimizer.
• Comprising 1800 grayscale pictures of 36 Bangla letters, [Islam et al., 2018] cre-
ated the first entirely open-access dataset, Ishara-Lipi, for isolated characters in
Bangla Sign Language (BdSL). Otsu thresholding and grayscale conversion helped
the photos to be preprocessed. Following that, a 9-layer Convolutional Neural
Network (CNN) tuned using the ADAM optimizer was trained using the dataset
and attained 92.65% accuracy on the training set and 94.74% on the validation set.
The aim of the dataset was especially to close the resource gap for BdSL identifi-
9
cation based on machine learning. The rather limited size of the dataset was one
of the constraints; while future plans call for growing it to improve model perfor-
mance. Ishara-Lipi will be a great tool for sign language study and development,
the report notes (Ishara-Lipi).
• [Das et al., 2023] proposed a composite approach for the recognition of Bangla
Sign Language (BSL) that employs a Random Forest (RF) classifier and a Convo-
lutional Neural Network (CNN). The transfer learning process was combined with
a background elimination algorithm that employs morphological operations and
an adaptive Gaussian thresholding technique to obtain optimal results. The CNN
models (VGG16, VGG19, InceptionV3, Xception, ResNet50) were pre-trained
on ImageNet. Two public datasets, Ishara-Bochon (digits) and Ishara-Lipi (char-
acters), were used for training. The system performed well, obtaining an accuracy
of 91.67% for character recognition and 97.33% for digit recognition. Future en-
hancements could focus on data augmentation and real-time optimization.
• [Shurid et al., 2020] suggested a novel architecture, the Concatenated BdSL Net-
work, intended for recognizing Bangla Sign Language (BdSL) by combining a
CNN for visual feature extraction and OpenPose for hand keypoint estimation.
This method handles the challenges posed by the subtle differences between sim-
ilar BdSL gestures. The model got a test accuracy of 91.51%, surpassing other
CNN-based models. However, misclassifications happened in symbols with nearly
identical hand gestures. The authors mentioned future improvements could in-
volve training a custom pose estimation model specific to BdSL. Experiments
were performed using Google Colaboratory with limited computational resources,
and further development may focus on real-time recognition and larger datasets.
• [Abedin et al., 2023] proposed a novel model called the ”Concatenated BdSL Net-
work” for Bangla Sign Language (BdSL) recognition, combining a CNN and a
pose estimation network. The CNN, composed of 10 convolutional layers, was
responsible for extracting visual features from the images, while OpenPose es-
timated hand keypoints, addressing the challenge of differentiating subtle hand
gestures. The model used two separate inputs: for the CNN, images were con-
verted to grayscale, while for pose estimation, the images were converted to BGR
format. The features extracted from both the CNN and pose estimation network,
flattened into layers, were combined by passing them through two fully connected
layers with ReLU activation. These combined features were then passed through
three additional fully connected layers to produce the final output. The model
achieved a test accuracy of 91.51%, outperforming previous methods. However,
certain symbols, such as [�-�] and [�-�], were frequently misclassified. The
10
authors suggested that training a custom pose estimation network specifically for
BdSL, rather than relying on a pre-trained one, would likely yield better results
• [Podder et al., 2022] developed a deep learning-based system for real-time Bangla
Sign Language (BdSL) recognition, focused on alphabets and numerals. Utilizing
the largest dataset created for BdSL, they trained pre-trained CNN models such
as ResNet18 and MobileNet_V2, getting high accuracy in both approaches: with
background and after background removal. ResNet18 provided the best perfor-
mance, with 99.99% accuracy, precision, and sensitivity. The study addressed
challenges related to skin tone, hand orientation, and background, finding that
models trained with backgrounds performed slightly better than those without.
This study made significant progress in BdSL recognition and provided publicly
available datasets to support future studies.
• An integrated CNN-LSTM modelwas proposed by [Basnin et al., 2021] for rec-

ognizing Bangla Sign Language (BSL), getting 90% training accuracy and 88.5%
testing accuracy. The collection consists of 13,400 images across 36 classes, with
each image processed using background subtraction, grayscale conversion, mor-
phological erosion, median filtering, and resizing. The model was trained using
the SGD optimizer with a learning rate of 0.001. Compared to CNN-based archi-
tectures such as VGG16, VGG9, and MobileNet, the CNN-LSTM model showed
better performance due to the integration of LSTM, which optimized temporal
feature learning. This model also overcame problems of overfitting seen in other
architectures. However, it was limited to identifying individual signs. Future im-
provements could focus on recognizing sign sequences and expanding the dataset
to include more different situations
• [Tasmere and Ahmed, 2020] proposed a hand gesture recognition framework us-
ing a deep convolutional neural network (CNN) for Bangla Sign Language (BSL),
getting a high accuracy of 99.22%. Their system relied on a dataset of 3,219 im-
ages collected from six different people, representing 37 Bangla sign characters.
The authors applied a combination of HSV and YCbCr color spaces for hand de-
tection, followed by a CNN architecture with four convolutional layers and a 40%
dropout layer to improve performance. Although the model achieved strong re-
sults, it was limited to recognizing static images, suggesting potential for future
improvements by incorporating dynamic gestures and real-time recognition.
• [Shams et al., 2024] proposed a multi-modal ensemble strategy for recognizing

Bangla Sign Language, concentrating on spatial, skeletal, and edge information
using lightweight CNN models. Their approach trained three different CNN mod-
els on these modalities, using unique ensemble strategies to improve accuracy.
11
The model obtained 95.13% accuracy on the testing set, beating individual CNN
models and other cutting-edge architectures like ResNet50. While the technique
had potential, its dependence on MediaPipe for skeletal feature extraction and em-
phasis on static indicators posed limits. The authors advised that future research
should investigate dynamic indications and build more robust real-time solutions.
• [Rafi et al., 2019] created a Convolutional Neural Network (CNN)-based system

that recognizes the Bengali Sign Language (BdSL) alphabet. Their study ad-
dressed the Deaf and Dumb community’s interpretation problem by developing
a method for automatically recognizing 38 BdSL alphabets using the VGG19
model. The system was trained using a dataset of 12,581 photographs compiled
in partnership with the National Federation of the Deaf. They obtained 89.6%
accuracy by using data augmentation approaches and changing the VGG19 archi-
tecture. Despite the encouraging results, the system had misclassification diffi-
culties with comparable hand gestures and performance decreases in complicated
environments. The authors proposed further enhancements, such as training sec-
ondary networks and expanding to word recognition, in the hopes that their work
will improve communication tools for the Deaf and Dumb community.
• [Hadiuzzaman et al., 2024] built the BAUST Lipi dataset, comprising 18,000 im-
ages representing 36 Bangla Sign Language (BdSL) alphabets. They introduced a
CNN-LSTM hybrid model that employs CNN layers for spatial feature extraction
and LSTM for temporal sequence learning, obtaining a high accuracy of 97.28%
on the testing set. The dataset was acquired from 15 participants under diverse
conditions, enhancing its robustness for machine learning tasks. However, the
model is limited to static signs, implying future work could investigate dynamic
gestures and real-time recognition.
12
CHAPTER 3
METHODOLOGY
Bangla Sign Language helps those with hearing disabilities bridge the communication
gap, allowing them to engage more successfully in society. As the need for sign lan-
guage recognition systems rises, the task of creating efficient and accurate models for
detecting Bangla hand signals becomes more pressing. This study addresses these prob-
lems by presenting a thorough approach for detecting and classifying Bangla sign lan-
guage, with an emphasis on the Shonket dataset, which comprises digit and letter hand
sign pictures. The approach begins with important preprocessing processes that im-
prove the clarity and uniformity of the hand sign pictures. Using a backdrop removal
method, rembg, all unnecessary features are removed, enabling the hand signals to stand
out. The photos are then transformed to grayscale to minimize computing complexity,
followed by binarization to provide crisp black-and-white images with uniformity and
increased contrast for further processing. After preprocessing, the photos are sent into
a succession of machine learning, deep learning, and hybrid models for categorization.
These models are charged with classifying the hand signals into two main sets: 10-digit
and 36-letter classes. To achieve high classification accuracy, a range of cutting-edge
techniques are used, including hybrid models that integrate convolutional neural net-
works (CNNs) with long short-term memory networks (LSTMs) and sophisticated deep
learning architectures such as Xception. To improve the performance of these models,
four different optimizers—Adam, Nadam, RMSprop, and Adamax—are evaluated with
different batch sizes (16, 32, 64, and 128). This study aims to determine the most suc-
cessful combinations of optimizers and batch sizes, with an emphasis on model correct-
ness and convergence speed. This study gives useful insights into the best optimization
approaches for detecting Bangla Sign Language by carefully comparing the outcomes
of each optimizer with varied batch sizes. The findings seek to improve the accuracy
and efficiency of sign language recognition systems, hence facilitating the creation of
more accessible and inclusive communication tools.
13
Figure 3.1: Steps in the process of identifying Bangla sing language and classification
Figure 3.1 shows the steps of our proposed model that involve Bangla sign language
identification and classification.
3.1 Dataset preprocessing
Preprocessing refers to the set of operations performed on unprocessed data before feed-
ing it into a machine learning or deep learning model. This phase is crucial in trans-
forming the data into a more suitable format, enhancing its quality, and making it more
informative for the learning algorithms. The objective of preprocessing is to ensure
that the data is clean, standardized, and in a form that the model can efficiently work
with, thereby increasing its accuracy and performance. The efficacy and accuracy of the
training process are significantly improved by preprocessing, which is a critical step in
the development of any machine learning or deep learning model. Preprocessing is the
foundation for the dataset to be optimized for performance in this study on Bangla Sign
Language identification, which includes two categories—digits and letters. Raw images
are converted to a more suitable format for model training through the use of grayscale
conversion, binarization, and background removal in the preprocessing pipeline. Fig-
ure 3.3 illustrates the processes involved in the preprocessing stage. The primary ob-
jective of the initial stage of preprocessing is to remove any undesirable components
from the images in order to mitigate noise or any inconsistencies that could impede
the recognition process. The gesture’s shape is accurately defined as the background
is eliminated, thereby concentrating solely on the hand sign. Subsequently, the images
are converted to grayscale in order to reduce the computational burden and simplify the
data. Lastly, Otsu’s binarization technique is implemented to transform these images
into crisp black-and-white representations, thereby emphasizing the essential attributes
required for precise sign classification.
14
3.1.1 Background removal using rembg
Background removal is an important part of preprocessing since it removes any non-

essential features from a picture while keeping the hand sign. This phase substantially
reduces noise and enhances the precision of the dataset, ensuring that the hand gesture
is the primary focus of the model. During the early stage of preprocessing, a problem
developed while directly applying binarization to the raw photos. Without correctly
eliminating the background, the binarization process struggled to distinguish between
the hand sign and its surrounds, resulting in deformed forms of the hand movements.
These distortions harmed the model’s performance, resulting in reduced accuracy in
sign identification. To remedy this issue, the background removal program, rembg, was
used. This method quickly finds and isolates the principal object—in this example,
the hand gesture—while removing any background features that may cause confusion
during categorization. The usage of rembg guarantees that the photos are cleaned up
before further modification. Rembg is a popular backdrop removal application that uses
deep learning models to separate items from their backgrounds. It operates by assessing
the image’s visual elements to determine the separation between the foreground object
and its surrounds. The algorithm then isolates the item, leaving a translucent or solid-
colored backdrop that simplifies the image and prepares it for future processing. This
phase guarantees that when the pictures are binarized, they retain the geometry of the
hand motion without interference from undesirable aspects.
(a) Before Background elimmination (b) After Background elimmination
Figure 3.2: Sample image of before and after background elimination.
Figure 3.2 shows the difference between before and after backgoud elimination.
3.1.2 Gray Scale conversion
After removing background the the images were converted into Grayscale images. Grayscale
conversion is the process of changing a color image into a single-channel image, where
15
each pixel represents the intensity or brightness of the corresponding pixel in the original
image. In a grayscale picture, shades of gray spanning from black to white are utilized
to depict varying degrees of intensity, with darker shades suggesting lower intensity and
lighter colors indicating more intensity. Grayscale pictures feature only one intensity
channel, simplifying processing compared to color images, which generally comprise
three channels (red, green, and blue). This minimizes computational complexity and
memory constraints, making grayscale pic tures easier to deal with. Aside from that,
grayscale pictures provide us with a clearer depiction of visual content.
3.1.3 Binarization
After grayscale conversion, Otsu’s binarization technique is used to further enhance

the pictures. Otsu’s approach is an adaptive thresholding technique that automatically
selects the best threshold value to separate the foreground (hand motion) from the back-
ground, yielding a black-and-white picture. The capacity of Otsu’s binarization to dy-
namically alter the threshold based on the intensity distribution of the grayscale picture is
its primary benefit, making it extremely successful for a wide range of lighting situations
and image characteristics. This binarization procedure increases the contrast between
the hand sign and its background, making the gesture’s contour clearer and simpler for
the model to perceive. By transforming the picture to binary format, the procedure guar-
antees that only the most important characteristics are retained, considerably improving
classification model performance during training.
Figure 3.3: Steps involves in preprocessing
3.2 Dataset
The Shongket dataset is a cutting-edge resource for recognizing Bangla Sign Language
(BdSL) using machine learning and computer vision. It is one of the largest datasets
in this field and attempts to close the communication gap by offering a well-organized
and varied set of hand gesture photos to both the general public and those with hearing
and speech impairments. The Shongket dataset, which is specifically focused on Bangla
16
Sign Language, includes a wide range of examples of both alphabetic and numeric hand
motions. It gives researchers and developers working on sign language recognition mod-
els a useful tool by providing a huge number of hand gesture photographs for each class,
allowing advances in system effectiveness and accuracy. This dataset encourages the
advancement of assistive technology, which enhances communication accessibility and
inclusivity.
There are two key sections to the dataset:
Digit Classes: It has 10 classes representing Bangla digits (0-9), with 150 hand motion
photos each class, totaling 1,500 images.
Letter Classes: The collection also comprises 36 classes matching to the Bangla alpha-
bet, with 120 hand motion photos per class, resulting in 4,320 images.
In all, Shongket features 5,820 photos, captured under varied situations and displaying
diverse hand movements. This makes it a crucial resource for furthering research and
applications in Bangla Sign Language detection.
Figure 3.4 shows the full set of classes for Bangla numbers in the Shongket dataset and
Figure 3.4: Visual overview of dataset for digit
visually represents the hand motion corresponding to each Bangla digit within those
classes. The hand movements and their corresponding numerical values are precisely
mapped out, with each class being clearly related with one of the 10 Bangla numerals
(0–9).
17
Figure 3.5: Visual overview of dataset for letter
In a similar vein, Figure 3.5 offers a thorough summary of the Bangla letter classes found
in the Shongket dataset. It shows the hand motion that goes with each of the 36 Bangla
letters. The hand motions connected to each character in the Bangla alphabet are clearly
illustrated in each class, which is matched to a particular letter.
3.3 Optimizer
An optimizer is an algorithm or method that adjusts a model’s parameters, such as

weights and biases, to minimize the error or loss function during training [Sun et al., 2019].
The optimizer improves the model’s accuracy by iteratively changing the parameters us-
ing the gradients computed from the loss function. It works by determining the optimal
values for these parameters in order to eliminate mistakes or accomplish a desired out-
come, hence improving the model’s performance and dependability.
3.3.1 Adam (Adaptive Moment Estimation)
Adam is a well-known optimization algorithm that takes ideas from both RMSProp
and Momentum and mixes them. It figures out the adjusted learning rates for each
parameter, which makes it possible to optimize problems in a lot of different ways.
There are two moving averages that Adam keeps track of. The first is mt , which is
the exponentially decaying average of past gradients. The second is vt , which is the
exponentially decaying average of past squared gradients
Default Learning Rate: (α ): 0.001
Default Decay Rates: (β1 and β2 ): 0.9 and 0.999 respectively
Epsilon: (ε ) Default Value: 1e − 7
18
The rule for updating Adam’s parameters is this:
mt = β1 · mt−1 + (1 − β1 ) · ∇J(θt ) (3.3.1)

vt = β2 · vt−1 + (1 − β2 ) · (∇J(θt ))2 (3.3.2)
mt
m̂t = (3.3.3)
1 − β1t
vt
v̂t = (3.3.4)
1 − β2t
m̂t
θt+1 = θt − α · √ (3.3.5)
v̂t + ε
where:
• θ represents the parameters.
• α is the learning rate.
• mt and vt are the first and second moment estimates respectively.
• β1 and β2 are decay rates for the moment estimates.
• t represents the time step.
• ε is a small constant to avoid division by zero.
3.3.2 Nadam (Nesterov-accelerated Adaptive Moment Estimation)

NAdam stands for ”Nesterov-accelerated Adaptive Moment Estimation.” It is an addi-
tion to Adam that uses Nesterov momentum. It fixes the momentum during the gradient
descent step, which speeds up convergence and might lead to better generalization.
Default Decay Rates: (β1 and β2 ): 0.9 and 0.999 respectively
Epsilon: (ε ) Default Value: 1 × 10−7
Nesterov Momentum: (µ ) Default Value: 0.9
The updating rule for Nadam is the same as Adam’s, but it adds an extra momentum
term:
mt = β1 · mt−1 + (1 − β1 ) · ∇J(θt ) (3.3.6)

vt = β2 · vt−1 + (1 − β2 ) · (∇J(θt ))2 (3.3.7)
mt
m̂t = (3.3.8)
1 − β1t
vt
v̂t = (3.3.9)
1 − β2t
mt ∇J(θt )
θt+1 = θt − α · √ −α ·µ · √ (3.3.10)
v̂t + ε v̂t + ε
19
where µ is the Nesterov momentum coefficient.
3.3.3 RMSProp (Root Mean Square Propagation):
RMSProp is a rate optimization approach for adaptive learning. The learning rates for
every parameter are adjusted according to the average of the gradients’ most recent mag-
nitudes. Accelerating convergence is the result of this overcoming the issue of declining
learning rates. Default Learning Rate: (α ): 0.001
Default Decay Rate: (γ ): 0.9
Epsilon; (ε )Default Value: 1 × 10−7
Gt = γ · Gt−1 + (1 − γ ) · (∇J(θt ))2 (3.3.11)

∇J(θt )
θt+1 = θt − α · √ (3.3.12)
Gt + ε
where γ is a decay rate.
3.3.4 Adamax :
Adamax is a variant of the Adam optimizer based on the infinity norm. It is more stable
and effective when the gradients are large, as it uses the maximum of the past squared
gradients rather than their sum. This makes it more robust in certain cases, especially
when dealing with large gradients.
Beta 1: (β1 ) Default Value: 0.9
Beta 2: (β2 ) Default Value: 0.999
Epsilon: (ε ) Default Value: 1 × 10−8 The update rule for Adamax is:
mt = β1 · mt−1 + (1 − β1 ) · ∇J(θt ) (3.3.13)

ut = max(β2 · ut−1 , |∇J(θt )|) (3.3.14)
α
θt+1 = θt − · mt (3.3.15)
ut + ε
where ε is a tiny constant to prevent division by zero and Gt is the sum of the squares
of the gradients up to time step t.
20
3.4 Model description:
A throughout description of employed models are given below.
3.4.1 Convolutional Neural Network (CNN)
A Convolutional Neural Network (CNN) is a deep learning model that is particularly

effective for visual data, such as images. It acquires the ability to identify a variety of
features from the input image through a sequence of layers. The network commences
by obtaining raw pixel values in the input layer, which are subsequently passed through
convolutional layers that apply filters to identify significant patterns, such as edges or
textures. The model is able to capture more complex patterns by passing the output from
these layers through activation functions such as ReLU, which introduce non-linearity.
Pooling layers, including max pooling, are employed to decrease the size of the feature
maps, thereby increasing the efficiency and speed of the model. The ultimate classifica-
tion is achieved by fully connected layers at the network’s conclusion, which combine
the learned features. The final layer employs softmax or sigmoid functions to generate
probabilities for each class. CNNs are highly effective at tasks such as image classifi-
cation, object detection, and medical imaging, thanks to their ability to automatically
extract key features with minimal preprocessing. The CNN model was used to classify
images from a Bangla Sign Language digit and letter dataset. Initially, the dataset was
preprocessed by, resizing it to 28x28 pixels, and normalizing the pixel values to fall
within the range of 0 to 1. The images are reshaped to suit the CNN’s input format, and
the data is split into training and testing sets. Using the CNN model, the convolutional
layers that are followed by max-pooling layers are utilizeed to downsample the feature
maps and reduce computational complexity. Dropout layers are implemented after each
pooling layer to prevent overfitting by randomly deactivating some neurons during train-
ing. The network is composed of three convolutional blocks with escalating numbers of
filters (32, 64, and 128) to capture more complex features as the layers deepen. After the
convolutional layers, the data is flattened and passed through a completely connected
dense layer with 256 neurons. Another dropout layer is applied before the final softmax
layer, which outputs probabilities for each of the 10 or 36 classes. The model is trained
with checkpoints that save the best-performing version based on validation accuracy.
21
Figure 3.6: Architecture of CNN
The figure 3.6 illusrate the architecture of cnn for better understanding.
3.4.2 Long Short-Term Memory
The Long Short-Term Memory (LSTM) model is a specialized type of Recurrent Neural
Network (RNN) architecture, designed to capture long-term dependencies in sequential
data. Unlike conventional RNNs, LSTMs are capable of remembering information for
extended periods and avoiding issues such as vanishing gradients, making them ideal
for sequence prediction tasks that involve temporal relationships. This includes ap-
plications like time series analysis, natural language processing, and image sequence
22
classification.
In this case, a LSTM model is used for classifying Bangla Sign Language digit se-
quences. The model processes images by treating each image as a sequence of rows,
where each row (28 pixels broad) is a time step, and each pixel within the row is a fea-
ture. The input images are first preprocessed by converting them to grayscale, resizing
them to 28x28 pixels (similar to the MNIST dataset), and normalizing the pixel values
to lie between 0 and 1. The dataset is divided into training and testing sets, and the labels
are one-hot encoded for classification into 10 or 36 output classes.
The architecture of the LSTM model is as follows:
Initial Layer: The input layer anticipates input sequences with a shape of (28, 28), rep-
resenting 28 time steps (rows) with 28 features (pixels) at each time step.
LSTM Layer: The first LSTM layer consists of 256 units and returns sequences, mean-
ing it outputs a sequence for each time step, which is passed on to the next LSTM layer.
This layer helps capture the temporal dependencies within the input sequences.
Dropout Layer:A dropout layer with a 0.2 dropout rate is implemented to reduce over-
fitting. Dropout works by arbitrarily turning off a fraction of neurons during training,
encouraging the model to generalize better.
Second LSTM Layer: Another LSTM layer with 256 units follows, but this time it only
returns the final output (not the entire sequence) to provide a summary of the learned
temporal features.
Final Output Layer: The model concludes with a Dense layer consisting of 10 and
36 units (for 10 and 36 output classes), using the softmax activation function to output
probabilities for each class.
Figure 3.7: Architecture of LSTM
23
The figure 3.7 illustrate the architecture of LSTM . The architecture details all the layers
that have been utilized in this model.
3.4.3 Bidirectional LSTM
A Bidirectional LSTM (BiLSTM) network is particularly well-suited for sequence pre-

diction tasks, as it efficiently encompasses temporal dependencies in both forward and
backward directions. BiLSTM networks are effective tools for analyzing fixed-length
input sequences and can be customized to accommodate a variety of output classes. The
architecture and functionality of the employed model are explained below:
The ‘input layer’ processes image sequences in the shape of (28, 28). This means that
each sequence contains 28 time steps, with 28 features representing the pixel values at
each stage. This configuration is optimal for handling image data where each row of
pixels in the image is treated as a time step. The model is fundamentally composed
of a Bidirectional LSTM layer that contains 256 LSTM units. BiLSTMs analyze in-
put sequences in both forward and backward orientations, in contrast to conventional
LSTMs, which exclusively examine past time steps. This bidirectional approach allows
the model to capture more meaningful temporal relationships by taking into account
both past and future contexts within the sequence. Following this, a Dropout layer with
a dropout rate of 0.2 is implemented to prevent overfitting by randomly dropping neu-
rons during training, which encourages the network to learn more robust representations.
After the first BiLSTM layer, a second Bidirectional LSTM layer with 256 units is in-
cluded, which operates without returning sequences, meaning it provides a single output
per sequence. Another dropout layer is placed after this BiLSTM layer to further reduce
the danger of overfitting. Finally, the model includes a Dense layer as the output layer,
with 10 and 36 units corresponding to the number of output classes (representing the
digit and the letters in Bangla Sign Language) and a softmax activation function. This
layer allocates probabilities to each of the classes, allowing the model to predict which
class the input sequence belongs to. Overall, this BiLSTM model architecture is well-
suited for sequence classification tasks, such as classifying letters in sign language based
on the temporal patterns extracted from the input image data. The use of BiLSTM lay-
ers ensures that the model can comprehend the sequential nature of the input from both
directions, leading to more accurate and comprehensive learning of the data’s temporal
structure.
24
Figure 3.8: Architecture of BiLSTM
The figure 3.8 illustrate the architecture of BiLSTM . The architecture details all the
layers that have been utilized in this model.
3.4.4 Resnet 50
The ResNet50 model for classifying Bangla Sign Language digits and letters. Initially,
the images are preprocessed by resizing them to 224x224 pixels (ResNet50’s required
input size) and normalizing pixel values. The dataset is then split into training and test-
ing sets, and the labels are one-hot encoded to facilitate classification across 10 classes
and 36 classes. ResNet50, pre-trained on the ImageNet dataset, is imported without
its fully connected layers, and the pre-trained layers are frozen to retain their learned
weights. On top of ResNet50, custom classification layers are added, including a global
average pooling layer to convert the feature maps into a single vector, followed by a
dense layer with 1024 neurons for feature extraction, a dropout layer to prevent over-
fitting, and a final dense layer with 10 and 36 units for the digit and letter classification
task. The model is compiled and trained using different batch size over 50 epochs, with
a checkpoint mechanism implemented to save the best model based on validation accu-
racy. After training, the model is evaluated on the test dataset, and predictions are made
to assess its performance.
3.4.5 Xception
The ‘Xception’ model is an enhanced version of the Inception architecture that incorpo-
rates depthwise separable convolutions, which consist of a depthwise convolution ap-
plied individually to each channel followed by a pointwise 1x1 convolution to combine
25
the channels. This innovation enables Xception to maintain the efficiency of Inception
while improving performance by making better use of parameters. Xception is particu-
larly suited for image classification tasks, having exhibited strong results on benchmarks
such as ImageNet. The model consists of an input flow, intermediate flow blocks, and an
exit flow, all utilizing depthwise separable convolutions. These convolutions, combined
with residual connections, reduce the number of parameters and computation while al-
lowing the network to extract intricate features. The use of residual connections also
helps to address the vanishing gradient problem, making it simpler to train deep net-
works. Xception’s structure is optimal for transfer learning, where pre-trained models
can be fine-tuned for specific applications with improved accuracy and faster training
times.
For the task of Bangla Sign Language letter classification, the Xception model has been
adapted as follows:
Data Preprocessing:
• Images are resized to 299x299 pixels (as required by Xception) and converted
from grayscale to RGB by repeating the single grayscale channel.
• Pixel values are normalized to the range [0, 1].
• The dataset is divided into training and testing sets, and the labels are one-hot
encoded for multi-class classification.
Model Architecture:
• The pre-trained Xception model is supplied with ImageNet weights, excluding

the top fully connected layers.
• A GlobalAveragePooling2D layer is added to reduce the dimensionality of the

feature maps.
• A dense layer with 128 neurons and a dropout layer (0.2) is added to prevent
overfitting.
• A final dense layer with 36 units (one for each class) and softmax activation is
used for multi-class classification.
Training the Model:
• The model is compiled with a different optimizer and categorical cross-entropy

loss.
26
• The training is conducted with a batch size of 16, 32, 64, and 128 for 50 epochs,
with a checkpoint to save the best model based on validation accuracy.
The combination of depthwise separable convolutions and residual connections makes

Xception a potent and efficient model for Bangla Sign Language classification, offering
a balance of computational efficiency and high accuracy.
3.4.6 VGG16
The VGG16 deep learning model, devised by Karen Simonyan and Andrew Zisserman
at the University of Oxford [Simonyan and Zisserman, 2014], is renowned for its sim-
plicity yet powerful feature extraction capability in image classification tasks. With 16
weight layers, the architecture consists of compact 3x3 convolutional filters followed by
max-pooling layers to reduce spatial dimensions. This design has 138 million param-
eters, making it computationally intensive but highly effective for tasks such as image
recognition and feature extraction, particularly in large datasets like ImageNet. The
VGG16 model operates through looking at an input image (typically 224x224 RGB)
with its convolutional layers. These layers detect low-level features like boundaries
and textures, which are then refined through multiple convolution and pooling opera-
tions. Max-pooling layers help reduce the feature map’s dimensions, preserving essen-
tial information while preventing overfitting. The extracted high-level features are then
transmitted to fully connected layers for classification. In this architecture, two fully
connected layers have 4096 neurons each, followed by a final layer with 1000 neu-
rons (corresponding to 1000 ImageNet classes) using softmax activation to predict the
image class probabilities. For this specific task, VGG16 has been adapted for Bangla
Sign Language letter classification. The input images are resized to 50x50 pixels and
converted from grayscale to RGB by replicating the grayscale channel. Using the pre-
trained VGG16 model (without the top fully connected layers), the model is customized
by adding a flattening layer, a dense layer with 128 neurons, a dropout layer to prevent
overfitting, and a final output layer for the 36-letter classes and 10 digit classes.
The stages for this modified VGG16 model include:
Data Preprocessing: Images are read in grayscale, resized to 50x50, and normalized
to the range [0, 1]. The images are then divided into training and testing sets, with one-
hot encoding applied to the labels for multi-class classification.
Model Architecture:
• A flattening layer is applied to convert the 3D output of the convolutional layers

into a 1D vector.
• A dense layer with 128 neurons is included, followed by a dropout layer (0.2) to
27
reduce overfitting.
• A dense layer with 128 neurons and a dropout layer (0.2) is added to prevent
overfitting.
• The final layer has 10 or 36 units (one for each class), with softmax activation for
multi-class classification.
Training the Model: The model is compiled with the different optimizer and categori-
cal cross-entropy loss. It is trained using a batch size of 16,32,64 and 128 for 50 epochs,
with a checkpoint to save the best model based on validation accuracy.
This modified VGG16 model leverages transfer learning to accomplish robust feature
extraction while minimizing computational cost by freezing the pre-trained layers and
only training the added custom layers.
3.4.7 VGG-19
The VGG19 model is a deep Convolutional Neural Network (CNN) architecture renowned
for its efficacy in image recognition tasks. Developed by the Visual Geometry Group
(VGG) at Oxford [Simonyan and Zisserman, 2014], VGG19 comprises 19 layers, in-
cluding 16 convolutional layers and 3 fully connected layers, employing small 3x3 con-
volutional filters applied sequentially. This design enables the model to capture fine de-
tails efficiently while progressively reducing spatial dimensions through max-pooling
layers, which retain crucial features. In implementation, a pre-trained VGG19 model
is leveraged for classifying images from the Bangla Sign Language digit dataset. Ini-
tially, the dataset was preprocessed by importing grayscale images, resizing them to
50x50 pixels, and converting them to RGB format. The pixel values are normalized to
a range of 0 to 1, and the dataset is divided into training and testing sets, with labels
one-hot encoded for multi-class classification. The VGG19 model was imported with
ImageNet weights while excluding the fully connected layers to add my custom classi-
fication layers. The model’s output is compressed into a 1D vector, followed by a dense
layer with 128 neurons and ReLU activation, and a dropout layer to mitigate overfitting.
The final output layer employs softmax activation to derive class probabilities for the
10 digit classes and 36 letter classes. To enhance training efficiency, the weights of the
VGG19 layers were paused, ensuring that only the custom layers learn during training.
The model is compiled with the optimizer (like: Adam, Nadam, Adamax and RMSprop)
and categorical cross-entropy loss, then trained using a checkpoint callback to save the
best-performing model based on validation accuracy.
After training, the model was evaluate on the test set, making predictions and calculating
metrics such as F1 score, precision, and recall. The model’s performance is also visu-
alize using a confusion matrix and generate a classification report for detailed insights.
28
Lastly, the training and validation accuracy and loss curves are illustrated to observe the
model’s learning progress throughout the epochs.
3.4.8 InceptionV3
Researchers at Google developed Inception V3, a deep convolutional neural network

architecture [Szegedy et al., 2016]. The dataset consists of images that are first prepro-
cessed by resizing them to 299x299 pixels (as required by InceptionV3), converting the
grayscale images to RGB, and normalizing pixel values. The data is divided into training
and testing sets, and the labels are one-hot encoded for multi-class classification with 36
classes. The pre-trained InceptionV3 model, which is loaded with ImageNet weights,
is used as the base model, with its fully connected layers removed. A global average
pooling layer is added to reduce the dimensionality of the feature maps, followed by a
dense layer with 128 neurons and a dropout layer (0.2) to prevent overfitting. The final
dense layer has 36 units with softmax activation, which is used to classify the images
into one of the 36 classes. During the training process, the pre-trained layers of the
InceptionV3 model are locked, and the custom classification layers are trained on the
new dataset. The training is conducted with a batch size of 16 over 50 epochs, with a
model checkpointing mechanism in place to save the best-performing model based on
validation accuracy.
3.4.9 CNN-LSTM
The CNN+VGG16 implements a model based on VGG16 architecture combined with

custom CNN layers for classifying Bangla Sign Language letters. Images are loaded,
converted to grayscale, resized to 64x64 pixels, and normalized. Grayscale images are
converted to RGB format before being processed. The dataset is split into training and
testing sets, and labels are one-hot encoded for the 36-class classification problem. A
pre-trained VGG16 model is used as the base, with its layers frozen to retain the learned
features from the ImageNet dataset. Additional CNN layers are added on top of VGG16
for further feature extraction. Dropout layers are incorporated between convolutional
layers to prevent overfitting. The GlobalMaxPooling2D layer is used to reduce the fea-
ture maps without completely shrinking them. The model is compiled using the differ-
ent optimizer and trained for 50 epochs with a batch size of 16,32,64 and 128. The best
model is saved using a model checkpoint mechanism based on validation accuracy.
3.4.10 CNN-VGG16
The CNN+VGG16 implements a model based on VGG16 architecture combined with

custom CNN layers for classifying Bangla Sign Language letters. Images are loaded,
converted to grayscale, resized to 64x64 pixels, and normalized. Grayscale images are
converted to RGB format before being processed. The dataset is split into training and
29
testing sets, and labels are one-hot encoded for the 36-class classification problem. A
pre-trained VGG16 model is used as the base, with its layers frozen to retain the learned
features from the ImageNet dataset. Additional CNN layers are added on top of VGG16
for further feature extraction. Dropout layers are incorporated between convolutional
layers to prevent overfitting. The GlobalMaxPooling2D layer is used to reduce the fea-
ture maps without completely shrinking them. The model is compiled using the differ-
ent optimizer and trained for 50 epochs with a batch size of 16,32,64 and 128. The best
model is saved using a model checkpoint mechanism based on validation accuracy.
3.4.11 CNN-VGG19
The CNN and VGG19 builds a hybrid model to classify Bangla Sign Language letters
using a hybrid architecture that integrates VGG19 as a feature extractor with additional
custom CNN layers. The grayscale images in the dataset are resized to 50x50 and con-
verted to 3-channel RGB format to meet the input requirements of VGG19. The dataset
is split into training and testing sets, and labels are one-hot encoded for the 36-class
classification task. The VGG19 model, pre-trained on ImageNet, is used as the base
model with its fully connected layers removed. The layers of VGG19 are frozen to re-
tain the learned features from the original dataset. On top of this base, custom CNN
layers are added, consisting of two convolutional layers followed by max-pooling op-
erations to capture additional features from the images. These convolutional layers are
then flattened, and fully connected layers are added to perform the final classification.
3.4.12 CNN-InceptionV3
The CNN-InceptionV3 constructs a hybrid model that integrates InceptionV3 as a fea-

ture extractor with additional custom CNN layers for classifying Bangla Sign Language
letters. The images are loaded in grayscale format, resized to 299x299 to align with
the input size required by InceptionV3, and converted to RGB by merging grayscale
channels. The dataset is split into training and testing sets, and the labels are one-hot
encoded to fit a 36-class classification task. InceptionV3 is used as the base model with-
out its fully connected layers, leveraging its pre-trained feature extraction capabilities
from ImageNet. The layers of InceptionV3 are frozen to retain these learned features.
On top of this, additional CNN layers are built, consisting of four convolutional layers
with progressively increasing filter sizes (32, 64, 128, and 256), extracting more com-
plex features from the images. Max pooling is applied after each convolution to reduce
spatial dimensions, with dropout incorporated to prevent overfitting. The outputs from
the custom CNN layers are flattened and merged with the globally averaged output of
InceptionV3. This merged output is processed through a dense layer to perform the fi-
nal classification. This architecture is designed to efficiently combine the generalizable
features of InceptionV3 with the task-specific features extracted by the custom CNN
30
layers. Training is performed for 50 epochs with different optimizer and batch size, and
the model is monitored for validation accuracy.
31
CHAPTER 4
RESULTS AND PERFORMANCE ANALYSIS
This Chapter focuses on analyzing the parameters in terms of accuracy,precision,recall,f1

score and training-validation curve and measure the performance comparative analysis
among the five state of- the art Optimizers, namely Adam, Nadam, Adamax and RM-
Sprop on different size of input images.
4.1 Evaluation metrics
Precision: The precision of the model measures its capacity to accurately identify posi-
tive instances from the total predicted positives. The accuracy of optimistic predictions
is the main focus. The ratio of true positive to the total of true positive and false positive
is used to calculate precision.
TP
Precision = (4.1.1)
T P + FP
Recall: Recall is the percentage of positively expected observations to all observations

in the real class. Another names for it include Hit Rate, True Positive Rate, or Sensitivity.
It gauges how comprehensive a classifier is. Recall that is poor suggests a lot of false
negatives.
TP
Recall = (4.1.2)
T P + FN
F1 Score:Precision and recall combined to a weighted average is the F1 Score. This

score thus considers both false positives and false negatives. Usually, especially in
cases when the class distribution is unequal, it is more helpful than precision.
Precision · Recall
F1 = 2 · (4.1.3)
Precision + Recall
4.2 Result Analysis and Discussion:
4.2.1 CNN Model :
In order to effectively build and evaluate the CNN model, the dataset is split into training
and testing sets. The digit dataset features 1,500 single-channel (grayscale) images,
which are divided into 10 classes. Each class contains 150 images. Each of the 36 classes
in the letter dataset consists of 120 images, totaling 4,320 single-channel images. All
32
images are resized and normalized to 50x50 pixels. A well-balanced split is maintained
by reserving 20% of the data for testing, while approximately 80% is utilized for training.
This arrangement enables the model to be trained on the preponderance of the data while
retaining a significant portion for evaluation. The CNN model is trained over a period
of 50 epochs, with each epoch representing a complete journey through the training
dataset. During training, the model minimizes the categorical cross-entropy loss using
gradient descent and backpropagation to revise its weights at each epoch. In an effort
to avert overfitting, the model’s performance is tracked on a subset of the training data
using a 20% validation split. This allows tracking and visualizing the learning curves,
including both accuracy and loss for the training and validation sets.
4.2.1.1 Explanation of CNN Performance Table
The table below summarizes the performance of the Convolutional Neural Network
(CNN) model using various optimizers and batch sizes. The metrics evaluated include
accuracy, precision, recall, and F1-score, providing a comprehensive assessment of the
model’s performance.
Performance table of Digit Dataset:
Optimizer Batch Size Precision Recall F1-Score Accuracy

Adam 16 96.1 95.6 95.6 96
32 95.9 95.1 95.3 96
64 95.9 94.3 94.5 95
128 93.6 93.1 93.1 93
Nadam 16 96.3 96 96.1 96
32 97.0 96.7 96.8 97
64 97.0 96.7 96.8 97
128 90.9 90.6 90.6 91
Adamax 16 94.5 94.2 94.2 94
32 90.8 90.2 90.3 91
64 82.7 82 81.9 83
128 70.9 70.2 69.5 71
RMSprop 16 95.2 95.3 95.1 95
32 96.6 96.0 96.2 96
64 95.3 94.5 94.7 95
128 94.2 93.8 93.7 94
Table 4.1: CNN Performance Table (Digit)
33
Performance table of Letter dataset:

Adam 16 88.8 88.2 87.7 88
32 87.6 87.5 87.2 87
64 88.8 88.0 88.1 88
128 87.7 88.2 87.6 88
Nadam 16 88.3 88.4 88.0 88
32 87.6 87.9 87.1 88
64 88.5 88.1 87.9 88
128 89.4 89.3 89.0 89
Adamax 16 89.7 89.9 89.4 89
32 87.1 87.6 87.1 87
64 86.4 87.0 86.2 87
128 81.8 82.0 81.5 82
RMSprop 16 90.2 89.7 89.5 90
32 87.5 87.0 86.6 87
64 89.5 89.2 88.7 89
128 85.0 85.8 84.8 85
Table 4.2: CNN Performance Table (Letter)
From CNN Performance Table 4.1 and 4.2 we can see that:
Adam Optimizer: The best accuracy is 0.961 on the digit dataset, and 0.888 is the best
accuracy on the letter dataset, especially with batch size 16 on both datasets. The very
high accuracy, recall, and F1-score on the Digit dataset demonstrate excellent model
performance.
Nadam Optimizer : Has exceptional performance as well, maintaining an accuracy
of 0.97 on the digit dataset and 0.894 on the letter dataset over a range of batch sizes.
Recall, F1-score, and precision are still flawless or very close to it.
Adamax Optimizer: Achieves high accuracy 0.945 on the digit dataset and almost 0.90
on the letter dataset., especially with batch sizes 16 .
RMSprop Optimizer:Shows excellent and reliable performance, with accuracy values
0.966 on the digit dataset and 0.902 on the letter dataset.
4.2.1.2 Visualization of CNN Optimizers Across Different Batch Sizes
The below bar charts demonstrate the performance of the CNN-based image classifica-
tion model for both the letter and digit datasets. The performance is evaluated across
several optimizers (Adam, Nadam, Adamax, RMSprop) and varying batch sizes (16, 32,
34
Figure 4.1: Evaluation of CNN (Digit) Optimizers and Batch Sizes
Figure 4.2: Evaluation of CNN (Letter) Optimizers and Batch Sizes
64, 128). Each optimizer is represented by a different color, while the x-axis denotes
the batch size, and the y-axis represents the accuracy percentage.
For the CNN (Letter) Model: The figure 4.2 demonstrates how the model’s accu-
racy changes with different optimizers and batch sizes. Across all batch sizes, Adam
and Nadam optimizers demonstrate generally stable and excellent performance. The
Adamax optimizer likewise performs well but slightly lags behind the others in some
circumstances. RMSprop exhibits competitive accuracy for increasing batch sizes, how-
ever its performance reduces for the largest batch size of 128. This shows that Adam,
Nadam, and Adamax are generally useful for training the CNN on the letter dataset, but
RMSprop may require careful adjustment.
For the CNN (Digit) Model: This figure 4.1 shows highlights the performance varia-
tions of the CNN model on the digit dataset with different optimizers and batch sizes.
Adam, Nadam, and RMSprop consistently perform well across all batch sizes, achieving
35
near-perfect accuracy. Adamax shows slightly slower performance, especially at big-
ger batch sizes (64 and 128). RMSprop’s performance remains competitive, though it
shows a slight fall at bigger batch sizes. In both models, the analysis suggests that Adam
and Nadam are the most trustworthy optimizers, producing consistently high accuracy,
while Adamax and RMSprop may require careful tuning based on the batch size.
4.2.1.3 Training Curves for CNN Model with Different Optimizers and Batch
Sizes.
The training curve: shows how the model learns over the course of training by chang-
ing its performance metric (such as accuracy) throughout a series of epochs.
The validation curve:Tracks the model’s performance measure on a different valida-
tion set throughout training to demonstrate how effectively the model generalizes to new
data.
The loss curve : shows how the model’s loss function decreases across epochs, demon-
strating its capacity to reduce mistakes and boost prediction accuracy while being trained.
Figure 4.3: CNN(Digit) Accuracy and Loss Curve For Nadam optimizer with Batch Size 32
The training accuracy rapidly increases in the first few epochs, reaching 97% accuracy
on the digit dataset within 50 epochs. The validation accuracy similarly exhibits a quick
increase during the initial epochs, though not as steep as the training accuracy. The
validation loss remain limited and reasonably consistent after the initial fall with slight
fluctuations.
36
Figure 4.4: CNN(Letter) Accuracy and Loss Curve For RMSprop optimizer with Batch Size 16
The training accuracy rapidly increases in the first few epochs, reaching 90.2% accuracy
on the letter dataset within 50 epochs.
4.2.1.4 Classification Report
A classification report presents important metrics including accuracy, recall, F1-score,

and support for each class to give a thorough overview of a classification model’s per-
formance.
Figure 4.5: Classification report for Nadam optimizer batch size 32 (Digit Dataset)
37
Figure 4.6: Classification report for RMSprop optimizer batch size 16 (Letter Dataset)
In the digit dataset, Class 1 from the first classification report has a support of 35, mean-
ing that 35 samples in actuality match the ground truth for class 1. It has a precision
of 1.00, indicating that all 35 samples predicted to be in class 1 are indeed correctly
classified with no false positives. The recall is 1.00, showing that the model accurately
identified all 35 true class 1 samples, leaving no false negatives. With a perfect F1-score
of 1.00, this class demonstrates a flawless balance between precision and recall. In the
second classification report of letter dataset, class 1 has a support of 19, meaning that
there are 19 actual samples for class 1. The model’s precision is 0.95, meaning that most
samples predicted to be class 1 were correct, but a few were false positives. The recall
38
is also 0.95, indicating that the model correctly predicted 18 out of the 19 true class 1
samples, missing 1 sample as a false negative. The F1-score of 0.95 shows that class 1
has a very good balance between precision and recall, with only minor inaccuracies.
4.2.1.5 Confusion Matrix
A table used to show how effectively a categorization system performs is dubbed a

confusion matrix. For every sample in the test set, it compares the real class with the
predicted class. The confusion matrix offers insight into the types of errors being created
by the classifier. It helps in assessing whether the model is perplexing two groups more
than others. Analyzing the confusion matrix helps drive improvements to the model and
its training process. Additionally, it highlights accuracy and recall measurements, which
are crucial for evaluating model performance. By evaluating the confusion matrix, one
may also assess the balance of classes and reveal any potential bias in the predictions.
Figure 4.7: Confusion Matrix For Nadam optimizer batch size 32 (Digit Dataset)
The Convolutional Neural Network (CNN) model for Digit dataset, Nadam optimizer
batch size 32 identified all 10 classes in the sample with remarkable accuracy. This out-
standing accomplishment reveals CNN’s resilience and efficiency in identifying com-
plex patterns in a variety of classes. Its excellent recall and precision highlight how
well-suited it is for a variety of real-world scenarios involving complex categorization
tasks.
39
Figure 4.8: Confusion Matrix For Adam optimizer batch size 32
The diagonal cells (highlighted) represent The Convolutional Neural Network (CNN)
model for Letter dataset, RMSprop optimizer batch size 16 identified all 36 classes.
40
4.2.2 LSTM Model :
Before observing an image, the LSTM-based image classification model resizes it to

28x28 pixels and transforms it to grayscale. The pixel values are then normalized and
placed inside the range [0, 1]. For LSTM input preparation, the images are transformed
to a 3D format with the shape (28, 28), indicating (time steps, features). The model
architecture comprises of two LSTM layers. The first LSTM layer has 256 units and is
set to return sequences, enabling it to record temporal correlations across the image’s
pixel rows. A 0.2 dropout rate is utilized to prevent overfitting. The second LSTM layer
also includes 256 units, followed by another 0.2 dropout layer. The output is processed
through a dense layer with softmax activation, categorizing the images into one of 36
classes for the letter dataset and 10 classes for the digit dataset. To optimize training,
the model utilizes the Nadam optimizer with a learning rate of 0.001. The training
procedure comprises monitoring the validation accuracy and saving the best model using
a ModelCheckpoint callback. Various hyperparameters, such as batch sizes (16, 32) and
different optimizers (Nadam for the letter dataset and Adam for the digit dataset), are
examined in order to discover the ideal training setting.
4.2.2.1 LSTM Performance Table
Performance table of Digit Dataset: Performance table of Letter Dataset:

Adam 16 95.6 95.0 95.2 95
32 90.8 90.4 90.5 91
64 90.3 89.7 89.8 90
128 84.8 84.7 84.6 85
Nadam 16 93.7 93.4 93.4 93
32 93.4 93.5 93.4 93
64 89.8 89.1 89.3 89
128 76.8 75.5 75.5 76
Adamax 16 86.2 85.9 85.7 86
32 87.3 86.1 86.1 86
64 78.8 78.1 78.2 78
128 53.8 51.8 48.6 52
RMSprop 16 91.9 91.3 91.4 92
32 88.2 87.5 87.7 88
64 85.6 84.8 85.0 85
128 84.4 84.0 83.9 84
Table 4.3: LSTM Model Results (Digit)
41
Adam 16 78.9 78.4 77.8 78
32 83.2 83.5 82.9 84
64 83.7 83.9 83.4 84
128 80.8 80.7 80.2 80
Nadam 16 80.9 80.8 79.9 80
32 84.2 84.4 84.0 84
64 83.4 83.7 83.1 83
128 79.2 79.6 78.9 80
Adamax 16 80.5 80.7 79.9 81
32 81.8 81.6 81.2 81
64 78.7 78.3 78.0 78
128 76.3 76.0 75.7 76
RMSprop 16 81.0 81.3 80.8 81
32 78.9 78.5 78.2 79
64 77.0 76.7 76.5 77
128 74.5 74.8 74.1 75
Table 4.4: LSTM Model Results (Letter)
4.2.2.2 Explanation of LSTM Performance Table
The table 4.3 and 4.4 below summarizes the performance of the LSTM model using
various optimizers and batch sizes. The metrics evaluated include accuracy, precision,
recall, and F1-score, providing a comprehensive assessment of the model’s performance.
Adam Optimizer:Overall, the Adam optimizer batch size 16 performed the best for the
digit dataset. It obtained 95% accuracy, 0.956 precision, 0.95 recall, and 0.952 F1-score
with a batch size of 16. On the other hand, Adam optimizer batch sizes 32 and 64 achieve
84% accuracy for the letter dataset.
Nadam Optimizer :The best result for the Nadam optimizer was with batch size 16,
reaching 93.7% accuracy, 93.4% precision, 93.4% recall, and an F1-score of 93. As
the batch size increased, performance slightly decreased, with batch size 32 maintain-
ing similar findings, but batch sizes 64 and 128 witnessing decreases to 89% and 76%
accuracy, respectively.
Adamax Optimizer: Batch size 32 generated the best result for Adamax, with 87.3%
accuracy, 86.1% precision, 86.1% recall, and an F1-score of 86. However, utilizing
batch size 16 resulted in slightly reduced accuracy at 86.2%, while bigger batch sizes
64 and 128 observed considerable decreases, with the accuracy dropping to 78% and
53.8%.
RMSprop Optimizer:For RMSprop, batch size 16 performed best, achieving 91.9%
accuracy, 91.3% precision, 91.4% recall, and an F1-score of 92. Performance declined
42
as the batch size increased, with batch size 32 dropping accuracy to 88%, and batch
sizes 64 and 128 showing additional decreases to 85% and 84%, respectively.
4.2.2.3 Visualization of LSTM Optimizers Across Different Batch Sizes
Figure 4.9: Evaluation of LSTM Optimizers and Batch Sizes (Digit)
Figure 4.10: Figure 4.13: Evaluation of LSTM Optimizers and Batch Sizes(Letter)
The graphs provided demonstrate the performance of an LSTM-based classification

model for both letter and digit datasets. The performance is evaluated across multiple
optimizers (Adam, Nadam, Adamax, RMSprop) and varying batch sizes (16, 32, 64,
128). Each optimizer is represented by a unique color, with the x-axis showing the
batch size and the y-axis representing accuracy in percentage.
For the LSTM (Digit) Model: The second graph displays the LSTM model’s perfor-
mance on the digit dataset with the same optimizers and batch sizes. Again, Adam and
Nadam are very competitive, keeping good accuracy across all batch sizes. Adamax,
despite working well for smaller batch sizes, exhibits a significant drop at batch size
43
128. RMSprop also displays a consistent performance but with a steeper drop-off at
batch size 128. Overall, Adam and Nadam emerge as the most dependable optimizers,
while Adamax and RMSprop require more careful modifying when the batch size in-
creases.
For the LSTM (Letter) Model: The graph demonstrates how the model’s accuracy
changes according on the choice of optimizer and batch size. Adam, Nadam, and Adamax
optimizers show remarkably similar performance, consistently reaching excellent accu-
racy across all batch sizes. RMSprop also performs well but exhibits a little drop in
performance at increasing batch sizes, particularly at batch size 128. It means that while
all optimizers work well, Adam and Nadam appear to be slightly more consistent, espe-
cially as batch sizes increase.
In both models, Adam and Nadam optimizers are highly successful, consistently deliv-
ering outstanding results across different batch sizes. Adamax and RMSprop work well
but exhibit increasing susceptibility to bigger batch sizes, particularly in the case of the
digit dataset.
4.2.2.4 Training Curves for LSTM Model with Different Optimizers and Batch
Sizes.
ing its performance metric (such as accuracy) throughout a series of epochs. The val-
idation curve:Tracks the model’s performance measure on a different validation set
throughout training to demonstrate how effectively the model generalizes to new data.
Figure 4.11: LSTM(Digit) Accuracy and Loss Curve For Adam optimizer with Batch Size 16.
44
The graphs show that the model learns well in the first few epochs, boosting accuracy
in training and validation while lowering loss for digit dataset. Good generalization is
shown by the alignment of the training and validation accuracy and loss curves, while
moderate overfitting may begin to show near the end of the training period. All things
considered, the model performs well and has strong learning and generalization skills.
Figure 4.12: LSTM (Letter) Accuracy and Loss Curve For Nadam with Batch Size 32.
The graphs show that the letter dataset accuracy and loss curve for Nadam optimizer
with batch size 32. Reaching 84.4% accuracy within 50 epochs.
A classification report offers crucial metrics including accuracy, recall, F1-score, and
support for each class to give a detailed overview of a classification model’s perfor-
mance. Accuracy shows the overall accuracy of the model, whereas recall evaluates the
capacity to detect all relevant occurrences. The F1-score establishes a balance between
accuracy and recall, offering insight into the model’s performance when both false posi-
tives and false negatives are essential. Support refers to the number of true examples for
each class in the dataset. The categorization report aids in assessing model strengths and
weaknesses across multiple classes, allowing for focused improvements in the model’s
predictive capabilities
45
Figure 4.13: Classification report for Adam optimizer batch size 16 (Digit)
Figure 4.14: Classification report for Nadam optimizer batch size 32 (Letter)
46
The classification reports 4.13 and 4.14indicate that the model performs exceptionally
well across most classes, with precision, recall, and F1-scores generally in a digit dataset.
However, the Letter dataset has slightly lower metrics, suggesting some difficulty in ac-
curate prediction. The overall accuracy of 95% and 84% highlights the model’s effec-
tiveness in classifying instances correctly. Both macro and weighted averages of pre-
cision, recall, and F1-score are 0.97, confirming balanced and consistent performance
across all classes. This demonstrates the model’s robustness and reliability in handling
the dataset.
Figure 4.15: Confusion Matrix For Adam optimizer batch size 16 (Digit)
The LSTM model for Digit dataset, Adam optimizer batch size 10 identified all 10
classes in the sample with remarkable accuracy.
Figure 4.16: Confusion Matrix For Nadam optimizer batch size 32(Letter)
47
4.2.3 BiLSTM Model :
The BiLSTM image classification model utilizes Bidirectional Long Short-Term Mem-
ory (BiLSTM) networks to categorize images into 36 classes for the letter dataset and 10
classes for the digit dataset. The input images are first scaled to 28x28 pixels and trans-
formed to grayscale, with pixel values adjusted between 0 and 1 for optimal training.
The model architecture consists of two BiLSTM layers, each with 256 units. To prevent
overfitting, dropout regularization is used after each BiLSTM layer. The final output
layer utilizes softmax activation to classify the images. For training, the images are sep-
arated into training and testing sets, with labels one-hot encoded into 36 categories for
the letter dataset and 10 categories for the digit dataset. The model is created using the
Nadam optimizer with a learning rate of 0.001 and categorical cross-entropy as the loss
function. It is trained across 50 epochs, utilizing batch sizes of 16 and a 20% validation
split. A model checkpoint callback is implemented to save the best-performing model
during training. After training, the model’s performance is tested on the test set. Key
measures, including accuracy, F1-score, precision, and recall, are calculated.
4.2.3.1 BiLSTM Performance Table

Adam 16 93.1 92.4 92.6 93
32 93.0 92.6 92.7 93
64 91.6 90.4 90.6 91
128 88.4 87.9 88.1 88
Nadam 16 93.9 93.4 93.6 94
32 92.8 92.5 92.6 93
64 90.1 89.8 89.8 90
128 81.4 81.1 81.0 81
Adamax 16 89.5 89.3 89.3 89
32 85.2 85.1 85.0 85
64 82.3 81.2 81.1 81
128 64.5 64.9 63.3 64
RMSprop 16 91.1 90.8 90.8 91
32 93.8 93.6 93.7 94
64 86.7 84.1 84.1 83
128 88.9 88.9 88.8 89
Table 4.5: BiLSTM(Digit) Model Results
48
Performance table of Letter Dataset:

Adam 16 84.5 84.8 84.3 85
32 84.5 85.0 84.3 85
64 82.7 83.3 82.6 83
128 80.9 81.3 80.7 81
Nadam 16 86.5 86.9 86.5 87
32 84.8 84.9 84.5 85
64 84.4 84.9 84.4 85
128 81.7 81.7 81.3 81
Adamax 16 82.2 82.5 81.9 82
32 80.9 81.4 80.9 81
64 79.5 80.0 79.2 80
128 77.2 77.0 76.4 77
RMSprop 16 81.4 81.9 81.3 81
32 80.7 81.7 80.9 81
64 79.8 79.8 78.7 79
128 78.4 78.5 77.9 78
Table 4.6: : BiLSTM(Letter) Model Results
The table 4.5 and 4.6 summarizes the performance of the BiLSTM model using various
optimizers and batch sizes. The metrics evaluated include accuracy, precision, recall,
and F1-score, providing a comprehensive assessment of the model’s performance.
4.2.3.2 Explanation of BiLSTM Performance Table
Adam Optimizer::Adam performed well for batch sizes ranging from 16 to 128. Accu-
racy with the digit dataset range from 88% to 93%, and F1-scores held constant between
88.1 and 92.7. The letter dataset performed a little bit worse, with F1-scores ranging
from 80.7 to 84.3 and accuracy ranging from 81% to 85%. The values of precision and
recall showed a similar pattern, with lower batch sizes providing the highest results.
Nadam Optimizer : Nadam performed similarly to Adam. With the digit dataset, accu-
racy ranged from 81% to 94%, with F1-scores from 81.0 to 93.6. It outperformed Adam
almost for smaller batch sizes (e.g., batch size of 16). In the letter dataset, the perfor-
mance was also constant, with accuracy between 81% and 87% and F1-scores ranging
from 81.3 to 86.5. Smaller batch sizes once again saw the best outcomes.
Adamax Optimizer:Performance decreased more noticeably with Adamax at larger
batch sizes. The F1-scores in the digit dataset ranged from 63.3 to 89.3, while the ac-
curacy ranged from 64% to 89%. Similar patterns were observed in the letter dataset,
49
where F1-scores ranged from 76.4 to 81.9 and accuracy from 77% to 82%. The optimal
performance of this optimizer was highest at lower batch sizes, but it rapidly decreased
as batch sizes increased
RMSprop Optimizer::RMSprop shows stable performance throughout batch sizes. In
the digit dataset, accuracy ranged from 83% to 94%, with F1-scores between 84.1 and
93.7. For the letter dataset, performance was lower but constant, with accuracy ranging
from 78% to 81% and F1-scores from 77.9 to 81.3. This optimizer performed well at
smaller batch sizes, especially with batch sizes of 16 and 32 in both datasets.
4.2.3.3 Visualization of BiLSTM Optimizers Across Different Batch Sizes
Figure 4.17: Evaluation of Bi-LSTM Optimizers and Batch Sizes(Digit)
Figure 4.18: Evaluation of Bi- LSTM Optimizers and Batch Sizes(Letter)
The bar charts shows the performance of a Bi-LSTM-based model applied to both let-
ter and digit datasets. The performance is evaluated using four different optimizers
(Adam, Nadam, Adamax, RMSprop) with variable batch sizes (16, 32, 64, 128). The
50
x-axis demonstrates the batch sizes, while the y-axis indicates the accuracy in percent-
age. Each optimizer is color-coded.
For the Bi-LSTM (Digit) Model: The figure 4.18 shows how different optimizers
perform when applied to the digit dataset. Adam, Nadam and RMSprop show high per-
formance across most batch sizes, achieving over 80% accuracy for lower batch sizes.
Adamax, however, exhibits a small loss in performance as batch size expands, especially
for batch sizes of 64 and 128. RMSprop similarly experiences a reduction in accuracy
at batch size 128, while it stays competitive at other levels. For the Bi-LSTM (Letter)
Model: The accuracy trends for different optimizers and batch sizes are shown in fig-
ure ??. Adam and Nadam optimizers frequently show high and stable accuracy across
all batch sizes, reaching over 80Adamax also performs well but shows a slight drop
compared to Adam and Nadam at greater batch sizes, but its performance remains com-
petitive. RMSprop while comparable in performance for smaller batch sizes, exhibits
a little reduction in accuracy as batch size grows, particularly at 128. This shows that
Adam and Nadam are stronger optimizers for the Bi-LSTM (Letter) model, whereas
Adamax and RMSprop may need fine-tuning depending on the batch size. Overall,
Adam and Nadam optimizers are the most consistent for the Bi-LSTM (Digit) model,
while Adamax and RMSprop indicate performance variances that would require more
changes.
4.2.3.4 Training Curves for BiLSTM Model with Different Optimizers and Batch
Sizes.
The model seems to be successfully learning to categorize the Class, based on the
combined observations of growing training accuracy and decreasing training loss over
epochs. Its capacity to recognize the appropriate categories for the class in the training
data is gradually becoming better.
51
Figure 4.19: Bi-LSTM Accuracy and Loss Curve For Nadam with Batch Size 16 (Digit)
Figure 4.20: Bi-LSTM Accuracy and Loss Curve For Nadam with Batch Size 16 (Letter)
The training and validation curves for Adam and RMSprop optimizers demonstrate good
performance across all tested measures, with consistent and rising trajectories indicat-
ing effective model training and generalization. In comparison, the curves linked with
the Adagrad optimizer show significantly poorer performance, with erratic trends and
minimal progress over epochs.
Figure 4.21: Classification report for Nadam batch size 16 (Digit)
52
Figure 4.22: Classification report for Nadam batch size 16 (Letter)
According to the aforementioned classification findings, the Bi-LSTM model has gen-
erally performed well across most classes, with accuracy and recall values on the digit
dataset usually above 0.90 and on the letter dataset frequently reaching 0.85. This in-
dicates that the model has a high degree of precision in identifying positive samples
and a high recall level in locating all of the positive samples for these classes. But the
performance is significantly worse for other classes, such as those with accuracy and
recall scores closer to 0.82 on the digit dataset or 0.47 on the letter dataset. This could
mean that certain specific classes are harder for the model to identify, or that the classes
are underrepresented in the data. For the majority of classes, the F1-score, a weighted
average of accuracy and recall, stays high, indicating a balanced performance between
precision and recall in both datasets. For numbers, the best accuracy was 94%, while
for letters, it was 87%.
53
Figure 4.23: Confusion Matrix For Nadam optimizer batch size 16 (Digit)
54
Figure 4.24: Confusion Matrix For Nadam optimizer batch size 16 (Letter)
55
4.2.4 VGG-16
This model utilizes transfer learning with the VGG16 architecture, which is pretrained
on ImageNet, to classify images. Two datasets are used: one for digits with 10 classes
and another for letters with 36 classes. The input images are scaled to 50x50 pixels and
processed to match the VGG16 standards. Grayscale images are transformed to RGB
by doubling the single channel. The model’s custom classification head includes fully
connected layers, dropout regularization, and a softmax layer for class prediction.
The VGG16 base layers are frozen, and Adam and Nadam optimizers are utilized in
independent tests. The model is trained using a batch size of 16 for 50 epochs, and a
validation split of 20
For both datasets, the model is evaluated using classification measures including ac-
curacy, F1-score, precision, recall, and confusion matrices. These metrics illustrate
the model’s effectiveness in detecting Bangla sign language characters across different
classes.
4.2.4.1 VGG16 Performance Table
Performance table of Digit Dataset: Performance table of Letter Dataset:

Adam 16 83.6 83.7 83.0 84
32 83.9 83.6 83.1 84
64 81.7 81.9 81.3 82
128 79.3 79.8 79.0 80
Nadam 16 83.7 83.8 82.9 83
32 83.5 83.6 83.2 83
64 81.6 81.8 81.1 82
128 79.0 79.5 78.8 80
Adamax 16 77.7 78.1 77.2 78
32 76.6 76.9 76.1 77
64 74.9 75.2 74.3 75
128 69.6 69.8 68.8 70
RMSprop 16 83.7 83.8 83.3 84
32 84.1 83.9 83.7 84
64 80.5 80.0 79.5 80
128 79.7 79.8 78.7 79
Table 4.7: VGG16 Model Results (Digit)
56
Adam 16 83.6 83.7 83.0 84
32 83.9 83.6 83.1 84
64 81.7 81.9 81.3 82
128 79.3 79.8 79.0 80
Nadam 16 83.7 83.8 82.9 83
32 83.5 83.6 83.2 83
64 81.6 81.8 81.1 82
128 79.0 79.5 78.8 80
Adamax 16 77.7 78.1 77.2 78
32 76.6 76.9 76.1 77
64 74.9 75.2 74.3 75
128 69.6 69.8 68.8 70
RMSprop 16 83.7 83.8 83.3 84
32 84.1 83.9 83.7 84
64 80.5 80.0 79.5 80
128 79.7 79.8 78.7 79
Table 4.8: VGG16 Model Results (Letter)
57
4.2.4.2 Explanation of VGG 16 Performance Table
The table below summarizes the performance of the VGG16 model using various opti-
mizers and batch sizes. The metrics evaluated include accuracy, precision, recall, and
F1-score, providing a comprehensive assessment of the model’s performance. Adam
Optimizer:: The Adam optimizer shows strong performance across various batch sizes,
with accuracy values ranging from 84% to 80% for the letter dataset and 96% to 93% for
the digit dataset. Precision, recall, and F1-scores also stay consistently good, especially
for smaller batch sizes. On the letter dataset, scores hover around 84% for batch sizes
of 16 and 32, while on the digit dataset, same measures reach 96% with batch sizes of
16.
Nadam Optimizer :The Nadam optimizer performs similarly to Adam, producing ac-
curacy scores between 83% and 80% for letters and between 95% and 94% for digits.
Precision, recall, and F1-scores are close to Adam’s values, with batch sizes of 16 and
32 frequently scoring above 83% for letters and 94% to 95% for digits, making it a great
choice for both datasets.
Adamax Optimizer: Adamax provides some lower performance compared to Adam
and Nadam. For the letter dataset, accuracy ranges from 78% to 70%, while for the digit
dataset, it gets up to 92% for lower batch sizes. Precision, recall, and F1-scores follow
similar patterns, with lower batch sizes producing better outcomes. For the digit dataset,
it obtains 92% precision at a batch size of 16 but drops to 86% at larger batch sizes.
RMSprop Optimizer:RMSprop gives excellent performance, particularly with batch
sizes of 32 and 64, where accuracy reaches 84% for letters and 96% to 94% for num-
bers. Precision, recall, and F1-scores are generally strong, with letter dataset metrics
ranging around 83% to 84% and digit dataset metrics highest on 94%, indicating accu-
rate and reliable classification results.
4.2.4.3 Visualization of VGG16 Optimizers Across Different Batch Sizes
The bar charts show the performance of the Bi-LSTM model for both the letter and digit
datasets, evaluated using multiple optimizers (Adam, Nadam, Adamax, RMSprop) and
varying batch sizes (16, 32, 64, 128). The x-axis shows the batch size, while the y-axis
demonstrates the accuracy percentage. For the Bi-LSTM (Letter) Model: The per-
formance of the Bi-LSTM model on the letter dataset shows consistent trends across
the four optimizers. The Adam and Nadam optimizers achieve relatively good accuracy
across all batch sizes, with little difference between them. Adamax also works well,
though it tends to significantly underperform compared to Adam and Nadam, especially
58
Figure 4.25: Evaluation of VGG16 Optimizers and Batch Sizes (Digit)
Figure 4.26: Evaluation of VGG16 Optimizers and Batch Sizes (Letter)
at batch sizes of 32 and 128. RMSprop performs satisfactorily for lower batch sizes but
demonstrates a more obvious drop at batch size 128. Overall, Adam and Nadam ap-
pear to be the most trustworthy for sustaining high accuracy on the letter dataset, while
RMSprop may require fine-tuning for bigger batch sizes. For the Bi-LSTM (Digit)
Model: The digit dataset performance follows the same pattern to the letter dataset.
Adam, Nadam, and RMSprop demonstrate consistent and strong accuracy at all batch
sizes, however RMSprop shows a small decrease in performance at bigger batch sizes.
Adamax falls behind the other optimizers in some instances, particularly for batch sizes
32 and 64. Nevertheless, the variations amongst the optimizers are small in terms of
total performance. Once again, Adam and Nadam stand out as the strongest perform-
ers, generating excellent accuracy consistently, whereas Adamax and RMSprop require
further changes, particularly for bigger batch sizes. These findings imply that Adam
and Nadam are generally the most effective optimizers for the Bi-LSTM model on both
the letter and digit datasets, but Adamax and RMSprop may require more attention to
maintain equivalent performance, particularly as batch size grows.
59
4.2.4.4 Training Curves for VGG16 Model with Different Optimizers and Batch
Sizes.
Figure 4.27: VGG16(Digit) Accuracy and Loss Curve For Adam optimizer with Batch Size 16
Figure 4.28: VGG16(Letter) Accuracy and Loss Curve For Adam optimizer with Batch Size 16
60
Figure 4.29: Classification report for Adam optimizer batch size 16(Digit)
61
Figure 4.30: Classification report for Adam optimizer batch size 16(Letter)
62
Figure 4.32: Confusion Matrix For Adam optimizer batch size 16 (Letter)
63
4.2.5 VGG-19
The VGG19 model was utilized for image classification on two datasets: a 36-class letter
dataset and a 10-class digit dataset. The images were preprocessed with background re-
moval, resized to 50x50 pixels, converted to RGB, and normalized. The VGG19 model
was initialized with pre-trained weights from ImageNet, excluding the fully connected
layers, and custom classification layers were added. For the letter dataset, Nadam was
used as the optimizer, while Adamax was employed for the digit dataset. The mod-
els were trained with batch sizes of 16 over 50 epochs, and early stopping with model
checkpointing was applied.
Training and validation accuracy and loss curves were plotted to assess performance.
On the test set, the models were evaluated using metrics such as loss, accuracy, preci-
sion, recall, and F1 score. Additionally, confusion matrices were generated to visualize
the classification performance.
4.2.5.1 VGG19 Performance Table

Adam 16 94.8 94.3 94.4 94
32 93.4 92.8 92.9 93
64 91.6 91.1 91.2 91
128 88.7 87.7 87.8 88
Nadam 16 92.8 91.9 92.2 92
32 93.2 92.5 92.7 93
64 91.7 91.1 91.3 91
128 87.7 85.8 86.3 86
Adamax 16 91.6 90.9 91.0 91
32 89.5 88.9 89.0 89
64 82.9 82.2 82.2 82
128 74.5 74.1 73.4 74
RMSprop 16 92.5 91.1 91.5 92
32 93.4 92.6 92.8 93
64 91.2 89.7 90.0 90
128 86.0 82.1 82.4 82
Table 4.9: VGG19 Model Results (Digit)
Performance table of Letter Dataset:
64
Adam 16 83.6 83.7 83.0 84
32 83.9 83.6 83.1 84
64 81.7 81.9 81.3 82
128 79.3 79.7 79.0 80
Nadam 16 83.7 83.8 82.9 83
32 83.5 83.6 83.2 83
64 81.6 81.8 81.1 82
128 79.0 79.5 78.8 80
Adamax 16 77.7 78.1 77.2 78
32 76.6 76.9 76.1 77
64 74.6 74.8 73.9 75
128 69.6 69.8 68.8 70
RMSprop 16 83.7 83.8 83.3 84
32 84.1 83.9 83.7 84
64 80.5 80.0 79.5 80
128 79.7 79.8 78.7 79
Table 4.10: VGG19 Model Results (letter)
65
4.2.5.2 Explanation of VGG 16 Performance Table
The VGG19 model was evaluated using multiple optimizers across different batch sizes.
Using the Adam optimizer, the model achieved an accuracy range of 94% to 88%, with
precision ranging from 94.8% to 88.7%, recall from 94.3% to 87.7%, and F1 score from
94.4% to 87.8% as the batch size increased from 16 to 128. The Nadam optimizer re-
sulted in comparable results with accuracy ranging from 92% to 86% with precision,
recall, and F1 score showing little variation among the batch sizes. The Adamax op-
timizer showed a more obvious decrease in performance, with accuracy falling from
91% at a batch size of 16 to 74% at a batch size of 128. Precision, recall, and F1
scores followed similar trends, showing the model’s decreased capacity to generalize
at increasing batch sizes. In contrast, RMSprop maintained very constant performance,
with accuracy ranging from 92% to 82%, demonstrating less variability compared to the
Adam and Nadam optimizers.
4.2.5.3 Visualization of VGG16 Optimizers Across Different Batch Sizes
Figure 4.33: Visualization of VGG19(Letter) Optimizers Across Different Batch Sizes
The performance of the CNN-based image classification models for the VGG19 (Letter)
and VGG19 (Digit) datasets is shown in the graphs. Several optimizers (Adam, Nadam,
Adamax, RMSprop) and varying batch sizes (16, 32, 64, 128) are used to assess the per-
formance. Batch size is shown on the x-axis, and each optimizer’s accuracy percentage
is shown on the y-axis. VGG19 (Letter) Model: This graph illustrates how batch sizes
and different optimizers impact the model’s accuracy. The Adam and Nadam optimiz-
ers show consistently strong and stable results across all batch sizes. When compared
to Adam and Nadam, the Adamax optimizer performs significantly worse, particularly
for bigger batch sizes. In general, RMSprop produces competitive results; however,
66
Figure 4.34: Visualization of VGG19(Letter) Optimizers Across Different Batch Sizes
when the batch size is increased to 128 it performs considerably worse. This indicates
that while Adamax and RMSprop could need careful batch size improving for best re-
sults, Adam and Nadam seem like good choices for this dataset.may require fine-tuning
for bigger batch sizes. VGG19 (Digit) Model: This graph analyzes whether the CNN
model performs with various optimizers and batch sizes on the digit dataset. Once more,
Adam and Nadam provide excellent, constant accuracy for all batch sizes, and RMSprop
also produces values that are competitive. Adamax performs slightly worse, especially
when batch sizes increase. Similar to the VGG16 (Letter) model, RMSprop’s perfor-
mance is still competitive but slightly decreases at greater batch sizes (128).
The most efficient optimizers in both models are Adam and Nadam, who consistently
produce excellent accuracy for all batch sizes. In order to achieve optimal performance,
Adamax and RMSprop may need further adjustment depending on the batch size, as
they show greater variability.
4.2.5.4 Training Curves for VGG19 Model with Different Optimizers and Batch
Sizes.
67
Figure 4.35: VGG19(Digit) Accuracy and Loss Curve For Adam optimizer with Batch Size 16
Figure 4.36: VGG19(Letter) Accuracy and Loss Curve For Adam optimizer with Batch Size 16
Figure 4.37: Classification report of VGG19(Digit) for Adam optimizer batch size 19
68
Figure 4.38: Classification report of VGG19(Letter) for Adam optimizer batch size 16
69
Figure 4.40: Confusion Matrix For Adam optimizer batch size 1 (Letter)
70
CHAPTER 5
CONCLUSION
This thesis addresses the difficulty of recovering and identifying characters from de-
graded printed Bangla documents for preserving historical and cultural legacies. By
applying standard image processing methods to modern deep learning models, consid-
erable advances were made in boosting document readability and accessibility. The
preprocessing step involves methods including skew correction and morphological pro-
cesses, crucial for successful segmentation and classification. Various models, includ-
ing CNNs, LSTMs, BiLSTMs, and numerous pre-trained architectures such as VGG16,
VGG19, ResNet50, Xception, and MobileNet, as well as hybrid models like VGG19-
BiLSTM, CNN-XGBoost, and semi-supervised VGG16-RF, were applied for character
classification. The research investigated five optimization algorithms (Adam, Nadam,
Adagrad, SGD, and RMSprop) and varying batch sizes, indicating Adam and Nadam as
the most successful owing to their adaptable learning rates. RMSprop also performed
well, especially for non-convex optimization problems, whereas SGD and AdaGrad
failed with fixed or falling learning rates.
5.1 Future Work:
While this thesis has made substantial progress in restoring and detecting damaged
Bangla papers, there is still need for more research. Future study might look at the
use of more complex deep learning algorithms, such as generative adversarial networks
(GANs), to improve restoration quality. Expanding the dataset to include compound
characters would also be useful because it would offer the models with a more com-
plete training set. Furthermore, we intend to create a more robust optical character
recognition (OCR) technology that is especially designed for the study of old, deterio-
rated documents. This upgraded OCR technology has the potential to greatly increase
text extraction accuracy, allowing for greater preservation and accessibility of historical
texts. By adding these sophisticated methodologies and growing the dataset, we may
continue to enhance our methods’ efficacy and generalizability, eventually assisting in
the preservation and study of historical documents.
71
REFERENCES
[Abedin et al., 2023] Abedin, T., Prottoy, K. S., Moshruba, A., and Hakim, S. B. (2023).
Bangla sign language recognition using a concatenated bdsl network. In Computer
Vision and Image Analysis for Industry 4.0, pages 76–86. Chapman and Hall/CRC.
[Basnin et al., 2021] Basnin, N., Nahar, L., and Hossain, M. S. (2021). An integrated
cnn-lstm model for bangla lexical sign language recognition. In Proceedings of Inter-
national Conference on Trends in Computational and Cognitive Engineering: Pro-
ceedings of TCCE 2020, pages 695–707. Springer.
[Das et al., 2023] Das, S., Imtiaz, M. S., Neom, N. H., Siddique, N., and Wang, H.
(2023). A hybrid approach for bangla sign language recognition using deep trans-
fer learning model with random forest classifier. Expert Systems with Applications,
213:118914.
[Hadiuzzaman et al., 2024] Hadiuzzaman, M., Ali, M. S., Sultana, T., Shafi, A. R.,
Miah, A. S. M., and Shin, J. (2024). Baust lipi: A bdsl dataset with deep learning
based bangla sign language recognition. arXiv preprint arXiv:2408.10518.
[Hasan et al., 2016] Hasan, M., Sajib, T. H., and Dey, M. (2016). A machine learning
based approach for the detection and recognition of bangla sign language. In 2016
international conference on medical engineering, health informatics and technology
(MediTec), pages 1–5. IEEE.
[Hasan et al., 2021] Hasan, S. N., Hasan, M. J., and Alam, K. S. (2021). Shongket: A
comprehensive and multipurpose dataset for bangla sign language detection. In 2021
International Conference on Electronics, Communications and Information Technol-
ogy (ICECIT), pages 1–4. IEEE.
[Islalm et al., 2019] Islalm, M. S., Rahman, M. M., Rahman, M. H., Arifuzzaman, M.,
Sassi, R., and Aktaruzzaman, M. (2019). Recognition bangla sign language using
convolutional neural network. In 2019 international conference on innovation and
intelligence for informatics, computing, and technologies (3ICT), pages 1–6. IEEE.
[Islam et al., 2018] Islam, M. S., Mousumi, S. S. S., Jessan, N. A., Rabby, A. S. A.,
and Hossain, S. A. (2018). Ishara-lipi: The first complete multipurposeopen access
dataset of isolated characters for bangla sign language. In 2018 International Con-
ference on Bangla Speech and Language Processing (ICBSLP), pages 1–4. IEEE.
[Khatun et al., 2021] Khatun, A., Shahriar, M. S., Hasan, M. H., Das, K., Ahmed, S.,
and Islam, M. S. (2021). A systematic review on the chronological development of
72
bangla sign language recognition systems. In 2021 Joint 10th International Con-
ference on Informatics, Electronics & Vision (ICIEV) and 2021 5th International
Conference on Imaging, Vision & Pattern Recognition (icIVPR), pages 1–9. IEEE.
[Podder et al., 2022] Podder, K. K., Chowdhury, M. E., Tahir, A. M., Mahbub, Z. B.,
Khandakar, A., Hossain, M. S., and Kadir, M. A. (2022). Bangla sign language
(bdsl) alphabets and numerals classification using a deep learning model. Sensors,
22(2):574.
[Rafi et al., 2019] Rafi, A. M., Nawal, N., Bayev, N. S. N., Nima, L., Shahnaz, C., and
Fattah, S. A. (2019). Image-based bengali sign language alphabet recognition for deaf
and dumb community. In 2019 IEEE global humanitarian technology conference
(GHTC), pages 1–7. IEEE.
[Shams et al., 2024] Shams, K. A., Reaz, M. R., Rafi, M. R. U., Islam, S., Rahman,
M. S., Rahman, R., Reza, M. T., Parvez, M. Z., Chakraborty, S., Pradhan, B., et al.
(2024). Multimodal ensemble approach leveraging spatial, skeletal, and edge features
for enhanced bangla sign language recognition. IEEE Access.
[Shurid et al., 2020] Shurid, S. A., Amin, K. H., Mirbahar, M. S., Karmaker, D.,
Mahtab, M. T., Khan, F. T., Alam, M. G. R., and Alam, M. A. (2020). Bangla
sign language recognition and sentence building using deep learning. In 2020 IEEE
Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), pages
1–9. IEEE.
[Simonyan and Zisserman, 2014] Simonyan, K. and Zisserman, A. (2014). Very

deep convolutional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556.
[Sultan et al., 2022] Sultan, A., Makram, W., Kayed, M., and Ali, A. A. (2022). Sign
language identification and recognition: A comparative study. Open Computer Sci-
ence, 12(1):191–210.
[Sun et al., 2019] Sun, S., Cao, Z., Zhu, H., and Zhao, J. (2019). A survey of optimiza-
tion methods from a machine learning perspective. IEEE transactions on cybernetics,
50(8):3668–3681.
[Szegedy et al., 2016] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z.
(2016). Rethinking the inception architecture for computer vision. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 2818–2826.
[Tasmere and Ahmed, 2020] Tasmere, D. and Ahmed, B. (2020). Hand gesture recog-
nition for bangla sign language using deep convolution neural network. In 2020 2nd
73
International Conference on Sustainable Technologies for Industry 4.0 (STI), pages
1–5. IEEE.
[Yasir et al., 2017] Yasir, F., Prasad, P., Alsadoon, A., Elchouemi, A., and Sreedharan,
S. (2017). Bangla sign language recognition using convolutional neural network. In
2017 international conference on intelligent computing, instrumentation and control
technologies (ICICICT), pages 49–53. IEEE.
74

Thesis

Uploaded by

Document Informationclick to expand document informationThesis

Document Informationclick to expand document information

Copyright:

Available Formats

Thesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Thesis

Uploaded by

Copyright:

Available Formats

Enhancing Bangla Sign Language Recognition Through Batch Size and Optimizer

Variations in Deep Learning

Submitted in partial fulfillment of the requirements

Department of Computer Science & Engineering

This thesis is a heartfelt tribute to my beloved parents and respected educators.

Signature of the candidate

We extend our sincere appreciation to a number of individuals whose steadfast support

Title Page No.

CHAPTER 2 LITERATURE REVIEW 8

CHAPTER 4 RESULTS AND PERFORMANCE ANALYSIS 32

4.1 Evaluation of CNN (Digit) Optimizers and Batch Sizes . . . . . . . . . . . 35

4.1 CNN Performance Table (Digit) . . . . . . . . . . . . . . . . . . . . . . . 33

1.2 Sign language:

Sign language is a visual-manual form of communication primarily used by the deaf

The motivations of this study are:

• Overcoming Communication Barriers: Millions of individuals in the deaf and

• Inclusion and Accessibility: By enhancing accessibility in public services, ed-

• Preserving Cultural and Linguistic Heritage: BdSL is an integral aspect of the

• Enhancing BdSL identification Accuracy: Complicated hand movements,

• Effect on Education and Learning: By developing resources for BdSL instruc-

1.4 Application of Bangla sign language identification and classification methods

• Communication Tools: Real-time sign language translation is made possible by

• Educational Platforms: : BdSL recognition systems may be used in hospitals,

• Public Services: BdSL recognition systems may be used in hospitals, govern-

1.5 Key Contribution

• In this study, a comprehensive preprocessing pipeline was designed to handle

• A meticulous endeavor was made to compile a comprehensive character dataset

• Extensive experimentation was conducted to evaluate the efficacy of various op-

A range of optimization methods, including Adam, Nadam, RMSprop, Adagrad, and

2.1 Related Work

• Shongket, a comprehensive dataset for Bangla sign language identification, was

• An integrated CNN-LSTM modelwas proposed by [Basnin et al., 2021] for rec-

• [Shams et al., 2024] proposed a multi-modal ensemble strategy for recognizing

• [Rafi et al., 2019] created a Convolutional Neural Network (CNN)-based system

3.1 Dataset preprocessing

Background removal is an important part of preprocessing since it removes any non-

(a) Before Background elimmination (b) After Background elimmination

Figure 3.2: Sample image of before and after background elimination.

3.1.2 Gray Scale conversion

After grayscale conversion, Otsu’s binarization technique is used to further enhance

Figure 3.3: Steps involves in preprocessing

Figure 3.4: Visual overview of dataset for digit

An optimizer is an algorithm or method that adjusts a model’s parameters, such as

3.3.1 Adam (Adaptive Moment Estimation)

mt = β1 · mt−1 + (1 − β1 ) · ∇J(θt ) (3.3.1)

• θ represents the parameters.

• α is the learning rate.

• mt and vt are the first and second moment estimates respectively.

• β1 and β2 are decay rates for the moment estimates.

• t represents the time step.

• ε is a small constant to avoid division by zero.

3.3.2 Nadam (Nesterov-accelerated Adaptive Moment Estimation)

mt = β1 · mt−1 + (1 − β1 ) · ∇J(θt ) (3.3.6)

3.3.3 RMSProp (Root Mean Square Propagation):

Gt = γ · Gt−1 + (1 − γ ) · (∇J(θt ))2 (3.3.11)

where γ is a decay rate.

mt = β1 · mt−1 + (1 − β1 ) · ∇J(θt ) (3.3.13)