Project Report
on
Title
Submitted in partial fulfillment for completion of
AI-training
SUBMITTED TO
7/01/2025-9/03/2025
Subitted By :
Name
ID
College Name & Address
Affiliated to Dr. A.P.J. Abdul Kalam Technical University, Lucknow
1
TABLE OF CONTENTS
1. GOAL
2. INTRODUCTION
2.1 PROBLEM FORMULATION
2.2 LIBRARY USED
2.3 TASK
3. LITERATURE SURVEY
3.1 WHICH MODELS ARE AVAILABLE FOR THE PROBLEM
3.2 SELECTED MODEL THEORY
3.3 MODEL WORKING
4. DATASET
4.1 DATA COLLECTION
4.2 DATA PREPROCESSING
4.3 DATA VISUALIZATION
5. EDA (EXPLORATORY DATA ANALYSIS)
3.1 INTRODUCTION TO EDA
3.2 EMOTION DATASET ANALYSIS
6. MODEL SELECTION AND TRAINING
6.1 MODEL SELECTION FOR EMOTION ANALYSIS
6.2 MODEL SELECTION FOR CRY ANALYSIS
6.3 TRAINING THE MODELS
6.4 CONCLUSION
7. MODEL EVALUATION
2
7.1 EVALUATION METRICS FOR EMOTION RECOGNITION
7.2 EVALUATION METRICS FOR CRY RECOGNITION
7.3 CROSS VALIDATION & TESTING
7.4 CONCLUSION
8. CONCLUSION
2.1 KEY ACHIEVEMENTS
2.2 IMPLICATIONS AND APPLICATION
2.3 FUTURE DIRECTION
2.4 FINAL THOUGHTS
3
LIST OF FIGURES
CHAPTER NO. TITLE
2 Figure 2.1
4 Figure 4.1(Datasets)
6 Figure 6.1(SVC)
6 Figure 6.2(deepFace Working)
7 Figure 7.1(Confusion Matrix)
7 Figure 7.2(Accuracy of Model)
9 Figure 9.1(Prediction)
4
CHAPTER 1
ABSTRACT
• This project integrates machine learning to perform:
1. Real-time emotion recognition from facial expressions in images and videos.
2. Infant cry classification based on audio data.
• The emotion detection module utilizes:
1. The DeepFace library for deep learning-based emotion analysis.
• The cry classification module uses:
1. A Support Vector Classifier (SVC) to identify cries like "hungry" and "discomfort."
• The project is implemented using:
1. A Gradio interface for easy, real-time interaction with video and audio inputs.
• Applications include:
1. Mental health monitoring.
2. Smart parenting tools.
• Results show:
1. Promising initial performance, with further improvements needed in dataset size and
model accuracy.
5
CHAPTER 2
INTRODUCTION
The integration of machine learning into daily life has opened doors to innovative applications in
healthcare, human-computer interaction, and monitoring systems. This project, titled Emotion
and Cry Classification, focuses on the development of a system that combines two key tasks:
real-time emotion recognition and infant cry classification.
Emotion recognition plays an essential role in understanding human behavior and emotional
states, with potential uses in mental health monitoring, customer service, and user experience
enhancement. The system developed in this project leverages deep learning techniques to
analyze facial expressions from both images and videos, providing accurate emotion detection in
real-time. This is achieved using the DeepFace library, which is renowned for its high
performance in facial analysis tasks.
In parallel, infant cry classification is a critical task that can assist caregivers in understanding an
infant's needs when they are unable to communicate verbally. The cry classification module
developed in this project uses a Support Vector Classifier (SVC) model to distinguish between
different types of cries, such as those indicating hunger or discomfort, based on audio recordings.
Accurate classification of infant cries can be particularly beneficial in smart parenting tools and
baby monitoring systems, helping caregivers respond to an infant’s needs more effectively.
The project is implemented with a user-friendly Gradio interface, which allows users to interact
with the system seamlessly, either by uploading images/videos for emotion detection or audio
files for cry classification. Real-time webcam integration further enhances the practical
application of emotion recognition.
This system addresses the growing need for automated tools that can provide insights into human
emotional states and the well-being of infants, offering a step forward in healthcare and human-
computer interaction technologies.
Figure. 2.1
6
CHAPTER 3
LITERATURE SURVEY
3.1 Deep Learning Based Models
Tusty Nadia et al. [1] proposed a study that utilizes both CNN and RNN that uses 5-fold cross-
validation on 200 training data sets and 50 validation data sets. This model achieved a success rate
of 95% but used a small dataset. Yun-Chia et al. [2] proposed a study that uses Deep Learning
algorithms CNN, LSTM, and ANN. This model is trained on 1607 infant cry samples extracted by
MFCC. This approach achieved a success rate of 95% but implementing this model in real-time
leads to technical challenges. Omnia et al. [3] proposed a study that is composed of two stages:
cry detection and classification stage, in which features are extracted from samples by MFCC and
passed on to the 2DCNN. This model provides 99.59% accuracy but the dataset may contain noise
signals which makes classification difficult. Bing Lv et al. [4] proposed an infant cry recognition
system that utilizes a Multiscale CNN combined Bi-LSTM-based approach trained on 5242 infant
cry samples. This model achieved an emotion recognition rate of 83%, which is not very
significant. Chuan-Yu et al. [5] proposed a study that utilizes the CNN and 1DCNN to classify
infant crying and this method achieves a high crying detection rate but this will not handle the
variable infant’s cry.
3.2 IOT Based Models Using ML
Erwin Sutanto et al. [1] proposed an analysis of sound`s power from WAV files before going into
2D patterns which help in finding out the baby`s real condition with 85% accuracy.
3.3 DNN Based Models
Daniele Ferretti et al. [1] propose an algorithm for detecting infant cries by an OMLSA post-filter
for reducing the effects of interferences and a dataset containing cries of infants for voice activity
detection on the basis of LTSD. Boon, Fei, et al. [2] propose a study that infant cry recognition is
difficult to figure out but using MFCC, RBN is combined with DNN to form DBN. RBN - CNN
proposes a model with 78.6% accuracy across 5 cry types using MFCC. K.S.Alishamol et al. [3]
proposed study says the system identifies the emotions of an infant by using MFCC and DNN
to classify the infant's cries into different categories and sent to parents mobile via GSM module.
2.1 Self- Supervised Learning Based Models
Cem Subakan et al. [1] proposed a study that self-supervised learning was used for analyzing a
large-scale database of cry recordings of more than a thousand newborns, and they showed that
SSL-based pre-training with SSL contrastive loss (SimCLR) performed significantly better than
supervised pre-training for both neuro injury and cry triggers. Arsenii Gorin et al. [2] proposed
that SSL improves the performance of CNN for analyzing infant cries, reducing the need for
labeled data and improving clinical solutions. Also the performance gains through SSL domains
7
and using SSL-based pre-training for adaptation to cry sounds decreases the need for labeled data
of the overall system.
8
CHAPTER 4
DATA COLLECTION AND PREPROCESSING
EXPLAIN THE PROCEDURE OF DATA COLLECTION
WHICH WEBSITE IS USED?
EXPLAIN THE LITTLE SUMMARY OF DATASET.
Data preprocessing is a crucial step to prepare the dataset for the machine learning
model. This step ensures that the data is clean, consistent, and ready for analysis.
FIGURE 4.1(DATASETS)
9
CHAPTER 5
EDA (EXPLORATORY DATA ANALYSIS)
5.1. Introduction to EDA
• Definition and importance of EDA in the context of emotion and cry classification.
• Objectives of the EDA for this project.
5.2. Emotion Dataset Analysis
• 2.1 Data Overview
o Total number of images in the FER2013 dataset.
o Distribution of images across different emotion categories.
• 2.2 Visualizations
o Histograms showing the frequency of each emotion.
o Sample images representing each emotion category.
• 2.3 Image Quality Assessment
o Checking for duplicates or corrupted images.
o Analysis of image dimensions and color channels.
• 2.4 Insights on Demographics
o Examination of the diversity in age, gender, and ethnicity represented in the dataset.
5.3. Cry Dataset Analysis
• 3.1 Data Overview
o Total number of audio samples collected.
10
o Distribution of audio samples across different cry categories (e.g., "hungry,"
"discomfort").
• 3.2 Visualizations
o Pie charts showing the percentage of each cry type in the dataset.
o Waveform plots and spectrograms for sample audio files.
• 3.3 Audio Quality Assessment
o Analysis of background noise levels in recordings.
o Checking for consistency in recording conditions (e.g., volume levels).
5.4. Feature Analysis
• 4.1 Emotion Recognition Features
o Overview of features extracted using the DeepFace library (e.g., facial landmarks,
emotion scores).
• 4.2 Cry Classification Features
o Description of MFCC features extracted from audio samples.
o Comparison of different feature extraction techniques (if applicable).
5.5. Correlation and Patterns
• 5.1 Emotion Correlation
•
o Analysis of potential correlations between different emotions (e.g., confusion
between happy and surprise).
• 5.2 Cry Classification Patterns
o Examination of common characteristics in cries (e.g., pitch, duration) that may
correlate with specific emotions or needs.
6. Challenges Identified in EDA
• Discussion of any limitations in the datasets (e.g., class imbalance, noise in audio).
• Observations that may require data augmentation or additional data collection.
7. Conclusion of EDA
• Summary of key findings from the EDA.
• Implications of these findings for the subsequent modeling and evaluation phases of the
project.
o perform better economically.
o States in the northern and eastern regions (e.g., Bihar, Uttar Pradesh) tend to
have lower economic indicators.
11
CHAPTER 6
MODEL SELECTION AND TRAINING
The success of the Emotion and Cry Classification project heavily relies on the selection and
training of appropriate machine learning models for the two core tasks: emotion recognition and
cry classification. Each task necessitated a tailored approach to ensure optimal performance and
accuracy.
6.1. Model Selection for Emotion Recognition
For the emotion recognition component, the project utilized the DeepFace library, which is
specifically designed for facial analysis and emotion detection. DeepFace provides several pre-
trained deep learning models, such as VGG-Face, Google FaceNet, OpenFace, and Facebook
DeepFace, which have demonstrated high accuracy in various facial recognition tasks. Given the
complexity of human emotions and the nuances in facial expressions, the pre-trained models
within the DeepFace library were advantageous, as they were trained on large and diverse
datasets, allowing for better feature extraction and generalization.
The selection of DeepFace was influenced by several factors:
• Performance: DeepFace has been benchmarked against various datasets, showing state-
of-the-art performance in emotion recognition tasks.
• Ease of Use: The library simplifies the implementation of complex models, enabling
rapid development and iteration.
• Flexibility: DeepFace supports real-time analysis, which aligns with the project’s
objective of providing immediate feedback during emotion detection.
6.2. Model Selection for Cry Classification
For the cry classification module, a Support Vector Classifier (SVC) was chosen due to its
effectiveness in handling high-dimensional feature spaces. The cry classification process began
12
with the extraction of Mel-frequency cepstral coefficients (MFCC) from audio recordings,
which provided a robust representation of the audio signals. MFCC features capture the timbral
characteristics of the cries, making them suitable for classification tasks.
The SVC model was selected for several reasons:
• Robustness to Overfitting: SVC is particularly effective in high-dimensional spaces and
is less prone to overfitting, especially when using a limited dataset.
• Binary and Multiclass Classification: SVC can handle both binary and multiclass
classification problems, which is essential for distinguishing between various cry types.
• Versatility: The model's hyperparameters can be tuned to optimize performance,
allowing for better accuracy in cry classification.
6.3. Training the Models
Once the models were selected, the training process commenced. For the emotion recognition
module, the training involved fine-tuning the DeepFace models on the FER2013 dataset. The
training process included the following steps:
• Data Augmentation: Techniques such as random rotations, flips, and brightness
adjustments were employed to enhance the dataset and improve model generalization.
• Training Procedure: The pre-trained model was fine-tuned using a smaller learning rate
to adapt to the specific characteristics of the FER2013 dataset. The training was
monitored using validation metrics to prevent overfitting.
• Evaluation: Performance was evaluated using metrics such as accuracy, precision, recall,
and F1-score on a separate test set.
For the cry classification model, the training process involved:
• Feature Extraction: MFCC features were computed from the audio recordings,
converting the raw audio signals into a feature set suitable for training.
• Model Training: The SVC model was trained on the extracted features, and
hyperparameter tuning was performed using techniques like grid search or random search
to identify the best-performing parameters.
• Cross-Validation: Cross-validation was employed to assess the model’s performance on
different subsets of the data, ensuring the robustness of the classification results.
6.4. Conclusion
In conclusion, the model selection and training phases for the Emotion and Cry Classification
project were critical in establishing a strong foundation for accurate predictions. The integration
of the DeepFace library for emotion recognition and the SVC model for cry classification
enabled the project to effectively address its objectives. Continuous evaluation and iteration
during the training process ensured that the models were optimized for performance, setting the
stage for successful implementation in real-time applications.
13
FIGURE 6.1 SVC
FIGURE 6.1 deepFace
Working
14
CHAPTER 7
MODEL EVALUATION
Model evaluation is a critical component of the machine learning workflow, as it assesses the
performance of the trained models and determines their effectiveness in making accurate
predictions. For the Emotion and Cry Classification project, the evaluation process involved
rigorous testing of both the emotion recognition model based on the DeepFace library and the cry
classification model using a Support Vector Classifier (SVC). The evaluation was conducted
using various metrics to ensure that the models meet the desired accuracy and reliability standards.
7.1. Evaluation Metrics for Emotion Recognition
The primary evaluation metrics used for the emotion recognition model included accuracy,
precision, recall, F1-score, and confusion matrix analysis.
• Accuracy: This metric indicates the overall proportion of correct predictions made by the
model. It is calculated as the ratio of correctly predicted instances to the total instances in
the test dataset. Accuracy is a fundamental metric that provides an initial assessment of the
model's performance.
• Precision: Precision measures the proportion of true positive predictions relative to the total
predicted positives. It is especially useful in scenarios where false positives carry
significant consequences, such as misinterpreting a neutral expression as happy.
• Recall: Recall, also known as sensitivity, measures the proportion of true positive
predictions relative to the actual positives in the dataset. This metric is crucial when the
goal is to minimize false negatives, as in the case of missing an emotional response.
15
• F1-Score: The F1-score is the harmonic mean of precision and recall, providing a balanced
measure that considers both false positives and false negatives. It is particularly useful
when dealing with imbalanced classes, which is common in emotion datasets.
• Confusion Matrix: A confusion matrix provides a detailed breakdown of the model's
performance across different emotion categories. It displays true positive, false positive,
true negative, and false negative counts, allowing for a deeper understanding of where the
model performs well and where it struggles. This analysis helps identify specific emotions
that may be confused with one another, guiding future improvements.
7.2. Evaluation Metrics for Cry Classification
For the cry classification model, similar metrics were employed, but with a focus on aspects
relevant to audio classification.
• Accuracy: The overall accuracy of the cry classification model was calculated to assess the
proportion of correctly classified cries in relation to the total number of cries in the test
dataset.
• Precision and Recall: These metrics were computed for each cry category (e.g., "hungry,"
"discomfort") to understand the model's ability to correctly identify each type of cry. High
precision indicates that most of the cries classified into a category are indeed of that
category, while high recall indicates that most of the actual cries of that category were
correctly identified.
• F1-Score: The F1-score for each cry category was calculated to provide a balanced view
of the model's performance, especially in the presence of class imbalance.
• Confusion Matrix: The confusion matrix for the cry classification model was analyzed to
identify specific cries that were often misclassified. This matrix facilitated a detailed
analysis of model performance and highlighted areas for improvement, such as adjusting
the feature extraction process or augmenting the dataset.
7.3. Cross-Validation and Testing
To ensure the robustness of the evaluation results, cross-validation techniques were applied during
model training. The dataset was split into multiple subsets, with each subset serving as a test set
while the others were used for training. This process provided a more reliable estimate of the
model’s performance and helped mitigate the risk of overfitting.
Additionally, a separate holdout test set, which was not used during the training phase, was
employed to evaluate the final performance of both models. This test set included a balanced
representation of all emotion and cry categories to ensure comprehensive assessment.
16
FIGURE 7.1 Confusion Matrix
7.4. Conclusion
In conclusion, the model evaluation process for the Emotion and Cry Classification project was
vital for determining the effectiveness and accuracy of the developed models. By employing a
variety of evaluation metrics such as accuracy, precision, recall, F1-score, and confusion matrices,
the evaluation provided insights into the strengths and weaknesses of each model. This thorough
assessment ensures that the models can reliably perform their intended functions in real-world
applications, paving the way for further enhancements and refinements as needed.
FIGURE 7.1 Accuracy of Model
17
18
CHAPTER 8
CONCLUSION
The Emotion and Cry Classification project has successfully developed a comprehensive
system that leverages advanced machine-learning techniques to analyze and interpret human
emotions and infant cries. Through the integration of the DeepFace library for emotion
recognition and a Support Vector Classifier (SVC) for cry classification, the project has
demonstrated its capability to accurately identify emotional states and assess the needs of infants
based on their vocalizations.
10.1. Key Achievements
Throughout the project, several significant achievements were accomplished:
• Effective Emotion Recognition: By employing pre-trained models within the DeepFace
library, the system has shown promising accuracy in classifying a range of emotions from
facial expressions. The use of extensive visualizations, including confusion matrices and
precision-recall curves, provided clear insights into the model's performance and areas for
improvement.
• Accurate Cry Classification: The implementation of the SVC model for cry
classification has enabled the identification of different types of infant cries, such as
"hungry" and "discomfort." By extracting relevant features from audio recordings, the
model has been able to distinguish between cries effectively, providing valuable support
for caregivers.
• Exploratory Data Analysis: Through comprehensive exploratory data analysis, the
project has highlighted the importance of understanding dataset distributions, identifying
potential biases, and ensuring robust model training and evaluation.
10.2. Implications and Applications
The implications of this project extend beyond academic research, offering practical applications
in various fields:
• Healthcare and Caregiving: The emotion and cry classification system can serve as a
valuable tool for healthcare professionals and caregivers, assisting in the monitoring of
emotional well-being and responding to infants' needs promptly.
19
• Early Intervention: By providing insights into emotional development and identifying
specific needs, the system has the potential to contribute to early intervention strategies
for children with developmental concerns.
10.3. Future Directions
As outlined in the future scope section, there are numerous avenues for further exploration and
enhancement of the project. Continued efforts to expand the dataset, optimize model
performance, and integrate contextual information will strengthen the system's applicability and
effectiveness. Additionally, addressing ethical considerations surrounding privacy and data
protection will be paramount as the technology evolves.
10.4. Final Thoughts
In conclusion, the Emotion and Cry Classification project represents a significant step toward
utilizing technology to understand human emotions better and address the needs of infants. The
successful implementation of machine learning models in this context highlights the potential for
innovation in emotional recognition and support systems. As the project progresses, ongoing
research and development will be essential to refine the models and expand their impact,
ultimately enhancing the quality of care and emotional well-being for individuals and families.
REFERENCES
20