Final HLD
Final HLD
Final HLD
Document
For
Customer Review
Prediction
Table of Contents
1. Introduction
2. General Description
Product perspective
Problem statement
Proposed solutions
Future Improvement
Tools used
3. Architecture Overview
System Components
Data Flow Diagram
4. Data Processing Techniques
Categorical Features Handling
Text Data Processing
Numerical Features Scaling
5. Model Training
Algorithm Selection
Hyperparameter Tuning
6. Evaluation Metrics
Performance Metrics
Confusion Matrix Analysis
7. Integration and Deployment
Model Deployment
Integration with Existing Systems
8. Security Considerations
Data Privacy
Model Security
9. Scalability and Performance
Handling Large Datasets
Performance Optimization
10. Maintenance and Monitoring
Model Maintenance
Monitoring and Alerts
11. Conclusion
Abstract
1.2. Scope
The scope of this document encompasses the entire lifecycle of the customer
satisfaction prediction system, from data preprocessing to model deployment
and monitoring. It includes detailed discussions on data processing
techniques, model training methodologies, evaluation metrics, integration
and deployment strategies, security considerations, scalability and
performance considerations, maintenance, and monitoring procedures.
Additionally, it outlines the roles and responsibilities of key contributors and
provides guidelines for effective project management.
1.3. Definitions
Customer Satisfaction Prediction System: A machine learning-based
system for predicting customer satisfaction levels.
High-Level Design (HLD) Document: Provides an overview of system
architecture and components.
Data Processing Techniques: Methods for preprocessing and
transforming raw data.
Model Training Approach: Methodology for training machine learning
models.
Evaluation Metrics: Metrics for assessing model performance.
Integration and Deployment: Process of integrating and deploying the
model.
Security Considerations: Measures to ensure data and model security.
Scalability and Performance: System's ability to handle increasing data
and traffic.
Maintenance and Monitoring: Activities for maintaining and monitoring
the system.
2. General Description
2.1 Product Perspective:
This project aims to assist companies in evaluating the satisfaction levels of
their customers, thereby providing valuable insights for optimizing
advertising strategies and other critical functions. By leveraging advanced
machine learning techniques, our solution enables companies to gauge
customer satisfaction based on purchase information. This invaluable
understanding empowers businesses to refine their marketing efforts,
enhance customer experiences, and make data-driven decisions for
sustainable growth and success.
Data Validation:
After receiving the data files from the E-commerce website, the Data
Validation component performs thorough cleaning and validation processes
on the seven sets of information. It ensures data integrity and consistency by
identifying and correcting errors, inconsistencies, or missing values.
Additionally, the Data Validation component establishes connections
between each set of data to ensure coherence and completeness.
Once the data has been validated, a final dataset is prepared, containing
only the required information for further analysis. This dataset is structured
with various columns and records, facilitating exploratory data analysis (EDA)
steps and model training. Finally, a new file is created, consolidating the
cleaned and validated data for seamless integration into subsequent stages
of the project.
Data Transformation:
The data transformation process addresses various challenges in handling
categorical features, text data, and numerical features to prepare the data
for model training.
Categorical Features:
Response coding is used to encode categorical features with multiple
categories, mitigating sparsity issues.
Ordinal or one-hot encoding is applied to categorical features with fewer
categories, while response coding is used for those with too many
categories.
Handling Text Data:
The "sentence-transformers/paraphrase-multilingual-mpnet-base-v2" model
is utilized to convert comments and messages into dense vector
representations, enabling tasks like clustering and semantic search.
Text data is preprocessed by removing stop words and applying regex
cleaning before vectorization.
Scaling Numerical Features:
Numerical features are standardized to ensure they are on the same scale,
enhancing model performance. Standardization is preferred over
normalization due to the presence of outliers in the dataset.
Concatenation and Final Check:
All necessary features are concatenated to create the final dataframe.
The shape of the final dataframe is verified to ensure all features are
included.
These feature engineering techniques optimize the data for model training,
contributing to improved model performance and accuracy.
Model Trainer:
The XGBoost classifier from the XGBoost library was selected to train the model on the
transformed data. This choice was made after thorough consideration, including testing various
algorithms and hyperparameter tuning. XGBoost is known for its efficiency and effectiveness in
handling structured data and has been widely adopted in machine learning competitions and
real-world applications.
Model Evaluation:
The trained model is evaluated using multiple metrics, including:
ROC AUC Score: This metric measures the area under the receiver operating
characteristic (ROC) curve, which plots the true positive rate against the
false positive rate. It provides an overall assessment of the model's ability to
distinguish between classes.
Accuracy Score: This metric calculates the proportion of correctly classified
instances out of the total number of instances. It gives an indication of the
overall correctness of the model's predictions.
F1 Score: This metric is the harmonic mean of precision and recall, providing
a balance between the two. It is particularly useful in imbalanced datasets
where one class is much more prevalent than the other.
These metrics are tracked using MLflow tracking, integrated with Dagshub, to
monitor the model's performance and facilitate collaboration and version
control.
Deployment:
For user convenience, a web application has been developed to allow users
to input custom parameters and predict reviews based on the provided
inputs. The web application is deployed on Amazon Web Services (AWS),
ensuring accessibility and scalability. Users can interact with the application
through a user-friendly interface, making it easy to input parameters and
receive predictions in real-time.