Project Report
Project Report
Submitted By:
Saksham Dura (5-2-19-657-2020)
Sandesh Regmi (5-2-19-661-2020)
Renjil Sunar (5-2-19-649-2020)
Submitted To:
Department of Science and Information Technology,
Birendra Multiple Campus, Bharatpur
Under Supervision of
Er. Binod Sharma
December 17, 2024
i
SUPERVISOR’S RECOMMENDATION
I hereby recommend that the project work prepared under my supervision by below listed
team of students entitled “Sentiment Analysis Using Machine Learning” be accepted as
in fulfilling partial requirement for completion of Four Year's Bachelor's Degree in
Computer Science and Information Technology. In my best knowledge this is an original
work in Computer Science and Information Technology.
Student's Name
…………………….
Supervisor
Er. Binod Sharma
Birendra Mutiple Campus, Bharatpur
Department of B.Sc.CSIT
ii
LETTER OF APPROVAL
This is to certify that this project prepared by Saksham Dura, Sandesh Regmi, Renjil Sunar
entitled “Sentiment Analysis Using Machine Learning” in partial fulfillment of the
requirement for the degree of B.Sc. in Computer Science and Information Technology has
been well studied. In our opinion it is satisfactory in scope and quality as a project for the
required degree.
.............................. ........................
Er. Binod Sharma Er. Binod Sharma
Signature of Project Supervisor Signature of HOD/Co-Ordinator
.......................... .........................
Signature of Internal Examiner Signature of External Examiner
iii
ACKNOWLEDGEMENT
We would like to express our heartfelt gratitude to all those who have contributed to the
successful completion of this project.
First and foremost, we are deeply thankful to our project advisor Binod Sharma, for their
unwavering guidance, invaluable insights, and constant encouragement throughout this
journey. Their expertise and mentorship played a pivotal role in shaping the direction and
quality of this project.
We extend our appreciation to the faculty members of the Birendra Multiple Campus for
their dedication to imparting knowledge and for providing us with a conducive learning
environment.
We are also indebted to our teammates who worked tirelessly, sharing their knowledge and
experiences. Our collaborative efforts were instrumental in achieving the project's goals.
We extend our gratitude to our family for their unwavering support and understanding
during the demanding phases of this project.
Lastly, we would like to express our thanks to all the resources, books, and online materials
that were instrumental in enhancing our understanding and knowledge in the field of
computer science and information technology.
This project would not have been possible without the collective efforts and support of all
these individuals and resources. We are truly grateful for their contributions.
iv
ABSTRACT
Sentiment analysis has become an important task in natural language processing (NLP),
finding its way into various areas like business intelligence, social media monitoring, and
customer feedback reviews. This project aims to put machine learning and deep learning
techniques to work for understanding sentiments in text. We first explored some classic
machine learning algorithms—Support Vector Machines (SVM), Logistic Regression, and
Multinomial Naive Bayes (MNB)—using a dataset from Kaggle. These models offered
decent initial results, but tweaking the DistilBERT model, which is a pre-trained
transformer-based language model, led to much better accuracy. The fine-tuned
DistilBERT model proved to be more effective in picking up subtle language cues and
context within the data. Comparing these approaches shows the limitations of traditional
methods and the promising advantages of using transfer learning models for sentiment
analysis. The project wraps up with a discussion on the challenges encountered during the
implementation and some thoughts on where future research could go.
v
TABLE OF CONTENTS
SUPERVISOR’S RECOMMENDATION.......................................................................ii
LETTER OF APPROVAL .............................................................................................. iii
ACKNOWLEDGEMENT ................................................................................................ iv
ABSTRACT ........................................................................................................................ v
TABLE OF CONTENTS.................................................................................................. vi
LIST OF TABLES ......................................................................................................... viii
LIST OF FIGURES ......................................................................................................... ix
LIST OF ABBREVIATIONS ........................................................................................... x
CHAPTER 1: INTRODUCTION ..................................................................................... 1
1.1. Introduction ............................................................................................................... 1
1.2. Problem Statement .................................................................................................... 1
1.3. Objectives.................................................................................................................. 1
1.4. Scope and Limitation ................................................................................................ 2
1.4.1 Scope ................................................................................................................... 2
1.4.2 Limitations .......................................................................................................... 2
1.5. Development Methodology....................................................................................... 2
1.6. Report Organization .................................................................................................. 3
CHAPTER 2 ....................................................................................................................... 4
BACKGROUND STUDY AND LITERATURE REVIEW ........................................... 4
2.1. Background Study ..................................................................................................... 4
2.2. Literature Review ...................................................................................................... 4
CHAPTER 3 ....................................................................................................................... 6
SYSTEM ANALYSIS ........................................................................................................ 6
3.1. System Analysis ........................................................................................................ 6
3.1.1. Requirements Analysis ...................................................................................... 6
3.1.2. Feasibility Analysis ............................................................................................ 7
3.1.3. Analysis .............................................................................................................. 8
CHAPTER 4 ..................................................................................................................... 14
SYSTEM DESIGN ........................................................................................................... 14
4.1. Design ..................................................................................................................... 14
4.1.1. Class / Object / State / Sequence / Activity ..................................................... 14
4.1.2. Component Diagrams ...................................................................................... 15
vi
4.2. Algorithm Details .................................................................................................... 16
CHAPTER 5 ..................................................................................................................... 20
IMPLEMENTATION AND TESTING ......................................................................... 20
5.1 Implementation ........................................................................................................ 20
5.1.1 Tools Used (CASE tools, Programming languages, Database platforms) ........ 20
5.1.2. Implementation Details of Modules ................................................................. 21
5.2. Testing ..................................................................................................................... 23
5.2.1. Test Cases for Unit Testing .............................................................................. 24
5.2.2. Test Cases for Integration Testing ................................................................... 26
5.3. Result Analysis ....................................................................................................... 27
CHAPTER 6 ..................................................................................................................... 29
CONCLUSION AND FUTURE RECOMMENDATIONS.......................................... 29
6.1. Conclusion .............................................................................................................. 29
6.2. Future Recommendations ....................................................................................... 29
REFERENCES ................................................................................................................. 31
vii
LIST OF TABLES
Table 1: Unit Testing Results............................................................................................. 24
Table 2 : Integration Testing Results ................................................................................. 26
Table 3 : Accuracy across models...................................................................................... 27
viii
LIST OF FIGURES
ix
LIST OF ABBREVIATIONS
x
CHAPTER 1: INTRODUCTION
1.1. Introduction
Sentiment analysis is a sub-field of natural language processing that has gained a lot of
attention in recent years. It allows for extracting opinions from textual data. This is one of
the analysis types that has gained immense popularity with the increasing availability of
large online sources of data such as social media, reviews related to e-commerce products,
and forums. It has a dual purpose: keeping track of how people feel about different brands
and analyzing customer reviews to improve services or make future decisions. These have
been tried with a high level of success in applications; however, because of the abstraction
of context and linguistic idiosyncrasies, they often perform somewhat poorly. The most
recent advances in deep learning specifically applied to transformers, such as DistilBERT,
have led to very substantial improvements in sentiment classification due to better
semantics and context understanding. Here we merged both the old and the state-of-the-art
methodologies to assess the same over Kaggle data.
1.3. Objectives
The objectives of this project are:
1
4. To identify the limitations of each method and recommend suitable use cases for
sentiment analysis.
1.4.2 Limitations
• Dataset Dependency: The models are trained and on a single
dataset sourced from Kaggle. As a result, the findings may not
generalize well to datasets from other domains
1. Dataset Preprocessing: Text data from the Kaggle dataset is cleaned, tokenized, and
vectorized using techniques like TF-IDF for traditional models and word
embeddings for transformers.
2
3. Transformer Fine-Tuning: The DistilBERT model is fine-tuned using the
preprocessed dataset to capture linguistic nuances.
• Chapter 5: Conclusion and Future Work Concluding Results and achievable future
work.
3
CHAPTER 2
3. Machine Learning (ML): A method of data analysis that automates the building
of analytical models. ML is widely employed for sentiment classification tasks.
6. DistilBERT: A lighter, faster, and smaller version of BERT optimized for resource-
constrained environments while retaining competitive accuracy.
These theories and tools provide the foundation for modern sentiment analysis techniques,
enabling precise sentiment classification across diverse datasets.
4
2.2. Literature Review
5
CHAPTER 3
SYSTEM ANALYSIS
1. User Input Module: An interface for user input is required. The input can be text
or an unlabeled csv file.
2. Sentiment Detection: The system is required to classify input data into positive,
negative, or neutral sentiments using machine learning algorithms.
3. Visualization Module: Results are displayed graphically (e.g., pie charts or bar
graphs) to provide an intuitive understanding of sentiment distribution.
6
1. Performance: The system should process input data and deliver sentiment
analysis results within 5 seconds for typical input sizes.
5. Security: User data must be encrypted and protected from unauthorized access.
2. SSD: 256GB
Python is used as the base programming language for this project with libraries such
as Numpy, Pandas for data pre-processing and NLTK for training the model.
7
iii. Economic Feasibility
Given the availability of open-source libraries (e.g., Hugging Face for BERT
models) and affordable cloud computing solutions (e.g., Google Colab), the
development and operational costs are minimal. This ensures the project remains
cost-effective.
Estimated duration for this project is around 1 and a half months. Gantt Chart for
the project is provided in figure [2] below:
3.1.3. Analysis
8
3. VisualizationModule: Handles the creation of graphical representations like pie
charts and bar graphs for sentiment distribution.
9
Figure 4 : Object Diagram SentimentAnalyzer, DatabaseHandler, Visualization
classes
State diagrams for this project illustrate the various states the system transitions
through during operation. Key states include:
Transitions between these states are triggered by user actions or the completion of
system tasks.
10
Figure 5 : State Diagram
Sequence diagrams for this project depict interactions between system components
in the order they occur. For sentiment analysis, the sequence may involve:
11
Figure 6 : Sequence Diagram
Activity diagrams provide a high-level view of the workflows within the system.
Key activities include:
12
Figure 7 : Activity Diagram
This analysis provides a detailed representation of the system's structure, behavior, and
workflows, ensuring clarity in design and implementation for effective sentiment analysis.
13
CHAPTER 4
SYSTEM DESIGN
4.1. Design
4.1.1. Class / Object / State / Sequence / Activity
The SentimentAnalyzer class is further refined to include specific methods such as
tokenizer(),vectorizer(),analyze_sentiment()etc.The VisualizationModule includes
methods for generating visual outputs using matplotlib library, applying methods such as
matploib.pyplot(). The user input text and csv file is taken from st.text_input(),
st.file_uploader() method provided by streamlit library. Then, the input is normalized using
normalizer() function for removing stopwords, stemming and lemmatization. Then, the
normalized input text is passed through vectorizer(), tokenizer() for preprocessing. The
preprocessed is taken by selected model through analyze_sentiment() method and the label
is returned for the input. The analyze_sentiment() uses model that are taken from scikit-
learn library, tranasformers from Hugging Face. Activity Sequence include interactions
such as:
14
Figure 8 : Flowchart
Data Storage Component: Handles database management for storing inputs, outputs, and
logs.
15
Figure 9 : Component Diagram
2. Tokenizer
• Word Segmentation: The tokenizer divides the input text into individual
words or subwords, known as tokens.
16
• Encoding: A crucial component of the Transformer architecture that
helps the model understand the order of words in a sequence. Positional
encoding uses sine and cosine functions to generate a unique encoding
for each position in the input sequence. This allows the model to capture
the relative positions of tokens, which helps it interpret and generate text.
Self-attention mechanism that allows the model to weigh the importance
of different parts of the input sequence. It calculates the embeddings' dot
product, which is also known as multi-head attention.
3. LogisticRegression
1
𝐹(𝑥) =
1 + 𝑒 −𝑥
This equation is similar to linear regression, where the input values are
combined linearly to predict an output value using weights or coefficient
values. However, unlike linear regression, the output value modeled here is
a binary value (0 or 1) rather than a numeric value.
17
Figure 11: Bayesian Probability Formula
The estimated prior 𝒑(𝐶ₖ) and likelihood 𝒑(𝑾 | 𝐶ₖ) proportionally contribute
to an outcome that 𝐶ₖ is being the class of K.
5. Transformer Architecture
• Input Embeddings:
The input text is converted into embeddings, which are continuous vector
representations of words or tokens.
18
• Output: The output from the decoder is passed through a linear layer and
softmax function to generate the final prediction, such as translated text
or classified sentiment.
19
CHAPTER 5
5.1 Implementation
5.1.1 Tools Used (CASE tools, Programming languages, Database
platforms)
1. Python Programming Language:
2. Jupyter Notebook:
This library was utilized for implementing pre-trained transformer models such as
BERT and DistilBERT, enabling efficient sentiment classification.
4. Scikit-learn:
These libraries were used for data manipulation and numerical computations,
ensuring efficient handling of large datasets.
These visualization tools enabled the creation of graphs and charts to represent
sentiment distributions and analysis results intuitively.
7. PyTorch:
20
This library is used for fine-tuning transformer models and implementing deep
learning-based sentiment analysis pipelines.
8. Google Colab:
NLTK was employed for text preprocessing tasks, such as tokenization, stopword
removal, and stemming.
Code Implementation
def normalizer(tweet):
only_letters = only_letters.lower()
only_letters = only_letters.split()
return lemmas
tokenizer = DistilBertTokenizer.from_pretrained('./Fine-TunedModel')
21
2. Model fine-tuning
// Code Implementation
class CustomDistilBertForSequenceClassification(nn.Module):
super(CustomDistilBertForSequenceClassification, self).__init__()
self.distilbert = DistilBertModel.from_pretrained('distilbert-base-uncased')
self.dropout = nn.Dropout(0.3)
distilbert_output=self.distilbert(input_ids=input_ids,attention_mask=attention
_mask)
pooled_output = self.pre_classifier(pooled_output)
pooled_output = nn.ReLU()(pooled_output)
logits = self.classifier(pooled_output)
return logits
22
model.to(device)
model.train()
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i + 1) % 100 == 0:
3. Evaluation
5.2. Testing
The test cases for this project are categorized into:
• Data Preprocessing
• Model-Specific Behavior
23
• Integration Testing
• Edge Cases
Test Environment:
• Framework: unittest
• Mock Data: A mix of positive, negative, and neutral sentences, including edge
cases such as empty strings, special characters, and extremely long text.
Expected
Test ID Description Input Status
Output
Correct
"I love this
SVM01 prediction for Positive Passed
product!"
positive
24
Correct
"Terrible
SVM02 prediction for Negative Passed
experience!"
negative
Handle
SVM03 neutral "It's okay." Neutral Passed
sentiment
Ensure
"Amazing Probabilities
SVM04 probability Passed
day!" in [0, 1]
output is valid
2) Logistic Regression
Expected
Test ID Description Input Status
Output
Positive "What a
LR01 sentiment fantastic Positive Passed
classification app!"
Negative
"Worst
LR02 sentiment Negative Passed
service ever!"
classification
Neutral
LR03 sentiment "It's fine." Neutral Passed
classification
3) Multinomial Naive Bayes
Expected
Test ID Description Input Status
Output
Predict
"Absolutely
NB01 positive Positive Passed
great!"
sentiment
Predict
"Horrible
NB02 negative Negative Passed
product."
sentiment
25
Test class Class
"Mediocre
NB03 distribution prediction Passed
experience"
handling succeeds
4) Fine-Tuned DistilBERT
Expected
Test ID Description Input Status
Output
Predict
"Outstanding
DBERT01 positive Positive Passed
service!"
sentiment
Predict
"Completely
DBERT02 negative Negative Passed
dissatisfied."
sentiment
Handle mixed "Good food, Accurate
DBERT03 Passed
sentiment bad service." probabilities
Long text Valid
DBERT04 Long reviews Passed
handling prediction
26
5.3. Result Analysis
Each model was evaluated on its ability to classify sentiments correctly. The models
exhibited consistent performance across test cases with slight variances in edge cases and
specific scenarios:
• Logistic Regression:
Handles class imbalances slightly better than SVM but struggles with non-linear
patterns.
• Fine-Tuned DistilBERT:
27
CHAPTER 6
6.1. Conclusion
This project on “Sentiment Analysis Using Machine Learning” demonstrates a robust
and well-validated pipeline capable of processing raw text and accurately classifying
sentiment using multiple machine learning and deep learning models. The implemented
test cases validate key functionalities, from preprocessing to final predictions, ensuring
high reliability and adaptability. The inclusion of DistilBERT further enhances the system’s
ability to capture nuanced sentiment, making it suitable for diverse applications such as
product reviews, social media analysis, and customer feedback.
The modular testing approach guarantees that each component, from data preprocessing to
model inference, functions as expected and contributes to the overall performance of the
system.
28
4. Incorporate Transfer Learning:
29
REFERENCES
[1] M. Hu and B. Liu, "Mining and summarizing customer reviews," Proceedings of the
10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
2004. [Online]. Available: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
[2] B. Pang, L. Lee, and S. Vaithyanathan, "Thumbs up? Sentiment classification using
machine learning techniques," Proceedings of the ACL-02 Conference on Empirical
Methods in Natural Language Processing (EMNLP), vol. 10, pp. 79–86, 2002. [Online].
Available: https://www.cs.cornell.edu/home/llee/papers/sentiment.pdf
[5] A. Vaswani et al., "Attention is all you need," Advances in Neural Information
Processing Systems (NeurIPS), vol. 30, 2017. [Online]. Available:
https://arxiv.org/abs/1706.03762
[6] A. Dashtipour et al., "A Study of Sentiment Analysis: Concepts, Techniques, and
Challenges," International Journal of Advanced Computer Science and Applications
(IJACSA), vol. 7, no. 1, 2016. [Online]. Available:
https://www.researchgate.net/publication/332451019_A_Study_of_Sentiment_Analysis_
Concepts_Techniques_and_Challenges
[7] M. A. Khan et al., "Sentiment analysis of social media content using artificial
intelligence: A comprehensive review," Data and Information Management, vol. 6, no. 2,
pp. 77–94, 2022. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S2590005622000224
30
[8] M. Yazdavar et al., "Psychological Stress Detection Using Social Media Data and
Machine Learning Techniques," Frontiers in Psychology, vol. 13, 2022.[Online].Available:
https://www.frontiersin.org/articles/10.3389/fpsyg.2022.906061/full
[9] A. Tulla, "Transformer Architecture Explained," Medium, Jul. 12, 2023. [Online].
Available: https://medium.com/@amanatulla1606/transformer-architecture-explained-
2c49e2257b4c [Accessed: Dec. 16, 2024].
[15] The Python Community, "Python Libraries: NumPy, Pandas, Matplotlib, Scikit-learn,
and Transformers." [Online]. Available: Respective library websites.
31
32