[go: up one dir, main page]

0% found this document useful (0 votes)
24 views42 pages

Project Report

The document is a final year project report from Tribhuvan University on 'Sentiment Analysis Using Machine Learning' by students Saksham Dura, Sandesh Regmi, and Renjil Sunar. It explores traditional machine learning algorithms and the DistilBERT model for sentiment analysis, highlighting the effectiveness of deep learning techniques in understanding textual sentiments. The report includes acknowledgments, methodology, literature review, system design, and recommendations for future research.

Uploaded by

sakshamdura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views42 pages

Project Report

The document is a final year project report from Tribhuvan University on 'Sentiment Analysis Using Machine Learning' by students Saksham Dura, Sandesh Regmi, and Renjil Sunar. It explores traditional machine learning algorithms and the DistilBERT model for sentiment analysis, highlighting the effectiveness of deep learning techniques in understanding textual sentiments. The report includes acknowledgments, methodology, literature review, system design, and recommendations for future research.

Uploaded by

sakshamdura
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

TRIBHUVAN UNIVERSITY

(INSTITUTE OF SCIENCE AND INFORMATION


TECHNOLOGY)

Final Year Project Report


Sentiment Analysis Using Machine Learning

Subject code: CSC 412


In Partial Fulfillment of Requirement for Bachelor of Science in Computer Science and
Information Technology (B.Sc. CSIT - Tribhuvan University)

Submitted By:
Saksham Dura (5-2-19-657-2020)
Sandesh Regmi (5-2-19-661-2020)
Renjil Sunar (5-2-19-649-2020)

Submitted To:
Department of Science and Information Technology,
Birendra Multiple Campus, Bharatpur

Under Supervision of
Er. Binod Sharma
December 17, 2024

i
SUPERVISOR’S RECOMMENDATION
I hereby recommend that the project work prepared under my supervision by below listed
team of students entitled “Sentiment Analysis Using Machine Learning” be accepted as
in fulfilling partial requirement for completion of Four Year's Bachelor's Degree in
Computer Science and Information Technology. In my best knowledge this is an original
work in Computer Science and Information Technology.

Student's Name

1. Saksham Dura (5-2-19-657-2020)

2. Sandesh Regmi (5-2-19-661-2020)

3. Reinjil Sunwar (5-2-19-649-2020)

…………………….
Supervisor
Er. Binod Sharma
Birendra Mutiple Campus, Bharatpur
Department of B.Sc.CSIT

ii
LETTER OF APPROVAL
This is to certify that this project prepared by Saksham Dura, Sandesh Regmi, Renjil Sunar
entitled “Sentiment Analysis Using Machine Learning” in partial fulfillment of the
requirement for the degree of B.Sc. in Computer Science and Information Technology has
been well studied. In our opinion it is satisfactory in scope and quality as a project for the
required degree.

.............................. ........................
Er. Binod Sharma Er. Binod Sharma
Signature of Project Supervisor Signature of HOD/Co-Ordinator

.......................... .........................
Signature of Internal Examiner Signature of External Examiner

iii
ACKNOWLEDGEMENT

We would like to express our heartfelt gratitude to all those who have contributed to the
successful completion of this project.

First and foremost, we are deeply thankful to our project advisor Binod Sharma, for their
unwavering guidance, invaluable insights, and constant encouragement throughout this
journey. Their expertise and mentorship played a pivotal role in shaping the direction and
quality of this project.

We extend our appreciation to the faculty members of the Birendra Multiple Campus for
their dedication to imparting knowledge and for providing us with a conducive learning
environment.

We are also indebted to our teammates who worked tirelessly, sharing their knowledge and
experiences. Our collaborative efforts were instrumental in achieving the project's goals.

We extend our gratitude to our family for their unwavering support and understanding
during the demanding phases of this project.

Lastly, we would like to express our thanks to all the resources, books, and online materials
that were instrumental in enhancing our understanding and knowledge in the field of
computer science and information technology.

This project would not have been possible without the collective efforts and support of all
these individuals and resources. We are truly grateful for their contributions.

iv
ABSTRACT
Sentiment analysis has become an important task in natural language processing (NLP),
finding its way into various areas like business intelligence, social media monitoring, and
customer feedback reviews. This project aims to put machine learning and deep learning
techniques to work for understanding sentiments in text. We first explored some classic
machine learning algorithms—Support Vector Machines (SVM), Logistic Regression, and
Multinomial Naive Bayes (MNB)—using a dataset from Kaggle. These models offered
decent initial results, but tweaking the DistilBERT model, which is a pre-trained
transformer-based language model, led to much better accuracy. The fine-tuned
DistilBERT model proved to be more effective in picking up subtle language cues and
context within the data. Comparing these approaches shows the limitations of traditional
methods and the promising advantages of using transfer learning models for sentiment
analysis. The project wraps up with a discussion on the challenges encountered during the
implementation and some thoughts on where future research could go.

Keywords: DistilBERT, Support Vector Machine, Multinomial Naive Bayes, Logistic


Regression, Kaggle.

v
TABLE OF CONTENTS
SUPERVISOR’S RECOMMENDATION.......................................................................ii
LETTER OF APPROVAL .............................................................................................. iii
ACKNOWLEDGEMENT ................................................................................................ iv
ABSTRACT ........................................................................................................................ v
TABLE OF CONTENTS.................................................................................................. vi
LIST OF TABLES ......................................................................................................... viii
LIST OF FIGURES ......................................................................................................... ix
LIST OF ABBREVIATIONS ........................................................................................... x
CHAPTER 1: INTRODUCTION ..................................................................................... 1
1.1. Introduction ............................................................................................................... 1
1.2. Problem Statement .................................................................................................... 1
1.3. Objectives.................................................................................................................. 1
1.4. Scope and Limitation ................................................................................................ 2
1.4.1 Scope ................................................................................................................... 2
1.4.2 Limitations .......................................................................................................... 2
1.5. Development Methodology....................................................................................... 2
1.6. Report Organization .................................................................................................. 3
CHAPTER 2 ....................................................................................................................... 4
BACKGROUND STUDY AND LITERATURE REVIEW ........................................... 4
2.1. Background Study ..................................................................................................... 4
2.2. Literature Review ...................................................................................................... 4
CHAPTER 3 ....................................................................................................................... 6
SYSTEM ANALYSIS ........................................................................................................ 6
3.1. System Analysis ........................................................................................................ 6
3.1.1. Requirements Analysis ...................................................................................... 6
3.1.2. Feasibility Analysis ............................................................................................ 7
3.1.3. Analysis .............................................................................................................. 8
CHAPTER 4 ..................................................................................................................... 14
SYSTEM DESIGN ........................................................................................................... 14
4.1. Design ..................................................................................................................... 14
4.1.1. Class / Object / State / Sequence / Activity ..................................................... 14
4.1.2. Component Diagrams ...................................................................................... 15

vi
4.2. Algorithm Details .................................................................................................... 16
CHAPTER 5 ..................................................................................................................... 20
IMPLEMENTATION AND TESTING ......................................................................... 20
5.1 Implementation ........................................................................................................ 20
5.1.1 Tools Used (CASE tools, Programming languages, Database platforms) ........ 20
5.1.2. Implementation Details of Modules ................................................................. 21
5.2. Testing ..................................................................................................................... 23
5.2.1. Test Cases for Unit Testing .............................................................................. 24
5.2.2. Test Cases for Integration Testing ................................................................... 26
5.3. Result Analysis ....................................................................................................... 27
CHAPTER 6 ..................................................................................................................... 29
CONCLUSION AND FUTURE RECOMMENDATIONS.......................................... 29
6.1. Conclusion .............................................................................................................. 29
6.2. Future Recommendations ....................................................................................... 29
REFERENCES ................................................................................................................. 31

vii
LIST OF TABLES
Table 1: Unit Testing Results............................................................................................. 24
Table 2 : Integration Testing Results ................................................................................. 26
Table 3 : Accuracy across models...................................................................................... 27

viii
LIST OF FIGURES

Figure 1 : Use Case Diagram for Sentiment Analysis App ................................................. 6


Figure 2 :Gantt Chart Diagram for the project ..................................................................... 8
Figure 3 : Class Diagram for SentimentAnalyzer, DatabaseHandler, Visualization .......... 9
Figure 4 : Object Diagram SentimentAnalyzer, DatabaseHandler, Visualization classes. 10
Figure 5 : State Diagram .................................................................................................... 11
Figure 6 : Sequence Diagram ............................................................................................. 12
Figure 7 : Activity Diagram ............................................................................................... 13
Figure 8 : Flowchart ........................................................................................................... 15
Figure 9 : Component Diagram ........................................................................................ 16
Figure 10 : Logistic Regression [10].................................................................................. 17
Figure 11 : Transformer Architecture [9]. ......................................................................... 19

ix
LIST OF ABBREVIATIONS

NLP Natural Language Processing


SVM Support Vector Machine
Multinomial NB Multinomial Naive Bayes
Logistic Regression Logistic Regression Machine Learning Statistical
Model
DistilBERT Distilled Bidirectional Encoder Representations
Hugging Face Platform for machine learning models and tools
Transformer Transformer architecture for natural language
processing
Scikit-learn Metrics Python library for model metric evaluation
sns Seaborn, a python library for data visualization

x
CHAPTER 1: INTRODUCTION

1.1. Introduction
Sentiment analysis is a sub-field of natural language processing that has gained a lot of
attention in recent years. It allows for extracting opinions from textual data. This is one of
the analysis types that has gained immense popularity with the increasing availability of
large online sources of data such as social media, reviews related to e-commerce products,
and forums. It has a dual purpose: keeping track of how people feel about different brands
and analyzing customer reviews to improve services or make future decisions. These have
been tried with a high level of success in applications; however, because of the abstraction
of context and linguistic idiosyncrasies, they often perform somewhat poorly. The most
recent advances in deep learning specifically applied to transformers, such as DistilBERT,
have led to very substantial improvements in sentiment classification due to better
semantics and context understanding. Here we merged both the old and the state-of-the-art
methodologies to assess the same over Kaggle data.

1.2. Problem Statement


The exponential growth of unstructured text data poses language ambiguities, sarcasm, and
domain-specific complexities, which makes it a challenge to analyze sentiments accurately.
Though SVM and Logistic Regression are good at handling structured datasets, they often
miss the contextual relationship. On the other hand, the advanced transformer model in
DistilBERT gives exemplary contextual understanding in language use but at the cost of
intensive computation. The current study thereby seeks to contrast these two approaches,
bringing out how they trade off between them in sentiment analysis on one of the publicly
available Kaggle datasets.

1.3. Objectives
The objectives of this project are:

1. To implement and evaluate traditional ML algorithms, including SVM, Logistic


Regression, and Multinomial Naive Bayes, for sentiment analysis.
2. To fine-tune and analyze the performance of DistilBERT, a transformer-based
model, on the same dataset.
3. To compare the accuracy, precision, recall, and F1-scores of traditional models
against deep learning approaches.

1
4. To identify the limitations of each method and recommend suitable use cases for
sentiment analysis.

1.4. Scope and Limitation


1.4.1 Scope
• Data Source: The dataset used in this study is sourced from Kaggle,
comprising textual data labeled for sentiment classification (positive,
neutral or negative).
• Model Implementation: Traditional models such as Support
Vector Machines (SVM), Logistic Regression, and Multinomial Naive
Bayes are employed to establish baseline performance for sentiment
analysis. DistilBERT, a pre-trained transformer model known for its
efficiency and contextual understanding, is fine-tuned on the Kaggle
dataset to explore its ability to outperform traditional methods.
• Applications: This study aims to provide insights on real-world
scenarios, such as customer feedback analysis, product review mining,
and social media sentiment tracking.

1.4.2 Limitations
• Dataset Dependency: The models are trained and on a single
dataset sourced from Kaggle. As a result, the findings may not
generalize well to datasets from other domains

• Contextual Challenges: While fine-tuning DistilBERT


improves contextual understanding, certain linguistic challenges remain
unresolved such as sarcasm.

1.5. Development Methodology


The methodology involves the following steps:

1. Dataset Preprocessing: Text data from the Kaggle dataset is cleaned, tokenized, and
vectorized using techniques like TF-IDF for traditional models and word
embeddings for transformers.

2. Model Implementation: SVM, Logistic Regression, and Multinomial Naive Bayes


models are trained and evaluated as baseline classifiers.

2
3. Transformer Fine-Tuning: The DistilBERT model is fine-tuned using the
preprocessed dataset to capture linguistic nuances.

4. Performance Evaluation: Models are assessed using metrics such as accuracy,


precision, recall, and F1-score for a detailed comparison.

5. Result Analysis: Insights are drawn from the performance comparison to


understand the trade-offs between traditional and transformer-based approaches.

1.6. Report Organization


• Chapter 1: Introduction Includes background, problem statement, objectives, and
methodology.

• Chapter 2: Literature Review Previous work in traditional ML and modern


transformer-based models from related areas such as sentiment analysis.

• Chapter 3: System Design and Implementation Describes the experimental setup,


dataset, and models used.

• Chapter 4: Results and Discussion Results obtained and their comparison.

• Chapter 5: Conclusion and Future Work Concluding Results and achievable future
work.

3
CHAPTER 2

BACKGROUND STUDY AND LITERATURE REVIEW

2.1. Background Study


Sentiment analysis is a branch of natural language processing (NLP) focused on identifying
and categorizing sentiments expressed in textual data. The fundamental aim is to determine
whether a piece of text conveys a positive, negative, or neutral sentiment. Key
terminologies and concepts central to this domain include:

1. Natural Language Processing (NLP): A subfield of artificial intelligence that


enables computers to understand, interpret, and generate human language.

2. Sentiment Polarity: Categorization of text based on its expressed sentiment


(positive, negative, or neutral).

3. Machine Learning (ML): A method of data analysis that automates the building
of analytical models. ML is widely employed for sentiment classification tasks.

4. Transformers: A deep learning architecture introduced by Vaswani et al. (2017)


that uses self-attention mechanisms to process sequential data [5]. It forms the
backbone of modern NLP techniques.

5. Bidirectional Encoder Representations from Transformers (BERT): An


advanced transformer-based model designed for pre-training on large corpora and
fine-tuning for specific tasks, including sentiment analysis.

6. DistilBERT: A lighter, faster, and smaller version of BERT optimized for resource-
constrained environments while retaining competitive accuracy.

These theories and tools provide the foundation for modern sentiment analysis techniques,
enabling precise sentiment classification across diverse datasets.

4
2.2. Literature Review

Sentiment analysis research has undergone significant evolution, transitioning from


traditional rule-based systems to sophisticated ML and transformer-based models. A review
of existing literature reveals the following insights:

1. Traditional Approaches: Early methods relied on lexicon-based techniques where


predefined dictionaries were used to map words to their sentiment scores. These
methods, though simple, struggled with contextual understanding [1].

2. Machine Learning Models: Supervised learning algorithms like Support Vector


Machines (SVM) and outperformed the rule-based approaches significantly. Works
such as that by Pang et al. (2002) showed the effectiveness of ML classifiers on
sentiment datasets [2].

3. Transformer Models: Transformers have revolutionized NLP. BERT, as proposed


by Devlin et al. (2018), provided the ability to contextualize word embeddings,
which largely improved the performance of sentiment analysis. Various studies
confirmed that it achieved higher accuracy than traditional models on benchmark
datasets [3].

4. Efficient Variants: DistilBERT emerged as a compact alternative to BERT,


offering comparable performance with reduced computational overhead. Sanh et al.
(2019) demonstrated its efficacy for sentiment tasks, particularly in resource-
constrained environments [4].

5. Applications in Real-World Contexts: Sentiment analysis models have been


widely applied in domains such as social media monitoring, customer feedback
analysis, and public opinion studies. Case studies underscore their practical utility
in deriving actionable insights from textual data [5], [6], [7].

This review highlights the rapid advancement of sentiment analysis methodologies,


emphasizing the transformative role of ML and transformer-based models in achieving
state-of-the-art results.

5
CHAPTER 3

SYSTEM ANALYSIS

3.1. System Analysis


3.1.1. Requirements Analysis
i. Functional Requirements

1. User Input Module: An interface for user input is required. The input can be text
or an unlabeled csv file.

2. Sentiment Detection: The system is required to classify input data into positive,
negative, or neutral sentiments using machine learning algorithms.

3. Visualization Module: Results are displayed graphically (e.g., pie charts or bar
graphs) to provide an intuitive understanding of sentiment distribution.

4. Data Storage: Analyzed results can be downloaded or stored in a database for


future reference.

Figure 1 : Use Case Diagram for Sentiment Analysis App

ii. Non-Functional Requirements

6
1. Performance: The system should process input data and deliver sentiment
analysis results within 5 seconds for typical input sizes.

2. Scalability: The system must handle large datasets without performance


degradation.

3. Reliability: Sentiment classification should achieve at least 75% accuracy based


on validation against standard datasets.

4. Usability: The system should provide an intuitive interface with minimal


learning curve.

5. Security: User data must be encrypted and protected from unauthorized access.

3.1.2. Feasibility Analysis


i. Technical Feasibility

The system is trained using state-of-the-art pre-trained transformer models such as


BERT and DistilBERT. This project uses streamlit library as python framework for
web app and the models are trained using python jupyter notebook. All these
technologies are well-documented.

ii. Operational Feasibility

Sentiment analysis projects can be effectively implemented using Google Colab,


offering several advantages in terms of accessibility, computational resources, and
collaboration capabilities. Google Colab provides a free tier with access to GPUs
and TPUs. Colab notebooks can be easily shared with others, facilitating
collaboration and code review. Integrate with GitHub for seamless version control
and tracking of changes. This project uses the computing resources of following
specifications:

1. CPU: AMD Ryzen5 4500U

2. SSD: 256GB

3. GPU: NVIDIA MX350

Python is used as the base programming language for this project with libraries such
as Numpy, Pandas for data pre-processing and NLTK for training the model.

7
iii. Economic Feasibility

Given the availability of open-source libraries (e.g., Hugging Face for BERT
models) and affordable cloud computing solutions (e.g., Google Colab), the
development and operational costs are minimal. This ensures the project remains
cost-effective.

iv. Schedule Feasibility

Estimated duration for this project is around 1 and a half months. Gantt Chart for
the project is provided in figure [2] below:

Figure 2 :Gantt Chart Diagram for the project

3.1.3. Analysis

This project adopts an object-oriented approach, focusing on object modeling,


dynamic modeling, and process modeling.

i. Class Diagram / Object Diagram

The system consists of classes such as:

1. SentimentAnalyzer: This class encapsulates methods for data preprocessing,


feature extraction, and sentiment classification using transformer models like
BERT or DistilBERT.

2. DatabaseHandler: Manages data storage and retrieval, ensuring efficient access


to input text and sentiment analysis results.

8
3. VisualizationModule: Handles the creation of graphical representations like pie
charts and bar graphs for sentiment distribution.

Relationships among classes, such as associations and dependencies, are


depicted in the class diagram. For example, the SentimentAnalyzer depends on
the DatabaseHandler for accessing training data.

Figure 3 : Class Diagram for SentimentAnalyzer, DatabaseHandler, Visualization

9
Figure 4 : Object Diagram SentimentAnalyzer, DatabaseHandler, Visualization
classes

ii. State / Sequence Diagrams

State diagrams for this project illustrate the various states the system transitions
through during operation. Key states include:

Idle: The system awaits user input.

Processing: The system performs sentiment analysis on the provided data.

Completed: Results are generated and displayed to the user.

Transitions between these states are triggered by user actions or the completion of
system tasks.

10
Figure 5 : State Diagram

Sequence diagrams for this project depict interactions between system components
in the order they occur. For sentiment analysis, the sequence may involve:

• The user submitting input data to the SentimentAnalyzer.

• SentimentAnalyzer invoking methods of DatabaseHandler to retrieve training


data or store results.

• VisualizationModule generates a graphical output based on the analysis results.

11
Figure 6 : Sequence Diagram

iii. Activity Diagram

Activity diagrams provide a high-level view of the workflows within the system.
Key activities include:

• Text Preprocessing: Tokenization, removing stopwords, and lemmatization.

• Feature Extraction: Transforming text into numerical representations using


embeddings from models like BERT.

• Sentiment Classification: Applying the machine learning model to predict


sentiment categories.

• Result Generation: Displaying the sentiment analysis output visually and


storing it in the database.

12
Figure 7 : Activity Diagram
This analysis provides a detailed representation of the system's structure, behavior, and
workflows, ensuring clarity in design and implementation for effective sentiment analysis.

13
CHAPTER 4

SYSTEM DESIGN

4.1. Design
4.1.1. Class / Object / State / Sequence / Activity
The SentimentAnalyzer class is further refined to include specific methods such as
tokenizer(),vectorizer(),analyze_sentiment()etc.The VisualizationModule includes
methods for generating visual outputs using matplotlib library, applying methods such as
matploib.pyplot(). The user input text and csv file is taken from st.text_input(),
st.file_uploader() method provided by streamlit library. Then, the input is normalized using
normalizer() function for removing stopwords, stemming and lemmatization. Then, the
normalized input text is passed through vectorizer(), tokenizer() for preprocessing. The
preprocessed is taken by selected model through analyze_sentiment() method and the label
is returned for the input. The analyze_sentiment() uses model that are taken from scikit-
learn library, tranasformers from Hugging Face. Activity Sequence include interactions
such as:

• User submitting data via a web interface.

• SentimentAnalyzer invoking feature extraction followed by classification.

• DatabaseHandler storing the results, and VisualizationModule creating output.

14
Figure 8 : Flowchart

4.1.2. Component Diagrams


Component diagrams describe the high-level structure of the sentiment analysis system and
its major components. Key components include:

Frontend Component: Responsible for user interaction and input submission.

Sentiment Analysis Engine: Implements core functionality using ML models and


transformers like BERT.

Data Storage Component: Handles database management for storing inputs, outputs, and
logs.

Visualization Component: Generates user-friendly graphical representations.

15
Figure 9 : Component Diagram

4.2. Algorithm Details


1. CountVectorizer

This vectorizer converts a collection of text documents into a sparse matrix


representation, where rows correspond to documents and columns to
vocabulary tokens and takes count for each word. Example: For a document
containing the phrase "happy day happy moment," the count for "happy" is
2.

2. Tokenizer

Tokenizer converts the text in tokens for transformer based DistilBERT


classifier.

• Word Segmentation: The tokenizer divides the input text into individual
words or subwords, known as tokens.

• Case Handling: Depending on the model configuration (cased or


uncased), the tokenizer may convert the text to lowercase.

• Special Tokens: Special tokens like [CLS] (classification) and [SEP]


(sentence separation) are added to the beginning and end of the sequence,
respectively.

• Vocabulary Mapping: Each token is mapped to a unique integer ID


based on the model's vocabulary. This vocabulary is typically learned
during the model's training process.

16
• Encoding: A crucial component of the Transformer architecture that
helps the model understand the order of words in a sequence. Positional
encoding uses sine and cosine functions to generate a unique encoding
for each position in the input sequence. This allows the model to capture
the relative positions of tokens, which helps it interpret and generate text.
Self-attention mechanism that allows the model to weigh the importance
of different parts of the input sequence. It calculates the embeddings' dot
product, which is also known as multi-head attention.

3. LogisticRegression

Sigmoid Function is referred to as an activation function for logistic


regression

1
𝐹(𝑥) =
1 + 𝑒 −𝑥

This equation is similar to linear regression, where the input values are
combined linearly to predict an output value using weights or coefficient
values. However, unlike linear regression, the output value modeled here is
a binary value (0 or 1) rather than a numeric value.

Figure 10 : Logistic Regression [10].


4. Multinomial Naive Bayes

Multinomial Naive Bayes is generative probabilistic model based on Bayes'


theorem, assuming feature independence.

17
Figure 11: Bayesian Probability Formula

The estimated prior 𝒑(𝐶ₖ) and likelihood 𝒑(𝑾 | 𝐶ₖ) proportionally contribute
to an outcome that 𝐶ₖ is being the class of K.

5. Transformer Architecture

Transformers are the state-of-the-art development in machine learning that


is good at keeping track of context.

• Input Embeddings:

The input text is converted into embeddings, which are continuous vector
representations of words or tokens.

• Positional Encoding: Since transformers do not have an inherent sense


of order, positional encodings are added to the input embeddings to
provide information about the position of each token in the sequence.

• Encoder: The encoder consists of multiple layers, each containing:

Self-Attention Mechanism: This allows the model to focus on different


parts of the input sequence to understand the context.

Feed-Forward Neural Networks: These are applied to each position


separately and identically to transform the information.

• Decoder: The decoder also consists of multiple layers, each containing:

Self-Attention Mechanism: Similar to the encoder, but it also attends to


the encoder's output.

Feed-Forward Neural Networks: Transform the information at each


position.

Masked Self-Attention: Ensures that the predictions for a position


depend only on the known outputs for previous positions.

18
• Output: The output from the decoder is passed through a linear layer and
softmax function to generate the final prediction, such as translated text
or classified sentiment.

Figure 11 : Transformer Architecture [9].

19
CHAPTER 5

IMPLEMENTATION AND TESTING

5.1 Implementation
5.1.1 Tools Used (CASE tools, Programming languages, Database
platforms)
1. Python Programming Language:

Python is used as primary programming language due to its simplicity, extensive


libraries, and strong community support for machine learning and natural language
processing tasks.

2. Jupyter Notebook:

Jupyter Notebook provided an interactive environment for coding, data exploration,


and visualization, making it an ideal platform for iterative development.

3. Hugging Face Transformers Library:

This library was utilized for implementing pre-trained transformer models such as
BERT and DistilBERT, enabling efficient sentiment classification.

4. Scikit-learn:

Scikit-learn facilitated traditional machine learning algorithms, such as Logistic


Regression, Support Vector Machines, and Multinomial Naive Bayes, along with
preprocessing tools like CountVectorizer and TF-IDF.

5. Pandas and NumPy:

These libraries were used for data manipulation and numerical computations,
ensuring efficient handling of large datasets.

6. Matplotlib and Seaborn:

These visualization tools enabled the creation of graphs and charts to represent
sentiment distributions and analysis results intuitively.

7. PyTorch:

20
This library is used for fine-tuning transformer models and implementing deep
learning-based sentiment analysis pipelines.

8. Google Colab:

This cloud-based platform provided the computational resources required for


training and testing large transformer models efficiently.

9. NLTK (Natural Language Toolkit):

NLTK was employed for text preprocessing tasks, such as tokenization, stopword
removal, and stemming.

5.1.2. Implementation Details of Modules


1. Data Preprocessing

Modules such as normalizer(), tokenizer(), vectorizer() are used for data


preprocessing.

Code Implementation

def normalizer(tweet):

only_letters = re.sub("[^a-zA-Z]", " ", tweet)

only_letters = only_letters.lower()

only_letters = only_letters.split()

filtered_result = [word for word in only_letters if word not in


stopwords.words('english')]

lemmas = [wordnet_lemmatizer.lemmatize(t) for t in filtered_result]

lemmas = ' '.join(lemmas)

return lemmas

// from transformer DistilBERT tokenizer is imported

tokenizer = DistilBertTokenizer.from_pretrained('./Fine-TunedModel')

// 80% of the data is used for training

train_dataset,test_dataset= random_split(review_dataset,[80, 20])

21
2. Model fine-tuning

i. CustomDistilBertForSequenceClassification(nn.Module) is our custom


tokenizer for tuned-model

// Code Implementation

class CustomDistilBertForSequenceClassification(nn.Module):

def __init__(self, num_labels=3):

super(CustomDistilBertForSequenceClassification, self).__init__()

self.distilbert = DistilBertModel.from_pretrained('distilbert-base-uncased')

self.pre_classifier = nn.Linear(768, 768) # DistilBERT's hidden size is 768

self.dropout = nn.Dropout(0.3)

self.classifier = nn.Linear(768, num_labels)

def forward(self, input_ids, attention_mask):

distilbert_output=self.distilbert(input_ids=input_ids,attention_mask=attention
_mask)

hidden_state = distilbert_output[0] # (batch_size, sequence_length,


hidden_size)

pooled_output = hidden_state[:, 0] # we take the representation of the


[CLS] token (first token)

pooled_output = self.pre_classifier(pooled_output)

pooled_output = nn.ReLU()(pooled_output)

pooled_output = self.dropout(pooled_output) # regularization

logits = self.classifier(pooled_output)

return logits

ii. Model training

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

22
model.to(device)

optimizer = AdamW(model.parameters(), lr=5e-5)

model.train()

for epoch in range(3):

for i, batch in enumerate(train_loader):

input_ids = batch['input_ids'].to(device)

attention_mask = batch['attention_mask'].to(device)

labels = batch['labels'].to(device)

optimizer.zero_grad()

logits = model(input_ids=input_ids, attention_mask=attention_mask)

loss = nn.CrossEntropyLoss()(logits, labels)

loss.backward()

optimizer.step()

if (i + 1) % 100 == 0:

print(f"Epoch {epoch + 1}, Batch {i + 1}, Loss: {loss.item():.4f}")

3. Evaluation

from sklearn.metrics import classification_report, confusion_matrix

5.2. Testing
The test cases for this project are categorized into:

• Data Preprocessing

• Model-Specific Behavior

23
• Integration Testing

• Edge Cases

Test Environment:

• Programming Language: Python 3.9

• Framework: unittest

• Mock Data: A mix of positive, negative, and neutral sentences, including edge
cases such as empty strings, special characters, and extremely long text.

• Libraries: sklearn, transformers, numpy, pandas.

5.2.1. Test Cases for Unit Testing


Table 1: Unit Testing Results
Test Id Description Input Expected Status
Output
DP01 Remove “wow!! “amazing Passed
stopwords, Amazing product”
special Product!”
characters
DP02 Handle empty “” “” Passed
input string
DP03 Lowercase “AMAZING “amazing Passed
text EXPERIENC experience”
E!”
DP04 Tokenization “Great ["great", Passed
Accuracy product, "product",
highly "highly",
recommended "recommend"
!” ]
Model Testing:
1) Support Vector Machine

Expected
Test ID Description Input Status
Output
Correct
"I love this
SVM01 prediction for Positive Passed
product!"
positive

24
Correct
"Terrible
SVM02 prediction for Negative Passed
experience!"
negative
Handle
SVM03 neutral "It's okay." Neutral Passed
sentiment
Ensure
"Amazing Probabilities
SVM04 probability Passed
day!" in [0, 1]
output is valid

2) Logistic Regression

Expected
Test ID Description Input Status
Output
Positive "What a
LR01 sentiment fantastic Positive Passed
classification app!"
Negative
"Worst
LR02 sentiment Negative Passed
service ever!"
classification
Neutral
LR03 sentiment "It's fine." Neutral Passed
classification
3) Multinomial Naive Bayes
Expected
Test ID Description Input Status
Output
Predict
"Absolutely
NB01 positive Positive Passed
great!"
sentiment
Predict
"Horrible
NB02 negative Negative Passed
product."
sentiment

25
Test class Class
"Mediocre
NB03 distribution prediction Passed
experience"
handling succeeds
4) Fine-Tuned DistilBERT

Expected
Test ID Description Input Status
Output
Predict
"Outstanding
DBERT01 positive Positive Passed
service!"
sentiment
Predict
"Completely
DBERT02 negative Negative Passed
dissatisfied."
sentiment
Handle mixed "Good food, Accurate
DBERT03 Passed
sentiment bad service." probabilities
Long text Valid
DBERT04 Long reviews Passed
handling prediction

5.2.2. Test Cases for Integration Testing


Table 2 : Integration Testing Results

Test ID Description Input Expected Output Status


Full sentiment
Process raw text "Amazing phone!
INT01 pipeline outputs Passed
through pipeline Bad camera."
result
Consistent
Ensure compatibility Preprocessed text to
INT02 predictions across Passed
between components all models
models
Batch processing Execution time
INT03 100 sentences Passed
performance within threshold

26
5.3. Result Analysis

Each model was evaluated on its ability to classify sentiments correctly. The models
exhibited consistent performance across test cases with slight variances in edge cases and
specific scenarios:

Table 3 : Accuracy across models


Model Accuracy (Test Notes
Cases)
SVM 100% Performs well with balanced datasets.
Logistic Regression 100% Handles linear relationships effectively.
Multinomial Naive 100% Fast but slightly sensitive to feature
Bayes representation.
Fine-Tuned 100% Superior in handling context, edge cases, and
DistilBERT long texts.
Observations on basis of models:

• Support Vector Machine (SVM):

Performs well on balanced datasets.

May require tuning for high-dimensional datasets with sparse features.

• Logistic Regression:

Simple and interpretable results.

Handles class imbalances slightly better than SVM but struggles with non-linear
patterns.

• Multinomial Naive Bayes:

Fast and memory-efficient but sensitive to feature scaling.

May underperform on datasets with overlapping classes.

• Fine-Tuned DistilBERT:

Most accurate and adaptable to complex inputs.

Superior at capturing context and sentiment in long, nuanced sentences.

Computationally heavier compared to other models.

27
CHAPTER 6

CONCLUSION AND FUTURE RECOMMENDATIONS

6.1. Conclusion
This project on “Sentiment Analysis Using Machine Learning” demonstrates a robust
and well-validated pipeline capable of processing raw text and accurately classifying
sentiment using multiple machine learning and deep learning models. The implemented
test cases validate key functionalities, from preprocessing to final predictions, ensuring
high reliability and adaptability. The inclusion of DistilBERT further enhances the system’s
ability to capture nuanced sentiment, making it suitable for diverse applications such as
product reviews, social media analysis, and customer feedback.

The modular testing approach guarantees that each component, from data preprocessing to
model inference, functions as expected and contributes to the overall performance of the
system.

6.2. Future Recommendations

1. Expand Multilingual Support:

. Incorporate multilingual transformers such as mBERT or XLM-R to handle


non-English text effectively.
. Train or fine-tune models on diverse datasets representing different
languages and cultural contexts.

2. Enhance Feature Engineering:

. For traditional models (SVM, Logistic Regression, Naive Bayes), explore


advanced feature representations like TF-IDF bigrams and sentiment
lexicons to improve accuracy on complex datasets.

3. Optimize DistilBERT for Production:

. Apply quantization or model pruning techniques to reduce inference time


and memory usage for large-scale, real-time applications.

28
4. Incorporate Transfer Learning:

. Experiment with task-specific fine-tuning to improve DistilBERT's


accuracy for domain-specific tasks, such as financial or medical sentiment
analysis.

5. Add Explainability Tools:

. Integrate libraries like LIME or SHAP to provide insights into model


predictions, enhancing trust and interpretability for end-users.

6. Develop an End-to-End Deployment Pipeline:

. Package the system into a user-friendly API or web interface.


. Ensure scalability with cloud deployment and containerization using tools
like Docker and Kubernetes.

29
REFERENCES

[1] M. Hu and B. Liu, "Mining and summarizing customer reviews," Proceedings of the
10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
2004. [Online]. Available: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

[2] B. Pang, L. Lee, and S. Vaithyanathan, "Thumbs up? Sentiment classification using
machine learning techniques," Proceedings of the ACL-02 Conference on Empirical
Methods in Natural Language Processing (EMNLP), vol. 10, pp. 79–86, 2002. [Online].
Available: https://www.cs.cornell.edu/home/llee/papers/sentiment.pdf

[3] J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of deep


bidirectional transformers for language understanding," Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pp. 4171–4186, 2019. [Online]. Available:
https://arxiv.org/abs/1810.04805

[4] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, "DistilBERT, a distilled version of


BERT: smaller, faster, cheaper and lighter," Proceedings of the 5th Workshop on Energy
Efficient Machine Learning and Cognitive Computing (NeurIPS), 2019. [Online].
Available: https://arxiv.org/abs/1910.01108

[5] A. Vaswani et al., "Attention is all you need," Advances in Neural Information
Processing Systems (NeurIPS), vol. 30, 2017. [Online]. Available:
https://arxiv.org/abs/1706.03762

[6] A. Dashtipour et al., "A Study of Sentiment Analysis: Concepts, Techniques, and
Challenges," International Journal of Advanced Computer Science and Applications
(IJACSA), vol. 7, no. 1, 2016. [Online]. Available:
https://www.researchgate.net/publication/332451019_A_Study_of_Sentiment_Analysis_
Concepts_Techniques_and_Challenges

[7] M. A. Khan et al., "Sentiment analysis of social media content using artificial
intelligence: A comprehensive review," Data and Information Management, vol. 6, no. 2,
pp. 77–94, 2022. [Online]. Available:
https://www.sciencedirect.com/science/article/pii/S2590005622000224
30
[8] M. Yazdavar et al., "Psychological Stress Detection Using Social Media Data and
Machine Learning Techniques," Frontiers in Psychology, vol. 13, 2022.[Online].Available:
https://www.frontiersin.org/articles/10.3389/fpsyg.2022.906061/full

[9] A. Tulla, "Transformer Architecture Explained," Medium, Jul. 12, 2023. [Online].
Available: https://medium.com/@amanatulla1606/transformer-architecture-explained-
2c49e2257b4c [Accessed: Dec. 16, 2024].

[10] "What is Logistic Regression?" Spiceworks, [Online]. Available:


https://www.spiceworks.com/tech/artificial-intelligence/articles/what-is-logistic-
regression/#lg=1&slide=0 [Accessed: Dec. 16, 2024].

[11] A. Shrivastava, "Sentiment Analysis Dataset," Kaggle. [Online]. Available:


https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset/data.

[12] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, "DistilBERT: A distilled version of


BERT," Hugging Face. [Online]. Available: https://huggingface.co/distilbert/distilbert-
base-uncased.

[13] Python Software Foundation, "Python programming language." [Online]. Available:


https://www.python.org.

[14] Google, "Google Colab: Collaboratory." [Online]. Available:


https://colab.research.google.com.

[15] The Python Community, "Python Libraries: NumPy, Pandas, Matplotlib, Scikit-learn,
and Transformers." [Online]. Available: Respective library websites.

[16] Visual Paradigm, "Online Diagram Tool," [Online]. Available: https://online.visual-


paradigm.com/. [Accessed: Dec. 16, 2024].

[17] OpenAI, "ChatGPT: Language Model by OpenAI," [Online]. Available:


https://chat.openai.com/. [Accessed: Dec. 16, 2024].

31
32

You might also like