[go: up one dir, main page]

0% found this document useful (0 votes)
10 views5 pages

Twitter Sentiment Analysis

—This paper investigates the sentiment expressed on Twitter regarding. We employ Natural Language Processing (NLP) techniques to classify a large dataset of scraped tweets into three categories: positive, negative, and neutral. The gathered tweets undergo pre-processing to remove noise and prepare them for sentiment classification. A machine learning model then automatically categorizes each tweet based on the sentiment it conveys. This analysis provides valuable insights into the emotional land

Uploaded by

devansh2102003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Twitter Sentiment Analysis

—This paper investigates the sentiment expressed on Twitter regarding. We employ Natural Language Processing (NLP) techniques to classify a large dataset of scraped tweets into three categories: positive, negative, and neutral. The gathered tweets undergo pre-processing to remove noise and prepare them for sentiment classification. A machine learning model then automatically categorizes each tweet based on the sentiment it conveys. This analysis provides valuable insights into the emotional land

Uploaded by

devansh2102003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Twitter Sentiment Analysis

* Note: a project based on NLP and web-scraper

1st Devansh Gupta


SCOPE
VIT Chennai
Chennai, India
devansh2102003@gmail.com

Abstract—This paper investigates the sentiment expressed on B. Background, Context, and Significance of Study
Twitter regarding. We employ Natural Language Processing
(NLP) techniques to classify a large dataset of scraped tweets into Social media platforms have become integral channels for
three categories: positive, negative, and neutral. The gathered individuals to express opinions, emotions, and reactions. Twit-
tweets undergo pre-processing to remove noise and prepare ter, with its brevity and real-time nature, offers a unique
them for sentiment classification. A machine learning model dataset for understanding public sentiment. Sentiment analysis,
then automatically categorizes each tweet based on the senti-
a subfield of natural language processing, plays a crucial role
ment it conveys. This analysis provides valuable insights into
the emotional landscape on Twitter. The findings can inform in extracting insights from such data. The study’s background
various stakeholders, including businesses, policymakers, and stems from the need to comprehend and harness the sentiments
researchers, by offering a quantitative measure of positive, expressed on Twitter, considering its influence on public
negative, and neutral sentiment on the platform. we also increase opinion and decision-making processes
the accuracy of cardiffnlp-bert model by optimising it to twet eval
dataset.
C. problem statement
Index Terms—YET TO FILL!!!!!
The surge in user-generated content on Twitter has created
a vast and dynamic landscape of short-form expressions,
I. I NTRODUCTION presenting a challenge for sentiment analysis. The brevity
of tweets, coupled with informal language, abbreviations,
Social media platforms like Twitter offer a real-time pulse
and the rapid evolution of trending topics, makes it difficult
of public sentiment. This paper delves into the emotional land-
to accurately categorize sentiments as positive, negative, or
scape on Twitter. We leverage the power of Natural Language
neutral. Existing sentiment analysis methods, designed for
Processing (NLP) to analyze a vast collection of tweets. By
longer and more formal texts, may fall short in capturing the
classifying them as positive, negative, or neutral, we aim to
nuances and contextual intricacies inherent in Twitter data.
extract valuable insights into how users feel. This analysis
Consequently, there is a pressing need to develop and refine
goes beyond simply gauging public opinion; it sheds light on
sentiment analysis techniques tailored specifically for tweets,
the emotional undercurrents of the conversation on Twitter.
addressing the unique linguistic characteristics and real-time
Our findings provide a quantitative measure of sentiment,
nature of this social media platform. This research seeks to
offering valuable information for businesses, policymakers,
unravel the complexities associated with sentiment analysis
and researchers alike. we have used techniques like focal loss
on Twitter, aiming to enhance the accuracy and effectiveness
function and nadam optimiser to optimise the model.
of sentiment classification in this dynamic and fast-paced
communication environment.
A. Statement of Problem/Challenges
D. research challenges
The proliferation of social media platforms, particularly
Twitter, has generated an enormous amount of user-generated 1) 1.Informal Language and Abbreviations: Tweets often
content in the form of tweets. Analyzing this vast pool of data use informal language, abbreviations, and slang, making sen-
presents a significant challenge, especially when attempting to timent analysis challenging. Deciphering the sentiment behind
discern sentiments expressed by users. The research question these expressions requires addressing the nuances inherent in
at the core of this study is: How can sentiment analysis be casual online communication.
effectively applied to tweets on Twitter to categorize them 2) 2.Contextual Ambiguity: Twitter conversations are of-
into positive, negative, and neutral sentiments? This research ten contextually ambiguous, with users assuming a shared
aims to address the complexities and challenges associated understanding of topics. Disentangling the sentiment in such
with sentiment analysis in the context of short-form, informal situations requires a nuanced approach that considers the
communication on social media broader conversation context.
3) 3.Handling Emoji and Emoticons: Users frequently in- II. M ETHODODLOGY
corporate emojis and emoticons to convey emotions, which A. Modules/Algorithms/Functionalities/Protocols
may not be adequately captured by traditional sentiment anal-
ysis methods. Developing techniques to interpret and integrate Basically we 5 main modules to our project :- 1.Data
these visual elements is crucial for comprehensive sentiment Scraping :- we use Selenium to automate the login, searching
classification. and scraping of tweets. 2.ML Analysis:- The model mm
4) 4.Trending Topics and Virality: Twitter is character- roberta base mld is a Natural Language Processing (NLP)
ized by rapidly changing trends and viral content. Adapting Model implemented in Transformer library support to analyse
sentiment analysis to dynamically evolving topics poses a the fetched tweets. Pytorch is also used as a supporting library
challenge, as the sentiment associated with a keyword or 3.User interface and GUI :- we use Streamlit library to display
hashtag may shift over time. the UI and matplotlib library to showcase the annualized data
5) 5.User Biases and Irony: Users may employ irony, sar- 4.Database module :- where the collected scraped data is
casm, or express sentiments contrary to their literal meaning. stored in csv file. 5.Outcome and Result Showcase module :-
Detecting such nuances and accounting for potential biases in this is an integrated part of the GUI and user interface which
sentiment interpretation is a significant challenge. reflects the outcome of the model analysis
6) 6.Real-time Processing: Twitter operates in real-time, B. Data Collection Approaches/Strategies
demanding swift and efficient sentiment analysis processes.
Developing algorithms capable of processing large volumes We employ Selenium for the purpose of retrieving tweets di-
of tweets in near real-time is essential for keeping pace with rectly from Twitter. This involves conducting targeted searches
the platform’s dynamic nature. for user-specified hashtags within the Twitter platform. Subse-
7) 7.Generalization Across Domains: Sentiment analysis quently, we utilize BeautifulSoup to perform data scraping on
models trained on one domain may struggle to generalize to the HTML code of the webpage, facilitating the extraction of
diverse topics discussed on Twitter. Addressing this challenge tweets along with their associated usernames. The extracted
involves creating robust models that can adapt to the varied information is then systematically stored in a CSV file for
content present in user-generated tweets further analysis and reference
1) a)Advantages of the strategy:-: The advantage of the de-
E. 8.Optimising pre-trained model scribed data extraction strategy lies in its ability to dynamically
fetch tweets from Twitter using Selenium, allowing for real-
Sentiment analysis requires balanced dataset to correctly time and targeted information retrieval. This strategy enables
train and hyper-parameter optimisation for optimum accuracy the collection of relevant data based on user-specified hashtags,
providing a tailored approach to gathering information from
F. research objective the platform. This approach allows for a significant reduction
The primary objective of this research is to develop and in costs by eliminating the need to invest in expensive Twitter
implement an effective sentiment analysis framework for cat- fetching APIs.
egorizing tweets on Twitter into positive, negative, and neutral 2) b) Limitation of strategy: One limitation of this strategy
sentiments. The goal is to overcome the challenges posed by is its reliance on web scraping techniques, which can be sus-
the brevity, informal language, and dynamic nature of Twitter ceptible to changes in the Twitter website’s structure. If Twitter
content, thereby enhancing the accuracy and applicability of modifies its HTML or implements anti-scraping measures, it
sentiment classification in this social media context. may disrupt the data extraction process and require continuous
finding optimum model to give us the best accuracy to the adaptation to ensure the strategy’s effectiveness.
dataset. 3) Potential Risks: The potential risks associated with this
method include the violation of Twitter’s terms of service.
Automated data scraping may be against the platform’s poli-
cies, leading to the suspension or restriction of the account
performing the scraping. Additionally, the strategy may face
legal implications if not conducted in compliance with data
protection regulations or if it involves scraping personal infor-
mation without explicit consent
4) Ethical Issue about collection upon the sub-
jects/Particiapants: There are ethical considerations related to
the collection of tweets and usernames from Twitter users. Pri-
vacy concerns arise, as individuals may not be aware that their
publicly available tweets are being systematically gathered and
stored. Transparency and consent become crucial in ensuring
Fig. 1. loss function for different algorithms ethical data collection, and it is important to respect users’
rights and preferences regarding the use of their information.
Clear communication and adherence to ethical guidelines are a CSV file. Subsequently, a preprocessing pipeline is applied
essential to mitigate any ethical concerns associated with this to the encoded tweets within the CSV file. This process
data extraction method. involves tokenization, removal of non-alphabetic characters,
elimination of excess spaces, and other relevant transforma-
C. C) Data Analysis Approaches tions. The resultant refined strings, denoted as ”clean tweets,”
The gathered data, post-web scraping, is stored in a CSV are then stored in a separate CSV file. • In the final stage
file. Subsequently, the encoded tweets within the CSV file of the proposed system, the clean tweets undergo analysis
undergo a preprocessing stage, encompassing tokenization, using the Roberta BERT model. This model, renowned for its
removal of non-alphabetic characters, elimination of excess proficiency in natural language processing tasks, categorizes
spaces, and other relevant transformations. The resultant re- each tweet into positive, negative, or neutral sentiments. This
fined strings, termed as ”clean tweets,” are then stored in an- comprehensive approach ensures not only efficient data collec-
other CSV file. Following this preprocessing phase, the tweets tion but also meaningful sentiment analysis, providing valuable
are subjected to analysis utilizing the Roberta BERT model. insights into the sentiments expressed within the extracted
The model categorizes each tweet into positive, negative, or Twitter data.
neutral sentiments based on its learned representations and In the original cardiffnlp model we use cross-entropy loss
contextual understanding. This multi-step process ensures the function and adam classifier. But with the optimised model we
extraction of meaningful insights from the collected Twitter use focal loss function and N-adam optimiser.
data
we use the optimised model from cardiffnlp model

Fig. 2. accuracy comparision of different algorithms


Fig. 3. proposed system diagram

III. P ROPOSED S YSTEM B. b) List of Modules


A. a) Proposed System Introduction 1) User Interface and GUI Module 2) Data Extraction
• The proposed system employs a sophisticated approach Module 3) Data Analysis Module 4) Database Module 5)
for Twitter data extraction, utilizing Selenium for dynamic Outcomes and Results Showcase Modul
fetching based on user-specified hashtags. This strategy en- 1) 1) User Interface and GUI Module:-: Streamlit serves
ables real-time and targeted information retrieval, contributing as the graphical user interface (GUI) framework for our
to a more tailored data collection process. The extracted project, making use of the streamlined Python library to
data, encompassing tweets and associated usernames, is then enhance user interaction and streamline data manipulation.
processed through BeautifulSoup for structured analysis. • To Designed with a focus on user-friendliness, it provides a
enhance cost-effectiveness, the system eschews the acquisition seamless experience for users. Two key inputs, hashtags, and
of expensive Twitter fetching APIs. Instead, it leverages web refresh rate play a pivotal role in exploring Twitter data. The
scraping techniques for direct data extraction from the Twitter hashtag string guides the system to relevant tweets, shaping
platform. This cost-saving measure is particularly advanta- a targeted analysis. The refresh rate not only determines
geous, given the flexibility and adaptability inherent in the how frequently new tweets are fetched but also governs the
Selenium and BeautifulSoup combination. • Following data entire workflow. The synergy between user inputs and system
extraction, the collected information is systematically stored in response ensures a fine-tuned and adaptive approach to data
collection. The integration of hashtag specificity and refresh interface, we can see the results of the analysis. We can see
rate control forms the backbone, enabling a tailored and agile the model analysis of every tweet and also the statistics of the
analysis process. The Results tab in the Streamlit interface database and it’s analyisis. Here scipy and matplotlib library
serves as a polished showcase of outcomes and statistics post- are used
analysis. It not only presents results but enhances the user
experience by delivering comprehensive insights in a visually IV. 6) R ESULTS AND O UTPUTS
appealing and understandable manner. In summary, Streamlit
elevates our project’s interface to a level that aligns with
industry standards. It seamlessly combines user inputs, data
processing, and results presentation, embodying a professional
and comprehensive solution for effective data analysis and
visualization
2) 2) Data Extraction Module:: The procedural steps com-
mence with the identification and isolation of hashtags within
a provided string. Following this, the Selenium library is
harnessed to initiate an automated Edge driver, facilitating the
execution of a web scraping script. Leveraging the capabilities
of the Beautiful Soup module, the HTML of the webpage is
acquired through targeted tweet searches. Subsequently, both Fig. 4. Web application home page
the username and tweet content are meticulously extracted and
meticulously saved in a CSV file. Notably, this data extrac-
tion methodology proves advantageous by circumventing the
necessity for the official Twitter API. This not only streamlines
the process but also translates into tangible cost savings. By
sidestepping the reliance on the official API, this approach of-
fers a flexible and resourceefficient means of gathering relevant
Twitter data for further analysis or applications, contributing
to a more versatile and cost-effective workflow.
3) 3) Data Analysis Module: Following the extraction of
data and its storage in a CSV file, a subsequent step involves
transforming UTF-8 encoded tweets into clean text. This
process includes tasks such as removing extra spaces and
retaining only alphabets. The refined tweets are then fed into Fig. 5. Extracted tweets and usernames
a sentiment analysis model. Specifically, the model utilized
in this context is the Roberta-Bert Model, imported from
the transformers module by Hugging Face. PyTorch library
support is essential for the proper functioning of this model.
The results of the sentiment analysis are appended in a separate
column within the CSV file, providing valuable additional
insights alongside the original tweet data.
The focal loss function, unlike standard cross-entropy, intro-
duces a focusing parameter that modulates the loss based on
the predicted probabilities of classes. This parameter down-
weights the contribution of well-classified examples (where
predicted probabilities are high) and amplifies the loss for
hard, misclassified examples (where predicted probabilities are
low) Fig. 6. Sentiment analysis visualization
4) 4) Database module: The CSV files serve as the
storage medium for the information collected. Initially, the
database, created from data scraping, contains details such V. C ONCLUSION
as usernames and their tweets. Following the analysis, the We have achieved the following in the implementation of
database is expanded to include additional fields: username, an API-less content moderation web application:
password, the refined text of the tweet, and the outcomes - Implemented content moderation capabilities without the
generated by the analytical model. need for an API. - Developed the ability to understand and flag
5) 5) Outcomes and Results module:: This module is a part tweets that are harmful for social media platforms. - Optimized
of the user interface , where in the show tab of the streamlit the cardiffNlp BERT pre-trained model by: - Utilizing a
[6] Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, “Electron spectroscopy
studies on magneto-optical media and plastic substrate interface,” IEEE
Transl. J. Magn. Japan, vol. 2, pp. 740–741, August 1987 [Digests 9th
Annual Conf. Magnetics Japan, p. 301, 1982].
[7] M. Young, The Technical Writer’s Handbook. Mill Valley, CA: Univer-
sity Science, 1989.
[8] Mikolov, Tomas, et al. ”Efficient estimation of word representations in
vector space.” arXiv preprint arXiv:1301.3781 (2013).
[9] Schmidt, Robin M. ”Recurrent neural networks (rnns): A gentle intro-
duction and overview.” arXiv preprint arXiv:1912.05911 (2019).
[10] Devlin, Jacob, et al. ”Bert: Pre-training of deep bidirectional trans-
formers for language understanding.”arXiv preprint arXiv:1810.04805
(2018).
[11] Liu, Yinhan, et al. ”Roberta: A robustly optimized bert pretraining
approach.” arXiv preprint arXiv:1907.11692 (2019).
Fig. 7. . Tweet and username based on a particular word it contains (in this [12] Hassan, Abdalraouf, and Ausif Mahmood. ”Deep learning approach for
case ‘bomb’) sentiment analysis of short texts.” 2017 3rd international conference on
control, automation and robotics (ICCAR). IEEE, 2017.
[13] Dewi, Lusiana Citra, and Alvin Chandra. ”Social media web scraping
using social media developers API and regex.” Procedia Computer
Science 157 (2019): 444-449.7)
[14] Thivaharan, S., G. Srivatsun, and S. Sarathambekai. ”A survey on python
libraries used for social media content scraping.”2020 International
Conference on Smart Electronics and Communication (ICOSEC). IEEE,
2020
[15] Li, Jing, et al. ”A survey on deep learning for named entity recog-
nition.”IEEE Transactions on Knowledge and Data Engineering 34.1
(2020)
[16] Tarunesh, Ishan, Somak Aditya, and Monojit Choudhury. ”Trusting
roberta over bert: Insights from checklisting the natural language in-
ference task.”arXiv preprint arXiv:2107.07229 (2021).
[17] Khankhoje, Rohit. ”Web Page Element Identification Using Selenium
and CNN: A Novel Approach.”Journal of Software 1.1 (2023)
Fig. 8. accuracy of cardiffnlp model on tweet eval dataset [18] Yuan, Shujun. ”Design and Visualization of Python Web Scraping Based
on Third-Party Libraries and Selenium Tools.”Academic Journal of
Computing Information Science 6.9: 25-31
[19] Singla, Saurav, and N. Ramachandra. ”Comparative analysis of trans-
focal loss function instead of cross-entropy loss. - Employing former based pre-trained NLP Models.”Int. J. Comput. Sci. Eng 8 (2020)
NAdam optimizer as opposed to Adam optimizer. [20] Zheng, Chunmei, Guomei He, and Zuojie Peng. ”A Study of Web In-
formation Extraction Technology Based on Beautiful Soup.”J. Comput.
This optimization was carried out on the tweet-eval dataset, 10.6 (2015)
resulting in improved performance and model efficiency. [21] Kumari, Pooja, et al. ”Sentiment analysis of tweets.”International Journal
of Science Technology Engineering 1.10 (2015):
R EFERENCES [22] Britvin, Artur, Jawad Hammad Alrawashdeh, and R. Tkachuk. ”Client-
server system for parsing data from web pages.” Advances in Cyber-
[1] G. Eason, B. Noble, and I. N. Sneddon, “On certain integrals of Physical Systems 7.1 (2022): 8-13.
Lipschitz-Hankel type involving products of Bessel functions,” Phil. [23] Qudus Khan, Fazal, et al. ”Smart algorithmic based web crawling
Trans. Roy. Soc. London, vol. A247, pp. 529–551, April 1955. and scraping with template autoupdate capabilities.” Concurrency and
[2] J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. Computation: Practice and Experience 33.22 (2021): e6042.
2. Oxford: Clarendon, 1892, pp.68–73. [24] Pota, Marco, et al. ”Multilingual evaluation of pre-processing for BERT-
[3] I. S. Jacobs and C. P. Bean, “Fine particles, thin films and exchange based sentiment analysis of tweets.” Expert Systems with Applications
anisotropy,” in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New 181 (2021): 115119.
York: Academic, 1963, pp. 271–350. [25] Zhao, Jichang, et al. ”Moodlens: an emoticon-based sentiment analysis
[4] K. Elissa, “Title of paper if known,” unpublished. system for chinese tweets.” Proceedings of the 18th ACM SIGKDD
[5] R. Nicole, “Title of paper with only first word capitalized,” J. Name international conference on Knowledge discovery and data mining. 2012.
Stand. Abbrev., in press. [26] Sharma, Kaajal, and Gautam M. Borkar. ”Comparative Analysis of
Dynamic Web Scraping Strategies: Evaluating Techniques for Enhanced
Data Acquisition.”
[27] Soup, Beautiful Soup Documentation—Beautiful. ”Beautiful soup
documentation—beautiful soup 4.4. 0 documentation.” Beautiful-
Soup/bs4/doc (2012).
[28] Dai, Junqi, et al. ”Does syntax matter? a strong baseline for aspect-
based sentiment analysis with roberta.” arXiv preprint arXiv:2104.04986
(2021).
[29] Agarwal, Basant, et al. ”Machine learning approach for sentiment
analysis.” Prominent feature extraction for sentiment analysis (2016):
21-45.
[30] Harfoushi, Osama, Dana Hasan, and Ruba Obiedat. ”Sentiment analysis
algorithms through azure machine learning: Analysis and comparison.”
Modern Applied Science 12.7 (2018): 49.
[31] Rani, Sujata, and Parteek Kumar. ”Deep learning based sentiment
analysis using convolution neural network.” Arabian Journal for Science
and Engineering 44 (2019): 3305-3314.
Fig. 9. better accuracy of custom model on tweet eval dataset

You might also like