ASS7 Pyspark1
ASS7 Pyspark1
Istanbul,Turkey.
3 Centre for Industrial Software, University of Southern Denmark,
Sønderborg,Denmark.
Abstract
Online social media platforms have recently become integral to our society and
daily routines. Every day, users worldwide spend a couple of hours on such plat-
forms, expressing their sentiments and emotional state and contacting each other.
Analyzing such huge amounts of data from these platforms can provide a clear
insight into public sentiments and help detect their mental status. The early iden-
tification of these health condition risks may assist in preventing or reducing the
number of suicide ideation and potentially saving people’s lives. The traditional
techniques have become ineffective in processing such streams and large-scale
datasets. Therefore, the paper proposed a new methodology based on a big data
architecture to predict suicidal ideation from social media content. The proposed
approach provides a practical analysis of social media data in two phases: batch
processing and real-time streaming prediction. The batch dataset was collected
from the Reddit forum and used for model building and training, while streaming
big data was extracted using Twitter streaming API and used for real-time pre-
diction. After the raw data was preprocessed, the extracted features were fed to
multiple Apache Spark ML classifiers: NB, LR, LinearSVC, DT, RF, and MLP.
We conducted various experiments using various feature-extraction techniques
with different testing scenarios. The experimental results of the batch process-
ing phase showed that the features extracted of (Unigram + Bigram) + CV-IDF
1
with MLP classifier provided high performance for classifying suicidal ideation,
with an accuracy of 93.47%, and then applied for real-time streaming prediction
phase.
Keywords: Big Data, Suicidal ideation, Apache Spark, Apache Kafka, Social Media
1 Introduction
Suicidal ideation is a serious public health concern. The number of suicidal ideations
is increasing at an alarming rate every year. According to a report issued by the
World Health Organization (WHO), more than 703,000 people commit suicide annu-
ally, which means roughly one person dies every 45 seconds due to suicide. In addition,
25 suicide attempts for each suicide, and many more had serious thoughts about sui-
cide [1]. Suicidal ideation has continuously been linked to emotional states such as
depression and hopelessness [2]. The early detection of suicidal ideation may help to
prevent many suicide attempts and identify individuals needing psychosocial support.
Traditional methods and programs for suicide prevention are still reactive and
require patients to take the initiative to seek medical help. However, many patients
are not highly motivated to receive the necessary support. Owing to the anonymity
characteristic of online social media platforms, it has become an alternate space for
people to express their honest feelings or thoughts about their suffering or health-
care issues without fearing stigma or revealing their real identities, as in face-to-face
conversation [3].
This is considered a valuable source for detecting high-risk suicidal ideations
instances and uncovering these dangerous intentions before they become irreversible
or the sufferers end their lives. Suicidal sufferers may show suicidal intentions online
through brief ideas or detailed planning. Social media have been successfully lever-
aged to assist in detecting physical and mental illnesses more easily [4]. Therefore,
researchers have begun using online postings to detect suicidal ideation manually or
with the help of machine learning techniques [5]. Manual identification of suicidal
ideation has become more challenging due to the vast amount of content on social
media platforms.
Moreover, social media posts are generated as streaming data in real-time. How-
ever, real-time systems require direct input and rapid processing capability to make
decisions in a short time [6]. Several problems must be addressed before developing a
real-time analytics system. The first is to provide a reliable and efficient framework for
distributing data without losing accuracy. Most big-data research in healthcare focuses
on the technical aspects of big data. Apache Spark 1 and Apache Kafka 2 are examples
of these frameworks. Another problem with streaming data is that it involves high-
velocity and continuous data generation. Hence, processing such a huge data stream
using a traditional system environment in real-time may result in system bottlenecks.
1
https://spark.apache.org/
2
https://kafka.apache.org/
2
The presented work aimed to build an effective real-time model using a big data
analytics system to predict a person’s suicidal ideation at an earlier stage based on
their social media posts. We focused primarily on a social media platform where people
talk about different mental health issues and offer a platform to help. Reddit’s topic-
specific subreddits were used as the source of the historical data utilized to train
batch processing phase classifiers. The Reddit data contains postings from subreddits
titled “Suicide Watch” and “Teenagers.” In addition, we extracted streaming tweets
using Twitter API for real-time prediction. The API offers a collection of ways to
communicate with the application. We employed spark-structure streaming to handle
the Streaming of tweets in real time.
Some notable contributions made by this paper include the following:
• This paper proposed a scalable predictive system that can analyze large volumes and
high-velocity streaming data in real-time using “big data” architecture to predict
suicidal ideation cases that require special attention.
• We applied various experiments with multiple Apache Spark ML algorithms using
three feature extraction: TF-IDF, N-gram, and CountVectorizer, with various
combinations and testing scenarios.
• We performed optimization techniques to achieve high prediction accuracy. The
proposed system achieved significant performance on both batch and real-time
streaming phases of suicidal ideation prediction.
The remainder of the paper is structured as follows: Section 2 briefly reviews the
literature on detecting suicidal ideation. Section 3 describes the proposed methodology
and big data architecture that is used for detecting suicidal ideation. Section 4 presents
the experimental setup, including a performance analysis comparison of the supervised
classification algorithms used with various testing-type scenarios and an exploration
of the data analysis. Section 5 discusses the results obtained and the notable findings
of the work. Finally, Section 6 concludes the paper and describes the future work.
2 Literature Review
Sentiment Analysis has attracted the attention as a research topic in various fields such
as financial [7], public health [8], product reviews [9], voting behavior [10], political
[11] and social events [12]. Although approaches, methods and models vary across
domains, it has been observed that sentiment analysis and prediction tasks often
produce useful and interesting results. From the perspective of monitoring suicidal
ideation and mental state, there are some studies that analyze social media data using
natural language processing (NLP) and sentiment analysis by investigating different
aspects [13, 14, 15].
In the study conducted by S. Jain et al.[13], two datasets were used to develop
a machine learning-based method for predicting suicidal behaviors depending on the
depression stage. The first dataset was collected by creating a questionnaire from
students and parents and then classifying the depression according to five severity-
based stages. The XGBoost classifier reported a maximum accuracy of 83.87% in this
dataset. The second dataset has been extracted from Twitter. Tweets were classified
3
according to whether the user had depression. They found that the Logistic Regression
algorithm exhibited the highest performance and achieved an accuracy of 86.45%.
N. Wang et al. [16] proposed a deep-learning (DL) architecture as well as eval-
uated three more machine learning (ML) models to analyze the individual content
for automatically identifying whether a person will commit suicide within 30 days to
6 months before the attempt. They created and extracted three handcrafted feature
sets to detect suicide risk using the three-phase theory of suicide and earlier work on
emotions and pronouns among people who exhibit suicidal thoughts.
R. Sawhney et al. [14] proposed a new supervised method for identifying suicidal
thoughts using a manually annotated Twitter dataset. They used a set of features to
train the linear and ensemble classification algorithms. The most significant contribu-
tion of their work was the performance enhancement of the Random Forest algorithm
compared with other classification algorithms and baselines. Comparisons also were
made with baseline models applying different methodologies, including LSTMs, nega-
tion resolution, and rule-based approaches. Their work proved that the Random Forest
algorithm outperformed the other classifiers and baselines.
Similarly, M. Chatterjee et al. [17] analyzed Twitter platform content and iden-
tified the features that can hold signs of suicidal ideation. Multiple ML algorithms
were applied, including LR, RF, SVM, and XGBoost, to evaluate the effectiveness of
the suggested approach. The study involved extracting and combining various topics,
linguistic, statistical features, and temporal sentiments. The study extracted multi-
ple features from Twitter data, including sentiment analysis, emoticons, statistics,
TF-IDF, N-gram, temporal features, and topic-based features (LDA). The empirical
findings showed that by employing the Logistic Regression classifier, an accuracy of
87% was registered.
A. E. Aladağ et al. [18] used text mining implemented on post titles and bodies;
they built a classification model that differentiated between postings that were suicidal
and others that were not suicidal. The needed features were extracted using various
techniques, including TF-IDF, word count, linguistic inquiry, and sentiment analysis
of the titles and bodies of the posts. In addition, several classification algorithms
were applied. The suicidality of posts was correctly classified using Logistic Regression
(LR) and Support Vector Machine (SVM) classifiers. Accuracy and an F1 score were
achieved at 80% and 92%, respectively.
By using data collected from electronic medical records in mental hospitals, N. J.
Carson et al. [19] built and evaluated an NLP-based machine-learning approach to
detect suicidal behaviors and thoughts among young people.
A. Roy et al. [20] evaluated psychological weight factors, including depression,
hopelessness, loneliness, stress, anxiety, burdensomeness, and insomnia. Furthermore,
the sentiment polarity and Random Forest (RF) algorithm were applied with ten
estimated psychological measures for predicting SI within tweets and achieved an 88%
AUC score.
On the other hand, V. Desu et al. [15] proposed an approach that utilizes various
ML and DL algorithms, such as XGBoost, SVM, and ANN, implemented upon a
Spark cluster with multiple nodes to detect individuals who suffer from depression and
suicidal thoughts and require urgent assistance or support by analyzing their social
4
media content. The proposed ANN model provided superior efficacy over all other
baseline algorithms and registered the best accuracy rate of 76.80
M. J. Vioules et al.[21] developed a novel method that uses Twitter data to identify
suicide warning signs in users and detect postings containing suicidal behaviors. The
key contribution of their method is its ability to detect sudden changes in users’
online behavior. To identify these changes, they employed NLP algorithms with a
martingale framework to collect behavioral and textual features. The experimental
results demonstrated that their text-scoring method could detect warning signs in a
text more effectively than standard machine learning classifiers.
W. Jung et al. [22] designed multiple machine-learning models and analyzed sui-
cidality using Twitter data. The models were trained using 1097 suicidal and 1097
nonsuicidal tweets. They explored metadata and text-feature extraction to construct
efficient prediction models. They trained the classifier models using Random Forest
and Gradient-boosted tree (GBT). The experiments were conducted using multiple
features to construct a robust classifier. The model achieved good accuracy, with an
F1-Score of 84.6%.
M. M. Tadesse et al. [23] used NLP techniques to identify the depressive content of
users generated on the Reddit social website. The study primarily focused on deploy-
ing and evaluating several feature extraction approaches, such as LIWC, N-grams,
and topic modeling utilizing LDA to achieve the highest performance results. The
authors applied several classification algorithms, including LR, SVM, RF, Adaptive
Boosting (AB), and Multilayer Perceptron (MLP), to evaluate the risk of depression
among users. The experimental results were maintained in a confusion matrix to mea-
sure the model’s performance. The Multilayer Perceptron (MLP) model showed high
effectiveness with LIWC, Bi-gram, and LDA features combination, which resulted in
the most outstanding performance for depression identification at 91% accuracy with
an F1 score of 93%.
M. Birjali et al. [24] provided a method for building a suicide vocabulary to solve
the lack of current lexical resources. To improve their analysis, they proposed inves-
tigating the use of Weka as a data mining tool by using machine learning methods
to gain valuable insights from Twitter data gathered via Twitter4J. The dataset was
built using 892 tweets. They also introduced an algorithm that utilizes WordNet to
perform semantic analysis regarding the train set and the tweets’ data set, allowing
practical semantic similarity computation. They have demonstrated the efficacy of
using a machine-learning-based technique through Twitter content as a suicide pre-
vention method. The empirical findings showed that Naı̈ve Bayes algorithms achieved
a Precision value of 87.50%, a Recall value of 78.8%, and F1. value of 82.9%.
N. A. Baghdadi et al. [5] presented a detailed framework for text content classifica-
tion, specifically for Twitter content. The trained model was employed to identify the
tweets as “Suicide” or “Normal.” The dataset contains 14,576 tweets. Additionally,
the study provided preprocessing methods specifically designed for Arabic tweets. The
dataset was annotated through multiple annotators, and the framework’s effectiveness
was evaluated using various assessment methods. Valuable understandings were gained
through the Weighted Scoring Model (WSM). Both USE and BERT classifier models
were also explored. The WSM models registered the highest-weighted sum of 80.20%.
5
3 Proposed Methodology
Real-time streaming analysis of social media content can provide helpful and up-to-
date information on individuals with mental health problems. The current analytics
methods that analyze social media content with massive volume offline are not robust
and active for supporting real-time decision-making under essential conditions. Thus,
these analysis methods must built to provide effective stream real-time prediction.
The methodology comprises two phases: batch processing and real-time streaming
prediction.
Our system methodology was built based on four primary components: the input
source system, where the system obtains the stream data (Apache Kafka); the stream
data processing, where the stream data are processed (Apache Spark Structured
Streaming); building the classification algorithms (Apache Spark ML); and the sink
node, where the final results are analyzed and visualized (Power BI). We built sev-
eral Apache Spark ML models using multiple feature extraction techniques. Also, we
compared the classification performance of multiple models using various evaluation
methods to determine the optimal architecture for predicting suicidal-related posts
from real-time Twitter streaming data. Figure 1 provides a clear overview of the
proposed methodology and the experimental workflow used in this work.
Fig. 1 Proposed methodology for predicting suicidal ideation on social media content
6
content with high velocity in real-time streaming data using a distributed big data
environment.
7
which were collected using the’ Pushshift’ API. The dataset comprised approximately
232,074 posts collected between Dec. 16, 2008, and Jan. 2, 2021, of 116,037 were
classified as suicidal and 116,037 as nonsuicidal. We cleaned and preprocessed the
dataset to remove duplicate posts, empty rows, and unnecessary columns. After the
preprocessing step, the dataset resulted in 232,042 rows, including 116,028 suicidal and
116,014 nonsuicidal instances. For our task, we used only the post content and target
columns for the analysis task. Some batch data samples are presented in Table 1.
3.2.4 Tokenizing
The tokenization step is essential for any natural language processing (NLP) pipeline.
It has a considerable influence on the remaining phases of the pipeline. It breaks down
the text data into individual, more meaningful terms, including words, punctuation
marks, symbols, and abbreviations, to make data exploration more accessible. The
8
Fig. 2 The primary steps in preprocessing the raw dataset
result of this process is known as a token [29]. These tokens were then used as input
data for the processing pipeline.
3.2.6 Lemmatizing
The input data was lemmatized at this step. Lemmatizing removes inflectional ends
and returns each word in the dataset to its basic or dictionary form. Lemmatizing
requires a comprehensive vocabulary and morphological analysis to lemmatize the
words. Among various lemmatization methods, we focused on rule-based approaches
using “WordNetLemmatizer.” It employs a pre-established set of morphological and
syntactic rules to find the lemma of each word within the input text. The use of
Lemmatization helps to reduce the dimensionality and the vocabulary size of textual
data, which leads to improved performance of analytical techniques.
9
3.2.7 Dataset Splitting
To train the classification models, it is necessary to split the dataset. Therefore, we
divided the entire historical Reddit data into two subsets: Out of 80% of the dataset
applied for training data, the remaining 20% were unseen data and applied for testing
data. The classification models were trained and optimized using the training data to
determine the most accurate features. On the other hand, the testing data (unseen
data) was employed to assess the effectiveness of the classification models. Table 2
provides descriptive statistics for the testing and training sets.
10
Total documents No.
IDF (t) = log (2)
No. of documents that contain the term t
11
Decision Tree Classifier (DT): Decision Tree algorithm is a common machine-
learning method categorized as a non-parametric supervised algorithm [37]. It is a
hierarchical model designed as a tree structure. DT is typically composed of multiple
levels beginning from the root node. Every interior node holds at least one child,
representing the evaluation of an input feature or variable. Based on the results of a
decision test, the branching procedure will repeat itself, directing the corresponding
child node along the suitable path, and this process continues until the last leaf node.
The optimal tree is the shortest tree that can correctly categorize all data points and
has the fewest splits.
TP: Model classified positive class for a post, and the actual post class is also
positive.
TN: Model classified a negative class for a post, and the actual post class is also
negative.
FP: Model classified positive class for a post, whereas a post is negative.
12
FN: Model classified negative class for a post where a post is positive.
Accuracy: It is the most popular and straightforward way of measuring the model’s
performance. Accuracy is the ratio of samples that have been properly classified
compared to the whole number of samples, as shown in Equation 4:
TP + TN
Accuracy = ∗ 100% (4)
TP + FN + TN + FP
F1-Score: It is the average of precision and recall scores. Using F1-score assessment
metrics, we can evaluate an ML classifier’s performance on all data classes. F1-Score
can be defined as the equation 7.
P recision ∗ Recall
F 1 − Score = 2 ∗ ∗ 100% (7)
P recision + Recall
Actual-Values
Predicted values
Actual-Pos. Actual-Neg.
Pos. Predicted. TP FP
Neg. predicted FN TN
13
3.2.11 Model Saving
This stage is considered the last process in the batch processing phase, which includes
saving the highest-performance model in the batch processing phase to use it as a
predictive model in the real time streaming prediction phase.
The collected tweets were then ingested as data streams into the Apache Kafka
input topic. Spark Structured Streaming consumes stream tweets from the Kafka topic
in real-time into the unbounded table. We implemented several preprocessing steps to
refine the tweets’ stream effectively. These steps involve removing irrelevant informa-
tion, reducing the noise, and extracting appropriate stream data. After preprocessing
and cleaning the streaming tweets, we generated a feature vector and fed it into the
highest accurate model previously developed and trained in the batch processing phase
to predict suicidal ideation in real time. The prediction results were then pushed and
buffered in a Kafka output “Predicted-tweets” topic before being consumed by the
Power BI application to visualize the final prediction results in real time.
3
https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data
4
https://docs.tweepy.org/en/latest/index.html
14
4 Experimental Setup and Performance Analysis
4.1 Experimental Setup
The proposed ApacheSpark-based architecture was implemented using the “PySpark”
library to build the classification algorithms: NB, LR, LinearSVC, DT, RF, and MLP
algorithms. Apache Spark Cluster was installed on a laptop with 64 GB of RAM, a 1
TB SSD disk drive, and an Intel Core i7 CPU (14 cores, 20 logical processors). In addi-
tion, we integrated multiple API libraries for implementation. ML library of Apache
Spark was used to develop classification algorithms. Apache Kafka version of “2.0.2”
was deployed as an input system for ingesting data streams from Twitter. Tweepy
version of “4.10.0” for connecting to the Twitter API. Spark Structured Streaming
was applied for receiving and processing stream tweets from Kafka topics—Power BI
application for Visualizing the real-time streaming prediction results.
Fig. 3 Word cloud representation of suicidal- Fig. 4 Cord cloud representation of non-suicidal-
related postings related postings
15
(LR), Linear Support Vector classifier (LinearSVC), Decision Tree (DT), Random
Forest (RF), and Multilayer Perceptron (MLP). The algorithms were trained and eval-
uated using data from the Reddit forum, using three different strategies for feature
extraction: TF-IDF, N-gram, and the CountVectorizer technique. Multiple combina-
tions of these feature extraction methods were implemented to extract the essential
features. A hyperparameter tuning strategy was adopted to detect the optimal param-
eter tune for each model configuration. Two methods are commonly employed for
Hyperparameter tuning: Random search and Grid search.
In this work, we utilized the Grid search as a hyperparameter technique in the exper-
iments. The Grid search hyperparameter tuning process aims to find the optimal
parameters and most suitable values for each classifier to enhance the overall perfor-
mance.
Furthermore, we made use of 10-fold Cross-validation, which is a widespread technique
and reliable method for minimizing overfitting, enhancing the validity and reliabil-
ity of the classification models, and balancing the bias and variance values. With the
10-fold cross-validation strategy, the given data were subdivided randomly into ten
subsets of the same size; one subset was used for testing purposes, while the other nine
subsets were used for the training process. Cross-validation was executed ten times,
with each of the ten subsets used as validation only once. To get a final estimate, the
data were averaged across ten folds. Table 4 and Figures 5, 6, 7, and 8 illustrate the
experimental results and comparative performance assessment of multiple Spark ML
classifiers using a binary classification evaluator.
From all experimental results, we found that the Multilayer Perceptron (MLP) classi-
fier outperformed the other classification algorithms and achieved a greater accuracy
rate of 93.47% and an AUC socre of 98.12%. The logistic Regression (LR) classifier
also performed well but somewhat less than the Multilayer Perceptron (MLP) classi-
fier and achieved the second-greatest performance, with an accuracy rate of 92.14%.
In addition, the results showed no significant performance difference between the Lin-
ear Support Vector classifier (LinearSVC) and Naı̈ve Bayes (NB). Unexpectedly, from
the experimental results, we found that Decision Tree (DT) and Random Forest (RF)
underperformed other classifiers utilized in this work despite their efficacy in numer-
ous machine-learning scenarios.
Also, from all experimental results, we have shown that most classifier models that
used N-gram + CV-IDF as their feature extraction approach performed better than
those that used the N-gram +TF-IDF feature approach. The classifier algorithms
were also evaluated using another metric known as the Area-Under-Curve (AUC).
The metric provides a value ranging from 0 to 1. A value closer to 1 indicated better
classification results. Figures 9, 10, 11 and 12 display the AUC comparison of all the
classification methods.
16
Table 4 Performance Comparison of Classification Algorithms on testing dataset
Fig. 5 Comparison of performance results of all Fig. 6 Comparison of performance results of all
classification algorithms with Unigram +TF-IDF classification algorithms with Unigram + CV-IDF
features features
classifier models in the batch processing phase, the classifier with the greatest per-
formance, as in our experiment, MLP with (Unigram + Bigram) + CV-IDF feature
extraction combination, was applied for predicting Twitter suicidal ideation-related
content in real-time. We collect streaming tweets using Twitter API with multiple
keywords, including “feel,” “want to die,” and “kill myself”, which were then pushed
into the Apache Kafka input topic. These streams of tweets are consumed by Apache
Spark Structure Streaming from the Kafka input topic, which is then preprocessed
17
Fig. 7 Comparison of performance results of all Fig. 8 Comparison of performance results of
classification algorithms with Bigram + CV-IDF all classification algorithms with (Unigram +
features Bigram) + CV-IDF features
Fig. 9 Comparison of ROC-AUC of all classifica- Fig. 10 Comparison of ROC-AUC of all clas-
tion algorithms with Unigram + TF-IDF features sification algorithms with Unigram + CV-IDF
method features method
Fig. 11 Comparison of ROC-AUC of all classi- Fig. 12 Comparison of ROC-AUC of all classi-
fication algorithms with Bigram + CV-IDF fea- fication algorithms with (Unigram + Bigram) +
tures method CV-IDF features method
as a data stream and used to generate a feature vector. The best pre-trained model
18
already developed in the batch processing phase then analyzes the stream of prepro-
cessed tweets and predicts whether these tweets are suicidal or normal content in
real-time. The prediction results are then pushed to a Kafka output topic for buffering
and then consumed from the Power BI application to visualize the prediction results in
real-time. In our work, a total of 764 tweets as a data stream were collected to exam-
ine the prediction ability in the real-time streaming prediction phase. The real-time
streaming prediction phase results indicated that (9.29%) of the tweets were predicted
as suicide, whereas (90.71%) were non-suicide. Figure 13 shows the results obtained
in the real-time streaming prediction phase.
5 Discussions
In this study, we proposed a big data approach to predict suicidal ideation based
on data collected from social media platforms. The proposed methodology comprised
two phases on batch processing and streaming predictions in real-time. The systems
utilized six Spark ML algorithms to build the classification model and compared the
performances of the models. In the streaming data pipeline, live streams of a tweet are
collected from Twitter using the keywords “feel”, “want to die” and “kill myself” and
then sent the collected data to the Kafka topic. Spark Structured Streaming receives
the stream data from the Kafka topic, extracts the optimal feature, and then sends
batches of preprocessed data to the real-time streaming prediction model to predict
whether the tweet contains indications of suicidal ideation.
This work used three feature extraction methods, including TF-IDF, N-gram, and
Count Vectorizer, with different combination scenarios to extract the optimal features
from the input data. The experimental results of six classification models showed that
the MLP classifier had the highest accuracy value of 93.47% with the features extracted
using (Unigram + Bigram) +CV-IDF feature extraction scenario. At the same time, a
high accuracy of 93.33% was obtained from the MLP classifier with features extracted
19
using (Unigram + CV-IDF). In addition, MLP provided the best accuracy of 92.66%
using (Unigram + TF-IDF).
In comparing our experimental results with related works, we noticed that the
highest accuracy obtained from the MLP classifier is higher than XGBoost and logistic
regression accuracies rate of 83.87% and 86.45%, respectively, achieved by S. Jain et al.
[13]. Also, compared with the accuracy and F1 score rate of 80% and 92%, respectively,
achieved by A. E. Aladağ et al. [18]. Furthermore, our methodology outperformed the
accuracy rate of 76.80% that was recorded by V. Desu et al. [15]. In addition, our
experimental results registered a higher performance than the Naı̈ve Bayes algorithm,
achieving a Precision value of 87.50%, a Recall value of 78.8%, and F1. value of 82.9%
by M. Birjali et al. [24]. Therefore, we adopted the MLP classifier with (Unigram +
Bigram) + CV-IDF feature combination scenario to predict suicidal ideation in the
second phase of real-time streaming prediction using Twitter streaming data.
That being said, further improvements can be made to extend this study. The first
improvement can be achieved by increasing the number of features of the textual data
using additional data such as emoticons, special characters, and symbols to extract
optimal features and reduce the misclassification results. Moreover, the dataset can
be expanded by gathering additional textual data from other social media platforms
to make our data more representative and varied.
20
References
[1] W.H. Organization. World Health Organization. URL https:
//www.who.int/news-room/events/detail/2022/09/10/default-calendar/
world-suicide-prevention-day-2022
[2] M.W. Gijzen, S.P. Rasing, D.H. Creemers, F. Smit, R.C. Engels, D. De Beurs, Sui-
cide ideation as a symptom of adolescent depression. a network analysis. Journal
of Affective Disorders 278, 68–77 (2021)
[3] A. Roy, K. Nikolitch, R. McGinn, S. Jinah, W. Klement, Z.A. Kaminsky, A
machine learning approach predicts future risk to suicidal ideation from social
media data. NPJ digital medicine 3(1), 1–12 (2020)
[4] T.H. Aldhyani, S.N. Alsubari, A.S. Alshebami, H. Alkahtani, Z.A. Ahmed, Detect-
ing and analyzing suicidal ideation on social media using deep learning and
machine learning models. International journal of environmental research and
public health 19(19), 12635 (2022)
[5] N.A. Baghdadi, A. Malki, H.M. Balaha, Y. AbdulAzeem, M. Badawy, M. Elhos-
seini, An optimized deep learning approach for suicide detection through Arabic
tweets. PeerJ Computer Science 8, e1070 (2022)
[6] S.A. Senthilkumar, B.K. Rai, A.A. Meshram, A. Gunasekaran, S. Chandrakumar-
mangalam, Big data in healthcare management: a review of literature. American
Journal of Theoretical and Applied Business 4(2), 57–69 (2018)
[7] S. Ayvaz, M.O. Shiha, A scalable streaming big data architecture for real-time
sentiment analysis, in Proceedings of the 2018 2nd international conference on
cloud and big data computing (2018), pp. 47–51
[8] A.H. Alamoodi, B.B. Zaidan, A.A. Zaidan, O.S. Albahri, K.I. Mohammed, R.Q.
Malik, E.M. Almahdi, M.A. Chyad, Z. Tareq, A.S. Albahri, et al., Sentiment anal-
ysis and its applications in fighting covid-19 and infectious diseases: A systematic
review. Expert systems with applications 167, 114155 (2021)
[9] G. Agarwal, S.K. Dinkar, A. Agarwal, Binarized spiking neural networks opti-
mized with nomadic people optimization-based sentiment analysis for social
product recommendation. Knowledge and Information Systems 66(2), 933–958
(2024)
[10] P. Rita, N. António, A.P. Afonso, Social media discourse and voting decisions
influence: sentiment analysis in tweets during an electoral period. Social Network
Analysis and Mining 13(1), 46 (2023)
[11] N. Öztürk, S. Ayvaz, Sentiment analysis on twitter: A text mining approach to
the syrian refugee crisis. Telematics and Informatics 35(1), 136–147 (2018)
[12] M.A. Allayla, S. Ayvaz, A Hybrid and Scalable Sentiment Analysis Framework:
Case of Russo-Ukrainian War, in 2023 3rd International Scientific Conference of
Engineering Sciences (ISCES) (IEEE, 2023), pp. 13–18
[13] S. Jain, S.P. Narayan, R.K. Dewang, U. Bhartiya, N. Meena, V. Kumar, A
machine learning based depression analysis and suicidal ideation detection sys-
tem using questionnaires and twitter, in 2019 IEEE Students Conference on
Engineering and Systems (SCES) (IEEE, 2019), pp. 1–6
[14] R. Sawhney, P. Manchanda, R. Singh, S. Aggarwal, A computational approach to
feature extraction for identification of suicidal ideation in tweets, in Proceedings
21
of ACL 2018, Student Research Workshop (2018), pp. 91–98
[15] V. Desu, N. Komati, S. Lingamaneni, F. Shaik, Suicide and Depression Detection
in Social Media Forums, in Smart Intelligent Computing and Applications, Vol-
ume 2: Proceedings of Fifth International Conference on Smart Computing and
Informatics (SCI 2021) (Springer, 2022), pp. 263–270
[16] N. Wang, F. Luo, Y. Shivtare, V.D. Badal, K.P. Subbalakshmi, R. Chandramouli,
E. Lee, Learning models for suicide prediction from social media posts. arXiv
preprint arXiv:2105.03315 (2021)
[17] M. Chatterjee, P. Kumar, P. Samanta, D. Sarkar, Suicide ideation detection from
online social media: A multi-modal feature based technique. International Journal
of Information Management Data Insights 2(2), 100103 (2022)
[18] A.E. Aladağ, S. Muderrisoglu, N.B. Akbas, O. Zahmacioglu, H.O. Bingol, Detect-
ing suicidal ideation on forums: proof-of-concept study. Journal of medical
Internet research 20(6), e9840 (2018)
[19] N.J. Carson, B. Mullin, M.J. Sanchez, F. Lu, K. Yang, M. Menezes, B.L. Cook,
Identification of suicidal behavior among psychiatrically hospitalized adolescents
using natural language processing and machine learning of electronic health
records. PloS one 14(2), e0211116 (2019)
[20] A. Roy, K. Nikolitch, R. McGinn, S. Jinah, W. Klement, Z.A. Kaminsky, A
machine learning approach predicts future risk to suicidal ideation from social
media data. NPJ digital medicine 3(1), 1–12 (2020)
[21] M.J. Vioules, B. Moulahi, J. Azé, S. Bringay, Detection of suicide-related posts
in Twitter data streams. IBM Journal of Research and Development 62(1), 1–7
(2018)
[22] W. Jung, D. Kim, S. Nam, Y. Zhu, Suicidality detection on social media using
metadata and text feature extraction and machine learning. Archives of suicide
research pp. 1–16 (2021)
[23] M.M. Tadesse, H. Lin, B. Xu, L. Yang, Detection of depression-related posts in
reddit social media forum. IEEE Access 7, 44883–44893 (2019)
[24] M. Birjali, A. Beni-Hssane, M. Erritali, Machine learning and semantic sentiment
analysis based algorithms for suicide sentiment prediction in social networks.
Procedia Computer Science 113, 65–72 (2017)
[25] E. Shaikh, I. Mohiuddin, Y. Alufaisan, I. Nahvi, Apache spark: A big data process-
ing engine, in 2019 2nd IEEE Middle East and North Africa COMMunications
Conference (MENACOMM) (IEEE, 2019), pp. 1–6
[26] M. Junaid, S. Ali, I.F. Siddiqui, C. Nam, N.M.F. Qureshi, J. Kim, D.R.
Shin, Performance Evaluation of Data-driven Intelligent Algorithms for Big
data Ecosystem. Wireless Personal Communications 126(3), 2403–2423 (2022).
https://doi.org/10.1007/s11277-021-09362-7. URL https://doi.org/10.1007/
s11277-021-09362-7
[27] K. Deshpande, M. Rao, in Inventive Computation and Information Technologies
(Springer, 2022), pp. 607–630
[28] NIKHILESWAR KOMATI. Suicide and Depression Detection. URL https://
www.kaggle.com/datasets/nikhileswarkomati/suicide-watch
22
[29] S. Vijayarani, M.J. Ilamathi, M. Nithya, Preprocessing techniques for text mining-
an overview. International Journal of Computer Science and Communication
Networks 5(1), 7–16 (2015)
[30] S.F.C. Haviana, B.S.W. Poetro, Deep learning model for sentiment analysis on
short informal texts. Indonesian Journal of Electrical Engineering and Informatics
(IJEEI) 10(1), 82–89 (2022)
[31] W. Shang, T. Underwood, Improving Measures of Text Reuse in English Poetry:
A TF–IDF Based Method, in International Conference on Information (Springer,
2021), pp. 469–477
[32] R. Vijaya Prakash, Machine Learning Approach To Forecast the Word in Social
Media. Social Network Analysis: Theory and Applications pp. 133–147 (2022)
[33] J. Brownlee, Deep learning for natural language processing: develop deep learning
models for your natural language problems (Machine Learning Mastery, 2017)
[34] R. Mehmood, B. Bhaduri, I. Katib, I. Chlamtac, Smart Societies, Infrastruc-
ture, Technologies and Applications: First International Conference, SCITA 2017,
Jeddah, Saudi Arabia, November 27–29, 2017, Proceedings, vol. 224 (Springer,
2018)
[35] E.M.K. Reddy, A. Gurrala, V.B. Hasitha, K.V.R. Kumar, Introduction to Naive
Bayes and a Review on Its Subtypes with Applications. Bayesian Reasoning and
Gaussian Processes for Machine Learning Applications pp. 1–14 (2022)
[36] A. Goel, J. Gautam, S. Kumar, Real time sentiment analysis of tweets using Naive
Bayes, in 2016 2nd International Conference on Next Generation Computing
Technologies (NGCT) (IEEE, 2016), pp. 257–261
[37] M. Jena, R.K. Behera, S. Dehuri, in Advances in Machine Learning for Big Data
Analysis (Springer, 2022), pp. 223–239
[38] L. Breiman, Random Forests. Machine Learning 45(1), 5–32 (2001). https:
//doi.org/10.1023/A:1010933404324
[39] N. Syam, R. Kaul, in Machine Learning and Artificial Intelligence in Marketing
and Sales (Emerald Publishing Limited, 2021)
[40] N. Jalal, A. Mehmood, G.S. Choi, I. Ashraf, A novel improved random forest for
text classification using feature ranking and optimal number of trees. Journal of
King Saud University-Computer and Information Sciences (2022)
23