0% found this document useful (0 votes)

11 views23 pages

ASS7 Pyspark1

The document presents a big data analytics system designed to predict suicidal ideation in real-time by analyzing social media streaming data. It utilizes a two-phase methodology involving batch processing of historical data from Reddit and real-time predictions from Twitter, employing various Apache Spark ML classifiers to achieve high accuracy. The study highlights the importance of early detection of suicidal thoughts through social media analysis, aiming to enhance traditional reactive methods of suicide prevention.

Uploaded by

Junku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views23 pages

ASS7 Pyspark1

Uploaded by

Junku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

A Big Data Analytics System for Predicting

Suicidal Ideation in Real-Time Based on Social

Media Streaming Data
Mohamed A. Allayla1,2 and Serkan Ayvaz2,3*
1 Dams and Water Resources Research Center, University of Mosul,
Mosul,Iraq.
arXiv:2404.12394v1 [cs.LG] 19 Mar 2024

2 Department of Computer Engineering,Yildiz Technical University,

Istanbul,Turkey.
3 Centre for Industrial Software, University of Southern Denmark,

Sønderborg,Denmark.

*Corresponding author(s). E-mail(s): seay@mmmi.sdu.dk;

Contributing authors: mohamed.abdulstar@uomosul.edu.iq;

Abstract
Online social media platforms have recently become integral to our society and
daily routines. Every day, users worldwide spend a couple of hours on such plat-
forms, expressing their sentiments and emotional state and contacting each other.
Analyzing such huge amounts of data from these platforms can provide a clear
insight into public sentiments and help detect their mental status. The early iden-
tification of these health condition risks may assist in preventing or reducing the
number of suicide ideation and potentially saving people’s lives. The traditional
techniques have become ineffective in processing such streams and large-scale
datasets. Therefore, the paper proposed a new methodology based on a big data
architecture to predict suicidal ideation from social media content. The proposed
approach provides a practical analysis of social media data in two phases: batch
processing and real-time streaming prediction. The batch dataset was collected
from the Reddit forum and used for model building and training, while streaming
big data was extracted using Twitter streaming API and used for real-time pre-
diction. After the raw data was preprocessed, the extracted features were fed to
multiple Apache Spark ML classifiers: NB, LR, LinearSVC, DT, RF, and MLP.
We conducted various experiments using various feature-extraction techniques
with different testing scenarios. The experimental results of the batch process-
ing phase showed that the features extracted of (Unigram + Bigram) + CV-IDF

1
with MLP classifier provided high performance for classifying suicidal ideation,
with an accuracy of 93.47%, and then applied for real-time streaming prediction
phase.

Keywords: Big Data, Suicidal ideation, Apache Spark, Apache Kafka, Social Media

1 Introduction
Suicidal ideation is a serious public health concern. The number of suicidal ideations
is increasing at an alarming rate every year. According to a report issued by the
World Health Organization (WHO), more than 703,000 people commit suicide annu-
ally, which means roughly one person dies every 45 seconds due to suicide. In addition,
25 suicide attempts for each suicide, and many more had serious thoughts about sui-
cide [1]. Suicidal ideation has continuously been linked to emotional states such as
depression and hopelessness [2]. The early detection of suicidal ideation may help to
prevent many suicide attempts and identify individuals needing psychosocial support.
Traditional methods and programs for suicide prevention are still reactive and
require patients to take the initiative to seek medical help. However, many patients
are not highly motivated to receive the necessary support. Owing to the anonymity
characteristic of online social media platforms, it has become an alternate space for
people to express their honest feelings or thoughts about their suffering or health-
care issues without fearing stigma or revealing their real identities, as in face-to-face
conversation [3].
This is considered a valuable source for detecting high-risk suicidal ideations
instances and uncovering these dangerous intentions before they become irreversible
or the sufferers end their lives. Suicidal sufferers may show suicidal intentions online
through brief ideas or detailed planning. Social media have been successfully lever-
aged to assist in detecting physical and mental illnesses more easily [4]. Therefore,
researchers have begun using online postings to detect suicidal ideation manually or
with the help of machine learning techniques [5]. Manual identification of suicidal
ideation has become more challenging due to the vast amount of content on social
media platforms.
Moreover, social media posts are generated as streaming data in real-time. How-
ever, real-time systems require direct input and rapid processing capability to make
decisions in a short time [6]. Several problems must be addressed before developing a
real-time analytics system. The first is to provide a reliable and efficient framework for
distributing data without losing accuracy. Most big-data research in healthcare focuses
on the technical aspects of big data. Apache Spark 1 and Apache Kafka 2 are examples
of these frameworks. Another problem with streaming data is that it involves high-
velocity and continuous data generation. Hence, processing such a huge data stream
using a traditional system environment in real-time may result in system bottlenecks.

1
https://spark.apache.org/
2
https://kafka.apache.org/

2
The presented work aimed to build an effective real-time model using a big data
analytics system to predict a person’s suicidal ideation at an earlier stage based on
their social media posts. We focused primarily on a social media platform where people
talk about different mental health issues and offer a platform to help. Reddit’s topic-
specific subreddits were used as the source of the historical data utilized to train
batch processing phase classifiers. The Reddit data contains postings from subreddits
titled “Suicide Watch” and “Teenagers.” In addition, we extracted streaming tweets
using Twitter API for real-time prediction. The API offers a collection of ways to
communicate with the application. We employed spark-structure streaming to handle
the Streaming of tweets in real time.
Some notable contributions made by this paper include the following:
• This paper proposed a scalable predictive system that can analyze large volumes and
high-velocity streaming data in real-time using “big data” architecture to predict
suicidal ideation cases that require special attention.
• We applied various experiments with multiple Apache Spark ML algorithms using
three feature extraction: TF-IDF, N-gram, and CountVectorizer, with various
combinations and testing scenarios.
• We performed optimization techniques to achieve high prediction accuracy. The
proposed system achieved significant performance on both batch and real-time
streaming phases of suicidal ideation prediction.
The remainder of the paper is structured as follows: Section 2 briefly reviews the
literature on detecting suicidal ideation. Section 3 describes the proposed methodology
and big data architecture that is used for detecting suicidal ideation. Section 4 presents
the experimental setup, including a performance analysis comparison of the supervised
classification algorithms used with various testing-type scenarios and an exploration
of the data analysis. Section 5 discusses the results obtained and the notable findings
of the work. Finally, Section 6 concludes the paper and describes the future work.

2 Literature Review
Sentiment Analysis has attracted the attention as a research topic in various fields such
as financial [7], public health [8], product reviews [9], voting behavior [10], political
[11] and social events [12]. Although approaches, methods and models vary across
domains, it has been observed that sentiment analysis and prediction tasks often
produce useful and interesting results. From the perspective of monitoring suicidal
ideation and mental state, there are some studies that analyze social media data using
natural language processing (NLP) and sentiment analysis by investigating different
aspects [13, 14, 15].
In the study conducted by S. Jain et al.[13], two datasets were used to develop
a machine learning-based method for predicting suicidal behaviors depending on the
depression stage. The first dataset was collected by creating a questionnaire from
students and parents and then classifying the depression according to five severity-
based stages. The XGBoost classifier reported a maximum accuracy of 83.87% in this
dataset. The second dataset has been extracted from Twitter. Tweets were classified

3
according to whether the user had depression. They found that the Logistic Regression
algorithm exhibited the highest performance and achieved an accuracy of 86.45%.
N. Wang et al. [16] proposed a deep-learning (DL) architecture as well as eval-
uated three more machine learning (ML) models to analyze the individual content
for automatically identifying whether a person will commit suicide within 30 days to
6 months before the attempt. They created and extracted three handcrafted feature
sets to detect suicide risk using the three-phase theory of suicide and earlier work on
emotions and pronouns among people who exhibit suicidal thoughts.
R. Sawhney et al. [14] proposed a new supervised method for identifying suicidal
thoughts using a manually annotated Twitter dataset. They used a set of features to
train the linear and ensemble classification algorithms. The most significant contribu-
tion of their work was the performance enhancement of the Random Forest algorithm
compared with other classification algorithms and baselines. Comparisons also were
made with baseline models applying different methodologies, including LSTMs, nega-
tion resolution, and rule-based approaches. Their work proved that the Random Forest
algorithm outperformed the other classifiers and baselines.
Similarly, M. Chatterjee et al. [17] analyzed Twitter platform content and iden-
tified the features that can hold signs of suicidal ideation. Multiple ML algorithms
were applied, including LR, RF, SVM, and XGBoost, to evaluate the effectiveness of
the suggested approach. The study involved extracting and combining various topics,
linguistic, statistical features, and temporal sentiments. The study extracted multi-
ple features from Twitter data, including sentiment analysis, emoticons, statistics,
TF-IDF, N-gram, temporal features, and topic-based features (LDA). The empirical
findings showed that by employing the Logistic Regression classifier, an accuracy of
87% was registered.
A. E. Aladağ et al. [18] used text mining implemented on post titles and bodies;
they built a classification model that differentiated between postings that were suicidal
and others that were not suicidal. The needed features were extracted using various
techniques, including TF-IDF, word count, linguistic inquiry, and sentiment analysis
of the titles and bodies of the posts. In addition, several classification algorithms
were applied. The suicidality of posts was correctly classified using Logistic Regression
(LR) and Support Vector Machine (SVM) classifiers. Accuracy and an F1 score were
achieved at 80% and 92%, respectively.
By using data collected from electronic medical records in mental hospitals, N. J.
Carson et al. [19] built and evaluated an NLP-based machine-learning approach to
detect suicidal behaviors and thoughts among young people.
A. Roy et al. [20] evaluated psychological weight factors, including depression,
hopelessness, loneliness, stress, anxiety, burdensomeness, and insomnia. Furthermore,
the sentiment polarity and Random Forest (RF) algorithm were applied with ten
estimated psychological measures for predicting SI within tweets and achieved an 88%
AUC score.
On the other hand, V. Desu et al. [15] proposed an approach that utilizes various
ML and DL algorithms, such as XGBoost, SVM, and ANN, implemented upon a
Spark cluster with multiple nodes to detect individuals who suffer from depression and
suicidal thoughts and require urgent assistance or support by analyzing their social

4
media content. The proposed ANN model provided superior efficacy over all other
baseline algorithms and registered the best accuracy rate of 76.80
M. J. Vioules et al.[21] developed a novel method that uses Twitter data to identify
suicide warning signs in users and detect postings containing suicidal behaviors. The
key contribution of their method is its ability to detect sudden changes in users’
online behavior. To identify these changes, they employed NLP algorithms with a
martingale framework to collect behavioral and textual features. The experimental
results demonstrated that their text-scoring method could detect warning signs in a
text more effectively than standard machine learning classifiers.
W. Jung et al. [22] designed multiple machine-learning models and analyzed sui-
cidality using Twitter data. The models were trained using 1097 suicidal and 1097
nonsuicidal tweets. They explored metadata and text-feature extraction to construct
efficient prediction models. They trained the classifier models using Random Forest
and Gradient-boosted tree (GBT). The experiments were conducted using multiple
features to construct a robust classifier. The model achieved good accuracy, with an
F1-Score of 84.6%.
M. M. Tadesse et al. [23] used NLP techniques to identify the depressive content of
users generated on the Reddit social website. The study primarily focused on deploy-
ing and evaluating several feature extraction approaches, such as LIWC, N-grams,
and topic modeling utilizing LDA to achieve the highest performance results. The
authors applied several classification algorithms, including LR, SVM, RF, Adaptive
Boosting (AB), and Multilayer Perceptron (MLP), to evaluate the risk of depression
among users. The experimental results were maintained in a confusion matrix to mea-
sure the model’s performance. The Multilayer Perceptron (MLP) model showed high
effectiveness with LIWC, Bi-gram, and LDA features combination, which resulted in
the most outstanding performance for depression identification at 91% accuracy with
an F1 score of 93%.
M. Birjali et al. [24] provided a method for building a suicide vocabulary to solve
the lack of current lexical resources. To improve their analysis, they proposed inves-
tigating the use of Weka as a data mining tool by using machine learning methods
to gain valuable insights from Twitter data gathered via Twitter4J. The dataset was
built using 892 tweets. They also introduced an algorithm that utilizes WordNet to
perform semantic analysis regarding the train set and the tweets’ data set, allowing
practical semantic similarity computation. They have demonstrated the efficacy of
using a machine-learning-based technique through Twitter content as a suicide pre-
vention method. The empirical findings showed that Naı̈ve Bayes algorithms achieved
a Precision value of 87.50%, a Recall value of 78.8%, and F1. value of 82.9%.
N. A. Baghdadi et al. [5] presented a detailed framework for text content classifica-
tion, specifically for Twitter content. The trained model was employed to identify the
tweets as “Suicide” or “Normal.” The dataset contains 14,576 tweets. Additionally,
the study provided preprocessing methods specifically designed for Arabic tweets. The
dataset was annotated through multiple annotators, and the framework’s effectiveness
was evaluated using various assessment methods. Valuable understandings were gained
through the Weighted Scoring Model (WSM). Both USE and BERT classifier models
were also explored. The WSM models registered the highest-weighted sum of 80.20%.

5
3 Proposed Methodology
Real-time streaming analysis of social media content can provide helpful and up-to-
date information on individuals with mental health problems. The current analytics
methods that analyze social media content with massive volume offline are not robust
and active for supporting real-time decision-making under essential conditions. Thus,
these analysis methods must built to provide effective stream real-time prediction.
The methodology comprises two phases: batch processing and real-time streaming
prediction.
Our system methodology was built based on four primary components: the input
source system, where the system obtains the stream data (Apache Kafka); the stream
data processing, where the stream data are processed (Apache Spark Structured
Streaming); building the classification algorithms (Apache Spark ML); and the sink
node, where the final results are analyzed and visualized (Power BI). We built sev-
eral Apache Spark ML models using multiple feature extraction techniques. Also, we
compared the classification performance of multiple models using various evaluation
methods to determine the optimal architecture for predicting suicidal-related posts
from real-time Twitter streaming data. Figure 1 provides a clear overview of the
proposed methodology and the experimental workflow used in this work.

Fig. 1 Proposed methodology for predicting suicidal ideation on social media content

3.1 Big Data Architecture

This section describes the big data architecture applied in this work. Our proposed
methodology was developed to efficiently analyze massive volumes of social media

6
content with high velocity in real-time streaming data using a distributed big data
environment.

3.1.1 Apache Spark

Apache Spark has been applied in the proposed methodology as a data processing
engine. It is an analytics platform that supports batch and stream data processing [25].
Spark is a cluster computing system designed to be open source with various scalable
and distributed ML built-in libraries [26]. A key feature of Spark is its scalability,
which enables building spark clusters with several nodes. It employs a master-slave
design consisting of a Driver program that operates as the cluster’s master node and
a set of executors that act as worker nodes. The core components of Spark include
Spark SQL, which is used for structured query language (SQL), and Spark Streaming,
which is used to process stream data. Spark Structured Streaming is developed on
top of Spark SQL. Structured Streaming manages its execution incrementally and
continuously, changing the final output whenever new data streams are received.

3.1.2 Apache Kafka

Apache Kafka has been used to develop real-time prediction pipelines and stream data
messaging. Kafka is an open-source and widely powerful ingestion system primarily
used in big data applications [27]. It is a low-latency, high-throughput system for
managing and transferring massive and high-velocity data in a streaming manner. Pro-
ducer and consumer API are the two primary components of the Kafka architecture.
The Producer API allows the system to send data to the Kafka topics. The Consumer
API provides access to Kafka topics and processes the data streams in real-time at
any time.

3.2 Batch Data Processing Phase

The experiments performed during the batch processing phase aimed to develop and
train multiple Spark ML models with different feature extraction and testing scenar-
ios. The model with the highest performance was then applied for real-time streaming
data prediction phase. The batch processing phase consists of seven primary stages:
(i) Data Collection, (ii) Data preprocessing, (iii) Dataset Splitting, (iv) Feature Engi-
neering, (v) Model Development, (vi) Models Evaluation, and (vii) Model Saving. The
upcoming subsections will provide a detailed description of each phase’s steps.

3.2.1 Datasets Collection

Datasets play an essential role in any text-data analysis. The dataset required for our
experiment in the batch processing phase was gathered and acquired from Reddit social
media platforms. The primary source of batch datasets is the Kaggle website, a publicly
accessible benchmark dataset for various applications [28]. The obtained dataset was
utilized to train and assess the classifier models during the batch processing phase.
The dataset was organized in a separate CSV file format and contained posts from
Reddit’s platform from subreddits titled “Suicide Watch” and “Teenagers Forum,”

7
which were collected using the’ Pushshift’ API. The dataset comprised approximately
232,074 posts collected between Dec. 16, 2008, and Jan. 2, 2021, of 116,037 were
classified as suicidal and 116,037 as nonsuicidal. We cleaned and preprocessed the
dataset to remove duplicate posts, empty rows, and unnecessary columns. After the
preprocessing step, the dataset resulted in 232,042 rows, including 116,028 suicidal and
116,014 nonsuicidal instances. For our task, we used only the post content and target
columns for the analysis task. Some batch data samples are presented in Table 1.

Table 1 Samples of the Batch Dataset Postings

class type postings

I need help just help me im crying so hard.
Suicide I have nothing to live for. My life is so bleak.
Suicidal tics and intrusive anxiety...
I just got a Russian Hardbass song in my Spotify...
Non-suicide I wish I could change my name to Seymour...
My life is not a joke Jokes have meaning.

3.2.2 Data Preprocessing

The text analysis performance can be improved by selecting the proper data prepro-
cessing strategy since the input data collected from social media may contain many
non-meaning words or characters, which can increase the complexity of the analysis.
Hence, we aimed to prepare and refine the raw data into a suitable and understandable
format for each classifier model. Some preprocessing methods are standard for text-
analyzing tasks, while others depend on the complexity of data and affect the final
result. We preprocessed and prepared the dataset using Natural Language Process-
ing (NLP) techniques before passing it to the feature extraction and training stages.
The preprocessing steps used to prepare the raw data were performed as illustrated
in Figure 2.

3.2.3 Filtering Data

In this step, we filtered the obtained tweets to remove duplicate content, URL
links (“https://,” “http://,”), punctuation (e.g., “?”, “!”), special symbols (e.g., “$”,
“%”,””) and the hashtag (“#”). The filtering step also includes case folding and
expanding contractions with their corresponding complete form (i.e., “let’s” into “let
us”, “didn’t” into “did not.”). This step has a significant effect on improving the
effectiveness of the classifiers as it reduces the dataset complexity.

3.2.4 Tokenizing
The tokenization step is essential for any natural language processing (NLP) pipeline.
It has a considerable influence on the remaining phases of the pipeline. It breaks down
the text data into individual, more meaningful terms, including words, punctuation
marks, symbols, and abbreviations, to make data exploration more accessible. The

8
Fig. 2 The primary steps in preprocessing the raw dataset

result of this process is known as a token [29]. These tokens were then used as input
data for the processing pipeline.

3.2.5 Stopword Removing

This step excludes words that have no sentimental effect in the dataset. Stop words
are the most frequently used term in the document. So, in this stage, we eliminated
most frequently stopwords, such as pronouns like “she” and “he” articles such as
“and,” “the,” “a,” “an,” and prepositions like “on,” “of,” “to,” “but,” “for.” and so
on, therefore, in this way, we aimed to reduce the size and complexity of the dataset.

3.2.6 Lemmatizing
The input data was lemmatized at this step. Lemmatizing removes inflectional ends
and returns each word in the dataset to its basic or dictionary form. Lemmatizing
requires a comprehensive vocabulary and morphological analysis to lemmatize the
words. Among various lemmatization methods, we focused on rule-based approaches
using “WordNetLemmatizer.” It employs a pre-established set of morphological and
syntactic rules to find the lemma of each word within the input text. The use of
Lemmatization helps to reduce the dimensionality and the vocabulary size of textual
data, which leads to improved performance of analytical techniques.

9
3.2.7 Dataset Splitting
To train the classification models, it is necessary to split the dataset. Therefore, we
divided the entire historical Reddit data into two subsets: Out of 80% of the dataset
applied for training data, the remaining 20% were unseen data and applied for testing
data. The classification models were trained and optimized using the training data to
determine the most accurate features. On the other hand, the testing data (unseen
data) was employed to assess the effectiveness of the classification models. Table 2
provides descriptive statistics for the testing and training sets.

Table 2 Training and Testing Dataset Statistics

Data Subset Class Type No. of postings

Suicide 92726
Train set
Non-suicide 92704
Suicide 23302
Test set
Non-suicide 23310

3.2.8 Feature Engineering

Once we had a clean data corpus from the previous stages, it was fed into the different
feature engineering methods. Our goal was to find the optimal features that provide
the highest classification performance, reduce the complexity, and speed up the data
transformation. In this stage, we have used three feature engineering techniques to
obtain and extract the dataset’s essential features, including N-gram, TF-IDF, and
CountVectorizer (CV) with multiple combinations.

N-gram: N-gram is a feature extraction method identifying N successive word

groups within a text [30]. This method is widely used as a feature extraction and
analysis tool in NLP and text mining. It involves converting the input data into a
series of n separate tokens. In our work, the most important features are represented
using Unigrams (single words) and Bi-grams (two words have different meanings when
combined) with the help of the PySpark library. Also, we assigned high importance
to N-grams that appear more than four times in the document.

TF-IDF: TF-IDF is a statistical method to extract relevant features from textual

data input. TF-IDF builds a vector matrix to demonstrate a word’s importance in
the document. A word with fewer occurrences in a document is more appropriate for
classification. TF-IDF provides a lower score for the most frequent terms and a higher
score for lower-frequency terms in a document [31] [32]. The Spark ML API provides
two methods for calculating term frequencies: HashingTF and CountVectorizer (CV).
TF-IDF is calculated using the equations 1, 2 and 3 as below.

No. of times term t appears in a document)

T F (t) = (1)
Total No. of terms in a document

10

Total documents No.
IDF (t) = log (2)
No. of documents that contain the term t

T F IDF (t) = T F (t) × IDF (t) (3)

CountVectorizer (CV): It is a basic method for tokenizing data and generating a
numerically-representative wordlist [33]. It builds several columns depending on the
occurrence of a unique word in the vocabulary. These columns represent each row by
replacing words with their frequencies. CV can be employed when a prior dictionary
is unavailable to extract the vocabulary and build the required dictionary [34]. As
part of this study, we conducted the experiments using the following combinations of
feature extraction methods:
Unigram + TF-IDF Unigram + CV-IDF Bigram + CV-IDF (Unigram + Bigram) +
CV-IDF

3.2.9 Models Development

In our proposed methodology, we built the classification models using multiple Spark
ML algorithms, namely Naı̈ve Bayes (NB), Logistic Regression (LR), Linear Support
Vector Classifier (LinearSVC), Decision Tree (DT), Random Forest (RF), and Multi-
layer Perceptron (MLP) classifiers. The classifier models were trained and tested with
various parameter and feature extraction combinations until the best performance
values were achieved.

Naı̈ve Bayes Classifier (NB): NB is a well-known machine learning classification

algorithm based on supervised learning. The NB classifier implies that the attributes
are independent of each other and that the presence or absence of one attribute does
not affect the other attributes. The Naı̈ve Bayes algorithm builds based on Bayes”
theorem [35]. The NB classifier is often used and ideal for text classification challenges
due to its simplicity and speed [36].

Logistic Regression Classifier (LR): LR algorithm is commonly employed for clas-

sifying problems and belongs to the generalized linear model category. Another term
for logistic regression is the maximum entropy algorithm. LR can help calculate and
predict the likelihood of allocating a new sample to a particular category for binary
or multiclass classification tasks. The algorithm performs well on linearly separable
datasets and can be applied to determine the correlations within dataset attributes.

Linear Support Vector Classifier (LinearSVC): The LinearSVC classifier is a stan-

dard algorithm often used for large-scale classification tasks. Despite its flexibility, it
is mainly used in ML to handle classification tasks. Linear SVC is a non-probabilistic
classification model that needs an extensive training set. It uses a hyperplane that
optimally splits the classes represented in a high-dimensional field space. LinearSVC
is widely known for its practical abilities, mainly in dealing with real-world data,
which include a solid theoretical basis and insensitivity to high-dimensional data.

11
Decision Tree Classifier (DT): Decision Tree algorithm is a common machine-
learning method categorized as a non-parametric supervised algorithm [37]. It is a
hierarchical model designed as a tree structure. DT is typically composed of multiple
levels beginning from the root node. Every interior node holds at least one child,
representing the evaluation of an input feature or variable. Based on the results of a
decision test, the branching procedure will repeat itself, directing the corresponding
child node along the suitable path, and this process continues until the last leaf node.
The optimal tree is the shortest tree that can correctly categorize all data points and
has the fewest splits.

Random Forest Classifier (RF): It is a popular and widely applied ML method

that may be utilized or adopted for both classification and regression purposes. It was
introduced by L. Breiman [38]. RF algorithm is sometimes called a “forest of decision
trees.” RF algorithm decreases the prediction variance a decision tree generates and
improves its performance. For this purpose, many decision trees were merged using
a bagging aggregation technique [39]. RF learns in parallel from numerous decision
trees made at random, trained on different data sets, and uses various features to get
at its individual decisions. RF is more accurate and reliable than the decision tree
since the final decision depends on the average of the decision tree’s outputs [40].

Multilayer Perceptron Classifier (MLP): It is a form of feedforward neural network.

MLP employs backpropagation, a supervised learning approach. MLP includes three
sets of nodes: the first set is input-layer neurons, the second set is hidden-layer neurons,
and the last set is called the output-layer neurons, which represent the final results
of the system. Neurons in a perceptron require an activation function that applies a
threshold, such as a sigmoid or ReLU. Any arbitrary activation function can be applied
to neurons in the Multilayer Perceptron.

3.2.10 Models Evaluation

The performance of the proposed architecture was evaluated using various assessment
methods, including Accuracy (ACC.) (Equation 4), Precision (PRE.) (Equation 5),
Recall (REC.) (Equation 6), F1-scores (F1) (Equation 7), and the ROC-AUC. Fur-
thermore, the k-fold Cross-Validation approach was employed to ensure the models
fit properly without overfitting and underfitting issues. Each classifier was evaluated
by calculating the average accuracy of the 10-fold cross-validation to achieve a better
model performance. Table 3 depicts the confusion matrix form for binary classifica-
tion. The confusion matrix’s rows and columns are the counts of the post numbers
that were either true positives (TP), False Positives (FP), True Negatives (TN), or
False Negatives (FN). Where:

TP: Model classified positive class for a post, and the actual post class is also
positive.
TN: Model classified a negative class for a post, and the actual post class is also
negative.
FP: Model classified positive class for a post, whereas a post is negative.

12
FN: Model classified negative class for a post where a post is positive.

Accuracy: It is the most popular and straightforward way of measuring the model’s
performance. Accuracy is the ratio of samples that have been properly classified
compared to the whole number of samples, as shown in Equation 4:
TP + TN
Accuracy = ∗ 100% (4)
TP + FN + TN + FP

Precision (Specificity): It is the True-Positives (TP) ratio correctly predicted to

the overall number of positively predicted samples (TP+FP). It is considered as TNR
(True Negative Rate). It is calculated as in equation 5:
TP
P recision = Specif icity = ∗ 100% (5)
TP + FP

Recall (Sensitivity): It is the ratio of positively identified observations (TP) to the

overall number of positively identified observations (TP + FN). The recall is sometimes
referred to as TPR (True Positive Rate). It is calculated as in equation 6:
TP
Recall = Sensitivity = ∗ 100% (6)
TP + FN

F1-Score: It is the average of precision and recall scores. Using F1-score assessment
metrics, we can evaluate an ML classifier’s performance on all data classes. F1-Score
can be defined as the equation 7.
P recision ∗ Recall
F 1 − Score = 2 ∗ ∗ 100% (7)
P recision + Recall

ROC-AUC: It is a two-dimensional graph that employs TPR and FPR to display

the ability of a classifier. In the graph, the X-axis demonstrates TPR, whereas the Y-
axis demonstrates FPR. A higher AUC score indicates a more effective classification
model. Ideally, the ROC curve should reach the top-left corner of the graph, resulting
in an AUC score approaching 1. The AUC range score can be from 0.5 (Unsatisfactory
classifier) to 1.0 (Outstanding classifier).

Table 3 Confusion-Matrix Structure

Actual-Values
Predicted values
Actual-Pos. Actual-Neg.
Pos. Predicted. TP FP
Neg. predicted FN TN

13
3.2.11 Model Saving
This stage is considered the last process in the batch processing phase, which includes
saving the highest-performance model in the batch processing phase to use it as a
predictive model in the real time streaming prediction phase.

3.3 Real-Time Streaming Prediction Phase

Our primary aim of the real-time Streaming prediction phase is to build a framework
methodology to analyze the high velocity of streaming data arriving each second in
real-time. Our methodology has four main components: data collection, data ingestion
system, stream processing, and results visualization, as shown in Figure 1. To check
the proposed architecture’s ability to identify suicidal ideation in real-time scenarios.
We used Twitter API to retrieve real-time streaming tweets from Twitter. Twitter
Streaming API 3 is the basic method for accessing Twitter data. Twitter API allows
access to real-time with a limited set of approximately 1% of all tweets. Furthermore,
Tweepy 4 allows us to search tweets using hashtags, keywords, trends, geolocation,
or timelines. Our methodology used keyword searches for retrieval of the tweets. We
employed rules to retrieve only English tweets and filtered all duplicate tweets cre-
ated by retweets. A total stream of 764 tweets was retrieved using multiple keywords
related to suicidal ideation, including “feel,” “want to die,” and “kill myself”. The
retrieved tweets included multiple columns, including tweet content, retweet counts,
and usernames. Only the “tweet” column was used for our work, while the other
columns were not utilized and were removed from the collected data. Apache Kafka
was utilized to develop real-time pipelines and stream data ingestion. The key benefit
of Kafka is its ability to handle huge amounts of real-time data within low latency,
and it is fault-tolerant and scalable to ingest large data streams. We created an input
topic, “Source-tweets,” on the Kafka system.

The collected tweets were then ingested as data streams into the Apache Kafka
input topic. Spark Structured Streaming consumes stream tweets from the Kafka topic
in real-time into the unbounded table. We implemented several preprocessing steps to
refine the tweets’ stream effectively. These steps involve removing irrelevant informa-
tion, reducing the noise, and extracting appropriate stream data. After preprocessing
and cleaning the streaming tweets, we generated a feature vector and fed it into the
highest accurate model previously developed and trained in the batch processing phase
to predict suicidal ideation in real time. The prediction results were then pushed and
buffered in a Kafka output “Predicted-tweets” topic before being consumed by the
Power BI application to visualize the final prediction results in real time.

3
https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data
4
https://docs.tweepy.org/en/latest/index.html

14
4 Experimental Setup and Performance Analysis
4.1 Experimental Setup
The proposed ApacheSpark-based architecture was implemented using the “PySpark”
library to build the classification algorithms: NB, LR, LinearSVC, DT, RF, and MLP
algorithms. Apache Spark Cluster was installed on a laptop with 64 GB of RAM, a 1
TB SSD disk drive, and an Intel Core i7 CPU (14 cores, 20 logical processors). In addi-
tion, we integrated multiple API libraries for implementation. ML library of Apache
Spark was used to develop classification algorithms. Apache Kafka version of “2.0.2”
was deployed as an input system for ingesting data streams from Twitter. Tweepy
version of “4.10.0” for connecting to the Twitter API. Spark Structured Streaming
was applied for receiving and processing stream tweets from Kafka topics—Power BI
application for Visualizing the real-time streaming prediction results.

4.2 Exploratory Data Analysis

We performed Word Cloud to explore the dataset utilized in this work. Word clouds
serve as visual representations of the most frequently appeared terms within the
dataset. The font size represents the frequency of each word within the dataset. From
Figures 3 and 4, We show that most suicidal postings contain the words “want,”
“friend,” and “think” in large letter or font sizes. The Word Cloud of nonsuicidal
posts, on the other hand, larger fonts represent the most frequently nonsuicidal posts
of repeated words, including the words “though,” “feel” and “die.”

Fig. 3 Word cloud representation of suicidal- Fig. 4 Cord cloud representation of non-suicidal-
related postings related postings

4.3 Evaluation of Batch Processing Phase

This section presents and discusses the experimental applied in the batch processing
phase for identifying suicidal ideation in individuals based on their social media posts.
Our primary objective was to determine the most efficient model with the highest per-
formance to adopt for real-time streaming prediction phase. We used multiple Apache
Spark ML algorithms in this work, including Naı̈ve Bayes (NB), Logistic Regression

15
(LR), Linear Support Vector classifier (LinearSVC), Decision Tree (DT), Random
Forest (RF), and Multilayer Perceptron (MLP). The algorithms were trained and eval-
uated using data from the Reddit forum, using three different strategies for feature
extraction: TF-IDF, N-gram, and the CountVectorizer technique. Multiple combina-
tions of these feature extraction methods were implemented to extract the essential
features. A hyperparameter tuning strategy was adopted to detect the optimal param-
eter tune for each model configuration. Two methods are commonly employed for
Hyperparameter tuning: Random search and Grid search.
In this work, we utilized the Grid search as a hyperparameter technique in the exper-
iments. The Grid search hyperparameter tuning process aims to find the optimal
parameters and most suitable values for each classifier to enhance the overall perfor-
mance.
Furthermore, we made use of 10-fold Cross-validation, which is a widespread technique
and reliable method for minimizing overfitting, enhancing the validity and reliabil-
ity of the classification models, and balancing the bias and variance values. With the
10-fold cross-validation strategy, the given data were subdivided randomly into ten
subsets of the same size; one subset was used for testing purposes, while the other nine
subsets were used for the training process. Cross-validation was executed ten times,
with each of the ten subsets used as validation only once. To get a final estimate, the
data were averaged across ten folds. Table 4 and Figures 5, 6, 7, and 8 illustrate the
experimental results and comparative performance assessment of multiple Spark ML
classifiers using a binary classification evaluator.
From all experimental results, we found that the Multilayer Perceptron (MLP) classi-
fier outperformed the other classification algorithms and achieved a greater accuracy
rate of 93.47% and an AUC socre of 98.12%. The logistic Regression (LR) classifier
also performed well but somewhat less than the Multilayer Perceptron (MLP) classi-
fier and achieved the second-greatest performance, with an accuracy rate of 92.14%.
In addition, the results showed no significant performance difference between the Lin-
ear Support Vector classifier (LinearSVC) and Naı̈ve Bayes (NB). Unexpectedly, from
the experimental results, we found that Decision Tree (DT) and Random Forest (RF)
underperformed other classifiers utilized in this work despite their efficacy in numer-
ous machine-learning scenarios.
Also, from all experimental results, we have shown that most classifier models that
used N-gram + CV-IDF as their feature extraction approach performed better than
those that used the N-gram +TF-IDF feature approach. The classifier algorithms
were also evaluated using another metric known as the Area-Under-Curve (AUC).
The metric provides a value ranging from 0 to 1. A value closer to 1 indicated better
classification results. Figures 9, 10, 11 and 12 display the AUC comparison of all the
classification methods.

4.4 Evaluation of Real-Time Streaming Prediction Phase

The real-time streaming prediction phase used the classifier models already developed
and pre-trained during the batch processing phase to evaluate their ability to pre-
dict suicidal ideation from Twitter streaming data. After designing and assessing the

16
Table 4 Performance Comparison of Classification Algorithms on testing dataset

Model Feature Extraction Combination ACC. PRE. REC. F1. AUC.

Unigram+TF-IDF 88.02 88.66 88.02 87.97 95.41
Unigram+CV-IDF 89.49 90.21 89.49 89.44 96.41
NB
Bigram+CV-IDF 75.86 81.07 75.86 74.81 94.60
(Unigram + Bigram) + CV-IDF 90.36 91.09 90.36 90.32 96.97
Unigram+TF-IDF 91.40 91.64 91.40 91.38 97.17
Unigram+CV-IDF 91.98 92.20 91.98 91.96 97.55
LR
Bigram+CV-IDF 87.56 88.50 87.56 87.48 94.54
(Unigram + Bigram) + CV-IDF 92.14 92.36 92.13 92.12 97.67
Unigram+TF-IDF 90.58 91.01 90.58 90.56 96.69
Unigram+CV-IDF 91.59 92.01 91.59 91.57 97.45
LinearSVC
Bigram+CV-IDF 86.36 88.05 86.36 86.21 94.62
(Unigram + Bigram) + CV-IDF 90.90 91.54 90.89 90.86 97.59
Unigram+TF-IDF 86.05 86.02 86.05 86.03 87.70
Unigram+CV-IDF 86.46 86.60 86.46 86.44 87.81
DT
Bigram+CV-IDF 72.92 77.87 72.92 71.66 73.82
(Unigram + Bigram) + CV-IDF 86.46 86.60 86.45 86.44 87.81
Unigram+TF-IDF 86.25 86.22 86.25 86.22 93.71
Unigram+CV-IDF 86.47 86.80 86.47 86.44 93.96
RF
Bigram+CV-IDF 79.77 82.31 79.77 79.37 88.03
(Unigram + Bigram) + CV-IDF 85.86 86.27 85.86 85.82 93.52
Unigram+TF-IDF 92.66 92.66 92.66 92.66 97.70
Unigram+CV-IDF 93.33 93.33 93.33 93.33 97.99
MLP
Bigram+CV-IDF 88.84 88.93 88.84 88.84 94.48
(Unigram + Bigram) + CV-IDF 93.47 93.47 93.47 93.47 98.12

Fig. 5 Comparison of performance results of all Fig. 6 Comparison of performance results of all
classification algorithms with Unigram +TF-IDF classification algorithms with Unigram + CV-IDF
features features

classifier models in the batch processing phase, the classifier with the greatest per-
formance, as in our experiment, MLP with (Unigram + Bigram) + CV-IDF feature
extraction combination, was applied for predicting Twitter suicidal ideation-related
content in real-time. We collect streaming tweets using Twitter API with multiple
keywords, including “feel,” “want to die,” and “kill myself”, which were then pushed
into the Apache Kafka input topic. These streams of tweets are consumed by Apache
Spark Structure Streaming from the Kafka input topic, which is then preprocessed

17
Fig. 7 Comparison of performance results of all Fig. 8 Comparison of performance results of
classification algorithms with Bigram + CV-IDF all classification algorithms with (Unigram +
features Bigram) + CV-IDF features

Fig. 9 Comparison of ROC-AUC of all classifica- Fig. 10 Comparison of ROC-AUC of all clas-
tion algorithms with Unigram + TF-IDF features sification algorithms with Unigram + CV-IDF
method features method

Fig. 11 Comparison of ROC-AUC of all classi- Fig. 12 Comparison of ROC-AUC of all classi-
fication algorithms with Bigram + CV-IDF fea- fication algorithms with (Unigram + Bigram) +
tures method CV-IDF features method

as a data stream and used to generate a feature vector. The best pre-trained model

18
already developed in the batch processing phase then analyzes the stream of prepro-
cessed tweets and predicts whether these tweets are suicidal or normal content in
real-time. The prediction results are then pushed to a Kafka output topic for buffering
and then consumed from the Power BI application to visualize the prediction results in
real-time. In our work, a total of 764 tweets as a data stream were collected to exam-
ine the prediction ability in the real-time streaming prediction phase. The real-time
streaming prediction phase results indicated that (9.29%) of the tweets were predicted
as suicide, whereas (90.71%) were non-suicide. Figure 13 shows the results obtained
in the real-time streaming prediction phase.

Fig. 13 Power BI visualization of real-time streaming prediction phase results

5 Discussions
In this study, we proposed a big data approach to predict suicidal ideation based
on data collected from social media platforms. The proposed methodology comprised
two phases on batch processing and streaming predictions in real-time. The systems
utilized six Spark ML algorithms to build the classification model and compared the
performances of the models. In the streaming data pipeline, live streams of a tweet are
collected from Twitter using the keywords “feel”, “want to die” and “kill myself” and
then sent the collected data to the Kafka topic. Spark Structured Streaming receives
the stream data from the Kafka topic, extracts the optimal feature, and then sends
batches of preprocessed data to the real-time streaming prediction model to predict
whether the tweet contains indications of suicidal ideation.
This work used three feature extraction methods, including TF-IDF, N-gram, and
Count Vectorizer, with different combination scenarios to extract the optimal features
from the input data. The experimental results of six classification models showed that
the MLP classifier had the highest accuracy value of 93.47% with the features extracted
using (Unigram + Bigram) +CV-IDF feature extraction scenario. At the same time, a
high accuracy of 93.33% was obtained from the MLP classifier with features extracted

19
using (Unigram + CV-IDF). In addition, MLP provided the best accuracy of 92.66%
using (Unigram + TF-IDF).
In comparing our experimental results with related works, we noticed that the
highest accuracy obtained from the MLP classifier is higher than XGBoost and logistic
regression accuracies rate of 83.87% and 86.45%, respectively, achieved by S. Jain et al.
[13]. Also, compared with the accuracy and F1 score rate of 80% and 92%, respectively,
achieved by A. E. Aladağ et al. [18]. Furthermore, our methodology outperformed the
accuracy rate of 76.80% that was recorded by V. Desu et al. [15]. In addition, our
experimental results registered a higher performance than the Naı̈ve Bayes algorithm,
achieving a Precision value of 87.50%, a Recall value of 78.8%, and F1. value of 82.9%
by M. Birjali et al. [24]. Therefore, we adopted the MLP classifier with (Unigram +
Bigram) + CV-IDF feature combination scenario to predict suicidal ideation in the
second phase of real-time streaming prediction using Twitter streaming data.
That being said, further improvements can be made to extend this study. The first
improvement can be achieved by increasing the number of features of the textual data
using additional data such as emoticons, special characters, and symbols to extract
optimal features and reduce the misclassification results. Moreover, the dataset can
be expanded by gathering additional textual data from other social media platforms
to make our data more representative and varied.

6 Conclusion and Future work

In conclusion, this paper proposed a real-time streaming prediction system for suici-
dal ideation prediction of users’ posts on social networks using a big data analytics
environment—the work methodology analysis of social media content with two-phase
batch processing and real time streaming prediction. Our system applied two types
of datasets. Reddit’s historical big data are used for model building, while Twitter
streams big data have been used for real-time streaming prediction. Our proposed
methodology for building binary classification models was evaluated using various
assessment metrics and showed high levels of accuracy and AUC scores with stable
Recall and Precision. The experimental results of the batch processing phase revealed
that the MLP classifier achieved the highest classification accuracy of 93.47% on an
unseen dataset and was used for the real-time streaming prediction phase. According
to the results of various testing scenarios, we can conclude that the features retrieved
from stream data could accurately determine the suicidal ideation of users in real
time. The developed system might also assist public health professionals with limited
resources in determining and controlling suicidal ideation and preparing preventative
steps to save lives. Multiple languages, such as Turkish and Arabic, can be added for
future work. To deal with such datasets, which require sequential information and
local feature engineering, we may use Ensemble LSTM and CNN models for better
performance. We also plan to develop a web or mobile interface as a text-analysis tool
to detect the individual’s health status.

20
References
[1] W.H. Organization. World Health Organization. URL https:
//www.who.int/news-room/events/detail/2022/09/10/default-calendar/
world-suicide-prevention-day-2022
[2] M.W. Gijzen, S.P. Rasing, D.H. Creemers, F. Smit, R.C. Engels, D. De Beurs, Sui-
cide ideation as a symptom of adolescent depression. a network analysis. Journal
of Affective Disorders 278, 68–77 (2021)
[3] A. Roy, K. Nikolitch, R. McGinn, S. Jinah, W. Klement, Z.A. Kaminsky, A
machine learning approach predicts future risk to suicidal ideation from social
media data. NPJ digital medicine 3(1), 1–12 (2020)
[4] T.H. Aldhyani, S.N. Alsubari, A.S. Alshebami, H. Alkahtani, Z.A. Ahmed, Detect-
ing and analyzing suicidal ideation on social media using deep learning and
machine learning models. International journal of environmental research and
public health 19(19), 12635 (2022)
[5] N.A. Baghdadi, A. Malki, H.M. Balaha, Y. AbdulAzeem, M. Badawy, M. Elhos-
seini, An optimized deep learning approach for suicide detection through Arabic
tweets. PeerJ Computer Science 8, e1070 (2022)
[6] S.A. Senthilkumar, B.K. Rai, A.A. Meshram, A. Gunasekaran, S. Chandrakumar-
mangalam, Big data in healthcare management: a review of literature. American
Journal of Theoretical and Applied Business 4(2), 57–69 (2018)
[7] S. Ayvaz, M.O. Shiha, A scalable streaming big data architecture for real-time
sentiment analysis, in Proceedings of the 2018 2nd international conference on
cloud and big data computing (2018), pp. 47–51
[8] A.H. Alamoodi, B.B. Zaidan, A.A. Zaidan, O.S. Albahri, K.I. Mohammed, R.Q.
Malik, E.M. Almahdi, M.A. Chyad, Z. Tareq, A.S. Albahri, et al., Sentiment anal-
ysis and its applications in fighting covid-19 and infectious diseases: A systematic
review. Expert systems with applications 167, 114155 (2021)
[9] G. Agarwal, S.K. Dinkar, A. Agarwal, Binarized spiking neural networks opti-
mized with nomadic people optimization-based sentiment analysis for social
product recommendation. Knowledge and Information Systems 66(2), 933–958
(2024)
[10] P. Rita, N. António, A.P. Afonso, Social media discourse and voting decisions
influence: sentiment analysis in tweets during an electoral period. Social Network
Analysis and Mining 13(1), 46 (2023)
[11] N. Öztürk, S. Ayvaz, Sentiment analysis on twitter: A text mining approach to
the syrian refugee crisis. Telematics and Informatics 35(1), 136–147 (2018)
[12] M.A. Allayla, S. Ayvaz, A Hybrid and Scalable Sentiment Analysis Framework:
Case of Russo-Ukrainian War, in 2023 3rd International Scientific Conference of
Engineering Sciences (ISCES) (IEEE, 2023), pp. 13–18
[13] S. Jain, S.P. Narayan, R.K. Dewang, U. Bhartiya, N. Meena, V. Kumar, A
machine learning based depression analysis and suicidal ideation detection sys-
tem using questionnaires and twitter, in 2019 IEEE Students Conference on
Engineering and Systems (SCES) (IEEE, 2019), pp. 1–6
[14] R. Sawhney, P. Manchanda, R. Singh, S. Aggarwal, A computational approach to
feature extraction for identification of suicidal ideation in tweets, in Proceedings

21
of ACL 2018, Student Research Workshop (2018), pp. 91–98
[15] V. Desu, N. Komati, S. Lingamaneni, F. Shaik, Suicide and Depression Detection
in Social Media Forums, in Smart Intelligent Computing and Applications, Vol-
ume 2: Proceedings of Fifth International Conference on Smart Computing and
Informatics (SCI 2021) (Springer, 2022), pp. 263–270
[16] N. Wang, F. Luo, Y. Shivtare, V.D. Badal, K.P. Subbalakshmi, R. Chandramouli,
E. Lee, Learning models for suicide prediction from social media posts. arXiv
preprint arXiv:2105.03315 (2021)
[17] M. Chatterjee, P. Kumar, P. Samanta, D. Sarkar, Suicide ideation detection from
online social media: A multi-modal feature based technique. International Journal
of Information Management Data Insights 2(2), 100103 (2022)
[18] A.E. Aladağ, S. Muderrisoglu, N.B. Akbas, O. Zahmacioglu, H.O. Bingol, Detect-
ing suicidal ideation on forums: proof-of-concept study. Journal of medical
Internet research 20(6), e9840 (2018)
[19] N.J. Carson, B. Mullin, M.J. Sanchez, F. Lu, K. Yang, M. Menezes, B.L. Cook,
Identification of suicidal behavior among psychiatrically hospitalized adolescents
using natural language processing and machine learning of electronic health
records. PloS one 14(2), e0211116 (2019)
[20] A. Roy, K. Nikolitch, R. McGinn, S. Jinah, W. Klement, Z.A. Kaminsky, A
machine learning approach predicts future risk to suicidal ideation from social
media data. NPJ digital medicine 3(1), 1–12 (2020)
[21] M.J. Vioules, B. Moulahi, J. Azé, S. Bringay, Detection of suicide-related posts
in Twitter data streams. IBM Journal of Research and Development 62(1), 1–7
(2018)
[22] W. Jung, D. Kim, S. Nam, Y. Zhu, Suicidality detection on social media using
metadata and text feature extraction and machine learning. Archives of suicide
research pp. 1–16 (2021)
[23] M.M. Tadesse, H. Lin, B. Xu, L. Yang, Detection of depression-related posts in
reddit social media forum. IEEE Access 7, 44883–44893 (2019)
[24] M. Birjali, A. Beni-Hssane, M. Erritali, Machine learning and semantic sentiment
analysis based algorithms for suicide sentiment prediction in social networks.
Procedia Computer Science 113, 65–72 (2017)
[25] E. Shaikh, I. Mohiuddin, Y. Alufaisan, I. Nahvi, Apache spark: A big data process-
ing engine, in 2019 2nd IEEE Middle East and North Africa COMMunications
Conference (MENACOMM) (IEEE, 2019), pp. 1–6
[26] M. Junaid, S. Ali, I.F. Siddiqui, C. Nam, N.M.F. Qureshi, J. Kim, D.R.
Shin, Performance Evaluation of Data-driven Intelligent Algorithms for Big
data Ecosystem. Wireless Personal Communications 126(3), 2403–2423 (2022).
https://doi.org/10.1007/s11277-021-09362-7. URL https://doi.org/10.1007/
s11277-021-09362-7
[27] K. Deshpande, M. Rao, in Inventive Computation and Information Technologies
(Springer, 2022), pp. 607–630
[28] NIKHILESWAR KOMATI. Suicide and Depression Detection. URL https://
www.kaggle.com/datasets/nikhileswarkomati/suicide-watch

22
[29] S. Vijayarani, M.J. Ilamathi, M. Nithya, Preprocessing techniques for text mining-
an overview. International Journal of Computer Science and Communication
Networks 5(1), 7–16 (2015)
[30] S.F.C. Haviana, B.S.W. Poetro, Deep learning model for sentiment analysis on
short informal texts. Indonesian Journal of Electrical Engineering and Informatics
(IJEEI) 10(1), 82–89 (2022)
[31] W. Shang, T. Underwood, Improving Measures of Text Reuse in English Poetry:
A TF–IDF Based Method, in International Conference on Information (Springer,
2021), pp. 469–477
[32] R. Vijaya Prakash, Machine Learning Approach To Forecast the Word in Social
Media. Social Network Analysis: Theory and Applications pp. 133–147 (2022)
[33] J. Brownlee, Deep learning for natural language processing: develop deep learning
models for your natural language problems (Machine Learning Mastery, 2017)
[34] R. Mehmood, B. Bhaduri, I. Katib, I. Chlamtac, Smart Societies, Infrastruc-
ture, Technologies and Applications: First International Conference, SCITA 2017,
Jeddah, Saudi Arabia, November 27–29, 2017, Proceedings, vol. 224 (Springer,
2018)
[35] E.M.K. Reddy, A. Gurrala, V.B. Hasitha, K.V.R. Kumar, Introduction to Naive
Bayes and a Review on Its Subtypes with Applications. Bayesian Reasoning and
Gaussian Processes for Machine Learning Applications pp. 1–14 (2022)
[36] A. Goel, J. Gautam, S. Kumar, Real time sentiment analysis of tweets using Naive
Bayes, in 2016 2nd International Conference on Next Generation Computing
Technologies (NGCT) (IEEE, 2016), pp. 257–261
[37] M. Jena, R.K. Behera, S. Dehuri, in Advances in Machine Learning for Big Data
Analysis (Springer, 2022), pp. 223–239
[38] L. Breiman, Random Forests. Machine Learning 45(1), 5–32 (2001). https:
//doi.org/10.1023/A:1010933404324
[39] N. Syam, R. Kaul, in Machine Learning and Artificial Intelligence in Marketing
and Sales (Emerald Publishing Limited, 2021)
[40] N. Jalal, A. Mehmood, G.S. Choi, I. Ashraf, A novel improved random forest for
text classification using feature ranking and optimal number of trees. Journal of
King Saud University-Computer and Information Sciences (2022)

1a+ (192 203) +Ensembled+Machine+Learning+Methods+and+Feature+Extraction+Approaches+for+Suicide Related+Social+Media
No ratings yet
1a+ (192 203) +Ensembled+Machine+Learning+Methods+and+Feature+Extraction+Approaches+for+Suicide Related+Social+Media
12 pages
Farukh Nadeem Concept Paper Detection of Suicidal Tendencies - 60421 - 20230218
No ratings yet
Farukh Nadeem Concept Paper Detection of Suicidal Tendencies - 60421 - 20230218
13 pages
Mental Health Analysis in Social Media Posts: A Survey: Muskan Garg
No ratings yet
Mental Health Analysis in Social Media Posts: A Survey: Muskan Garg
24 pages
BDCC 09 00016
No ratings yet
BDCC 09 00016
19 pages
Suicidal Ideation Detection On Social Media
No ratings yet
Suicidal Ideation Detection On Social Media
46 pages
Analyzing Social Media Texts For Suicidal Risk Identification Using Natural Language Processing
No ratings yet
Analyzing Social Media Texts For Suicidal Risk Identification Using Natural Language Processing
5 pages
2023 Stacked CNN LSTM Approach For Prediction of Suicidal Ideation
No ratings yet
2023 Stacked CNN LSTM Approach For Prediction of Suicidal Ideation
22 pages
Understanding Mental Health Content On Social Media and It's Effect Towards Suicidal Ideation
No ratings yet
Understanding Mental Health Content On Social Media and It's Effect Towards Suicidal Ideation
15 pages
LR - Farrukh Nadeem, DBA - 60421 Updated
No ratings yet
LR - Farrukh Nadeem, DBA - 60421 Updated
52 pages
A Suicidal Ideation Detection Framework On Social Media Using Machine Learning and Genetic Algorithms
No ratings yet
A Suicidal Ideation Detection Framework On Social Media Using Machine Learning and Genetic Algorithms
18 pages
Suicidal Ideation Detection Using Colbert Project Report
No ratings yet
Suicidal Ideation Detection Using Colbert Project Report
14 pages
B15-Content - Analysis - in - Social - Media (1) - Bbhavani
No ratings yet
B15-Content - Analysis - in - Social - Media (1) - Bbhavani
59 pages
Suicide Text Classification Using Machine Learning Tecniques
No ratings yet
Suicide Text Classification Using Machine Learning Tecniques
18 pages
Retrieve
No ratings yet
Retrieve
8 pages
Abstract 22
No ratings yet
Abstract 22
1 page
s41870 023 01725 6
No ratings yet
s41870 023 01725 6
17 pages
Suicidal Thought Detection Using NLPNatural Language Processing On Reddit Data
No ratings yet
Suicidal Thought Detection Using NLPNatural Language Processing On Reddit Data
6 pages
Research Paper (PREDICTION OF DEPRESSION LEVELS USING SOCIAL MEDIA)
No ratings yet
Research Paper (PREDICTION OF DEPRESSION LEVELS USING SOCIAL MEDIA)
11 pages
Social Media Crime Detection Using Machine Learning Algorithms
No ratings yet
Social Media Crime Detection Using Machine Learning Algorithms
11 pages
Social Media As A Mirror Reflecting Mental Health Through Computational Linguistics
No ratings yet
Social Media As A Mirror Reflecting Mental Health Through Computational Linguistics
22 pages
Suicidal Ideation in Online Posts
No ratings yet
Suicidal Ideation in Online Posts
9 pages
IJCRT2106325 BBB
No ratings yet
IJCRT2106325 BBB
11 pages
(IJETA-V10I2P1) :dr. A. Manjula, D. Kalpana, M. Sai Prasad, G. Sanjana, B. Mahender, D. Manisha, M. Abhishek
No ratings yet
(IJETA-V10I2P1) :dr. A. Manjula, D. Kalpana, M. Sai Prasad, G. Sanjana, B. Mahender, D. Manisha, M. Abhishek
10 pages
Using Machine Learning Algorithms To Detect Suicide Risk Factors On Twitter
No ratings yet
Using Machine Learning Algorithms To Detect Suicide Risk Factors On Twitter
8 pages
Research Paper FF
No ratings yet
Research Paper FF
18 pages
Surveillance
No ratings yet
Surveillance
31 pages
Presentation 6
No ratings yet
Presentation 6
9 pages
AI Task
No ratings yet
AI Task
17 pages
A Machine Learning Based Depression Analysis
No ratings yet
A Machine Learning Based Depression Analysis
6 pages
Paper
No ratings yet
Paper
10 pages
Suic Ide Pred Ictio N: Pres Ented To Abh Ijit Path Ak
No ratings yet
Suic Ide Pred Ictio N: Pres Ented To Abh Ijit Path Ak
8 pages
Suicidal Ideation Detection: A Review of Machine Learning Methods and Applications
No ratings yet
Suicidal Ideation Detection: A Review of Machine Learning Methods and Applications
14 pages
Phase 1
No ratings yet
Phase 1
14 pages
Automatic Identification of Suicide Notes With A Transformer-Based Deep
No ratings yet
Automatic Identification of Suicide Notes With A Transformer-Based Deep
8 pages
Sucidal Analysisusing Machine Learnin
No ratings yet
Sucidal Analysisusing Machine Learnin
19 pages
Emotion Detection and Suicidal Intention Prediction of Differently Depressed Individuals Using Mach
No ratings yet
Emotion Detection and Suicidal Intention Prediction of Differently Depressed Individuals Using Mach
4 pages
Anusha
No ratings yet
Anusha
18 pages
Feature Based Depression Detection From
No ratings yet
Feature Based Depression Detection From
9 pages
Leveraging Machine Learning and NLP For Personalized Mental Health Analysis From Social Media Insights
No ratings yet
Leveraging Machine Learning and NLP For Personalized Mental Health Analysis From Social Media Insights
5 pages
Predicting Depression Using Deep Learnin
No ratings yet
Predicting Depression Using Deep Learnin
6 pages
Depression Detection
No ratings yet
Depression Detection
5 pages
Emotional Health
No ratings yet
Emotional Health
66 pages
IJNGC Latex Research Paper
No ratings yet
IJNGC Latex Research Paper
10 pages
Constructing Depression Prediction Model Using ChatGPT and Machine Learning Algorithms
No ratings yet
Constructing Depression Prediction Model Using ChatGPT and Machine Learning Algorithms
4 pages
Synopsis 3
No ratings yet
Synopsis 3
7 pages
Phase 1
No ratings yet
Phase 1
15 pages
Ji 2020
No ratings yet
Ji 2020
13 pages
IJRPR35097
No ratings yet
IJRPR35097
4 pages
Deep Learning-Based Depression Detection From Social Media
No ratings yet
Deep Learning-Based Depression Detection From Social Media
20 pages
NLP Paper 3
No ratings yet
NLP Paper 3
11 pages
Conference PPTT
No ratings yet
Conference PPTT
20 pages
A Novel Imbalanced Data Classification Approach For Suicidal Ideation Detection On Social Media
No ratings yet
A Novel Imbalanced Data Classification Approach For Suicidal Ideation Detection On Social Media
25 pages
Predicting Suicidal Ideation on Social Media
No ratings yet
Predicting Suicidal Ideation on Social Media
6 pages
Project Report
No ratings yet
Project Report
16 pages
Priyanka RDC 2
No ratings yet
Priyanka RDC 2
26 pages
Projectsysnopsis
No ratings yet
Projectsysnopsis
7 pages
Depression Detection via BERT on Social Media
No ratings yet
Depression Detection via BERT on Social Media
4 pages
1 s2.0 S1877050923001412 Main
No ratings yet
1 s2.0 S1877050923001412 Main
9 pages
Department of Information Technology: Data Structure Semester IV (4IT01) Question Bank Prepared by Prof. Ankur S. Mahalle
100% (1)
Department of Information Technology: Data Structure Semester IV (4IT01) Question Bank Prepared by Prof. Ankur S. Mahalle
13 pages
Database Test 1
No ratings yet
Database Test 1
2 pages
User'S Manual: Rugged Mobile Computing Solutions
No ratings yet
User'S Manual: Rugged Mobile Computing Solutions
148 pages
JavaScript - BOM Concept
No ratings yet
JavaScript - BOM Concept
44 pages
Game Devs: Customize Dino Adventure
No ratings yet
Game Devs: Customize Dino Adventure
5 pages
BIM in Hospital Building Management
No ratings yet
BIM in Hospital Building Management
4 pages
Avenza Map Documento PDF
No ratings yet
Avenza Map Documento PDF
37 pages
Dwarka Computer Services.: Office No B228, Bharat Vihar Near Dwarka Sector 14 Metro Station Dwarka New Delhi 110078
No ratings yet
Dwarka Computer Services.: Office No B228, Bharat Vihar Near Dwarka Sector 14 Metro Station Dwarka New Delhi 110078
2 pages
Licensing Guide PLT Windows Server 2025
No ratings yet
Licensing Guide PLT Windows Server 2025
32 pages
Oop Notes
No ratings yet
Oop Notes
75 pages
Ceper Foundry's Industry 4.0 Readiness
No ratings yet
Ceper Foundry's Industry 4.0 Readiness
4 pages
SMART Line HMI V5 Manual - Operating Instructions
No ratings yet
SMART Line HMI V5 Manual - Operating Instructions
15 pages
TranHaiThien B2203580 CT104H Lab03
No ratings yet
TranHaiThien B2203580 CT104H Lab03
9 pages
Outdoor Fingerprint & Card Reader/Controller: IP65 Waterproof
No ratings yet
Outdoor Fingerprint & Card Reader/Controller: IP65 Waterproof
2 pages
Smart Door Lock System
No ratings yet
Smart Door Lock System
13 pages
TK3163 Tutorial 7 TK3163 2023 Top-Down
No ratings yet
TK3163 Tutorial 7 TK3163 2023 Top-Down
3 pages
DNSSec Tutorial 4 - Phil Regnauld and Hervey Allen PDF
No ratings yet
DNSSec Tutorial 4 - Phil Regnauld and Hervey Allen PDF
9 pages
Rem 610
No ratings yet
Rem 610
52 pages
Working Instructions: - Mechanical
No ratings yet
Working Instructions: - Mechanical
30 pages
Omada Controller Software 3.0.2 - UG PDF
No ratings yet
Omada Controller Software 3.0.2 - UG PDF
146 pages
Softing-DB ECU TEST E
No ratings yet
Softing-DB ECU TEST E
2 pages
Java Array Operations Guide
No ratings yet
Java Array Operations Guide
14 pages
Zrzut Ekranu 2023-12-04 o 18.28.20
No ratings yet
Zrzut Ekranu 2023-12-04 o 18.28.20
1 page
DE1-SoC User Manual
No ratings yet
DE1-SoC User Manual
113 pages
Twitter 2 Comparativestudy
No ratings yet
Twitter 2 Comparativestudy
13 pages
Cyber Security Awareness Among Malaysian Pre-University Students
No ratings yet
Cyber Security Awareness Among Malaysian Pre-University Students
198 pages
Duc Thinh Phan - LinkedIn
No ratings yet
Duc Thinh Phan - LinkedIn
1 page
Wit13 01 Rms 20230817
No ratings yet
Wit13 01 Rms 20230817
27 pages
Easergy Micom P139: Feeder Management and Bay Control
No ratings yet
Easergy Micom P139: Feeder Management and Bay Control
1,284 pages
Veeam Backup For Microsoft Azure Short Deck
No ratings yet
Veeam Backup For Microsoft Azure Short Deck
14 pages

ASS7 Pyspark1

Uploaded by

ASS7 Pyspark1

Uploaded by

A Big Data Analytics System for Predicting

Suicidal Ideation in Real-Time Based on Social

2 Department of Computer Engineering,Yildiz Technical University,

*Corresponding author(s). E-mail(s): seay@mmmi.sdu.dk;

3.1 Big Data Architecture

3.1.1 Apache Spark

3.1.2 Apache Kafka

3.2 Batch Data Processing Phase

3.2.1 Datasets Collection

Table 1 Samples of the Batch Dataset Postings

class type postings

3.2.2 Data Preprocessing

3.2.3 Filtering Data

3.2.5 Stopword Removing

Table 2 Training and Testing Dataset Statistics

Data Subset Class Type No. of postings

3.2.8 Feature Engineering

N-gram: N-gram is a feature extraction method identifying N successive word

TF-IDF: TF-IDF is a statistical method to extract relevant features from textual

No. of times term t appears in a document)

T F IDF (t) = T F (t) × IDF (t) (3)

3.2.9 Models Development

Naı̈ve Bayes Classifier (NB): NB is a well-known machine learning classification

Logistic Regression Classifier (LR): LR algorithm is commonly employed for clas-

Linear Support Vector Classifier (LinearSVC): The LinearSVC classifier is a stan-

Random Forest Classifier (RF): It is a popular and widely applied ML method

Multilayer Perceptron Classifier (MLP): It is a form of feedforward neural network.

3.2.10 Models Evaluation

Precision (Specificity): It is the True-Positives (TP) ratio correctly predicted to

Recall (Sensitivity): It is the ratio of positively identified observations (TP) to the

ROC-AUC: It is a two-dimensional graph that employs TPR and FPR to display

Table 3 Confusion-Matrix Structure

3.3 Real-Time Streaming Prediction Phase

4.2 Exploratory Data Analysis

4.3 Evaluation of Batch Processing Phase

4.4 Evaluation of Real-Time Streaming Prediction Phase

Model Feature Extraction Combination ACC. PRE. REC. F1. AUC.

Fig. 13 Power BI visualization of real-time streaming prediction phase results

6 Conclusion and Future work

You might also like