Ensemble Machine Learning Models in Predicting
Ensemble Machine Learning Models in Predicting
2023 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI) | 979-8-3503-1590-5/23/$31.00 ©2023 IEEE | DOI: 10.1109/ACCAI58221.2023.10199294
Department of Computer Science and Engineering Amrita School of Computing, Amrita Vishwa Vidyapeetham, Chennai, India
Abstract- Personality prediction refers to the use of is often used as a platform for expressing one's views
machine learning techniques to predict an individual's on personal matters such as family, psychological
personality traits based on various sources of data, well-being, financial issues, interactions with society
such as text, images, and social media usage and the environment, and politics.In this model the
Personality traits refer to persistent patterns of
attempt was made to predict the personality of a
behaviors, thoughts, and feelings that differentiate one
individual from another. The prediction of people’s person based on social media content. For this, we
personality traits based on their social media posts are using a Myers-Briggs personality type indicator
using various machine learning models. With the help (MBTI) to classify people’s personalities under
of this model, a person’s personality can be classified various categories. The MBTI is one of the most
based on the 16 categories of Myers-Briggs personality popular personality tests in the world. The dataset
types. With the availability of a huge amount of data contains 16 personality types across 4 characteristic
on human behavior and personality traits, it is traits, namely, introversion (I) and extraversion (E),
possible to train a machine-learning model and predict intuition (N) and sensing (S), thinking (T) and
the personality trait of a person. The ML model feeling (F), and judging (J) and perceiving (P). When
assesses the person based on their social media posts.
a person is found to be introverted, intuitive, sensible,
The data consists of the posts from social media and
the personality type to which a person belongs. The and judging, he or she will be labelled as having an
model will be using the NLTK library to assess and INSJ personality type.
pre-process the data. Here a model has been based on
built four machine learning models, which include The data pre-processing section is the most important
logistic regression, support vector machines (SVM), part of this entire project since we are required to
nave Bayes, and random forest. Finally, we compare find important features from the huge dataset and
the machine learning model results to determine train the machine learning models with it. We need to
which one is best based on evaluation metrics pre-process the data and extract important features
(accuracy score, geometric mean score, ROC-AUC from the social media posts given in the dataset. The
score). Furthermore, this can be used in the
data pre-processing step includes tokenization,
personalization of online advertising ads and
campaigns. Also, it can be used by social media lemmatization, sentiment analysis, Parts of Speech
companies to attract users based on their personality (POS) tagging, and vectorization. Furthermore, a
traits and preferences. comparison is made between different ML models,
including logistic regression, SVM, random forest,
Keywords—Personality prediction, Myers-Briggs and Naive Bayes. Then we find the best-performing
personality types, NLTK, logistic regression, SVM, Nave model and use it on the testing data.
Bayes, and Random Forest.
This model can be implemented for various
I.INTRODUCTION applications. Some of them are the personalization of
advertisement campaigns on the products sold by
Personality is defined as the unique set of thinking, companies; use in recommender systems since
feeling, and behavioral patterns that vary between personality traits are closely linked with user
individuals. In modern times, it is common for preferences; and use by social media companies to
people to have multiple social media accounts and attract users based on their personality traits and
post a large number of messages daily. Social media
Authorized licensed use limited to: St. Petersburg State University. Downloaded on January 31,2025 at 19:41:03 UTC from IEEE Xplore. Restrictions apply.
preferences. In the upcoming sections, we will using the Big Five personality traits model. The
discuss the implementation of the ML model, along results showed that linguistic features were able to
with the challenges faced and how they were solved. moderately accurately predict personality traits, with
extraversion being the most strongly predicted trait.
II.LITERATURE REVIEW There was no plagiarism detected in the original
text[6]. the potential for predicting personality traits
Predicting personality traits investigates the potential of individuals by analyzing their language patterns
of using machine learning methods to predict a in email messages through text mining. A sample of
person's personality traits through their digital 379 individuals was used, and the Big Five
footprints the data collected from social media, personality traits model was applied to the collected
online forums. The Random Forest algorithm email data. Machine learning algorithms were then
demonstrated the highest performance, with an used to predict personality traits based on email
average accuracy of 62.5% across all five features. Results showed that personality traits could
personality traits[1]. To enhance the precision of be predicted with moderate accuracy using email
personality prediction, a deep learning model is text data[7]. Data from social media, blogs, and
utilized that considers both textual and visual essays to predict personality traits through a
characteristics of social media data. The study combination of linguistic and machine learning
employed a dataset comprising self-reported methods. They employed a dimensional model of
personality traits of 861 participants and employed a language, which allowed for a more nuanced
deep neural network model known as analysis of language use beyond just the frequency
"Convolutional Recurrent Neural Network" (CRNN) of specific words[8]. Convolutional neural networks
to forecast the personality traits[2]. The analysis of (CNNs) to predict personality traits based on images
personality through social media data can transform posted on Instagram. The authors collected data
the domain of personality psychology by providing from 80 Instagram users and applied the Big Five
customized content suggestions and precise personality traits model to analyze the content of
advertising, as well as inform further investigations their images. The CNN-based approach yielded
into personality prediction based on social media good prediction accuracy for extraversion and
data. It can also aid in the formulation of ethical openness traits, with accuracies of 69.9% and
standards for the utilization of personal data in both 74.2%, respectively[9]. To predict specific aspects
research and commercial applications. There is a of personality through Facebook status updates with
possibility of a revolutionary impact on the field of higher accuracy than previous studies conducted on
personality psychology with the potential benefits of other digital media. The algorithm utilized was able
social media data analysis[3]. It is possible to to predict scores on all five personality traits
forecast certain aspects of an individual's personality accurately, based solely on Facebook status
using smartphone data. The research, based on updates[10]. The use of deep learning techniques to
information gathered from 123 participants, accurately predict personality traits from short texts,
indicates that behavioral patterns concerning activity using a convolutional neural network (CNN) to learn
level, social rhythm, and mobility demonstrate a representations of words and phrases, and a Long
noteworthy correlation with personality traits like Short-Term Memory (LSTM) network to model the
conscientiousness, openness, and extraversion. The sequences of these learned representations. The
study suggests that smartphone data can be utilized model was tested on two datasets: a set of tweets and
effectively to make predictions about an individual's a set of restaurant reviews. Results demonstrated that
personality[4]. certain precise personality traits are the deep learning model outperformed traditional
more effective in forecasting job performance than machine learning models, such as Support Vector
general personality traits. This discovery carries Machines (SVMs), in accurately predicting
significant importance for the design of selection personality traits[11]. It is possible to predict some
assessments and training programs in professional aspects of personality from Twitter language use,
settings. By identifying the specific personality traits although the accuracy varies depending on the trait a
that correlate with job performance, organizations regression model to predict each of the Big Five
can tailor their recruitment processes and training traits, Twitter usage data could predict personality
initiatives accordingly[5]. The language patterns of traits with moderate accuracy[12].The statistical
2,800 Twitter users predict their personality traits model of topic modeling is utilized to identify
concealed topics or keywords in a document
Authorized licensed use limited to: St. Petersburg State University. Downloaded on January 31,2025 at 19:41:03 UTC from IEEE Xplore. Restrictions apply.
collection. Techniques like LSA, pLSA, and LDA imbalance is left unnoticed, then it will lead to biased
are used, with LDA commonly used in multi- results since the data is skewed. This class imbalance
document summarization, providing better results can be visualized in Fig-1. which clearly shows that
than LSA when the number of features for sentence the class "INFP" has about 1832 rows of data, but the
selection is increased[13].The strategy of text class "ESTJ" has about 39 rows of data.
document analysis shows potential for content
summarization and involves two stages: text
abstraction and text summarization. Automated text
summarization uses NLP to extract significant
information from related documents, and this study
proposes a novel technique of ensemble topic vector
clustering using SA for efficient processing and
summarization [14]. The automation of machine
learning model development is achieved through
AutoML, which aims to increase productivity and
reduce time. This study proposes a genetic
algorithm-based AutoML model for network
architecture search, with an evaluation in scenarios
of binary classification and regression resulting in Fig.1 Visualization of class imbalance
98% accuracy[15].While machine learning has
shown increased accuracy in classification, the To overcome this problem, we separate these 16
quality of features used has a significant impact on personality types into 4 binary class types, namely
the predictive model outcomes. This explores the introversion (I), sensing (S), thinking (T), and
impact of feature quality on heart disease prediction judging (J). Assume a person is either an introvert or
by employing RFECV with SVM, LR, DT, and RF an extrovert; this is represented in the binary class
algorithms. It was found that RF outperformed the "introversion" as 1 if the person is an introvert and 0
other models, achieving a predictive accuracy of if the person is an extrovert. The same follows for the
99.7%[16]. remaining three classes, too, we can easily classify
based on these 4 classes and avoid the class
III.METHODOLOGY imbalance to some extent.
Authorized licensed use limited to: St. Petersburg State University. Downloaded on January 31,2025 at 19:41:03 UTC from IEEE Xplore. Restrictions apply.
C. Sentiment Scoring and Analysis speech, such as nouns, verbs, adverbs, adjectives,
Sentiment analysis, which is sometimes called pronouns, conjunctions, and their subclasses, to each
"opinion mining," is an NLP technique used to word in the text.After doing POS tagging, the tagged
identify the emotional tone of a piece of text. words are grouped based on the 12 important
Companies frequently use this technique to analyze categories of the Stanford list. For each row in the
customer opinions and classify them according to a dataset, we calculate the average value for these 12
particular product, service, or concept. Sentiment POS tags.
analysis involves utilizing machine learning (ML),
data mining, and artificial intelligence (AI) to extract E. Vectorizing
subjective information and analyze text for As we are dealing with textual data, we need to
sentiment. We loaded the cleaned dataset obtained convert it into a numerical data format for the ML
from the lemmatization step and proceeded with model to train with the dataset. To achieve this, we
sentiment scoring. We used the NLTK’s Vader make use of two well-known vectorizing methods,
module to find the sentiment scores. We received "TF-IDF vectorizer" and "Count vectorizer." The
four distinct scores: a composite score, a positive reason for choosing both is to find the best
score, a negative score, and a neutral score. Since we vectorizing method for the ML model to perform
were also using the Nave Bayes model, which cannot well. Both vectorizers are provided by the sci-kit-
handle negative values, we had to rescale and learn library.
normalize the data with a min-max scaler.
Authorized licensed use limited to: St. Petersburg State University. Downloaded on January 31,2025 at 19:41:03 UTC from IEEE Xplore. Restrictions apply.
IV.IMPLEMENTATION we use the SelectKBest module to select the k-best
features, where k is set to 10 by default. So, finally,
In the implementation phase, we will discuss how we with the help of make_pipeline from sci-kit-learn, we
train and test different machine learning models. We make an ML pipeline consisting of the
will use four algorithms: logistic regression, support MinMaxScaler and SelectKBest functions. Applying
vector machine (SVM), naive Bayes, and random data transformations, such as scaling or vectorizing,
forest. Our first step is to split the cleaned dataset is a simple process when all input variables are of the
into training and testing sets. We allocate 90% of the same type. However, it can become challenging
data for training and 10% for testing. As discussed when the dataset contains mixed types, and we need
earlier, the dataset suffers from class imbalance, to selectively apply data transformations to certain
which needs to be reduced to a certain extent before input features but not all.
training the models.Imbalanceddatasets are those
where there is a severe skew in the class distribution, V.RESULT AND DISCUSSION
in the ratio of 1:100 or 1:1000 examples in the
minority class to the majority class. This is a The accuracy scores for each class have been found
problem, as it is typically the minority class for from the cross-validation dataset, which helps us
which predictions are most important. One approach choose the best one out of the eight pipelines which
to addressing the problem of class imbalance is to were shown in Table 1. We found out that the ML
randomly resample the training dataset.The Mayer- pipelines with TfidfVectorizer performed well when
Briggs Twitter dataset has undergone data pre- compared to CountVectorizer, based on the
processing, including various text pre-processing classification metrics, accuracy scores, and ROC-
techniques for feature extraction. Feature selection AUC scores obtained which are shown in Table 2.
was carried out using K-best feature selection. The And, of the four pipelines, the ML pipeline with the
resulting features will be used to train several logistic regression classifier performed admirably
machine learning models, including Logistic and significantly better than the others. The accuracy
Regression, SVM, Multinomial Naive Bayes, and scores of the logistic regression on testing data and
Random Forest Classifier. The ultimate goal is to use coefficient values of the model were displayed in
these models to predict personality traits as shown in Table 3 and fig 5.
fig 4.
TABLE 1. ACCURACY SCORES OF EACH CLASSIFIER ON
TRAINING DATA
Personality AccuracyScore
type Logistic SV Multinomi Random
Regressi M al Naive ForestCl
on Bayes assifier
Introvertvs 0.67 0.66 0.63 0.62
Extrovert
Intuitionvs 0.68 0.67 0.72 0.61
Fig 4:Flow diagram of the proposed method Sensing
Thinkingvs 0.80 0.80 0.76 0.72
F. Random Under Sampling and k-Best Feature Feeling
Selection JudgingvsP 0.64 0.64 0.60 0.56
There are two primary methods for randomly erceiving
resampling an imbalanced dataset under sampling,
which involves deleting examples from the majority TABLE 2. ROC-AUC SCORES OF EACH CLASSIFIER ON
class, and oversampling, which involves duplicating TRAINING DATA
examples from the minority class. In our Personalitytyp ROC-AUCScore
implementation, we have utilized the random under e Logisti SV Multino Random
sampler provided by the imblearn library to reduce c M mial ForestClas
Regress Naive sifier
the class imbalance.
ion Bayes
IntrovertvsExt 0.73 0.7 0.70 0.67
We have a lot of features and columns in the dataset rovert 0
after doing all these pre-processing steps. As a result,
Authorized licensed use limited to: St. Petersburg State University. Downloaded on January 31,2025 at 19:41:03 UTC from IEEE Xplore. Restrictions apply.
IntuitionvsSen 0.71 0.7 0.72 0.61 individuals' personality traits, which can be useful
sing 1 for a variety of applications such as career
ThinkingvsFee 0.89 0.8 0.85 0.80 counselling, team building, and personal
ling 9 development. This model can be implemented and
JudgingvsPerc 0.68 0.6 0.67 0.58 used in various applications. Social media
eiving 8
companies can use this to attract users based on their
personality traits and preferences. Companies can
Table 3. Accuracy scores of logistic regressions on advertise their products through personalized ads
testing data based on users’ personality traits and behavioural
Personalitytype AccuracyScore preferences. Further research is needed to improve
IntrovertvsExtrovert 0.6820 the accuracy and generalizability of these models
Intuition vs Sensing 0.6831 and to explore their potential for real-world
Thinking vs Feeling 0.7811 applications.
Judging vs Perceiving 0.6359
REFERENCES
[1] "Predicting personality traits from digital footprints using
machine learning" by Wu, C., Lu, H., & Zhu, Y. (2021).
[2] "Deep learning-based personality prediction using social media
data" by Choi, Y., Jo, J., & Choi, S. (2019).
[3] "Personality prediction using Facebook data: A comprehensive
review" by Farnadi, G., Sitaraman, G., & Moens, M. (2016).
[4] "Predicting personality from patterns of behaviour collected
with smartphones" by Saeb, S., Lonini, L., Jayaraman, A.,
Mohr, D. C., & Kording, K. P. (2016).
[5] "Personality and job performance: The importance of narrow
traits" by Tett, R. P., Jackson, D. N., & Rothstein, M. (1991).
[6] "Predicting personality from Twitter" by Golbeck, J., Robles,
C., & Turner, K. (2011).
[7] "Personality prediction based on text mining of email messages"
by Quercia, D., Kosinski, M., Stillwell, D., & Crowcroft, J.
(2011).
Fig 5. Words with the highest coefficient values in [8] Park, G., & Schwartz, H. A. (2015). Predicting personality from
text using dimensional models of language. Journal of
each personality class type. Personality, 83(3), 243-256.
[9] You, Q., Jin, H., & Luo, J. (2015). Predicting personality traits
VI.CONCLUSION AND DISCUSSION from Instagram images using convolutional neural networks.
Proceedings of the ACM International Conference on
Multimedia, 131-140
The results of these studies suggest that machine [10] Kosinski, M., Stillwell, D., & Graepel, T. (2013). Predicting the
Big 5 personality traits using Facebook status updates.
learning models can accurately predict certain Psychological science, 24(4), 1-8.
aspects of personality from MBTI data, such as [11] Dhingra, B., & Cohen, W. W. (2016). Deep learning for
extraversion, openness, and agreeableness. However, personality trait extraction from short texts. Proceedings of the
54th Annual Meeting of the Association for Computational
predicting other aspects of personality, such as Linguistics, 1572-1583.
neuroticism, may be more challenging. Additionally, [12] Golbeck, J., Robles, C., & Turner, K. (2011). Predicting
the accuracy of these models may be affected by personality traits from Twitter usage. Proceedings of the
International Conference on Weblogs and social media, 1-10
factors such as the size and quality of the dataset, the [13] Bharathi Mohan, G., and R. Prasanna Kumar. "A
choice of features and algorithms, and individual comprehensive survey on topic modeling in text
summarization." 5th international conference on micro-
differences in personality expression. For training electronics and telecommunication engineering, Springer book
and testing, the logistic regression classification series on “Lecture Notes in Networks and Systems. 2021.
model is chosen which is also later used for [14] Bharathi Mohan, G., Prasanna Kumar, R. (2023). Survey of
Text Document Summarization Based on Ensemble Topic
predicting the MBTI personality types in the web Vector Clustering Model. In: Joby, P.P., Balas, V.E.,
app. The class imbalance problem has been handled Palanisamy, R. (eds) IoT Based Control Networks and
by making the 16 classes into 4 classes and by using Intelligent Systems. Lecture Notes in Networks and Systems,
vol 528. Springer, Singapore. https://doi.org/10.1007/978-981-
random under-sampling. Performance and accuracy 19-5845-8_60
scores can be improved more with deep learning [15] C. Spandana, I. V. Srisurya, S. Aasha Nandhini, R. P. Kumar,
G. Bharathi Mohan and P. Srinivasan, "An Efficient Genetic
models. The use of machine learning techniques for Algorithm based Auto ML Approach for Classification and
personality prediction using the MBTI dataset has Regression," 2023 International Conference on Intelligent Data
the potential to provide valuable insights into Communication Technologies and Internet of Things (IDCIoT),
Bengaluru, India, 2023, pp. 371-376, doi:
Authorized licensed use limited to: St. Petersburg State University. Downloaded on January 31,2025 at 19:41:03 UTC from IEEE Xplore. Restrictions apply.
10.1109/IDCIoT56793.2023.10053442.
[16] Tsehay Admassu Assegie, Prasanna Kumar Rangarajan, Napa
Komal Kumar, & Dhamodaran Vigneswari. (2022). An
empirical study on machine learning algorithms for heart
disease prediction. International Journal of Artificial
Intelligence (IJ-AI), 11(3), 1066–1073.
https://doi.org/10.11591/ijai.v11.i3.pp1066-1073.
Authorized licensed use limited to: St. Petersburg State University. Downloaded on January 31,2025 at 19:41:03 UTC from IEEE Xplore. Restrictions apply.