Heart Disease Prediction Using Machine Learning-1
Heart Disease Prediction Using Machine Learning-1
Abstract—Cardiovascular disease refers to any critical This paper is organized as follow: Section II covers the
condition that impacts the heart. Because heart diseases related works where machine learning was used for heart
can be life-threatening, researchers are focusing on disease prediction. Section III explains the methodology,
designing smart systems to accurately diagnose them where the dataset is described, preprocessed, and split. As well
based on electronic health data, with the aid of machine as the applied algorithms and the corresponding model design
learning algorithms. This work presents several machine parameters, the evaluation metrics selected to evaluate the
learning approaches for predicting heart diseases, using performance of the model are described. Section IV discusses
data of major health factors from patients. The paper the experimental results. Finally, in Section V, the remarks
and conclusions about this work are presented.
demonstrated four classification methods: Multilayer
Perceptron (MLP), Support Vector Machine (SVM), II. RELATED WORK
Random Forest (RF), and Naïve Bayes (NB), to build the
prediction models. Data preprocessing and feature Heart disease prediction was addressed in the literature
selection steps were done before building the models. The using several methods. In [7], Naïve Bayes, SVM, and
models were evaluated based on the accuracy, precision, Functional Trees were used to predict the possibility of heart
recall, and F1-score. The SVM model performed best with diseases with an accuracy of 84.5%, using measurements
91.67% accuracy. from wearable mobile technologies with the same inputs used
in our work. Furthermore, Naïve Bayes was solely used in [8]
Keywords—heart disease prediction, machine learning, with a slightly better accuracy of 86.4%, using the same
support vector machine, multilayer perceptron, naïve bayes, dataset.
random forest. Another work [9] used several algorithms; Logistic
Regression, KNN, NN, SVM, NB, Decision Tree, and RF,
I. INTRODUCTION
with three feature selection algorithms: Relief, mRMR, and
Cardiovascular Disease (CVD), commonly referred to as LASSO to predict the existence of heart disease with the same
heart disease, encompasses a wide range of conditions that dataset used in this work. The Logistic Regression algorithm
affect the heart, with the two most common conditions being had the best performance and yielded predictions with an
ischemic heart diseases and strokes. The World Health accuracy as high as 89%.
Organization lists the most significant behavioural risk factors Moreover, a work done in 2020 [10] applied 4 algorithms
for CVD as maintaining an unhealthy diet, a sedentary
with a very high accuracy of 90.8% for the KNN model, and
lifestyle, tobacco use, and excessive consumption of alcohol.
minimum accuracy of 80.3% for the other models.
Prolonged exposure to these risk factors can present itself as
an initial sign of CVD, which include elevated blood pressure, In [11], a hybrid Random Forest and Naïve Bayes model
elevated blood glucose, raised blood lipids, and obesity. achieved an accuracy of 84.16% using 10 features, which
Warning signs listed by the American Heart Association were selected using Recursive Feature Elimination and Gain
include having one or more of the following: shortness of Ratio algorithms.
breath, persistent coughing or wheezing, swelling of the In a recent work done in 2021 [12], Logistic Regression,
ankles and feet, constant fatigue, lack of appetite, and Random Forest, and KNN were used for the prediction. The
impaired thinking [1]. Moreover, Coronavirus may cause maximum accuracy was 87.5%.
heart disease [2]–[4]. Efficient early diagnosis can All the previous is very promising for the future of heart
substantially reduce the risk and global burden of CVD by diseases and failure prediction, especially with the current
initiating treatment rapidly to prevent further health advances in portable electronic measurement devices.
deterioration. Thus, there is an urgent need to develop
machine learning models that can predict the probability of III. METHODOLOGY
developing CVD depending on the risk factors present. A. Data Collection
Recently, machine learning models have successfully lent The dataset was collected from Kaggle [13]. The dataset
a hand in diverse cases in the medical field [5]. They have contains a total of 303 instances with 13 attributes as
been effective in analyzing, evaluating, and predicting described in Table I.
different medical conditions [6]. In this paper, we are
proposing a machine learning approach to predict the presence
of cardiovascular diseases in patients based on major health
data.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 04,2023 at 11:16:17 UTC from IEEE Xplore. Restrictions apply.
TABLE I. HEART DISEASE DATASET DESCRIPTION
0: None
Exang Exercise induced angina Bi -
1: Produced
ST depression induced by exercise Right skewed data, majority of
Oldpeak Num 0-6.2
relative to rest population is between 0 and 0.5
0: Unsloping
The slope of the peak exercise ST
Slope Nom 1: Flat -
segment
2: Down-sloping
Ca Number of major vessels Nom 0/1/2/3/4 -
1: Fixed defect
Thal Defect type Nom 2: Normal There is one outlier of category 0
3: Reversable defect
0: No disease
Target Diagnosis of heart disease Bi -
1: Disease
a
Numerical, b Binary, c Nominal.
Because the mild outliers contribute to the final diagnosis,
B. Data Preprocessing
only the extreme outliers were removed. The extreme outliers
The performance of a machine learning model is greatly were detected using (1) & (2), where the IQR is the
determined by the quality of the data used to build it, which interquartile range, and is a measure for the dispersion of the
makes data preprocessing very important. Data preprocessing data, and Q1, Q3 are the lower and upper quartiles
includes cleaning the data by removing corrupted or missing respectively.
data points and outliers, in addition to transforming the data, (75% × Q 3 ) + 3 × IQR (1)
resampling it, and applying feature selection.
(25% × Q1 ) − 3 × IQR (2)
1) Data Visualization and Cleaning
First, we checked for missing values and none were The data points that are greater than the first expression
found. Second, we checked for outliers and we found some were removed. Similarly, the data points that are less than the
as reported in Table II. second expression were removed. As a result, 2 out of the 303
instances were removed.
TABLE II. LIST OF OUTLIERS Then, the correlation coefficient matrix was obtained to
observe the relation between the different attributes and the
Attributes Outlier values
output. Fig. 1. illustrates the correlation matrix where the
Age None
Chol 417, 564, 394, 407, 409 coefficient indicates both the strength of the relationship
172, 178, 180,180, 200, 174, between the variables as well as the direction (whether it is a
Trestbps
192, 178, 180 positive or negative correlation).
Thalach 71
Oldpeak 4.2, 6.2, 5.6, 4.2, 4.4
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 04,2023 at 11:16:17 UTC from IEEE Xplore. Restrictions apply.
Weka software was used for feature selection as it has
several options of attributes evaluator to test and use.
Slightly different than feature selection, feature
extraction is when a new set of features is generated from
the original set. Principal Component Analysis (PCA) is
widely used. It calculates the projection of the original
data into a smaller dimension space.
5) Data Splitting
In machine learning, the data is usually split into
training and testing sets, where the training set is used to
train the model, and the testing set is to test it and predict
the output. Hold-out was used in this work with 80% of
the data used in training and 20% used for testing.
C. Applied Algorithms
Fig. 1. Correlation coefficient matrix 1) Naïve Bayes (NB)
Naïve Bayes is a supervised learning algorithm, that
2) Checking for Imbalances is based on the Bayes Theorem, and assumes that all
Imbalance in the output can distort the prediction features are independent and have equal contribution to the
accuracy. Therefore, the balance of the output “target” was target class. Bayes’ theorem calculates the posterior
verified as shown in Figure 2. After inspection, the data probability of an event A, given some prior probability of
turned out to be balanced with a 9:11 ratio between the two event B, as in (3).
categories. Thus, there was no need to resample the data.
P(B|A)P(A)
P(A|B) = (3)
P(B)
3) Data Transformation
Transformation is applied when the dataset includes
data of different formats, or when different datasets are
combined. In this case, the nominal features were
transformed into factors, for them to be used in Rstudio.
4) Dimensionality Reduction
In machine learning, dimensionality reduction refers
Fig. 3. Random Forest prediction method [14]
to the process of reducing the number of features to
decrease the complexity and prevent overfitting, by either When training the model in RStudio, the following
feature selection or extraction. parameter were fixed: 500 decision trees, 3 variables tries
Feature selection is done by selecting a subset of at each split, in classification method.
features from the original set, and is done by methods
3) Neural Networks (NN)
such as CFS (Correlation-based Feature Selection), Chi-
Neural Networks are universal approximators that
squared test and ridge regression. In this paper, the feature
can map any relation between the inputs and outputs of a
selection method used was CfsSubsetEval, which
system, regardless of its complexity as shown in Figure 4.
evaluates the worth of a subset of the attribute by
Their working principle is mimicked from the human
considering both the individual predictive ability of each
brain, where during training, they assign weights (w) to
feature, and the degree of redundancy between them.
each input to indicate its significance to the output.
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 04,2023 at 11:16:17 UTC from IEEE Xplore. Restrictions apply.
Positive class (in this case, the number of correctly diagnosed
heart diseases). Similarly, a True Negative is the number of
correctly classified Negative class (In this case, the count of
correctly predicted absence of heart disease).
Accuracy: The percentage of the total number of
predictions that were classified correctly, and is obtained
from the confusion matrix by the following equation:
TP + TN
A=
Fig. 4. Neural Network Diagram [15] TP + TN + FP + FN
Precision: The percentage of the positive cases that were
Each node is called a neuron and has its own classified correctly, and is obtained from the confusion
activation function. The number or neurons, network matrix by the following equation:
layers, and activation functions, all depend on the TP
PR =
application, and influence the performance of the model. TP + FP
In this work, the MLP network was chosen to have Sensitivity or Recall: The percentage of the actual
5 hidden nodes with sigmoid activation function. positive cases that were classified correctly, and is obtained
from the confusion matrix by the following equation:
4) Support Vector Machine (SVM) TP
SVM is a supervised learning method used for RE =
TP + FN
classification, regression, and outlier detection. It seeks to F1 Score: If the target is to get the best precision and
establish a decision boundary between different classes, to recall, F measure would be the best choice as it provides a
label prediction using one or more feature vectors as harmonic mean of the recall and the precision values in
shown in Figure 5. This decision boundary, known as the classification problem, and is obtained from the confusion
hyperplane, is oriented to be as far away as possible from matrix by the following equation:
the nearest data points. Those nearest locations are referred 2TP
to as support vectors. F1 =
2TP + FP + FN
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 04,2023 at 11:16:17 UTC from IEEE Xplore. Restrictions apply.
(c) SVM (d) NB Fig. 8. Selected attributes based on WEKA software
Fig. 6. Confusion matrices before removing the outliers According to the previous evaluation, the attributes Cp,
Thalach, Exang, Oldpeak, Slope, Ca, Thal, were chosen to be
Table IV shows the results of each model after removing the most important inputs. Hence, Age, Sex, Fbs, Chol,
the extreme outliers. The metrics are calculated using Restecg, Trestbps are the least important attributes. However,
RStudio based on the confusion matrix shown in Fig. 7. the decision was to not remove all the least important, but the
three least significant (Fbs, Restecg, Sex), which are
TABLE IV. RESULTS OF THE PREDICTED MODEL AFTER REMOVING
EXTREME OUTLIERS indicated in RStudio as shown in Fig. 9.
Model
MLP SVM RF NB
Metric
Accuracy 81.67% 88.33% 86.67% 86.67%
Precision 92.31% 95.65% 95.45% 91.67%
Recall 54.55% 78.57% 75% 78.57%
F1 Score 68.67% 86.27% 83.99% 84.61%
Metric
Accuracy Precision Recall F1 Score
Model
SVM 91.67% 92.31% 88.89% 90.56%
(c) SVM (d) NB
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 04,2023 at 11:16:17 UTC from IEEE Xplore. Restrictions apply.
dimensionality reduction techniques, and finally splitting no. 5, Jul. 2019, doi: 10.1088/1742-6596/1237/5/052026.
[16] “Support Vector Machines (SVM) | LearnOpenCV #.”
using Hold-out. The model was trained and tested for each
https://learnopencv.com/support-vector-machines-svm/ (accessed
machine learning algorithm. SVM algorithm with linear Jan. 10, 2022).
kernel had the best results with a 91.67% accuracy, 92.31%
precision, 88.89% recall, and F1 Score of 90.56%. The
algorithms used were able to extract the complex relations
between the symptoms and the disease. Machine learning
algorithms can also be applied to other types of diseases,
especially with the generation of more accurate datasets in the
medical field in the future.
This work can be enhanced by applying more extensive
data analysis and trying additional algorithms to reach the
maximum possible accuracy.
REFERENCES
[1] S. Rehman, E. Rehman, M. Ikram, and Z. Jianglin,
“Cardiovascular disease (CVD): assessment, prediction and policy
implications,” BMC Public Health, vol. 21, no. 1, p. 1299, 2021,
doi: 10.1186/s12889-021-11334-2.
[2] O. Atef, A. B. Nassif, M. A. Talib, and Q. Nassir, “Death/Recovery
Prediction for Covid-19 Patients using Machine Learning,” 2020.
[3] A. B. Nassif, I. Shahin, M. Bader, A. Hassan, and N. Werghi,
“COVID-19 Detection Systems Using Deep-Learning Algorithms
Based on Speech and Image Data,” Mathematics, 2022.
[4] H. Hijazi, M. Abu Talib, A. Hasasneh, A. Bou Nassif, N. Ahmed,
and Q. Nasir, “Wearable Devices, Smartphones, and Interpretable
Artificial Intelligence in Combating COVID-19,” Sensors, vol. 21,
no. 24, 2021, doi: 10.3390/s21248424.
[5] O. T. Ali, A. B. Nassif, and L. F. Capretz, “Business intelligence
solutions in healthcare a case study: Transforming OLTP system
to BI solution,” in 2013 3rd International Conference on
Communications and Information Technology, ICCIT 2013, 2013,
pp. 209–214, doi: 10.1109/ICCITechnology.2013.6579551.
[6] A. Nassif, O. Mahdi, Q. Nasir, M. Abu Talib, and M. Azzeh,
“Machine Learning Classifications of Coronary Artery Disease.”
Jan. 2018.
[7] A. F. Otoom, E. E. Abdallah, Y. Kilani, A. Kefaye, and M. Ashour,
“Effective diagnosis and monitoring of heart disease,” Int. J. Softw.
Eng. its Appl., vol. 9, no. 1, pp. 143–156, 2015, doi:
10.14257/IJSEIA.2015.9.1.12.
[8] K. Vembandasamyp, R. R. Sasipriyap, and E. Deepap, “Heart
Diseases Detection Using Naive Bayes Algorithm,” IJISET-
International J. Innov. Sci. Eng. Technol., vol. 2, no. 9, 2015,
Accessed: Dec. 11, 2021. [Online]. Available: www.ijiset.com.
[9] A. U. Haq, J. P. Li, M. H. Memon, S. Nazir, R. Sun, and I. Garciá-
Magarinõ, “A hybrid intelligent system framework for the
prediction of heart disease using machine learning algorithms,”
Mob. Inf. Syst., vol. 2018, 2018, doi: 10.1155/2018/3860146.
[10] D. Shah, S. Patel, · Santosh, and K. Bharti, “Heart Disease
Prediction using Machine Learning Techniques,” vol. 1, p. 345,
2020, doi: 10.1007/s42979-020-00365-y.
[11] K. Pahwa and R. Kumar, “Prediction of heart disease using hybrid
technique for selecting features,” 2017 4th IEEE Uttar Pradesh
Sect. Int. Conf. Electr. Comput. Electron. UPCON 2017, vol.
2018-January, pp. 500–504, Jun. 2017, doi:
10.1109/UPCON.2017.8251100.
[12] H. Jindal, S. Agrawal, R. Khera, R. Jain, and P. Nagrath, “Heart
disease prediction using machine learning algorithms,” doi:
10.1088/1757-899X/1022/1/012072.
[13] “Heart Disease UCI | Kaggle.”
https://www.kaggle.com/ronitf/heart-disease-uci (accessed Jan.
10, 2022).
[14] D. Murphy, “Using Random Forest Machine Learning Methods to
Identify Spatiotemporal Patterns of Cheatgrass Invasion through
Landsat Land Cover Classification in the Great Basin from 1984 -
2011,” 2019.
[15] S. Liu, Z. Fang, and L. Zhang, “Research on Urban Short-term
Traffic Flow Forecasting Model,” J. Phys. Conf. Ser., vol. 1237,
Authorized licensed use limited to: SRM Institute of Science and Technology. Downloaded on July 04,2023 at 11:16:17 UTC from IEEE Xplore. Restrictions apply.