[go: up one dir, main page]

Academia.eduAcademia.edu
International Conference on Recent Trends in Data Science and its Applications DOI: rp-9788770040723.030 Cardio Vascular Disease Prediction Using Multiple Machine Learning Algorithms VenkataSaiAshrith Kona Department of Data Science and Business Systems SRM Institute of Science and Technology Chennai kv6209@srmist.edu.in Maithili Saran Reddy Lingala Department of Data Science and Business Systems SRM Institute of Science and Technology Chennai ls2204@srmist.edu.in HrudayVuppala Department of Data Science and Business Systems SRM Institute of Science and Technology Chennai hv6819@srmist.edu.in SravyaAdapa Department of Data Science and Business Systems SRM Institute of Science and Technology Chennai na8385@srmist.edu.in Abstract— This Cardiovascular disease is one of the serious issue that we are facing in current day it has become a massive challenge to try and analyse the cardiovascular disease survivors. Artificial intelligence is a component of machine learning, which is used to address several issues in data science. We can predict results based on past data which is a very frequently used application of machine learning for the machine to forecast predictions it has to identify patterns from the previous data and these patterns can be used on latest or new data to predict the outcome. The health business generates enormous amounts of raw data, which data mining transforms into meaningful information that might aid in making decisions. Decision Tree (DT), Adaptive boosting classifier (AdaBoost), Logistic Regression (LR), Random Forest (RF), Gradient Boosting classifier (GBM), and K-Nearest Neighbor (KNN) are the classification methods used in this study. Keywords— Cardiovascular disease, Machine learning, Random Forest, Decision Tree, Adaptive boosting classifier, Gradient Boosting classifier, KNN I. INTRODUCTION The biggest cause of death worldwide, as reported by the WHO, is heart disease. According to estimates, cardiac conditions account for 24% of deaths in India from noncommunicable diseases. The cause of one-third of all fatalities worldwide is heart disease. Heart diseases are to blame for 50 percent of mortality in the United States and other industrialised nations. Every year, around 1crore 70 lakh people worldwide die from cardiovascular disease (CVD). It might be difficult to identify (CVD) due to several contributing variables, including high BP, high cholesterol, diabetes, irregular pulse rate, and several other illnesses. The symptoms of CVD might occasionally vary based on a person's gender. For instance, a female patient may also suffer nausea, severe tiredness, and shortness of breath in addition to chest pain, but male patients are most likely to have chest pain. Researchers have investigated a variety of ways to predict cardiac diseases, but predicting so at a beginning stage is not particularly successful for a variety of reasons, including complexity, execution time, and method accuracy. Consequently, efficacious diagnosis and treatment can save a lot of lives. Between healthcareservice guidelines, medications, and lost productiveness as a result of death, in 2014 and 2015 it cost Rajasekar P Department of Data Science and Business Systems SRM Institute of Science and Technology Chennai rajasekp@srmist.edu.in roughly $219 billion annually. Heart failure, which can result in death, can also be avoided with early detection. Although angiography is thought to be the most exact and accurate procedure for predicting cardiac artery disease (CAD), it is quite expensive, making it less accessible to families with limited financial resources. Physical examination can cause few errors which might even lead to death of few patients as heart disease is a very complicated disease and we have to take at most care and here using machine learning based expert systems will help us to effectively diagnose Cardio Vascular Disease (CVD). Data Mining plays a major role in many fields like engineering, business, and education to extract data and find interesting patterns out of those. Examining data to find hidden information that will be useful to take important decisions in the future is a process called as "data mining". By decreasing the error in factual results and forecast, Understanding the complexity and non- linear interplay between several components, a wide range of machine learning techniques have been used. Medical experts must employ ML and AI algorithms to analyse data and draw exact and detailed diagnostic judgments because the amount of medical data is always growing. Different categorization algorithms are used in data mining of medical data to predict patients' CVD and deaths from heart attack. II. LITERATURE SURVEY [1] Melillo et al. proposed a system that automatically distinguishes high-risk patients from low-risk individuals. Classification and regression tree (CART) (93.3% sensitivity, 63.5% specificity) performed better in their investigation. Only 12 little-risk and 34 hugerisk patients were examined. To determine whether their suggested method is beneficial, a larger dataset must be examined. [2] Guidi et al examined the clinical support system (CDSS) for heart failure analysis. This model provided outputs such as HF(Heart Failure) sensitivity . They conducted study using various machine learning classifiers and compared the results. Random forest and CART performed best with 87.6% accuracy out of all classifiers. 157 International Conference on Recent Trends in Data Science and its Applications DOI: rp-9788770040723.030 [3] Parthiban and Srivatsa have done a extensive study and have conducted research to find out heart disease in those patients who have diabetes.They used many predictive features like blood pressure, blood sugar, and age there is a imbalance in data set and the writers of have not employed any strategy to address this issue. they were able to achieve an accuracy of 94.60% by using support vector machine (SVM) classifier. Al Rahhalet al have used a novel approach using deep neural network (DNN) they used raw ECG data to predict using an unsupervised learning technique stacked denoisingautoencoders (SDAEs) to examine the highest level of features. They allowed expert engagement, which can induce biases, throughout each training cycle. It may bring about prejudice. [4] Muthukaruppan and Er proposed a fuzzy expert system for the identification of CVD that is based on Particle Swarm Optimization (PSO). Fuzzy rules were created when rules from the decision tree were retrieved. Their accuracy using the fuzzy expert system was 93.27%. On the short dataset used in their investigation, a few rules were extracted. Alizadehsani and others [6,7] Alizadehsaniet al. utilised a group-based learning strategy. They utilised a dataset with 303 cases that they aquired from the “Rajaie Cardiovascular Medical and Research Centre” for their study. For CVD prediction, authors employed the introductory C45 ensemble learning approach. Left circumflex stenosis, left anterior descending stenosis, and right coronary artery (RCA) stenosis were accurately identified with 68.96%, 61.46%, and 79.54%, respectively (LAD). By using the SVM model, the results were improved and "80.50% accuracy for RCA, 86.14% accuracy for LAD, and 83.17% accuracy for LCX" were reached by a new team of researchers. [8] Tama et al. presented the idea of a two-tier ensemble paradigm, where certain classifiers serve as the basis for another ensemble. Using class labels from Extreme Gradient Boosting (EGB), Random Forest (RF), and Gradient Boosting Machine (GBM), the suggested stacking architecture is constructed (XGBoost). Four different types of datasets are used to evaluate their suggested detection model. Moreover, they employed feature selection methods based on particle swarm optimization. With a k value of 10, their suggested model fared better in the k-fold cross- validation. Only the stacking of tree-based models was considered by the authors. Additional statistical and regression-based techniques might be used to improve model results. [9] Abdar et al. established the N2Genetic optimizer, a novel optimization approach. The patients were then identified as having CHD or not using the nuSVM. On the Z- AlizadehSani dataset, the proposed detection approach had an accuracy of 93.08% when compared to earlier works. Raza proposed an ensemble architecture with majority vote. To forecast heart illness in a patient, it incorporated logistic regression, multilayer perceptron, and naive Bayes. A classification accuracy of 88.88% was attained, surpassing all base classifiers combined. [10] Mohan et al developed a hybrid approach based on combining a linear model with a random forest to predict cardiac disease (HRFLM). On the Cleveland dataset, the suggested technique raised performance levels and had an accuracy rate of 88.7%. [11] Soni and Vyas they used WARM, and their degree of confidence was 79.5%. dependent on age, smoking behaviours, BMI range and Hypertension their research assigned weights. Soni et al. on the other hand, gave each quality a weight depending on the advice they obtained from the medical experts. By attaining a maximum score of 80% confidence, Using a weighted associative classifier, they demonstrated a bright and effective cardiovascular attack prediction system. [12] Ganna A, Magnusson P K and team. Effort on using machine learning algorithms to identify cardiovascular heart disease has had a substantial effect on this work. In this paper, a summary of the literature is presented. Using a variety of methods, an effective prediction of cardiovascular disease has been achieved. Logistic Regression, KNN, Random Forest Classifier, etc. are a few of them. The outcomes demonstrate the capability of each algorithm to register the given objectives. The findings indicate that every algorithm is capable of registering the given objectives, with KNN displaying the greatest performance (88.52%). III. PROPOSED SYSTEM In this literature we have proposed multiple machine learning algorithms to detect if a person has Cardio vascular disease or not. Building, training, testing and validating the architecture for a specified challenge is a complex process. Decision Tree, Adaptive boosting classifier, Logistic Regression, Gradient Boosting classifier and K-Nearest Neighbor are the classification methods used in this study. Google colab was used to run the experiment. In this study the data is collected from 1025 patients which consists of both healthy and patients suffering from cardio vascular disease and we use attributes like age, sex, chol, cp(chest pain) etc to predict if a person is healthy or suffering from cardio vascular disease and this data set contains a total of 14 attributes the above mentioned algorithms are considered to be best for predicting the cardio vascular disease as they are all supervised learning algorithms. Overview of architecture Fig 1 consists of the overall architecture of the cardio vascular disease prediction using multiple ml models and the main parts of this architecture is data collection, data preprocessing and predicting the data using the given algorithms. Our technique uses the data of patient to predict the patient’s heart condition weather the person has the heart disease or not. And these predictions are made by the best algorithm of all the ml algorithms used and the model is trained before hand with a genuine data set to make accurate predictions. A. 158 International Conference on Recent Trends in Data Science and its Applications DOI: rp-9788770040723.030 Fig..1. Flowchart for Heart disease prediction Related Work ApurvGarg et al have proposed a model that predicts the chances of getting heart disease using two machine learning algorithms that are KNN and Random forest they have compared these two models in order to get the best accuracy possible out of which KNN yielded a prediction accuracy of 86.88% where as the RF yielded an accuracy of 81.96% [15] Archanasingh and Ramesh kumar have proposed a prediction based model with multiple machine learning algorithms like SVM, DT, LR, KNN out of which KNN yielded highest accuracy of 87% they have first collected data then selected the required attributes then the data is preprocessed and then balanced the data they have used UCI repository dataset [16] In this study we have used similar approaches and we were able to get better accuracies for the models by using different dataset with more number of data points. We were able to achieve better accuracy for our proposed model KNN B. Data Collection We took our data from kaggle website for free and our data is called heart.csv this data set contains of 1025 patients records. Out of this 1025 people 499 people are normal and526 people have heart disease and this data set has 14 attributes and out of these people there are 713 male and 312 female. And out of people that have heart disease 300 are male and 226 female C. Fig. 2. Data values and the attributes Data Pre processing In this step we use one-hot encoding technique to transform categorical values to numerical values then we drop of unnecessary variables then we separate features, now we normalize the data using min-max method now we split the data into two parts training and testing out of which the ratio of training is 80 and testing is 20 and now the data is ready and can be used in any model. D. IV. MACHINE LEANING ALGORITHMS A. Logistic Regression (LR) Logistic Regression is one of the best available tough classifier among the supervised ml algorithms. It is an elongation of general regression model it reflects the likelihood of the occurrence or nonoccurrence of a certain instance. Logistic regression is used to describe data and the connection between a dependent variable and one or more independent variables. Nominal, ordinal, or period types are all acceptable for the independent variables. The likelihood that a new observation will fall into a particular class is determined by LR, with the result falling between 0 and 1 because it is a probability B. Decision Tree (DT) Decision tree is one of the oldest ml algorithms. For issues related to classification and regression we have a best supervised algorithm that can deal with them and that algorithm is Decision tree and most of the times it is used for classification problems. It is basically a Tree shaped classifier root node is the top node while others are child nodes. Internal nodes represent the features of datasets while leaf node consist result Decision node and the leaf node are the nodes that make up decision tree. Decision node generally makes up decisions as it has many branches whereas leaf node can’t make any decisions. C. K-Nearest Neighbor (KNN) 159 International Conference on Recent Trends in Data Science and its Applications DOI: rp-9788770040723.030 KNN is among the very few oldest algorithms or statistical learning technique. In KNN K is basically to represent the total number of nearest neighbors used which is directly mentioned in the object builder. As a result, related situations are classified similarly, and a new instance is classified by comparing it to each of the existing examples. KNN method will search the pattern space for k training samples adjacent to the supplied unique sample when one is provided. Two distinct methods are offered to translate the distance into a weight so that predictions from many neighbors of the test instance may be calculated based on their distance. D. Adaboost An ensemble method in machine learning is called AdaBoost, also known as Adaptive Boosting. The most popular AdaBoost algorithm is a decision tree with one level, or a decision tree with only one split. A model is created via AdaBoost, and all the data points are given the same weight. After that, it gives points with incorrect classifications more weight. The following model now accords greater relevance to all of the points with higher weights. As long as no low errors are received, it will continue training models. E. Random Forest A ML technique that uses many numerous decision trees to make a decision is known as Random forest. It is a ensemble learning based technique. While it is in the training stage, itProduces many trees and a forest of decision trees. Each and every tree, a component of the forest, predicts a class label for each and every occurrence during the testing period. The model will take the class with the highest votes and makes it as prediction. The individual tree makes a class prediction from a very large independent tree models working together will give out the best result. F. Gradient Boosting Using boosting, weak learners may become strong learners. Each new tree created by boosting is fitted to a modified version of the initial data set. It is anticipated that when merged with older models, the new model will produce forecasts with lower error rates. The major goal is to set objectives for this next model to reduce mistakes. Gradient Boosting in a gradual, additive, and linear fashion trains many models. Because each case's goal results are decided by the gradient's deviation relative to the predictions, the phrase "gradient boosting" came into popularity. Every model picks up speed in the correct way by reducing the prediction errors. E. Proposed Algorithm In this study the best out of all the algorithms is KNN which has achieved an accuracy of 97% which is considered as one of the best algorithm in supervised classification algorithm and other than that it is simple KNN is nonparametric and lazy, which means it does not assume anything about the distribution of the underlying data anddoes not create a model from the training set. As an alternative, it memorises the full training dataset and utilises it to make predictions when presented with fresh test cases. For many applications, KNN is a straightforward and efficient method, although it can have large computing costs and be sensitive to the choice of K and the distance metric used to compare instances. KNN is a flexible technique that may be used to solve a variety of issues since it can be applied to both classification and regression jobs. V. EVALUATION For the machine learning models, there are some approaches for performance evaluation. It is anticipated that the blending of several assessment tools will support the advancement of analytical research. Four fundamental measures (accuracy, precision, recall, and F-Score) will be looked at in this study to see how machine learning-based algorithms differ from one another. Using the confusion matrix, we may assess the four measures. The Confusion Matrix's constituents are True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).In the medical data the most important thing is to find out (FN). The performance metrics are provided below Accuracy = Number of correctly classified (1) predictions / Total predictions Precision = TP / TP + FP (2) Recall = TP / TP + FN (3) F–Score = 2*precision*recall / precision + recall (4) The total collection of features in the heart disease dataset have been exposed to comparison analysis of supervised machine learning classifiers. Some classifiers performed well on evaluation measures, whereas others did not. In order to predict heart failure survival, this work employed tree-based, statistical-based and regression-based models. The DT, RF ensemble models are tree-based. AdaBoost and GBM are two tree-based boosting methods. Statistically-based models whereas regression-based models include LR and KNN . Fig. 3. Different accuracy comparison As per the table we have KNN with the best accuracy of 97.02%, Random forest with an accuracy of 90.16%, Gradint boosting with a accuracy of 88.7%, LR with a accuracy of 82.43%,Adaboost with a accuracy of 81.46%and with the least accuracy is the decision tree algorithm with an accuracy of 79% 160 International Conference on Recent Trends in Data Science and its Applications DOI: rp-9788770040723.030 TABLE.2. VALUE OF AREA UNDER ROC Algorithm K Nearest Neighbor Random forest Gradient Boosting Logistic Regression Ada boost Decision Tree AUROC 0.99 0.96 0.95 0.91 0.87 0.86 In conclusion, a dataset on heart illness was gathered, preprocessed as needed, and then analysis was done to better understand the dataset. Following the application of six machine learning algorithms Ada boost, LR, Gradient boost, KNN, DT, and RF we assessed the predictions using the F-1 Measure, ROC curve, recall, accuracy, and precision. We discovered that all of the used algorithms performed well, with KNN demonstrating the greatest performance with 97% accuracy, showing that these algorithms are the most effective at predicting cardiac disease. Fig. 4. Correlation matrix of variables VI. CONCLUSION Fig. 5. Roc and confusion matrix of KNN the best algorithm TABLE.1. PRECISION, RECALL AND F-MEASURES Algorithm K Nearest Neighbor Random forest Gradient Boosting Logistic Regression Ada boost Decision Tree Precision 0.97 0.91 0.89 0.84 0.82 0.80 Recall 0.97 0.90 0.89 0.82 0.81 0.79 F-1 0.97 0.90 0.89 0.82 0.81 0.79 Heart patients' lives will be saved through the processing of raw health data of heart information using machine learning algorithms. By identifying risk factors for heart failure, preventive steps can be taken to lower mortality rates. In this study, a machine learning-based technique for predicting the survival of heart patients is suggested. The following machine learning methods are used: LR, AdaBoost, RF, GBM, DT and KNN. KNN with a accuracy of 97% the highest of all algorithms with precision score0.97 recall 0.97 F-1 0.97 and AUROC 0.99 the work done here has the potential to advance the medical field and help doctors forecast how long a patient with heart failure will live. Additionally, it will aid doctors in realizing that if a heart failure patient survives, they can concentrate on key risk factors. To gain from their combined advantages, the research can employ a range of machine learning model combinations in the future. To better the efficiency of ML models, better feature selection methods may be created. Due to the fact that these feature selection issues are NPhard, meta-heuristics can be used. REFERENCES [1] P. Melillo, N. De Luca, M. Bracale and L. Pecchia, "Classification tree for risk assessment in patients suffering from congestive heart failure via long-term heart rate variability", IEEE J. Biomed. Health Informat., vol. 17, no. 3, pp. 727-733, May 2013. [2] G. Guidi, M. C. Pettenati, P. Melillo and E. Iadanza, "A machine learning system to improve heart failure patient assistance", IEEE J. Biomed. Health Informat., vol. 18, no. 6, pp. 1750-1756, Nov. 2014. [3] G. Parthiban and S. K. Srivatsa, "Applying machine learning methods in diagnosing heart disease for diabetic patients", Int. J. Appl. Inf. Syst., vol. 3, no. 7, pp. 25-30, Aug. 2012. [4] M. M. A. Rahhal, Y. Bazi, H. AlHichri, N. Alajlan, F. Melgani and R.R. Yager, "Deep learning approach for active classification of electrocardiogram signals", Inf. Sci., vol. 345, pp. 340-354, Jun. 2016 [5] S. Muthukaruppan and M. J. Er, "A hybrid particle swarm optimization based fuzzy expert system for the diagnosis of coronary artery disease", Expert Syst. Appl., vol. 39, no. 14, pp. 11657-11665, Oct. 2012. [6] Z. Sani, R. Alizadehsani, J. Habibi, H. Mashayekhi, R. Boghrati, A. Ghandeharioun, et al., "Diagnosing coronary artery disease via data mining algorithms by considering laboratory and echocardiography features", Res. Cardiovascular Med., vol. 2, no. 3, pp. 133, 2013. 161 International Conference on Recent Trends in Data Science and its Applications DOI: rp-9788770040723.030 [7] R. Alizadehsani, M. H. Zangooei, M. J. Hosseini, J. Habibi, A. Khosravi, M. Roshanzamir, et al., "Coronary artery disease detection using computational intelligence methods", Knowl.-Based Syst., vol. 109, pp. 187-197, Oct. 2016.4 [8] Rajesh, M., &Sitharthan, R. (2022). Introduction to the special section on cyber-physical system for autonomous process control in industry 5.0. Computers and Electrical Engineering, 104, 108481. [9] M. Abdar, W. Książek, U. R. Acharya, R. S. Tan, V. Makarenkov, and P. Pławiak, “A new machine learning technique for an accurate diagnosis of coronary artery disease,” Computer Methods and Programs in Biomedicine, vol. 179, article 104992, 2019. [10] S. Mohan, C. Thirumalai, and G. Srivastava, “Effective heart disease prediction using hybrid machine learning techniques,” IEEE Access, vol. 7, pp. 81542–81554, 2019. [11] Sitharthan, R., Vimal, S., Verma, A., Karthikeyan, M., Dhanabalan, S. S., Prabaharan, N., ...&Eswaran, T. (2023). Smart microgrid with the internet of things for adequate energy management and analysis. Computers and Electrical Engineering, 106, 108556. [12] A. Ganna, P. K. Magnusson, N. L. Pedersen, U. de Faire, M. Reilly, J. Ärnlövand E. Ingelsson,“Multilocus genetic risk scores for coronary heart disease prediction,” Arteriosclerosis, thrombosis, and vascular biology, vol. 33, no. 9, pp. 2267-7, 2013. [13] https://www.kaggle.com/datasets/johnsmith88/heart-disease-datas [14] Syed Nawaz Pasha, Dadi Ramesh, SallauddinMohmmad, A. Harshavardhan and Shabana “Cardiovascular disease prediction using deep learning techniques “OP Conference Series: Materials Science and Engineering, Volume 981, International Conference on Recent Advancements in Engineering and Management (ICRAEM-2020) 910 October 2020, Warangal, India Citation Syed Nawaz Pasha et al 2020 IOP Conf. Ser.: Mater. Sci. Eng. 981 022006 [15] Garg, Apurvand Sharma, Bhartenduand Khan, Rizwan. (2021). Heart disease prediction using machine learning techniques. IOP Conference Series: Materials Science and Engineering. 1022. 012046. 10.1088/1757-899X/1022/1/012046 [16] Singh, A., and Kumar, R. (2020). Heart Disease Prediction Using Machine Learning Algorithms. 2020 International Conference on Electrical and ElectronicsEngineering (ICE3). doi:10.1109/ice348803.2020.9122958 162