Final Project ML
Classification: Heart
Disease Prediction
IBM Machine Learning Professional Certificate
Course 03: Supervised Machine Learning: Classification
By Shubham Bohra
Contents
• Dataset Description
• Main objectives of the analysis.
• Applying various classification models.
• Machine learning analysis and findings.
• Models flaws and advanced steps.
Supervised Machine Learning: Classification 2
Data Description Section
Supervised Machine Learning: Classification 3
Introduction
Predicting and diagnosing heart disease is one of the major challenges in the medical industry as it depends on
several factors including physical examination and various symptoms and signs present in the patient. Heart disease
is considered as one of the deadliest disease in the world for human life due to the heart's inability to push the
required amount of blood to other body organs to perform the regular functions in the human body. There are
several factors affecting heart disease include but not limited cholesterol levels in the body, smoking habits and
obesity, family history of diseases, blood pressure, work environment and others. Today, ML Algorithms play an
essential and accurate role in heart disease prediction. Rapid advances in technology allow Machine Language to
integrate with Big Data tools to manage the exponentially growing unstructured data that includes medical data for
patients around the world. Heart disease can be predicted based on different symptoms like age, gender, heart rate
etc. which in turn reduces the death rate for heart patients. In this report we are going to use machine learning
algorithms and Python language to do that!
Supervised Machine Learning: Classification 4
Dataset Description 01
• age: The person’s age in years • trestbps: The person’s resting blood
pressure (mm Hg on admission to the
• sex: The person’s sex (1 = male, 0 = female) hospital).
• cp: chest pain type: • chol: The person’s cholesterol
Value 0: asymptomatic measurement in mg/dl.
Value 1: atypical angina
Value 2: non-anginal pain • fbs: The person’s fasting blood sugar (>
Value 3: typical angina 120 mg/dl, 1 = true; 0 = false).
Supervised Machine
Learning: Classification 5
Dataset Description 02
• restecg: resting electrocardiographic results • slope: the slope of the peak exercise ST segment
Value 0: showing probable or definite 0: upsloping; 1: flat; 2: downsloping
left ventricular hypertrophy by Estes’
criteria • ca: The number of major vessels (0–3)
Value 1: normal
Value 2: having ST-T wave abnormality • thal: A blood disorder called thalassemia
(T wave inversions and/or ST elevation Value 0: NULL (dropped from the dataset previously
or depression of > 0.05 mV). Value 1: fixed defect (no blood flow in some part of
the heart)
• thalach: The person’s maximum heart rate Value 2: normal blood flow
achieved. Value 3: reversible defect (a blood flow is observed
but it is not normal)
• exang: Exercise induced angina (1 = yes; 0 = no)
• target: Heart disease (0 = no, 1= yes)
• oldpeak: ST depression induced by exercise
relative to rest (‘ST’ relates to positions on the
ECG plot. See more here)
Supervised Machine
Learning: Classification 6
Dataset Description 02
Supervised Machine
Learning: Classification 7
Dataset Description 03
Checking for Null values
Great, there is no missing values
within our features !
Supervised Machine Learning: Classification 8
Data Analysis Section
Supervised Machine Learning: Classification 9
Main Objective of the analysis:
In this section I am showing the correlation between the features to find the
most influence features on our target which is Target (Heart Disease Existence).
After that I am building different Classification models based on advanced
techniques such as GridSearch, ML pipelines, and Hyperparameters tuning to get
the best predictive model in terms of accuracy, in addition of what are the flaws
of each model.
Supervised Machine Learning: Classification 10
Data Analysis 01
- Identifying categorical features and continuous features:
Supervised Machine Learning: Classification 11
Data Analysis 02
Viewing the status of people in the data set :
We have 165 people with heart
disease and 138 healthy people, so
the data for the target variable we
want to predict is in balance.
Supervised Machine Learning: Classification 12
Data Analysis 03
Study of the relationship of categorical
features and heart disease:
cp (chest pain): people with chest pain of the type:
cp: [1, 2, 3] tend to have more heart disease than
people without any chest pain cp: 0
restecg (resting ECG results): People with a value of 1
(having an abnormal heart rhythm, which can range
from mild symptoms to severe problems) are more
likely to develop heart disease.
exang (exercise-induced angina): People with non-
exercise-induced angina who have a value of 0 are
more likely to have heart disease than those who
have exercise-induced angina with a value of 1.
Supervised Machine Learning: Classification 13
Data Analysis 04
Slope (rectal slope for the ST segment of peak
exercise): People with a downsloping slope of 2 have
signs of an unhealthy heart therefore they more likely
to have heart disease than people with an upsloping
of 0 or a flat slope A value of 1: minimal change
(typical healthy heart)).
ca (number of blood vessels (0-3) ): the more blood
flow the better heart, so people with a vessel number
ca equal to 0 are more likely to have heart disease.
thal (a blood disorder called thalassemia): People
with a thal value = 2 are more likely to have heart
disease.
Supervised Machine Learning: Classification 14
Data Analysis 05
Study of the relationship of continuous
features and heart disease:
trestbps: When blood pressure is higher than 130-140
mm Hg, this is a cause for concern.
chol: When cholesterol is higher than 200 mg/dL, it is
a very dangerous indicator, as shown in the graphic
above.
thalach: People with a heart rate above 140 are more
likely to have heart disease.
Supervised Machine Learning: Classification 15
Data Analysis 06
- Studying the correlations between features using Heat Map!
The goal of this matrix is to show the relationship
between features, and this is useful for feature
engineering techniques, but what matters most to us in
this lesson is the relationship between the target
variable (knowing whether a person has a heart
disease or not) and the rest of the features, meaning
that our focus will be on the last row from the matrix.
1. fbs and chol are the features least related to the
target variable.
2. All other features have a high correlation with the
target variable.
Supervised Machine
Learning: Classification 16
Feature Engineering 01
Converting Categorical features into Numerical features :
Supervised Machine
Learning: Classification 17
Machine Learning
Analysis & Findings
Supervised Machine Learning: Classification 18
Machine Learning Analysis & Findings
In the following analysis will compare between 4 different Classification models
Logistic Regression, KNN, SVM and XGBoost in terms of predicting the Heart Disease.
Where I am going to use the following techniques to help me in developing robust
models:
Standard scaling, cross-validation method, Grid Search, metric measurements such
accuracy, precision, F1 Score etc.
Supervised Machine Learning: Classification 19
Machine Learning Analysis 01
Data Splitting:
Supervised Machine Learning: Classification 20
Machine Learning Analysis 02
Logistic Regression Model:
Supervised Machine Learning: Classification 21
Machine Learning Analysis 03
Logistic Regression Model with penalty = L1:
Supervised Machine Learning: Classification 22
Machine Learning Analysis 04
Logistic Regression Model with penalty = L2:
Supervised Machine Learning: Classification 23
Analysis & Findings
Logistic Regression Models Findings:
The best model in terms of prediction performance is Logistic
Regression with penalty = 2
Accuracy : 80%
Precision : 80%
Recall : 80%
F1-score : 80%
Support : 91%
Supervised Machine Learning: Classification 24
Machine Learning Analysis 05
KNN Algorithm
Accuracy : 84%
Precision : 85%
Recall : 84%
F1-score : 83%
Support : 91%
Supervised Machine Learning: Classification 25
Machine Learning Analysis 06
Support Vector Machine Model:
Accuracy : 80%
Precision : 80%
Recall : 80%
F1-score : 80%
Support : 91%
Supervised Machine Learning: Classification 26
Machine Learning Analysis 07
XGBoost Algorithm
Supervised Machine Learning: Classification 27
Machine Learning Analysis 08
XGBoost Algorithm
Accuracy : 80%
Precision : 82%
Recall : 80%
F1-score : 80%
Support : 91%
Supervised Machine Learning: Classification 28
Machine Learning Analysis 09
XGBoost Algorithm
Accuracy : 80%
Precision : 82%
Recall : 80%
F1-score : 80%
Support : 91%
Supervised Machine Learning: Classification 29
Machine Learning Analysis 10
Models Comparison
As shown in the previous analysis all the models provide very good prediction results and these
results are so close to each other, But at the end we must choose one model for our dataset
and this depends on the highest result.
Below I ordered the models descending:
KNN
1- KNN
2 XGBoost
3 Logistic Regression with L2
4 Support Vector Machine
Supervised Machine Learning: Classification 30
Models flaws and strengths
and advanced steps
Supervised Machine Learning: Classification 31
Machine Learning Analysis 11
Models Flaws and Strength and further suggestions:
In terms of simplicity, we can say Logistic Regression provided high predictive results and at the
same time it is the simplest and fastest Model in terms of parameters and training but if we look
to other models like KNN it is providing the best results, but it is slower in terms of prediction
process because it requires to calculate the distance between all the points in the dataset to
classify every single point.
XGBoost performance was very good as well but in contrast of KNN it takes longer time in the
training process since we used grid search technique to search about best fitting parameters, so at
the end it is a tradeoff if we have bigger dataset then the performance will be higher with such
models, but the training process will take a longer time.
Supervised Machine Learning: Classification 32
Thank you
IBM Machine Learning Professional Certificate
Supervised Machine Learning: Classification