[go: up one dir, main page]

0% found this document useful (0 votes)
205 views33 pages

03-Supervised Machine Learning Classification

Uploaded by

suraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
205 views33 pages

03-Supervised Machine Learning Classification

Uploaded by

suraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Final Project ML

Classification: Heart
Disease Prediction

IBM Machine Learning Professional Certificate


Course 03: Supervised Machine Learning: Classification

By Shubham Bohra
Contents
• Dataset Description
• Main objectives of the analysis.
• Applying various classification models.
• Machine learning analysis and findings.
• Models flaws and advanced steps.

Supervised Machine Learning: Classification 2


Data Description Section

Supervised Machine Learning: Classification 3


Introduction

Predicting and diagnosing heart disease is one of the major challenges in the medical industry as it depends on
several factors including physical examination and various symptoms and signs present in the patient. Heart disease
is considered as one of the deadliest disease in the world for human life due to the heart's inability to push the
required amount of blood to other body organs to perform the regular functions in the human body. There are
several factors affecting heart disease include but not limited cholesterol levels in the body, smoking habits and
obesity, family history of diseases, blood pressure, work environment and others. Today, ML Algorithms play an
essential and accurate role in heart disease prediction. Rapid advances in technology allow Machine Language to
integrate with Big Data tools to manage the exponentially growing unstructured data that includes medical data for
patients around the world. Heart disease can be predicted based on different symptoms like age, gender, heart rate
etc. which in turn reduces the death rate for heart patients. In this report we are going to use machine learning
algorithms and Python language to do that!

Supervised Machine Learning: Classification 4


Dataset Description 01

• age: The person’s age in years • trestbps: The person’s resting blood
pressure (mm Hg on admission to the
• sex: The person’s sex (1 = male, 0 = female) hospital).

• cp: chest pain type: • chol: The person’s cholesterol


Value 0: asymptomatic measurement in mg/dl.
Value 1: atypical angina
Value 2: non-anginal pain • fbs: The person’s fasting blood sugar (>
Value 3: typical angina 120 mg/dl, 1 = true; 0 = false).
Supervised Machine
Learning: Classification 5
Dataset Description 02
• restecg: resting electrocardiographic results • slope: the slope of the peak exercise ST segment
Value 0: showing probable or definite 0: upsloping; 1: flat; 2: downsloping
left ventricular hypertrophy by Estes’
criteria • ca: The number of major vessels (0–3)
Value 1: normal
Value 2: having ST-T wave abnormality • thal: A blood disorder called thalassemia
(T wave inversions and/or ST elevation Value 0: NULL (dropped from the dataset previously
or depression of > 0.05 mV). Value 1: fixed defect (no blood flow in some part of
the heart)
• thalach: The person’s maximum heart rate Value 2: normal blood flow
achieved. Value 3: reversible defect (a blood flow is observed
but it is not normal)
• exang: Exercise induced angina (1 = yes; 0 = no)
• target: Heart disease (0 = no, 1= yes)
• oldpeak: ST depression induced by exercise
relative to rest (‘ST’ relates to positions on the
ECG plot. See more here)

Supervised Machine
Learning: Classification 6
Dataset Description 02

Supervised Machine
Learning: Classification 7
Dataset Description 03
Checking for Null values

Great, there is no missing values


within our features !

Supervised Machine Learning: Classification 8


Data Analysis Section

Supervised Machine Learning: Classification 9


Main Objective of the analysis:
In this section I am showing the correlation between the features to find the
most influence features on our target which is Target (Heart Disease Existence).

After that I am building different Classification models based on advanced


techniques such as GridSearch, ML pipelines, and Hyperparameters tuning to get
the best predictive model in terms of accuracy, in addition of what are the flaws
of each model.

Supervised Machine Learning: Classification 10


Data Analysis 01
- Identifying categorical features and continuous features:

Supervised Machine Learning: Classification 11


Data Analysis 02
Viewing the status of people in the data set :

We have 165 people with heart


disease and 138 healthy people, so
the data for the target variable we
want to predict is in balance.

Supervised Machine Learning: Classification 12


Data Analysis 03
Study of the relationship of categorical
features and heart disease:

cp (chest pain): people with chest pain of the type:


cp: [1, 2, 3] tend to have more heart disease than
people without any chest pain cp: 0

restecg (resting ECG results): People with a value of 1


(having an abnormal heart rhythm, which can range
from mild symptoms to severe problems) are more
likely to develop heart disease.

exang (exercise-induced angina): People with non-


exercise-induced angina who have a value of 0 are
more likely to have heart disease than those who
have exercise-induced angina with a value of 1.

Supervised Machine Learning: Classification 13


Data Analysis 04
Slope (rectal slope for the ST segment of peak
exercise): People with a downsloping slope of 2 have
signs of an unhealthy heart therefore they more likely
to have heart disease than people with an upsloping
of 0 or a flat slope A value of 1: minimal change
(typical healthy heart)).

ca (number of blood vessels (0-3) ): the more blood


flow the better heart, so people with a vessel number
ca equal to 0 are more likely to have heart disease.

thal (a blood disorder called thalassemia): People


with a thal value = 2 are more likely to have heart
disease.

Supervised Machine Learning: Classification 14


Data Analysis 05
Study of the relationship of continuous
features and heart disease:

trestbps: When blood pressure is higher than 130-140


mm Hg, this is a cause for concern.

chol: When cholesterol is higher than 200 mg/dL, it is


a very dangerous indicator, as shown in the graphic
above.

thalach: People with a heart rate above 140 are more


likely to have heart disease.

Supervised Machine Learning: Classification 15


Data Analysis 06
- Studying the correlations between features using Heat Map!

The goal of this matrix is to show the relationship


between features, and this is useful for feature
engineering techniques, but what matters most to us in
this lesson is the relationship between the target
variable (knowing whether a person has a heart
disease or not) and the rest of the features, meaning
that our focus will be on the last row from the matrix.

 1. fbs and chol are the features least related to the


target variable.

 2. All other features have a high correlation with the


target variable.

Supervised Machine
Learning: Classification 16
Feature Engineering 01
Converting Categorical features into Numerical features :

Supervised Machine
Learning: Classification 17
Machine Learning
Analysis & Findings

Supervised Machine Learning: Classification 18


Machine Learning Analysis & Findings

In the following analysis will compare between 4 different Classification models


Logistic Regression, KNN, SVM and XGBoost in terms of predicting the Heart Disease.
Where I am going to use the following techniques to help me in developing robust
models:

Standard scaling, cross-validation method, Grid Search, metric measurements such


accuracy, precision, F1 Score etc.

Supervised Machine Learning: Classification 19


Machine Learning Analysis 01
Data Splitting:

Supervised Machine Learning: Classification 20


Machine Learning Analysis 02
Logistic Regression Model:

Supervised Machine Learning: Classification 21


Machine Learning Analysis 03
Logistic Regression Model with penalty = L1:

Supervised Machine Learning: Classification 22


Machine Learning Analysis 04
Logistic Regression Model with penalty = L2:

Supervised Machine Learning: Classification 23


Analysis & Findings
Logistic Regression Models Findings:

The best model in terms of prediction performance is Logistic


Regression with penalty = 2

 Accuracy : 80%
 Precision : 80%
 Recall : 80%
 F1-score : 80%
 Support : 91%

Supervised Machine Learning: Classification 24


Machine Learning Analysis 05
KNN Algorithm

 Accuracy : 84%
 Precision : 85%
 Recall : 84%
 F1-score : 83%
 Support : 91%

Supervised Machine Learning: Classification 25


Machine Learning Analysis 06
Support Vector Machine Model:

 Accuracy : 80%
 Precision : 80%
 Recall : 80%
 F1-score : 80%
 Support : 91%

Supervised Machine Learning: Classification 26


Machine Learning Analysis 07
XGBoost Algorithm

Supervised Machine Learning: Classification 27


Machine Learning Analysis 08
XGBoost Algorithm

 Accuracy : 80%
 Precision : 82%
 Recall : 80%
 F1-score : 80%
 Support : 91%

Supervised Machine Learning: Classification 28


Machine Learning Analysis 09
XGBoost Algorithm

 Accuracy : 80%
 Precision : 82%
 Recall : 80%
 F1-score : 80%
 Support : 91%

Supervised Machine Learning: Classification 29


Machine Learning Analysis 10
Models Comparison

As shown in the previous analysis all the models provide very good prediction results and these
results are so close to each other, But at the end we must choose one model for our dataset
and this depends on the highest result.
Below I ordered the models descending:
KNN
1- KNN
2 XGBoost
3 Logistic Regression with L2
4 Support Vector Machine

Supervised Machine Learning: Classification 30


Models flaws and strengths
and advanced steps

Supervised Machine Learning: Classification 31


Machine Learning Analysis 11
Models Flaws and Strength and further suggestions:

In terms of simplicity, we can say Logistic Regression provided high predictive results and at the
same time it is the simplest and fastest Model in terms of parameters and training but if we look
to other models like KNN it is providing the best results, but it is slower in terms of prediction
process because it requires to calculate the distance between all the points in the dataset to
classify every single point.

XGBoost performance was very good as well but in contrast of KNN it takes longer time in the
training process since we used grid search technique to search about best fitting parameters, so at
the end it is a tradeoff if we have bigger dataset then the performance will be higher with such
models, but the training process will take a longer time.

Supervised Machine Learning: Classification 32


Thank you
IBM Machine Learning Professional Certificate
Supervised Machine Learning: Classification

You might also like