[go: up one dir, main page]

0% found this document useful (0 votes)
207 views16 pages

Diabetes Prediction Report

This document discusses a project that aims to predict diabetes using machine learning models. It analyzes a dataset of family members, some with diabetes and some without, to train models. The models take in inputs like pregnancies, glucose, skin thickness, age, blood pressure, and BMI to predict whether a person has diabetes. A literature review discusses previous research applying machine learning algorithms like KNN, SVM, Naive Bayes and logistic regression to clinical datasets, achieving prediction accuracies between 73-98%. The document then outlines the methodology, which applies supervised algorithms KNN, SVM, Naive Bayes and logistic regression to the dataset.

Uploaded by

areeshaabass4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
207 views16 pages

Diabetes Prediction Report

This document discusses a project that aims to predict diabetes using machine learning models. It analyzes a dataset of family members, some with diabetes and some without, to train models. The models take in inputs like pregnancies, glucose, skin thickness, age, blood pressure, and BMI to predict whether a person has diabetes. A literature review discusses previous research applying machine learning algorithms like KNN, SVM, Naive Bayes and logistic regression to clinical datasets, achieving prediction accuracies between 73-98%. The document then outlines the methodology, which applies supervised algorithms KNN, SVM, Naive Bayes and logistic regression to the dataset.

Uploaded by

areeshaabass4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

DATA ANALYTICS (CS-055, CS-065)

Project report
Diabetes prediction using machine learning
Submitted by:
AREESHA ABASS (CS-055)
MAHAM (CS-065)
DATA ANALYTICS (CS-055, CS-065)

ABSTRACT
Diabetic Mellitus is one of the major diseases in this new era because of people’s carelessness
regarding their health they do not eat healthy food and have bad habits of sleeping at the wrong
times, not working out, due to the increased age, hereditary diabetes, high blood pressure, etc.
peoples having diabetes have a high-risk disease like heart disease, kidney disease, an eye
problem, nerve damage, etc. The current hospital procedure is to collect the required information
for diabetes diagnosis through various tests, and then give suitable therapy depending on the
diagnosis. In the healthcare industry, big data analytics is extremely important. The healthcare
industry has vast datasets. Big data analytics can be used to examine large datasets and uncover
hidden information and trends to extract knowledge from the data and anticipate outcomes. The
current prevalence of type 2 diabetes mellitus in Pakistan is 11.77%. In males, the prevalence is
11.20% and in females 9.19%. The mean prevalence in Sindh province is 16.2% in males and
11.70 % in females. Previous research had hospitalized person’s dataset, but this research paper
used the dataset of my family members where some members have diabetes, and some members
are free from this disease so our diabetes prediction model uses machine learning with python
can predict after the input data either the given data of that member having debated Mellitus, or
the member is free from this disease. The inputted data includes pregnancies, glucose, Skin
Thickness, Age, blood pressure, and BMI after entering these values the model will predict that
either it’s a diabetic patient or it’s not a diabetic patient.
DATA ANALYTICS (CS-055, CS-065)

INTRODUCTION
Diabetes is the main task in the healthcare community worldwide, its effects are too high,
according to the World health organization (WHO), it is the seventh major way of early neonatal
death in 2016. Based on the worldwide pervasiveness diabetic patients die each year by
approximately 1.6 million. WHO merged to form partners from around the world to highlight the
impact of diabetes on the day of World Diabetes Day in 2018. The issue is getting worse every
day because one in every three women is overweight as per the report by WHO. So, diabetes has
been considered the key cause of heart attacks, kidney failure, and stroke blindness.
The National Institute of Diabetes and Digestive and Kidney Diseases provided the actual data to
apply some machine algorithms to this dataset including KNN, Naïve Bayes, SVM, and LR. The
main goal of this project is to diagnose whether a patient has diabetes or not diabetes based on
the previous screening measurement. So, from the previous results, this model predicts diabetes
patients with high accuracy.

LITERATURE REVIEW
In September 2018 research was introduced [1], the research mainly focused on big data
analytics, Predictive Analytics, and Machine Learning in Healthcare, it collects a dataset of
PIMA Indian from a repository of UCI machine learning which includes the record of 768
women with 9 features, the main goal of this research was that to predict whether the person is
diabetic patient or not according to the previous measurements. the research includes machine
learning algorithms SVM, KNN, LR, DT, RF, and NB, these algorithms can be applied to the
PIMA Indian dataset, to predict the accuracy of results in predicting diabetes.

In research [2] the main goal is to a deciding whether a person is diabetic or not, so to find out
the answer to this question they used convolutional neural network and conventional machine
learning methods including Support Vector Machine, Random forecast, and many others and
applied on the dataset consists of 768 instances with 8 attributes and attributes includes pregnant
DATA ANALYTICS (CS-055, CS-065)

count, plasma glucose concentration, diastolic blood pressure, skin thickness, and serum insulin
to predict the person’s sugar level and gives the accuracy around 73.94 percent.

This research [3] works on big data which plays an important role to find out the hidden values
in the dataset, the challenge of this research is to make a high accuracy prediction of diabetes, the
accuracy predicted to compare the dataset to the previous dataset, and this research uses a
pipeline model to give the accurate results, algorithms used in this research were supervised
learning, unsupervised learning, and semi-supervised learning to conclude the accuracy of the
diabetes prediction results.

The research [4] proposed a sugar level prediction with an accuracy of 97.11% and sensitivity of
96.25% by using the Deep Neural Network and evaluation Metrics and the methods applied on
Pima Indian Dataset (PID), this research uses deep learning to predict diabetes by using some of
the steps including data collection, data preparation, implement deep network and evaluation
criteria, to predict diabetes with the accuracy of 98.35%.

The main motive of this research [5] is to give a model to predict diabetes with maximum
accuracy, so they use a machine learning algorithm including decision tree, SVM, and Naive
Bayes to apply to PIDD (Pima Indian Diabetes Database) which is collected from UCI. this
research shows the precision, F-measure, and accuracy of the model, the accuracy is around
76.30 percent. but these predicted results are also verified by Operating Characteristic (ROC)
curves

The author in this research [6] mainly focuses on the machine learning data mining, support
vector machine, artificial neural network, and decision tree to predict the debates by taking some
steps like training and testing dataset, pre-processing, feature extraction, and target dataset, this
research also using machine learning algorithm including decision tree method, artificial neural
network, SVM and Naive Baes Classifier and also used the Machine learning Matrix which
consists of Precision, Recall, F1-Score to give the results which predict diabetes with accuracy.

Machine learning has different classifiers which help us in the real world, the main challenge of
this research [7] is the accurate and strong prediction of diabetes mellitus, in this literature the
weight is estimated by the corresponding area under the ROC Curve (AUC) of some Machine
Learning models and the models are made from the different classifiers including (K-nearest
DATA ANALYTICS (CS-055, CS-065)

Neighbor, Decision Trees, Random Forest, AdaBoost, Naive Bayes, and XGBoost), and these all
classifiers applied on the Pima Indian dataset, in the result of this project the framework made
for diabetes prediction can predict with AUC as 0.789, 0.934, 0.092, with respect of 2.00 percent
accuracy in AUC.

As we know that data plays an important role in worldwide and data mining is also significant in
this era, in this research paper [8] keep into consideration of data, and data mining techniques
too, this research can give results for diabetic patient’s data concerning diabetic complications,
prediction, the background of the patients, this only done with the help of machine learning
algorithm SVM (Support Vector Learning) with employed on different types of data including
supervised which 85 percent of available data, unsupervised 15 percent of availability of data.

Machine Learning algorithms are used in in-depth problems in the real world, this research [9]
main purpose is to find out the best machine learning algorithm which helps us in the prediction
of diabetes with make use of clinical data, the Machine learning algorithm has trained on a
different dataset in this research they use K-nearest neighbor (KNN), Random Forest (RF),
Gradient Boosting (GB), Logistic Regression (LR) and Support Vector (SVM). they also
increase accuracy by some pre-processing techniques like label-encoding and normalization, for
the need for accurate results they compare models with some previous whose accuracy is around
2.71 percent to 13.13 percent. This research concludes that they implement this model on the
smart web applications by using python and the developed smart web application can give a
higher accuracy in the prediction of diabetes mellitus.

As we know that diabetes become a very big disease nowadays, the research [10] proposed a
system which determines the patient’s type of diabetes whether the type1 or type2 with high
accuracy of predicting diabetes, to predict the type of diabetes they used parameters including
(Pregnancies, skin thickness, Blood pressure, Insulin, BMI, etc. the prediction is made by using
Machine Learning algorithms like SVM, ANN, DT, LR and these algorithms applied on the
dataset to give the better accuracy of diabetes prediction.

METHODOLOGY
FLOW CHART OF DIABETES PREDICTION METHODOLOGY
DATA ANALYTICS (CS-055, CS-065)

Data mining plays a significant role in the era of data, the industries, companies, offices,
hospitals, and banking systems these all rely on data, with various techniques including machine
learning that are used in the predictions of disease. Therefore, this research works on diabetes
prediction using various machine learning algorithms, machine learning provides chunks of tools
and techniques that can be used to transform raw data into a useful dataset. There are many types
of algorithms in Machine learning but in this research paper, we have only focused on the
supervised algorithm such as K-Nearest Neighbors (KNN), Support Vector Machine Algorithm
(SVM), Naive Bias, and Logistic Regression (LR). These algorithms are applied to the PIMA
Indian dataset, the steps used to predict diabetes are defined in the flow chart that is given in
figure 1.1, this figure illustrates the steps of the diabetes prediction methodology.

Figure 1: Diabetes Prediction Methodology Flow Chart

In the previous models, there are three machine learning algorithms are used:
DATA ANALYTICS (CS-055, CS-065)

1. KNN (K-Nearest Neighbor)


2. Naïve Bayas
3. SVM (Support Vector Machine)

Now I use another algorithm named LR (Logistic Algorithm)

4. Logistic Regression Algorithm


1. KNN (K-NEAREST NEIGHBOR)
 The KNN machine learning algorithm is a supervised machine learning algorithm, which
means the dataset used in any model is labeled dataset, this algorithm solves the problem
statements for both classification and regression.
 The unknown variable that must be predicted in several nearest neighbors is represented
by K.
 The distance-based algorithm KNN, simply finds out the class of the nearest neighbor
around the unknown data point.

let’s take an example of a cats and dogs’ dataset where 4 cats and 1 dog around the cat so KNN
calculates all the points in the nearest unknown data and finally finds out the shortest distances to
it. in figure 1 you see the value of K is 5, the algorithm predicts the dataset is the cat-based
dataset.

Figure 2: This figure is used as an example of the KNN Algorithm

Mathematical Representation of KNN Algorithm:


DATA ANALYTICS (CS-055, CS-065)

The algorithm KNN stated that for a given value of K, for finding the K-nearest of an invisible
data point and then appoint the class to the invisible data point by having the class with the most
data points out of all classes of K neighbors

For distance metrics, we will use Euclidian Metrics:

In the last input, x gets allocated to the class with the largest probability

2. NAÏVE BAYAS ALGORITHM

 This algorithm is based on the Bayes theorem, which states that a supposition that all the
attributes predict the main goal value that is independent of each other,
 It predicts or calculates the probability of each class and then collects the one with the
highest probability.
 It works with natural processing (NLP) problems.

Bayes' Theorem states that the probability of an item, construct according to prior knowledge of
conditions that may be related to the item.

Mathematical Representation of Naïve Bayes Algorithm:


DATA ANALYTICS (CS-055, CS-065)

3. SVM (SUPPORT VECTOR MACHINE)


 A support vector machine is also a supervised learning algorithm, that is also used in
the classification and regression of a problem.
 The SVM's goal is to create the line named hyperplane which can decide the different
classes, there are two other lines parallel to the hyperplane, so the distance between
these two lines is the margin and the point which is nearest to these lines are called
support vectors. The SVM illustrates in the below figure 2:

Figure 3: SVM Illustration

Mathematical Representation of SVM Algorithm:

Let’s see the equation of SVM:

The distance of any line, ax+by+c = 0 from a given point say, (x0,y0) I given by d

The distance of the hyperplane equation is given below:


DATA ANALYTICS (CS-055, CS-065)

Euclidean norm for the length of w given by:

4. LR (LOGISTIC REGRESSION ALGORITHM)


 This Logistic regression is also a supervised (labeled dataset) algorithm used to
estimate/predict a goal value, the nature of the goal value or dependent value is forked,
which means there will be only feasible classes.

Mathematical Representation of LR Algorithm:

 In Mathematical representation, logistic regression models estimate P(Y=1) as a function


of x. one of the simplest Machine learning algorithms which usually detects some the
problems such as diabetes prediction, spam detection, and cancer detection.

EXPLORATORY DATA ANALYSIS (EDA):


Do you want to make an amazing data science project, but you just need a dataset that is free
from all mistakes, now is there any way to find out the detail about the dataset?
Yes! Exploratory Data Analysis is a method to find out the detail of your dataset which is related
to your demanding project it also includes some other features:
 It can be trying to conduct preliminary data analysis to discover variations
 To identify outliers
 To test hypotheses
 To validate assumptions using statically results and visualization
TECHNIQUES OF EDA:
 UNIVARIATE NON-GRAPHICAL
DATA ANALYTICS (CS-055, CS-065)

The simplest EDA technique, where only a single variable is used in data, because of a single
variable data expert does not deal with relationships. And it does not show the complete picture
of the data.
 UNIVARIATE GRAPHICAL
In the univariate analysis, we try to pick up one variable (column) from our dataset, and on that
basis, we will determine the output, but the output must overlap with each other so that’s why we
will move from univariate to Bivariate. In Univariate graphics experts also implement the graphs
like stem-and-leaf, histograms, and box plots.
 BIVARIATE ANALYSIS
The technique by which data experts take two variables and determine the output. It also deals
with the relationships between two variables.
 MULTIVARIATE NON-GRAPHICAL
This technique includes more than two variables, and non-graphical multivariate demonstrate
relationships between more than two variables with the help of statistics and cross-tabulation.
 MULTIVARIATE GRAPHICAL
This technique includes more than two variables and shows the relationship between them.
Graphs can be a bar chart, heat map, bar plot, bubble chart, run chart, scatter plot, and
multivariate chart.
EDA TECHNIQUES APPLYING TO DIABETES DATABASE

1. Univariate Analysis:

Whenever I apply the technique of univariate on a diabetes dataset, I must put only one variable
and then determine the output.
DATA ANALYTICS (CS-055, CS-065)

2. Bivariate Analysis

In this technique, I took two variables and determine the result or output from the relationship
between these two variables.

3. Multivariate Analysis

In this technique, we can determine the relationship between more than two variables and see the
output through the graph.
DATA ANALYTICS (CS-055, CS-065)

4. Missing Values:

In missing values, we can find out which column or variable is missing from the dataset and
handle those missing values to complete the column and fill out these missing values in the
dataset.

RESULTS
This research is working on the diabetes prediction for that we apply some of the machine
learning algorithms on PIMA Indian dataset, the algorithms such as KNN, NB, SVM, and
Logistic Algorithm, this works on accuracy so to apply different kinds of machine learning on a
dataset which is divide by 75% training the data and 25% for testing the data, now let’s see the
below figure which shows the features of columns and rows or datasets point.
Feature heat map in figure 1, shows the attributes in the heat map that may represent the
graphical image where datasets show through the colors, which helps you to see the depth part of
your dataset.
DATA ANALYTICS (CS-055, CS-065)

Figure 4: Featured Heat map

Figure 2, represents the accuracy of the diabetes prediction after applying some machine learning
algorithms, the above confusion matrix describes that 0,0 means true negative and 1,1 means true
positive so the next part shows that 0,1 when a person is negative but the predicted value comes
in positive and the last part shows the accuracy through 1,0 that mean the s person is positive but
the predicted value shows negative, so the first part accurately but it also becomes more accurate
if the multiple models will be ensemble.

Figure 5: Accuracy of Algorithms


DATA ANALYTICS (CS-055, CS-065)

REFERENCES

[1] Sarwar, M.A., Kamal, N., Hamid, W. and Shah, M.A., 2018, September. Prediction of
diabetes using machine learning algorithms in healthcare. In 2018 24th international conference
on automation and computing (ICAC) (pp. 1-6). IEEE.
[2] Yahyaoui, A., Jamil, A., Rasheed, J., and Yesiltepe, M., 2019, November. A decision support
system for diabetes prediction using machine learning and deep learning techniques. In 2019 1st
International Informatics and Software Engineering Conference (UBMYK) (pp. 1-4). IEEE.

[3] Mujumdar, A. and Vaidehi, V., 2019. Diabetes prediction using machine learning
algorithms. Procedia Computer Science, 165, pp.292-299.

[4] Ayon, S.I. and Islam, M.M., 2019. Diabetes prediction: a deep learning
approach. International Journal of Information Engineering and Electronic Business, 12(2),
p.21.

[5] Sisodia, D. and Sisodia, D.S., 2018. Prediction of diabetes using classification
algorithms. Procedia computer science, 132, pp.1578-1585.

[6] Sonar, P. and JayaMalini, K., 2019, March. Diabetes prediction using different machine
learning approaches. In 2019 3rd International Conference on Computing Methodologies and
Communication (ICCMC) (pp. 367-371). IEEE.
DATA ANALYTICS (CS-055, CS-065)

[7] M. K. Hasan, M. A. Alam, D. Das, E. Hossain, and M. Hasan, "Diabetes Prediction Using
Ensembling of Different Machine Learning Classifiers," in IEEE Access, vol. 8, pp. 76516-
76531, 2020, DOI: 10.1109/ACCESS.2020.2989857.

[8] Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I. and Chouvarda, I.,
2017. Machine learning and data mining methods in diabetes research. Computational and
structural biotechnology journal, 15, pp.104-116.

[9] Ahmed, N., Ahammed, R., Islam, M.M., Uddin, M.A., Akhter, A., Talukder, M.A.A. and
Paul, B.K., 2021. Machine learning-based diabetes prediction and development of smart web
application. International Journal of Cognitive Computing in Engineering, 2, pp.229-241.

[10] Naz, H. and Ahuja, S., 2020. Deep learning approach for diabetes prediction using PIMA
Indian dataset. Journal of Diabetes & Metabolic Disorders, 19(1), pp.391-403.

[11] M. A. Sarwar, N. Kamal, W. Hamid, and M. A. Shah, "Prediction of Diabetes Using


Machine Learning Algorithms in Healthcare," 2018 24th International Conference on
Automation and Computing (ICAC), 2018, pp. 1-6, DOI: 10.23919/IConAC.2018.8748992.

[12] Bano Farhana et al 2021 J. Phys.: Conf. Ser. 2089 012002

You might also like