Diabetes Prediction Using Machine Learning Techniques
Diabetes Prediction Using Machine Learning Techniques
net/publication/372057110
CITATIONS READS
8 366
5 authors, including:
V. Jithendra
Madanapalle Institute of Technology & Science
1 PUBLICATION 8 CITATIONS
SEE PROFILE
All content following this page was uploaded by V. Jithendra on 14 July 2023.
Abstract
Now a day due to hectic schedules and sedentary lifestyle people do not follow the
proper diet. Poor diet may lead to diabetes, and which could result in various health issues such
as heart attacks, strokes, renal failure, nerve damage, etc. When diabetes is accurately detected
in its early stage, it can be effectively treated. By using Machine Learning methods, the problem
can be easily detected and a solution could bearrived. Early diabetes detection and prediction
can be greatly improved with machine learning (ML) approaches. When it is detected in an
early stage, it can be resolved quickly. The objective of this research is to provide prediction
using various supervised machine learning methods. Seven algorithms are compared with each
other to figure out which is the best. The algorithms are Logistic Regression, Random Forest,
Decision Tree, K-Nearest Neighbor, Support Vector Machine, Naïve Bayes, Gradient
Boosting. The evaluation results stated that Logistic Regression is more accurate than other
algorithms for the given data set with an accuracy of 82%. After selecting the ML model which
is more accurate. A User Interface where users can enter the new data and get results was
developed and the results to the user were forwarded through WhatsApp along with some
suggestions and precautions.
Keywords: Logistic Regression, Random Forest, Decision Tree, K-Nearest Neighbor, Support
Vector Machine, Naïve Bayes, Gradient Boosting, Machine Learning (ML), Diabetes.
Journal of Artificial Intelligence and Capsule Networks, June 2023, Volume 5, Issue 2, Pages 190-206 190
DOI: https://doi.org/10.36548/jaicn.2023.2.008
Received: 03.05.2023, received in revised form: 03.06.2023, accepted: 18.06.2023, published: 28.06.2023
© 2022 Inventive Research Organization. This is an open access article under the Creative Commons Attribution-NonCommercial International (CC BY-NC 4.0) License
V Jithendra, R M Sai Mohit, M Madhusudhan, B Jagadeesh, S Kusuma
1. Introduction
To lessen the impact of diabetes and manage the condition, one must focus
on a person who is at high risk. The World Health Organization (WHO) specifies
additional risk factors for diabetes mellitus as follows:
• Blood sugar levels during fasting are continuously above the normal
range or blood glucose is elevated above normal levels (IFG).
Journal of Artificial Intelligence and Capsule Networks, June 2023, Volume 5, Issue 2 191
Diabetes Prediction Using Machine Learning Techniques
2. Literature Survey
Glucose levels, blood pressure, insulin, body mass index (BMI), age is
considered when developing the model for diabetes prediction. Following that, a set
of real-world data is used to evaluate the method. They achieved accuracy of 72%
for the decision tree and 76.5% for the random forest [3]. Furthermore, the
researchers used the support Vector Machine (SVM) and Random Forest (RF)
classifiers to predict diabetes. The data set is first pre-processed after being obtained
from a clinic.
Some duplicate and incomplete data are removed during the pre-processing
of the dataset. The features are chosen from the data set after that pre-processing.
The dataset is then prepared for dimensionality reduction. splitted The data set is
spilt into two portions,i.e., training and testing. Later trained with SVM classifier
and a Random Forest model, respectively. Two models are then evaluated. the
evaluation results showed 81.4% and 83% accuracy respectively [4].
The dataset is spilt as 70% for training and 30% testing . To predict
diabetes, the seven best machine learning algorithms are being deployed. For this
dataset, random forest achieves maximum accuracy. For the test data set, it provides
an accuracy of 79.9%. In consideration of this, they came to the conclusion that,
Random Forest algorithm is the most effective [5]. They primarily take into account
three data sets to predict diabetes, using the same dataset as the other researchers for
prediction. They are thinking about the datasets for heart disease (HD), liver
disease, and diabetes disease.
The proposed model has 97% prediction accuracy. To determine the highest
accuracy, they have used LR formula, this article used GA to forecast by combining
the effects of a sizable number of independent variables.
3. Proposed Work
Journal of Artificial Intelligence and Capsule Networks, June 2023, Volume 5, Issue 2 193
Diabetes Prediction Using Machine Learning Techniques
3.1 Methodology
It is the most important and key aspect in this project. Missing values and
other unwanted values will might reduce the effectiveness of the results. Data pre-
processing is performed to increase the accuracy and effectiveness of the results.
The two step Pre-processing followed in the work is as follows .
1) Missing Values Removal: Remove any value that have null or 0. There
can never be a value of zero. This instance is therefore no longer valid. In order to
reduce the dimensionality of the data and work more quickly, feature subsets are
created by eliminating pointless features and occurrences.
2) Splitting Of Dataset: Splitting of data help us to know how the model will
perform for the new data. The dataset is spilt into two portions each 80% and 20%.
80% is for training and 20% is for testing respectively. This process will be done
after the above step.
Journal of Artificial Intelligence and Capsule Networks, June 2023, Volume 5, Issue 2 195
Diabetes Prediction Using Machine Learning Techniques
It is a statistical method used to analyse data and make predictions about the
likelihood of an event occurring. When the dependent variable is categorical or
binary, this kind of regression analysis is employed. Figuring out how the variable
are related to each other is the aim of logistic regression. It uses a logistic function
to transform the input values into a probability value between 0 and 1. The logistic
regression model calculates the independent variable coefficients that increase the
probability of the observed data. The chance that the dependent variable will have a
particular value is determined using these coefficients.
b) K- Nearest Neighbor
d) Naive Bayes
e) Decision Tree
In this Internal node represent tests on a feature. Each branch represents the
test's result, and each leaf node represents a class name or a numerical value in this
tree-like structure. Because they are simple to manage both categorical and
numerical data.
Journal of Artificial Intelligence and Capsule Networks, June 2023, Volume 5, Issue 2 197
Diabetes Prediction Using Machine Learning Techniques
f) Random Forest
It is constructed with the help of multiple decision trees and takes one final
prediction. Each decision tree is grown using a random subset of features and data
samples, making them less prone to overfitting. The final prediction of the Random
Forest is the majority vote or an average prediction of all the individual trees. This
is most frequently used in classification and regression tasks. This algorithm can
handle both categorical and continuous data, and also handles missing values. It can
identify feature importance, making it useful for feature selection and data
visualization. Random Forest is more efficient and can handle huge datasets.
g) Gradient Boosting
Accuracy
performance metric, particularly if the target variable classes in the sample are
uneven. Mathematically it can be represented as,
𝐴 = ( 𝑇𝑃 + 𝑇𝑁 ) / ( 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 (1)
Precision
It is a metric for how accurate a binary classification algorithm is. It is defined as the
ratio of the model's correct positive predictions to its total positive predictions. Mathematically
it can be represented as:
𝑃 = 𝑇𝑃 / ( 𝑇𝑃 + 𝐹𝑃 ) (2)
Recall
𝑅 = 𝑇𝑃 / ( 𝑇𝑃 + 𝐹𝑁 ) (3)
F1-score
Journal of Artificial Intelligence and Capsule Networks, June 2023, Volume 5, Issue 2 199
Diabetes Prediction Using Machine Learning Techniques
For a simple and interactive interface for user we built the GUI application
using Tkinter package.
Step 1: Installing the Tkinker package and importing it into python script, which
provides required classes and functions for creating a GUI.
Step 2: Creating the main window and customizing it by modifying the attributes
like the title, size.
Step 3: After Creating the main window, now create and add widgets to it. It
involves widgets like labels, buttons, and entry fields, etc.
Here we created 6 labels and 6 entry fields for the user’s name, phone number
and features and Predict, Report buttons.
Step 4: Now positioning the created widgets using a layout manager (Grid
Manager).
After clicking on the predict button it displays the result like “Diabetic” or
“Non-Diabetic” and similarly after clicking on the report button it sends report to
the user’s mobile number through WhatsApp.
Step 1: Installing the PyWhatKit package and importing it into python script.
Journal of Artificial Intelligence and Capsule Networks, June 2023, Volume 5, Issue 2 201
Diabetes Prediction Using Machine Learning Techniques
where;
ph_no: The phone number of the user entered in GUI, including the country
code.
Hour and minute: These are set for two minutes i.e.00:02 from the point of
time when user clicks the report button.
4. Results
Journal of Artificial Intelligence and Capsule Networks, June 2023, Volume 5, Issue 2 203
Diabetes Prediction Using Machine Learning Techniques
5. Conclusion
References
[1] Arwatki Chen L, Nurul Amin and Soumen Moulik’s “Diabetes Disease
Prediction Using Machine Learning Algorithms” published in IEEE
EMBS Conference on Biomedical Engineering and Sciences (IECBES) in
2020.
[2] Le, T. M., Vo, T. M., Pham, T. N., & Dao, S. V.T. (2021). A Novel Wrapper
Based Feature Selection for Early Diabetes Prediction Enhanced with a
Metaheuristic. IEEE Access, 9,7869–7884.
doi:10.1109/access.2020.3047942.
[3] P, Anirudh Hebbar; M V, Manoj Kumar; H A, Sanjay (2019). [IEEE 2019 1st
International Conference on Advances in Information Technology (ICAIT)
- Chikmagalur, India(2019.7.25- 2019.7.27)] 2019 1st International
Conference on Advances in InformationTechnology (ICAIT) - DRAP:
Decision Tree and Random Forest Based Classification Modelto Predict
Diabetes. 271–276. doi:10.1109/icait47043.2019.8987277
[4] Sivaranjani, S., Ananya, S., Aravinth, J., & Karthika, R. (2021). Diabetes
Prediction using Machine Learning Algorithms with Feature Selection and
Dimensionality Reduction. 2021 7th International Conference on
Advanced Computing and Communication Systems (ICACCS).
doi:10.1109/icaccs51430.2021.9441935
[5] Barhate, Rahul; Kulkarni, Pradnya (2018). [IEEE 2018 Fourth International
Conference on Computing Communication Control and Automation
(ICCUBEA) - Pune, India (2018.8.16- 2018.8.18)] 2018 Fourth
International Conference on Computing Communication Control and
Automation (ICCUBEA) - Analysis of Classifiers for Prediction of Type II
Diabetes Mellitus.1–6. doi:10.1109/ICCUBEA.2018.8697856
[6] Chaudhuri, A. K., & Das, A. (2020). Variable Selection in Genetic Algorithm
Model with Logistic Regression for Prediction of Progression to Diseases.
2020 IEEE International Conference for Innovation in
Technology(INOCON) doi:10.1109/inocon50539.2020.9298372
[7] Ahmed, Hager; Younis, Eman M.G.; Ali, Abdelmgeid A. (2020). [IEEE 2020
International Conference on Innovative Trends in Communication and
Computer Engineering (ITCE) - Aswan, Egypt (2020.2.8-2020.2.9)]
2020 International Conference on Innovative Trends in Communication
and Computer Engineering (ITCE) - Predicting Diabetes using Distributed
Journal of Artificial Intelligence and Capsule Networks, June 2023, Volume 5, Issue 2 205
Diabetes Prediction Using Machine Learning Techniques
[12] Nonso Nnamoko, Abir Hussain, David England, "Predicting Diabetes Onset:
an Ensemble Supervised Learning Approach ". IEEE Congress on
Evolutionary Computation (CEC), 2018.
[13] Deeraj Shetty, Kishor Rit, Sohail Shaikh, Nikita Patil, "Diabetes Disease
Prediction Using Data Mining ".International Conference on
Innovations in Information, Embedded and Communication Systems
(ICIIECS), 2017.
[14] Nahla B., Andrew et al,"Intelligible support vector machines for diagnosis of
diabetes mellitus. Information Technology in Biomedicine", IEEE
Transactions. 14, (July. 2010), 1114-20.