Disease Prediction Model An Efficient Machine Lear
Disease Prediction Model An Efficient Machine Lear
Research Article
Keywords: DNA sequences, Classi ers, Feature extraction, Healthcare, Machine Learning
DOI: https://doi.org/10.21203/rs.3.rs-4248864/v1
License: This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License
Abstract
When it comes to health care, everyone is always eager to identify diseases in their early stages,
but doing so might be difficult because of the lack of knowledge on the patterns of specific
diseases since DNA contains most of the genetic blueprints, DNA sequence classification can be
used to predict the existence of certain conditions accurately. There are several machine-learning
techniques available to classify DNA sequences. Traits from known diseases are extracted to
train the model for new, unknown diseases. The expansion of patients' access to digital platforms
for early disease diagnosis through knowledge transfer to artificial neural networks eliminates
the need for clinical equipment. To analyze the model, DNA samples of four well-known
viruses—human respiratory viruses, lung cancer viruses, and papilla-maviruses (HPV)—are
gathered from Genbank (NCBI). These samples are then compared with five existing methods
using seven different parameters—specificity, accuracy, Matthews correlation coefficient, recall,
precision, F1-score, area under the receiver operating characteristic (ROC) curve (AUROC), and
area under the Precision-Recall (PRC) curve (AUPRC)—to facilitate the analysis of the model.
The outcome demonstrates that the proposed work provides significantly better precision and
accuracy than the prior best results, where precision has increased by more than 5.124% and
accuracy has increased by about 15.9%.
1. Introduction
Since DNA sequences include all the genetic data about organisms, using them to extract disease
patterns can be very advantageous [1]. This is because researchers can utilize this genetic data to
develop novel medications to treat or diagnose diseases at an early stage. For the classification of
DNA sequences, there are numerous machine-learning techniques available. Extraction of
features from known diseases is done to train the model for new unknown diseases. For this
purpose, DNA data of various diseases are collected from NCBI which has the largest collection
of DNA sequences and is also growing exponentially, year-wise distribution of sequences
storage in Genbank (NCBI) from 2013- to 2022 can be seen in Figure 1. In this work, a new
approach in which data given to traditional machine learning techniques is first preprocessed to
increase the efficiency of classification has been proposed. Pre-processing of DNA data is done
using a matrix named “Hot vector” in which each nucleotide is converted to either binary 0 or 1
that represents their position in the FASTA sequence.
Sequences,
Sequences 317122157
Sequences,
Sequences, 200301550 Sequences,
133818570
395301176
Sequences,
551063065 2013
2014
Sequences,
2241439349 2015
Sequences,
773773190 2016
Sequences,
2017
Sequences,
1127023870 2018
1734664952
2019
Sequences,
2020
1517995689 2021
2022
2023
These calculated hot vectors are now given as input to a traditional convolution neural network
for extracting the feature and identifying the pattern of a particular disease. Now, whenever a
newly infected person visits a doctor their DNA will be matched with the database for the
identification of an unknown disease. Graph 2 displays the total number of publications relating
to research on healthcare from 2000 to 2023, and it makes obvious how crucial it is to understand
the healthcare system and its security.
600
500
# of publications
400
# of publications related to
300 healthcare
200
100
0
2005 2007 2009 2011 2013 2015 2017 2019 2021 2023
Figure 2. A graph on the total number of publications related to studies on healthcare from 2005-to 2023
1.1. Composition of DNA data
Since it is commonly recognized that biological information about an organism or human is
contained in its DNA, understanding DNA sequences is essential for understanding organisms.
The basic structure of the DNA is shown in Figure 3 [2].
The DNA sequence [3][4] is represented textually in FASTA, which only employs a single letter
to represent each DNA base pair. Any organism's structure can be modeled by the arrangement
of its nucleotides. As seen in the example below, the description of the DNA sequence, or
identifier, comes first on a FASTA representation line before the sequence data.
The use of the special symbol ">" separates the DNA sequence from the identification. The word
or phrase that comes after the symbol ">" designates the identification of the sequence, and the
optional remaining text, which separates the identification number from a white space or tab,
designates the description of the sequence. The actual DNA sequence will start on the line that
comes after the text line. Another ">" symbol marks the beginning of the second line and the
beginning of a new sequence.
• Decision Trees:
The term "Decision tree" refers to a tree-like structure where the core node reflects the test
applied to a predictive feature and the branches are used to represent attribute value [13]. The
leaf node is used to represent class distribution. After moving up the tree based on an object's
predictive attribute value, an unlabeled object must first be classified, starting at the root node.
The real-world example of a decision tree-like test to determine if a number is larger than or
lower than any other number is already known to us.
• CNN (Convolution neural network)
Convolution layers are used in a convolution neural network, a type of deep neural network, to
separate the input data and features. It is modeled after the idea of biological neurons, where
characteristics from the previous layer of convolution are used in high-level feature abstraction
[14].
• Support Vector Machines:
A support vector machine (SVM) is a novel supervised machine learning algorithm that was first
introduced in 1995 that maximizes the margin between different classes while minimizing the
empirical classification error [15]. High dimensional data categorization can be accomplished
with little difficulty using SVM's kernel, or radial basis function [16]. Most of the data sets are
divided into training and testing datasets. The effectiveness of SVM's classifiers by the training
set is evaluated using this test dataset.
The organization of the paper is as follows: section 1 is about the basic overview of DNA
composition and machine learning techniques, section 2 covers the literature survey and the
proposed model is discussed in section 3. Section 4 is about the result analysis and discussion,
and finally, section 5 concludes the paper.
2. Literature review
The early detection of viruses and diseases is now everyone's top concern in the wake of the
COVID-19 epidemic because doing so can help with the development of medicines and save
more lives. The numbers shown in Figure 4 show the overall number of articles released each
year, broken down by kind of publication, as well as the total number of publications on DNA
classification for disease prediction with a distribution of several authors.
But finding them becomes difficult due to the scarcity of patterns for various ailments. The
classification of DNA sequence, which can be considered as one possible response, is crucial to
computational biology. Table 1 shows some earlier studies in the classification of DNA
sequences:
# of publications published per year based on
publication type
Patent Grey Literature Conference proceedings Journal
30
25 25
20
14
12
10
8
6 5
2 22 3 32 32 3 3
000 00 12 0
Figure 4. The number of publications published per year, based on publication type
[18] Vector-space A novel technique for identifying This strategy was constrained to a
classification intron-exon junctions and PCA plan that required prior knowledge
for classifying DNA sequences of the genomic DNA's structure in
are described by the author in this the form of a training set, i.e., a
work. description of the contents of exons
and introns.
[19] Directed acyclic The "words" provided in the The three-state prediction method
word graphs unknown sequence are compared and the sequence recognition system
and matched with the dictionary need to be tested more thoroughly.
of words to classify it, and it is
subsequently
categorized following 3
fundamental classes of DNA.
Directed acyclic word graphs
(DAWGs) can be used to increase
labor efficiency even further; with
this technique, the failure rate is
lowered to 4% failure with 94%
of successful test identification.
[20] Microarray To properly represent the This method was constrained to a
technology difficulties surrounding gene plan that required prior knowledge
classification, the author has of the genomic DNA's structure in
emphasized the use of fuzzy logic the form of a training set.
in this work.
[21] Block-based The classification of protein The block-based strategy is only
approach sequences is the subject of an appropriate when amino acids are
investigation by the author. They present in blocks, and a decent
only applied the five approaches classification result can be obtained
to related proteins found in the by combining all of the above
"PROSITE" catalog to analyze methods.
the work.
[22] Machine learning The author has provided a The issue observed during sequence
techniques fundamental explanation of mining
DNA, in addition to the use of
machine learning algorithms for
DNA sequence mining.
[23] CNN technique Convolution neurons use the Preprocessing of data is required.
collected features from the
preceding layer to extract high-
level features. In this case, the
DNA sequence is first converted
to its equivalent textual format,
and then CNN is applied to it in
the same way as it is used to
textual data.
[24] CapsNets An improved machine-learning Not effective for significant input
architecture for classifying tumors modifications.
from MRI images has been
proposed in this paper.
Research Gaps
The majority of the literature studies recommend the classification of DNA datasets for pattern
recognition when using machine-learning approaches to identify diseases. Some of the methods
for classifying DNA include the Expectation-Maximization algorithm, Vector space
classification, DAWGs, CNN, and block-based approaches. The drawback of these classification
methods is that they were constrained to a method that required prior information on the genomic
DNA's structure in the form of a training set. Additionally, a new classification strategy known
as "CapsNets" is described in [24], a novel machine-learning architecture that successfully
classifies disease from given datasets, to eliminate the drawbacks of current approaches. These
classification methods do not perform well for large input datasets, though, which is a problem.
To overcome CapsNets' limitations, a new classifier is presented that not only works well for
large datasets but also increases the accuracy of classification by preprocessing the data using a
hot vector before applying the data to conventional CNN. This method can also be used to train
the neural network to recognize a particular disease's pattern.
3. Proposed Model
Region size=2
1 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 1 0
0 0 0 0 1 0 0 1
0 0 0
0 0 1
Hot vector 0 0 0
1 0 0
0 1 0
In the above example, D is the dictionary that consists of 5 words with their equivalent numeric
value in a 2-D matrix named a hot vector matrix. For any sentence s with a region size of m, a
hot vector matrix can be created by referring to the dictionary of words for each pair. Once the
sentence is converted into its equivalent 2-D hot vector matrix then it can be given as an input to
conventional CNN for classification purposes. The model is named sequential CNN or CNN
model for sequence.
Here four consecutive words are selected after every stride of 1 and will be added to the
destination sequence. One such dictionary of words and their hot vector representation is shown
in Figure 6 and Figure 7.
256 words
In this work, a framework has been proposed as shown in Figure 8 where when a patient reaches
a doctor for treatment then their samples are collected and matched with our database. DNA data
collected from NCBI are securely stored using blockchain so that they can’t be changed or
attacked by an intruder. The researchers who have authorized access can use this data for
classification purposes.
For the classification of DNA sequences, there are numerous machine-learning techniques
available. Extraction of features from known diseases is done to train the model for new
unknown diseases. For this purpose, we have proposed a new approach in which data given to
traditional machine learning techniques is first preprocessed to increase the efficiency of
classification. Pre-processing of DNA data is done using a matrix named “Hot vector” in which
each nucleotide is converted to either binary 0 or 1 that represents their position in the FASTA
sequence. CNN will then extract the features from given matrixes that are further used to train
machines for various kinds of diseases. These trained machines are evaluated on some random
testing datasets and can be applied to unknown DNA sequences for disease prediction.
When a dataset is normalized, every value is done so that the standard deviation is 1 and the
dataset's mean is 0. This procedure is known as "Z-score normalization." Z-score normalization
is applied to each value in a dataset using the following formula:
Predicted Positive TP FP
Class
Predicted Negative FN TN
Class
4.3.2. Accuracy
The percentage of instances that are correctly classified is calculated by:
TP+TN
Accuracy =
(TP+TN+FP+FN)
…………………….(2)
4.3.3. Error Rate
The percentage of predicted values that are incorrectly categorized is determined by:
Error rate = 1- Accuracy ………………………………(3)
4.3.4. True Positive Rate
Utilizing this metric allows for the accurate measurement of actual positive proportions. The
TPR is acquired using:
TP
True positive rate =
TP+FN
………………………..(4)
4.3.5. False positive rate
If it is true, the null hypothesis is rejected. The FPR can be found by:
FP
False positive rate =
FP+TN
………………….(5)
4.3.6. True negative rate
The regular instances of the patterns are precisely identified. The TNR was discovered by:
True negative rate = 1- FPR …………………(6)
4.3.7. False Negative Rate
The patterns are incorrectly categorized as common occurrences. The FNR can be acquired by:
False Negative rate = 1-TPR ………………….(7)
4.3.8. Precision
The patterns are accurately classified by the extent of their action. It is obtained by:
TP
Precision = ……………………….(8)
TP+FP
4.3.9. F-Measure
The appraisal of the accuracy is performed by it. It is obtained by:
precision x recall
F-measure= 2 ×
precision+recall
………………….(9)
4.3.10. Matthews’s correlation coefficient
It is employed for the measurement concerning classifications of binary instances. The MCC is
obtained by:
(TN x TP)−(FN x FP)
Matthews’s correlation coefficient = (FN+TP)(FP+TP)(FN+TN)(FP+TN) …………..(10)
√
Moreover, the datasets applied for evaluation purpose is discussed in this section. A comparison
of five well-known classification techniques has been carried out on some random datasets and
tested on a Python-based tool that can perform feature extraction and classification of DNA
sequences in FASTA file format. In this work, the feature descriptor “Kmer” with a size of 3, for
clustering the K-means method where the number of clusters is 3 and the Zscore method for
feature normalization is used and the Chi-square selection method is used for selecting features
where 100 features are selected.
For analysis, the evaluation parameters are calculated and compared that include Sn, Sp, Acc,
MCC, Pre, F1-score, ROC, AUROC, PRC, and AUPRC i.e. sensitivity, specificity, accuracy,
Matthews correlation coefficient, Precision, F1-score, receiver operating characteristic, the area
under the receiver operating characteristic, Precision-Recall, and the area under the Precision-
Recall curve [28] [29] [30] [31] [32]. The area under the receiver operating characteristic
(AUROC) parameter can be evaluated using the receiver operating characteristic (ROC) curve
whose value ranges between 0-1 and the area under the Precision-Recall curve (AUPRC)
parameter can be evaluated using the precision-recall curve.
If a model predicts the exact positive result then it is termed as “true positive. When the
prediction of the exact negative result is done then it is termed as “true negative”. Similarly when
the prediction is incorrect then it is termed a false positive and false negative respectively. Here
“i” represents the sample of the ith class. A comparison of the proposed model with the other
approaches to classifying testing datasets has been carried out based on these parameters.
4.4. Analysis
Table 10. The dataset of human respiratory syncytial virus and human
orthopneumo virus appraised with decision tree
Overall Instances 300
Instances which are classified precisely 187 (62.5%)
Instances which are classified wrongly 113(37.5%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
62.5% 0.625 0.913 0.5211 0.6604 0.8015 0.836
CM
A B
132 33
80 55
Table 12. The dataset of human respiratory syncytial virus and human
orthopneumo virus appraised with MLP
Overall Instances 300
Instances which are classified precisely 234 (78%)
Instances which are classified wrongly 66(22%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
78% 0.78 0.79 0.5687 0.7848 0.8255 0.8347
CM
a B
220 16
50 14
Table 16. The dataset of human respiratory syncytial virus and human
orthopneumo virus appraised with SVM
• Precision-PRC curve
Graphs of all discussed classification techniques drawn between precision and recall value of
300 data sets with 5 folds are shown below.
Figure 10. CNN PRC curve Figure 11. Decision tree PRC curve
Figure 12. RNN PRC curve Figure 13. SVM PRC curve
Figure 14. MLP PRC curve Figure 15. Proposed Method PRC curve
Figure 16. CNN True- False positive rate Figure 17. Decision tree True- False positive rate
Figure 18. RNN True- False positive rate Figure 19. SVM True- False positive rate
Figure 20. MLP True- False positive rate Figure 21. Proposed Method True- False
positive rate
Table 22. The dataset of Breast cancer appraised with decision tree
Table 30. The dataset of Breast cancer appraised with proposed method
• Precision-PRC curve
Graphs of all discussed classification techniques drawn between precision and recall value of
300 data sets with 5 folds are shown below.
Figure 22. CNN PRC curve Figure 23. Decision tree PRC curve
Figure 24. RNN PRC curve Figure 25. SVM PRC curve
Figure 26. MLP PRC curve Figure 27. Proposed Method PRC curve
Figure 28. CNN True- False positive rate Figure 29. Decision tree True- False positive rate
Figure 30. RNN True- False positive rate Figure 31. SVM True- False positive rate
Figure 32. MLP True- False positive rate Figure 33. Proposed Method True- False
positive rate
• Accuracy and precision
Figure 34 shows the accuracy and precision of CNN, decision tree, MLP, RNN, SVM, and
proposed approach obtained on 300 instances of human respiratory syncytial virus and human
orthopneumovirus dataset. The proposed method achieved a weighted average of precision and
accuracy of 97.3, 97.3 which is highest when compared to other methods.
Accuracy-Precision Graph
150
100
Accuracy
50
Precision
0
CNN Decision MLP RNN SVM Proposed
tree Method
Figure 34. Accuracy and precision graph of human respiratory syncytial virus and human
orthopneumo virus dataset
Figure 35 shows the accuracy and precision of CNN, decision tree, MLP, RNN, SVM and
proposed approach obtained on 300 instances of breast cancer dataset. The proposed method
achieved a weighted average of precision and accuracy of 97.3, 93.9 which is highest when
compared to other methods.
Accuracy-Precision Graph
150
100
Accuracy
50
Precision
0
CNN Decision MLP RNN SVM Proposed
tree Method
Figure 36 shows the accuracy and precision of CNN, decision tree, MLP, RNN, SVM and
proposed approach obtained on 300 instances of lung cancer dataset. The proposed method
achieved a weighted average of precision and accuracy of 97.3, 93.9 which is highest when
compared to other methods.
Accuracy-Precision Graph
150
100
Accuracy
50
Precision
0
CNN Decision MLP RNN SVM Proposed
tree Method
The accompanying table 37 demonstrates that the proposed method provides significantly better
precision and accuracy outcomes than the prior best results. Precision has increased by over
5.124 %, while accuracy has increased by about 15.9%. Because the work has employed a better
way for sequence representation (hot-vector representation) and also the best feature selection
method, the proposed method has a very high improvement compared to previous methods.
Table 37. Comparison of best past results with proposed work results
Parameter Best past result Proposed work result Improvement
100
90
Percentage 80
70
60
50
40
30
20 Average accuracy
10
0 Average precision
Techniques
This study's main drawback is the use of empirical selection for hyper-parameters such as word
size, region size, and network architectural configuration. The work has identified these hyper-
parameters as having a substantial impact on our model's prediction ability through the findings
of multiple experiments. Consequently, more research on this issue is required.
With datasets that were easy to classify, the suggested technique yielded nearly perfect
classification results. It could be a reliable tool for facilitating research involving this type of
data. Although advances have been made for Histone datasets, performance remains
inadequate. As a result, predictions based on the suggested model should be regarded simply
as a starting point for future research on these types of data. Because other sequence data in
bioinformatics, such as amino acid sequence data, are similar data of sequences of successive
letters, the proposed model might be applied to these data.
Data availability: The datasets generated during and/or analyzed during the current study are
not publicly available due to [security reasons] but are available from the corresponding
author on reasonable request.
Declarations:
Competing interests
The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.
References
1. S. Shadab, M. T. Alam Khan, N. A. Neezi, S. Adilina, and S. Shatabda. (2020). DeepDBP: deep neural
networks for the identification of DNA-binding proteins. Informatics in Medicine Unlocked. vol. 19, article
100318.
2. Garima Mathur, Anjana Pandey and Sachin Goyal, "A Novel Approach to Compress and Secure Human
Genome Sequence", In: Saroj Hiranwal and Garima Mathur (eds), Artificial Intelligence and Communication
Technologies, SCRS, India, 2022, pp. 305-317. https://doi.org/10.52458/978-81-955020-5-9-31
3. Mathur, G., Pandey, A., & Goyal, S. (2023). Blockchain Solutions, Challenges, and Opportunities for DNA
Classification and Secure Storage for the E-Healthcare Sector: A Useful Review. In A. Tyagi (Ed.), Handbook
of Research on Quantum Computing for Smart Environments (pp. 453-473). IGI Global.
https://doi.org/10.4018/978-1-6684-6697-1.ch024
4. Mathur, G., Pandey, A. & Goyal, S. A review on blockchain for DNA sequence: security issues, application in
DNA classification, challenges and future trends. Multimed Tools Appl 83, 5813–5835 (2024).
https://doi.org/10.1007/s11042-023-15857-1
5. M. Anthimopoulos, S. Christodoulidis, L. Ebner, A. Christe, and S. Mougiakakou, "Lung pattern classification
for interstitial lung diseases using a deep convolutional neural network", IEEE Trans. Med. Imag., vol. 35, no.
5, pp. 1207-1216, May 2016.
6. Z. Yan et al., "Multi-instance deep learning: Discover discriminative local anatomies for body part
recognition", IEEE Trans. Med. Imag., vol. 35, no. 5, pp. 1332-1343, May 2016.
7. W. Shen, M. Zhou, F. Yang, C. Yang, and J. Tian, "Multi-scale convolutional neural networks for lung nodule
classification", Proc. Int. Conf. Inf. Process. Med. Imag., pp. 588-599, 2015.
8. J. Schlemper, J. Caballero, J. V. Hajnal, A. Price and D. Rueckert, "A deep cascade of convolutional neural
networks for MR image reconstruction", Proc. Int. Conf. Inf. Process. Med. Imag., pp. 647-658, 2017.
9. J. Mehta and A. Majumdar, "Rodeo: Robust de-aliasing autoencoder for real-time medical image
reconstruction", Pattern Recognit., vol. 63, pp. 499-510, 2017.
10. M. Havaei et al., "Brain tumor segmentation with deep neural networks", Med. Image Anal., vol. 35, pp. 18-31,
2017.
11. K. Bourzac, "The computer will see you now", Nature, vol. 502, no. 3, pp. S92-S94, 2013.
12. Michalski, R.S., Carbonell, J.G. and Mitchell, T.M. eds., 2013. Machine learning: An artificial intelligence
approach. Springer Science & Business Media
13. Tettamanzi, A.G. and Tomassini, M., 2013. Soft computing: integrating evolutionary, neural, and fuzzy
systems. Springer Science & Business Media.
14. N. A. Kassim and A. Abdullah. (2017). Classification of DNA sequences using convolutional neural network
approach. UTM Computing Proceedings Innovations in Computing Technology and Applications. vol. 2, pp. 1–
6,.
15. Boser, B. E., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin Classifiers. In
Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp.144 -152. ACM Press. 1992.
16. V. Vapnik. The Nature of Statistical Learning Theory. NY: Springer Verlag. 1995.
17. Ma, Q., Wang, J. T. L., Shasha, D., and Wu, C. H. (2001). DNA sequence classification via an expectation
maximization algorithm and neural networks: a case study. IEEE Trans. Syst. 31, 468–475. doi:
10.1109/5326.983930.
18. Müller, H. M., and Koonin, S. E. (2003). Vector space classification of DNA sequences. J. Theor. Biol. 223,
161–169. doi: 10.1016/S0022-5193(03)00082-1
19. Levy, S., and Stormo, G. D. (1997). “DNA sequence classification using DAWGs,” in Structures in Logic and
Computer Science, eds J. Mycielski, G. Rozenberg, and A. Salomaa (Berlin: Springer), 339–352. doi:
10.1007/3-540-63246-8_21.
20. Ohno-Machado L, Vinterbo S, Weber G (2002) Classification of gene expression data using fuzzy logic. J Intell
Fuzzy Syst 12(1):19–24
21. J. T. L. Wang, T. G. Marr, D. Shasha, B. A. Shapiro, G. Chirn, and T. Y. Lee, “Complementary classification
approaches for protein sequences,” Protein Eng., vol. 9, no. 5, pp. 381–386, 1996.
22. Yang A, Zhang W, Wang J, Yang K, Han Y, Zhang L. Review on the Application of Machine Learning
Algorithms in the Sequence Data Mining of DNA. Front Bioeng Biotechnol. 2020 Sep 4;8:1032. doi:
10.3389/fbioe.2020.01032. PMID: 33015010; PMCID: PMC7498545.
23. Nguyen, N. , Tran, V. , Ngo, D. , Phan, D. , Lumbanraja, F. , Faisal, M. , Abapihi, B. , Kubo, M. and Satou, K.
(2016) DNA Sequence Classification by Convolutional Neural Network. Journal of Biomedical Science and
Engineering, 9, 280-286. doi: 10.4236/jbise.2016.95021.
24. P. Afshar, A. Mohammadi, and K. N. Plataniotis, “Brain tumor type classification via capsule networks,” in
Proc. 25th IEEE Int. Conf. Image Process., 2018, pp. 3129–3133.
25. Szegedy C, Toshev A, Erhan D. Deep neural networks for object detection. In: Burges CJC, Bottou L, Welling
M, Ghahramani Z, Weinberger KQ, eds. Advances in neural information processing systems. Red Hook, NY:
Curran Associates, 2013; 2553–2561.
26. Abadi M, Agarwal A, Barham P, et al. TensorFlow: large-scale machine learning on heterogeneous distributed
systems. Cornell University Library website. http://arxiv.org/abs/1603.04467. Published 2016. Accessed
October 2016.
27. Mathur, G., Pandey, A. & Goyal, S. A comprehensive tool for rapid and accurate prediction of disease using
DNA sequence classifier. J Ambient Intell Human Comput (2022). https://doi.org/10.1007/s12652-022-04099-y.
28. V. Vapnik. The Nature of Statistical Learning Theory. NY: Springer Verlag. 1995.
29. Celebi, M.E., Kingravi, H.A. and Vela, P.A., 2013. A comparative study of efficient initialization methods for
the k-means clustering algorithm. Expert Systems with Applications, 40(1), pp.200-210.
30. Sweety Bakyarani. E, Dr. Srimathi. H , Dr. M. Bagavandas, NOVEMBER 2019. , A Survey Of Machine
Learning Algorithms In Health Care. INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY
RESEARCH VOLUME 8, ISSUE 11, pp, 2288-2292.
31. H. Kim, D. C. Jung, and B. W. Choi, “Exploiting the vulnerability of deep learning-based artificial intelligence
models in medical imaging: Adversarial attacks,” J. Korean Soc. Radiol., vol. 80, no. 2, pp. 259–273, 2019.
32. J. Zhang and E. Bareinboim, “Fairness in decision-making—The causal explanation formula,” in Proc. 32nd
AAAI Conf. Artif. Intell., 2018.