[go: up one dir, main page]

0% found this document useful (0 votes)
14 views30 pages

Disease Prediction Model An Efficient Machine Lear

The document presents a machine learning-based DNA classifier aimed at early disease prediction by analyzing DNA sequences. It discusses the use of various machine learning techniques, particularly convolutional neural networks, to improve classification accuracy and precision, demonstrating significant enhancements over previous methods. The proposed model utilizes a hot vector representation for preprocessing DNA data, leading to better performance in identifying unknown diseases from DNA samples.

Uploaded by

Hani M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views30 pages

Disease Prediction Model An Efficient Machine Lear

The document presents a machine learning-based DNA classifier aimed at early disease prediction by analyzing DNA sequences. It discusses the use of various machine learning techniques, particularly convolutional neural networks, to improve classification accuracy and precision, demonstrating significant enhancements over previous methods. The proposed model utilizes a hot vector representation for preprocessing DNA data, leading to better performance in identifying unknown diseases from DNA samples.

Uploaded by

Hani M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Disease prediction model: An e cient machine

learning- based DNA classi er


GARIMA MATHUR

Rajiv Gandhi Technical University

Research Article

Keywords: DNA sequences, Classi ers, Feature extraction, Healthcare, Machine Learning

Posted Date: August 9th, 2024

DOI: https://doi.org/10.21203/rs.3.rs-4248864/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


Disease prediction model: An efficient machine learning-
based DNA classifier
GARIMA MATHUR,
1
Department of Computer Science and Engineering, SISTec Gandhi Nagar, Bhopal
garimamathur@sistec.ac.in

Abstract
When it comes to health care, everyone is always eager to identify diseases in their early stages,
but doing so might be difficult because of the lack of knowledge on the patterns of specific
diseases since DNA contains most of the genetic blueprints, DNA sequence classification can be
used to predict the existence of certain conditions accurately. There are several machine-learning
techniques available to classify DNA sequences. Traits from known diseases are extracted to
train the model for new, unknown diseases. The expansion of patients' access to digital platforms
for early disease diagnosis through knowledge transfer to artificial neural networks eliminates
the need for clinical equipment. To analyze the model, DNA samples of four well-known
viruses—human respiratory viruses, lung cancer viruses, and papilla-maviruses (HPV)—are
gathered from Genbank (NCBI). These samples are then compared with five existing methods
using seven different parameters—specificity, accuracy, Matthews correlation coefficient, recall,
precision, F1-score, area under the receiver operating characteristic (ROC) curve (AUROC), and
area under the Precision-Recall (PRC) curve (AUPRC)—to facilitate the analysis of the model.
The outcome demonstrates that the proposed work provides significantly better precision and
accuracy than the prior best results, where precision has increased by more than 5.124% and
accuracy has increased by about 15.9%.

Keywords: DNA sequences, Classifiers, Feature extraction, Healthcare, Machine Learning.

1. Introduction
Since DNA sequences include all the genetic data about organisms, using them to extract disease
patterns can be very advantageous [1]. This is because researchers can utilize this genetic data to
develop novel medications to treat or diagnose diseases at an early stage. For the classification of
DNA sequences, there are numerous machine-learning techniques available. Extraction of
features from known diseases is done to train the model for new unknown diseases. For this
purpose, DNA data of various diseases are collected from NCBI which has the largest collection
of DNA sequences and is also growing exponentially, year-wise distribution of sequences
storage in Genbank (NCBI) from 2013- to 2022 can be seen in Figure 1. In this work, a new
approach in which data given to traditional machine learning techniques is first preprocessed to
increase the efficiency of classification has been proposed. Pre-processing of DNA data is done
using a matrix named “Hot vector” in which each nucleotide is converted to either binary 0 or 1
that represents their position in the FASTA sequence.
Sequences,
Sequences 317122157
Sequences,
Sequences, 200301550 Sequences,
133818570
395301176
Sequences,
551063065 2013
2014
Sequences,
2241439349 2015
Sequences,
773773190 2016

Sequences,
2017
Sequences,
1127023870 2018
1734664952

2019

Sequences,
2020
1517995689 2021
2022
2023

Figure 1. Year-wise distribution of sequences in Genbank

These calculated hot vectors are now given as input to a traditional convolution neural network
for extracting the feature and identifying the pattern of a particular disease. Now, whenever a
newly infected person visits a doctor their DNA will be matched with the database for the
identification of an unknown disease. Graph 2 displays the total number of publications relating
to research on healthcare from 2000 to 2023, and it makes obvious how crucial it is to understand
the healthcare system and its security.

# of publications related to study on healthcare


700

600

500
# of publications

400
# of publications related to
300 healthcare

200

100

0
2005 2007 2009 2011 2013 2015 2017 2019 2021 2023

Figure 2. A graph on the total number of publications related to studies on healthcare from 2005-to 2023
1.1. Composition of DNA data
Since it is commonly recognized that biological information about an organism or human is
contained in its DNA, understanding DNA sequences is essential for understanding organisms.
The basic structure of the DNA is shown in Figure 3 [2].

Figure 3. The basic structure of DNA

The DNA sequence [3][4] is represented textually in FASTA, which only employs a single letter
to represent each DNA base pair. Any organism's structure can be modeled by the arrangement
of its nucleotides. As seen in the example below, the description of the DNA sequence, or
identifier, comes first on a FASTA representation line before the sequence data.

>U02993.1 Human cytochrome P450 (Cyp1A2) gene, 5' region


GGTACCTTGAGAAAGGAACACAACAGGGACTTCTTGGATGCTTATGATGTCC
TTGATTAGAGCTGGTTA

The use of the special symbol ">" separates the DNA sequence from the identification. The word
or phrase that comes after the symbol ">" designates the identification of the sequence, and the
optional remaining text, which separates the identification number from a white space or tab,
designates the description of the sequence. The actual DNA sequence will start on the line that
comes after the text line. Another ">" symbol marks the beginning of the second line and the
beginning of a new sequence.

1.2. Machine Learning


Our daily lives are now significantly impacted by machine learning, which is unavoidable. They
are now well-known for their outstanding accomplishments in the field of healthcare, including
the classification of lung diseases [5], the identification of body organs through medical images
[6], the detection of lung nodules [7], the reconstruction of medical images [8, 9], and the
segmentation of brain tumors [10]. They have already made a name for themselves in several
industries, including manufacturing and transportation. These powerful devices are now
commonly used by radiologists and clinicians to assess patients [11]. One such instance of a
machine learning and deep learning model used in medicine without any human input is the
FDA's method for the diagnosis of an eye-related condition caused by diabetes that can be
recognized using only eye images.
In machine learning techniques for text classification, previous experiences are utilized to
educate the system to learn something new. In a machine learning-based system, the process of
converting textual data to its numerical value and then showing that value using a vector matrix
is known as feature extraction.
The AI methodology known as machine learning is nothing more than a method in which
algorithms improve with each experience they gain from learning from previous ones [12]. Such
algorithms take the input data and extract patterns from it, and then they use these inputs to build
a model that predicts the values of fresh data. For purposes of forecasting, diagnosing, and
treating, these extracted patterns are useful. The broad categories of ML approaches are

• Decision Trees:
The term "Decision tree" refers to a tree-like structure where the core node reflects the test
applied to a predictive feature and the branches are used to represent attribute value [13]. The
leaf node is used to represent class distribution. After moving up the tree based on an object's
predictive attribute value, an unlabeled object must first be classified, starting at the root node.
The real-world example of a decision tree-like test to determine if a number is larger than or
lower than any other number is already known to us.
• CNN (Convolution neural network)
Convolution layers are used in a convolution neural network, a type of deep neural network, to
separate the input data and features. It is modeled after the idea of biological neurons, where
characteristics from the previous layer of convolution are used in high-level feature abstraction
[14].
• Support Vector Machines:
A support vector machine (SVM) is a novel supervised machine learning algorithm that was first
introduced in 1995 that maximizes the margin between different classes while minimizing the
empirical classification error [15]. High dimensional data categorization can be accomplished
with little difficulty using SVM's kernel, or radial basis function [16]. Most of the data sets are
divided into training and testing datasets. The effectiveness of SVM's classifiers by the training
set is evaluated using this test dataset.
The organization of the paper is as follows: section 1 is about the basic overview of DNA
composition and machine learning techniques, section 2 covers the literature survey and the
proposed model is discussed in section 3. Section 4 is about the result analysis and discussion,
and finally, section 5 concludes the paper.
2. Literature review
The early detection of viruses and diseases is now everyone's top concern in the wake of the
COVID-19 epidemic because doing so can help with the development of medicines and save
more lives. The numbers shown in Figure 4 show the overall number of articles released each
year, broken down by kind of publication, as well as the total number of publications on DNA
classification for disease prediction with a distribution of several authors.

But finding them becomes difficult due to the scarcity of patterns for various ailments. The
classification of DNA sequence, which can be considered as one possible response, is crucial to
computational biology. Table 1 shows some earlier studies in the classification of DNA
sequences:
# of publications published per year based on
publication type
Patent Grey Literature Conference proceedings Journal
30

25 25

20

14
12
10
8
6 5
2 22 3 32 32 3 3
000 00 12 0

2015 2016 2017 2018 2019 2020 2021

Figure 4. The number of publications published per year, based on publication type

Table 1. Limitations of classification algorithms according to taxonomy

Ref. No. Algorithm Description Limitation


[17] Expectation- In this article, the author has put The sequence recognition system
maximization forth an entirely novel approach has to be tested more thoroughly
algorithm for identifying E. coli promoters
in DNA sequences and
determining whether they are
present or not.

[18] Vector-space A novel technique for identifying This strategy was constrained to a
classification intron-exon junctions and PCA plan that required prior knowledge
for classifying DNA sequences of the genomic DNA's structure in
are described by the author in this the form of a training set, i.e., a
work. description of the contents of exons
and introns.

[19] Directed acyclic The "words" provided in the The three-state prediction method
word graphs unknown sequence are compared and the sequence recognition system
and matched with the dictionary need to be tested more thoroughly.
of words to classify it, and it is
subsequently
categorized following 3
fundamental classes of DNA.
Directed acyclic word graphs
(DAWGs) can be used to increase
labor efficiency even further; with
this technique, the failure rate is
lowered to 4% failure with 94%
of successful test identification.
[20] Microarray To properly represent the This method was constrained to a
technology difficulties surrounding gene plan that required prior knowledge
classification, the author has of the genomic DNA's structure in
emphasized the use of fuzzy logic the form of a training set.
in this work.
[21] Block-based The classification of protein The block-based strategy is only
approach sequences is the subject of an appropriate when amino acids are
investigation by the author. They present in blocks, and a decent
only applied the five approaches classification result can be obtained
to related proteins found in the by combining all of the above
"PROSITE" catalog to analyze methods.
the work.
[22] Machine learning The author has provided a The issue observed during sequence
techniques fundamental explanation of mining
DNA, in addition to the use of
machine learning algorithms for
DNA sequence mining.
[23] CNN technique Convolution neurons use the Preprocessing of data is required.
collected features from the
preceding layer to extract high-
level features. In this case, the
DNA sequence is first converted
to its equivalent textual format,
and then CNN is applied to it in
the same way as it is used to
textual data.
[24] CapsNets An improved machine-learning Not effective for significant input
architecture for classifying tumors modifications.
from MRI images has been
proposed in this paper.

[25][26] Deep learning The constraints of conventional Costs must be decreased.


neural networks have been
overcome. Deep learning is
currently the area of research that
is growing the fastest, in contrast
to older neural networks, which
contain fewer layers—typically
less than five—than those present
today.
[27] Modified CNN This research suggests a strategy Accuracy can still be increased.
for categorizing DNA sequences
to reveal the patterns of various
diseases. These patterns are then
compared with samples from
patients who have just contracted
the disease, which can help with
an early disease diagnosis.

Research Gaps
The majority of the literature studies recommend the classification of DNA datasets for pattern
recognition when using machine-learning approaches to identify diseases. Some of the methods
for classifying DNA include the Expectation-Maximization algorithm, Vector space
classification, DAWGs, CNN, and block-based approaches. The drawback of these classification
methods is that they were constrained to a method that required prior information on the genomic
DNA's structure in the form of a training set. Additionally, a new classification strategy known
as "CapsNets" is described in [24], a novel machine-learning architecture that successfully
classifies disease from given datasets, to eliminate the drawbacks of current approaches. These
classification methods do not perform well for large input datasets, though, which is a problem.
To overcome CapsNets' limitations, a new classifier is presented that not only works well for
large datasets but also increases the accuracy of classification by preprocessing the data using a
hot vector before applying the data to conventional CNN. This method can also be used to train
the neural network to recognize a particular disease's pattern.

3. Proposed Model

3.1. Hot vector representation for textual data


Convolution neural networks have already proven themselves in solving image data-related
problems and now can also be used for solving textual data problems. The only difference
between both is that images are in a 2-D matrix format consisting of numerical values whereas
textual data is in a 1-D matrix format consisting of letters only. So to use CNN on textual data, it
must be converted to some equivalent numerical value.
A look-up table with letters to numeric value mapping is maintained, known as a “word vector”.
These look-up tables have some limitations that are further removed in hot vector representation.
One such example of a hot vector matrix is shown in Figure 5 with D as a dictionary of words.

D= I; car; bike; bought; a; Sentence s= “I bought a car”

Region size=2

1 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0
0 0 1 0 0 0 0 0
0 0 0 1 0 0 1 0
0 0 0 0 1 0 0 1
0 0 0
0 0 1
Hot vector 0 0 0
1 0 0
0 1 0

Figure 5. Hot vector representation of words

In the above example, D is the dictionary that consists of 5 words with their equivalent numeric
value in a 2-D matrix named a hot vector matrix. For any sentence s with a region size of m, a
hot vector matrix can be created by referring to the dictionary of words for each pair. Once the
sentence is converted into its equivalent 2-D hot vector matrix then it can be given as an input to
conventional CNN for classification purposes. The model is named sequential CNN or CNN
model for sequence.

3.2. The proposed method for DNA Sequence


The FASTA representation for DNA sequence is a text file with no space between letters, which
means no concept of a “word” here. To apply sequential CNN on DNA data, the given sequence
needed to be converted into words in such a way that each DNA nucleotide holds its original
position in the sequence. In the above example, the region size (window size) taken is 3 which
slides with a fixed stride of 1. Whenever the window reads three continuous letters in a sequence,
will consider a word and at last, some sequences of words are derived from the given DNA
sequence. These sequences of words can now be given as input to CNN for classification
purposes. In the FASTA format for DNA representation, there are only four letters. If we choose
the window size of 4 then there will be 256 (44) possible combinations of words in the
dictionary, which means that every word can be represented using a hot vector of size 256. Now,
a 2-dimensional matrix created from these generated words that will hold the exact position of
DNA nucleotides as in the original sequence can be given as an input to sequential CNN.
GGCATCTGAGACCAGTGAGAA
1
2
3

Here four consecutive words are selected after every stride of 1 and will be added to the
destination sequence. One such dictionary of words and their hot vector representation is shown
in Figure 6 and Figure 7.

256 words

AAAA AAAC AAAG AAAT…………………….TTTT


1 0 0 0 0
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
256 elements
. . . . .
0 0 0 0 1

Figure 6. Dictionary of words

GGCA GCAT CATC ATCT {Generated sequence of words from DNA}


. . . .
1 . . .
. 1 . 1
. . 1 .
. . . .
. . . .
. 1 . .
1 . 1 .
. . . 1
. . . .
Figure 7. Hot vector representation of FASTA DNA sequence

3.3. Proposed Framework


Detection of diseases and viruses in a prior stage can not only prevent outbreaks but also help in
designing drugs. However, due to the limited availability of patterns of various diseases, it
becomes more complicated to identify them. Using DNA sequences for extracting patterns of
disease can be very beneficial because it carries all the genetic information about organisms that
can be used by researchers to identify or treat the disease in the early stage by designing new
drugs.

In this work, a framework has been proposed as shown in Figure 8 where when a patient reaches
a doctor for treatment then their samples are collected and matched with our database. DNA data
collected from NCBI are securely stored using blockchain so that they can’t be changed or
attacked by an intruder. The researchers who have authorized access can use this data for
classification purposes.

For the classification of DNA sequences, there are numerous machine-learning techniques
available. Extraction of features from known diseases is done to train the model for new
unknown diseases. For this purpose, we have proposed a new approach in which data given to
traditional machine learning techniques is first preprocessed to increase the efficiency of
classification. Pre-processing of DNA data is done using a matrix named “Hot vector” in which
each nucleotide is converted to either binary 0 or 1 that represents their position in the FASTA
sequence. CNN will then extract the features from given matrixes that are further used to train
machines for various kinds of diseases. These trained machines are evaluated on some random
testing datasets and can be applied to unknown DNA sequences for disease prediction.

Figure 8. Proposed Framework


4. Result Analysis and Discussions
This part discusses the result analysis with existing and proposed machine-learning techniques
for DNA sequence classification. Moreover, the proposed technique is analyzed and compared
with existing machine learning techniques based on seven different parameters in this chapter.

4.1. Data set


To examine how well this proposed model performed, DNA information from the NCBI
Genbank's virus category has been collected. Specifically, the dataset of human respiratory
viruses, breast cancer viruses, lung cancer viruses, and papilla-viruses (HPV) are collected.
Below is a list of some of this sample data set along with their names as they appear in Genbank,
descriptions, and sequence lengths.

4.1.1. Human respiratory syncytial virus Dataset


Infections of the respiratory tract are frequently caused by the respiratory syncytial virus,
commonly known as the human respiratory syncytial virus and human orthopneumovirus. The
Dataset of human respiratory syncytial virus collected from NCBI is shown in Table 2.

Table 2. Human respiratory syncytial virus dataset

GenBank Name Description Sequence Length

OK500269.1 Human respiratory syncytial virus A isolate 15224


Patient_C_D0
OK500268.1 Human respiratory syncytial virus A isolate 15255
Patient_A_D7
OK500267.1 Human respiratory syncytial virus A isolate 15239
Patient_E_D17
OK500264.1 Human respiratory syncytial virus B isolate 15,205
Patient_F_D4
OM857396 Human respiratory syncytial virus B strain 15217
hRSV/B/Australia/WM0421262/2020

OP783839.1 Severe acute respiratory syndrome coronavirus 2 29718


isolate SARS-CoV-2/human/USA/TX-CDC-
QDX42813690/2022

4.1.2. Breast Cancer Dataset


Breast cancer develops in the tissue of the breast. Men are much less likely than women to
develop breast cancer. Breast lumps, bloody nipple discharge, and changes in the nipple's or
breast's shape or texture are all indications of breast cancer. The dataset of breast cancer
collected from NCBI is shown in Table 3
Table 3. Breast cancer dataset

GenBank Name Description Sequence Length


BD292773.1 Human endogenous retrovirus in breast cancer 104
BD292765.1 Human endogenous retrovirus in breast cancer 461

AF073469.1 HIV-like human cancer virus isolation-source 142


patient 7 breast cancer-associated antigen (RAK
alpha) gene
AF073468.1 HIV-like human cancer virus isolation-source 142
patient 6 breast cancer-associated antigen (RAK
alpha) gene
AF073467.1 HIV-like human cancer virus isolation-source 142
patient 5 breast cancer-associated antigen (RAK
alpha) gene

AF073466.1 HIV-like human cancer virus isolation-source 142


patient 4 breast cancer-associated antigen (RAK
alpha) gene

4.1.3. Lung Cancer Dataset


Lung cancer most frequently affects smokers and has its origins in the lungs. The two main kinds
of lung cancer are small cell and non-small cell. Smoking, exposure to secondhand smoke,
certain chemicals, and family history are some of the factors that contribute to lung cancer. The
dataset of lung cancer collected from NCBI is shown in Table 4.
Table 4. Lung cancer dataset

GenBank Name Description Sequence Length

AB610651.2 Merkel cell polyomavirus VP1 gene for major 311


capsid protein, partial cds, isolation_source: lung
cancer in Japanese patients
AB610650.1 Merkel cell polyomavirus STAg gene for small T 267
antigen, partial cds, large T antigen (LTAg)
intron, partial sequence, isolation_source: lung
cancer in Japanese patients
MW176106.1 Human lung cancer-associated polyomavirus 5232
strain lungcat1, complete genome

4.1.4. Papillomavirus Dataset


A prevalent sexually transmitted infection with the same name is caused by the human
papillomavirus (HPV). At some point, it is encountered by the majority of sexually active
individuals. Nearly 79 million Americans have HPV, and approximately 14 million new cases
are detected annually by doctors. HPV comes in a variety of forms, and some of these can make
you more likely to get cancer. In the United States, cancers caused by HPV affect about 19,400
females and 12,100 males annually. The Dataset of Papillomavirus collected from NCBI is
shown in Table 5.
Table 5. Papillomavirus dataset.

GenBank Name Description Sequence Length


MH254917.1 Human papillomavirus type 16 isolate SL-030 L1- 75
like (L1) gene, partial sequence
MH254916.1 Human papillomavirus type 16 isolate SL-029 L1- 75
like (L1) gene, partial sequence
MH254915.1 Human papillomavirus type 33 isolate SL-028 L1- 75
like (L1) gene, partial sequence
MH254914.1 Human papillomavirus type 16 isolate SL-027 L1- 75
like (L1) gene, partial sequence

4.2. Data Pre-processing


For efficient prediction, the normalization operation and transformation procedure are performed
on applied datasets in the data-pre-processing procedure. Furthermore, this technique helps in
uncovering the structures in data that further helps in achieving the improved accuracy of the
model. 300 DNA samples, with 200 training and 100 testing set of sequences, a z-score for
feature normalization, and 3 clusters are taken for the model analysis.

When a dataset is normalized, every value is done so that the standard deviation is 1 and the
dataset's mean is 0. This procedure is known as "Z-score normalization." Z-score normalization
is applied to each value in a dataset using the following formula:

𝐍𝐞𝐰 𝐯𝐚𝐥𝐮𝐞: (𝐱 – 𝛍) / 𝛔 ……………….(1)


where:
x: The initial price
μ: Mean of the data
σ: Data's standard deviation

4.3.Efficiency appraisal metrics


To validate outcomes, various efficiency appraisals, including FP, FN, TN, and TP, are used.
These evaluation metrics are used as measuring statistics by different academics to get the
results.

4.3.1. Confusion Matrix (CM)


The confusion matrix is used to find the correlation between the actual and expected class. It is
also thought to be a helpful statistic for estimating AUC and ROC curves, specificity, precision,
recall, and accuracy. Table 6 displays the confusion matrix.

Table 6. Confusion Matrix (CM)


Class Actual Positive Class Actual Negative Class

Predicted Positive TP FP
Class
Predicted Negative FN TN
Class

4.3.2. Accuracy
The percentage of instances that are correctly classified is calculated by:
TP+TN
Accuracy =
(TP+TN+FP+FN)
…………………….(2)
4.3.3. Error Rate
The percentage of predicted values that are incorrectly categorized is determined by:
Error rate = 1- Accuracy ………………………………(3)
4.3.4. True Positive Rate
Utilizing this metric allows for the accurate measurement of actual positive proportions. The
TPR is acquired using:
TP
True positive rate =
TP+FN
………………………..(4)
4.3.5. False positive rate
If it is true, the null hypothesis is rejected. The FPR can be found by:
FP
False positive rate =
FP+TN
………………….(5)
4.3.6. True negative rate
The regular instances of the patterns are precisely identified. The TNR was discovered by:
True negative rate = 1- FPR …………………(6)
4.3.7. False Negative Rate
The patterns are incorrectly categorized as common occurrences. The FNR can be acquired by:
False Negative rate = 1-TPR ………………….(7)
4.3.8. Precision
The patterns are accurately classified by the extent of their action. It is obtained by:
TP
Precision = ……………………….(8)
TP+FP
4.3.9. F-Measure
The appraisal of the accuracy is performed by it. It is obtained by:
precision x recall
F-measure= 2 ×
precision+recall
………………….(9)
4.3.10. Matthews’s correlation coefficient
It is employed for the measurement concerning classifications of binary instances. The MCC is
obtained by:
(TN x TP)−(FN x FP)
Matthews’s correlation coefficient = (FN+TP)(FP+TP)(FN+TN)(FP+TN) …………..(10)

4.3.11. AUC and ROC


The classifier's ability for measuring and distinguishing among its classes is defined as AUC or
Area under the Curve whereas ROC or receiver operating characteristics show the graphical
performance of the model at different points.

Moreover, the datasets applied for evaluation purpose is discussed in this section. A comparison
of five well-known classification techniques has been carried out on some random datasets and
tested on a Python-based tool that can perform feature extraction and classification of DNA
sequences in FASTA file format. In this work, the feature descriptor “Kmer” with a size of 3, for
clustering the K-means method where the number of clusters is 3 and the Zscore method for
feature normalization is used and the Chi-square selection method is used for selecting features
where 100 features are selected.

For analysis, the evaluation parameters are calculated and compared that include Sn, Sp, Acc,
MCC, Pre, F1-score, ROC, AUROC, PRC, and AUPRC i.e. sensitivity, specificity, accuracy,
Matthews correlation coefficient, Precision, F1-score, receiver operating characteristic, the area
under the receiver operating characteristic, Precision-Recall, and the area under the Precision-
Recall curve [28] [29] [30] [31] [32]. The area under the receiver operating characteristic
(AUROC) parameter can be evaluated using the receiver operating characteristic (ROC) curve
whose value ranges between 0-1 and the area under the Precision-Recall curve (AUPRC)
parameter can be evaluated using the precision-recall curve.

Accuracy for multiclass classification can be defined as:

(false negatives(i)+ false positives(i))


Accuracy (Acc) = 1 – (true positives(i) + true negatives(i))
………..(11)

If a model predicts the exact positive result then it is termed as “true positive. When the
prediction of the exact negative result is done then it is termed as “true negative”. Similarly when
the prediction is incorrect then it is termed a false positive and false negative respectively. Here
“i” represents the sample of the ith class. A comparison of the proposed model with the other
approaches to classifying testing datasets has been carried out based on these parameters.

4.4. Analysis

4.4.1. Set distribution


The train/test approach is a way to gauge how accurate your model is. Because the data set is
divided into two sets—a training set and a testing set—this technique is known as train/test. 20%
for testing, and 80% for training. Using the training set, you train the model. Utilizing the testing
set, you test the model. Create the model by training it. Testing a model entails evaluating its
accuracy. The training set distribution of 200 data sets and the testing set distribution of 100 data
sets are shown using their descriptor values in Figures 9 a & b.

Figure 9 a) Testing set Figure 9 b) Training set

4.4.2. Comparison of evaluation metrics - The evaluation metrics comparison of various


classification algorithms with the following dataset are shown below

a) The dataset of Human respiratory syncytial virus


This section appraises performance characteristics of CNN, decision tree, MLP, RNN, SVM, and
proposed work with human respiratory syncytial virus and human orthopneumovirus dataset.
Tables 7 and 8 depict the performance of the CNN algorithm which attained an accuracy of
91.288% and a precision of .913 and obtained a TPR of .912 for the given dataset. Tables 9 and
10 show the performance of the decision tree algorithm which attained an accuracy of 62.5% and
a precision of .913 and obtained a TPR of .625 for the given dataset. Tables 11 and 12 show the
performance of the MLP algorithm which attained an accuracy of 78% and a precision of .79 and
obtained a TPR of .78 for the given dataset. Tables 13 and 14 show the performance of the RNN
algorithm which attained an accuracy of 92.176% and a precision of .92 and obtained a TPR of
.922 for the given dataset. Tables 15 and 16 show the performance of the SVM algorithm which
attained an accuracy of 78% and a precision of .79 and obtained a TPR of .81 for the given
dataset. Tables 17 and 18 show the performance of the proposed algorithm which attained the
highest accuracy of 97.3% and the precision of .98 and obtained a TPR of .97 for the given
dataset

Table 7. CNN Evaluation metrics

Sn Sp Pre Acc MCC F1 AUROC AUPRC


Fold 0 60.0 95.0 92.31 92.31 0.5871 0.7273 0.7575 0.827

Fold 1 60.0 95.0 92.31 92.31 0.5871 0.7273 0.8525 0.8702


Fold 2 45.0 90.0 81.82 81.82 0.3919 0.5806 0.8375 0.836
Fold 3 45.0 90.0 81.82 81.82 0.3919 0.5806 0.8375 0.836
Fold 4 50.0 100.0 100.0 100.0 0.5774 0.6667 0.795 0.8574

Mean 52.0 95.0 91.3 91.288 0.5211 0.6604 0.8015 0.836

Table 8. The dataset of human respiratory syncytial virus and human


orthopneumo virus appraised with CNN
Overall Instances 300
Instances which are classified precisely 273 (91.288%)
Instances which are classified wrongly 27 (8.712%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
91.288% 0.912 0.913 0.5211 0.6604 0.8015 0.836
CM
A B
213 8
19 60

Table 9. Decision Tree Evaluation metrics

Sn Sp Pre Acc MCC F1 AUROC AUPRC


Fold 0 50.0 55.0 92.31 52.5 0.0501 0.5128 0.525 0.6382
Fold 1 70.0 65.0 92.31 67.5 0.3504 0.6829 0.675 0.7583

Fold 2 55.0 70.0 81.82 62.5 0.2529 0.5946 0.625 0.711

Fold 3 70.0 60.0 81.82 65.0 0.3015 0.6667 0.65 0.7432

Fold 4 65.0 65.0 100.0 65.0 0.3 0.65 0.65 0.7375


Mean 62.0 63.0 91.288 62.5 0.251 0.6214 0.625 0.7176

Table 10. The dataset of human respiratory syncytial virus and human
orthopneumo virus appraised with decision tree
Overall Instances 300
Instances which are classified precisely 187 (62.5%)
Instances which are classified wrongly 113(37.5%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
62.5% 0.625 0.913 0.5211 0.6604 0.8015 0.836
CM
A B
132 33
80 55

Table 11. MLP Evaluation metrics


Sn Sp Pre Acc MCC F1 AUROC AUPRC
Fold 0 70.0 90.0 87.5 80.0 0.6124 0.7778 0.8075 0.8183

Fold 1 80.0 70.0 72.73 75.0 0.5025 0.7619 0.845 0.8608


Fold 2 80.0 95.0 94.12 87.5 0.7586 0.8649 0.8975 0.9232
Fold 3 85.0 55.0 65.38 70.0 0.4193 0.7391 0.755 0.7375

Fold 4 80.0 75.0 76.19 77.5 0.5507 0.7805 0.8225 0.8339

Mean 79.0 77.0 79.184 78.0 0.5687 0.7848 0.8255 0.8347

Table 12. The dataset of human respiratory syncytial virus and human
orthopneumo virus appraised with MLP
Overall Instances 300
Instances which are classified precisely 234 (78%)
Instances which are classified wrongly 66(22%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
78% 0.78 0.79 0.5687 0.7848 0.8255 0.8347
CM
a B
220 16
50 14

Table 13. RNN Evaluation metrics

Sn Sp Pre Acc MCC F1 AUROC AUPRC


Fold 0 45.0 95.0 90.0 90.0 0.4619 0.6 0.815 0.8662
Fold 1 60.0 95.0 92.31 92.31 0.5871 0.7273 0.885 0.8647
Fold 2 55.0 85.0 78.57 78.57 0.4193 0.6471 0.8975 0.8864
Fold 3 30.0 100.0 100.0 100.0 0.4201 0.4615 0.785 0.8125
Fold 4 25.0 100.0 100.0 100.0 0.378 0.4 0.7575 0.7998

Mean 43.0 95.0 92.176 92.176 0.4533 0.5672 0.828 0.8459


Table 14. The dataset of human respiratory syncytial virus and human
orthopneumo virus appraised with RNN
Overall Instances 300
Instances which are classified precisely 276 (92.176%)
Instances which are classified wrongly 24(7.8%)
Accurac TPR Precision Mcc F-Measure ROC PRC
y
92.176% 0.922 0.92 0.4533 0.5672 0.828 0.8459
CM
a b
261 9
15 15

Table 15. SVM Evaluation metrics

Sn Sp Pre Acc MCC F1 AUROC AUPRC

Fold 0 0.0 100.0 87.5 80.0 0.4619 0.6 0.1225 0.324

Fold 1 0.0 100.0 72.73 75.0 0.5871 0.7273 0.115 0.3222

Fold 2 0.0 100.0 94.12 87.5 0.4193 0.6471 0.1 0.3203

Fold 3 0.0 100.0 65.38 70.0 0.4201 0.4615 0.2475 0.3607

Fold 4 0.0 100.0 76.19 77.5 0.378 0.4 0.195 0.3427

Mean 0.0 100.0 79.184 78.0 0.4533 0.5672 0.156 0.334

Table 16. The dataset of human respiratory syncytial virus and human
orthopneumo virus appraised with SVM

Overall Instances 300


Instances which are classified precisely 234 (78%)
Instances which are classified wrongly 66(22%)
Accuracy TPR Precision Mcc F-Measure ROC PRC

78% 0.81 0.79 0.4533 0.5672 0.156 0.334


CM
a b
220 16
50 14

Table 17. Proposed Method Evaluation metrics


Sn Sp Pre Acc MCC F1 AUROC AUPRC
Fold 0 60.0 95.0 98.31 98.31 0.5871 0.7273 0.7575 0.827

Fold 1 60.0 95.0 98.31 98.31 0.5871 0.7273 0.8525 0.8702

Fold 2 45.0 90.0 90 90 0.3919 0.5806 0.8375 0.836

Fold 3 45.0 95.0 100.0 100.0 0.4619 0.6 0.765 0.7893

Fold 4 50.0 100.0 100.0 100.0 0.5774 0.6667 0.795 0.8574


Mean 52.0 95.0 97.3 97.3 0.5211 0.6604 0.8015 0.836
Table 18. The dataset of human respiratory syncytial virus and human
orthopneumo virus appraised with proposed method
Overall Instances 300
Instances which are classified precisely 291 (97.3%)
Instances which are classified wrongly 9 (2.7%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
97.3% 0.97 0.97 0.5211 0.6606 0.8015 0.836
CM
a b
281 2
7 10

• Precision-PRC curve
Graphs of all discussed classification techniques drawn between precision and recall value of
300 data sets with 5 folds are shown below.

Figure 10. CNN PRC curve Figure 11. Decision tree PRC curve

Figure 12. RNN PRC curve Figure 13. SVM PRC curve
Figure 14. MLP PRC curve Figure 15. Proposed Method PRC curve

• True positive- False positive rate


Graphs of all discussed classification techniques drawn between the true positive rate and false-
positive rate of 300 data sets with 5 folds are shown below.

Figure 16. CNN True- False positive rate Figure 17. Decision tree True- False positive rate

Figure 18. RNN True- False positive rate Figure 19. SVM True- False positive rate
Figure 20. MLP True- False positive rate Figure 21. Proposed Method True- False
positive rate

b) The dataset of Breast Cancer patients


This section appraises the performance characteristics of CNN, decision tree, MLP, RNN, SVM,
and proposed work with the breast cancer dataset. Tables 19 and 20 depict the performance of
the CNN algorithm which attained an accuracy of 91.288% and a precision of .88 and obtained
the TPR of .912 for the given dataset. Tables 21 and 22 show the performance of the decision
tree algorithm which attained an accuracy of 62.5% and a precision of .63 and obtained a TPR of
.625 for the given dataset. Tables 23 and 24 show the performance of the MLP algorithm which
attained an accuracy of 78.1% and a precision of .79 and obtained a TPR of .78 for the given
dataset. Tables 25 and 26 show the performance of the RNN algorithm which attained an
accuracy of 92.0% and a precision of .92 and obtained a TPR of .922 for the given dataset.
Tables 27 and 28 show the performance of the SVM algorithm which attained an accuracy of
78% and a precision of .79 and obtained a TPR of .81 for the given dataset. Tables 29 and 30
show the performance of the proposed algorithm which attained an accuracy of 93.9% and the
highest precision of .98 and obtained a TPR of .94 for the given dataset.

Table 19. CNN Evaluation metrics

Sn Sp Pre Acc MCC F1 AUROC AUPRC


Fold 0 60.0 95.0 91 92.31 0.5871 0.7273 0.7575 0.827

Fold 1 60.0 95.0 92 92.31 0.5871 0.7273 0.8525 0.8702


Fold 2 45.0 90.0 81 81.82 0.3919 0.5806 0.8375 0.836
Fold 3 45.0 90.0 81 81.82 0.3919 0.5806 0.8375 0.836
Fold 4 50.0 100.0 95 100.0 0.5774 0.6667 0.795 0.8574
Mean 52.0 95.0 88 91.288 0.5211 0.6604 0.8015 0.836
Table 20. The dataset of Breast cancer appraised with CNN
Overall Instances 300
Instances which are classified precisely 273 (91.288%)
Instances which are classified wrongly 27 (8.712%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
91.288% 0.912 0.88 0.5211 0.6604 0.8015 0.836
CM
A b
213 8
19 60

Table 21. Decision Tree Evaluation metrics


Sn Sp Pre Acc MCC F1 AUROC AUPRC
Fold 0 50.0 55.0 52.63 52.5 0.5 0.5128 0.525 0.6382
Fold 1 70.0 65.0 66.67 67.5 0.3504 0.6829 0.675 0.7583

Fold 2 55.0 70.0 64.71 62.5 0.2529 0.5946 0.625 0.711

Fold 3 70.0 60.0 63.64 65.0 0.3015 0.6667 0.65 0.7432

Fold 4 65.0 65.0 65.0 65.0 0.3 0.65 0.65 0.7375

Mean 62.0 63.0 62.53 62.5 0.34 0.6214 0.625 0.7176

Table 22. The dataset of Breast cancer appraised with decision tree

Overall Instances 300


Instances which are classified precisely 187 (62.5%)
Instances which are classified wrongly 113(37.5%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
62.5% 0.625 0.63 0.34 0.6604 0.8015 0.836
CM
A b
132 33
80 55

Table 23. MLP Evaluation metrics


Sn Sp Pre Acc MCC F1 AUROC AUPRC
Fold 0 70.0 90.0 87.5 80.1 0.6124 0.7778 0.8075 0.8183

Fold 1 80.0 70.0 72.73 75.0 0.5025 0.7619 0.845 0.8608


Fold 2 80.0 95.0 94.12 87.5 0.7586 0.8649 0.8975 0.9232
Fold 3 85.0 70.0 65.38 70.0 0.4193 0.7391 0.755 0.7375

Fold 4 80.0 75.0 76.19 77.5 0.5507 0.7805 0.8225 0.8339

Mean 79.0 80.0 79.184 78.1 0.5687 0.7848 0.8255 0.8347


Table 24. The dataset of Breast cancer appraised with MLP

Overall Instances 300


Instances which are classified precisely 234 (78.1%)
Instances which are classified wrongly 66(22%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
78.1% 0.78 0.79 0.5687 0.7848 0.8255 0.8347
CM
A b
220 16
50 14

Table 25. RNN Evaluation metrics

Sn Sp Pre Acc MCC F1 AUROC AUPRC


Fold 0 95.0 95.0 90.0 90.0 0.4619 0.6 0.815 0.8662
Fold 1 95.0 95.0 92.31 92.0 0.5871 0.7273 0.885 0.8647
Fold 2 85.0 85.0 78.57 78.0 0.4193 0.6471 0.8975 0.8864
Fold 3 100.0 100.0 100.0 100.0 0.4201 0.4615 0.785 0.8125
Fold 4 100.0 100.0 100.0 100.0 0.378 0.4 0.7575 0.7998

Mean 95.0 95.0 92.176 92.0 0.4533 0.5672 0.828 0.8459

Table 26. The dataset of Breast cancer appraised with RNN

Overall Instances 300


Instances which are classified precisely 276 (92%)
Instances which are classified wrongly 24(7.8%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
92% 0.922 0.92 0.4533 0.5672 0.828 0.8459
CM
A b
261 9
15 15

Table 27. SVM Evaluation metrics

Sn Sp Pre Acc MCC F1 AUROC AUPRC

Fold 0 0.0 100.0 87.5 87.0 0.4619 0.6 0.1225 0.324

Fold 1 0.0 100.0 72.73 72.0 0.5871 0.7273 0.115 0.3222

Fold 2 0.0 100.0 94.12 94.0 0.4193 0.6471 0.1 0.3203

Fold 3 0.0 100.0 65.38 67 0.4201 0.4615 0.2475 0.3607

Fold 4 0.0 100.0 76.19 76.0 0.378 0.4 0.195 0.3427

Mean 0.0 100.0 79.184 79.2 0.4533 0.5672 0.156 0.334

Table 28. The dataset of Breast cancer appraised with SVM


Overall Instances 300
Instances which are classified precisely 237 (79%)
Instances which are classified wrongly 63(21%)
Accuracy TPR Precision Mcc F-Measure ROC PRC

78% 0.81 0.79 0.4533 0.5672 0.156 0.334


CM
A b
221 16
47 16
Table 29. Proposed Method Evaluation metrics

Sn Sp Pre Acc MCC F1 AUROC AUPRC


Fold 0 95.0 95.0 98.31 93.5 0.68 0.92 0.7575 0.827

Fold 1 95.0 95.0 98.31 92.5 0.75 0.92 0.8525 0.8702

Fold 2 90.0 90.0 90 97.5 0.80 0.8 0.8375 0.836

Fold 3 95.0 95.0 100.0 91.0 0.72 0.8 0.765 0.7893

Fold 4 100.0 100.0 100.0 95.0 0.69 0.85 0.795 0.8574


Mean 95.0 95.0 97.3 93.9 0.728 0.86 0.8015 0.836

Table 30. The dataset of Breast cancer appraised with proposed method

Overall Instances 300


Instances which are classified precisely 282 (93.9%)
Instances which are classified wrongly 18 (6.1%)
Accuracy TPR Precision Mcc F-Measure ROC PRC
93.9% 0.939 0.98 0.728 0.86 0.8015 0.836
CM
A B
270 6
12 12

• Precision-PRC curve
Graphs of all discussed classification techniques drawn between precision and recall value of
300 data sets with 5 folds are shown below.

Figure 22. CNN PRC curve Figure 23. Decision tree PRC curve
Figure 24. RNN PRC curve Figure 25. SVM PRC curve

Figure 26. MLP PRC curve Figure 27. Proposed Method PRC curve

• True positive- False positive rate


Graphs of all discussed classification techniques drawn between the true positive rate and false-
positive rate of 300 data sets with 5 folds are shown below.

Figure 28. CNN True- False positive rate Figure 29. Decision tree True- False positive rate
Figure 30. RNN True- False positive rate Figure 31. SVM True- False positive rate

Figure 32. MLP True- False positive rate Figure 33. Proposed Method True- False
positive rate
• Accuracy and precision
Figure 34 shows the accuracy and precision of CNN, decision tree, MLP, RNN, SVM, and
proposed approach obtained on 300 instances of human respiratory syncytial virus and human
orthopneumovirus dataset. The proposed method achieved a weighted average of precision and
accuracy of 97.3, 97.3 which is highest when compared to other methods.

Accuracy-Precision Graph
150
100
Accuracy
50
Precision
0
CNN Decision MLP RNN SVM Proposed
tree Method

Figure 34. Accuracy and precision graph of human respiratory syncytial virus and human
orthopneumo virus dataset
Figure 35 shows the accuracy and precision of CNN, decision tree, MLP, RNN, SVM and
proposed approach obtained on 300 instances of breast cancer dataset. The proposed method
achieved a weighted average of precision and accuracy of 97.3, 93.9 which is highest when
compared to other methods.

Accuracy-Precision Graph
150

100
Accuracy
50
Precision
0
CNN Decision MLP RNN SVM Proposed
tree Method

Figure 35. Accuracy and precision graph of breast cancer dataset

Figure 36 shows the accuracy and precision of CNN, decision tree, MLP, RNN, SVM and
proposed approach obtained on 300 instances of lung cancer dataset. The proposed method
achieved a weighted average of precision and accuracy of 97.3, 93.9 which is highest when
compared to other methods.

Accuracy-Precision Graph
150
100
Accuracy
50
Precision
0
CNN Decision MLP RNN SVM Proposed
tree Method

Figure 36. Accuracy and precision graph of Lung cancer dataset

The accompanying table 37 demonstrates that the proposed method provides significantly better
precision and accuracy outcomes than the prior best results. Precision has increased by over
5.124 %, while accuracy has increased by about 15.9%. Because the work has employed a better
way for sequence representation (hot-vector representation) and also the best feature selection
method, the proposed method has a very high improvement compared to previous methods.

Table 37. Comparison of best past results with proposed work results
Parameter Best past result Proposed work result Improvement

Precision 92.176 97.3 5.124

Accuracy 78.0 93.9 15.9


The outcomes have demonstrated the value of the features recovered by the suggested strategy
for categorizing the sequence into a true category.

100
90
Percentage 80
70
60
50
40
30
20 Average accuracy
10
0 Average precision

Techniques

Figure 37. Accuracy comparison of classification techniques


In this study, a machine learning model called the convolutional neural network is used to
increase the strong ability to convey complex issues. Additionally, sequences are represented
using one-hot vectors, which allowed us to maintain the precise positional details of each
nucleotide in the sequences. Our algorithm produced almost flawless classification results with
simple-to-classify datasets. It could be a dependable tool to facilitate research involving these
kinds of data. Although there have been improvements for Histone datasets, the performances
are still poor. Therefore, predictions from the proposed model should only be considered as a
starting point for research on these types of data. This model might be used with these data
because other sequence data in bioinformatics, including amino acid sequence data, are also data
of sequences of successive letters.

This study's main drawback is the use of empirical selection for hyper-parameters such as word
size, region size, and network architectural configuration. The work has identified these hyper-
parameters as having a substantial impact on our model's prediction ability through the findings
of multiple experiments. Consequently, more research on this issue is required.

5. Conclusion and Future work


Since most of the genetic instructions needed for an organism's growth, survival, and
reproduction are found in its DNA, DNA sequence categorization offers a precise way to identify
when these disorders occur. Various machine-learning approaches are available for classifying
DNA sequences. To train the model for new unknown diseases, features from recognized
diseases are extracted. A newly afflicted person's ailment can be identified by further comparing
these derived patterns. Patients now have more access to digital platforms for early disease
diagnosis without the need for clinical equipment thanks to knowledge transfer to artificial
neural networks. For analysis, the suggested technique was compared to the existing classifier
using seven distinct assessment metrics. When applied to different datasets, the suggested
technique has the highest accuracy and precision when compared to other classifiers. It provides
significantly better precision and accuracy outcomes than the prior best results, where precision
has increased by more than 5.124% and accuracy has increased by about 15.9%.

With datasets that were easy to classify, the suggested technique yielded nearly perfect
classification results. It could be a reliable tool for facilitating research involving this type of
data. Although advances have been made for Histone datasets, performance remains
inadequate. As a result, predictions based on the suggested model should be regarded simply
as a starting point for future research on these types of data. Because other sequence data in
bioinformatics, such as amino acid sequence data, are similar data of sequences of successive
letters, the proposed model might be applied to these data.

Data availability: The datasets generated during and/or analyzed during the current study are
not publicly available due to [security reasons] but are available from the corresponding
author on reasonable request.

Declarations:
Competing interests
The authors declare that they have no known competing financial interests or personal
relationships that could have appeared to influence the work reported in this paper.

References
1. S. Shadab, M. T. Alam Khan, N. A. Neezi, S. Adilina, and S. Shatabda. (2020). DeepDBP: deep neural
networks for the identification of DNA-binding proteins. Informatics in Medicine Unlocked. vol. 19, article
100318.
2. Garima Mathur, Anjana Pandey and Sachin Goyal, "A Novel Approach to Compress and Secure Human
Genome Sequence", In: Saroj Hiranwal and Garima Mathur (eds), Artificial Intelligence and Communication
Technologies, SCRS, India, 2022, pp. 305-317. https://doi.org/10.52458/978-81-955020-5-9-31
3. Mathur, G., Pandey, A., & Goyal, S. (2023). Blockchain Solutions, Challenges, and Opportunities for DNA
Classification and Secure Storage for the E-Healthcare Sector: A Useful Review. In A. Tyagi (Ed.), Handbook
of Research on Quantum Computing for Smart Environments (pp. 453-473). IGI Global.
https://doi.org/10.4018/978-1-6684-6697-1.ch024
4. Mathur, G., Pandey, A. & Goyal, S. A review on blockchain for DNA sequence: security issues, application in
DNA classification, challenges and future trends. Multimed Tools Appl 83, 5813–5835 (2024).
https://doi.org/10.1007/s11042-023-15857-1
5. M. Anthimopoulos, S. Christodoulidis, L. Ebner, A. Christe, and S. Mougiakakou, "Lung pattern classification
for interstitial lung diseases using a deep convolutional neural network", IEEE Trans. Med. Imag., vol. 35, no.
5, pp. 1207-1216, May 2016.
6. Z. Yan et al., "Multi-instance deep learning: Discover discriminative local anatomies for body part
recognition", IEEE Trans. Med. Imag., vol. 35, no. 5, pp. 1332-1343, May 2016.
7. W. Shen, M. Zhou, F. Yang, C. Yang, and J. Tian, "Multi-scale convolutional neural networks for lung nodule
classification", Proc. Int. Conf. Inf. Process. Med. Imag., pp. 588-599, 2015.
8. J. Schlemper, J. Caballero, J. V. Hajnal, A. Price and D. Rueckert, "A deep cascade of convolutional neural
networks for MR image reconstruction", Proc. Int. Conf. Inf. Process. Med. Imag., pp. 647-658, 2017.
9. J. Mehta and A. Majumdar, "Rodeo: Robust de-aliasing autoencoder for real-time medical image
reconstruction", Pattern Recognit., vol. 63, pp. 499-510, 2017.
10. M. Havaei et al., "Brain tumor segmentation with deep neural networks", Med. Image Anal., vol. 35, pp. 18-31,
2017.
11. K. Bourzac, "The computer will see you now", Nature, vol. 502, no. 3, pp. S92-S94, 2013.
12. Michalski, R.S., Carbonell, J.G. and Mitchell, T.M. eds., 2013. Machine learning: An artificial intelligence
approach. Springer Science & Business Media
13. Tettamanzi, A.G. and Tomassini, M., 2013. Soft computing: integrating evolutionary, neural, and fuzzy
systems. Springer Science & Business Media.
14. N. A. Kassim and A. Abdullah. (2017). Classification of DNA sequences using convolutional neural network
approach. UTM Computing Proceedings Innovations in Computing Technology and Applications. vol. 2, pp. 1–
6,.
15. Boser, B. E., I. Guyon, and V. Vapnik (1992). A training algorithm for optimal margin Classifiers. In
Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp.144 -152. ACM Press. 1992.
16. V. Vapnik. The Nature of Statistical Learning Theory. NY: Springer Verlag. 1995.
17. Ma, Q., Wang, J. T. L., Shasha, D., and Wu, C. H. (2001). DNA sequence classification via an expectation
maximization algorithm and neural networks: a case study. IEEE Trans. Syst. 31, 468–475. doi:
10.1109/5326.983930.
18. Müller, H. M., and Koonin, S. E. (2003). Vector space classification of DNA sequences. J. Theor. Biol. 223,
161–169. doi: 10.1016/S0022-5193(03)00082-1
19. Levy, S., and Stormo, G. D. (1997). “DNA sequence classification using DAWGs,” in Structures in Logic and
Computer Science, eds J. Mycielski, G. Rozenberg, and A. Salomaa (Berlin: Springer), 339–352. doi:
10.1007/3-540-63246-8_21.
20. Ohno-Machado L, Vinterbo S, Weber G (2002) Classification of gene expression data using fuzzy logic. J Intell
Fuzzy Syst 12(1):19–24
21. J. T. L. Wang, T. G. Marr, D. Shasha, B. A. Shapiro, G. Chirn, and T. Y. Lee, “Complementary classification
approaches for protein sequences,” Protein Eng., vol. 9, no. 5, pp. 381–386, 1996.
22. Yang A, Zhang W, Wang J, Yang K, Han Y, Zhang L. Review on the Application of Machine Learning
Algorithms in the Sequence Data Mining of DNA. Front Bioeng Biotechnol. 2020 Sep 4;8:1032. doi:
10.3389/fbioe.2020.01032. PMID: 33015010; PMCID: PMC7498545.
23. Nguyen, N. , Tran, V. , Ngo, D. , Phan, D. , Lumbanraja, F. , Faisal, M. , Abapihi, B. , Kubo, M. and Satou, K.
(2016) DNA Sequence Classification by Convolutional Neural Network. Journal of Biomedical Science and
Engineering, 9, 280-286. doi: 10.4236/jbise.2016.95021.
24. P. Afshar, A. Mohammadi, and K. N. Plataniotis, “Brain tumor type classification via capsule networks,” in
Proc. 25th IEEE Int. Conf. Image Process., 2018, pp. 3129–3133.
25. Szegedy C, Toshev A, Erhan D. Deep neural networks for object detection. In: Burges CJC, Bottou L, Welling
M, Ghahramani Z, Weinberger KQ, eds. Advances in neural information processing systems. Red Hook, NY:
Curran Associates, 2013; 2553–2561.
26. Abadi M, Agarwal A, Barham P, et al. TensorFlow: large-scale machine learning on heterogeneous distributed
systems. Cornell University Library website. http://arxiv.org/abs/1603.04467. Published 2016. Accessed
October 2016.
27. Mathur, G., Pandey, A. & Goyal, S. A comprehensive tool for rapid and accurate prediction of disease using
DNA sequence classifier. J Ambient Intell Human Comput (2022). https://doi.org/10.1007/s12652-022-04099-y.
28. V. Vapnik. The Nature of Statistical Learning Theory. NY: Springer Verlag. 1995.
29. Celebi, M.E., Kingravi, H.A. and Vela, P.A., 2013. A comparative study of efficient initialization methods for
the k-means clustering algorithm. Expert Systems with Applications, 40(1), pp.200-210.
30. Sweety Bakyarani. E, Dr. Srimathi. H , Dr. M. Bagavandas, NOVEMBER 2019. , A Survey Of Machine
Learning Algorithms In Health Care. INTERNATIONAL JOURNAL OF SCIENTIFIC & TECHNOLOGY
RESEARCH VOLUME 8, ISSUE 11, pp, 2288-2292.
31. H. Kim, D. C. Jung, and B. W. Choi, “Exploiting the vulnerability of deep learning-based artificial intelligence
models in medical imaging: Adversarial attacks,” J. Korean Soc. Radiol., vol. 80, no. 2, pp. 259–273, 2019.
32. J. Zhang and E. Bareinboim, “Fairness in decision-making—The causal explanation formula,” in Proc. 32nd
AAAI Conf. Artif. Intell., 2018.

You might also like