CSE35 Project Report
CSE35 Project Report
SATHYABAMA
i
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
(Established under Section 3 of UGC Act, 1956)
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI-600119
www.sathyabama.ac.in
BONAFIDE CERTIFICATE
This is to certify that this Project Report is the bonafide work of AVULA
VENKATA SRIANDH REDDY (REG. NO. 38110061), BODDU PRASANTH
REDDY (REG. NO. 38110422) who carried out the project entitled “Detection of
Attacks (DoS, Probe) Using Genetic Algorithm” under our supervision from Nov
2021 to April 2022.
Internal Guide
Dr. L. SUJIHELEN, M.E., Ph.D.,
ii
DECLARATION
DATE:
PLACE: CHENNAI SIGNATURE OF THE CANDIDATE
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
v
TABLE OF CONTENT
vi
Network Intrusion
Detection System
2.4 Intrusion Preventing 24
System using
Intrusion Detection
System Decision
Tree Data Mining
3 Experimental 29
Methods and
algorithms used
3.1 Machine Learning 29
Scope
3.1.1 Supervised 29
Machine Learning
3.1.2 Unsupervised 30
Machine Learning
3.2 Decision Tree 30
vii
List of Figures
Figur Title
e no.
3.2.1 Decision Tree
3.3.1 Genetic Algorithm steps
3.3.2 Genetic Algorithm Application
3.4.1 Block diagram
3.4.2 Flow Diagram
4.2.1 Module Flow Chart
4.2.2 Data Collection
4.2.3 Training Set
4.2.4 Validation
4.2.5 Prediction
4.3.1 Intrusion detection Application
4.3.2 Training dataset
4.3.3 Test All attacks dataset
4.3.4 All attacks Plot Graph
4.3.5 Test Normal Attack Dataset
4.3.6 Normal Attack Plot Graph
List of Tables
Table.no Title
viii
CHAPTER 1
INTRODUCTION
Approaches for intrusion detection can be broadly divided into two types: misuse
detection and anomaly detection. In misuse detection system, all known types of
attacks (intrusions) can be detected by looking into the predefined intrusion patterns
in system audit traffic. In case of anomaly detection, the system first learns a normal
activity profile and then flags all system events that do not match with the already
established profile. The main advantage of the misuse detection is its capability for
high detection rate with a difficulty in finding the new or unforeseen attacks. The
advantage of anomaly detection lies in the ability to identify the novel (or unforeseen)
attacks at the expense of high false positive rate. Network monitoring-based machine
learning techniques have been involved in diverse fields. Using bi-directional long-
short-term-memory neural networks, a social media network monitoring system is
proposed for analysing and detecting traffic accidents.
1.1 BACK GROUND
The proposed method retrieves traffic-related information from social media
(Facebook and Twitter) using query-based crawling: this process collects sentences
related to any traffic events, such as jams, road closures, etc. Subsequently, several
pre-processing techniques are carried out, such as steaming, tokenization, POS
tagging and segmentation, in order to transform the retrieved data into structured
form. Thereafter, the data are automatically labelled as ’traffic’ or ’non-traffic’, using a
latent Dirichlet allocation (LDA) algorithm. Traffic- labelled data are analysed into
three types; positive, negative, and neutral. The output from this stage is a sentence
labelled according to whether it is traffic or non-traffic, and with the polarity of that
traffic sentence (positive, negative or neutral). Then, using the bag-of-words (BoW)
technique, each sentence is transformed into a one-hot encoding representation in
order to feed it to the Bi-directional LSTM neural network (Bi-LSTM). After the
learning process, the neural networks perform multi-class classification using the
softmax layer in order to classify the sentence in terms of location, traffic event and
polarity types. The proposed method compares different classical machine learning
1
and advanced deep learning approaches in terms of accuracy, F-score and other
criteria.
1.4 OBJECTIVE
The primary purposes for an IDS deployment are to reduce risk, identify error,
optimize network use, provide insight into threat levels, and change user behavior.
Thus, an IDS provides more than just detection of intrusion.
Genetic Algorithm (GA) developed specifically for problems with multiple objectives.
They differ primarily from traditional GA by using specialized fitness functions and
introducing methods to promote solution diversity.
The goal of this algorithm is to create a model that predicts the value of a target
variable, for which the decision tree uses the tree representation to solve the problem
2
in which the leaf node corresponds to a class label and attributes are represented on
the internal node of the tree.
3
CHAPTER 2
AIM AND SCOPE OF THE PRESENT INVESTIGATION
5
introduction relies on the increasing demand for improved performance on different
types of network. However, software defined network (SDN) implementation of the
network-based intrusion detection system (NIDS) has opened a frontier for its
deployment, considering the increasing scope and typology of security risks of
modern networks. The rapid growth in the volume of network data and connected
devices carries inherent security risks. The adoption of technologies such as the
Internet of Things (IoT), artificial intelligence (AI), and quantum computing, has
increased the threat level, making network security challenging and necessitating a
new paradigm in its implementation. Various attacks have overwhelmed previous
approaches (classified into signature-based intrusion detection systems and anomaly-
based intrusion detection systems, increasing the need for advanced, adaptable and
resilient security implementation. For this reason, the traditional network design
platform is being transformed into the evolving SDN implementation Monitoring data
and analysing it over time are essential to the process of predicting future events,
such as risks, attacks and diseases. The more details are formed, discovered and
documented through analysing very large-scale data, the more saved resources, as
well as the working environment, will remain normal without any variations. Big data
analytics (BDA) research in the supply chain becomes the secret of a protector for
managing and preventing risks. BDA for humanitarian supply chains can aid the
donors in their decision of what is appropriate in situations such as disasters, where it
can improve the response and minimize human suffering and deaths. BDA and data
monitoring using machine learning can help in identifying and understanding the
interrelationships between the reasons, difficulties, obstacles and barriers that guide
organizations in taking the most efficient and accurate decisions in risk management
processes. This could impact entire organizations and countries, producing a hugely
significant improvement in the process. Network monitoring-based machine learning
techniques have been involved in diverse fields. Using bi-directional long-short-term-
memory neural networks, a social media network monitoring system is proposed for
analysing and detecting traffic accidents. The proposed method retrieves traffic-
related information from social media (Facebook and Twitter) using query-based
crawling: this process collects sentences related to any traffic events, such as jams,
6
road closures, etc. Subsequently, several pre-processing techniques are carried out,
such as steaming, tokenization, POS tagging and segmentation, in order to transform
the retrieved data into structured form. Thereafter, the data are automatically labelled
as ’traffic’ or ’non-traffic’, using a latent Dirichlet allocation (LDA) algorithm. Traffic-
labelled data are analysed into three types; positive, negative, and neutral. The output
from this stage is a sentence labelled according to whether it is traffic or non-traffic,
and with the polarity of that traffic sentence (positive, negative or neutral). Then, using
the bag-of-words (BoW) technique, each sentence is transformed into a one-hot
encoding representation in order to feed it to the Bidirectional LSTM neural network
(Bi-LSTM). After the learning process, the neural networks perform multi-class
classification using the softback layer in order to classify the sentence in terms of
location, traffic event and polarity types. The proposed method compares different
classical machine learning and advanced deep learning approaches in terms of
accuracy, F-score and other criteria. Many initiatives and workshops have been
conducted in order to improve and develop the healthcare systems using machine
learning, such as [12]. In these workshops several proposed machine algorithms
have been used, such as K Nearest-Neighbours, logistic regression, K-means
clustering, Random Forest (RF) etc., together with deep learning algorithms such as
CNN, RNN, fully connected layer and auto-encoder. These varieties of techniques
allow the researchers to deal with several data types, such as medical imaging,
history, medical notes, video data, etc. Therefore, different topics and applications are
introduced, with significant performance results such as causal inference, in
investigations of Covid-19, disease prediction, such as disorders and heart diseases.
Using intelligent ensemble deep learning methods, healthcare monitoring is carried
out for prediction of heart diseases. Real-time health status monitoring can prevent
and predict any heart attacks before occurrence. For disease prediction, the proposed
ensemble deep learning approach achieved a brilliant accuracy performance score of
98.5%. The proposed model takes two types of data that are transferred and saved
on an online cloud database. The first is the data transferred from the sensors; these
sensors have been placed in different places on the body in order to extract more
than 10 different types of medical data. The second type is the daily electronic
7
medical records from doctors, which includes various types of data, such as smoking
history, family diseases, etc. The features are fused using the feature fusion
Framingham Risk factors technique, which executes two tasks at a time, fusing the
data together, and then extracting a fused and informative feature from this data.
Then different pre-processing techniques are used to transform the data into a
structured and well-prepared form, such as normalization, missing values filtering and
feature weighting. Subsequently, an ensemble deep learning algorithm starts which
learns from the data in order to predict whether a heart disease will occur or the threat
is absent. IDS refers to a mechanism capable of identifying or detecting intrusive
activities. In a broader view, this encompasses all the processes used in the
discovery of unauthorized uses of network devices or computers. This is achieved
through software designed specifically to detect unusual or abnormal activities. IDS
can be classified according to several s1urveys and sources in the literature into four
types (HIDS, NIDS, WIDS, NBA). NIDS is an inline or passive-based intrusion
detection technique. The scope of its detection targets network and host levels. The
only architecture that fits and works with NIDS is the managed network. The
advantage of using NIDS is that it costs less and is quicker in response, since there is
no need to maintain sensor programming at the host level. The performance of
monitoring the traffic is close to real-time; NIDS can detect attacks as they occur.
However, it has the following limited features. It does not indicate if such attacks are
successful or not: it has restricted visibility inside the host machine. There is also no
effective way to analyse encrypted network traffic to detect the type of attack.
Moreover, NIDS may have difficulty capturing all packets in a large or busy network.
Thus, it may fail to recognize an attack launched during a period of high traffic. SDN
provides a novel means of network implementation, stimulating the development of a
new type of network security application. It adopts the concept of programmable
networks through the deployment of logically centralized management. The network
deployment and configuration are virtualized to simplify complex processes, such as
orchestration, network optimization, and traffic engineering. It creates a scalable
architecture that allows sufficient and reliable services based on certain types of
traffic. The global view approach to a network enhances flow-level control of the
8
underlying layers. Implementing NIDS over SDN becomes a major effective security
defence mechanism for detecting network attacks from the network entry point. NIDS
has been implemented and investigated for decades to achieve optimal efficiency. It
represents an application or device for monitoring network traffic for suspicious or
malicious activity with policy violations. Such activities include malware attacks,
untrustworthy users, security breaches, and DDoS. NIDS focuses on identifying
anomalous network traffic or behaviour; its efficiency means that network anomaly is
adequately implemented as part of the security implementation. Since it is nearly
impossible to prevent threats and attacks, NIDS will ensure early detection and
mitigation. However, the advancement in NIDS has not instilled sufficient confidence
among practitioners, since most solutions still use less capable, signature-based
techniques. This study aims to increase the focus on several points:
Choosing the right algorithm for the right tasks depends on the data types, size and
network behaviour and needs.
Implementing the optimized development process by preparing and selecting the
benchmark dataset in order to build a promising system in NIDS.
Analysing the data, finding, shaping, and engineering the important features, using
several pre-processing techniques by stacking them together with an intelligent order
to find the best accuracy with the lowest amount of data representation and size.
Proposing an integration and complete development process using those algorithms
and techniques from the selection of dataset to the evaluation of the algorithms using
a different metric. Which can be extended to other NIDS applications.
Integrating machine learning algorithms into SDN has attracted significant attention.
In, a solution was proposed that solved the issues in KDD Cup 99 by performing an
extensive experimental study, using the NSL-KDD dataset to achieve the best
accuracy in intrusion detection. The experimental study was conducted on five
popular and efficient machine learning algorithms (RF, J48, SVM, CART, and Naïve
Bayes). The correlation feature selection algorithm was used to reduce the complexity
of features, resulting in 13 features only in the NSL-KDD dataset. This study tests the
NSL-KDD dataset’s performance for real-world anomaly detection in network
behaviour. Five classic machine learning models RF, J48, SVM, CART, and Naïve
Bayes were trained on all 41 features against the five normal types of attacks, DOS,
probe, U2R, and R2L to achieve average accuracies of 97.7%, 83%, 94%, 85%, and
70% for each algorithm, respectively. The same models were trained again using the
reduced 13 features to achieve average accuracies of 98%, 85%, 95%, 86%, and
73% for each model. In, a deep neural network model was proposed to find and
detect intrusions in the SDN. The NSL-KDD dataset was used to train and test the
model. The neural network was constructed with five primary layers, one input layer
with six inputs, three hidden layers with (12, 6, 3) neurons, and one output layer with
2D dimensions. The proposed method was trained on six features chosen from 41
features in the NSL-KDD dataset, which are basic and traffic features that can easily
be obtained from the SDN environment. The proposed method calculates the
accuracy, precision and recall, achieving an F1-score of 0.75. A second evaluation
was conducted on seven classic machine learning models (RF, NB, NB Tree, J48,
DT, MLP, and SVM) proposed in and the model achieved sixth place out of eight. The
same author extended the approach using a gated recurrent unit neural network
(GRU-RNN) for SDN anomaly detection, achieving accuracy up to 89%. In addition,
10
the minimax normalization technique is used for feature scaling to improve and boost
the learning process. The SVM classifier, integrated with the principal component
analysis (PCA) algorithm, was used for an intrusion detection application. The NSL-
KDD dataset is used in this approach to train and optimize the model for detecting
abnormal patterns. A Min-Max normalization technique was proposed to solve the
diversity data scale ranges with the lowest misclassification errors. The PCA
algorithm is selected as a statistical technique to reduce the NSL-KDD dataset’s
complexity, reducing the number of trainable parameters that needed to be learned.
The nonlinear radial basis function kernel was chosen for SVM optimization.
Detection rate (DR), false alarm rate (FAR), and correlation coefficient metrics were
chosen to evaluate the proposed model, with an overall average accuracy of 95%
using 31 features in the dataset. In [32], an extreme gradient-boosting (XGBoost)
classifier was used to distinguish between two attacks, i.e., normal and DoS. The
detection method was analysed and conducted over POX SDN, as a controller, which
is an SDN open-source platform for prototyping and developing a technique based on
SDN. Mininet was used to emulate the network topology to simulate real-time SDN-
based cloud detection. Logistic regression was selected as a learning algorithm, with
a regularization term penalty to prevent overfitting. The XGBoost term was added and
combined with the logistic regression algorithm to boost the computations by
constructing structure trees. The dataset used in this approach was KDD Cup 1999,
while 400 K samples were selected for constructing the training set. Two types of
normalization techniques were used; one with a logarithmic-based technique and one
with a Min-Max-based technique. The average overall accuracy for XGBoost,
compared to RF and SVM, was 98%, 96%, and 97% respectively. Based on DDoS
attack characteristics, a detection system was simulated with the Mininet and FL
floodlight platform using the SVM algorithm [5]. The proposed method categorizes the
characteristics into six tuples, which are calculated from the packet network. These
characteristics are the speed of the source IP (SSIP), the speed of the source port,
the standard deviation of FL flow packets, the deviation of FL flow bytes (SDFB), the
speed of flow entries, and the ratio of pair-FL flow. Based on the calculated statistics
from the SVM classifier’s six characteristics, the current network state is normal or
11
attack. Attack flow (AF), DR, and FAR were chosen to achieve an average accuracy
of 95%. In TSDL a model with two stages of deep neural networks was designed and
proposed for NIDS, using a stacked auto-encoder, integrated with softmax in the
output layer as a classifier. TSDL was designed and implemented for Multiclass
classification of attack detection. Down-sampling and other pre-processing
techniques were performed over different datasets in order to improve the detection
rate, as well as the monitoring efficiency. The detection accuracy for UNSW-NB15
was 89.134%. Different models of neural networks, such as variation auto-encoder,
seq2seq structures using Long-Short-term-Memory (LSTM) and fully connected
networks were proposed in [34] for NIDS. The proposed approach was designed and
implemented to differentiate between normal and attack packets in the network, using
several datasets, such as NSL-KDD, UNSW NB15, KYOTO-HONEYPOT, and
MAWILAB. A variety of pre-processing techniques have been used, such as one-hot-
encoding, normalization, etc., for data preparation, feature manipulation and selection
and smooth training in neural networks. Those factors are designed mainly, but not
only, to enable the neural networks to learn complex features from different scopes of
a single packet. Using 4 hidden layers, a deep neural network model [35] was
illustrated and implemented on KDD cup99 for monitoring intrusion attacks. Feature
scaling and encoding were used for data pre-processing and lower data usage. More
than 50 features were used to perform this task on different datasets. Therefore,
complex hardware GPUs were used in order to handle this huge number of features
with lower training time. A supervised [36] adversarial auto-encoder neural network
was proposed for NIDS. It combined GANS and a variation auto-encoder. GANS
consists of two different neural networks competing with each other, known as the
generator and the discriminator. The result of the competition is to minimize the
objective function as much as possible, using the Jensen Shannon minimization
algorithm. The generator tries to generate fake data packets, while the discriminator
determined whether this data is real or fake; in other words, it checks if that packet is
an attack or normal. In addition, the proposed method integrates the regularization
penalty with the model structure for overfitting control behaviour. The results were
reasonable in the detection rate of U2RL and R2L but lower in others. Multi-channel
12
deep learning of features for NIDS was presented in [37], using AE involving CNN,
two fully connected layers and the output to the softmax classifier. The evaluation is
done over three different datasets; KDD cup99, UNSWNB15 and CICIDS, with an
average accuracy of 94%. The proposed model provides effective results; however,
the structure and the characteristics of the attack were not highlighted clearly. The
proposed method enhances the implementation of NIDS by deploying machine
learning over SDN. It introduces a machine learning algorithm for network monitoring
within the NIDS implementation on the central controller of the SDN. In this paper,
enhanced tree-based machine learning algorithms are proposed for anomaly
detection. Using only five features, a multi-class classification task is conducted by
detecting whether there is an attack or not and classifying the type of attack.
In this section, we discuss and explain each component and its role in the NIDS
architecture. As shown in the SDN architecture can be divided into three main layers,
as follows:
System Architecture Layers NIDS component architecture is constructed in three
main parts as follows:
The infrastructure layer consists of two main parts: hardware and software
components. The hardware components are devices such as routers and switches.
The software components are those components that interface with the hardware,
such as Open Flow switches.
The control layer is an intelligent network controller, such as an SDN controller. The
control layer is the layer responsible for regulating actions and traffic data
management by establishing or denying every network flow.
The application layer is the one that performs all network management tasks. These
tasks can be performed using an SDN controller and NIDS.
Attacks are created by an attacker and delivered through the internet. NIDS is
deployed over the SDN controller. As NIDS listens to the network and actively
compares all traffic against predefined attack signatures, it detects the attacker’s
scanning attempts. It sends an alert to administrators through its control, and the
connections will be blocked due to specific rules in the firewall or routers.
13
This section presents a generalized flowchart of the proposed method. The dataset,
pre-processing techniques, and proposed machine learning algorithms will be
presented and discussed.
In this subsection, a generalized block diagram is presented and discussed. As
shown, the NSLKDD dataset is used. Data analysis, feature engineering, and other
pre-processing techniques are conducted to train the model, using the best hyper-
parameters, with only five features. Tree-based algorithms are used for the multi-
class classification task. The processed data enter the algorithm and are classified as
to whether they constitute an attack or are normal; then, the type of attack will be
analysed to see which category it belongs to, and action is taken accordingly.
The KDD Cup is the leading data mining competition in the world. The NSL-KDD
dataset was proposed to solve many issues represented in the KDD Cup 1999
dataset. Many researchers have used the NSL-KDD dataset to develop and evaluate
the NIDS problem. The dataset includes all types of attacks. The dataset has 41
features, categorized into three main types (basic feature, content-based, and traffic-
based features) and labelled as either normal or attack, with the attack type precisely
categorized. The categories can be classified into four main groups, with a brief
description of each attack type and its impact.
As stated in the previous subsection, the dataset has 41 features labelled as either
normal or attack with the precise attack category. After experimental trials, five
features were selected out of the 41 features in the NSL-KDD dataset, which have the
most impact and effect on algorithm learning performance. Presents the selected five
features with a brief description.
To evaluate the performance of NIDS in terms of accuracy (AC), different metrics
were used; precision (P), recall (R), and F-measure (F). These metrics can be
calculated using confusion matrix parameters: true positive (the number of anomalous
instances that are correctly classified); false positive (the number of normal instances
that are incorrectly classified as anomalous); true negative (the number of normal
instances that are correctly classified); and false negative (the number of anomalous
instances that are incorrectly classified as normal). A good NIDS must achieve high
DR and FAR. Accuracy (AC): This is the percentage of correctly classified network
14
activities. Precision (P): The percentage of predicted anomalous instances that are
actual anomalous instances; the higher P, the lower FAR. . Recall (R): the percentage
of predicted attack instances versus all attack instances presented. F-measure (F):
measuring the performance of NIDS using the harmonic mean of the P and R. We
aim to achieve a high F-score. We compare XGBoost against the other two tree-
based methods, RF and DT. Using the test set, which includes the four types of
attacks as discussed. Three different evaluation metrics are computed; F-score,
precision and recall. XGBoost ranked first in the evaluation, with an F1-score of
95.55%, while RF and DT achieved 94.6% and 94.5%, respectively. For precision,
XGBoost outperformed RF and DT with a score of 92%, while RF and DT scored 90%
and 90.2% respectively. Finally, for Recall, our proposed method with XGBoost
proves its stability with a score of 98% while for RF and DT, the results were 82%,
and 85%, respectively. From these results, the proposed model with XGBoost
performs with high precision and high recall, which means that the classifier returns
accurate results and high precision, while, at the same time, returning a majority of all
positive results (it’s an attack and the classifier detects that it’s an attack), which
means high recall.
Finally, we evaluate the proposed method using an accuracy analysis against seven
classical machine learning algorithms, in addition to the deep neural network. The
proposed method achieves an accuracy of 95.55%, while the second-best accuracy
performance is 82.02 for the NB tree, showing a significant difference between the
accuracy of our proposed method and the other approaches. This evaluation confirms
that the proposed method is accurate and robust, even compared against other
algorithms. This shows how the unambiguous steps in our approach are reliable,
effective and authoritative. We conclude that the proposed method achieves a
verifiable result using several techniques. For the precise literature and comparison,
we carefully chose the NSL-KDD data set, which is considered one of the most
powerful benchmark datasets. Several procedures of data statistics, cleaning and
verification are performed on the dataset, which are very important in order to
produce a smooth learning process with no obstacles, such as over- or under-fitting
issues. This stage ensures that the proposed model has unified data and increases
15
the value of data, which helps in decision-making. Feature normalization and
selection clarifies the path for clear selection and intelligent preferences, using only 5
features. Subsequently, more detailed exploration and various comparisons are
carried out, based on three machine learning algorithms, i.e., DT, RF, and XGBoost,
in order to test their performance with different criteria and then select the best
performing algorithm for our task. This shows that the selection is dependably proven
and technically verified.
NIDS in SDN-based machine learning algorithms has attracted significant attention in
the last two decades because of the datasets and various algorithms proposed in
machine learning, using only limited features for better detection of anomalies better
and more efficient network security. In this study, the benchmarking dataset NSL-
KDD is used for training and testing. Feature normalization, feature selection and
data pre-processing techniques are used in order to improve and optimize the
algorithm’s performance for accurate prediction, as well as to facilitate a smooth
training process with minimal time and resources. To select the appropriate algorithm,
we compare three classical tree-based machine learning algorithms; Random Forest,
Decision Trees and XGBoost. We examine them using a variety of evaluation metrics
to find the disadvantages and advantages of using one or more. Using six different
evaluation metrics, the proposed XGBoost model outperformed more than seven
algorithms used in NIDS. The proposed method focused on detecting anomalies and
protecting the SDN platform from attacks in real-time scenarios. The proposed
methods performed two tasks simultaneously; to detect if there is an attack or not,
and to determine the type of attack (Dos, probe, U2R, R2L). In future studies, more
evaluation metrics will be carried out. We plan to implement the approach using
several deep neural network algorithms, such as Auto-Encoder, Generative
Adversarial Networks, and Recurrent neural networks, such as GRU and LSTM.
These techniques have been proven in the literature to allow convenient anomaly
detection approaches in NIDS applications. Also, we plan to compare these
algorithms against each other and integrate one or more neural network architectures
to extract more details of how we can implement an efficient anomaly detection
system in NIDS, with lower consumption of time and resources. In addition, for a
16
more solid basis for comparison, several benchmarking cyber security datasets, such
as NSL-KDD, UNSWNB15, and CIC-IDS2017 will be conducted, in order to make
sure that the selection of the proposed algorithm is not biased in any situation. These
various datasets are generated in different environments and conditions, so more
complex features will be available, more generalized attacks will be covered and the
accuracy of the proposed algorithm will significantly increase, which could lead to a
state-of-the art approach.
17
NIDS. Immense efforts are required to produce such a labelled dataset from the raw
network traffic traces collected over a period or in real-time. Additionally, to preserve
the confidentiality of the internal organizational network structure as well as the
privacy of various users, network administrators are reluctant towards reporting any
intrusion that might have occurred in their networks. Various machine learning
techniques have been used to develop ADNIDSs, such as Artificial Neural Networks
(ANN), Support Vector Machines (SVM), Naive Bayesian (NB), Random Forests
(RF), and Self-Organized Maps (SOM). The NIDSs are developed as classifiers to
differentiate the normal traffic from the anomalous traffic. Many NIDSs perform a
feature selection task to extract a subset of relevant features from the traffic dataset
to enhance classification results. Feature selection helps in the elimination of the
possibility of incorrect training through the removal of redundant features and noises.
Recently, deep learning based methods have been successfully applied in audio,
image, and speech processing applications. These methods aim to learn a good
feature representation from a large amount of unlabelled data and subsequently apply
these learned features on a limited amount of labelled data in a supervised
classification. The labelled and unlabelled data may come from different distributions.
However, they must be relevant to each other. It is envisioned that the deep learning
based approaches can help to overcome the challenges of developing an efficient
NIDS. We can collect unlabelled network traffic data from different network sources
and a good feature representation from these datasets using deep learning
techniques can be obtained. These features can, then, be applied for supervised
classification to a small, but labelled traffic dataset consisting of normal as well as
anomalous traffic records. The traffic data for labelled dataset can be collected in a
confined, isolated and private network environment. With this motivation, we use self-
taught learning, a deep learning technique based on sparse auto encoder and soft-
max regression, to develop an NIDS. We verify the usability of the self-taught learning
based NIDS by applying on NSL-KDD intrusion dataset, an improved version of the
benchmark dataset for various NIDS evaluations - KDD Cup 99. We provide a
comparison of our current work with other techniques as well. Towards this end, our
paper is organized into four sections. In Section 2, we discuss a few closely related
18
work. Section 3 presents an overview of self-taught learning and the NSL-KDD
dataset. We discuss our results and comparative analysis in Section 4 and finally
conclude our paper with future work direction.
This section presents various recent accomplishments in this area. It should be noted
that we only discuss the work that have used the NSL-KDD dataset for their
performance bench marking. Therefore, any dataset referred from this point forward
should be considered as NSL-KDD. This approach allows a more accurate
comparison of work with other found in the literature. Another limitation is the use of
training data for both training and testing by most work. Finally, we discuss a few
deep learning based approaches that have been tried so far for similar kind of work.
One of the earliest work found in literature used ANN with enhanced resilient back-
propagation for the design of such an IDS. This work used only the training dataset
for training (70%), validation (15%) and testing (15%). As expected, use of unlabelled
data for testing resulted in a reduction of performance. A more recent work used J48
decision tree classifier with 10-fold cross-validation for testing on the training dataset.
This work used a reduced feature set of 22 features instead of the full set of 41
features. A similar work evaluated various popular supervised tree-based classifiers
and found that Random Tree model performed best with the highest degree of
accuracy along with a reduced false alarm rate. Many 2-level classification
approaches have also been proposed. One such work used Discriminative
Multinomial Naive Bayes (DMNB) as a base classifier and Nominal-to binary
supervised filtering at the second level along with 10-fold cross validation for testing.
This work was further extended to use Ensembles of Balanced Nested Dichotomies
(END) at the first level and Random Forest at the second level. As expected, this
enhancement resulted in an improved detection rate and a lower false positive rate.
Another 2level implementation used principal component analysis (PCA) for the
feature set reduction and then SVM (using Radial Basis Function) for final
classification, resulted in a high detection accuracy with only the training dataset and
full 41 features set. A reduction in features set to 23 resulted in even better detection
accuracy in some of the attack classes, but the overall performance was reduced.
The authors improved their work by using information gain to rank the features and
19
then a behaviorbased feature selection to reduce the feature set to 20. This resulted
in an improvement in reported accuracy using the training dataset. The second
category to look at, used both the training and test dataset. An initial attempt in this
category used fuzzy classification with genetic algorithm and resulted in a detection
accuracy of 80%+ with a low false positive rate. Another important work used
unsupervised clustering algorithms and found that the performance using only the
training data was reduced drastically when test data was also used. A similar
implementation using the k-point algorithm resulted in a slightly better detection
accuracy and lower false positive rate, using both training and test datasets. Another
less popular technique, OPF (optimum path forest) which uses graph partitioning for
feature classification, was found to demonstrate a high detection accuracy within one-
third of the time compared to SVMRBF method. We observed a deep learning
approach with Deep Belief Network (DBN) as a feature selector and SVM as a
classifier in. This approach resulted in an accuracy of 92.84% when applied on
training data. Our current work could be easily compared to this work due to the
enhancement of approach over this work and use of both the training and test dataset
in our work. A similar, however, semi-supervised learning approach has been used in.
The authors used real world trace for training and evaluated their approach on real-
world and KDD Cup 99 traces. Our approach is different from them in the sense that
we use NSL-KDD dataset to find deep learning applicability in NIDS implementation.
Moreover, the feature learning task is completely unsupervised and based on sparse
auto encoder in our approach. We recently observed a sparse auto encoder based
deep learning approach for network traffic identification in. The authors performed
TCP based unknown protocols identification in their work instead of network intrusion
detection.
Self-taught Learning (STL) is a deep learning approach that consists of two stages for
the classification. First, a good feature representation is learnt from a large collection
of unlabelled data, Xu, termed as Unsupervised Feature Learning (UFL). In the
second stage, this learnt representation is applied to labelled data, XL, and used for
the classification task. Although the unlabelled and labelled data may come from
different distributions, there must be relevance among them. Figure 1 shows the
20
architecture diagram of STL. There are different approaches used for UFL, such as
Sparse Auto encoder, Restricted Boltzmann Machine (RBM), K-Means Clustering,
and Gaussian Mixtures. We use sparse auto encoder based feature learning for our
work due to its relatively easier implementation and good performance. A sparse auto
encoder is a neural network consists of an input, a hidden, and an output layers. The
input and output layers contain N nodes, and the hidden layer contains K nodes. The
target values at the output layer are set equal to the input values, i.e., ˆxi = xi The
sparse auto encoder network finds the optimal values for weight matrices, W ∈ <K×N
and V ∈ <N×K, and bias vectors, b1 ∈ <K×1 and b2 ∈ <N×1, using back-propagation
algorithm while trying to learn the approximation of the identity function, i.e., output ˆx
similar to x [18]. Sigmoid function, g(z) = 1 1+e−z , is used for the activation, hW,b of
the nodes in the hidden and output layers: hW,b(x) = g(W x + b) (1) J = 1 2m Xm I=1
kxi − xˆik 2 + λ 2 ( X k,n W2 + X n,k V 2 + X k b1 2 + X n b2 2 ) + β XK j=1 KL(ρkρˆj )
(2) The cost function to be minimized in sparse auto encoder using back-propagation
is represented by Eq. (2). The first term is the average of sum-of-square error terms
for all m input data. The second term is a weight decay term, with λ as weight decay
parameter, to avoid the over-fitting in training. The last term in the equation is sparsity
penalty term that puts a constraint into the hidden layer to maintain a low average
activation values, and expressed as KullbackLeibler (KL) divergence shown in Eq.
(3): KL (ρkρˆj) = ρlog ρ ρˆj + (1 − ρ) log 1 − ρ 1 − ρˆj (3) where ρ is a sparsity
constraint parameter ranges from 0 to 1 and β controls the sparsity penalty term. The
KL (ρkρˆj) attains a minimum value when ρ = ˆρj, where ˆρj denotes the average
activation value of hidden unit j over all training inputs x. Once we learn optimal
values for W and b1 by applying the sparse auto encoder on unlabelled data, Xu, we
evaluate the feature representation a = HW, b1 (XL) for the labelled data, (XL, y). We
use this new features representation, a, with the labels vector, y, for the classification
task in the second stage. We use soft-max regression for the classification task.
As discussed earlier, we used NSL-KDD dataset in our work. The dataset is an
improved and reduced version of the KDD Cup 99 dataset. The KDD Cup dataset
was prepared using the network traffic captured by 1998 DARPA IDS evaluation
program. The network traffic includes normal and different kinds of attack traffic, such
21
as DoS, Probing, user-to-root (U2R), and root to-local (R2L). The network traffic for
training was collected for seven weeks followed by two weeks of traffic collection for
testing in raw TCP dump format. The test data contains many attacks that were not
injected during the training data collection phase to make the intrusion detection task
realistic. It is believed that most of the novel attacks can be derived from the known
attacks. Finally, the training and test data were processed into the datasets of five
million and two million TCP/IP connection records, respectively. The KDD Cup
dataset has been widely used as a benchmark dataset for many years in the
evaluation of NIDS. One of the major drawback with the dataset is that it contains an
enormous amount of redundant records both in the training and test data. It was
observed that almost 78% and 75% records are redundant in the training and test
dataset, respectively. This redundancy makes the learning algorithms biased towards
the frequent attack records and leads to poor classification results for the infrequent,
but harmful records. The training and test data were classified with the minimum
accuracy of 98% and 86% respectively using a very simple machine learning
algorithm. It made the comparison task difficult for various IDSs based on different
learning algorithms. NSL-KDD was proposed to overcome the limitation of KDD Cup
dataset. The dataset is derived from the KDD Cup dataset. It improved the previous
dataset in two ways. First, it eliminated all the redundant records from the training and
test data. Second, it partitioned all the records in the KDD Cup dataset into various
difficulty levels based on the number of learning algorithms that can correctly classify
the records. Further, it selected the records by random sampling of the distinct
records from different difficulty levels in a fraction that is inversely proportional to their
fractions in the distinct records. The multi-steps processing of KDD Cup dataset made
the total records statistics reasonable in the NSL-KDD dataset. Moreover, these
enhancements made the evaluation of various machine learning techniques realistic.
Each record in the NSL-KDD dataset consists of 41 features1 and is labelled with
either normal or a particular kind of attack. These features include basic features
derived directly from a TCP/IP connection, traffic features accumulated in a window
interval, either time, e.g. two seconds, or a number of connections, and content
features extracted from the application layer data of connections. Out of 41 features,
22
three are nominal, four are binary, and remaining 34 are continuous. The training data
contains 23 traffic classes that include 22 classes of attack and one normal class. The
test data contains 38 traffic classes that include 21 attack classes from the training
data, 16 novel attacks, and one normal class. These attacks are also grouped into
four categories based on the purpose they serve. These categories are DoS, Probing,
U2R, and R2L. Table-1 shows the statistics of records for the training and test data
for normal and different attack classes.
As discussed in the previous section, the dataset contains different kinds of attributes
with different values. We pre-process the dataset before applying self-taught learning
on it. Nominal attributes are converted into discrete attributes using 1-to-n encoding.
In addition, there is one attribute, num outbound cmds, in the dataset whose value is
always 0 for all the records in the training and test data. We eliminated this attribute
from the dataset. The total number of attributes become 121 after performing the
steps mentioned above. The values in the output layer during the feature learning
phase is computed by the sigmoid function that gives values between 0 and 1. Since
the output layer values are identical to the input layer values in this phase, it results in
normalization of the values at the input layer in the range of. To obtain this, we
perform max-min normalization on the new attributes list. With the new attributes, we
use the NSL-KDD training data without labels for feature learning using sparse auto
encoder for the first stage of self-taught learning. In the second stage, we apply the
newly learned features representation on the training data itself for the classification
using soft-max regression. In our implementation, both the unlabelled and labelled
data for feature learning and classifier training come from the same source, i.e., NSL-
KDD training data.
We implemented the NIDS for three different types of classification: a) Normal and
anomaly (2class), b) Normal and four different attack categories (5-class), and c)
Normal and 22 different attacks (23-class). We have evaluated classification accuracy
for all types. However, precision, recall, and f-measure values are evaluated in the
case of 2-class and 5- class classification. We have computed the weighted values of
these metrics in the case of 5-class classification.
23
As discussed, there are two approaches applied for the evaluation of NIDSs. In the
most widely used approach, the training data is used for both training and testing
either using n fold cross-validation or splitting the training data into training, cross-
validation, and test sets. NIDSs based on this approach achieved very high accuracy
and less false-alarm rates. The second approach uses the training and test data
separately for the training and testing. Since the training and test data were collected
in different environments, the accuracy obtained using the second approach is not as
high as in the first approach. Therefore, we emphasize on the results of the second
approach in our work for accurate evaluation of NIDS. However, we present the
results of the first approach as well for completeness. We describe our NIDS
implementation before discussing the results.
We proposed a deep learning based approach for developing an efficient and flexible
NIDS. A sparse auto encoder and soft-max regression based NIDS was
implemented. We used the benchmark network intrusion dataset - NSL-KDD to
evaluate anomaly detection accuracy. We observed that the proposed NIDS
performed very well compared to previously implemented NIDSs for the
normal/anomaly detection when evaluated on the test data. The performance can be
further enhanced by applying techniques such as Stacked Auto encoder, an
extension of sparse auto encoder in deep belief nets, for unsupervised feature
learning, and NB-Tree, Random Tree, or J48 for further classification. It was noted
that the latter techniques performed well when applied directly on the dataset. In
future, we plan to implement a real-time NIDS for actual networks using deep learning
technique. Additionally, on-the-go feature learning on raw network traffic headers
instead of derived features can be another high impact research in this area.
25
mining. Method used to generate of rules is classification by ID3 algorithm of decision
tree. It is an efficient and optimized to make the rules filtering in firewall.
Log files:
Log files can give an idea about what the different parts of system are doing. Logs
can show what is going right and what is going wrong. Log files can provide a useful
26
profile activity. From a security standpoint, it is crucial to be able to distinguish normal
activity from the activity of someone to attack server or network. Log files are useful
for three reasons:
Log files help with troubleshooting system problems and understanding what is
happening on the system
Logs serve as an early warning for both system and security events
Logs can be indispensable in reconstructing events, whether determined an intrusion
has occurred and performing the follow-up forensic investigation or just profiling
normal activity
Decision tree is a technique in classification method of data mining for learning
patterns from data and using these patterns for classification. Decision tree are
structures used to classify and data and with and common and attributes and as each
decision tree represents a rule, which categorizes data according to these attributes.
Where each node (non-leaf node) denotes a test on an attribute, each branch
represent an outcome of the test and each leaf node or terminal node holds a class
label. The topmost node in a tree is the root node. A decision tree classifier is one of
the most widely need supervised learning methods used for data exploration. It is
easy to interpret and can be represented as if-then-else rules. This classifier works
well on noisy data. A decision tree aids in data exploration in the following manner:
It reduces a volume of data by transformation into a more compact form that
preserves the essential characteristics and provides an accurate summary
It discovers whether the data contains well separated classes of objects, such that the
classes can be interpreted meaningfully in the context of a substantive theory
It maps data in the form the leaves to its root. This may use to predict the outcome for
a new data or Query.
This research using decision tree a technique of data mining machine learning to find
the intrusion characteristics for intrusion detection. Algorithm is used ID3 to construct
Decision tree. Network traffic logs as data training that describes the human
behaviour in network traffics as intrusive activities and normal activities. The results of
decision tree training will get rules of intrusion characteristics then these rules to
27
implement in the firewall rules as prevention. Determining occurrence of intrusion or
normal activities at network traffic log can be conducted with two way of that is:
Observe manually activities network traffic in log files. Example, application software
of log files is syslog, syslog_ng, tcpdump and others. Pattern found to see intrusion
through log seen modestly, for example there are some times trying to access using
login or password failed, trying port scan, abundant ping, delivery of abundant
package by repeat
Using software as a means of assists functioning as Network Intrusion Detection
System (NIDS) able to determine intrusion activities or normal activities, for example
snort software
Collect and extract log files of intrusive activities and normal activities become five of
parameter as attributes and belongs to a class ‘Yes’ or ‘No’ of intrusive for the data
training of decision tree. The parameter is IP address source, IP address destination,
port source, port destination and protocol. Applying Decision Tree to Find Intrusion
Characteristic: Suppose train a decision tree.
The examples of extract rule of tree decision representing characteristic of intrusion
earn implementation into firewall rules .Do not forget to every rule there is a TCP
protocol. Firewall policy rules above representing preventive action, where every
network packet with criteria like rules firewall above will DROP.
Network traffic logs to describe patterns of behaviour in network traffic accident with
intrusive or normal activity. Decision tree technique is good for the intrusion
characteristic of the network traffic logs for IDS and implemented in the firewall as
prevention. The both of this combination is called IPS. The other hand, this technique
is also good efficiency and optimize rule for the firewall rules such as avoid
redundancy.
28
CHAPTER 3
EXPERIMENTAL METHODS AND ALGORITHMS USED
3.1 MACHINE LEARNING SCOPE
Machine learning as a very likely approach to achieve human-computer integration
and can be applied in many computer fields. Machine learning is not a typical method
as it contains many different computer algorithms. Yu Yang algorithms aim to solve
different machine learning tasks. At last, all the algorithms can help the computer to
act more like a human. Machine learning is already applied in many fields, for
instance, pattern recognition, Artificial Intelligence, computer vision, data mining, and
text categorization and so on. Machine learning gives a new way to develop the
intelligence of the machines. It also becomes an easier way to help people to
analyses data from huge data sets. A learning method is a complicated topic which
has many different kinds of forms. Everyone has different methods to study, so does
the machine. We can categorize various machine learning systems by different
conditions. In general, we can separate learning problems in two main categories:
29
supervised learning and unsupervised learning.
3.1.1SUPERVISED LEARNING
Supervised learning is a commonly used machine learning algorithm which appears
in many different fields of computer science. In the supervised learning method, the
computer can establish a learning model based on the training data set.
According to this learning model, a computer can use the algorithm to predict or
analyze new information. By using special algorithms, a computer can find the best
result and reduce the error rate all by itself. Supervised learning is mainly used for
two different patterns: classification and regression.
In supervised learning, when a developer gives the computer some samples, each
sample is always attached with some classification information. The computer will
analyze these samples to get learning experiences so that the error rate would be
reduced when a classifier does recognitions for each pattern.
30
3.1.2 UNSUPERVISED LEARNING
Unsupervised learning is also used for classification of original data. The classifier in
the unsupervised learning method aims to find the classification information for
unlabeled samples. The objective of unsupervised learning is to let the computer learn
it by itself. We do not teach the computer how to do it. The computer is supposed to
do analyzing from the given samples. In unsupervised learning, the computer is not
able to find the best result to take and also the computer does not know if the result is
correct or not. When the computer receives the original data, it can find the potential
regulation within the information automatically and then the computer will adopt this
regulation to the new case. That makes the difference between supervised learning
and unsupervised learning. In some cases, this method is more powerful than
supervised learning. That is because there is no need to do the classification for
samples in advance. Sometimes, our classification method may not be the best one.
On the other hand, a computer may find out the best method after it learns it from
samples again and again.
31
Decision Nodes – the nodes we get after splitting the root nodes are called Decision
Node
Leaf Nodes – the nodes where further splitting is not possible are called leaf nodes or
terminal nodes
Sub-tree – just like a small portion of a graph is called sub-graph similarly a
subsection of this decision tree is called sub-tree.
Pruning – is nothing but cutting down some nodes to stop overfitting.
Example of a decision tree.
Let’s understand decision trees with the help of an example.
Decision trees are upside down which means the root is at the top and then this root is
split into various several nodes. Decision trees are nothing but a bunch of if-else
statements in layman terms. It checks if the condition is true and if it is then it goes to
the next node attached to that decision.
Did you notice anything in the above flowchart? We see that if the weather is cloudy
then we must go to play. Why didn’t it split more? Why did it stop there?
To answer this question, we need to know about few more concepts like entropy,
information gain, and Gini index. But in simple terms, I can say here that the output for
the training dataset is always yes for cloudy weather, since there is no disorderliness
here we don’t need to split the node further.
The goal of machine learning is to decrease uncertainty or disorders from the dataset
and for this, we use decision trees.
Now you must be thinking how do I know what should be the root node? What should
be the decision node? When should I stop splitting? To decide this, there is a metric
called “Entropy” which is the amount of uncertainty in the dataset.
Entropy:
Entropy is nothing but the uncertainty in our dataset or measure of disorder. Let me try
to explain this with the help of an example.
Suppose you have a group of friends who decides which movie they can watch
together on Sunday. There are 2 choices for movies, one is “Lucy” and the second is
“Titanic” and now everyone has to tell their choice. After everyone gives their answer
we see that “Lucy” gets 4 votes and “Titanic” gets 5 votes. Which movie do we watch
32
now? Isn’t it hard to choose 1 movie now because the votes for both the movies are
somewhat equal?
This is exactly what we call disorder-ness, there is an equal number of votes for both
the movies, and we can’t really decide which movie we should watch. It would have
been much easier if the votes for “Lucy” were 8 and for “Titanic” it was 2. Here we
could easily say that the majority of votes are for “Lucy” hence everyone will be
watching this movie.
In a decision tree, the output is mostly “yes” or “no” The formula for Entropy is shown
below:
How do Decision Trees use Entropy?
Now we know what entropy is and what is its formula, next, we need to know that how
exactly it works in this algorithm.
Entropy basically measures the impurity of a node. Impurity is the degree of
randomness; it tells how random our data is. A pure sub-split means that either you
should be getting “yes”, or you should be getting “no”.
Suppose feature 1 had 8 yes and 4 no, after the split feature 2 get 5 yes and 2 no
whereas feature 3 gets 3 yes and 2 no.
We see here the split is not pure, why? Because we can still see some negative
classes in both the feature. In order to make a decision tree, we need to calculate the
impurity of each split, and when the purity is 100% we make it as a leaf node. To
check the impurity of feature 2 and feature 3 we will take the help for Entropy
We can clearly see from the tree itself that feature 2 has low entropy or more purity
than feature 3 since feature 2 has more “yes” and it is easy to make a decision here.
Always remember that the higher the Entropy, the lower will be the purity and the
higher will be the impurity.
As mentioned earlier the goal of machine learning is to decrease the uncertainty or
impurity in the dataset, here by using the entropy we are getting the impurity of a
feature or a particular node, we don’t know if the parent entropy or the entropy of a
particular node has decreased or not.
For this, we bring a new metric called “Information gain” which tells us how much the
parent entropy has decreased after splitting it with some feature.
Information Gain:
33
Information gain measures the reduction of uncertainty given some feature and it is
also a deciding factor for which attribute should be selected as a decision node or root
node. It is just entropy of the full dataset – entropy of the dataset given some feature.
Let’s see how our decision tree will be made using these 2 features. We’ll use
information gain to decide which feature should be the root node and which feature
should be placed after the split.
When to stop splitting?
You must be asking this question to yourself that when do we stop growing our tree?
Usually, real-world datasets have a large number of features, which will result in a
large number of splits, which in turn gives a huge tree. Such trees take time to build
and can lead to overfitting. That means the tree will give very good accuracy on the
training dataset but will give bad accuracy in test data.
There are many ways to tackle this problem through hyper parameter tuning. We can
set the maximum depth of our decision tree using the max_depth parameter. The
more the value of max_depth, the more complex your tree will be. The training error
will off-course decrease if we increase the max_depth value but when our test data
comes into the picture, we will get a very bad accuracy. Hence you need a value that
will not over fit as well as under fit our data and for this, you can use GridSearchCV.
Another way is to set the minimum number of samples for each spilt. It is denoted by
min_samples_split. Here we specify the minimum number of samples required to do a
spilt. For example, we can use a minimum of 10 samples to reach a decision. That
means if a node has less than 10 samples then using this parameter, we can stop the
further splitting of this node and make it a leaf node. There are more hyperparameters
such as:
min_samples_leaf – represents the minimum number of samples required to be in the
leaf node. The more you increase the number, the more is the possibility of overfitting.
max_features – it helps us decide what number of features to consider when looking
for the best split. To read more about these hyper parameters Pruning
It is another method that can help us avoid overfitting. It helps in improving the
performance of the tree by cutting the nodes or sub-nodes which are not significant.
There are mainly 2 ways for pruning:
34
Pre-pruning – we can stop growing the tree earlier, which means we can
prune/remove/cut a node if it has low importance while growing the tree.
Post-pruning – once our tree is built to its depth, we can start pruning the nodes based
on their significance.
Endnotes
To summarize, in this article we learned about decision trees. On what basis the tree
splits the nodes and how to can stop overfitting. Why linear regression doesn’t work in
the case of classification problems. In the next article, I will explain Random forests,
which is again a new technique to avoid overfitting.
36
I suppose, you would now be thinking is there any use of such tough tasks. I will not
answer this question now, rather let us look at the implementation of it using TPOT
library and then you decide this.
Implementation using TPOT library
First, let’s take a quick view on the TPOT (Tree-based Pipeline Optimisation
Technique) which is built upon sickie-learn library.
A basic pipeline structure is shown in the image below.
So the highlighted grey section in the image above is automated using TPOT. This
automation is achieved using genetic algorithm. So, without going deep into this, let’s
directly try to implement it. For using TPOT library, you first have to install some
existing python libraries on which TPOT is build. So let us quickly install them.
Applications in Real World:
Engineering Design
Robotics.
End Notes
I hope that now you have gain enough understanding about what genetic algorithm is
and also how to implement it using TPOT library. But this knowledge is not enough, if
you don’t apply it somewhere. So try to implement it whether in any real world
application or in a data science competition.
37
3.4 Block diagram And Flow diagram
38
CHAPTER 4
RESULTS, DISCUSSION AND PERFORMANCE ANALYSIS
4.1 Requirements
Hardware requirements:
System: Pentium i3 Processor.
Hard Disk: 500 GB.
Monitor : 15’’ LED
Input Devices : Keyboard, Mouse
Ram : 2 GB
Software requirements:
4.2 MODULES
What is the machine learning Model?
The machine learning model is nothing but a piece of code; an engineer or data
scientist makes it smart through training with data. So, if you give garbage to the
39
model, you will get garbage in return, i.e. the trained model will provide false or wrong
prediction.
Fig.4.2.1 Module Flow chart
Data Collection
Data pre-processing
Future Extraction
Model Training
Testing Model
Evaluation
Prediction
Data Collection
Collecting data allows you to capture a record of past events so that we can use data
analysis to find recurring patterns.
KDD datasets:
The KDD data set is a well-known benchmark in the research of Intrusion
Detection techniques. A lot of work is going on for the improvement of intrusion
detection strategies while the research on the data used for training and testing the
detection model is equally of prime concern because better data quality can improve
offline intrusion detection. This paper presents the analysis of KDD data set with
respect to four classes which are Basic, Content, Traffic and Host in which all data
attributes can be categorized.
40
Fig.4.2.2 Data Collection
Data Pre-Processing
Data pre-processing is a process of cleaning the raw data i.e. the data is collected in
the real world and is converted to a clean data set. In other words, whenever the data
is gathered from different sources it is collected in a raw format and this data isn’t
feasible for the analysis. Therefore, certain steps are executed to convert the data into
a small clean data set, this part of the process is called as data pre-processing.
Feature Extraction
This is done to reduce the number of attributes in the dataset hence providing
advantages like speeding up the training and accuracy improvements.
Model training
A training model is a dataset that is used to train an ML algorithm. It consists of the
sample output data and the corresponding sets of input data that have an influence on
the output.
Training set:
The training set is the material through which the computer learns how to process
information. Machine learning uses algorithms to perform the training part. A set of
data used for learning that is to fit the parameters of the classifier.
41
Validation set:
Cross-validation is primarily used in applied machine learning to estimate the skill of a
machine learning model on unseen data. A set of unseen data is used from the
training data to tune the parameters of a classifier.
Fig.4.2.4 Validation
Once the data is divided into the 3 given segments we can start the training process.
In a data set, a training set is implemented to build up a model, while a test (or
validation) set is to validate the model built. Data points in the training set are
excluded from the test (validation) set. Usually, a data set is divided into a training set,
a validation set (some people use ‘test set’ instead) in each iteration, or divided into a
training set, a validation set and a test set in each iteration. The model uses any one
of the models that we had chosen in step 3/ point 3. Once the model is trained we can
use the same trained model to predict using the testing data i.e. the unseen data.
Once this is done we can develop a confusion matrix, this tells us how well our model
is trained. A confusion matrix has 4 parameters, which are ‘True positives’, ‘True
Negatives’, ‘False Positives’ and ‘False Negative’. We prefer that we get more values
in the True negatives and true positives to get a more accurate model. The size of the
Confusion matrix completely depends upon the number of classes.
43
and try to improve the accuracy and also looking at the confusion matrix to try to
increase the number of true positives and true negatives.
4.3 RESULT
44
Fig.4.3.2 Train Data Set
45
Fig.4.3.3 Testing All Attacks
46
Fig.4.3.5 Normal Attacks Testing
47
Fig.4.3.6Normal Attacks Plot Graph
4.4 Code
Dataset.py
import pyshark
import time
import random
class Packet:
packet_list = list()
def initiating_packets(self):
self.packet_list.clear()
capture = pyshark.LiveCapture(interface="Wi-Fi")
for packet in capture.sniff_continuously(packet_count=25):
try:
if "<UDP Layer>" in str(packet.layers) and "<IP Layer>" in str(packet.layers):
self.packet_list.append(packet)
elif "<TCP Layer>" in str(packet.layers) and "<IP Layer>" in
48
str(packet.layers):
self.packet_list.append(packet)
except:
print(f"No Attribute name 'ip' {packet.layers}")
def udp_packet_attributes(self,packet):
attr_list = list()
a1 = packet.ip.ttl
a2 = packet.ip.proto
a3 = self.__get_service(packet.udp.port, packet.udp.dstport)
a4 = packet.ip.len
a5 = random.randrange(0,1000)
a6 = self.__get_land(packet,a2)
a7 = 0
a8, a10, a11 =
self.__get_count_with_same_and_diff_service_rate(packet.udp.dstport, a3) #23, 29,
30
a9, a12 = self.__get_srv_count_and_srv_diff_host_rate(packet.ip.dst, a3) #24,
31
a13, a15, a16 = self.__get_dst_host_count(packet.ip.dst, a3) # 32,34,35
a14, a17, a18 = self.__get_dst_host_srv_count(packet.udp.port,
packet.udp.dstport, packet.ip.dst) #33, 36, 37
attr_list.extend((a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a
18))
return self.get_all_float(attr_list)
def tcp_packet_attributes(self,packet):
attr_list = list()
a1 = packet.ip.ttl #duration
a2 = packet.ip.proto #protocol
a3 = self.__get_service(packet.tcp.port, packet.tcp.dstport) # service
a4 = packet.ip.len
a5 = random.randrange(0,1000)
49
a6 = self.__get_land(packet,a2)
a7 = packet.tcp.urgent_pointer
a8, a10, a11 =
self.__get_count_with_same_and_diff_service_rate(packet.tcp.dstport, a3) #23, 29,
30
a9, a12 = self.__get_srv_count_and_srv_diff_host_rate(packet.ip.dst, a3) #24,
31
a13, a15, a16 = self.__get_dst_host_count(packet.ip.dst, a3) # 32,34,35
a14, a17, a18 = self.__get_dst_host_srv_count(packet.tcp.port,
packet.tcp.dstport, packet.ip.dst) #33, 36, 37
attr_list.extend((a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a
18))
return self.get_all_float(attr_list)
def __get_service(self,src_port,dst_port):
services = [80,443,53]
if int(src_port) in services:
return int(src_port)
elif int(dst_port) in services:
return int(dst_port)
else:
return 53
50
else:
return 0
51
service_count+= 1
if not (p.ip.dst == dst_ip): # not added
diff_dst_ip+=1
elif "<TCP Layer>" in str(p.layers):
if (self.__get_service(p.tcp.port, p.tcp.dstport) == service):
service_count+= 1
if not (p.ip.dst == dst_ip): # not added
diff_dst_ip+=1
srv_diff_host_rate = 0.0
if not(service_count == 0):
srv_diff_host_rate = ((diff_dst_ip*100)/service_count)/100
return (service_count, srv_diff_host_rate)
52
def __get_dst_host_srv_count(self,src_port, dst_port, dst_ip): #33, 36, 37
dst_host_srv_count = 0
same_src_port = 0
diff_dst_ip = 0
for p in self.packet_list:
if "<UDP Layer>" in str(p.layers):
if (p.udp.dstport == dst_port): # same destination port
dst_host_srv_count+=1
if (p.udp.port == src_port): # same src port
same_src_port+=1
if not (p.ip.dst == dst_ip): # different destination Ip
diff_dst_ip+=1
def get_all_float(self,l):
53
all_float = list()
for x in l:
all_float.append(round(float(x),1))
return all_float
GAAlgorithm.py
import Population
import random
class GAAlgorithm():
def initialization(self):
self.population.initialize_population()
def calculate_fitness(self):
self.population.calculate_fitness()
def selection(self):
parents = list()
end = int(self.population_size/2)
54
no_of_parents = int(self.population_size/2)
for x in range(no_of_parents):
p1 = random.randint(0,end-1)
p2 = random.randint(end,self.population_size-1)
parents.append([p1,p2])
return parents
def cross_over(self,parents):
self.population.cross_over(parents)
def mutation(self):
self.population.mutation(self.mutation_rate)
def clear_population(self):
self.population.clear_population()
Individual.py
import random
import string
import pandas
from classifier import DecisionTree
class Individual:
chromosome = list()
fitness = 0
def __init__(self, train_dataset, test_dataset, gene_length=18):
self.gene_length=int(gene_length)
self.chromosome = [random.randint(0,1) for x in range(self.gene_length)]
self.train_dataset = train_dataset
self.test_dataset = test_dataset
self.gene_length = gene_length
55
def calculate_fitness(self):
header = list(string.ascii_lowercase[0:(self.gene_length+1)])
kdd_train = pandas.read_csv(self.train_dataset, names=header)
kdd_test = pandas.read_csv(self.test_dataset, names=header)
selected_index= [header[x] for x, y in enumerate(self.chromosome) if y==1]
var_train, res_train = kdd_train[selected_index], kdd_train[header[18]]
var_test, res_test = kdd_test[selected_index], kdd_test[header[18]]
self.fitness = self.__get_fitness(var_train, res_train, var_test, res_test)*100
Packet.py
import pyshark
import random
class Packet:
packet_list = list() #list is declare
def initiating_packets(self):
self.packet_list.clear()
capture = pyshark.LiveCapture(interface="Wi-Fi")
for packet in capture.sniff_continuously(packet_count=25):
try:
if "<UDP Layer>" in str(packet.layers) and "<IP Layer>" in str(packet.layers):
self.packet_list.append(packet)
elif "<TCP Layer>" in str(packet.layers) and "<IP Layer>" in
str(packet.layers):
self.packet_list.append(packet)
except:
print(f"No Attribute name 'ip' {packet.layers}")
def udp_packet_attributes(self,packet):
56
attr_list = list()
a1 = packet.ip.ttl
a2 = packet.ip.proto
a3 = self.__get_service(packet.udp.port, packet.udp.dstport)
a4 = packet.ip.len
a5 = random.randrange(0,1000)
a6 = self.__get_land(packet,a2)
a7 = 0 # urgent pointer not exist in udp layer
a8, a10, a11 =
self.__get_count_with_same_and_diff_service_rate(packet.udp.dstport, a3) #23, 29,
30
a9, a12 = self.__get_srv_count_and_srv_diff_host_rate(packet.ip.dst, a3) #24,
31
a13, a15, a16 = self.__get_dst_host_count(packet.ip.dst, a3) # 32,34,35
a14, a17, a18 = self.__get_dst_host_srv_count(packet.udp.port,
packet.udp.dstport, packet.ip.dst) #33, 36, 37
attr_list.extend((a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a
18))
return self.get_all_float(attr_list)
def tcp_packet_attributes(self,packet):
attr_list = list()
a1 = packet.ip.ttl #duration
a2 = packet.ip.proto #protocol
a3 = self.__get_service(packet.tcp.port, packet.tcp.dstport) # service
a4 = packet.ip.len #Src - byte
a5 = random.randrange(0,1000) #dest_byte
a6 = self.__get_land(packet,a2) #land
a7 = packet.tcp.urgent_pointer #urgentpoint
a8, a10, a11 =
self.__get_count_with_same_and_diff_service_rate(packet.tcp.dstport, a3) #23, 29,
30
57
a9, a12 = self.__get_srv_count_and_srv_diff_host_rate(packet.ip.dst, a3) #24,
31
a13, a15, a16 = self.__get_dst_host_count(packet.ip.dst, a3) # 32,34,35
a14, a17, a18 = self.__get_dst_host_srv_count(packet.tcp.port,
packet.tcp.dstport, packet.ip.dst) #33, 36, 37
attr_list.extend((a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a
18))
return self.get_all_float(attr_list) # convert every attribute to float data type
def __get_service(self,src_port,dst_port):
services = [80,443,53]
if int(src_port) in services:
return int(src_port)
elif int(dst_port) in services:
return int(dst_port)
else:
return 53
58
30
count = 0
packet_with_same_service = 0
for p in self.packet_list:
if "<UDP Layer>" in str(p.layers):
if (p.udp.dstport == dst_port): #same destination port
count+=1
if (self.__get_service(p.udp.port, p.udp.dstport) == service): # same
service
packet_with_same_service+=1
elif "<TCP Layer>" in str(p.layers):
if (p.tcp.dstport == dst_port):
count+=1
if (self.__get_service(p.tcp.port, p.tcp.dstport) == service):
packet_with_same_service+= 1
same_service_rate=0.0
diff_service_rate = 1.0
if not count==0:
same_service_rate = ((packet_with_same_service*100)/count)/100
diff_service_rate = diff_service_rate-same_service_rate
return (count, same_service_rate, diff_service_rate)
59
elif "<TCP Layer>" in str(p.layers):
if (self.__get_service(p.tcp.port, p.tcp.dstport) == service):
service_count+= 1
if not (p.ip.dst == dst_ip): # # different destination ip if tcp
diff_dst_ip+=1
srv_diff_host_rate = 0.0
if not(service_count == 0):
srv_diff_host_rate = ((diff_dst_ip*100)/service_count)/100
return (service_count, srv_diff_host_rate)
60
def __get_dst_host_srv_count(self,src_port, dst_port, dst_ip): #33, 36, 37
dst_host_srv_count = 0
same_src_port = 0
diff_dst_ip = 0
for p in self.packet_list:
if "<UDP Layer>" in str(p.layers):
if (p.udp.dstport == dst_port): # same destination port
dst_host_srv_count+=1
if (p.udp.port == src_port): # same src port
same_src_port+=1
if not (p.ip.dst == dst_ip): # different destination Ip
diff_dst_ip+=1
def get_all_float(self,l):
61
all_float = list()
for x in l:
all_float.append(round(float(x),1))
return all_float
ABNIDS.py
# Change testing panel to avoid segmentation fault
from PyQt5 import QtCore, QtGui, QtWidgets
from PyQt5.QtGui import QIcon, QPixmap
from PyQt5.QtWidgets import
qApp,QFileDialog,QMessageBox,QMainWindow,QDialog,QDialogButtonBox,QVBoxL
ayout, QHeaderView, QMessageBox
import os
import time
import pyshark
import matplotlib.pyplot as plt
import threading
import packet as pack
import GAAlgorithm
import Preprocess as data
import classifier
class Ui_MainWindow(object):
def __init__(self):
self.tree_classifier = classifier.DecisionTree()
self.packet = pack.Packet()
self.trained = False
self.stop = False
self.threadActive = False
self.pause = False
def plot_graph(self):
x = ['Normal','DoS','Prob']
62
normal,dos,prob = self.tree_classifier.get_class_count()
y = [normal,dos,prob]
plt.bar(x,y,width=0.3,label="BARCHART")
plt.xlabel('Classes')
plt.ylabel('Count')
plt.title('Graph Plotting')
plt.legend()
plt.show()
def train_model(self):
try:
train_dataset, train_dataset_type =
QFileDialog.getOpenFileName(MainWindow, "Select Training Dataset","","All Files
(*);;CSV Files (*.csv)")
if train_dataset:
os.chdir(os.path.dirname(train_dataset))
test_dataset, test_dataset_type =
QFileDialog.getOpenFileName(MainWindow, "Select Testing Dataset","","All Files
(*);;CSV Files (*.csv)")
if train_dataset and test_dataset:
generation = 0
train_dataset = data.Dataset.refine_dataset(train_dataset, "Train
Preprocess.txt")
63
ga.calculate_fitness()
while(ga.population.max_fitness<93 and generation<1):
print(f"Generation = {generation}")
generation+= 1
parents = ga.selection()
ga.cross_over(parents)
ga.mutation()
ga.calculate_fitness()
max_fitest = ga.population.max_fittest
max_fitness = round(ga.population.max_fitness,1)
self.tree_classifier.train_classifier(train_dataset,max_fitest)
self.trained = True
ga.clear_population()
self.progressBar.setProperty("value", 100)
self.showdialog('Model train',f'Model trained successfully',1)
except:
try:
ga.clear_population()
except:
print("Err 00")
finally:
self.showdialog('Model train','Model trained unsuccessfully',2)
def static_testing(self):
if self.isModelTrained():
if (self.threadActive):
self.showdialog('Warning','Please stop currently testing',3)
else:
test_dataset, train_dataset_type =
QFileDialog.getOpenFileName(MainWindow, "Select Testing Dataset","","All Files
64
(*);;CSV Files (*.csv)")
if test_dataset:
try:
test_dataset = data.Dataset.refine_dataset(test_dataset , "Test
Dataset.txt")
t1 = threading.Thread( target=self.static_testing_thread, name = 'Static
testing', args=(test_dataset,))
t1.start()
self.threadActive = True
except:
self.showdialog('Error','Invalid Dataset',2)
else:
self.showdialog('Warning','Model not trained',3)
def static_testing_thread(self,dataset):
row = 0
self.reset_all_content()
with open(dataset,"r") as file:
for line in file.readlines():
try:
line = line.split(',')
result, result_type = self.tree_classifier.test_dataset(line)
self.insert_data(line,result,result_type,row)
row+=1
if self.pause:
while(self.pause):
pass
if self.isStop():
self.stop=False
break
time.sleep(0.05)
65
except:
print("Err")
self.threadActive = False
def realtime_testing(self):
if self.isModelTrained():
if (self.threadActive):
self.showdialog('Warning','Please stop currently testing',3)
else:
t2 = threading.Thread(target=self.realtime_testing_thread, name = 'Realtime
testing')
t2.start()
self.threadActive = True
else:
self.showdialog('Warning','Model not trained',3)
def realtime_testing_thread(self):
self.reset_all_content()
self.packet.initiating_packets()
t1 = time.time()
attr_list = list()
capture = pyshark.LiveCapture(interface='Wi-Fi')
row = 0
try:
for p in capture.sniff_continuously():
try:
if "<UDP Layer>" in str(p.layers) and "<IP Layer>" in str(p.layers):
attr_list = self.packet.udp_packet_attributes(p)
result, result_type = self.tree_classifier.test_dataset(attr_list)
self.insert_data(attr_list,result,result_type,row)
print(attr_list)
row+=1
66
elif "<TCP Layer>" in str(p.layers) and "<IP Layer>" in str(p.layers):
attr_list = self.packet.tcp_packet_attributes(p)
result, result_type = self.tree_classifier.test_dataset(attr_list)
self.insert_data(attr_list,result,result_type,row)
print(attr_list)
row+= 1
if (time.time()-t1) > 5 and not self.isStop: # 5Seconds
print("Updateing List")
self.packet.initiating_packets()
t1 = time.time()
if self.pause:
while(self.pause):
pass
if self.isStop():
self.stop=False
break
except :
print("Err")
except :
print("Error in loooooop")
def pause_resume(self):
if self.pause:
self.pause = False
self.btn_start.setText("Pause")
else:
self.pause = True
self.btn_start.setText("Resume")
def save_log_file(self):
log = self.tree_classifier.get_log()
67
url = QFileDialog.getSaveFileName(None, 'Save Log', 'untitled', "Text file
(*.txt);;All Files (*)")
if url[0]:
try:
name = url[1]
url = url[0]
with open(url, 'w') as file:
file.write(log)
self.showdialog('Saved',f'File saved as {url}',1)
except:
self.showdialog('Error','File not saved',2)
def stop_capturing_testing(self):
if self.pause:
self.pause = False
self.btn_start.setText('Pause')
if not self.stop:
self.stop = True
if self.threadActive:
self.threadActive = False
def reset_all_content(self):
if self.pause:
self.pause = False
self.btn_start.setText('Pause')
self.stop=False
self.tree_classifier.reset_class_count()
self.panel_capturing.clearContents()
self.panel_capturing.setRowCount(0)
self.panel_result.clearContents()
self.panel_result.setRowCount(0)
self.panel_testing.clear()
68
def insert_data(self,line,result,result_type,row):
self.panel_capturing.insertRow(row)
for column, item in enumerate(line[0:4:1]):
self.panel_capturing.setItem(row,column,QtWidgets.QTableWidgetItem(str(ite
m)))
self.panel_capturing.scrollToBottom()
self.panel_testing.clear()
self.panel_testing.addItem(str(line[0:4:1]))
if not result==0:
result_row = self.panel_result.rowCount()
self.panel_result.insertRow(result_row)
x = [row+1, line[1], line[2], result_type]
for column, item in enumerate(x):
self.panel_result.setItem(result_row,column,QtWidgets.QTableWidgetItem(s
tr(item)))
self.panel_result.scrollToBottom()
def clickexit(self):
buttonReply = QMessageBox.question(MainWindow, 'Exit', "Are ou sure to
exit?", QMessageBox.Yes | QMessageBox.No, QMessageBox.No)
if buttonReply == QMessageBox.Yes:
if self.threadActive:
self.pause = False
self.stop = True
qApp.quit()
else:
print('No clicked.')
69
def isStop(self):
return self.stop
def showdialog(self,title,text, icon_type):
msg = QMessageBox()
if icon_type==1:
msg.setIcon(QMessageBox.Information)
elif icon_type==2:
msg.setIcon(QMessageBox.Critical)
elif icon_type==3:
msg.setIcon(QMessageBox.Warning)
msg.setText(text)
msg.setWindowTitle(title)
msg.setStandardButtons(QMessageBox.Ok)
msg.buttonClicked.connect(self.msgbtn)
retval = msg.exec_()
def msgbtn(self):
self.progressBar.setProperty("value", 0)
def isModelTrained(self):
return self.trained
def setupUi(self, MainWindow):
MainWindow.setObjectName("MainWindow")
path = os.path.dirname(os.path.abspath(__file__))
MainWindow.setWindowIcon(QtGui.QIcon(os.path.join(path,'icon.png')))
MainWindow.resize(908, 844)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Fixed,
QtWidgets.QSizePolicy.Preferred)
sizePolicy.setHorizontalStretch(0)
sizePolicy.setVerticalStretch(0)
sizePolicy.setHeightForWidth(MainWindow.sizePolicy().hasHeightForWidth())
MainWindow.setSizePolicy(sizePolicy)
MainWindow.setIconSize(QtCore.QSize(30, 30))
70
self.centralwidget = QtWidgets.QWidget(MainWindow)
self.centralwidget.setObjectName("centralwidget")
self.gridLayout = QtWidgets.QGridLayout(self.centralwidget)
self.gridLayout.setObjectName("gridLayout")
spacerItem = QtWidgets.QSpacerItem(10, 10,
QtWidgets.QSizePolicy.Expanding, QtWidgets.QSizePolicy.Minimum)
self.gridLayout.addItem(spacerItem, 1, 0, 1, 1)
spacerItem1 = QtWidgets.QSpacerItem(20, 20,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Maximum)
self.gridLayout.addItem(spacerItem1, 4, 1, 1, 1)
spacerItem2 = QtWidgets.QSpacerItem(20, 10,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Fixed)
self.gridLayout.addItem(spacerItem2, 6, 1, 1, 1)
self.horizontalLayout_2 = QtWidgets.QHBoxLayout()
self.horizontalLayout_2.setObjectName("horizontalLayout_2")
spacerItem3 = QtWidgets.QSpacerItem(15, 10, QtWidgets.QSizePolicy.Ignored,
QtWidgets.QSizePolicy.Minimum)
self.horizontalLayout_2.addItem(spacerItem3)
self.btn_start = QtWidgets.QPushButton(self.centralwidget)
self.btn_start.setObjectName("btn_start")
self.btn_start.setText('Pause')
self.btn_start.clicked.connect(self.pause_resume)
self.horizontalLayout_2.addWidget(self.btn_start)
# ####################################################
self.btn_pause = QtWidgets.QPushButton(self.centralwidget)
self.btn_pause.setText("Stop Capturing/Testing")
self.btn_pause.setObjectName("btn_pause")
self.btn_pause.clicked.connect(self.stop_capturing_testing)
self.horizontalLayout_2.addWidget(self.btn_pause)
71
self.gridLayout.addLayout(self.horizontalLayout_2, 8, 1, 1, 1)
self.horizontalLayout = QtWidgets.QHBoxLayout()
self.horizontalLayout.setObjectName("horizontalLayout")
# #####################################################
self.btn_modeltrain = QtWidgets.QPushButton(self.centralwidget)
self.btn_modeltrain.setText("Train Model")
self.btn_modeltrain.setObjectName("btn_modeltrain")
self.btn_modeltrain.clicked.connect(self.train_model)
self.horizontalLayout.addWidget(self.btn_modeltrain)
# ######################################################
self.btn_statictesting = QtWidgets.QPushButton(self.centralwidget)
self.btn_statictesting.setText("Static Testing")
self.btn_statictesting.setObjectName("btn_statictesting")
self.btn_statictesting.clicked.connect(self.static_testing)
self.horizontalLayout.addWidget(self.btn_statictesting)
# ######################################################
self.btn_realtimetesting = QtWidgets.QPushButton(self.centralwidget)
self.btn_realtimetesting.setText("")
self.btn_realtimetesting.setObjectName("")
self.btn_realtimetesting.clicked.connect(self.realtime_testing)
self.horizontalLayout.addWidget(self.btn_realtimetesting)
# ######################################################
self.btn_savelog = QtWidgets.QPushButton(self.centralwidget)
self.btn_savelog.setText("Save Log")
icon5 = QtGui.QIcon()
self.btn_savelog.setObjectName("btn_savelog")
72
self.btn_savelog.clicked.connect(self.save_log_file)
self.horizontalLayout.addWidget(self.btn_savelog)
# ######################################################
self.btn_graph = QtWidgets.QPushButton(self.centralwidget)
self.btn_graph.setText("Plot Graph")
self.btn_graph.setObjectName("btn_graph")
self.btn_graph.clicked.connect(self.plot_graph)
self.horizontalLayout.addWidget(self.btn_graph)
# ######################################################
self.btn_exit = QtWidgets.QPushButton(self.centralwidget)
self.btn_exit.setText("Exit")
self.btn_exit.setObjectName("btn_exit")
self.btn_exit.clicked.connect(self.clickexit)
self.horizontalLayout.addWidget(self.btn_exit)
# ######################################################
self.gridLayout.addLayout(self.horizontalLayout, 3, 1, 1, 2)
spacerItem4 = QtWidgets.QSpacerItem(20, 10,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Fixed)
self.gridLayout.addItem(spacerItem4, 8, 1, 1, 1)
spacerItem5 = QtWidgets.QSpacerItem(20, 10,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Fixed)
self.gridLayout.addItem(spacerItem5, 0, 1, 1, 1)
self.panel_capturing = QtWidgets.QTableWidget(self.centralwidget)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Preferred,
QtWidgets.QSizePolicy.Preferred)
sizePolicy.setHorizontalStretch(10)
sizePolicy.setVerticalStretch(0)
73
sizePolicy.setHeightForWidth(self.panel_capturing.sizePolicy().hasHeightForWid
th())
self.panel_capturing.setSizePolicy(sizePolicy)
self.panel_capturing.setRowCount(0)
self.panel_capturing.setColumnCount(4)
self.panel_capturing.setObjectName("panel_capturing")
item = QtWidgets.QTableWidgetItem()
self.panel_capturing.setHorizontalHeaderItem(0, item)
item = QtWidgets.QTableWidgetItem()
self.panel_capturing.setHorizontalHeaderItem(1, item)
item = QtWidgets.QTableWidgetItem()
self.panel_capturing.setHorizontalHeaderItem(2, item)
item = QtWidgets.QTableWidgetItem()
self.panel_capturing.setHorizontalHeaderItem(3, item)
self.gridLayout.addWidget(self.panel_capturing, 4, 1, 4, 1)
self.label = QtWidgets.QLabel(self.centralwidget)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Fixed,
QtWidgets.QSizePolicy.Fixed)
sizePolicy.setHorizontalStretch(0)
sizePolicy.setVerticalStretch(0)
sizePolicy.setHeightForWidth(self.label.sizePolicy().hasHeightForWidth())
self.label.setSizePolicy(sizePolicy)
self.label.setLayoutDirection(QtCore.Qt.LeftToRight)
self.label.setAutoFillBackground(False)
self.label.setText("")
path = os.path.dirname(os.path.abspath(__file__))
path = path + r'\icons'
self.label.setPixmap(QtGui.QPixmap(os.path.join(path,'logo.jpg')))
self.label.setScaledContents(True)
self.label.setAlignment(QtCore.Qt.AlignCenter)
self.label.setObjectName("label")
self.gridLayout.addWidget(self.label, 1, 1, 1, 1)
74
spacerItem6 = QtWidgets.QSpacerItem(10, 20,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Fixed)
self.gridLayout.addItem(spacerItem6, 2, 1, 1, 1)
self.panel_testing = QtWidgets.QListWidget(self.centralwidget)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Expanding,
QtWidgets.QSizePolicy.Preferred)
sizePolicy.setHorizontalStretch(0)
sizePolicy.setVerticalStretch(0)
sizePolicy.setHeightForWidth(self.panel_testing.sizePolicy().hasHeightForWidth(
))
self.panel_testing.setSizePolicy(sizePolicy)
self.panel_testing.setVerticalScrollBarPolicy(QtCore.Qt.ScrollBarAlwaysOn)
self.panel_testing.setHorizontalScrollBarPolicy(QtCore.Qt.ScrollBarAsNeeded)
self.panel_testing.setObjectName("panel_testing")
self.gridLayout.addWidget(self.panel_testing, 9, 1, 1, 1)
self.progressBar = QtWidgets.QProgressBar(self.centralwidget)
self.progressBar.setProperty("value", 0)
self.progressBar.setObjectName("progressBar")
self.gridLayout.addWidget(self.progressBar, 10, 1, 1, 2)
# ----------------------------------------------------------------- #
self.panel_result = QtWidgets.QTableWidget(self.centralwidget)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Preferred,
QtWidgets.QSizePolicy.Preferred)
sizePolicy.setHorizontalStretch(10)
sizePolicy.setVerticalStretch(0)
sizePolicy.setHeightForWidth(self.panel_result.sizePolicy().hasHeightForWidth())
self.panel_result.setSizePolicy(sizePolicy)
self.panel_result.setRowCount(0)
self.panel_result.setColumnCount(4)
self.panel_result.setObjectName("panel_result")
item = QtWidgets.QTableWidgetItem()
75
self.panel_result.setHorizontalHeaderItem(0, item)
item = QtWidgets.QTableWidgetItem()
self.panel_result.setHorizontalHeaderItem(1, item)
item = QtWidgets.QTableWidgetItem()
self.panel_result.setHorizontalHeaderItem(2, item)
item = QtWidgets.QTableWidgetItem()
self.panel_result.setHorizontalHeaderItem(3, item)
self.gridLayout.addWidget(self.panel_result, 4,2,6,1)
# ----------------------------------------------------------------- #
MainWindow.setCentralWidget(self.centralwidget)
self.menubar = QtWidgets.QMenuBar(MainWindow)
self.menubar.setGeometry(QtCore.QRect(0, 0, 908, 26))
self.menubar.setObjectName("menubar")
self.menuFile = QtWidgets.QMenu(self.menubar)
self.menuFile.setObjectName("menuFile")
self.menuAbout = QtWidgets.QMenu(self.menubar)
self.menuAbout.setObjectName("menuAbout")
MainWindow.setMenuBar(self.menubar)
self.statusbar = QtWidgets.QStatusBar(MainWindow)
self.statusbar.setObjectName("statusbar")
MainWindow.setStatusBar(self.statusbar)
self.actionNew = QtWidgets.QAction(MainWindow)
self.actionNew.setObjectName("actionNew")
self.actionOpen = QtWidgets.QAction(MainWindow)
self.actionOpen.setObjectName("actionOpen")
self.actionExit = QtWidgets.QAction(MainWindow)
self.actionExit.setObjectName("actionExit")
self.actionHelp = QtWidgets.QAction(MainWindow)
self.actionHelp.setObjectName("actionHelp")
self.menuFile.addAction(self.actionNew)
self.menuFile.addAction(self.actionOpen)
self.menuFile.addSeparator()
76
self.menuFile.addAction(self.actionExit)
self.actionExit.triggered.connect(qApp.quit)
self.menuAbout.addAction(self.actionHelp)
self.menubar.addAction(self.menuFile.menuAction())
self.menubar.addAction(self.menuAbout.menuAction())
self.retranslateUi(MainWindow)
QtCore.QMetaObject.connectSlotsByName(MainWindow)
77
item.setText(_translate("MainWindow", "Packet #"))
item = self.panel_result.horizontalHeaderItem(1)
item.setText(_translate("MainWindow", "Protocol"))
item = self.panel_result.horizontalHeaderItem(2)
item.setText(_translate("MainWindow", "Service"))
item = self.panel_result.horizontalHeaderItem(3)
item.setText(_translate("MainWindow", "Class"))
# ---------------------------------------------------- #
self.menuFile.setTitle(_translate("MainWindow", "File"))
self.menuAbout.setTitle(_translate("MainWindow", "About"))
self.actionNew.setText(_translate("MainWindow", "New"))
self.actionOpen.setText(_translate("MainWindow", "Open"))
self.actionExit.setText(_translate("MainWindow", "Exit"))
self.actionHelp.setText(_translate("MainWindow", "Help"))
if __name__ == "__main__":
import sys
app = QtWidgets.QApplication(sys.argv)
MainWindow = QtWidgets.QMainWindow()
ui = Ui_MainWindow()
ui.setupUi(MainWindow)
MainWindow.show()
sys.exit(app.exec_())
Conclusion:
Network traffic logs to describe patterns of behaviour in network traffic accident with
intrusive or normal activity. Decision tree technique is good for the intrusion
characteristic of the network traffic logs for IDS and implemented in the genetic
algorithm as prevention.The other hand, this technique is also good efficiency and
optimize rule for the firewall rules such as avoid redundancy.
78
REFERENCES:
[1] R. M. A. Ujjan, Z. Pervez, K. Dahal, A. K. Bashir, R. Mumtaz, and J. González,
"Towards sFlow and adaptive polling sampling for deep learning based DDoS
detection in SDN," Future Generation Computer Systems, vol. 111, pp. 763-779,
2020, doi: 10.1016/j.future.2019.10.015. [2]"Software Defined Networking Definition."
https://www.opennetworking.org/sdn-definition/ (accessed March, 2, 2020).
[2] S. Garg, K. Kaur, N. Kumar, and J. J. P. C. Rodrigues, "Hybrid Deep-Learning-
Based Anomaly Detection Scheme for Suspicious Flow Detection in SDN: A Social
Multimedia Perspective," IEEE Transactions on Multimedia, vol. 21, no. 3, pp. 566-
578, 2019, doi: 10.1109/tmm.2019.2893549.
[3] M. Nobakht, V. Sivaraman, and R. Boreli, "A Host-Based Intrusion Detection and
Mitigation Framework for Smart Home IoT Using OpenFlow," presented at the 2016
11th International Conference on Availability, Reliability and Security (ARES), 2016.
[4] M. S. Elsayed, N. Le-Khac, S. Dev, and A. D. Jurcut, "Machine Learning
Techniques for Detecting Attacks in SDN," in 2019 IEEE 7th International Conference
on Computer Science and Network Technology (ICCSNT), 19-20 Oct. 2019
2019,pp,277-281,doi: 10.1109/ICCSNT47585.2019.8962519.
79
Publication
80
INTRUSION DETECTION IN
SOFTWARE DEFINED NETWORK USING MACHINE LEARNING
81
Submission date: 02-Mar-2022 09:59AM (UTC-0600)
Submission ID: 1649798424
File name:
CTION_IN_SOFTWARE_DEFINED_NETWORK_USING_MACHINE_LEARNING_1.docx
(365.3K)
Word count: 2100
Character count: 11888
82
4
83
5
84
1
85
86
INTRUSION DETECTION IN SOFTWARE DEFINED
NETWORK
USING MACHINE LEARNING
ORIGINALITY REPORT
2 1 1
% % %
INTER 1% STUD
SIMILA NET ENT
RITY SOUR PUBLICATI PAPE
INDEX CES ONS RS
PRIMARY SOURCES
1
1%
Submitted to Aston University Student Paper
opus.lib.uts.edu.au
2 Internet Source 1%
87
<1%
Peng Cui. "A Tighter Analysis of Set
Cover Greedy Algorithm for Test Set",
3 Lecture
Notes in Computer Science, 2007
Publication
Bambang Susilo, Riri Fitri Sari.
"Intrusion Detection in Software Defined
Network
4
<1%
Using Deep Learning Approach", 2021 IEEE 11th Annual
Computing and
Communication Workshop and Conference (CCWC), 2021
Publication
thesai.org
5 Internet Source
<1 %
88
Exclude quotes On Exclude matches Off
89
Detection of Attacks (DoS, Probe) Using Genetic Algorithm
ABSTARCT:
The entrance framework (IDS) is right now exceptionally fascinating as a significant piece of
framework security. The IDS gathers traffic data from the line or framework and afterward
involves it for better security. Assaults are typically truly challenging and tedious to isolate
street exercises. To screen the organization association, the examiner should survey all data,
enormous and wide. Subsequently, an organization search strategy is expected to decide the
recurrence of traffic. In this review, another strategy for looking for IDS identifiers was
created utilizing a technique for concentrating on information mining procedures from a
calculation machine. The technique used to set the principles is to sort the choice tree and
calculation. These guidelines can be utilized to decide the idea of the assault and afterward
apply it to the hereditary calculation for avoidance, so that as well as distinguishing the
assault, it is feasible to find ways to forestall the assault and deny the assault.
Keywords- Intrusion detection, K-Nearest Neighbor, Naive Bayes, Decision Trees, Support
Vector Machine, Prediction
Proposed System:
HARDWARE
Hereditary Algorithms are one of REQUIREMENTS:
the most generally utilized
95
System : Pentium i3 Processor. • Choice tree/Natural woodland
HDD : 500 GB. • Support for vector machines
Screen : 15’’ LED • Intercession
Devices : Keyboard, Mouse
Random Access Memory: 2 GB Decision tree
Introduction
SOFTWARE Up until this point, we have figured
REQUIREMENTS: out how to go this way and that,
and it has been hard to
Software : Windows 10. comprehend. Presently how about
Language : Python we start with "Tree Decision", I
guarantee you it very well may be
a straightforward calculation in
Machine Learning. There aren't so
numerous here. It is one of the
most broadly utilized and
commonsense strategies for AI
since it is not difficult to utilize and
BLOCK DIAGRAM: clarify.
What is a Decision Tree?
It is an instrument with
applications running in better
places. The testament tree can be
FLOW DIAGRAM:
utilized in similar class as obsolete
issues. The actual name
The absolute most broadly utilized
recommends that it utilizes plans,
calculations.
for example, trees to show
• K-Neighbor
prescience from the request in
which things are isolated. It begins
at the root and finishes with the
choice to get away. Before we
96
which things are isolated. It begins Dataset collection:
at the root and finishes with the
choice to get away. Before we Informational index assortment:
study the choice tree, how about Information assortment can assist
we investigate a few words. you with tracking down ways of
Root Nodes The top of this hub is following previous occasions
toward the start of the choice tree, utilizing information examination
and the public starts to isolate it as to record them. This permits you to
indicated by different elements. foresee the way and make prescient
Decision Nodes - The gatherings models utilizing AI devices to
we see subsequent to isolating the anticipate future changes. Since the
root are called Resolutions prescient model is just pretty much
Leaf Nodes - an indivisible head as great as the data acquired, the
called a leaf or leaf most effective way to gather
Sub-tree - 33% of the sub-tree information is to further develop
plan, a large portion of the execution. The data ought to be
exactness of the sub-tree. faultless (garbage, open air
Pruning - There is nothing to do squander) and ought to incorporate
except for remove the head to quit data about the work you are doing.
trying too hard. For instance, a non-performing
advance may not profit from the
MODULES: sum got, yet may profit from gas
costs over the long run. In this
Dataset collection module, we gather data from the
Data Cleaning kaggle data set. These figures
Feature Extraction contain data on yearly contrasts.
Model training
Testing model Data cleaning:
Performance Evaluation Data cleanliness is a significant
Prediction piece of all AI exercises. The data
cleanliness of this module is
expected for the arrangement of
97
information for the annihilation properties (likewise called vector
and transformation of wrong, properties).
inadequate, deluding or Characterize the initial segment,
misdirecting data. You can utilize called highlight choice. The chose
it to look for data. Discover what things ought to contain data about
cleaning you can do. the data got so they can fill the
ideal role utilizing this portrayal
Feature Extraction: rather than complete data.
REFERENCES:
101
[3] S. Garg, K. Kaur, N. Kumar, Learning Techniques for
and J. J. P. C. Rodrigues, "Hybrid Detecting Attacks in SDN," in
Deep-Learning-Based Anomaly 2019 IEEE 7th International
Detection Scheme for Suspicious Conference on Computer
102