[go: up one dir, main page]

0% found this document useful (0 votes)
531 views111 pages

CSE35 Project Report

This document describes a project submitted for a Bachelor's degree in computer science and engineering. The project aims to develop a method for detecting denial of service (DoS) and probe attacks on a network using a genetic algorithm. It was completed by two students, Avula Venkata Srinadh Reddy and Boddu Prasanth Reddy, under the supervision of Dr. L. Sujihelen from the Department of Computer Science and Engineering at Sathyabama Institute of Science and Technology. The project involved reviewing relevant literature, describing the experimental methods and algorithms used, presenting the results and performance analysis, and writing code for an intrusion detection application.

Uploaded by

Srinadh Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
531 views111 pages

CSE35 Project Report

This document describes a project submitted for a Bachelor's degree in computer science and engineering. The project aims to develop a method for detecting denial of service (DoS) and probe attacks on a network using a genetic algorithm. It was completed by two students, Avula Venkata Srinadh Reddy and Boddu Prasanth Reddy, under the supervision of Dr. L. Sujihelen from the Department of Computer Science and Engineering at Sathyabama Institute of Science and Technology. The project involved reviewing relevant literature, describing the experimental methods and algorithms used, presenting the results and performance analysis, and writing code for an intrusion detection application.

Uploaded by

Srinadh Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 111

Detection of Attacks (DoS, Probe) Using Genetic Algorithm

Submitted in partial fulfillment of the requirements for


The award of
Bachelor of Engineering Degree in Computer Science and Engineering
By
AVULA VENKATA SRINADH REDDY (Reg. No. 38110061)
BODDU PRASANTH REDDY (Reg. No. 38110422)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SCHOOL OF COMPUTING
SATHYABAMA
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
JEPPIAAR NAGAR, RAJIV GANDHI SALAI,
CHENNAI - 600 119
April – 2022

SATHYABAMA

i
INSTITUTE OF SCIENCE AND TECHNOLOGY
(DEEMED TO BE UNIVERSITY)
Accredited with Grade “A” by NAAC
(Established under Section 3 of UGC Act, 1956)
JEPPIAAR NAGAR, RAJIV GANDHI SALAI, CHENNAI-600119
www.sathyabama.ac.in

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

BONAFIDE CERTIFICATE

This is to certify that this Project Report is the bonafide work of AVULA
VENKATA SRIANDH REDDY (REG. NO. 38110061), BODDU PRASANTH
REDDY (REG. NO. 38110422) who carried out the project entitled “Detection of
Attacks (DoS, Probe) Using Genetic Algorithm” under our supervision from Nov
2021 to April 2022.

Internal Guide
Dr. L. SUJIHELEN, M.E., Ph.D.,

Head of the Department


Dr. S. VIGNESHWARI M.E., Ph.D.,

Submitted for Viva voce Examination held on

Internal Examiner External Examiner

ii
DECLARATION

I AVULA VENKATA SRINADH REDDY (Reg. No. 38110061), BODDU


PRASANTH REDDY (Reg. No. 38110422) hereby declare that the Project
Report entitled “Detection of Attacks (DoS, Probe) Using Genetic Algorithm”
done by me under the guidance of Dr. L. SUJI HELEN, M.E., Ph.D., is
submitted in partial fulfilment of the requirements for the award of Bachelor of
Engineering Technology degree in Computer Science and Engineering.

DATE:
PLACE: CHENNAI SIGNATURE OF THE CANDIDATE

iii
ACKNOWLEDGEMENT

We are pleased to acknowledge my sincere thanks to Board of Management of


SATHYABAMA for their kind encouragement in doing this project and for
completing it successfully. I am grateful to them.

I convey my thanks to Dr. T. SASIKALA, M.E., Ph.D., Dean, School of


Computing, and DR. S. VIGNESHWARI, M.E., Ph.D., Head of the Department,
Department of Computer Science and Engineering for providing me necessary
support and details at the right time during the progressive reviews.

I would like to express my sincere and deep sense of gratitude to my Project


Guide Dr. L. SUJI HELEN, M.E., Ph.D., for her valuable guidance, suggestions
and constant encouragement paved way for the successful completion of my
project work.

I wish to express my thanks to all Teaching and Non-teaching staff members of


the department of COMPUTER SCIENCE AND ENGINEERING who were
helpful in many ways for the completion of the project.

iv
ABSTRACT

Intrusion detection systems (IDSs) are currently drawing a great amount of


interest as a key part of system defence. IDSs collect network traffic information
from some point on the network or computer system and then use this
information to secure the network. To distinguish the activities of the network
traffic that the intrusion and normal is very difficult and to need much time
consuming. An analyst must review all the data that large and wide to find the
sequence of intrusion on the network connection. Therefore, it needs a way that
can detect network intrusion to reflect the current network traffics. In this study, a
novel method to find intrusion characteristic for IDS using genetic algorithm
machine learning of data mining technique was proposed. Method used to
generate of rules is classification by Genetic algorithm of decision tree. These
rules can determine of intrusion characteristics then to implement in the genetic
algorithm as prevention.so that besides detecting the existence of intrusion also
can execute by doing deny of intrusion as prevention .

v
TABLE OF CONTENT

Chapte Title Page


r No. No.
Abstract v

List of Figures viii


List of Tables viii
1 Introduction 1

1.1 Back Ground 1

1.2 Existing System 2

1.3 Problem Statement 2


1.4 Objective 2
1.5 Evaluation and 3
Result
2 Aim and Scope of 4
Present
Investigation
2.1 Literature Review 4
2.2 Designing a Network 4
Intrusion Detection
System Based on
Machine Learning
for Software Defined
Networks
2.3 A Deep Learning 16
Approach for

vi
Network Intrusion
Detection System
2.4 Intrusion Preventing 24
System using
Intrusion Detection
System Decision
Tree Data Mining
3 Experimental 29
Methods and
algorithms used
3.1 Machine Learning 29
Scope
3.1.1 Supervised 29
Machine Learning
3.1.2 Unsupervised 30
Machine Learning
3.2 Decision Tree 30

3.3 Genetic Algorithm 34

3.4 Block and Flow 37


Diagram
4 Results, Discussion 38
and Performance
Analysis
4.1 Requirements 38
4.2 Modules 38
4.3 Results 42
4.4 Code 46
Conclusion 77
References 77
Publication 78

vii
List of Figures

Figur Title
e no.
3.2.1 Decision Tree
3.3.1 Genetic Algorithm steps
3.3.2 Genetic Algorithm Application
3.4.1 Block diagram
3.4.2 Flow Diagram
4.2.1 Module Flow Chart
4.2.2 Data Collection
4.2.3 Training Set
4.2.4 Validation
4.2.5 Prediction
4.3.1 Intrusion detection Application
4.3.2 Training dataset
4.3.3 Test All attacks dataset
4.3.4 All attacks Plot Graph
4.3.5 Test Normal Attack Dataset
4.3.6 Normal Attack Plot Graph

List of Tables
Table.no Title

4.3.1 Count of Attacks

viii
CHAPTER 1
INTRODUCTION
Approaches for intrusion detection can be broadly divided into two types: misuse
detection and anomaly detection. In misuse detection system, all known types of
attacks (intrusions) can be detected by looking into the predefined intrusion patterns
in system audit traffic. In case of anomaly detection, the system first learns a normal
activity profile and then flags all system events that do not match with the already
established profile. The main advantage of the misuse detection is its capability for
high detection rate with a difficulty in finding the new or unforeseen attacks. The
advantage of anomaly detection lies in the ability to identify the novel (or unforeseen)
attacks at the expense of high false positive rate. Network monitoring-based machine
learning techniques have been involved in diverse fields. Using bi-directional long-
short-term-memory neural networks, a social media network monitoring system is
proposed for analysing and detecting traffic accidents.
1.1 BACK GROUND
The proposed method retrieves traffic-related information from social media
(Facebook and Twitter) using query-based crawling: this process collects sentences
related to any traffic events, such as jams, road closures, etc. Subsequently, several
pre-processing techniques are carried out, such as steaming, tokenization, POS
tagging and segmentation, in order to transform the retrieved data into structured
form. Thereafter, the data are automatically labelled as ’traffic’ or ’non-traffic’, using a
latent Dirichlet allocation (LDA) algorithm. Traffic- labelled data are analysed into
three types; positive, negative, and neutral. The output from this stage is a sentence
labelled according to whether it is traffic or non-traffic, and with the polarity of that
traffic sentence (positive, negative or neutral). Then, using the bag-of-words (BoW)
technique, each sentence is transformed into a one-hot encoding representation in
order to feed it to the Bi-directional LSTM neural network (Bi-LSTM). After the
learning process, the neural networks perform multi-class classification using the
softmax layer in order to classify the sentence in terms of location, traffic event and
polarity types. The proposed method compares different classical machine learning

1
and advanced deep learning approaches in terms of accuracy, F-score and other
criteria.

1.2 EXISTING SYSTEM


Today network has become an essential part of public infrastructures with the
inception of public and private cloud computing. The traditional networking approach
has become too complex. This complexity has resulted in a barrier for creating new
and innovative services within a single data centre, difficulties in interconnecting data
centres, interconnection within enterprises, and bigger barrier in the continued growth
of the Internet in general.

1.3 PROBLEM STATEMENT


 To distinguish the activities of the network traffic that the intrusion and normal is very
difficult and to need much time consuming.
 An analyst must review all the data that large and wide to find the sequence of
intrusion on the network connection.
 It needs a way that can detect network intrusion to reflect the current network traffics.
 Combination of IDS and firewall so-called the IPS, so that besides detecting the
existence of intrusion also can execute by doing deny of intrusion as prevention .

1.4 OBJECTIVE
 The primary purposes for an IDS deployment are to reduce risk, identify error,
optimize network use, provide insight into threat levels, and change user behavior.
Thus, an IDS provides more than just detection of intrusion.
 Genetic Algorithm (GA) developed specifically for problems with multiple objectives.
They differ primarily from traditional GA by using specialized fitness functions and
introducing methods to promote solution diversity.
 The goal of this algorithm is to create a model that predicts the value of a target
variable, for which the decision tree uses the tree representation to solve the problem

2
in which the leaf node corresponds to a class label and attributes are represented on
the internal node of the tree.

1.5 EVALUATION & RESULTS


Similar to most existing deep learning research, our proposed classification model
(Section 4.2) was implemented using Tensor Flow. All of our evaluations were
performed using GPU-enabled Tensor Flow running on a 64-bit Ubuntu 16.04 LTS
PC with an Intel Xeon 3.60GHz processor, 16 GBRAM and an NVIDIA GTX 750
GPU.
To perform our evaluations, we have used the KDD Cup’99 and NSL-KDD datasets.
Both of these datasets are considered as benchmarks within NIDS research.
Furthermore, using these datasets assists in drawing comparisons with existing
methods and research. Throughout this section, we will be using the metrics defined
below:
 True Positive (TP) - Attack data that is correctly classified as an attack.
 False Positive (FP) - Normal data that is incorrectly classified as an attack.
 True Negative (TN) - Normal data that is correctly classified as normal.
 False Negative (FN) - Attack data that is incorrectly classified as normal .

3
CHAPTER 2
AIM AND SCOPE OF THE PRESENT INVESTIGATION

2.1 LITERATURE REVIEW


Designing a Network Intrusion Detection System Based on Machine Learning for
Software Defined Networks
Software-defined Networking (SDNs) have as of late been created as a feasible and
promising answer for the eventual fate of the Internet. Networks are made due,
incorporated, and observed and adjusted utilizing SDN. These advantages, then
again, bring us ecological dangers, for example, network crashes, framework
incapacities, internet banking misrepresentation, and robbery. These issues can
detrimentally affect families, organizations, and the economy. Truth, superior
execution, and the genuine framework are fundamental to accomplishing this
objective. The extension of wise AI calculations into the network intrusion detection
system (NIDS) through a software-defined network (SDN) has been extremely
invigorating over the previous decade. The accessibility of data, the distinction in
information investigation, and the many advances in AI calculations assist us with
making a superior, more dependable, and solid framework for distinguishing the
various sorts of organization assaults. The review was essential for the NIDS SDN
survey.
2.2 Designing a Network Intrusion Detection System Based on Machine
Learning for Software Defined Networks
Software-defined Networking (SDN) has recently developed and been put forward as
a promising and encouraging solution for future internet architecture. Managed, the
4
centralized and controlled network has become more flexible and visible using SDN.
On the other hand, these advantages bring us a more vulnerable environment and
dangerous threats, causing network breakdowns, systems paralysis, online banking
frauds and robberies. These issues have a significantly destructive impact on
organizations, companies or even economies. Accuracy, high performance and real-
time systems are essential to achieve this goal successfully. Extending intelligent
machine learning algorithms in a network intrusion detection system (NIDS) through a
software defined network (SDN) has attracted considerable attention in the last
decade. Big data availability, the diversity of data analysis techniques, and the
massive improvement in the machine learning algorithms enable the building of an
effective, reliable and dependable system for detecting different types of attacks that
frequently target networks. This study demonstrates the use of machine learning
algorithms for traffic monitoring to detect malicious behaviour in the network as part of
NIDS in the SDN controller. Different classical and advanced tree-based machine
learning techniques, Decision Tree, Random Forest and XGBoost are chosen to
demonstrate attack detection. The NSL-KDD dataset is used for training and testing
the proposed methods; it is considered a benchmarking dataset for several state-of-
the-art approaches in NIDS. Several advanced pre-processing techniques are
performed on the dataset in order to extract the best form of the data, which produces
outstanding results compared to other systems. Using just five out of 41 features of
NSL-KDD, a multi-class classification task is conducted by detecting whether there is
an attack and classifying the type of attack (DDoS, PROBE, R2L, and U2R),
accomplishing an accuracy of 95.95%.
A network intrusion detection system is a process for discovering the existence of
malicious or unwanted packets in the network. This process is done using real-time
traffic monitoring to find out if any unusual behaviour is present in the network or not.
Big data, powerful computation facilities, and the expansion of the network size
increase the demand for the required tasks that should be carried out simultaneously
in real-time. Therefore, NIDS should be careful, accurate, and precise in monitoring,
which has not been the case in the traditional methods. On the other hand, the rapid
increase in the accuracy of machine learning algorithms is highly impressive. Its

5
introduction relies on the increasing demand for improved performance on different
types of network. However, software defined network (SDN) implementation of the
network-based intrusion detection system (NIDS) has opened a frontier for its
deployment, considering the increasing scope and typology of security risks of
modern networks. The rapid growth in the volume of network data and connected
devices carries inherent security risks. The adoption of technologies such as the
Internet of Things (IoT), artificial intelligence (AI), and quantum computing, has
increased the threat level, making network security challenging and necessitating a
new paradigm in its implementation. Various attacks have overwhelmed previous
approaches (classified into signature-based intrusion detection systems and anomaly-
based intrusion detection systems, increasing the need for advanced, adaptable and
resilient security implementation. For this reason, the traditional network design
platform is being transformed into the evolving SDN implementation Monitoring data
and analysing it over time are essential to the process of predicting future events,
such as risks, attacks and diseases. The more details are formed, discovered and
documented through analysing very large-scale data, the more saved resources, as
well as the working environment, will remain normal without any variations. Big data
analytics (BDA) research in the supply chain becomes the secret of a protector for
managing and preventing risks. BDA for humanitarian supply chains can aid the
donors in their decision of what is appropriate in situations such as disasters, where it
can improve the response and minimize human suffering and deaths. BDA and data
monitoring using machine learning can help in identifying and understanding the
interrelationships between the reasons, difficulties, obstacles and barriers that guide
organizations in taking the most efficient and accurate decisions in risk management
processes. This could impact entire organizations and countries, producing a hugely
significant improvement in the process. Network monitoring-based machine learning
techniques have been involved in diverse fields. Using bi-directional long-short-term-
memory neural networks, a social media network monitoring system is proposed for
analysing and detecting traffic accidents. The proposed method retrieves traffic-
related information from social media (Facebook and Twitter) using query-based
crawling: this process collects sentences related to any traffic events, such as jams,

6
road closures, etc. Subsequently, several pre-processing techniques are carried out,
such as steaming, tokenization, POS tagging and segmentation, in order to transform
the retrieved data into structured form. Thereafter, the data are automatically labelled
as ’traffic’ or ’non-traffic’, using a latent Dirichlet allocation (LDA) algorithm. Traffic-
labelled data are analysed into three types; positive, negative, and neutral. The output
from this stage is a sentence labelled according to whether it is traffic or non-traffic,
and with the polarity of that traffic sentence (positive, negative or neutral). Then, using
the bag-of-words (BoW) technique, each sentence is transformed into a one-hot
encoding representation in order to feed it to the Bidirectional LSTM neural network
(Bi-LSTM). After the learning process, the neural networks perform multi-class
classification using the softback layer in order to classify the sentence in terms of
location, traffic event and polarity types. The proposed method compares different
classical machine learning and advanced deep learning approaches in terms of
accuracy, F-score and other criteria. Many initiatives and workshops have been
conducted in order to improve and develop the healthcare systems using machine
learning, such as [12]. In these workshops several proposed machine algorithms
have been used, such as K Nearest-Neighbours, logistic regression, K-means
clustering, Random Forest (RF) etc., together with deep learning algorithms such as
CNN, RNN, fully connected layer and auto-encoder. These varieties of techniques
allow the researchers to deal with several data types, such as medical imaging,
history, medical notes, video data, etc. Therefore, different topics and applications are
introduced, with significant performance results such as causal inference, in
investigations of Covid-19, disease prediction, such as disorders and heart diseases.
Using intelligent ensemble deep learning methods, healthcare monitoring is carried
out for prediction of heart diseases. Real-time health status monitoring can prevent
and predict any heart attacks before occurrence. For disease prediction, the proposed
ensemble deep learning approach achieved a brilliant accuracy performance score of
98.5%. The proposed model takes two types of data that are transferred and saved
on an online cloud database. The first is the data transferred from the sensors; these
sensors have been placed in different places on the body in order to extract more
than 10 different types of medical data. The second type is the daily electronic

7
medical records from doctors, which includes various types of data, such as smoking
history, family diseases, etc. The features are fused using the feature fusion
Framingham Risk factors technique, which executes two tasks at a time, fusing the
data together, and then extracting a fused and informative feature from this data.
Then different pre-processing techniques are used to transform the data into a
structured and well-prepared form, such as normalization, missing values filtering and
feature weighting. Subsequently, an ensemble deep learning algorithm starts which
learns from the data in order to predict whether a heart disease will occur or the threat
is absent. IDS refers to a mechanism capable of identifying or detecting intrusive
activities. In a broader view, this encompasses all the processes used in the
discovery of unauthorized uses of network devices or computers. This is achieved
through software designed specifically to detect unusual or abnormal activities. IDS
can be classified according to several s1urveys and sources in the literature into four
types (HIDS, NIDS, WIDS, NBA). NIDS is an inline or passive-based intrusion
detection technique. The scope of its detection targets network and host levels. The
only architecture that fits and works with NIDS is the managed network. The
advantage of using NIDS is that it costs less and is quicker in response, since there is
no need to maintain sensor programming at the host level. The performance of
monitoring the traffic is close to real-time; NIDS can detect attacks as they occur.
However, it has the following limited features. It does not indicate if such attacks are
successful or not: it has restricted visibility inside the host machine. There is also no
effective way to analyse encrypted network traffic to detect the type of attack.
Moreover, NIDS may have difficulty capturing all packets in a large or busy network.
Thus, it may fail to recognize an attack launched during a period of high traffic. SDN
provides a novel means of network implementation, stimulating the development of a
new type of network security application. It adopts the concept of programmable
networks through the deployment of logically centralized management. The network
deployment and configuration are virtualized to simplify complex processes, such as
orchestration, network optimization, and traffic engineering. It creates a scalable
architecture that allows sufficient and reliable services based on certain types of
traffic. The global view approach to a network enhances flow-level control of the

8
underlying layers. Implementing NIDS over SDN becomes a major effective security
defence mechanism for detecting network attacks from the network entry point. NIDS
has been implemented and investigated for decades to achieve optimal efficiency. It
represents an application or device for monitoring network traffic for suspicious or
malicious activity with policy violations. Such activities include malware attacks,
untrustworthy users, security breaches, and DDoS. NIDS focuses on identifying
anomalous network traffic or behaviour; its efficiency means that network anomaly is
adequately implemented as part of the security implementation. Since it is nearly
impossible to prevent threats and attacks, NIDS will ensure early detection and
mitigation. However, the advancement in NIDS has not instilled sufficient confidence
among practitioners, since most solutions still use less capable, signature-based
techniques. This study aims to increase the focus on several points:
 Choosing the right algorithm for the right tasks depends on the data types, size and
network behaviour and needs.
 Implementing the optimized development process by preparing and selecting the
benchmark dataset in order to build a promising system in NIDS.
 Analysing the data, finding, shaping, and engineering the important features, using
several pre-processing techniques by stacking them together with an intelligent order
to find the best accuracy with the lowest amount of data representation and size.
 Proposing an integration and complete development process using those algorithms
and techniques from the selection of dataset to the evaluation of the algorithms using
a different metric. Which can be extended to other NIDS applications.

This study enhances the implementation of NIDS by deploying different machine


learning algorithms over SDN. Tree-based machine learning algorithms (XGBoost,
random forest (RF), and decision tree (DT)) were implemented to enhance the
monitoring and accuracy performance of NIDS. The proposed method will be trained
on network traffic packet data, collected from large-scale resources and servers
called NSL-KDD dataset to perform two tasks at a time by detecting whether there is
an attack or not and classifying the type of attack. This study enhances the
implementation of NIDS by deploying machine learning over SDN. Tree-based
9
machine learning algorithms (XGBoost, random forest (RF), and decision tree (DT))
are proposed to enhance NIDS. The proposed method will be trained on network
traffic packet data, collected from large-scale resources and servers, called the NSL-
KDD dataset to perform two tasks at a time by detecting whether there is an attack or
not and classifying the type of attack.

Integrating machine learning algorithms into SDN has attracted significant attention.
In, a solution was proposed that solved the issues in KDD Cup 99 by performing an
extensive experimental study, using the NSL-KDD dataset to achieve the best
accuracy in intrusion detection. The experimental study was conducted on five
popular and efficient machine learning algorithms (RF, J48, SVM, CART, and Naïve
Bayes). The correlation feature selection algorithm was used to reduce the complexity
of features, resulting in 13 features only in the NSL-KDD dataset. This study tests the
NSL-KDD dataset’s performance for real-world anomaly detection in network
behaviour. Five classic machine learning models RF, J48, SVM, CART, and Naïve
Bayes were trained on all 41 features against the five normal types of attacks, DOS,
probe, U2R, and R2L to achieve average accuracies of 97.7%, 83%, 94%, 85%, and
70% for each algorithm, respectively. The same models were trained again using the
reduced 13 features to achieve average accuracies of 98%, 85%, 95%, 86%, and
73% for each model. In, a deep neural network model was proposed to find and
detect intrusions in the SDN. The NSL-KDD dataset was used to train and test the
model. The neural network was constructed with five primary layers, one input layer
with six inputs, three hidden layers with (12, 6, 3) neurons, and one output layer with
2D dimensions. The proposed method was trained on six features chosen from 41
features in the NSL-KDD dataset, which are basic and traffic features that can easily
be obtained from the SDN environment. The proposed method calculates the
accuracy, precision and recall, achieving an F1-score of 0.75. A second evaluation
was conducted on seven classic machine learning models (RF, NB, NB Tree, J48,
DT, MLP, and SVM) proposed in and the model achieved sixth place out of eight. The
same author extended the approach using a gated recurrent unit neural network
(GRU-RNN) for SDN anomaly detection, achieving accuracy up to 89%. In addition,
10
the minimax normalization technique is used for feature scaling to improve and boost
the learning process. The SVM classifier, integrated with the principal component
analysis (PCA) algorithm, was used for an intrusion detection application. The NSL-
KDD dataset is used in this approach to train and optimize the model for detecting
abnormal patterns. A Min-Max normalization technique was proposed to solve the
diversity data scale ranges with the lowest misclassification errors. The PCA
algorithm is selected as a statistical technique to reduce the NSL-KDD dataset’s
complexity, reducing the number of trainable parameters that needed to be learned.
The nonlinear radial basis function kernel was chosen for SVM optimization.
Detection rate (DR), false alarm rate (FAR), and correlation coefficient metrics were
chosen to evaluate the proposed model, with an overall average accuracy of 95%
using 31 features in the dataset. In [32], an extreme gradient-boosting (XGBoost)
classifier was used to distinguish between two attacks, i.e., normal and DoS. The
detection method was analysed and conducted over POX SDN, as a controller, which
is an SDN open-source platform for prototyping and developing a technique based on
SDN. Mininet was used to emulate the network topology to simulate real-time SDN-
based cloud detection. Logistic regression was selected as a learning algorithm, with
a regularization term penalty to prevent overfitting. The XGBoost term was added and
combined with the logistic regression algorithm to boost the computations by
constructing structure trees. The dataset used in this approach was KDD Cup 1999,
while 400 K samples were selected for constructing the training set. Two types of
normalization techniques were used; one with a logarithmic-based technique and one
with a Min-Max-based technique. The average overall accuracy for XGBoost,
compared to RF and SVM, was 98%, 96%, and 97% respectively. Based on DDoS
attack characteristics, a detection system was simulated with the Mininet and FL
floodlight platform using the SVM algorithm [5]. The proposed method categorizes the
characteristics into six tuples, which are calculated from the packet network. These
characteristics are the speed of the source IP (SSIP), the speed of the source port,
the standard deviation of FL flow packets, the deviation of FL flow bytes (SDFB), the
speed of flow entries, and the ratio of pair-FL flow. Based on the calculated statistics
from the SVM classifier’s six characteristics, the current network state is normal or

11
attack. Attack flow (AF), DR, and FAR were chosen to achieve an average accuracy
of 95%. In TSDL a model with two stages of deep neural networks was designed and
proposed for NIDS, using a stacked auto-encoder, integrated with softmax in the
output layer as a classifier. TSDL was designed and implemented for Multiclass
classification of attack detection. Down-sampling and other pre-processing
techniques were performed over different datasets in order to improve the detection
rate, as well as the monitoring efficiency. The detection accuracy for UNSW-NB15
was 89.134%. Different models of neural networks, such as variation auto-encoder,
seq2seq structures using Long-Short-term-Memory (LSTM) and fully connected
networks were proposed in [34] for NIDS. The proposed approach was designed and
implemented to differentiate between normal and attack packets in the network, using
several datasets, such as NSL-KDD, UNSW NB15, KYOTO-HONEYPOT, and
MAWILAB. A variety of pre-processing techniques have been used, such as one-hot-
encoding, normalization, etc., for data preparation, feature manipulation and selection
and smooth training in neural networks. Those factors are designed mainly, but not
only, to enable the neural networks to learn complex features from different scopes of
a single packet. Using 4 hidden layers, a deep neural network model [35] was
illustrated and implemented on KDD cup99 for monitoring intrusion attacks. Feature
scaling and encoding were used for data pre-processing and lower data usage. More
than 50 features were used to perform this task on different datasets. Therefore,
complex hardware GPUs were used in order to handle this huge number of features
with lower training time. A supervised [36] adversarial auto-encoder neural network
was proposed for NIDS. It combined GANS and a variation auto-encoder. GANS
consists of two different neural networks competing with each other, known as the
generator and the discriminator. The result of the competition is to minimize the
objective function as much as possible, using the Jensen Shannon minimization
algorithm. The generator tries to generate fake data packets, while the discriminator
determined whether this data is real or fake; in other words, it checks if that packet is
an attack or normal. In addition, the proposed method integrates the regularization
penalty with the model structure for overfitting control behaviour. The results were
reasonable in the detection rate of U2RL and R2L but lower in others. Multi-channel

12
deep learning of features for NIDS was presented in [37], using AE involving CNN,
two fully connected layers and the output to the softmax classifier. The evaluation is
done over three different datasets; KDD cup99, UNSWNB15 and CICIDS, with an
average accuracy of 94%. The proposed model provides effective results; however,
the structure and the characteristics of the attack were not highlighted clearly. The
proposed method enhances the implementation of NIDS by deploying machine
learning over SDN. It introduces a machine learning algorithm for network monitoring
within the NIDS implementation on the central controller of the SDN. In this paper,
enhanced tree-based machine learning algorithms are proposed for anomaly
detection. Using only five features, a multi-class classification task is conducted by
detecting whether there is an attack or not and classifying the type of attack.
In this section, we discuss and explain each component and its role in the NIDS
architecture. As shown in the SDN architecture can be divided into three main layers,
as follows:
System Architecture Layers NIDS component architecture is constructed in three
main parts as follows:
 The infrastructure layer consists of two main parts: hardware and software
components. The hardware components are devices such as routers and switches.
The software components are those components that interface with the hardware,
such as Open Flow switches.
 The control layer is an intelligent network controller, such as an SDN controller. The
control layer is the layer responsible for regulating actions and traffic data
management by establishing or denying every network flow.
 The application layer is the one that performs all network management tasks. These
tasks can be performed using an SDN controller and NIDS.
Attacks are created by an attacker and delivered through the internet. NIDS is
deployed over the SDN controller. As NIDS listens to the network and actively
compares all traffic against predefined attack signatures, it detects the attacker’s
scanning attempts. It sends an alert to administrators through its control, and the
connections will be blocked due to specific rules in the firewall or routers.

13
This section presents a generalized flowchart of the proposed method. The dataset,
pre-processing techniques, and proposed machine learning algorithms will be
presented and discussed.
In this subsection, a generalized block diagram is presented and discussed. As
shown, the NSLKDD dataset is used. Data analysis, feature engineering, and other
pre-processing techniques are conducted to train the model, using the best hyper-
parameters, with only five features. Tree-based algorithms are used for the multi-
class classification task. The processed data enter the algorithm and are classified as
to whether they constitute an attack or are normal; then, the type of attack will be
analysed to see which category it belongs to, and action is taken accordingly.
The KDD Cup is the leading data mining competition in the world. The NSL-KDD
dataset was proposed to solve many issues represented in the KDD Cup 1999
dataset. Many researchers have used the NSL-KDD dataset to develop and evaluate
the NIDS problem. The dataset includes all types of attacks. The dataset has 41
features, categorized into three main types (basic feature, content-based, and traffic-
based features) and labelled as either normal or attack, with the attack type precisely
categorized. The categories can be classified into four main groups, with a brief
description of each attack type and its impact.
As stated in the previous subsection, the dataset has 41 features labelled as either
normal or attack with the precise attack category. After experimental trials, five
features were selected out of the 41 features in the NSL-KDD dataset, which have the
most impact and effect on algorithm learning performance. Presents the selected five
features with a brief description.
To evaluate the performance of NIDS in terms of accuracy (AC), different metrics
were used; precision (P), recall (R), and F-measure (F). These metrics can be
calculated using confusion matrix parameters: true positive (the number of anomalous
instances that are correctly classified); false positive (the number of normal instances
that are incorrectly classified as anomalous); true negative (the number of normal
instances that are correctly classified); and false negative (the number of anomalous
instances that are incorrectly classified as normal). A good NIDS must achieve high
DR and FAR. Accuracy (AC): This is the percentage of correctly classified network

14
activities. Precision (P): The percentage of predicted anomalous instances that are
actual anomalous instances; the higher P, the lower FAR. . Recall (R): the percentage
of predicted attack instances versus all attack instances presented. F-measure (F):
measuring the performance of NIDS using the harmonic mean of the P and R. We
aim to achieve a high F-score. We compare XGBoost against the other two tree-
based methods, RF and DT. Using the test set, which includes the four types of
attacks as discussed. Three different evaluation metrics are computed; F-score,
precision and recall. XGBoost ranked first in the evaluation, with an F1-score of
95.55%, while RF and DT achieved 94.6% and 94.5%, respectively. For precision,
XGBoost outperformed RF and DT with a score of 92%, while RF and DT scored 90%
and 90.2% respectively. Finally, for Recall, our proposed method with XGBoost
proves its stability with a score of 98% while for RF and DT, the results were 82%,
and 85%, respectively. From these results, the proposed model with XGBoost
performs with high precision and high recall, which means that the classifier returns
accurate results and high precision, while, at the same time, returning a majority of all
positive results (it’s an attack and the classifier detects that it’s an attack), which
means high recall.
Finally, we evaluate the proposed method using an accuracy analysis against seven
classical machine learning algorithms, in addition to the deep neural network. The
proposed method achieves an accuracy of 95.55%, while the second-best accuracy
performance is 82.02 for the NB tree, showing a significant difference between the
accuracy of our proposed method and the other approaches. This evaluation confirms
that the proposed method is accurate and robust, even compared against other
algorithms. This shows how the unambiguous steps in our approach are reliable,
effective and authoritative. We conclude that the proposed method achieves a
verifiable result using several techniques. For the precise literature and comparison,
we carefully chose the NSL-KDD data set, which is considered one of the most
powerful benchmark datasets. Several procedures of data statistics, cleaning and
verification are performed on the dataset, which are very important in order to
produce a smooth learning process with no obstacles, such as over- or under-fitting
issues. This stage ensures that the proposed model has unified data and increases

15
the value of data, which helps in decision-making. Feature normalization and
selection clarifies the path for clear selection and intelligent preferences, using only 5
features. Subsequently, more detailed exploration and various comparisons are
carried out, based on three machine learning algorithms, i.e., DT, RF, and XGBoost,
in order to test their performance with different criteria and then select the best
performing algorithm for our task. This shows that the selection is dependably proven
and technically verified.
NIDS in SDN-based machine learning algorithms has attracted significant attention in
the last two decades because of the datasets and various algorithms proposed in
machine learning, using only limited features for better detection of anomalies better
and more efficient network security. In this study, the benchmarking dataset NSL-
KDD is used for training and testing. Feature normalization, feature selection and
data pre-processing techniques are used in order to improve and optimize the
algorithm’s performance for accurate prediction, as well as to facilitate a smooth
training process with minimal time and resources. To select the appropriate algorithm,
we compare three classical tree-based machine learning algorithms; Random Forest,
Decision Trees and XGBoost. We examine them using a variety of evaluation metrics
to find the disadvantages and advantages of using one or more. Using six different
evaluation metrics, the proposed XGBoost model outperformed more than seven
algorithms used in NIDS. The proposed method focused on detecting anomalies and
protecting the SDN platform from attacks in real-time scenarios. The proposed
methods performed two tasks simultaneously; to detect if there is an attack or not,
and to determine the type of attack (Dos, probe, U2R, R2L). In future studies, more
evaluation metrics will be carried out. We plan to implement the approach using
several deep neural network algorithms, such as Auto-Encoder, Generative
Adversarial Networks, and Recurrent neural networks, such as GRU and LSTM.
These techniques have been proven in the literature to allow convenient anomaly
detection approaches in NIDS applications. Also, we plan to compare these
algorithms against each other and integrate one or more neural network architectures
to extract more details of how we can implement an efficient anomaly detection
system in NIDS, with lower consumption of time and resources. In addition, for a

16
more solid basis for comparison, several benchmarking cyber security datasets, such
as NSL-KDD, UNSWNB15, and CIC-IDS2017 will be conducted, in order to make
sure that the selection of the proposed algorithm is not biased in any situation. These
various datasets are generated in different environments and conditions, so more
complex features will be available, more generalized attacks will be covered and the
accuracy of the proposed algorithm will significantly increase, which could lead to a
state-of-the art approach.

2.3 A Deep Learning Approach for Network Intrusion Detection System


Network Intrusion Detection Systems (NIDSs) are essential tools for the network
system administrators to detect various security breaches inside an organization’s
network. An NIDS monitors and analyses the network traffic entering into or exiting
from the network devices of an organization and raises alarms if an intrusion is
observed. Based on the methods of intrusion detection, NIDSs are categorized into
two classes: I) signature (misuse) based NIDS (SNIDS), and ii) anomaly detection
based NIDS (ADNIDS). In SNIDS, e.g. Snort, attack signatures are preinstalled in the
NIDS. A pattern matching is performed for the traffic against the installed signatures
to detect an intrusion in the network. In contrast, an ADNIDS classifies network traffic
as an intrusion when it observes a deviation from the normal traffic pattern. SNIDS is
effective in the detection of. Known attacks and shows high detection accuracy with
less false-alarm rates. However, its performance suffers during detection of unknown
or new attacks due to the limitation of attack signatures that can be installed
beforehand in an IDS. ADNIDS, on the other hand, is well-suited for the detection of
unknown and new attacks. Although ADNIDS produces high false-positive rates, its
theoretical potential in the identification of novel attacks has caused its wide
acceptance among the research community. There are primarily two challenges that
arise while developing an efficient and flexible NIDS for unknown future attacks. First,
proper feature selections from the network traffic dataset for anomaly detection is
difficult. The features selected for one class of attack may not work well for other
categories of attacks due to continuously changing and evolving attack scenarios.
Second, unavailability of labelled traffic dataset from real networks for developing an

17
NIDS. Immense efforts are required to produce such a labelled dataset from the raw
network traffic traces collected over a period or in real-time. Additionally, to preserve
the confidentiality of the internal organizational network structure as well as the
privacy of various users, network administrators are reluctant towards reporting any
intrusion that might have occurred in their networks. Various machine learning
techniques have been used to develop ADNIDSs, such as Artificial Neural Networks
(ANN), Support Vector Machines (SVM), Naive Bayesian (NB), Random Forests
(RF), and Self-Organized Maps (SOM). The NIDSs are developed as classifiers to
differentiate the normal traffic from the anomalous traffic. Many NIDSs perform a
feature selection task to extract a subset of relevant features from the traffic dataset
to enhance classification results. Feature selection helps in the elimination of the
possibility of incorrect training through the removal of redundant features and noises.
Recently, deep learning based methods have been successfully applied in audio,
image, and speech processing applications. These methods aim to learn a good
feature representation from a large amount of unlabelled data and subsequently apply
these learned features on a limited amount of labelled data in a supervised
classification. The labelled and unlabelled data may come from different distributions.
However, they must be relevant to each other. It is envisioned that the deep learning
based approaches can help to overcome the challenges of developing an efficient
NIDS. We can collect unlabelled network traffic data from different network sources
and a good feature representation from these datasets using deep learning
techniques can be obtained. These features can, then, be applied for supervised
classification to a small, but labelled traffic dataset consisting of normal as well as
anomalous traffic records. The traffic data for labelled dataset can be collected in a
confined, isolated and private network environment. With this motivation, we use self-
taught learning, a deep learning technique based on sparse auto encoder and soft-
max regression, to develop an NIDS. We verify the usability of the self-taught learning
based NIDS by applying on NSL-KDD intrusion dataset, an improved version of the
benchmark dataset for various NIDS evaluations - KDD Cup 99. We provide a
comparison of our current work with other techniques as well. Towards this end, our
paper is organized into four sections. In Section 2, we discuss a few closely related

18
work. Section 3 presents an overview of self-taught learning and the NSL-KDD
dataset. We discuss our results and comparative analysis in Section 4 and finally
conclude our paper with future work direction.
This section presents various recent accomplishments in this area. It should be noted
that we only discuss the work that have used the NSL-KDD dataset for their
performance bench marking. Therefore, any dataset referred from this point forward
should be considered as NSL-KDD. This approach allows a more accurate
comparison of work with other found in the literature. Another limitation is the use of
training data for both training and testing by most work. Finally, we discuss a few
deep learning based approaches that have been tried so far for similar kind of work.
One of the earliest work found in literature used ANN with enhanced resilient back-
propagation for the design of such an IDS. This work used only the training dataset
for training (70%), validation (15%) and testing (15%). As expected, use of unlabelled
data for testing resulted in a reduction of performance. A more recent work used J48
decision tree classifier with 10-fold cross-validation for testing on the training dataset.
This work used a reduced feature set of 22 features instead of the full set of 41
features. A similar work evaluated various popular supervised tree-based classifiers
and found that Random Tree model performed best with the highest degree of
accuracy along with a reduced false alarm rate. Many 2-level classification
approaches have also been proposed. One such work used Discriminative
Multinomial Naive Bayes (DMNB) as a base classifier and Nominal-to binary
supervised filtering at the second level along with 10-fold cross validation for testing.
This work was further extended to use Ensembles of Balanced Nested Dichotomies
(END) at the first level and Random Forest at the second level. As expected, this
enhancement resulted in an improved detection rate and a lower false positive rate.
Another 2level implementation used principal component analysis (PCA) for the
feature set reduction and then SVM (using Radial Basis Function) for final
classification, resulted in a high detection accuracy with only the training dataset and
full 41 features set. A reduction in features set to 23 resulted in even better detection
accuracy in some of the attack classes, but the overall performance was reduced.
The authors improved their work by using information gain to rank the features and

19
then a behaviorbased feature selection to reduce the feature set to 20. This resulted
in an improvement in reported accuracy using the training dataset. The second
category to look at, used both the training and test dataset. An initial attempt in this
category used fuzzy classification with genetic algorithm and resulted in a detection
accuracy of 80%+ with a low false positive rate. Another important work used
unsupervised clustering algorithms and found that the performance using only the
training data was reduced drastically when test data was also used. A similar
implementation using the k-point algorithm resulted in a slightly better detection
accuracy and lower false positive rate, using both training and test datasets. Another
less popular technique, OPF (optimum path forest) which uses graph partitioning for
feature classification, was found to demonstrate a high detection accuracy within one-
third of the time compared to SVMRBF method. We observed a deep learning
approach with Deep Belief Network (DBN) as a feature selector and SVM as a
classifier in. This approach resulted in an accuracy of 92.84% when applied on
training data. Our current work could be easily compared to this work due to the
enhancement of approach over this work and use of both the training and test dataset
in our work. A similar, however, semi-supervised learning approach has been used in.
The authors used real world trace for training and evaluated their approach on real-
world and KDD Cup 99 traces. Our approach is different from them in the sense that
we use NSL-KDD dataset to find deep learning applicability in NIDS implementation.
Moreover, the feature learning task is completely unsupervised and based on sparse
auto encoder in our approach. We recently observed a sparse auto encoder based
deep learning approach for network traffic identification in. The authors performed
TCP based unknown protocols identification in their work instead of network intrusion
detection.
Self-taught Learning (STL) is a deep learning approach that consists of two stages for
the classification. First, a good feature representation is learnt from a large collection
of unlabelled data, Xu, termed as Unsupervised Feature Learning (UFL). In the
second stage, this learnt representation is applied to labelled data, XL, and used for
the classification task. Although the unlabelled and labelled data may come from
different distributions, there must be relevance among them. Figure 1 shows the

20
architecture diagram of STL. There are different approaches used for UFL, such as
Sparse Auto encoder, Restricted Boltzmann Machine (RBM), K-Means Clustering,
and Gaussian Mixtures. We use sparse auto encoder based feature learning for our
work due to its relatively easier implementation and good performance. A sparse auto
encoder is a neural network consists of an input, a hidden, and an output layers. The
input and output layers contain N nodes, and the hidden layer contains K nodes. The
target values at the output layer are set equal to the input values, i.e., ˆxi = xi The
sparse auto encoder network finds the optimal values for weight matrices, W ∈ <K×N
and V ∈ <N×K, and bias vectors, b1 ∈ <K×1 and b2 ∈ <N×1, using back-propagation
algorithm while trying to learn the approximation of the identity function, i.e., output ˆx
similar to x [18]. Sigmoid function, g(z) = 1 1+e−z , is used for the activation, hW,b of
the nodes in the hidden and output layers: hW,b(x) = g(W x + b) (1) J = 1 2m Xm I=1
kxi − xˆik 2 + λ 2 ( X k,n W2 + X n,k V 2 + X k b1 2 + X n b2 2 ) + β XK j=1 KL(ρkρˆj )
(2) The cost function to be minimized in sparse auto encoder using back-propagation
is represented by Eq. (2). The first term is the average of sum-of-square error terms
for all m input data. The second term is a weight decay term, with λ as weight decay
parameter, to avoid the over-fitting in training. The last term in the equation is sparsity
penalty term that puts a constraint into the hidden layer to maintain a low average
activation values, and expressed as KullbackLeibler (KL) divergence shown in Eq.
(3): KL (ρkρˆj) = ρlog ρ ρˆj + (1 − ρ) log 1 − ρ 1 − ρˆj (3) where ρ is a sparsity
constraint parameter ranges from 0 to 1 and β controls the sparsity penalty term. The
KL (ρkρˆj) attains a minimum value when ρ = ˆρj, where ˆρj denotes the average
activation value of hidden unit j over all training inputs x. Once we learn optimal
values for W and b1 by applying the sparse auto encoder on unlabelled data, Xu, we
evaluate the feature representation a = HW, b1 (XL) for the labelled data, (XL, y). We
use this new features representation, a, with the labels vector, y, for the classification
task in the second stage. We use soft-max regression for the classification task.
As discussed earlier, we used NSL-KDD dataset in our work. The dataset is an
improved and reduced version of the KDD Cup 99 dataset. The KDD Cup dataset
was prepared using the network traffic captured by 1998 DARPA IDS evaluation
program. The network traffic includes normal and different kinds of attack traffic, such

21
as DoS, Probing, user-to-root (U2R), and root to-local (R2L). The network traffic for
training was collected for seven weeks followed by two weeks of traffic collection for
testing in raw TCP dump format. The test data contains many attacks that were not
injected during the training data collection phase to make the intrusion detection task
realistic. It is believed that most of the novel attacks can be derived from the known
attacks. Finally, the training and test data were processed into the datasets of five
million and two million TCP/IP connection records, respectively. The KDD Cup
dataset has been widely used as a benchmark dataset for many years in the
evaluation of NIDS. One of the major drawback with the dataset is that it contains an
enormous amount of redundant records both in the training and test data. It was
observed that almost 78% and 75% records are redundant in the training and test
dataset, respectively. This redundancy makes the learning algorithms biased towards
the frequent attack records and leads to poor classification results for the infrequent,
but harmful records. The training and test data were classified with the minimum
accuracy of 98% and 86% respectively using a very simple machine learning
algorithm. It made the comparison task difficult for various IDSs based on different
learning algorithms. NSL-KDD was proposed to overcome the limitation of KDD Cup
dataset. The dataset is derived from the KDD Cup dataset. It improved the previous
dataset in two ways. First, it eliminated all the redundant records from the training and
test data. Second, it partitioned all the records in the KDD Cup dataset into various
difficulty levels based on the number of learning algorithms that can correctly classify
the records. Further, it selected the records by random sampling of the distinct
records from different difficulty levels in a fraction that is inversely proportional to their
fractions in the distinct records. The multi-steps processing of KDD Cup dataset made
the total records statistics reasonable in the NSL-KDD dataset. Moreover, these
enhancements made the evaluation of various machine learning techniques realistic.
Each record in the NSL-KDD dataset consists of 41 features1 and is labelled with
either normal or a particular kind of attack. These features include basic features
derived directly from a TCP/IP connection, traffic features accumulated in a window
interval, either time, e.g. two seconds, or a number of connections, and content
features extracted from the application layer data of connections. Out of 41 features,

22
three are nominal, four are binary, and remaining 34 are continuous. The training data
contains 23 traffic classes that include 22 classes of attack and one normal class. The
test data contains 38 traffic classes that include 21 attack classes from the training
data, 16 novel attacks, and one normal class. These attacks are also grouped into
four categories based on the purpose they serve. These categories are DoS, Probing,
U2R, and R2L. Table-1 shows the statistics of records for the training and test data
for normal and different attack classes.
As discussed in the previous section, the dataset contains different kinds of attributes
with different values. We pre-process the dataset before applying self-taught learning
on it. Nominal attributes are converted into discrete attributes using 1-to-n encoding.
In addition, there is one attribute, num outbound cmds, in the dataset whose value is
always 0 for all the records in the training and test data. We eliminated this attribute
from the dataset. The total number of attributes become 121 after performing the
steps mentioned above. The values in the output layer during the feature learning
phase is computed by the sigmoid function that gives values between 0 and 1. Since
the output layer values are identical to the input layer values in this phase, it results in
normalization of the values at the input layer in the range of. To obtain this, we
perform max-min normalization on the new attributes list. With the new attributes, we
use the NSL-KDD training data without labels for feature learning using sparse auto
encoder for the first stage of self-taught learning. In the second stage, we apply the
newly learned features representation on the training data itself for the classification
using soft-max regression. In our implementation, both the unlabelled and labelled
data for feature learning and classifier training come from the same source, i.e., NSL-
KDD training data.
We implemented the NIDS for three different types of classification: a) Normal and
anomaly (2class), b) Normal and four different attack categories (5-class), and c)
Normal and 22 different attacks (23-class). We have evaluated classification accuracy
for all types. However, precision, recall, and f-measure values are evaluated in the
case of 2-class and 5- class classification. We have computed the weighted values of
these metrics in the case of 5-class classification.

23
As discussed, there are two approaches applied for the evaluation of NIDSs. In the
most widely used approach, the training data is used for both training and testing
either using n fold cross-validation or splitting the training data into training, cross-
validation, and test sets. NIDSs based on this approach achieved very high accuracy
and less false-alarm rates. The second approach uses the training and test data
separately for the training and testing. Since the training and test data were collected
in different environments, the accuracy obtained using the second approach is not as
high as in the first approach. Therefore, we emphasize on the results of the second
approach in our work for accurate evaluation of NIDS. However, we present the
results of the first approach as well for completeness. We describe our NIDS
implementation before discussing the results.
We proposed a deep learning based approach for developing an efficient and flexible
NIDS. A sparse auto encoder and soft-max regression based NIDS was
implemented. We used the benchmark network intrusion dataset - NSL-KDD to
evaluate anomaly detection accuracy. We observed that the proposed NIDS
performed very well compared to previously implemented NIDSs for the
normal/anomaly detection when evaluated on the test data. The performance can be
further enhanced by applying techniques such as Stacked Auto encoder, an
extension of sparse auto encoder in deep belief nets, for unsupervised feature
learning, and NB-Tree, Random Tree, or J48 for further classification. It was noted
that the latter techniques performed well when applied directly on the dataset. In
future, we plan to implement a real-time NIDS for actual networks using deep learning
technique. Additionally, on-the-go feature learning on raw network traffic headers
instead of derived features can be another high impact research in this area.

2.4 Intrusion Preventing System using Intrusion Detection System Decision


Tree Data Mining
Problem statement: To distinguish the activities of the network traffic that the intrusion
and normal is very difficult and to need much time consuming. An analyst must review
all the data that large and wide to find the sequence of intrusion on the network
24
connection. Therefore, it needs a way that can detect network intrusion to reflect the
current network traffics.
Approach: In this study, a novel method to find intrusion characteristic for IDS using
decision tree machine learning of data mining technique was proposed. Method used
to generate of rules is classification by ID3 algorithm of decision tree.
Results: These rules can determine of intrusion characteristics then to implement in
the firewall policy rules as prevention.
Conclusion: Combination of IDS and firewall so-called the IPS, so that besides
detecting the existence of intrusion also can execute by doing deny of intrusion as
prevention.
With the global Internet connection, network security has gained significant attention
in research and industrial communities. Due to the increasing threat of network
attacks, firewalls have become important elements of the security policy is generally.
Firewall can be allow or deny access network packet, but firewall cannot detect
intrusion or attack, so to need intrusion detection and then implemented to firewall is
access control systems as prevention. Intrusion detection are also considered as a
complementary solution to firewall technology by recognizing attacks against the
network that are missed by the firewall. Firewall and IDS represent an old stuff
terminology in the field of IT security. Firewall is good for protection a system and
network and can minimization risk of attack to network. IDS can detect existence
intrusion or attack. The joining ability of IDS and firewalls that is so-called IPS. That is
a functioning tool to detect intrusion and then denying by firewall for prevention. For
each type of network traffics, there are one or more different rules. Every network
packet, which arrives at firewall, must be check against defined rules until a matching
rule found. The packet will be then allow or banned access to the network, depending
on the action specified in the matching rule. Each rule identifies specific type of
network traffic. Characteristics to reflect the current of network traffics can observe
from network traffic logs as human pattern recognize. This Study focus on some
methods to prevention from attempt intrusion to find intrusion characteristics in the
network traffic as IDS then implementation to firewall policy rules as prevention. To
find rules of intrusion characteristics using decision tree machine learning data

25
mining. Method used to generate of rules is classification by ID3 algorithm of decision
tree. It is an efficient and optimized to make the rules filtering in firewall.

Intrusion Detection System (IDS): Intrusion detection can be performed manually or


automatically. Manual intrusion detection might take place by examining log files or
other evidence for signs of intrusions, including network traffic. A system that
performs automated intrusion detection is called an Intrusion Detection System (IDS).
IDS play a vital role in ensuring the security of modern computer installations. Such
systems are need in order to detect hostile activity and to respond appropriately. As
networks continue to expand and become more exposed to a diversity of sources,
both hostile and benign, IDS need to be able to deal with a large and ever-increasing
flow of alerts and events. Therefore, automatic procedures for detecting and
responding to intrusion are becoming increasingly essential.
A firewall security policy is a list of ordered filtering rules that define the actions
performed on packets that satisfy specific conditions. Before to develop rules filtering
by using packet filter, anything have to be considered beforehand how far
demarcation which will be applied, because more and more demarcation applied
hence increases the search time and space requirements of the packet filtering
process[1] and consequences to make downhill performance progressively. This
matter because every incoming network packet and go out the network checked
beforehand by rules alternately until matching rule found in firewall. Firewall rules can
limit to access the connection of pursuant to parameter: source IP, destination IP,
source port, destination port, protocol and others [8, 10]. Following example of firewall
rules. Firewall rule of above explaining to enhance the order by the end of chain (A)
for the traffic of incoming to firewall (INPUT) by source IP address (-s) 203.230.206.5
with the type protocol (-p) TCP to destination IP address (-d) 10.10.15.7 and
destination port (--deport) 80 hence done by action (-j) dropped (DROP) by firewall.

Log files:
Log files can give an idea about what the different parts of system are doing. Logs
can show what is going right and what is going wrong. Log files can provide a useful
26
profile activity. From a security standpoint, it is crucial to be able to distinguish normal
activity from the activity of someone to attack server or network. Log files are useful
for three reasons:
 Log files help with troubleshooting system problems and understanding what is
happening on the system
 Logs serve as an early warning for both system and security events
 Logs can be indispensable in reconstructing events, whether determined an intrusion
has occurred and performing the follow-up forensic investigation or just profiling
normal activity
Decision tree is a technique in classification method of data mining for learning
patterns from data and using these patterns for classification. Decision tree are
structures used to classify and data and with and common and attributes and as each
decision tree represents a rule, which categorizes data according to these attributes.
Where each node (non-leaf node) denotes a test on an attribute, each branch
represent an outcome of the test and each leaf node or terminal node holds a class
label. The topmost node in a tree is the root node. A decision tree classifier is one of
the most widely need supervised learning methods used for data exploration. It is
easy to interpret and can be represented as if-then-else rules. This classifier works
well on noisy data. A decision tree aids in data exploration in the following manner:
 It reduces a volume of data by transformation into a more compact form that
preserves the essential characteristics and provides an accurate summary
 It discovers whether the data contains well separated classes of objects, such that the
classes can be interpreted meaningfully in the context of a substantive theory
 It maps data in the form the leaves to its root. This may use to predict the outcome for
a new data or Query.

This research using decision tree a technique of data mining machine learning to find
the intrusion characteristics for intrusion detection. Algorithm is used ID3 to construct
Decision tree. Network traffic logs as data training that describes the human
behaviour in network traffics as intrusive activities and normal activities. The results of
decision tree training will get rules of intrusion characteristics then these rules to
27
implement in the firewall rules as prevention. Determining occurrence of intrusion or
normal activities at network traffic log can be conducted with two way of that is:
Observe manually activities network traffic in log files. Example, application software
of log files is syslog, syslog_ng, tcpdump and others. Pattern found to see intrusion
through log seen modestly, for example there are some times trying to access using
login or password failed, trying port scan, abundant ping, delivery of abundant
package by repeat
Using software as a means of assists functioning as Network Intrusion Detection
System (NIDS) able to determine intrusion activities or normal activities, for example
snort software

Collect and extract log files of intrusive activities and normal activities become five of
parameter as attributes and belongs to a class ‘Yes’ or ‘No’ of intrusive for the data
training of decision tree. The parameter is IP address source, IP address destination,
port source, port destination and protocol. Applying Decision Tree to Find Intrusion
Characteristic: Suppose train a decision tree.
The examples of extract rule of tree decision representing characteristic of intrusion
earn implementation into firewall rules .Do not forget to every rule there is a TCP
protocol. Firewall policy rules above representing preventive action, where every
network packet with criteria like rules firewall above will DROP.
Network traffic logs to describe patterns of behaviour in network traffic accident with
intrusive or normal activity. Decision tree technique is good for the intrusion
characteristic of the network traffic logs for IDS and implemented in the firewall as
prevention. The both of this combination is called IPS. The other hand, this technique
is also good efficiency and optimize rule for the firewall rules such as avoid
redundancy.

28
CHAPTER 3
EXPERIMENTAL METHODS AND ALGORITHMS USED
3.1 MACHINE LEARNING SCOPE
Machine learning as a very likely approach to achieve human-computer integration
and can be applied in many computer fields. Machine learning is not a typical method
as it contains many different computer algorithms. Yu Yang algorithms aim to solve
different machine learning tasks. At last, all the algorithms can help the computer to
act more like a human. Machine learning is already applied in many fields, for
instance, pattern recognition, Artificial Intelligence, computer vision, data mining, and
text categorization and so on. Machine learning gives a new way to develop the
intelligence of the machines. It also becomes an easier way to help people to
analyses data from huge data sets. A learning method is a complicated topic which
has many different kinds of forms. Everyone has different methods to study, so does
the machine. We can categorize various machine learning systems by different
conditions. In general, we can separate learning problems in two main categories:
29
supervised learning and unsupervised learning.

3.1.1SUPERVISED LEARNING
Supervised learning is a commonly used machine learning algorithm which appears
in many different fields of computer science. In the supervised learning method, the
computer can establish a learning model based on the training data set.
According to this learning model, a computer can use the algorithm to predict or
analyze new information. By using special algorithms, a computer can find the best
result and reduce the error rate all by itself. Supervised learning is mainly used for
two different patterns: classification and regression.
In supervised learning, when a developer gives the computer some samples, each
sample is always attached with some classification information. The computer will
analyze these samples to get learning experiences so that the error rate would be
reduced when a classifier does recognitions for each pattern.

30
3.1.2 UNSUPERVISED LEARNING
Unsupervised learning is also used for classification of original data. The classifier in
the unsupervised learning method aims to find the classification information for
unlabeled samples. The objective of unsupervised learning is to let the computer learn
it by itself. We do not teach the computer how to do it. The computer is supposed to
do analyzing from the given samples. In unsupervised learning, the computer is not
able to find the best result to take and also the computer does not know if the result is
correct or not. When the computer receives the original data, it can find the potential
regulation within the information automatically and then the computer will adopt this
regulation to the new case. That makes the difference between supervised learning
and unsupervised learning. In some cases, this method is more powerful than
supervised learning. That is because there is no need to do the classification for
samples in advance. Sometimes, our classification method may not be the best one.
On the other hand, a computer may find out the best method after it learns it from
samples again and again.

3.2 Decision tree


Till now we have learned about linear regression, logistic regression, and they were
pretty hard to understand. Let’s now start with Decision trees and I assure you this is
probably the easiest algorithm in Machine Learning. There’s not much mathematics
involved here. Since it is very easy to use and interpret it is one of the most widely
used and practical methods used in Machine Learning.
What is a Decision Tree?
It is a tool that has applications spanning several different areas. Decision trees can
be used for classification as well as regression problems. The name itself suggests
that it uses a flowchart like a tree structure to show the predictions that result from a
series of feature-based splits. It starts with a root node and ends with a decision made
by leave. Before learning more about decision trees let’s get familiar with some of the
terminologies.
 Root Nodes – It is the node present at the beginning of a decision tree from this node
the population starts dividing according to various features.

31
 Decision Nodes – the nodes we get after splitting the root nodes are called Decision
Node
 Leaf Nodes – the nodes where further splitting is not possible are called leaf nodes or
terminal nodes
 Sub-tree – just like a small portion of a graph is called sub-graph similarly a
subsection of this decision tree is called sub-tree.
 Pruning – is nothing but cutting down some nodes to stop overfitting.
Example of a decision tree.
Let’s understand decision trees with the help of an example.
Decision trees are upside down which means the root is at the top and then this root is
split into various several nodes. Decision trees are nothing but a bunch of if-else
statements in layman terms. It checks if the condition is true and if it is then it goes to
the next node attached to that decision.
Did you notice anything in the above flowchart? We see that if the weather is cloudy
then we must go to play. Why didn’t it split more? Why did it stop there?
To answer this question, we need to know about few more concepts like entropy,
information gain, and Gini index. But in simple terms, I can say here that the output for
the training dataset is always yes for cloudy weather, since there is no disorderliness
here we don’t need to split the node further.
The goal of machine learning is to decrease uncertainty or disorders from the dataset
and for this, we use decision trees.
Now you must be thinking how do I know what should be the root node? What should
be the decision node? When should I stop splitting? To decide this, there is a metric
called “Entropy” which is the amount of uncertainty in the dataset.
Entropy:
Entropy is nothing but the uncertainty in our dataset or measure of disorder. Let me try
to explain this with the help of an example.
Suppose you have a group of friends who decides which movie they can watch
together on Sunday. There are 2 choices for movies, one is “Lucy” and the second is
“Titanic” and now everyone has to tell their choice. After everyone gives their answer
we see that “Lucy” gets 4 votes and “Titanic” gets 5 votes. Which movie do we watch

32
now? Isn’t it hard to choose 1 movie now because the votes for both the movies are
somewhat equal?
This is exactly what we call disorder-ness, there is an equal number of votes for both
the movies, and we can’t really decide which movie we should watch. It would have
been much easier if the votes for “Lucy” were 8 and for “Titanic” it was 2. Here we
could easily say that the majority of votes are for “Lucy” hence everyone will be
watching this movie.
In a decision tree, the output is mostly “yes” or “no” The formula for Entropy is shown
below:
How do Decision Trees use Entropy?
Now we know what entropy is and what is its formula, next, we need to know that how
exactly it works in this algorithm.
Entropy basically measures the impurity of a node. Impurity is the degree of
randomness; it tells how random our data is. A pure sub-split means that either you
should be getting “yes”, or you should be getting “no”.
Suppose feature 1 had 8 yes and 4 no, after the split feature 2 get 5 yes and 2 no
whereas feature 3 gets 3 yes and 2 no.
We see here the split is not pure, why? Because we can still see some negative
classes in both the feature. In order to make a decision tree, we need to calculate the
impurity of each split, and when the purity is 100% we make it as a leaf node. To
check the impurity of feature 2 and feature 3 we will take the help for Entropy
We can clearly see from the tree itself that feature 2 has low entropy or more purity
than feature 3 since feature 2 has more “yes” and it is easy to make a decision here.
Always remember that the higher the Entropy, the lower will be the purity and the
higher will be the impurity.
As mentioned earlier the goal of machine learning is to decrease the uncertainty or
impurity in the dataset, here by using the entropy we are getting the impurity of a
feature or a particular node, we don’t know if the parent entropy or the entropy of a
particular node has decreased or not.
For this, we bring a new metric called “Information gain” which tells us how much the
parent entropy has decreased after splitting it with some feature.
Information Gain:

33
Information gain measures the reduction of uncertainty given some feature and it is
also a deciding factor for which attribute should be selected as a decision node or root
node. It is just entropy of the full dataset – entropy of the dataset given some feature.
Let’s see how our decision tree will be made using these 2 features. We’ll use
information gain to decide which feature should be the root node and which feature
should be placed after the split.
When to stop splitting?
You must be asking this question to yourself that when do we stop growing our tree?
Usually, real-world datasets have a large number of features, which will result in a
large number of splits, which in turn gives a huge tree. Such trees take time to build
and can lead to overfitting. That means the tree will give very good accuracy on the
training dataset but will give bad accuracy in test data.
There are many ways to tackle this problem through hyper parameter tuning. We can
set the maximum depth of our decision tree using the max_depth parameter. The
more the value of max_depth, the more complex your tree will be. The training error
will off-course decrease if we increase the max_depth value but when our test data
comes into the picture, we will get a very bad accuracy. Hence you need a value that
will not over fit as well as under fit our data and for this, you can use GridSearchCV.
Another way is to set the minimum number of samples for each spilt. It is denoted by
min_samples_split. Here we specify the minimum number of samples required to do a
spilt. For example, we can use a minimum of 10 samples to reach a decision. That
means if a node has less than 10 samples then using this parameter, we can stop the
further splitting of this node and make it a leaf node. There are more hyperparameters
such as:
min_samples_leaf – represents the minimum number of samples required to be in the
leaf node. The more you increase the number, the more is the possibility of overfitting.
max_features – it helps us decide what number of features to consider when looking
for the best split. To read more about these hyper parameters Pruning
It is another method that can help us avoid overfitting. It helps in improving the
performance of the tree by cutting the nodes or sub-nodes which are not significant.
There are mainly 2 ways for pruning:

34
Pre-pruning – we can stop growing the tree earlier, which means we can
prune/remove/cut a node if it has low importance while growing the tree.
Post-pruning – once our tree is built to its depth, we can start pruning the nodes based
on their significance.
Endnotes
To summarize, in this article we learned about decision trees. On what basis the tree
splits the nodes and how to can stop overfitting. Why linear regression doesn’t work in
the case of classification problems. In the next article, I will explain Random forests,
which is again a new technique to avoid overfitting.

Fig.3.2.1 Decision Tree

3.3 Genetic Algorithm


Let’s get back to the example we discussed above and summarize what we did.
Firstly, we defined our initial population as our countrymen.
We defined a function to classify whether is a person is good or bad.
Then we selected good people for mating to produce their off-springs.
And finally, these off-springs replace the bad people from the population and this
process repeats.
This is how genetic algorithm actually works, which basically tries to mimic the human
evolution to some extent. So to formalize a definition of a genetic algorithm, we can
say that it is an optimization technique, which tries to find out such values of input so
35
that we get the best output values or results. The working of a genetic algorithm is
also derived from biology, which is as shown in the image below.

Fig.3.3.1 Genetic Algorithm Steps


Steps Involved in Genetic Algorithm:
 Initialisation
 Fitness Function
 Selection
 Crossover
 Mutation
Application of Genetic Algorithm: Feature Selection
Every time you participate in a data science competition, how do you select features
that are important in prediction of the target variable? You always look at the feature
importance of some model, and then manually decide the threshold, and select the
features which have importance above that threshold.
Is there any better way to deal with this kind of situations? Actually one of the most
advanced algorithms for feature selection is genetic algorithm.
The method here is completely same as the one we did with the knapsack problem.
We will again start with the population of chromosome, where each chromosome will
be binary string. 1 will denote “inclusion” of feature in model and 0 will denote
“exclusion” of feature in the model.
And another difference would be that the fitness function would be changed. The
fitness function here will be our accuracy metric of the competition. The more accurate
our set of chromosome in predicting value, the more fit it will be.

36
I suppose, you would now be thinking is there any use of such tough tasks. I will not
answer this question now, rather let us look at the implementation of it using TPOT
library and then you decide this.
Implementation using TPOT library
First, let’s take a quick view on the TPOT (Tree-based Pipeline Optimisation
Technique) which is built upon sickie-learn library.
A basic pipeline structure is shown in the image below.

Fig.3.3.2 Genetic Algorithm Application

So the highlighted grey section in the image above is automated using TPOT. This
automation is achieved using genetic algorithm. So, without going deep into this, let’s
directly try to implement it. For using TPOT library, you first have to install some
existing python libraries on which TPOT is build. So let us quickly install them.
Applications in Real World:
 Engineering Design
 Robotics.
End Notes
I hope that now you have gain enough understanding about what genetic algorithm is
and also how to implement it using TPOT library. But this knowledge is not enough, if
you don’t apply it somewhere. So try to implement it whether in any real world
application or in a data science competition.

37
3.4 Block diagram And Flow diagram

Fig.3.4.1 Block Diagram

Fig.3.4.2 Flow Diagram

38
CHAPTER 4
RESULTS, DISCUSSION AND PERFORMANCE ANALYSIS

4.1 Requirements
Hardware requirements:
 System: Pentium i3 Processor.
 Hard Disk: 500 GB.
 Monitor : 15’’ LED
 Input Devices : Keyboard, Mouse
 Ram : 2 GB
 Software requirements:

 Operating System: Windows 10.


 Coding Language : Python

4.2 MODULES
What is the machine learning Model?
The machine learning model is nothing but a piece of code; an engineer or data
scientist makes it smart through training with data. So, if you give garbage to the

39
model, you will get garbage in return, i.e. the trained model will provide false or wrong
prediction.
Fig.4.2.1 Module Flow chart
 Data Collection
 Data pre-processing
 Future Extraction
 Model Training
 Testing Model
 Evaluation
 Prediction

Data Collection

Collecting data allows you to capture a record of past events so that we can use data
analysis to find recurring patterns.
KDD datasets:
The KDD data set is a well-known benchmark in the research of Intrusion
Detection techniques. A lot of work is going on for the improvement of intrusion
detection strategies while the research on the data used for training and testing the
detection model is equally of prime concern because better data quality can improve
offline intrusion detection. This paper presents the analysis of KDD data set with
respect to four classes which are Basic, Content, Traffic and Host in which all data
attributes can be categorized.

40
Fig.4.2.2 Data Collection

Data Pre-Processing
Data pre-processing is a process of cleaning the raw data i.e. the data is collected in
the real world and is converted to a clean data set. In other words, whenever the data
is gathered from different sources it is collected in a raw format and this data isn’t
feasible for the analysis. Therefore, certain steps are executed to convert the data into
a small clean data set, this part of the process is called as data pre-processing.

Feature Extraction
This is done to reduce the number of attributes in the dataset hence providing
advantages like speeding up the training and accuracy improvements.

Model training
A training model is a dataset that is used to train an ML algorithm. It consists of the
sample output data and the corresponding sets of input data that have an influence on
the output.
Training set:
The training set is the material through which the computer learns how to process
information. Machine learning uses algorithms to perform the training part. A set of
data used for learning that is to fit the parameters of the classifier.

Fig.4.2.3 Training dataset

41
Validation set:
Cross-validation is primarily used in applied machine learning to estimate the skill of a
machine learning model on unseen data. A set of unseen data is used from the
training data to tune the parameters of a classifier.

Fig.4.2.4 Validation
Once the data is divided into the 3 given segments we can start the training process.
In a data set, a training set is implemented to build up a model, while a test (or
validation) set is to validate the model built. Data points in the training set are
excluded from the test (validation) set. Usually, a data set is divided into a training set,
a validation set (some people use ‘test set’ instead) in each iteration, or divided into a
training set, a validation set and a test set in each iteration. The model uses any one
of the models that we had chosen in step 3/ point 3. Once the model is trained we can
use the same trained model to predict using the testing data i.e. the unseen data.
Once this is done we can develop a confusion matrix, this tells us how well our model
is trained. A confusion matrix has 4 parameters, which are ‘True positives’, ‘True
Negatives’, ‘False Positives’ and ‘False Negative’. We prefer that we get more values
in the True negatives and true positives to get a more accurate model. The size of the
Confusion matrix completely depends upon the number of classes.

Fig .4.2.5 Prediction


42
True positives: These are cases in which we predicted TRUE and our predicted output
is correct.
True negatives: We predicted FALSE and our predicted output is correct.
False positives: We predicted TRUE, but the actual predicted output is FALSE.
False negatives: We predicted FALSE, but the actual predicted output is TRUE. We
can also find out the accuracy of the model using the confusion matrix. Accuracy =
(True Positives +True Negatives) / (Total number of classes) i.e. for the above
example:
Accuracy = (100 + 50) / 165 = 0.9090 (90.9% accuracy)
Testing model
In this module we test the trained machine learning model using the test dataset.
Quality assurance is required to make sure that the software system works according
to the requirements. Were all the features implemented as agreed? Does the program
behave as expected? All the parameters that you test the program against should be
stated in the technical specification document.
Performance Evaluation
In this module, we evaluate the performance of trained machine learning model using
performance evaluation criteria such as F1 score, accuracy and classification error.
Performance Evaluation is defined as a formal and productive procedure to measure
an employee’s work and results based on their job responsibilities. It is used to gauge
the amount of value added by an employee in terms of increased business revenue,
in comparison to industry standards and overall employee return on investment (ROI).
Prediction
The algorithm will generate probable values for an unknown variable for each record
in the new data, allowing the model builder to identify what that value will most likely
be. The word “prediction” can be misleading. In some cases, it really does mean that
you are predicting a future outcome, such as when you’re using machine learning to
determine the next best action in a marketing campaign.
Evaluation
Model Evaluation is an integral part of the model development process. It helps to find
the best model that represents our data and how well the chosen model will work in
the future. To improve the model we might tune the hyper-parameters of the model

43
and try to improve the accuracy and also looking at the confusion matrix to try to
increase the number of true positives and true negatives.

4.3 RESULT

S.N Types of Count


O Attacks
1 All Attacks 4019
2 Normal 1589
Attacks
Table.4.3.1 Count of Attacks

Fig.4.3.1 Intrusion Detection Application

44
Fig.4.3.2 Train Data Set
45
Fig.4.3.3 Testing All Attacks

Fig.4.3.4 All Attacks Plot graph

46
Fig.4.3.5 Normal Attacks Testing

47
Fig.4.3.6Normal Attacks Plot Graph

4.4 Code

Dataset.py

import pyshark
import time
import random
class Packet:
packet_list = list()
def initiating_packets(self):
self.packet_list.clear()
capture = pyshark.LiveCapture(interface="Wi-Fi")
for packet in capture.sniff_continuously(packet_count=25):
try:
if "<UDP Layer>" in str(packet.layers) and "<IP Layer>" in str(packet.layers):
self.packet_list.append(packet)
elif "<TCP Layer>" in str(packet.layers) and "<IP Layer>" in
48
str(packet.layers):
self.packet_list.append(packet)
except:
print(f"No Attribute name 'ip' {packet.layers}")
def udp_packet_attributes(self,packet):
attr_list = list()
a1 = packet.ip.ttl
a2 = packet.ip.proto
a3 = self.__get_service(packet.udp.port, packet.udp.dstport)
a4 = packet.ip.len
a5 = random.randrange(0,1000)
a6 = self.__get_land(packet,a2)
a7 = 0
a8, a10, a11 =
self.__get_count_with_same_and_diff_service_rate(packet.udp.dstport, a3) #23, 29,
30
a9, a12 = self.__get_srv_count_and_srv_diff_host_rate(packet.ip.dst, a3) #24,
31
a13, a15, a16 = self.__get_dst_host_count(packet.ip.dst, a3) # 32,34,35
a14, a17, a18 = self.__get_dst_host_srv_count(packet.udp.port,
packet.udp.dstport, packet.ip.dst) #33, 36, 37
attr_list.extend((a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a
18))
return self.get_all_float(attr_list)

def tcp_packet_attributes(self,packet):
attr_list = list()
a1 = packet.ip.ttl #duration
a2 = packet.ip.proto #protocol
a3 = self.__get_service(packet.tcp.port, packet.tcp.dstport) # service
a4 = packet.ip.len
a5 = random.randrange(0,1000)

49
a6 = self.__get_land(packet,a2)
a7 = packet.tcp.urgent_pointer
a8, a10, a11 =
self.__get_count_with_same_and_diff_service_rate(packet.tcp.dstport, a3) #23, 29,
30
a9, a12 = self.__get_srv_count_and_srv_diff_host_rate(packet.ip.dst, a3) #24,
31
a13, a15, a16 = self.__get_dst_host_count(packet.ip.dst, a3) # 32,34,35
a14, a17, a18 = self.__get_dst_host_srv_count(packet.tcp.port,
packet.tcp.dstport, packet.ip.dst) #33, 36, 37
attr_list.extend((a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a
18))
return self.get_all_float(attr_list)

def __get_service(self,src_port,dst_port):
services = [80,443,53]
if int(src_port) in services:
return int(src_port)
elif int(dst_port) in services:
return int(dst_port)
else:
return 53

def __get_land(self,packet, protocol):


if int(protocol) == 6:
if(packet.ip.dst == packet.ip.src and packet.tcp.port == packet.tcp.dstport):
return 1
else:
return 0
elif int(protocol) == 17:
if(packet.ip.dst == packet.ip.src and packet.udp.port == packet.udp.dstport):
return 1

50
else:
return 0

def __get_count_with_same_and_diff_service_rate(self,dst_port, service): #23, 29,


30
count = 0
packet_with_same_service = 0
for p in self.packet_list:
if "<UDP Layer>" in str(p.layers):
if (p.udp.dstport == dst_port):
count+=1
if (self.__get_service(p.udp.port, p.udp.dstport) == service):
packet_with_same_service+= 1
elif "<TCP Layer>" in str(p.layers):
if (p.tcp.dstport == dst_port):
count+=1
if (self.__get_service(p.tcp.port, p.tcp.dstport) == service):
packet_with_same_service+= 1
same_service_rate=0.0
diff_service_rate = 1.0
if not count==0: # To avoid zero divison error
same_service_rate = ((packet_with_same_service*100)/count)/100
diff_service_rate = diff_service_rate-same_service_rate
return (count, same_service_rate, diff_service_rate)

def __get_srv_count_and_srv_diff_host_rate(self,dst_ip, service): #24, 31


diff_dst_ip = 0
service_count = 0
for p in self.packet_list:
if "<UDP Layer>" in str(p.layers):
if (self.__get_service(p.udp.port, p.udp.dstport) == service):

51
service_count+= 1
if not (p.ip.dst == dst_ip): # not added
diff_dst_ip+=1
elif "<TCP Layer>" in str(p.layers):
if (self.__get_service(p.tcp.port, p.tcp.dstport) == service):
service_count+= 1
if not (p.ip.dst == dst_ip): # not added
diff_dst_ip+=1
srv_diff_host_rate = 0.0
if not(service_count == 0):
srv_diff_host_rate = ((diff_dst_ip*100)/service_count)/100
return (service_count, srv_diff_host_rate)

def __get_dst_host_count(self,dst_ip, service): #32, 34, 35


same_dst_ip = 0
same_service=0
for p in self.packet_list:
if(p.ip.dst == dst_ip):
same_dst_ip+=1
if "<UDP Layer>" in str(p.layers):
if (self.__get_service(p.udp.port, p.udp.dstport) == service):
same_service+= 1
elif "<TCP Layer>" in str(p.layers):
if (self.__get_service(p.tcp.port, p.tcp.dstport) == service):
same_service+= 1
dst_host_same_srv_rate = 0.0
dst_host_diff_srv_rate = 1.0
if not same_dst_ip==0:
dst_host_same_srv_rate = ((same_service*100)/same_dst_ip)/100
dst_host_diff_srv_rate = dst_host_diff_srv_rate-dst_host_same_srv_rate

return (same_dst_ip, dst_host_same_srv_rate, dst_host_diff_srv_rate)

52
def __get_dst_host_srv_count(self,src_port, dst_port, dst_ip): #33, 36, 37
dst_host_srv_count = 0
same_src_port = 0
diff_dst_ip = 0
for p in self.packet_list:
if "<UDP Layer>" in str(p.layers):
if (p.udp.dstport == dst_port): # same destination port
dst_host_srv_count+=1
if (p.udp.port == src_port): # same src port
same_src_port+=1
if not (p.ip.dst == dst_ip): # different destination Ip
diff_dst_ip+=1

elif "<TCP Layer>" in str(p.layers):


if (p.tcp.dstport == dst_port): # same destination port
dst_host_srv_count+=1
if (p.tcp.port == src_port): # same src port
same_src_port+=1
if not (p.ip.dst == dst_ip): #different destination ip
diff_dst_ip+=1
dst_host_same_src_port_rate = 0.0
dst_host_srv_diff_host_rate = 0.0
if not dst_host_srv_count==0:
dst_host_same_src_port_rate =
((same_src_port*100)/dst_host_srv_count)/100
dst_host_srv_diff_host_rate = ((diff_dst_ip*100)/dst_host_srv_count)/100
return (dst_host_srv_count, dst_host_same_src_port_rate,
dst_host_srv_diff_host_rate)

def get_all_float(self,l):

53
all_float = list()
for x in l:
all_float.append(round(float(x),1))
return all_float

GAAlgorithm.py
import Population
import random

class GAAlgorithm():

def __init__(self,train_dataset, test_dataset, population_size,


mutation_rate,gene_length=18):
self.train_dataset = train_dataset
self.test_dataset = test_dataset
self.population_size = population_size
self.mutation_rate = mutation_rate
self.gene_length = int(gene_length)
self.population = Population.Population(self.train_dataset, self.test_dataset,
self.population_size, self.gene_length)

def initialization(self):
self.population.initialize_population()

def calculate_fitness(self):
self.population.calculate_fitness()

def selection(self):
parents = list()
end = int(self.population_size/2)

54
no_of_parents = int(self.population_size/2)
for x in range(no_of_parents):
p1 = random.randint(0,end-1)
p2 = random.randint(end,self.population_size-1)
parents.append([p1,p2])
return parents
def cross_over(self,parents):
self.population.cross_over(parents)

def mutation(self):
self.population.mutation(self.mutation_rate)

def clear_population(self):
self.population.clear_population()

Individual.py
import random
import string
import pandas
from classifier import DecisionTree

class Individual:

chromosome = list()
fitness = 0
def __init__(self, train_dataset, test_dataset, gene_length=18):
self.gene_length=int(gene_length)
self.chromosome = [random.randint(0,1) for x in range(self.gene_length)]
self.train_dataset = train_dataset
self.test_dataset = test_dataset
self.gene_length = gene_length
55
def calculate_fitness(self):
header = list(string.ascii_lowercase[0:(self.gene_length+1)])
kdd_train = pandas.read_csv(self.train_dataset, names=header)
kdd_test = pandas.read_csv(self.test_dataset, names=header)
selected_index= [header[x] for x, y in enumerate(self.chromosome) if y==1]
var_train, res_train = kdd_train[selected_index], kdd_train[header[18]]
var_test, res_test = kdd_test[selected_index], kdd_test[header[18]]
self.fitness = self.__get_fitness(var_train, res_train, var_test, res_test)*100

def __get_fitness(self,var_train, res_train, var_test, res_test):


return DecisionTree.get_fitness(var_train, res_train, var_test, res_test)

Packet.py

import pyshark
import random
class Packet:
packet_list = list() #list is declare
def initiating_packets(self):
self.packet_list.clear()
capture = pyshark.LiveCapture(interface="Wi-Fi")
for packet in capture.sniff_continuously(packet_count=25):
try:
if "<UDP Layer>" in str(packet.layers) and "<IP Layer>" in str(packet.layers):
self.packet_list.append(packet)
elif "<TCP Layer>" in str(packet.layers) and "<IP Layer>" in
str(packet.layers):
self.packet_list.append(packet)
except:
print(f"No Attribute name 'ip' {packet.layers}")
def udp_packet_attributes(self,packet):
56
attr_list = list()
a1 = packet.ip.ttl
a2 = packet.ip.proto
a3 = self.__get_service(packet.udp.port, packet.udp.dstport)
a4 = packet.ip.len
a5 = random.randrange(0,1000)
a6 = self.__get_land(packet,a2)
a7 = 0 # urgent pointer not exist in udp layer
a8, a10, a11 =
self.__get_count_with_same_and_diff_service_rate(packet.udp.dstport, a3) #23, 29,
30
a9, a12 = self.__get_srv_count_and_srv_diff_host_rate(packet.ip.dst, a3) #24,
31
a13, a15, a16 = self.__get_dst_host_count(packet.ip.dst, a3) # 32,34,35
a14, a17, a18 = self.__get_dst_host_srv_count(packet.udp.port,
packet.udp.dstport, packet.ip.dst) #33, 36, 37
attr_list.extend((a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a
18))
return self.get_all_float(attr_list)

def tcp_packet_attributes(self,packet):
attr_list = list()
a1 = packet.ip.ttl #duration
a2 = packet.ip.proto #protocol
a3 = self.__get_service(packet.tcp.port, packet.tcp.dstport) # service
a4 = packet.ip.len #Src - byte
a5 = random.randrange(0,1000) #dest_byte
a6 = self.__get_land(packet,a2) #land
a7 = packet.tcp.urgent_pointer #urgentpoint
a8, a10, a11 =
self.__get_count_with_same_and_diff_service_rate(packet.tcp.dstport, a3) #23, 29,
30

57
a9, a12 = self.__get_srv_count_and_srv_diff_host_rate(packet.ip.dst, a3) #24,
31
a13, a15, a16 = self.__get_dst_host_count(packet.ip.dst, a3) # 32,34,35
a14, a17, a18 = self.__get_dst_host_srv_count(packet.tcp.port,
packet.tcp.dstport, packet.ip.dst) #33, 36, 37
attr_list.extend((a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a
18))
return self.get_all_float(attr_list) # convert every attribute to float data type

def __get_service(self,src_port,dst_port):
services = [80,443,53]
if int(src_port) in services:
return int(src_port)
elif int(dst_port) in services:
return int(dst_port)
else:
return 53

def __get_land(self,packet, protocol):


if int(protocol) == 6:
if(packet.ip.dst == packet.ip.src and packet.tcp.port == packet.tcp.dstport):
return 1
else:
return 0
elif int(protocol) == 17:
if(packet.ip.dst == packet.ip.src and packet.udp.port == packet.udp.dstport):
return 1
else:
return 0

def __get_count_with_same_and_diff_service_rate(self,dst_port, service): #23, 29,

58
30
count = 0
packet_with_same_service = 0
for p in self.packet_list:
if "<UDP Layer>" in str(p.layers):
if (p.udp.dstport == dst_port): #same destination port
count+=1
if (self.__get_service(p.udp.port, p.udp.dstport) == service): # same
service
packet_with_same_service+=1
elif "<TCP Layer>" in str(p.layers):
if (p.tcp.dstport == dst_port):
count+=1
if (self.__get_service(p.tcp.port, p.tcp.dstport) == service):
packet_with_same_service+= 1
same_service_rate=0.0
diff_service_rate = 1.0
if not count==0:
same_service_rate = ((packet_with_same_service*100)/count)/100
diff_service_rate = diff_service_rate-same_service_rate
return (count, same_service_rate, diff_service_rate)

def __get_srv_count_and_srv_diff_host_rate(self,dst_ip, service): #24, 31


diff_dst_ip = 0
service_count = 0
for p in self.packet_list:
if "<UDP Layer>" in str(p.layers):
if (self.__get_service(p.udp.port, p.udp.dstport) == service): # same
service
service_count+=1
if not (p.ip.dst == dst_ip): # different destination ip if udp
diff_dst_ip+=1

59
elif "<TCP Layer>" in str(p.layers):
if (self.__get_service(p.tcp.port, p.tcp.dstport) == service):
service_count+= 1
if not (p.ip.dst == dst_ip): # # different destination ip if tcp
diff_dst_ip+=1
srv_diff_host_rate = 0.0
if not(service_count == 0):
srv_diff_host_rate = ((diff_dst_ip*100)/service_count)/100
return (service_count, srv_diff_host_rate)

def __get_dst_host_count(self,dst_ip, service): #32, 34, 35


same_dst_ip = 0
same_service=0
for p in self.packet_list:
if(p.ip.dst == dst_ip): # same destination ip
same_dst_ip+=1
if "<UDP Layer>" in str(p.layers):
if (self.__get_service(p.udp.port, p.udp.dstport) == service): # same
service if udp
same_service+=1
elif "<TCP Layer>" in str(p.layers):
if (self.__get_service(p.tcp.port, p.tcp.dstport) == service): # same service
if tcp
same_service+=1
dst_host_same_srv_rate = 0.0
dst_host_diff_srv_rate = 1.0
if not same_dst_ip==0:
dst_host_same_srv_rate = ((same_service*100)/same_dst_ip)/100
dst_host_diff_srv_rate = dst_host_diff_srv_rate-dst_host_same_srv_rate

return (same_dst_ip, dst_host_same_srv_rate, dst_host_diff_srv_rate)

60
def __get_dst_host_srv_count(self,src_port, dst_port, dst_ip): #33, 36, 37
dst_host_srv_count = 0
same_src_port = 0
diff_dst_ip = 0
for p in self.packet_list:
if "<UDP Layer>" in str(p.layers):
if (p.udp.dstport == dst_port): # same destination port
dst_host_srv_count+=1
if (p.udp.port == src_port): # same src port
same_src_port+=1
if not (p.ip.dst == dst_ip): # different destination Ip
diff_dst_ip+=1

elif "<TCP Layer>" in str(p.layers):


if (p.tcp.dstport == dst_port): # same destination port
dst_host_srv_count+=1
if (p.tcp.port == src_port): # same src port
same_src_port+=1
if not (p.ip.dst == dst_ip): #different destination ip
diff_dst_ip+=1
dst_host_same_src_port_rate = 0.0
dst_host_srv_diff_host_rate = 0.0
if not dst_host_srv_count==0:
dst_host_same_src_port_rate =
((same_src_port*100)/dst_host_srv_count)/100
dst_host_srv_diff_host_rate = ((diff_dst_ip*100)/dst_host_srv_count)/100
return (dst_host_srv_count, dst_host_same_src_port_rate,
dst_host_srv_diff_host_rate)

def get_all_float(self,l):

61
all_float = list()
for x in l:
all_float.append(round(float(x),1))
return all_float
ABNIDS.py
# Change testing panel to avoid segmentation fault
from PyQt5 import QtCore, QtGui, QtWidgets
from PyQt5.QtGui import QIcon, QPixmap
from PyQt5.QtWidgets import
qApp,QFileDialog,QMessageBox,QMainWindow,QDialog,QDialogButtonBox,QVBoxL
ayout, QHeaderView, QMessageBox
import os
import time
import pyshark
import matplotlib.pyplot as plt
import threading
import packet as pack
import GAAlgorithm
import Preprocess as data
import classifier

class Ui_MainWindow(object):
def __init__(self):
self.tree_classifier = classifier.DecisionTree()
self.packet = pack.Packet()
self.trained = False
self.stop = False
self.threadActive = False
self.pause = False
def plot_graph(self):
x = ['Normal','DoS','Prob']

62
normal,dos,prob = self.tree_classifier.get_class_count()
y = [normal,dos,prob]
plt.bar(x,y,width=0.3,label="BARCHART")
plt.xlabel('Classes')
plt.ylabel('Count')
plt.title('Graph Plotting')
plt.legend()
plt.show()

def train_model(self):
try:
train_dataset, train_dataset_type =
QFileDialog.getOpenFileName(MainWindow, "Select Training Dataset","","All Files
(*);;CSV Files (*.csv)")
if train_dataset:
os.chdir(os.path.dirname(train_dataset))
test_dataset, test_dataset_type =
QFileDialog.getOpenFileName(MainWindow, "Select Testing Dataset","","All Files
(*);;CSV Files (*.csv)")
if train_dataset and test_dataset:
generation = 0
train_dataset = data.Dataset.refine_dataset(train_dataset, "Train
Preprocess.txt")

test_dataset = data.Dataset.refine_dataset(test_dataset, "Test


Preprocess.txt")
#Start Genetic Algorithm
ga =
GAAlgorithm.GAAlgorithm(train_dataset,test_dataset,population_size=5,mutation_rat
e=65)
ga.initialization() # if error occur due to invalid dataset population needs to
be clear to avoid append of new population

63
ga.calculate_fitness()
while(ga.population.max_fitness<93 and generation<1):
print(f"Generation = {generation}")
generation+= 1
parents = ga.selection()
ga.cross_over(parents)
ga.mutation()
ga.calculate_fitness()
max_fitest = ga.population.max_fittest
max_fitness = round(ga.population.max_fitness,1)
self.tree_classifier.train_classifier(train_dataset,max_fitest)
self.trained = True
ga.clear_population()
self.progressBar.setProperty("value", 100)
self.showdialog('Model train',f'Model trained successfully',1)

except:
try:
ga.clear_population()
except:
print("Err 00")
finally:
self.showdialog('Model train','Model trained unsuccessfully',2)

def static_testing(self):
if self.isModelTrained():
if (self.threadActive):
self.showdialog('Warning','Please stop currently testing',3)
else:
test_dataset, train_dataset_type =
QFileDialog.getOpenFileName(MainWindow, "Select Testing Dataset","","All Files

64
(*);;CSV Files (*.csv)")
if test_dataset:
try:
test_dataset = data.Dataset.refine_dataset(test_dataset , "Test
Dataset.txt")
t1 = threading.Thread( target=self.static_testing_thread, name = 'Static
testing', args=(test_dataset,))
t1.start()
self.threadActive = True
except:
self.showdialog('Error','Invalid Dataset',2)
else:
self.showdialog('Warning','Model not trained',3)

def static_testing_thread(self,dataset):
row = 0
self.reset_all_content()
with open(dataset,"r") as file:
for line in file.readlines():
try:
line = line.split(',')
result, result_type = self.tree_classifier.test_dataset(line)
self.insert_data(line,result,result_type,row)

row+=1
if self.pause:
while(self.pause):
pass
if self.isStop():
self.stop=False
break
time.sleep(0.05)

65
except:
print("Err")
self.threadActive = False

def realtime_testing(self):
if self.isModelTrained():
if (self.threadActive):
self.showdialog('Warning','Please stop currently testing',3)
else:
t2 = threading.Thread(target=self.realtime_testing_thread, name = 'Realtime
testing')
t2.start()
self.threadActive = True
else:
self.showdialog('Warning','Model not trained',3)
def realtime_testing_thread(self):
self.reset_all_content()
self.packet.initiating_packets()
t1 = time.time()
attr_list = list()
capture = pyshark.LiveCapture(interface='Wi-Fi')
row = 0
try:
for p in capture.sniff_continuously():
try:
if "<UDP Layer>" in str(p.layers) and "<IP Layer>" in str(p.layers):
attr_list = self.packet.udp_packet_attributes(p)
result, result_type = self.tree_classifier.test_dataset(attr_list)
self.insert_data(attr_list,result,result_type,row)
print(attr_list)
row+=1

66
elif "<TCP Layer>" in str(p.layers) and "<IP Layer>" in str(p.layers):
attr_list = self.packet.tcp_packet_attributes(p)
result, result_type = self.tree_classifier.test_dataset(attr_list)
self.insert_data(attr_list,result,result_type,row)
print(attr_list)
row+= 1
if (time.time()-t1) > 5 and not self.isStop: # 5Seconds
print("Updateing List")
self.packet.initiating_packets()
t1 = time.time()
if self.pause:
while(self.pause):
pass
if self.isStop():
self.stop=False
break
except :
print("Err")
except :
print("Error in loooooop")

def pause_resume(self):
if self.pause:
self.pause = False
self.btn_start.setText("Pause")
else:
self.pause = True
self.btn_start.setText("Resume")

def save_log_file(self):
log = self.tree_classifier.get_log()

67
url = QFileDialog.getSaveFileName(None, 'Save Log', 'untitled', "Text file
(*.txt);;All Files (*)")
if url[0]:
try:
name = url[1]
url = url[0]
with open(url, 'w') as file:
file.write(log)
self.showdialog('Saved',f'File saved as {url}',1)
except:
self.showdialog('Error','File not saved',2)

def stop_capturing_testing(self):
if self.pause:
self.pause = False
self.btn_start.setText('Pause')
if not self.stop:
self.stop = True
if self.threadActive:
self.threadActive = False
def reset_all_content(self):
if self.pause:
self.pause = False
self.btn_start.setText('Pause')
self.stop=False
self.tree_classifier.reset_class_count()
self.panel_capturing.clearContents()
self.panel_capturing.setRowCount(0)
self.panel_result.clearContents()
self.panel_result.setRowCount(0)
self.panel_testing.clear()

68
def insert_data(self,line,result,result_type,row):
self.panel_capturing.insertRow(row)
for column, item in enumerate(line[0:4:1]):
self.panel_capturing.setItem(row,column,QtWidgets.QTableWidgetItem(str(ite
m)))
self.panel_capturing.scrollToBottom()
self.panel_testing.clear()
self.panel_testing.addItem(str(line[0:4:1]))
if not result==0:
result_row = self.panel_result.rowCount()
self.panel_result.insertRow(result_row)
x = [row+1, line[1], line[2], result_type]
for column, item in enumerate(x):
self.panel_result.setItem(result_row,column,QtWidgets.QTableWidgetItem(s
tr(item)))
self.panel_result.scrollToBottom()

def clickexit(self):
buttonReply = QMessageBox.question(MainWindow, 'Exit', "Are ou sure to
exit?", QMessageBox.Yes | QMessageBox.No, QMessageBox.No)
if buttonReply == QMessageBox.Yes:
if self.threadActive:
self.pause = False
self.stop = True
qApp.quit()
else:
print('No clicked.')

69
def isStop(self):
return self.stop
def showdialog(self,title,text, icon_type):
msg = QMessageBox()
if icon_type==1:
msg.setIcon(QMessageBox.Information)
elif icon_type==2:
msg.setIcon(QMessageBox.Critical)
elif icon_type==3:
msg.setIcon(QMessageBox.Warning)
msg.setText(text)
msg.setWindowTitle(title)
msg.setStandardButtons(QMessageBox.Ok)
msg.buttonClicked.connect(self.msgbtn)
retval = msg.exec_()

def msgbtn(self):
self.progressBar.setProperty("value", 0)
def isModelTrained(self):
return self.trained
def setupUi(self, MainWindow):
MainWindow.setObjectName("MainWindow")
path = os.path.dirname(os.path.abspath(__file__))
MainWindow.setWindowIcon(QtGui.QIcon(os.path.join(path,'icon.png')))
MainWindow.resize(908, 844)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Fixed,
QtWidgets.QSizePolicy.Preferred)
sizePolicy.setHorizontalStretch(0)
sizePolicy.setVerticalStretch(0)
sizePolicy.setHeightForWidth(MainWindow.sizePolicy().hasHeightForWidth())
MainWindow.setSizePolicy(sizePolicy)
MainWindow.setIconSize(QtCore.QSize(30, 30))

70
self.centralwidget = QtWidgets.QWidget(MainWindow)
self.centralwidget.setObjectName("centralwidget")
self.gridLayout = QtWidgets.QGridLayout(self.centralwidget)
self.gridLayout.setObjectName("gridLayout")
spacerItem = QtWidgets.QSpacerItem(10, 10,
QtWidgets.QSizePolicy.Expanding, QtWidgets.QSizePolicy.Minimum)
self.gridLayout.addItem(spacerItem, 1, 0, 1, 1)
spacerItem1 = QtWidgets.QSpacerItem(20, 20,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Maximum)
self.gridLayout.addItem(spacerItem1, 4, 1, 1, 1)
spacerItem2 = QtWidgets.QSpacerItem(20, 10,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Fixed)
self.gridLayout.addItem(spacerItem2, 6, 1, 1, 1)
self.horizontalLayout_2 = QtWidgets.QHBoxLayout()
self.horizontalLayout_2.setObjectName("horizontalLayout_2")
spacerItem3 = QtWidgets.QSpacerItem(15, 10, QtWidgets.QSizePolicy.Ignored,
QtWidgets.QSizePolicy.Minimum)
self.horizontalLayout_2.addItem(spacerItem3)
self.btn_start = QtWidgets.QPushButton(self.centralwidget)

self.btn_start.setObjectName("btn_start")
self.btn_start.setText('Pause')
self.btn_start.clicked.connect(self.pause_resume)
self.horizontalLayout_2.addWidget(self.btn_start)

# ####################################################
self.btn_pause = QtWidgets.QPushButton(self.centralwidget)
self.btn_pause.setText("Stop Capturing/Testing")

self.btn_pause.setObjectName("btn_pause")
self.btn_pause.clicked.connect(self.stop_capturing_testing)
self.horizontalLayout_2.addWidget(self.btn_pause)

71
self.gridLayout.addLayout(self.horizontalLayout_2, 8, 1, 1, 1)
self.horizontalLayout = QtWidgets.QHBoxLayout()
self.horizontalLayout.setObjectName("horizontalLayout")
# #####################################################
self.btn_modeltrain = QtWidgets.QPushButton(self.centralwidget)
self.btn_modeltrain.setText("Train Model")

self.btn_modeltrain.setObjectName("btn_modeltrain")
self.btn_modeltrain.clicked.connect(self.train_model)
self.horizontalLayout.addWidget(self.btn_modeltrain)
# ######################################################
self.btn_statictesting = QtWidgets.QPushButton(self.centralwidget)
self.btn_statictesting.setText("Static Testing")

self.btn_statictesting.setObjectName("btn_statictesting")
self.btn_statictesting.clicked.connect(self.static_testing)
self.horizontalLayout.addWidget(self.btn_statictesting)
# ######################################################
self.btn_realtimetesting = QtWidgets.QPushButton(self.centralwidget)
self.btn_realtimetesting.setText("")

self.btn_realtimetesting.setObjectName("")
self.btn_realtimetesting.clicked.connect(self.realtime_testing)
self.horizontalLayout.addWidget(self.btn_realtimetesting)

# ######################################################
self.btn_savelog = QtWidgets.QPushButton(self.centralwidget)
self.btn_savelog.setText("Save Log")
icon5 = QtGui.QIcon()

self.btn_savelog.setObjectName("btn_savelog")

72
self.btn_savelog.clicked.connect(self.save_log_file)
self.horizontalLayout.addWidget(self.btn_savelog)

# ######################################################
self.btn_graph = QtWidgets.QPushButton(self.centralwidget)
self.btn_graph.setText("Plot Graph")

self.btn_graph.setObjectName("btn_graph")
self.btn_graph.clicked.connect(self.plot_graph)
self.horizontalLayout.addWidget(self.btn_graph)

# ######################################################
self.btn_exit = QtWidgets.QPushButton(self.centralwidget)
self.btn_exit.setText("Exit")

self.btn_exit.setObjectName("btn_exit")
self.btn_exit.clicked.connect(self.clickexit)
self.horizontalLayout.addWidget(self.btn_exit)
# ######################################################
self.gridLayout.addLayout(self.horizontalLayout, 3, 1, 1, 2)
spacerItem4 = QtWidgets.QSpacerItem(20, 10,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Fixed)
self.gridLayout.addItem(spacerItem4, 8, 1, 1, 1)
spacerItem5 = QtWidgets.QSpacerItem(20, 10,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Fixed)
self.gridLayout.addItem(spacerItem5, 0, 1, 1, 1)
self.panel_capturing = QtWidgets.QTableWidget(self.centralwidget)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Preferred,
QtWidgets.QSizePolicy.Preferred)
sizePolicy.setHorizontalStretch(10)
sizePolicy.setVerticalStretch(0)

73
sizePolicy.setHeightForWidth(self.panel_capturing.sizePolicy().hasHeightForWid
th())
self.panel_capturing.setSizePolicy(sizePolicy)
self.panel_capturing.setRowCount(0)
self.panel_capturing.setColumnCount(4)
self.panel_capturing.setObjectName("panel_capturing")
item = QtWidgets.QTableWidgetItem()
self.panel_capturing.setHorizontalHeaderItem(0, item)
item = QtWidgets.QTableWidgetItem()
self.panel_capturing.setHorizontalHeaderItem(1, item)
item = QtWidgets.QTableWidgetItem()
self.panel_capturing.setHorizontalHeaderItem(2, item)
item = QtWidgets.QTableWidgetItem()
self.panel_capturing.setHorizontalHeaderItem(3, item)
self.gridLayout.addWidget(self.panel_capturing, 4, 1, 4, 1)
self.label = QtWidgets.QLabel(self.centralwidget)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Fixed,
QtWidgets.QSizePolicy.Fixed)
sizePolicy.setHorizontalStretch(0)
sizePolicy.setVerticalStretch(0)
sizePolicy.setHeightForWidth(self.label.sizePolicy().hasHeightForWidth())
self.label.setSizePolicy(sizePolicy)
self.label.setLayoutDirection(QtCore.Qt.LeftToRight)
self.label.setAutoFillBackground(False)
self.label.setText("")
path = os.path.dirname(os.path.abspath(__file__))
path = path + r'\icons'
self.label.setPixmap(QtGui.QPixmap(os.path.join(path,'logo.jpg')))
self.label.setScaledContents(True)
self.label.setAlignment(QtCore.Qt.AlignCenter)
self.label.setObjectName("label")
self.gridLayout.addWidget(self.label, 1, 1, 1, 1)

74
spacerItem6 = QtWidgets.QSpacerItem(10, 20,
QtWidgets.QSizePolicy.Minimum, QtWidgets.QSizePolicy.Fixed)
self.gridLayout.addItem(spacerItem6, 2, 1, 1, 1)
self.panel_testing = QtWidgets.QListWidget(self.centralwidget)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Expanding,
QtWidgets.QSizePolicy.Preferred)
sizePolicy.setHorizontalStretch(0)
sizePolicy.setVerticalStretch(0)
sizePolicy.setHeightForWidth(self.panel_testing.sizePolicy().hasHeightForWidth(
))
self.panel_testing.setSizePolicy(sizePolicy)
self.panel_testing.setVerticalScrollBarPolicy(QtCore.Qt.ScrollBarAlwaysOn)
self.panel_testing.setHorizontalScrollBarPolicy(QtCore.Qt.ScrollBarAsNeeded)
self.panel_testing.setObjectName("panel_testing")
self.gridLayout.addWidget(self.panel_testing, 9, 1, 1, 1)
self.progressBar = QtWidgets.QProgressBar(self.centralwidget)
self.progressBar.setProperty("value", 0)
self.progressBar.setObjectName("progressBar")
self.gridLayout.addWidget(self.progressBar, 10, 1, 1, 2)
# ----------------------------------------------------------------- #

self.panel_result = QtWidgets.QTableWidget(self.centralwidget)
sizePolicy = QtWidgets.QSizePolicy(QtWidgets.QSizePolicy.Preferred,
QtWidgets.QSizePolicy.Preferred)
sizePolicy.setHorizontalStretch(10)
sizePolicy.setVerticalStretch(0)
sizePolicy.setHeightForWidth(self.panel_result.sizePolicy().hasHeightForWidth())
self.panel_result.setSizePolicy(sizePolicy)
self.panel_result.setRowCount(0)
self.panel_result.setColumnCount(4)
self.panel_result.setObjectName("panel_result")
item = QtWidgets.QTableWidgetItem()

75
self.panel_result.setHorizontalHeaderItem(0, item)
item = QtWidgets.QTableWidgetItem()
self.panel_result.setHorizontalHeaderItem(1, item)
item = QtWidgets.QTableWidgetItem()
self.panel_result.setHorizontalHeaderItem(2, item)
item = QtWidgets.QTableWidgetItem()
self.panel_result.setHorizontalHeaderItem(3, item)
self.gridLayout.addWidget(self.panel_result, 4,2,6,1)
# ----------------------------------------------------------------- #
MainWindow.setCentralWidget(self.centralwidget)
self.menubar = QtWidgets.QMenuBar(MainWindow)
self.menubar.setGeometry(QtCore.QRect(0, 0, 908, 26))
self.menubar.setObjectName("menubar")
self.menuFile = QtWidgets.QMenu(self.menubar)
self.menuFile.setObjectName("menuFile")
self.menuAbout = QtWidgets.QMenu(self.menubar)
self.menuAbout.setObjectName("menuAbout")
MainWindow.setMenuBar(self.menubar)
self.statusbar = QtWidgets.QStatusBar(MainWindow)
self.statusbar.setObjectName("statusbar")
MainWindow.setStatusBar(self.statusbar)
self.actionNew = QtWidgets.QAction(MainWindow)
self.actionNew.setObjectName("actionNew")
self.actionOpen = QtWidgets.QAction(MainWindow)
self.actionOpen.setObjectName("actionOpen")
self.actionExit = QtWidgets.QAction(MainWindow)
self.actionExit.setObjectName("actionExit")
self.actionHelp = QtWidgets.QAction(MainWindow)
self.actionHelp.setObjectName("actionHelp")
self.menuFile.addAction(self.actionNew)
self.menuFile.addAction(self.actionOpen)
self.menuFile.addSeparator()

76
self.menuFile.addAction(self.actionExit)
self.actionExit.triggered.connect(qApp.quit)
self.menuAbout.addAction(self.actionHelp)
self.menubar.addAction(self.menuFile.menuAction())
self.menubar.addAction(self.menuAbout.menuAction())

self.retranslateUi(MainWindow)
QtCore.QMetaObject.connectSlotsByName(MainWindow)

def retranslateUi(self, MainWindow):


_translate = QtCore.QCoreApplication.translate
MainWindow.setWindowTitle(_translate("MainWindow", "Intrusion Detection"))
self.btn_start.setStatusTip(_translate("MainWindow", "Pause/Resume"))
self.btn_pause.setStatusTip(_translate("MainWindow", "Stop"))
self.btn_modeltrain.setStatusTip(_translate("MainWindow", "Train Model"))
self.btn_statictesting.setToolTip(_translate("MainWindow", "Stactic Testing"))
self.btn_statictesting.setStatusTip(_translate("MainWindow", "Static Testing"))

self.btn_savelog.setToolTip(_translate("MainWindow", "Real Time Capturing"))


self.btn_savelog.setStatusTip(_translate("MainWindow", "Real Time Capturing"))
self.btn_graph.setStatusTip(_translate("MainWindow", "Graph"))
self.btn_exit.setStatusTip(_translate("MainWindow", "Exit"))
item = self.panel_capturing.horizontalHeaderItem(0)
item.setText(_translate("MainWindow", "Duration"))
item = self.panel_capturing.horizontalHeaderItem(1)
item.setText(_translate("MainWindow", "Protocol"))
item = self.panel_capturing.horizontalHeaderItem(2)
item.setText(_translate("MainWindow", "Service"))
item = self.panel_capturing.horizontalHeaderItem(3)
item.setText(_translate("MainWindow", "Src_Bytes"))
# ---------------------------------------------------- #
item = self.panel_result.horizontalHeaderItem(0)

77
item.setText(_translate("MainWindow", "Packet #"))
item = self.panel_result.horizontalHeaderItem(1)
item.setText(_translate("MainWindow", "Protocol"))
item = self.panel_result.horizontalHeaderItem(2)
item.setText(_translate("MainWindow", "Service"))
item = self.panel_result.horizontalHeaderItem(3)
item.setText(_translate("MainWindow", "Class"))
# ---------------------------------------------------- #
self.menuFile.setTitle(_translate("MainWindow", "File"))
self.menuAbout.setTitle(_translate("MainWindow", "About"))
self.actionNew.setText(_translate("MainWindow", "New"))
self.actionOpen.setText(_translate("MainWindow", "Open"))
self.actionExit.setText(_translate("MainWindow", "Exit"))
self.actionHelp.setText(_translate("MainWindow", "Help"))

if __name__ == "__main__":
import sys
app = QtWidgets.QApplication(sys.argv)
MainWindow = QtWidgets.QMainWindow()
ui = Ui_MainWindow()
ui.setupUi(MainWindow)
MainWindow.show()
sys.exit(app.exec_())

Conclusion:
Network traffic logs to describe patterns of behaviour in network traffic accident with
intrusive or normal activity. Decision tree technique is good for the intrusion
characteristic of the network traffic logs for IDS and implemented in the genetic
algorithm as prevention.The other hand, this technique is also good efficiency and
optimize rule for the firewall rules such as avoid redundancy.

78
REFERENCES:
[1] R. M. A. Ujjan, Z. Pervez, K. Dahal, A. K. Bashir, R. Mumtaz, and J. González,
"Towards sFlow and adaptive polling sampling for deep learning based DDoS
detection in SDN," Future Generation Computer Systems, vol. 111, pp. 763-779,
2020, doi: 10.1016/j.future.2019.10.015. [2]"Software Defined Networking Definition."
https://www.opennetworking.org/sdn-definition/ (accessed March, 2, 2020).
[2] S. Garg, K. Kaur, N. Kumar, and J. J. P. C. Rodrigues, "Hybrid Deep-Learning-
Based Anomaly Detection Scheme for Suspicious Flow Detection in SDN: A Social
Multimedia Perspective," IEEE Transactions on Multimedia, vol. 21, no. 3, pp. 566-
578, 2019, doi: 10.1109/tmm.2019.2893549.
[3] M. Nobakht, V. Sivaraman, and R. Boreli, "A Host-Based Intrusion Detection and
Mitigation Framework for Smart Home IoT Using OpenFlow," presented at the 2016
11th International Conference on Availability, Reliability and Security (ARES), 2016.
[4] M. S. Elsayed, N. Le-Khac, S. Dev, and A. D. Jurcut, "Machine Learning
Techniques for Detecting Attacks in SDN," in 2019 IEEE 7th International Conference
on Computer Science and Network Technology (ICCSNT), 19-20 Oct. 2019
2019,pp,277-281,doi: 10.1109/ICCSNT47585.2019.8962519.

79
Publication

80
INTRUSION DETECTION IN
SOFTWARE DEFINED NETWORK USING MACHINE LEARNING

81
Submission date: 02-Mar-2022 09:59AM (UTC-0600)
Submission ID: 1649798424
File name:
CTION_IN_SOFTWARE_DEFINED_NETWORK_USING_MACHINE_LEARNING_1.docx
(365.3K)
Word count: 2100
Character count: 11888

82
4

83
5

84
1

85
86
INTRUSION DETECTION IN SOFTWARE DEFINED
NETWORK
USING MACHINE LEARNING

ORIGINALITY REPORT

2 1 1
% % %
INTER 1% STUD
SIMILA NET ENT
RITY SOUR PUBLICATI PAPE
INDEX CES ONS RS

PRIMARY SOURCES

1
1%
Submitted to Aston University Student Paper

opus.lib.uts.edu.au

2 Internet Source 1%
87
<1%
Peng Cui. "A Tighter Analysis of Set
Cover Greedy Algorithm for Test Set",
3 Lecture
Notes in Computer Science, 2007
Publication
Bambang Susilo, Riri Fitri Sari.
"Intrusion Detection in Software Defined

Network
4

<1%
Using Deep Learning Approach", 2021 IEEE 11th Annual
Computing and
Communication Workshop and Conference (CCWC), 2021
Publication
thesai.org

5 Internet Source

<1 %

88
Exclude quotes On Exclude matches Off

89
Detection of Attacks (DoS, Probe) Using Genetic Algorithm

A. Venkata Srinadh Reddy1, B. Prasanth Reddy2, L. Sujihelen3


1, 2, 3
Department of Computer Science and Engineering
Sathyabama Institute of Science and Technology, Chennai.

ABSTARCT:

The entrance framework (IDS) is right now exceptionally fascinating as a significant piece of
framework security. The IDS gathers traffic data from the line or framework and afterward
involves it for better security. Assaults are typically truly challenging and tedious to isolate
street exercises. To screen the organization association, the examiner should survey all data,
enormous and wide. Subsequently, an organization search strategy is expected to decide the
recurrence of traffic. In this review, another strategy for looking for IDS identifiers was
created utilizing a technique for concentrating on information mining procedures from a
calculation machine. The technique used to set the principles is to sort the choice tree and
calculation. These guidelines can be utilized to decide the idea of the assault and afterward
apply it to the hereditary calculation for avoidance, so that as well as distinguishing the
assault, it is feasible to find ways to forestall the assault and deny the assault.

Keywords- Intrusion detection, K-Nearest Neighbor, Naive Bayes, Decision Trees, Support
Vector Machine, Prediction

INTRODUCTION normal interruption pace of the


framework for checking the means
Input techniques can be partitioned of misconception. In the case of
into two kinds: misconstruing and something surprising occurs, the
deformity location. A wide range framework initially learns the
of known (irresistible) assaults can ordinary profile and afterward
be distinguished by evaluating the records every one of the
90
components of the framework that structure. The information is then
don't match the set up profile. The consequently shown as "traffic" or
principle advantage of discovery is "traffic" utilizing the latent
the maltreatment of the capacity to Dirichlet allocation (LDA)
identify new or surprising assaults calculation. Vehicle enrollment
at high rates, making it hard to data is isolated into three kinds;
distinguish. great, terrible and impartial. The
The upside of having the option to response to this classification is the
identify uncommon things is the expression enraptured (positive,
capacity to recognize new (or negative, or unbiased) as for street
startling) assaults that convey sentences, contingent upon whether
many advantages. Procedures or not it is traffic. The bag-of-
dependent on innovation pipelines words (BoW) is presently used to
utilized in different ventures. We change each sentence over to a
give general data to the solitary hot code to take care of bi-
investigation of traffic data and for directional LSTM organizations
the location of street mishaps (Bi-LSTM). In the wake of
utilizing the significant distance- preparing, a multi-stage muscle
course of-the-street network utilizes softmax to arrange
The proposed technique utilizes sentences as indicated by area,
tests dependent on the issue of vehicle experience, and sort of
eliminating traffic data via online polarization. The proposed strategy
media (Facebook and Twitter): this contrasts the preparation of various
movement gathers sentences machines and the high-level
connected with all traffic exercises, preparing techniques as far as
for example, traffic stops or street precision, F scores, and different
terminations. The quantity of standards.
starting handling strategies is
presently executed. breathing, LITERATURE REVIEW
signal presentation, POS signal,
partition, and so forth to change the Designing a Network Intrusion
data acquired in the inherent Detection System Based on
91
Machine Learning for Software framework for distinguishing the
Defined Networks various sorts of organization
assaults. The review was essential
Software-defined Networking for the NIDS SDN survey.
(SDNs) have as of late been
created as a feasible and promising A Deep Learning Approach for
answer for the eventual fate of the Network Intrusion Detection
Internet. Networks are made due, System
incorporated, and observed and
adjusted utilizing SDN. These Network Intrusion Detection
advantages, then again, bring us Systems (NIDSs) are a significant
ecological dangers, for example, device for network framework
network crashes, framework overseers to decide network
incapacities, internet banking security. NIDS screens and
misrepresentation, and robbery. examines approaching and active
These issues can detrimentally calls from family network gadgets
affect families, organizations, and and cautions assuming that
the economy. Truth, superior entrance is identified. As far as
execution, and the genuine access control, NIDS is separated
framework are fundamental to into two classifications: I) NIDS
accomplishing this objective. The (SNIDS) based mark (abuse), and
extension of wise AI calculations ii) NIDS (ADNIDS) based secrecy
into the network intrusion detection location. SNIDS and Drinking put
system (NIDS) through a software- assault marks first in NIDS. The
defined network (SDN) has been helpful plan is made of against slip
extremely invigorating over the vehicle to permit admittance to the
previous decade. The accessibility organization. Interestingly,
of data, the distinction in ADNIDS permits network traffic to
information investigation, and the stream in when it is going to split
many advances in AI calculations away from typical traffic.
assist us with making a superior, Significant in characterizing
more dependable, and solid SNIDS. notable, notable assault,
92
non-salvage assault. Nonetheless,
its unmistakable makes it Intrusion Preventing System
extremely challenging to using Intrusion Detection System
distinguish obscure or new assaults Decision Tree Data Mining
on the grounds that the marks of
pre-introduced assaults on the IDS With worldwide availability,
are decreased. However, ADNIDS network security has become more
is critical to be familiar with associated with innovative work.
obscure and new assaults. In spite As the quantity of assaults builds,
of the fact that ADNIDS estimates the firewall has turned into a
its adequacy well, its capacity to significant security strategy issue
identify new assaults has prompted overall. Firewalls can be permitted
its far and wide acknowledgment. or denied over the organization,
There are two issues that function however since firewalls can't be
admirably in the advancement of recognized or assaulted, signing in
NIDS: gentle and direct assaults. and applying to a firewall is a
Above all else, the strategy for method for controlling how you
choosing the right traffic forestall it. Access location
information from the informational Firewall innovation is viewed as an
index line is hard to distinguish extra answer for identify
peculiarities. Because of steady interruptions in an organization
vacillations and changes, the without a firewall. Firewalls and
capacities chose at a similar assault IDS address the old as far as data
level may not be reasonable for innovation security. A firewall is
other assault classes. Second, there great for ensuring frameworks and
is an absence of a bunch of traffic networks and lessens the danger of
information from the genuine line organization assaults. IDS can
of NIDS improvement. It requires a identify endurance or assault.
ton of work to separate a bunch of Capacity to interface IDS and
genuine or ongoing recorded firewalls called IPS. That is the fair
information from the crude line of thing to do, and it should end there.
the gathered way. There are at least one distinct
93
standard for every retailer. Each that as it may, there are worries
organization parcel that arrives at about the accessibility and
the firewall should be tried by maintainability of current
characterized rules until an innovation to meet present day
appropriate rule is found. Under network necessities. Specifically,
current law, bundles will be these worries are connected with
permitted or restricted from the increment in individuals' level
arriving at the line. Every law of correspondence and the
determines a particular kind of lessening in their level of
vehicle. The points of interest of information. This paper presents
how the pipeline will be sold new top to bottom examination
should be visible from the lines of techniques to comprehend and
vehicles from people's perspective. resolve these issues. We plainly
This review plans to try not to characterize non-standard encoder
attempt to sign in to look for (NDAE) prerequisites for the
Internet-based substance, like IDS, investigation of uncontrolled items.
and afterward implementing Furthermore, we suggest a top to
firewall rules like impeding. Need bottom investigation of the classes
to find out about our information made utilizing the NDAE. Our
mining machine security strategy. proposals were carried out in GPU-
The technique used to make the TensorFlow and assessed utilizing
standard is to rank the ID3 the KDD Cup '99 scale and the
calculation by tree endorsement. NSL-KDD informational index.
It's a decent and great practice to
implement firewalls. EXISTING SYSTEM:

A Deep Learning Approach to  Today, pipelines have turned into a


Network Intrusion Detection significant piece of public
foundation and the computation of
The Network Access System public or private mists.
(NIDS) assumes a significant part
in ensuring PC organizations. Be
94
 Techniques Traditional techniques for AI as far as
organization network has turned availability.
into a test.  Cold The choice sheet looks at the
 These troubles have forestalled the test to one of the qualities of a
foundation of new and forward- specific case, while the leaf shows
thinking administrations in a the possibility of whether the result
similar organization, making it is in the ordinary or typical period
hard to associate organizations, of the assault (potentially a
business associations, and the potential assault).
Internet overall.  Strategy A better approach to
observe IDS tokens utilizing an
Problem Statement: authentication tree. A strategy for
AI has been given. The technique
 Attacks are truly challenging, utilized in lawmaking is to sort the
typical, and tedious to isolate street choice tree and calculation.
exercises.
 Utilizes Analysts need to think Advantages:
about enormous and wide-going
data to screen the seriousness of  Attack location should be possible
pipelines. physically or consequently.
 Technique The strategy used to  IDS should have the option to
recognize the pipelines is expected adapt to the hours of development
to decide the progression of traffic. and exposure.
 Associating a firewall to an IDS,  It is vital to utilize a choice tree.
otherwise called an IDS, can Understanding programmed
distinguish an assault, however can assaults and how to react is turning
likewise keep it from assaulting. out to be progressively significant.

Proposed System:
HARDWARE
 Hereditary Algorithms are one of REQUIREMENTS:
the most generally utilized
95
 System : Pentium i3 Processor. • Choice tree/Natural woodland
 HDD : 500 GB. • Support for vector machines
 Screen : 15’’ LED • Intercession
 Devices : Keyboard, Mouse
 Random Access Memory: 2 GB Decision tree
Introduction
SOFTWARE Up until this point, we have figured
REQUIREMENTS: out how to go this way and that,
and it has been hard to
 Software : Windows 10. comprehend. Presently how about
 Language : Python we start with "Tree Decision", I
guarantee you it very well may be
a straightforward calculation in
Machine Learning. There aren't so
numerous here. It is one of the
most broadly utilized and
commonsense strategies for AI
since it is not difficult to utilize and
BLOCK DIAGRAM: clarify.
What is a Decision Tree?
It is an instrument with
applications running in better
places. The testament tree can be
FLOW DIAGRAM:
utilized in similar class as obsolete
issues. The actual name
The absolute most broadly utilized
recommends that it utilizes plans,
calculations.
for example, trees to show
• K-Neighbor
prescience from the request in
which things are isolated. It begins
at the root and finishes with the
choice to get away. Before we

96
which things are isolated. It begins Dataset collection:
at the root and finishes with the
choice to get away. Before we Informational index assortment:
study the choice tree, how about Information assortment can assist
we investigate a few words. you with tracking down ways of
Root Nodes The top of this hub is following previous occasions
toward the start of the choice tree, utilizing information examination
and the public starts to isolate it as to record them. This permits you to
indicated by different elements. foresee the way and make prescient
Decision Nodes - The gatherings models utilizing AI devices to
we see subsequent to isolating the anticipate future changes. Since the
root are called Resolutions prescient model is just pretty much
Leaf Nodes - an indivisible head as great as the data acquired, the
called a leaf or leaf most effective way to gather
Sub-tree - 33% of the sub-tree information is to further develop
plan, a large portion of the execution. The data ought to be
exactness of the sub-tree. faultless (garbage, open air
Pruning - There is nothing to do squander) and ought to incorporate
except for remove the head to quit data about the work you are doing.
trying too hard. For instance, a non-performing
advance may not profit from the
MODULES: sum got, yet may profit from gas
costs over the long run. In this
 Dataset collection module, we gather data from the
 Data Cleaning kaggle data set. These figures
 Feature Extraction contain data on yearly contrasts.
 Model training
 Testing model Data cleaning:
 Performance Evaluation Data cleanliness is a significant
 Prediction piece of all AI exercises. The data
cleanliness of this module is
expected for the arrangement of
97
information for the annihilation properties (likewise called vector
and transformation of wrong, properties).
inadequate, deluding or Characterize the initial segment,
misdirecting data. You can utilize called highlight choice. The chose
it to look for data. Discover what things ought to contain data about
cleaning you can do. the data got so they can fill the
ideal role utilizing this portrayal
Feature Extraction: rather than complete data.

This is done to lessen the quantity Model training:


of capacities in the informational
index, which will accelerate An illustration of this preparation is
preparing and increment the informational collection used to
proficiency. prepare the ML calculation. It
In AI, picture acknowledgment, comprises of significant info
and picture handling, mining starts definitions that influence
at the front line of estimated, useful information inspecting and yield.
data (ascribes) pointed toward The preparation model is utilized to
guaranteeing, adjusting, following, utilize the information through the
and normalizing data, and now and result and result change
again prompting more prominent calculations. The aftereffects of
clearness. Take out the properties this connection will be utilized to
related with aspect decrease alter the layout.
On the off chance that the This strategy for assault is
calculation's feedback is designated "matching model".
excessively enormous, it won't be Information preparing definition or
handled, and assuming it is informational collection approval
suspected to be excessively huge is significant for demonstrating.
(like estimating one foot and meter, Plan language preparing is a
or rehashing the picture displayed method for giving data about the
in pixels), it tends to be switched. ML calculation and assist with
deciding and become familiar with
98
the best significance of every one utilizing execution assessment
of its highlights. There are many measures, for example, F1 scores,
kinds of AI, the majority of which exactness, and arrangement
are controlled and uncontrolled. mistake.
At the point when the model
Testing model: performs inadequately, we change
the AI to further develop
In this module, we test an AI execution.
machine planned utilizing research Execution examination is
information characterized as a norm and
Quality protection is needed to productive method for estimating
make the product framework work representative execution dependent
appropriately. All chances settled on worker obligations. It is utilized
upon? Does the program fill in true to gauge the worth of
to form? All program testing representatives by expanding their
standards should be remembered business pay contrasted with
for the specialized detail. industry and all out venture (ROI).
What's more, programming testing All associations that have taken in
can uncover every one of the the specialty of "mutual benefit"
defects and shortcomings that have depend on the presentation of their
happened during improvement. workers dependent on an
Once the application is delivered, exhibition examination framework
you don't need your clients to come to continually survey and assess
to your home together. Various the presentation of its
kinds of tests just take into account representatives.
recognition of blunders during In a perfect world, workers are
activity. evaluated yearly upon the arrival of
the occasion, in view of
Performance Evaluation: advancement or compensation
increment.
In this module, we audit the Execution examination plays an
presentation of an AI framework immediate part to play in giving
99
input to workers to all the more In this module, we utilize an
likely comprehend their principles. organized, AI technique to decide
whether the patient will respond to
Prediction: a portion of the inquiries.

Consistency "alludes to the RESULT:


outcomes subsequent to preparing
the calculation on the historical Train and Test the dataset
backdrop of the set and carrying
out it when you expect the chance
of a specific outcome, for example,
deciding whether the client will
remain for 30 days.
The worth-based calculation can be
changed for each new thing
composed, permitting the author to
decide the worth that is destined to
be.
"Speculation" can be misdirecting.
Now and then, this implies
foreseeing the future, like utilizing
a machine to decide the following
game-plan.
In different cases, "prescience" is
connected, for instance, in the
event that the item has as of now
been created.
For this situation, the move has as
After completion of training and
of now been made, however it will
testing the KDD dataset, now
assist you with giving input on
dataset which contain the attacks
whether it is satisfactory and to
undergoes for static testing. After
make a proper move.
100
completion of testing it shows the CONCLUSION:
attacks data in plot graph.
Detours depict personal conduct
standards that happen during street
mishaps and typical exercises. The
tree managing method is the most
ideal to the working of the IDS
access street and is executed in the
hereditary calculation of
avoidance. Then again, this
innovation functions admirably and
maintains a strategic distance from
over-the-top guidelines, like
firewalls.

REFERENCES:

[1] R. M. A. Ujjan, Z. Pervez, K.


Dahal, A. K. Bashir, R. Mumtaz,
Now, we can static testing of the
and J. González, "Towards sFlow
normal dataset for attacks
and adaptive polling sampling for
deep learning based DDoS
detection in SDN," Future
Generation Computer Systems,
vol. 111, pp. 763-779, 2020, doi:
10.1016/j.future.2019.10.015.
[2]"Software Defined Networking
classification
Definition."
https://www.opennetworking.org/s
dn-definition/ (accessed March, 2,
2020).

101
[3] S. Garg, K. Kaur, N. Kumar, Learning Techniques for
and J. J. P. C. Rodrigues, "Hybrid Detecting Attacks in SDN," in
Deep-Learning-Based Anomaly 2019 IEEE 7th International
Detection Scheme for Suspicious Conference on Computer

Flow Detection in SDN: A Social Science and Network Technology

Multimedia Perspective," IEEE (ICCSNT), 19-20 Oct. 2019

Transactions on Multimedia, vol. 2019, pp. 277-281,


doi:
21, no. 3, pp. 566-578, 2019, doi:
10.1109/
10.1109/tmm.2019.2893549.
ICCSNT47585.2019.8962519.
[4] M. Nobakht, V. Sivaraman,
and R. Boreli, "A Host-Based
Intrusion Detection and Mitigation
Framework for Smart Home IoT
Using OpenFlow," presented at the
2016 11th International Conference
on Availability, Reliability and
Security (ARES), 2016.
[5] M. S. Elsayed, N. Le-Khac, S.
Dev, and A. D. Jurcut, "Machine
Learning Techniques for Detecting
Attacks in SDN," in 2019 IEEE 7th
International Conference on
Computer Science and Network
Technology (ICCSNT), 19-20 Oct.
2019 2019, pp. 277-281, doi:
10.1109/ICCSNT47585.2019.8962
519.
International Conference on
Availability, Reliability and
Security (ARES), 2016.
M. S. Elsayed, N. Le-Khac, S.
Dev, and A. D. Jurcut, "Machine

102

You might also like