Deep Learning Approaches For Network Int
Deep Learning Approaches For Network Int
Gabriel Fernandez
DEEP LEARNING APPROACHES FOR NETWORK INTRUSION DETECTION
by
THESIS
Presented to the Graduate Faculty of
The University of Texas at San Antonio
In Partial Fulfillment
Of the Requirements
For the Degree of
COMMITTEE MEMBERS:
Shouhuai Xu, Ph.D., Chair
Greg White, Ph.D.
Wenbo Wu, Ph.D.
I would like to dedicate this thesis to my wife, for all of her love and support.
ACKNOWLEDGEMENTS
First of all, I would like to thank my wife and family for all the love and support they’ve
provided as I’ve embarked on my career and postgraduate education.
I would also like to thank my colleagues at USAA for their guidance and mentorship as I’ve
worked through the M.S. Computer Science program, including Chuck Oakes, Dr. Barrington
Young, Dr. Michael Gaeta, Maland Mortensen, Debra Casillas, Brad McNary, and many others.
In addition, I would like to thank my advisor Dr. Shouhuai Xu, committee members Dr. Greg
White and Dr. Wenbo Wu, and my colleagues in the Laboratory for Cybersecurity Dynamics at
UTSA for their guidance, fellowship, and collaboration as I worked toward completion of this
research, including Dr. Marcus Pendelton, Richard Garcia-LeBron, Jose Mireles, Eric Ficke, and
Dr. Sajad Khorsandroo. I would also like to thank all of my professors throughout my master’s
curriculum at both UTSA and TAMU-CC Departments of Computer Science for their dedication
and excellence in teaching. Furthermore, I’d like to thank my colleagues at Texas A&M – Corpus
Christi, including Dr. Liza Wisner, Dr. Julie Joffray, Dr. Laura Rosales, and Lori Blades for
assisting in my growth and professional development as part of the ELITE team. Lastly, I would
like to thank my Uncle, Dr. John Fernandez, for his guidance and mentorship during this endeavor.
May 2019
iv
DEEP LEARNING APPROACHES FOR NETWORK INTRUSION DETECTION
As the scale of cyber attacks and volume of network data increases exponentially, organizations
must develop new ways of keeping their networks and data secure from the dynamic nature of
evolving threat actors. With more security tools and sensors being deployed within the modern-
day enterprise network, the amount of security event and alert data being generated continues to
increase, making it more difficult to find the needle in the haystack. Organizations must rely on new
techniques to assist and augment human analysts when dealing with the monitoring, prevention,
detection, and response to cybersecurity events and potential attacks on their networks.
The focus for this Thesis is on classifying network traffic flows as benign or malicious. The
contribution of this work is two-fold. First, a feedforward fully connected Deep Neural Network
(DNN) is used to train a Network Intrusion Detection System (NIDS) via supervised learning.
Second, an autoencoder is used to detect and classify attack traffic via unsupervised learning in the
absence of labeled malicious traffic. Deep neural network models are trained using two more recent
intrusion detection datasets that overcome limitations of other intrusion detection datasets which
have been commonly used in the past. Using these more recent datasets, deep neural networks are
shown to be highly effective in performing supervised learning to detect and classify modern-day
cyber attacks with a high degree of accuracy, high detection rate, and low false positive rate. In
addition, an autoencoder is shown to be effective for anomaly detection.
v
TABLE OF CONTENTS
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
vi
3.2.4 Case Study I: ISCX IDS 2012 Dataset . . . . . . . . . . . . . . . . . . . . 43
3.2.5 Case Study II: CIC IDS 2017 Dataset . . . . . . . . . . . . . . . . . . . . 58
3.3 Case Study Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Vita
vii
LIST OF TABLES
viii
LIST OF FIGURES
Figure 2.1 Relationship between Artificial Intelligence, Machine Learning, and Deep
Learning. (Figure adapted from [27]) . . . . . . . . . . . . . . . . . . . . . 8
Figure 2.2 Example convex optimization function (computed on Wolfram|Alpha) . . . 12
Figure 2.3 Simple neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 2.4 Comprehensive neural network representation . . . . . . . . . . . . . . . . 20
Figure 2.5 Simplified neural network representation . . . . . . . . . . . . . . . . . . . 20
Figure 2.6 Scale drives deep learning performance (Figure adapted from [81]) . . . . . 21
Figure 2.7 Sigmoid vs. ReLU activation functions . . . . . . . . . . . . . . . . . . . . 22
Figure 2.8 Illustration of the iterative process for using Machine Learning in practice
(Figure adapted from [81]) . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 2.9 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 2.10 ISCX IDS 2012 Dataset: Number of flows per day . . . . . . . . . . . . . . 29
Figure 2.11 ISCX IDS 2012 Dataset: Number of attacks per day . . . . . . . . . . . . . 29
Figure 2.12 CIC IDS 2017 Dataset: Number of flows per day . . . . . . . . . . . . . . 35
Figure 2.13 CIC IDS 2017 Dataset: Number of attacks per day . . . . . . . . . . . . . 36
ix
Figure 3.11 CIC IDS 2017 Confusion Matrix using Embeddings without IP Address . . 65
Figure 3.12 CIC IDS 2017 Confusion Matrix - First 3 Octets of IP address . . . . . . . 66
Figure 3.13 Confusion Matrix - Multinomial Classification using full IP address . . . . 67
Figure 3.14 Confusion Matrix - Multinomial Classification using first 3 octets . . . . . 68
x
Chapter 1: INTRODUCTION
The Internet has revolutionized society — with more and more people connecting everyday, it is
fast becoming a necessity of daily life and a mainstay for conducting day-to-day business. Contin-
ued growth in both network access and speed of network connectivity has facilitated wide-spread
adoption by the world at large. While the growth of the Internet continues to enable breakthrough
innovations and life-changing benefits to society, it also opens the possibility for adversaries to con-
duct malicious activity in this digital arena. These adversaries primarily consist of three groups:
Nation-states, cybercriminals, and activist groups (e.g. Anonymous). Their motivations include
espionage, political and ideological interests, and financial gain [10]. While their motivations may
be varied, their aims are the same: leverage the connectivity of society through the Internet to carry
out a malicious goal. These goals can vary from theft of intellectual property, denial of service, dis-
ruption of business, theft of personally identifiable information (PII) or payment card information
(PCI), financial fraud, demanding a ransom (i.e. ransomware), destruction of physical property
(e.g. Conflicker Warm), and other nefarious purposes.
Due to the opportunity existing for bad actors to conduct this malicious activity, it is imperative
that all cyber infrastructure be secured and protected from misuse. Among the many cyber infras-
tructure systems that exist (e.g. critical infrastructure, cyber-physical systems, SCADA systems,
Internet of Things, etc.), this Thesis focuses on the protection of networks maintained and operated
by enterprises, both large and small, from being exploited by bad actors. Often, bad actors seek to
gain access to enterprise networks for a variety of reasons including but not limited to theft of in-
tellectual property, access to trade secrets, insider information for illegal stock trading, disruption
of business (e.g. Sony attacks in 2016), theft of financial information (e.g. Target breach in 2016).
In order to combat bad actors, a wide array of approaches have been developed in order to stay
one step ahead of the adversary. The best overall approach for tackling this problem consists of
a defense-in-depth strategy, whereby various security tools, techniques, and mechanisms are em-
ployed throughout an organization’s ecosystem, both horizontally and vertically at different levels.
1
It is commonly understood that there is no such thing as 100% security. The aim instead is towards
managing risk and reducing the surface area available for attack. Security is an intractable problem,
as it is impossible to think of all the possible ways an attacker may break through the defenses. A
preferred strategy, as suggested by MIT Computer Science and Artificial Intelligence Laboratory
(CSAIL) [85], is to minimize the attack surface, and manage risk by employing a defense-in-depth
strategy. Various techniques can be used to reduce surface area vulnerable to attack (e.g., by enforc-
ing access control, multi-factor authentication, network segmentation, and continuous patching) as
well as reducing risk by deploying tools at various stages, from the exterior-facing network to the
interior network, and down to the individual host-level workstations on the network.
The challenge inherent in information systems and networks from being compromised is the
fact that they are built upon complex layers of software. Due to its growing complexity, software
often contains vulnerabilities that can be found and exploited by an attacker. Even if the software
of a given security tool uses proven algorithms and standards for security, it can still suffer from
a bad implementation that leaves a security hole. These risks can be mitigated by putting in place
strategies and best practices, such as continuous patching, bug bounty programs, red/blue team
exercises, threat hunting, honeypots, honeynets, moving target defense, and vulnerability manage-
ment programs, to name a few. However, attackers continually try new avenues to compromise
defenses by altering their attack strategies and using never before seen techniques. Commonly
referred to as a zero-day attack, these types of attacks can be very damaging and frustrating to the
defender, as the attacker has developed a new exploit that bypasses the defenses of a given system
or software. Therefore, as mentioned previously, the best strategy is to implement a defense-in-
depth strategy and expect the inevitable result that with enough time and resources, an attacker
will inevitably gain access to the network somewhere along the way. It is paramount that when
this occurs, the attack is discovered promptly and quarantined or eliminated before any material
harm is done.
One of the most effective ways to protect the confidentiality, integrity, and availability of infor-
mation and enterprise systems once an attacker has compromised its defenses is to deploy Intru-
2
sion Detection Systems (IDS). Intrusion Detection Systems are defined by the National Institute
of Technology (NIST) as "software or hardware systems that automate the process of monitor-
ing the events occurring in a computer system or network, analyzing them for signs of security
problems" [17]. Intrusion Detection is the art and science of finding attackers that have bypassed
preventive defense mechanisms such as firewalls, access control, and other protection mechanisms
further up or down the stack. More formally, Intrusion Detection is defined by NIST as the "process
of monitoring the events occurring in a computer system or network and analyzing them for signs
of possible incidents, which are violations or imminent threats of violation of computer security
policies, acceptable use policies, or standard security practices" [83]. There are two main types of
Intrusion Detection Systems: Host-based and Network-based. Host-based intrusion detection sys-
tems monitor and control data coming from an individual workstation using tools and techniques
such as host-based firewalls, anti-virus/anti-malware agents, data-loss prevention agents, and via
monitoring system call trees. Network-based defenses monitor and control network traffic flows
via firewalls, anti-virus, proxies, and network intrusion detection techniques. Network Intrusion
Detection Systems (NIDSs) are essential security tools that help increase the security posture of a
computer network. NIDSs have become necessary, along with firewalls, anti-virus, access control,
and other common defense-in-depth strategies towards helping cyber threat operations teams be-
come aware of attacks, security incidents, and potential breaches occurring on their networks. The
focus of this research is on advancing NIDSs, by leveraging recent advances in deep learning.
There are two main types of Network Intrusion Detection Systems: signature/misuse based and
anomaly based. Signature based systems generate alarms when a known misuse or bad activity oc-
curs. These systems use techniques to measure the difference between input events and signatures
of known bad intrusions. If the input event shares patterns of similarity with known bad intrusions,
then the systems flags these events as malicious. These systems are effective in finding known bad
attacks, and can flag them with a low false positive rate. The downside to these systems is that they
are not able to detect novel attacks [35]. Anomaly based systems trigger alarms when observed
events are behaving substantially differently from previously defined known good patterns. The
3
advantage of these systems is that unlike signature based systems, they are able to detect novel
and evolving attacks. An anomaly, by definition is anything that deviates from what is considered
standard, normal, or expected. Anomalies are rare in occurrence, and deviate from the normal,
expected behavior. The goal of an anomaly detection system is to identify any event, or series of
events, that fall outside a predefined set of normal behaviors. It is important to note that not all
anomalies are necessarily malicious. By definition, anomalies are just deviations from expected
normal behavior. Once an event or pattern is deemed to be an anomaly, it can be further labeled as
either benign or malicious. Therefore, one of the main challenges in anomaly based systems is the
problem of generating a high rate of false positives, as well as a high rate of false negatives.
As described, there are numerous host-based and network-based security tools put in place at
different layers to detect attacks — these tools generate security events, which must then be evalu-
ated either systematically or by a human analyst. Often times, these security events are centralized
in a Security Incident and Event Monitoring (SIEM) system, where they can be triaged by a team
of security analysts. These SIEMs contain a combination of signatures, rules, and anomaly de-
tection modules that correlate the myriad of security events, triggering alerts to be worked by a
cybersecurity analyst. A common problem faced by these SIEMs is a high rate of false positives.
The sheer number of events, and thus alerts generated can overwhelm a security operations team.
This results in “alert fatigue” [69] and ultimately makes it increasingly difficult to triage the alerts.
As a result, true positive alerts can become buried in a sea of false positives, resulting in an attacker
flying under the radar and not being detected until it is too late and material harm has been done.
Thus, an ongoing cyber attack campaign can go undetected, leading to a multitude of negative
outcomes.
This Thesis approaches the challenge of detecting attacks using network intrusion detection in a
two-fold manner. First, a fully connected Deep Neural Network (DNN) is used to train a NIDS
with supervised learning using labeled benign and malicious network traffic data. Newer bench-
4
mark datasets produced by the Canadian Institute of Cybersecurity from the University of New
Brunswick (CIC UNB) are used which are more representative of modern day network traffic and
attacks [94,95] and do not have drawbacks of previous datasets commonly used in the field [96,97].
After learning these patterns of malicious and benign by training a fully connected neural network,
the system can reliably and effectively detect and classify modern attack traffic with a high degree
of accuracy, high rate of recall, and a low rate of false positive rate. This is considered to be a form
of pattern-based detection because the system is trained on known good and known bad patterns
and taught to detect these patterns in future, unseen network flows.
Second, an alternative deep learning approach, known as Autoencoder, is used to detect and
classify attack traffic in the case where there isn’t any labeled malicious training data. This ap-
proach is important because in practice it may be difficult to obtain labeled training data in order
to train a supervised deep learning algorithm on malicious and benign traffic. In addition, the na-
ture of the adversary is that they are constantly evolving and attempting new attacks, for which
a pattern-based system may not be effective since new attacks may have patterns that are vastly
different than what has been seen historically. This second approach is considered an unsuper-
vised anomaly based approach, as the learning algorithm will put the traffic into clusters, whereby
anomalous activity (e.g. outliers) will stand out from the normal traffic.
In the following chapters, experiments will be outlined that implement deep learning approaches
for network intrusion detection, in an attempt to detect and classify malicious traffic. Chapter 2
reviews the preliminary knowledge in the field of machine learning as well as the subfield of deep
learning and how it can be used for network intrusion detection. Then the evaluation metrics are
described, and the two datasets that will be used in experiments are introduced. Chapter 3 presents
a case study on using a fully connected feedforward neural network to perform a classification task
on network traffic flows on the two datasets. Chapter 4 describes another case study which utilizes
an Autoencoder to perform anomaly detection. Chapter 5 reviews and compares with related prior
5
work. This work is concluded in Chapter 6 with a discussion on the use these deep learning
approaches to network intrusion detection, a review of the insights gained, and suggestions for
future work in this field.
6
Chapter 2: PRELIMINARIES AND DATASETS
This chapter covers a background on the field of artificial intelligence, machine learning, and deep
learning and how it relates to the problem of network intrusion detection. In addition, metrics are
reviewed that will be used to evaluate the deep learning algorithms used in this work. Lastly, the
datasets that will be used for our experiments are described in detail, as well as how features are
setup for the learning algorithm.
2.1 Preliminaries
Deep learning sits nestled within the field of machine learning, and machine learning is a subset
of Artificial Intelligence (Figure 2.1). Deep learning is a subfield of machine learning that deals
with utilizing neural networks containing a large number of parameters and layers. Therefore,
a background on artificial intelligence and machine learning concepts will be reviewed, then a
framework for applying deep learning for network intrusion detection will be discussed.
The field of Artificial Intelligence (AI) was born in the 1950s when computer scientists set out
to determine if computers could “think” like humans. Artificial Intelligence is defined by MIT’s
Marvin Minksy as “the science of making machines do things that would require intelligence if
done by men” [18]. Another similar definition for the field of artificial intelligence provided by
Chollet in [27] is that it is simply “the effort to automate intellectual tasks normally performed by
humans.” Therefore, artificial intelligence not only encompasses the subfields of machine learning
and deep learning, but also many other approaches for enabling the goal of automating intellectual
tasks normally performed by humans. These other approaches, however, do not involve the task
of having the computer learn. These other systems rely on rules that are explicitly programmed
by humans, and are known as symbolic artificial intelligence. These systems perform well for for
solving well-defined logical tasks such as playing chess; however, they are ill-equiped to deal with
more complex tasks such as image classification and language translation. Thus, a new approach
7
Figure 2.1: Relationship between Artificial Intelligence, Machine Learning, and Deep Learning.
(Figure adapted from [27])
to artificial intelligence, termed machine learning, gained traction over the previous approaches of
symbolic AI.
The field of intrusion detection is another area where existing approaches often rely on rules
programmed by humans. While there is a place for these existing intrusion detection systems, and
they do work well for enforcing specific parameters and blocking known attack signatures, they
are challenged with being able to adapt to new, unseen attacks that do not fall within the strict
ruleset defined. A machine learning based system can learn what patterns constitute benign and
malicious traffic, and when new traffic comes through, the learning model can determine whether
this new traffic looks benign or looks similar to an attack, based on what it has learned about what
constitutes benign versus malicious, based on the complex patterns it has learned from the data.
Alan Turing in his seminal 1950 paper “Computing Machinery and Intelligence” [101] came to the
conclusion that general purpose computers could learn and be capable of originality. This opened
up questions of whether computers could learn on their own to perform a specific task — can
8
computers learn rules by looking at data, instead of having humans input the rules manually? These
questions gave rise to the subfield of machine learning. Machine learning algorithms are learning
algorithms that learn and adjust from data. Instead of manually programming the computer and
telling it explicitly what to do, machine learning algorithms enable the program to learn what
output to produce, implicitly based on examples and data. By learning based on examples and
data, this allows the computer to make decisions and perform tasks on new inputs it has never seen
before.
Mitchell [74] defines a learning algorithm as follows: “A computer program is said to learn
from experience E with respect to some class of tasks T and performance measure P , if its per-
formance at tasks in T , as measured by P , improves with experience E .” In the context of this
work, the task is for the learning-based algorithm to classify a network flow as being either benign
or malicious. In this case, the individual network flow is an example input to the machine learning
based algorithm for a classification task. Each example is represented by a set of features in cer-
tain quantitative or categorical measurements, leading to a vector x ∈ Rn where each entry xi in
the vector corresponds to an individual feature (assuming only numeric features are present). The
features for describing netflow examples are described in more detail in Section 2.2.
While there are several types of tasks a machine learning algorithm can accomplish, we focus
on the task of classification, which has two variants: binary and multiclass (also called multino-
mial). In either of these two types of classification, the learning algorithm’s goal is to determine a
function f : Rn → {1, . . . , k}, where k = 2 for binary classification and k ≥ 3 for multinomial
classification. For a function y = f (x), the machine learned model assigns to a given input x a
numerical value y , representing the output class. When k = 2, the output class y implies that the
netflow represented by input x is either benign or malicious; when k ≥ 3, the output class y im-
plies that the netflow represented by input x is either one among a particular set of attacks, such as
Denial of Service (DoS), Probe, Remote-to-Local (R2L), and User-to-Root (U2R), or the network
flow is of a benign class. Machine learning algorithms can be placed into two main categories of
either supervised or unsupervised, which will be described in the upcoming section.
9
2.1.2.1 Supervised Learning
Supervised learning involves the program (system) observing many examples of a given input
vector x along with an associated output vector y containing corresponding labels. The learning
algorithm intuits how to predict y when given x, often by estimating p(y | x). Supervised learning
uses an algorithm to learn a function that maps the input to the output, in the form Y = f (X).
The goal is to learn the mapping function in such a way that when a new input sample (x) is run
through the function (f ), it can predict an output (Y ) that is correct. This process is referred to as
supervised learning because the process can be thought of as a teacher supervising a student. As the
‘student’ iteratively makes predictions, the ‘teacher’ supervises and informs the ‘student’ whether
these predictions are correct. The learning algorithm adjusts itself until it reaches an acceptable
level of performance.
Fundamentally, supervised machine learning and deep learning are based on this concept of
conditional probability following the equation
P (E | F ) (2.1)
where, E is the label (in this case benign or malicious), and F represents the various attributes or
features that describe the example or entity for which we are predicting E. A common application
of conditional probabilities is what is known as Bayes’ Theorem, which is defined as the formula
for any two events, A and B as:
P (B | A) P (A)
P (A | B) = (2.2)
P (B)
Where
• P(B | A) is the probability of event/data B , given that the hypothesis A was true.
10
• P(A) is the probability of hypothesis A being true (regardless of associated event/data). This
is referred to as the prior probability of A
Probability is a centerpiece of neural networks and deep learning due to how it enables feature
extraction and classification [87].
The foundation of how machine learning works is based on linear algebra and solving systems
of linear equations. The most basic form of these equations is:
Ax = b (2.3)
Data is represented in this equation using scalars, vectors, matrices, and tensors. A tensor is the
basic data structure for modern machine learning systems. Fundamentally, a tensor is a container
for data. A matrix can be described as being a two-dimensional tensor. Simply put, tensors are
a generalization of matrices that can be extended to an arbitrary number of dimensions. In the
equation above, A is a matrix containing all of the input examples as row vectors with the different
features as scalar values, and b is a column vector which has the output labels for each training
example, or vector, in the A matrix. The goal in this example is to solve for the coefficient x,
in this case a parameter vector, which produces the desired output b. This is accomplished by
changing the values in this parameter vector iteratively until the equation generates a desirable
outcome as close to the known output b as possible. These parameters are adjusted in a weight
matrix iteratively after a loss function calculates the loss between the calculated output, and the
actual value, also referred to as ground truth. The goal when solving this system of linear equations
is to minimize the error, or loss, via optimization.
A common method for optimization when solving this system of linear equations (a way in
which the algorithm learns) is based on the iterative process called Stochastic Gradient Descent
(SGD). When performing optimization, the learning algorithm is searching through the hypothesis
space for which parameter values (coefficients) map the input to the output with the least amount of
11
error, or loss. There is a fine balance in this process, as the learning algorithm should not underfit
nor overfit the training data. Instead, the parameters should be optimized in a way that the learned
function generalizes well to the overall general population of data for the given problem set.
As discussed, there are three primary levers at play within machine learning optimization: (1)
parameterization, which helps translate input to output for determining probability for classifica-
tion or regression task; (2) loss function, which is used to measure how well the parameters classify
(reduce error, minimize loss) at each training step; (3) optimization function, which helps adjust
the parameters towards a point of minimized error.
Convex optimization is one type of optimization that deals with a convex cost function. In
3-dimensional space, this can be imagined as a sheet that is being held high at each of the four
corners, with each corner sloping down to form a cup shape in the bottom. See Figure 2.2 for a
visual representation of this cost function. The bottom of the convex shape represents the global
minima, or a 0 cost.
Gradient descent is one optimization algorithm that can determine the slope of the valley (or
hill) at a given point based on the weight parameter of the cost function. The gradient descent
algorithm then adjusts the weights (parameters) on the function towards reaching a lower cost.
12
It determines which direction to adjust the weights based on the direction of the slope that it
calculates, towards the goal of reaching a zero cost. Gradient descent is able to calculate the
slope of the function by taking the derivative of the cost function. The derivative of a function is
equivalent to the the slope of the function. For a two dimensional loss function, for example, the
derivative (or slope) of the parabolic function y = 4x2 is calculated to be 8x. The derivative then
becomes the tangent of any point on the parabola. On a parabolic function, the slope at any given
point is a line tangential to that point. The gradient descent algorithm takes the derivative of the
loss function to determine the gradient. This gradient provides the direction of the slope and thus
informs the algorithm on how to adjust its weights (parameters) in order to calculate a loss that
approaches zero on each successive step.
A variant of gradient descent is Stochastic Gradient Descent (SGD), where the gradient is
calculated after each training example is run. This variant is commonly used as it has shown to
increase training speed and its computation can also be distributed via parallel computing. An
alternative is to perform mini-batch SGD, which takes in a small number of training examples for
each iteration of loss calculations. This has shown to be more effective than computing the gradient
update only one training example at a time.
The main difference between supervised learning and unsupervised learning is that with supervised
learning there are given labels or targets corresponding to the input data, and with unsupervised
learning the algorithm is given no corresponding labels for its input data. Unsupervised learning is
commonly used within the data analytics space and is often used as a means to better understand a
dataset prior to using it within a supervised learning algorithm through dimensionality reduction.
In the field of exploratory data analysis and data visualization, since humans can only comprehend
data that is represented in three dimensions, when a given dataset has 50 or 100 dimensions, it
becomes impossible to visualize and make sense of the data in these hyper dimensions. This is
why dimensionality reduction is so useful, as it enables humans to visualize high-dimensional data
and discover patterns and clusters within the data. Another benefit to dimensionality reduction
13
is that if the size of the data can be reduced, it can be processed faster. In addition, reducing
dimensions also minimizes noise present in the data. When the data is compressed to a smaller
number of dimensions, the amount of room available to represent that data is limited, therefore
removing unnecessary noise.
As previously described, unsupervised learning techniques can be leveraged in order to help
separate the signal from the noise in a given dataset. The hypothesis is that by reducing the dimen-
sions and removing noise from the signal, a supervised deep learning classifier can perform better,
as it will mainly be learning from the signal, without additional noise getting in the way. Chap-
ter 4 explores the use of unsupervised deep learning techniques, namely autoencoders, to perform
dimensionality reduction and train a neural network to reconstruct its inputs, instead of predicting
a class label as in supervised learning. By learning the representation of the input data for nor-
mal network flows, a reconstruction error is calculated on never-before-seen test inputs, whereby
higher reconstruction errors above a set threshold are flagged as anomalous. In this way, unsu-
pervised deep learning, and specifically autoencoders, can be a powerful engine for an anomaly
detection system.
Deep learning, a subfield of machine learning, excels in generalizing to new examples when the
data is complex in nature and contains a high level of dimensionality [45]. In addition, deep
learning enables the scalable training of nonlinear models on large datasets [45]. This is important
in the domain of network intrusion detection because not only is it dealing with a large amount of
data, but the model generated by the deep learning system will need to be capable of generalizing
to new forms of attacks not specifically represented in the currently available labeled data. Ideally,
the model could generalize and be effective in new, never-before seen network environments, or
at a minimum be leveraged in a machine learning pipeline as part of a transfer learning step when
used with data from a different computer network environment.
While deep learning has gained popularity in recent years, it has been around for a long time
and its origin dates back to the 1940s [45]. Through its history, it has gone by different names such
14
as “cybernetics” in the 1940-1960 timeframe, and “connectionism” in the 1980-1990s, to what it
is known by today as “deep learning” with renewed interested starting back up in 2006. Some
of the early algorithms in deep learning were biologically inspired by computational models of
the human brain, thereby popularizing algorithms with names such as artificial neural networks
(ANNs) and by describing computational nodes as neurons. While the neuroscientific perspective
is considered an important source of inspiration for deep learning, it is no longer the primary basis
for the field — there simply does not yet exist a full understanding of the inner workings and
algorithms run by the brain. This is an active and ongoing area of research being conducted within
the field of “computational neuroscience.” While models of the brain such as the perceptron and
neuron have influenced the architecture and direction of deep learning over the years, it is by no
means a rigid guide. Modern deep learning instead is based more on the principle of multiple levels
of composition [45].
One of the catalysts in the resurgence of deep learning in the 2000s was due to a combination
of the increase in computational power, along with the increase in available data. Deep learning
excels when there exists a large amount of data for which the algorithm can learn from. Accord-
ing to [45], the general rule of thumb as of 2016 is that supervised deep learning algorithms will
achieve good performance with at least 5,000 labeled examples per category. They will also ex-
ceed human performance when they are trained with a dataset that has at least 10 million labeled
examples. In the field of network intrusion detection, the most common benchmark datasets that
have been used in the past such as the NSL-KDD ‘99 dataset are smaller in size, containing a
total of 148,517 training examples, with 77,054 being benign, and 71,463 being attack. The newer
benchmark datasets used in this work such as ISCX IDS 2012 and CIC IDS 2017 are much larger.
The ISCX IDS 2012 dataset contains over 2.54M examples, with over 2.47M being benign, and
68,910 being malicious. Similarly, the CIC IDS 2017 dataset contains over 2.83M examples with
over 2.27M being benign, and 557,646 being malicious. These larger datasets have many more
examples for the neural network to learn from, and therefore can be used to experiment on the ef-
fectiveness of using deep neural network architectures for classifying flows as benign or malicious.
15
While these datasets don’t quite have 10M examples, they are much larger than any IDS datasets
used in the past, and can be used to experiment and determine the effectiveness of deep learning
architectures and algorithms as applied to the domain of network intrusion detection. The amount
of data available in practice in an enterprise network is enormous and highly dimensional, often
not only containing raw PCAP and/or network flow data, but also including application event logs,
host-based logs, security event data, and a myriad of other log data from workstations, servers, sen-
sors, and other appliances spread throughout the network. Furthermore, there exists expert human
analysts which can provide ongoing feedback to a learning-based system. This immense amount
of data is suited well for utilization by deep learning technologies to help find malicious activity
buried within the haystack of network traffic and log data on an enterprise network.
As described earlier, the underlying technology and algorithms in deep learning are based on
the utilization of neural network architectures consisting of multiple layers of neurons. In the next
section, we will provide some background on neural networks, then describe distinctions inherent
within deep neural network architectures.
2. Learning algorithm, or the way in which values for weights on the communication links are
determined
16
Each individual neuron maintains its own internal state, which is determined by the activation
function applied to its inputs. The neuron sends its activation to all the other neurons to which it is
directly connected to downstream in the next layer of the network. Common activation functions
for a neuron include the sigmoid function, and more recently the rectified linear unit (ReLU)
function. These two activation functions are shown in Figure 2.7.
An example of a very simple neural network is shown in Figure 2.3. More specifically, this
architecture is representative of a single-layer perceptron, which was invented in 1957 at the Cor-
nell Aeronautical Laboratory by Frank Rosenblatt, and funded by the U.S. Office of Naval Re-
search [86]. In this example, the neuron Y is receiving inputs x1 , x2 , and x3 . Each of these three
connections between the inputs x1 , x2 , x3 , to Y are represented by weighted variables in this first
[1] [1] [1]
hidden layer as w1 , w2 , and w3 . Therefore, the input, yin , to neuron Y is calculated as the sum
of the three weighted input signals x1 , x2 , and x3 .
n
[l]
yin = xi w i (2.5)
i=1
In this example, it can be supposed that the activation function on the hidden layer neuron will
be a sigmoid function:
1
σ(z) = (2.6)
1 + e−z
Therefore, after neuron Y performs its activation function on the input, it has an output or
activation of yout . In this example, neuron Y is connected to output neurons Z1 and Z2 , each having
[2] [2]
weights w1 and w2 respectively. Neuron Y then sends the output of its activation function to all
neurons in the next layer, in this case neurons Z1 and Z2 . The values that these two downstream
neurons receive as input will vary, as each of the connections to these neurons has a different
17
[2] [2]
weight associated, w1 and w2 .
At this last output layer, Z1 and Z2 have their own activation function which generates the final
output from the network, ŷ1 and ŷ2 respectively. In the case of a classification problem, ŷ1 and ŷ2
can each be the respective probabilities that the given input X is either class 0 or 1.
At each training step, the neural network adjusts the weights between its connection in an
effort to minimize the loss of the cost function. Updating weights is the primary way in which the
neural network learns. The neural adjusts the weights using an algorithm called backpropagation
learning. The backpropagation learning technique uses gradient descent (described earlier) on
the weight values of the neural network connections in order to minimize the error on the output
generated by the network.
The underpinning of the backprogagation algorithm is based on the chain rule from calculus,
which tells us that a chain of functions can be derived using the following identity:
Therefore, a neural network can update the weights for each neuron by using the backpropa-
gation algorithm, which starts with the final loss value and working backwards from the top layer
down to the bottom layers of the network. At each backward step, the chain rule is applied to deter-
18
mine the contribution that each individual parameter (at each neuron) had in determining the loss
value. Using gradient descent, the weights are updated accordingly, with the goal of optimizing
the loss value at the end of the training cycle [86].
Deep neural networks (DNNs), also called deep feedforward networks or feedforward neural net-
works or multilayer perceptrons (MLPs), are a powerful mechanism for supervised learning. DNNs
are one type of deep learning architecture, in addition to recurrent deep neural networks (RNNs)
and convolutional deep neural networks (CNNs). This research focuses on the use of DNNs for
the task of network intrusion detection. DNNs can represent functions of increasing complexity,
by inclusion of more layers and more units per layer in a neural network [45]. In the context of
NIDSs, DNNs can be used to discover patterns of benign and malicious traffic hidden within large
amounts of structured log data. According to [82], a neural network is considered deep if it con-
tains more than three layers, including input and output layers. Therefore, any network with at
least two hidden layers is considered a deep neural network.
An example of standard deep learning representations can be seen in Figures 2.4 and 2.5. The
former shows a deep, fully connected neural network, as each of the neurons in the input layer are
connected to every other neuron at each successive layer. The latter is a more simplified represen-
tation of a two layer fully connected neural network. These figures convey common notation used
for representing deep neural networks [81], and will be the notation followed for the rest of this
work. Nodes represent inputs, and edges represent weights or biases. The superscript (i) denotes
the ith training example, and superscript [l] denotes the lth layer.
The basic technical approach of deep learning for neural networks has been around for decades,
so why has this area been gaining so much attention in recent years? The main reason for this is due
to an increase in scale of both amounts of data and computational power available. A larger amount
of available data, combined with larger neural networks has led to an increase in performance of
deep neural network learning algorithms, specifically in the context of supervised learning [81].
This concept can be seen depicted in Figure 2.6.
19
Figure 2.4: Comprehensive neural network representation
20
Figure 2.6: Scale drives deep learning performance (Figure adapted from [81])
21
Figure 2.7: Sigmoid vs. ReLU activation functions
faster, allowing for training on much larger networks within a reasonable amount of time. This
faster computation is important because the training of neural networks is an iterative process. As
illustrated in Figure 2.8, a machine learning practitioner first has an idea for a particular neural
network architecture. Subsequently, they implement their idea in code. Finally, they run an ex-
periment to determine how well the neural network performs. Based on the performance of the
experiment, the practitioner modifies the architecture, hyperparameters, and/or code, and runs an-
other experiment. This iterative process is run repeatedly until best results are achieved. If each
experiment takes an exhorbitant amount of time to run, the productivity of the practitioner is im-
pacted, thus inhibiting their ability to achieve useful results for their machine learning application.
Therefore, the ability to iterate quickly with larger neural networks, coupled with larger amounts
of data, and new algorithmic innovations has led to higher performance than previously possible.
One of the hallmarks of deep learning is its ability to take complex raw data and create higher-
order features automatically in order to make the task of generating a classification or regression
output simpler [86]. In the field of cybersecurity and specifically network intrusion detection, the
amount of data being generated is continually increasing. Coupled with the continued growth
in computational power, deep neural networks can be an effective tool in performing supervised
learning for the task of network intrusion detection.
22
Figure 2.8: Illustration of the iterative process for using Machine Learning in practice (Figure
adapted from [81])
Autoencoders and Restricted Boltzmann Machines (RBMs) are two types of neural network ar-
chitectures that are considered building blocks of larger deep networks [86]. Often, these types of
networks are used in a pretraining phase used to extract features and pretrain weight parameters for
a follow-on network(s). They are considered unsupervised because they do not use labels (ground-
truth) as part of their training. A common use case for using unsupervised pretraining is when there
exists a lot of unlabeled data, along with a relatively smaller set of labeled training data [86]. This
is a common scenario for enterprise network security use cases, as often times there is a subset of
labeled training data that has been reviewed, processed, and labeled by a human analyst, yet there
is still an enormous amount of unlabeled data for which there is not enough manpower to review.
The downside to having this pretraining step is the extra amount of overhead in terms of network
tuning and added training time.
Autoencoders are useful in cases where there are lots of examples of what normal data looks
like, yet it is difficult to explain what represents anomalous activity. For this reason, Autoencoders
can be powerful when used in anomaly detection systems. Autoencoders applied to network intru-
sion detection are described in more depth and used in experiments in Chapter 4.
23
2.1.4 Evaluation Metrics
The primary goal of a classification algorithm in the context of network intrusion detection is
to achieve the highest level of accuracy with the lowest number of false positives. In addition,
the True Positive Rate (TPR) (also referred to as Detection Rate, Recall, or Sensitivity) is an
important metric for network intrusion detection as it indicates the number of malicious examples
that are correctly identified. A number of common metrics are used to evaluate the effectiveness of
the deep learning approaches in this work [24, 88]. The basic terminology will be described first.
• True Positives (TP) are the number of samples that are correctly predicted as positive (e.g.
ground truth is ‘malicious’ and the prediction is also ‘malicious’).
• True Negatives (TN) are the number of samples that are correctly predicted as negative (e.g.
ground truth is ‘benign’ and the prediction is also ‘benign’).
• False Positives (FP) are the number of samples that are negative but predicted as positive
(e.g. ground truth is ‘benign’ and prediction is ‘malicious’).
• False Negatives (FN) are the number of samples that are positive but are predicted as nega-
tive (e.g. ground truth is ‘malicious’ and prediction is ‘benign’).
The performance of a supervised learning classification algorithm can be depicted via a confusion
matrix, an example of which is shown in Figure 2.9. The rows indicate the ground truth and the
columns indicate predicted class.
In the context of network intrusion detection, metrics can be defined as follows [24, 38, 88].
Note: True Positive Rate, or Recall, is also referred to by other studies as Detection Rate;
therefore, results in this work are compared to these other studies where appropriate using
the term Detection Rate, interchangeable for True Positive Rate.
24
Figure 2.9: Confusion Matrix
• Accuracy is the ratio of the number of total correct predictions made T P + T N to all
predictions made T P + F P + F N + T N , namely
TP + TN
Accuracy = (2.9)
TP + FP + FN + TN
• Precision, which is also known as Bayesian Detection Rate or Positive Predictive Value, is
the ratio of the total number of correctly predicted positive classes T P to the total number
of positive predictions made T P + F P , namely
TP
P recision = (2.10)
TP + FP
25
• F1 Score is the harmonic mean of the Precision and the Recall (i.e., TPR), namely
2 × Recall × P recision
F 1 Score = (2.11)
Recall + P recision
• The Receiver Operating Characteristics (ROC) curve is a plot of True Positive Rate (TPR)
on the y -axis against False Positive Rate (FPR) on the x-axis.
As shown above, there are several metrics that can be used for evaluating the performance
of a given machine learning classifier. The F1 Score, which combines precision and recall, is a
single evaluation metric that is focused on in order to measure and compare the performance and
effectiveness of the neural network classifiers used in this work. In addition, the True Positive Rate
(Recall, Detection Rate) is another key metric that is focused on in comparing the deep learning
approaches in this Thesis to other works in the field.
2.2 Datasets
A primary and ongoing challenge in the field of network intrusion detection is the lack of publicly
available, labeled datasets that can be used for effective testing, evaluation, and comparison of
techniques [84, 96]. Often times, the most useful datasets for network intrusion detection are those
containing captures of real network environments. These datasets are not easily shared with the
public, as they contain details of an organization’s network topology, and more importantly sensi-
tive information about the traffic activity of the users on the respective network. Furthermore, the
effort required to create a labeled dataset from the raw network traces is an immense undertaking.
As a consequence, researchers often resort to suboptimal datasets, or datasets that cannot be
shared amongst the research community. Granted, publicly labeled datasets are available, such as
CAIDA [51], DARPA/Lincoln Labs packet traces [66, 67], KDD ’99 Dataset [1], and Lawrence
Berkeley National Laboratory (LBNL) and ICSI Enterprise Tracing Project [14]; however, these
datasets are mostly anonymized and do not contain valuable payload information, making them
less useful for research purposes [96]. While these datasets have proven useful, there are some
26
arguments as to the validity of using them in present day research — they may be better suited for
the purposes of providing additional validation and cross-checking of a novel technique [97].
This work focuses on using newer benchmark datasets that have recently become available to
the research community, specifically the UNB ISCX IDS 2012 and UNB CIC IDS 2017 datasets,
which will be described in more detail in the following sections.
One of the datasets analyzed in this thesis was provided by Lashkari, on behalf of the authors
of [96], who with the University of New Brunswick’s Information Security Center of Excellence
(ISCX) developed a systematic approach to generating benchmark datasets for network intrusion
detection. Their approach creates datasets by first statistically modeling a given network environ-
ment, and then creating agents that replay that activity on a testbed network. Using this systematic
approach, Lashkari et. al. created the UNB ISCX IDS 2012 dataset, which consists of network
traffic generated on a testbed environment in their laboratory. The testbed network consists of
21 interconnected Windows workstations. Windows was chosen for the workstations because of
the availability of exploits for running attacks. These workstations were divided into four distinct
LANs in order to represent a real network configuration. A fifth LAN was setup containing both
Linux (Ubuntu 10.04) and Windows (Server 2003) servers for providing web, email, DNS, and
Network Address Translation (NAT) services. When compared with the widely used, but more
outdated datasets [1, 66, 67], this dataset has the following characteristics [96]: (i) realistic net-
work configuration because of the use of real testbed; (ii) realistic traffic because of the use of real
attacks/exploits (to the extent of the specified profiles); (iii) labeled dataset with ground truth of
benign and malicious traffic; (iv) total interaction capture of communications; (v) diverse/multiple
attack scenarios are involved. The reader is directed to [96] for full details of the testbed configu-
ration.
For generating the dataset, two kinds of profiles, dubbed α-profiles and β -profiles, were used
[96]. The α-profiles reflect attacks by specifying attack scenarios in a clear format, easily inter-
pretable and reproducible by a human agent. The β -profiles reflect benign traffic by specifying
27
statistical distributions or behaviors, represented as procedures with pre and post conditions (e.g.,
the number of packets per flow, specific patterns in a payload, protocol packet size distribution, and
other encapsulated entity distributions). The β -profiles are represented by a procedure in a pro-
gramming language and executed by an agent, either systematic or human. This profile generation
methodology was created in an attempt to resolve issues seen in other network security datasets.
The main objective of Shiravi et al. in [96] was to establish a systematic approach for generating
a dataset containing background traffic (β -profiles) reflective of benign traffic while being com-
plimentary to malicious traffic generated from executing legitimate attack scenarios (α-profiles).
The UNB ISCX IDS 2012 dataset therefore contains properties that make it useful as a benchmark
dataset, and resolves issues seen in other intrusion detection datasets.
Table 2.1 gives an overview of the dataset. This dataset contains over 2.5 million flows. The
dataset contains labels for both benign and malicious traffic flows, which are described in an XML
file. After processing the XML flow records, the number of flows and attacks per day can be seen
in Figures 2.10 and 2.11 respectively.
The dataset is made up of captured traffic for a seven day period starting at 00:01:06 on Friday,
June 11th, 2012 and running continuously until 00:01:06 on Friday June 18th, 2012, while noting
that attack activities occurred on days two through seven only (i.e., no attacks on day one). The
dataset contains the following types of attacks [96]:
1. Brute-force against SSH: This attack attempts to gain SSH access by running a brute-force
28
600000
# Flows
571,698
500000 522,263
400000
397,595
300000
275,528
200000
171,380
133,193
100000
0
6/12/2012 6/13/2012 6/14/2012 6/15/2012 6/16/2012 6/17/2012
Figure 2.10: ISCX IDS 2012 Dataset: Number of flows per day
40000
1 37,460
# Attacks
32000
24000
20,358
16000
8000
5,219
3,776
2,086
11
0
6/12/2012 6/13/2012 6/14/2012 6/15/2012 6/16/2012 6/17/2012
Figure 2.11: ISCX IDS 2012 Dataset: Number of attacks per day
29
dictionary attack by guessing username/password combinations. The brutessh tool [5] was
used for this attack, and ran for a period of 30 minutes until successfully obtaining superuser
credentials. Those credentials were used to download /etc/passwd and /etc/shadow
files from a server.
2. Infiltrating the network from the inside: This attack attempts to gain access to a host on
the inside (due to a buffer overflow exploit on a PDF file) and establishes a reverse shell. The
attacker pivots from the host machine, scanning other internal servers for vulnerabilities and
installing a backdoor.
3. HTTP denial of service (DoS) attacks: This is a stealthy, low bandwidth denial of service
attack without needing to flood the network. The Slowloris tool [4] was used to perform
this attack, which holds a TCP connection open with a webserver by sending valid, incom-
plete HTTP requests, keeping open sockets from closing. Leveraging a backdoor established
by the aforementioned attack of infiltrating the network from the inside, Slowloris was
deployed on multiple hosts to perform the attack.
4. Distributed denial of service (DDoS) using IRC bots: Leveraging the backdoor estab-
lished by the aforementioned attack of infiltrating the network from the inside, an Internet
Relay Chat (IRC) server is installed, and IRC bots are deployed to infected machines on the
network. Within a period of 30 minutes, bots installed on seven users’ machines connect to
the IRC server awaiting commands. These bots are then instructed to download a program
for making HTTP GET requests, and are then commanded to flood an Apache Web server
with requests for a period of 60 minutes causing a distributed denial of service.
Table 2.2 lists the features for the ISCX IDS 2012 Dataset. Each feature is discrete or continu-
ous, depending on the type of information it contains. A discrete feature (also called categorical)
is one which has a finite or countably finite number of states. Discrete features can be either in-
tegers or named states represented as strings that do not have a numerical value. A continuous
feature is one that can be represented as a real number. In the context of network intrusion detec-
30
tion, it is important to configure each feature correctly based on domain knowledge. For example,
while certain features such as source/destination port appear numerical and potentially continu-
ous in nature, they should be setup as categorical variables since the value ‘80’ corresponds to
the HTTP protocol, and not the continuous value of 80. The full description and configuration of
features for the ISCX IDS 2012 dataset is shown in Table 2.2.
Each flow record in the original dataset is represented by these 14 high-level features. Of these
14 high-level features, seven of them are categorical and seven are continuous. Therefore, the
actual number of feature columns expands to a maximum of the sum of all the unique categories
existent for each categorical feature. For example, the high-level feature of ‘SrcIP’ contains 2,478
unique source IP addresses, while the ‘DstIP’ feature consists of 34,552 distinct destination IP
addresses. The ‘SrcPort’ feature contains 64,482 unique values, and ‘DstPort’ feature contains
24,238 distinct values. As a result, if each categorical feature is expanded out based on the unique
possible value, there ends up being a total of 2, 478 + 34, 552 + 64, 482 + 24, 238 + 107 + 4 +
6 = 125,867 possible features. In the case study section of Chapter 3 a number of experiments
are conducted using the maximum number of features, as well as a subset of these expanded
features. The features can be reduced by first removing IP address and ports completely from
31
the dataset, but experiments are also conducted that used a dense vector representation of the
features for IP addresses and ports. This methodology and its results are expanded upon in Chapter
3. The ‘AppName’ feature contains 107 unique values, and consists of values such as ‘SSH’,
‘HTTPWeb’, ‘IMAP’, ‘DNS’, ‘FTP’, etc. corresponding to the type of application traffic traversing
between source and destination IP/port for a given flow record. The ‘Direction’ feature contains
four unique values, consisting of ‘L2L’, ‘L2R’, ‘R2L’, and ‘R2R’ which stand for local-to-local,
local-to-remote, remote-to-local, and remote-to-remote respectively. The ‘IP Protocol’ feature
consists of six unique values of ‘tcp_ip’, ‘udp_ip’, ‘icmp_ip’, ‘ip’, ‘igmp’, ‘ipv6icmp’, which
indicate the type of protocol used for the given flow record. The remaining features in the dataset
are continuous and are statistics of the flow record, including total source and destination bytes,
as well as total number of source and destination packets that occurred for a given flow. The
two features of ‘TotalBytes’ and ‘TotalPackets’ are engineered features not present in the original
dataset, which are a sum of the source and destination bytes and source and destination packets
respectively. In addition, this dataset contains labels of benign and malicious for each flow record
example. The class label benign is represented with the numerical value 0, while malicious is
represented with the numerical value 1.
This second dataset analyzed in this work was also provided by Lashkari, on behalf of the authors
of [95]. It comes from a collaboration between the Canadian Institute for Cybersecurity (CIC) and
University of New Brunswick’s Information Security Center of Excellence (ISCX). The dataset
was created in 2017 and published for the research community to use in 2018. In their work, they
study and compare eleven available datasets that have been used for the research and development
of intrusion detection and intrusion prevention algorithms, including the ISCX IDS 2012 dataset.
They outline some of the same points that have been discussed in Section 2.2 in regards to the issue
of existing datasets being out of date and not fit for current research and future advancement of
the field of network intrusion detection. This dataset is improved from previous datasets in that it
contains more recent attack traffic from seven different attack methods, along with benign traffic.
32
In addition, the published dataset includes not only the raw PCAP data, but also pre-processed
netflow data from the PCAP data that was processed using publicly available CICFlowMeter soft-
ware [55]. This dataset was generated over a period of five days, Monday through Friday. The
distribution of benign and malicious flows can be seen in Table 2.3. It turns out that there are a
total of 2, 830, 743 flows generated over the five days, with a total of 557, 646 of those flows being
attack flows. This results in 19.70% of the flows being malicious traffic, which is much larger than
the ISCX IDS 2012 dataset which had 2.71% of the flows labeled as malicious traffic.
This pre-processed netflow data is provided as CSV files that can be more easily fed into the
machine learning pipeline, as opposed to having to start with the raw PCAP files (or in the case of
ISCX IDS 2012, a custom XML file). Furthermore, the pre-processed netflow data has 83 columns
(plus one label column and one flow ID column) that can be used as potential features, which
is advantageous for evaluating various features within deep learning approaches for NIDSs. As
mentioned, there are seven different attack types that are labeled as such, which enables exper-
imentation with a multinomial classifier, as opposed to just a binary classifier in the case of the
labeled ISCX IDS 2012 dataset. The other main limitation of the ISCX IDS 2012 dataset is that
there are no HTTPS protocols in the dataset, which is an important point since over 70% of traf-
fic on the Internet is now traversing the HTTPS protocol [95]. In addition, as stated in [95, 96],
the distribution of the simulated attacks in the ISCX IDS 2012 dataset is not based on real world
statistics.
In order to generate this benchmark dataset, Sharafldin et. al. implmented a comprehensive
testbed that consisted of two networks which they named the Attack-Network and Victim-Network.
The Victim-Network was built to represent a modern-day highly secure network environment,
complete with routers, firewalls, switches, and different versions of modern operating systems,
including Linux, Windows, and MacOS. In addition, the Victim-Network had an agent that per-
formed the benign behaviors on each workstation on the network. The Attack-Network was built
on a completely separate network infrastructure, complete with router and switch and various PCs
on multiple public IPs. These PCs were loaded with varying operating systems and software nec-
33
essary for launching malicious attacks. The reader is directed to Figure 1 in [95] for a detailed
diagram of the testbed architecture. The way in which the actual PCAP data was captured for
this dataset was via a span/mirror port setup on the Victim-Network to record all send and receive
network traffic.
An important component of this dataset is the degree to which it represents real network traffic
that would be naturally generated by a live network, commonly referred to as background traffic.
In order to accomplish this, Sharafaldin et. al. created a benign profile agent, based on their
previously proposed β -profile system in [94]. Similar to the background traffic generation in ISCX
IDS 2012, this system profiles normal human interactions on the network in order to be used for
benign background traffic generation at a later time. Using the β -profile system for this dataset, 25
users’ behavior was profiled based on their use of HTTP, HTTPS, FTP, SSH, and email (SMTP). As
detailed in Section 2.2.1, the β -profiles are created using machine learning and statistical analysis
techniques to obtain distributions of packets, packet sizes, protocol use, etc. These generated β -
profiles are then used by an agent written in Java in order to create realistic background traffic on
the Victim-Network based on the real 25 users’ prior behavior.
As detailed in Table 2.3 the CIC IDS 2017 dataset contains over 2.8 Million flows. This dataset
also aims to cover an up-to-date and diverse set of attacks that are seen in modern day networks.
Therefore, this dataset contains the following seven types of attack profiles and scenarios:
1. Brute Force Attack: A common attack where an attacker repeatedly attempts to guess a
password by ‘brute-forcing’ a large number of attempts, one after another until they succeed
with a correct username/password combination. Not only is this technique used for creden-
tials, but it can also be used to ‘brute-force’ a web application or server, trying to find hidden
pages or directories.
34
Table 2.3: CIC IDS 2017 Dataset Overview
Friday 703,245 21
288,923 DDoS using IRCWeb Attack - SQL Injection
bots
Thursday Afternoon 288,602 36 Infiltration
Friday Morning 191,033 1966 Bot
Friday Afternoon 1 286,467 158,930 PortScan
Friday Afternoon 2 225,745 128,027 DDoS
Total 2,830,743 557,646 19.70% attack traffic
800000
# Flows
700000
692,703 703,245
600000
500000 529,918
445,909 458,968
400000
300000
200000
100000
0
Monday Tuesday Wednesday Thursday Friday
Figure 2.12: CIC IDS 2017 Dataset: Number of flows per day
35
1
300000 288,923
# Attacks
252,672
250000
200000
150000
100000
50000
13,835
0 2,216
0
Monday Tuesday Wednesday Thursday Friday
Figure 2.13: CIC IDS 2017 Dataset: Number of attacks per day
steal sensitive information such as private key material, which could later be used to decrypt
confidential information.
3. Botnet: A botnet consists of a large number of ‘zombie’ hosts that have been infected with
a piece of malware, whereby a Command and Control (C&C) server can send instructions
to the bots to perform a specific command or series of commands.
4. DoS Attack: A Denial of Service attack is one in which the attacker intends to hinder
the availability piece of the CIA (confidentiality, integrity, availability) triangle, causing the
service to go down often by flooding the system with an abundance of requests beyond which
the system can respond to.
5. DDoS Attack: A Distributed Denial of Service attack is similar to a DoS attack, with the
only difference being that the attack is now carried out by multiple, distributed hosts (often
2 difficult to contain, as the source of the attack is not
facilitated via a botnet). These are more
concentrated.
36
6. Web Attack: Web attacks can take a variety forms, and they are always evolving. In this
dataset, some of the most popular forms of web attacks are performed, including SQL In-
jection, Cross-Site Scripting (XSS), and Brute-force password guessing. SQL Injection is a
type of fuzzing attack where the attacker injects (or appends) additional string values into a
form field that if not properly checked for by the web application, would trigger the database
to perform commands it was not intended to perform. This is commonly a scenario in which
many web applications leak sensitive data inadvertently. Cross Site Scripting (XSS) occurs
when a web application contains form fields that don’t properly sanitize the input, allowing
an attacker to run malicious scripts on the server. Brute-force password guessing is similar
to Brute-force SSH attacks, however they are run over the HTTP/S protocol against a web
application/server.
7. Infiltration Attack: This is a dangerous attack where an external bad actor is able to gain
unauthorized access to the internal network. This is often accomplished via a social engi-
neering attack where the attacker will send a phishing email to a victim, and convince the
victim to click a link that leads to a malicious website that launches an exploit, or has the
victim open a malicious attachment that contains a zero-day attack that allows the bad actor
to compromise the victim’s computer via establishing a back door. Through this backdoor,
the attacker can run commands remotely, which can be anything from performing recon-
naissance on the topology of the network, looking for vulnerable services to perform lateral
movement, or anything else they desire.
Each flow in this dataset contains 85 columns, including one column for the label, and another
column for the FlowID. Therefore, there are a total of 83 available features. This dataset contains
more features than the previous dataset because the authors provided not only raw PCAP, but
also a CSV of the PCAP that had already been converted to flow records using the CICFlowMeter
tool [33,55]. For the ISCX IDS 2012 dataset, the flow records were provided as an XML document
for which 14 main features were extracted, as well as a label indicating whether the flow was benign
or malicious. While they do also provide the raw PCAP data for ISCX IDS 2012, the labeled flow
37
record version did not provide the breadth of features that can be made available when converting
from PCAP to flows. For the CIC IDS 2017 dataset, when the authors converted the PCAP to flow
records using the CICFlowMeter tool, they output the full 85 columns of the flow record made
available by the tool. In addition, this output includes a column with a label indicating whether
the flow is benign, or one of 14 different attack types, which fall into one of the seven attack
categories described in section previously and shown in Table 2.3. For this reason, this dataset
is more useful to the research community as a benchmark dataset. The previous ISCX IDS 2012
dataset only provided labeled data in the form of an XML file, generated from the IBM QRadar
SIEM appliance. Therefore, that data is not as rich in features, since many of the network protocol
features that could be generated from raw PCAP were not provided in the labeled data set. Granted,
ISCX IDS 2012 also does provide raw PCAP data which could be converted to flow records using
the CICFlowMeter tool to acquire the same 83 features of the flow record. However, those flow
records would then need to be matched up with the XML flow records to acquire labels of ground
truth for the generated flow records. Results could vary from one researcher to the next once PCAP
has been converted to flow records and combined with labels from the given XML file. Therefore,
for the ISCX IDS 2012 dataset, experiments in this work use the given XML flow records with
labels, thus having a total of 14 primary features available for use in learning.
All of the 14 primary features used in the ISCX IDS 2012 dataset are also present and used in
the CIC IDS 2017 dataset. Five of the features are categorical, and the remaining 78 features are
continuous. The five categorical features are ‘SourceIP’, ‘DestinationIP’, ‘SourcePort’, ‘Destina-
tionPort’, and ‘Protocol’. These overlap with the ISCX IDS 2012 dataset as expected, as they are
common and necessary elements in network flow records.
There are 17, 0002 unique values for ‘SourceIP’, and 19, 112 unique values for ‘DestinationIP’.
For ‘SourcePort’, there are a total of 64, 638 unique values, and for ‘DestinationPort’ there are a
total of 53, 791 unique values. For the last categorical feature ‘Protocol’ there are three unique
values of ‘6.0’, ‘0.0’, and ‘17.0’ which map to internet protocol types as defined by the Internet
Assigned Numbers Authority (IANA) [7]. A value of ‘6.0’ indicates TCP traffic, a value of ‘17.0’
38
indicates UDP traffic, and while ‘0.0’ translates to IPv6 ‘Hop-by-Hop Option’, it is assumed that a
value of ‘0.0’ is undefined since this dataset is dealing with IPv4 traffic. In addition, the number of
flows that are of UDP type is 99, 476, and the number of flows for TCP traffic is 1, 826, 704, leaving
only 1, 696 flows with protocol label ‘0.0’. Therefore, with these five categorical features, the CIC
IDS 2017 dataset can be expanded to a maximum of 17, 002 + 19, 112 + 64, 638 + 53, 791 + 3 =
154,546 features. The full list of features in the CIC IDS 2017 dataset are detailed in Table A.1
within Appendex A.
In this chapter, the background and preliminaries in regards to machine learning, deep learning,
evaluation metrics, as well as details on the NID datasets being used in this Thesis were provided.
In the next chapter, the details of the experiments, implementation, and their results will be re-
viewed.
39
Chapter 3: USING DEEP NEURAL NETWORKS FOR NETWORK
INTRUSION DETECTION
3.1 Introduction
In this chapter a series of case studies will be performed using a deep fully connected feedforward
neural network to perform classification on two different network intrusion detection benchmark
datasets.
In this chapter, the following contributions are provided:
1. Application of a fully connected deep neural network to network intrusion detection. Through
experimentation, it is determined the set of hyperparameters and network configuration that
produces optimal results in terms of evaluation metrics. In addition, comparison is performed
between including and excluding IP address as features.
2. Validation of the approach by evaluating results using two benchmark intrusion detection
datasets. The two benchmark datasets used in this research are recent and contain modern-
day attacks. Using these datasets with the deep neural network approach in this chapter
shows that this technique can be applied to modern-day enterprise networks.
3. Evaluation and comparison of this deep learning approach for network intrusion against
other previous techniques. There are a number of studies that have now been performed
using the two intrusion detection benchmark datasets used in this Thesis. Therefore, the
results achieved in this work are compared with other approaches in the field.
For this case study, experiments are performed with the previously mentioned two datasets – ISCX
IDS 2012 benchmark dataset [96] and CIC IDS 2017 benchmark dataset [95]. Each of these
datasets contain ground truth labels which is necessary for carrying out supervised learning. While
the version of the ISCX IDS 2012 dataset used in this work has binary class labels for benign and
40
malicioius, the CIC IDS 2017 dataset contains multinomial class labels for each of the attack types
carried out in the flow records. Each dataset is used within a common machine learning workflow
to apply the deep neural network approach to network intrusion detection. The results of each
of the experiments will be analyzed and evaluated, then the effectiveness of this approach against
previous approaches in the field will be compared.
To implement and evaluate the deep learning algorithms in this work, experiments were performed
on the Texas Advanced Computing Center (TACC) resources. To perform the deep learning tasks,
experiments leverage the popular open source TensorFlow machine learning framework [11], along
with the Keras high level framework [28] which uses TensorFlow on the backend. The experiments
were run on a Dell R630 server with 24 cores setup in dual socket Intel Xeon E5-2670 v3 "Haswell"
processors (each with 12 cores @ 2.3GHz) and 128 GB of RAM.
Figure 3.1 highlights a supervised machine learning pipeline. This workflow is utilized (minus the
model deployment step) in the experiments with each of the benchmark datasets.
41
3.2.3 Network Traffic Data: PCAP vs NetFlow
Network traffic data is often collected at network switches or routers in the form of raw packet cap-
ture (PCAP). PCAP data contains the full TCP/IP packet data transmitted or received on a given
network device. While full packet capture information can be helpful in certain scenarios, it does
bring with it a high cost in terms of space. An alternative (or complement) to PCAP data is Net-
Flow, which serves to summarize the PCAP data in terms of higher level network flows. NetFlow
data offers the advantages of more breadth and lower disk space requirements, whereas PCAP data
has more depth, but is more expensive in terms of disk space. NetFlow records typically consist
of source/destination IP, source/destination port, source interface, protocol, and number of bytes
transmitted. It can have many more fields extracted from the PCAP based on the configuration of
software used for converting PCAP to NetFlow records.
The machine learning algorithms used in this work utilize NetFlow records as input. Therefore,
if there is any PCAP input data, it must first be converted to NetFlow. In practice, running an
algorithm that takes NetFlow as input is ideal — the NetFlow protocol was created by Cisco,
and is commonly used by many network switches, routers, and firewalls to export raw traffic data
into flow data for further analysis. An emerging IETF standard called IPFIX, short for IP Flow
Information Export, is similar to the NetFlow standard, but allows for additional information that
is normally sent to syslog, as well as variable length fields for collecting information such as
URLs, messages, HTTP hosts, and more [2, 6]. For the purposes of this work, the summarized
traffic records will simply be referred to as ‘flow data’.
In order to convert raw PCAP data to flow data, there are a number of tools available. One
tool is called YAF, short for Yet Another Flowmeter. YAF was created by Computer Emergency
Response Taskforce (CERT) and is a part of their Network Situational Awareness (NetSA) Security
Suite of tools used for monitoring large-scale networks. YAF processes packet data from PCAP
into bidirectional flows, then exports the flows to IPFIX-based file format.
Another tool that can be used for converting PCAP data to network flow data is the CI-
CFlowMeter, created by University of New Brunswick [33, 55] . For the CIC IDS 2017 Dataset,
42
the authors provide the dataset in the form of both PCAP as well as computed flow records via
CICFlowMeter. The CICFlowMeter is a network traffic flow generator written in Java that takes
raw PCAP as input and generates bidirectional flows where the first packet determines the forward
(source to destination) and backward (destination to source) directions. It generates 77 statistical
features (the features available for CIC IDS 2017 specifically are detailed in Table A.1), such as
duration, number of packets, number of bytes, length of packets, which are also calculated sep-
arately in the forward and backwards directions. Along with these statistical features for each
flow, CICFlowMeter provides the corresponding source/destination IP address and port, protocol
number, timestamp, label, and a unique FlowID for each flow record.
The following subsections detail a deep learning approach applied in experiments using the ISCX
IDS 2012 benchmark intrusion detection dataset.
The ISCX IDS 2012 Dataset contains raw PCAP files of the network traffic, as well as flow records
in the form of an XML file that was processed through an IBM QRadar network security appliance.
Each of the flows in the XML file were labeled as either benign or malicious. For this experiment,
instead of utilizing a tool such as YAF or CICFlowMeter to convert the PCAP data into flow
records, the provided XML flow records are used as a starting point. The reason for this is that
the labels for each of the network flows are only provided through the XML files. Therefore, a
reliable way to obtain ground truth labels from this dataset is to use these XML flows, as opposed
to generating the network flow records from the raw PCAP files. In order to process the XML
records, a custom parser was developed that iterates through all the XML records and outputs
structured CSV files that can more easily be used as input into a machine learning pipeline. The
labeled XML flows were only provided for days two through seven, for which there was attack
traffic. Therefore, for this experiment, day one is being omitted and only days two through seven
are being used.
43
3.2.4.2 Data Cleaning and Preprocessing
Once the data is in the form of flow data, a process is performed to further clean, transform, and
prepare the data for use in the deep learning algorithm. This step includes common tasks such as
making sure there are no erroneous characters in the data, removing fields that contain all zero or
null values, and removing or modifying NaN (not a number) values. Often, a large amount of time
can be spent in this stage to clean and prepare the data. This is due not only to the various formats
and locations of target data, but also to ensure the accurateness and effectiveness of the model that
is being trained on this data.
During this stage, it is common practice to also normalize or scale the continuous values
amongst all the features, so that the machine learning algorithm trains on data that is all within
the same feature space. There are two main types of normalization that are commonly used within
the machine learning pipeline: Standardization (or Z-score normalization) and Min-Max scaling.
Standardization results in the features being rescaled so that they will have the properties of
a standard normal distribution with µ = 0 and σ = 1, where µ is the mean (average) and σ is
the standard deviation from mean. The standard scores (also called z scores) of the features are
calculated as follows:
x−µ
z= (3.1)
σ
X − Xmin
Xnorm = (3.2)
Xmax − Xmin
In this work, Min-Max scaling is used to normalize the features during this step.
44
3.2.4.3 Feature Selection
The amount of features available in this first dataset of converted flow records is relatively small
at a total of 14. As described in Section 2.2.1, the main features provided in this dataset after pro-
cessing the labeled XML flows are as follows: Source IP, Destination IP, Source Port, Destination
Port, App Name, Direction, Protocol, Duration, Total Source Bytes, Total Destination Bytes, Total
Bytes, Total Source Packets, Total Destination Packets, Total Packets. Of these features, Source IP,
Destination IP, Source Port, Destination Port, App Name, Direction, and Protocol are categorical
features. As described earlier in Section 2.2.1, the number of unique possible values across all of
these categorical features is 2, 478+34, 552+64, 482+24, 238+107+4+6+7 = 125,867. There-
fore, to experiment with using these high dimensional categorical features, they must be converted
to floating point values for use in the neural network. There are a number of different approaches
for accomplishing this, ranging from simple to more complex. The following techniques for work-
ing with these high-cardinality categorical variables are described: one-hot encoding, one-hot hash
trick, and entity embedding.
One-hot encoding is a common technique for converting categorical feature variables into con-
tinuous features for use with a machine learning algorithm. This process works by creating a new
variable for every unique value possible for the original categorical feature. An example of how
this works is if there is a categorical feature named ‘animal’ with three different possible values,
such as ‘cat’, ‘dog’, ‘bird’. Since the machine learning algorithm only works with continuous
floating point values, one-hot encoding can encode this ‘animal’ feature as shown in tables 3.1 and
3.2.
Using one-hot encoding in the context of neural networks has two main problems [32]:
1. One-hot encoding of high cardinality features can be inefficient and require a large amount
of computational power to complete.
2. One-hot encoding ignores the relationships between categorical features, and treats them
completely independent of one another.
45
Table 3.1: One-hot encoding example - before encoding
No. [Other columns] animal [Other columns]
1 ··· cat ···
2 ··· cat ···
3 ··· bird ···
4 ··· dog ···
5 ··· bird ···
6 ··· cat ···
··· ··· ··· ···
n ··· dog ···
46
A variant of one-hot encoding is a technique called the one-hot hashing trick. This is useful
when the number of possible values for the categorical variable is large. Instead of assigning
an index to each value and keeping track of it in a vector as is done with one-hot encoding, the
hashing trick uses a simple hash function to hash each possible value into a vector of a fixed size.
The main advantage of this method is that there is no longer the need to maintain an explicit index
of possible values (a “word index” when used in the context of natural language processing and
text). This provides a savings in terms of memory, and also allows for on-the-fly creation of hash
values (“tokens”) before needing to see the entirety of possible values for the categorical variable.
The main disadvantage of this technique is that if the size of the hashing table is not large enough to
represent the possible number of values for the categorical variable, the problem of hash collisions
occurs. In a hash collision, two different values would map to the same value. This is a problem
for machine learning, as the algorithm would see these two distinct values as being the same. One
way to prevent hash collisions is to make sure the dimensionality of the hashing space is much
larger than the maximum number of values for a given categorical variable [28].
Entity embedding is a technique that can overcome the shortcomings of one-hot encoding and
one-hot hashing trick. It is useful when the number of unique values for a categorical feature is
of high cardinality [48]. Using embeddings for categorical variables is a key technique for lever-
aging the power of deep learning when working with tabular data [100]. This idea draws from
word embedding techniques, such as Word2Vec, where similar words are grouped together. Using
this technique, relationships between different categories can be captured. Embeddings map cate-
gorical values to low-dimensional real vectors in such a way that similar values are close to each
other [46]. The lower-dimensional space preserves semantic relationships. The structure in which
embeddings are mapped is geared toward what is trying to be accomplished with the underlying
categorical variables. The key categorical variables in network intrusion detection consist of the
source/destination IP addresses and Ports. With embeddings, vectors for similar categorical values
are closer together, and unrelated values have vectors that are further away. An example for word
embeddings would be ‘puppy’, ‘kitten’, ‘fawn’, and ‘mountain’. In this example, the vectors that
47
represent the first three baby animals would be closer together, and the vector for the embedding
that represents ‘mountain‘ would be further away.
Each category is represented by a dense vector of floating point numbers. The values in the
vector are learned. Embeddings are able to capture richer, more meaningful and complex relation-
ships, as opposed to raw categories. Embeddings are more useful than one-hot encoding as the
embedding technique provides an extra level of relationship information and can group values to-
gether into a smaller number of categories [100]. In addition, embeddings allow for the reduction
of the dimensionality of the feature space for the categorical variables, which helps in reducing
overfitting [99].
According to Cheng Guo and Felix Berkhan [48], embeddings can help models to generalize
better when data is sparse and statistics are unknown. They mention that this technique is especially
useful for datasets which contain high cardinality features, where other methods would tend to
overfit. For the datasets used in this work, this is especially applicable, as there are thousands of
possible values for each of the source IP, destination IP, source port, and destination port features.
Therefore in this work, the embedding technique is used for these high cardinality features. The
initial embedding dimensions used are according to the following rule of thumb [46]:
dimensions = 4
possible values (3.3)
Another rule of thumb for number of reduced dimensions used by [100] is as follows:
possible values
dimensions = min(50, ) (3.4)
2
Therefore, for the ISCX IDS 2012 dataset, the high cardinality features are reduced in dimen-
sion using entity embeddings, following the first rule of thumb, as shown in Table 3.3.
In effect, this borrows from the technique used in word embedding, and performs IP embed-
ding and Port embedding to utilize these features within a neural network. Specifically, the high
cardinality categorical variables are each mapped to an integer index. Then, each of the categor-
48
Table 3.3: Embedding Categorical Variables - ISCX IDS 2012 Dataset
Feature Possible Values Embedded Dimensions
Source IP 2,478 7
Destination IP 34,552 13
Source Port 64,482 16
Destination Port 24,238 12
√
ical values are represented by a dense vector of size 4
possible values. When these categorical
variables are embedded, they are then represented by a dense vector of floating point numbers
according to the dimension that are calculated. The parameters (weights) for the dense vector
representation are initialized using a random uniform distribution in the range [−0.05, 0.05]. An
example of the embedded representation of sample source IP addresses into an 8-dimensional vec-
tor is shown in Table 3.4. This dense representation is not only more computationally efficient, but
also holds inherent relationship information between the various IP addresses, the other features
in the dataset, and the label. As these vectors are inputs into the first layer of the neural network,
their weights are updated during the back-propagation step at each epoch.
Following best practices, the dataset is separated into three separate training, validation, and test
sets. First, the dataset is separated out with 33% of the data put into a test set, and 67% into a
training set. In addition, since this dataset is highly imbalanced, it is important to ensure there is
the same proportion of malicious flows in each training and test sample as there are in the dataset
as a whole. In order to accomplish this, the dataset is stratified to ensure the distribution of benign
49
and malicious traffic is equivalent in both training and test data sets. In addition to the standard
training and test split, a separate hold out validation set is used during the training iterations. This
is a standard methodology used when there is plenty of data available [28]. Therefore, the training
split is further separated out into a validation set according to the same parameters as the original
training and test split, including stratification. This results in 67% of the training dataset set aside
for training, and 33% of the training dataset used for validation during the training iterations.
The reason for having a validation set is to help in tuning the configuration of the model. For
example, modifying the number of layers, or units per layer can have an effect on the performance
of the model. This tuning is performed as a feedback loop using the model’s performance on the
validation dataset. Once the model performs well on the validation set, it is run against the separate
test dataset which it has never before seen. This ensures the model is not overfitting the training
and validation set. The knobs that are tuned during this training and validation cycle are called
hyperparameters, which will be covered in the next section.
3.2.4.5 Hyperparameters
Hyperparameters refer to the parameters that are set before the learning process begins. These
hyperparameters are tuned based on a feedback loop of the model’s performance on the validation
set. Hyperparameters fall into the following main categories [86]:
• Weight initialization
• Loss functions
50
• Data normalization scheme
Throughout the training and validation cycle, these hyperparameters are tuned in order to
achieve highest and best performance for the machine learning model. The most effective neu-
ral network architecture and hyperparameters used in this case study are described next.
The configuration for the neural network as shown in Figure 3.2 gave best results in the ex-
periments. The first hidden layer for the continuous input features is much larger in this case at
128 neurons, than the number of inputs at 7. Smaller numbers of neurons were also tested, and
found to achieve comparable results. Therefore, to keep architectures consistent, this same general
configuration is also used in the next case study with CIC IDS 2017, which has many more con-
tinuous input features. It should also be noted that it is recommended future work to perform an
in-depth study that performs further hyperparameter optimization to determine the neural network
architecture and combination of features that performs best for each of the given datasets.
Figure 3.2: Deep Neural Network Architecture for ISCX IDS 2012 Dataset
The neural network that was chosen consists of three layers of hidden units, consisting of 64
51
units per layer. The activation function on each layer is the ReLU activation function, with the last
output layer having a sigmoid activation function. The optimizer used is the RMSProp optimizer,
with a default learning rate of 0.001. After initially using the Adam optimizer, it was found that
RMSProp performed better at optimizing the loss function, enabling better results with validation
and test sets. The loss function used is the binary crossentropy:
N
1
Hp (q) = − yi · log(p(yi )) + (1 − yi ) · log(1 − p(yi )) (3.5)
N i=1
where y is the label (1 for malicious and 0 for benign), and p(y) is the predicted probability of
the given flow being malicious for all N flows. This formula for binary crossentropy shows that
for each malicious point (y = 1), it adds log(p(y)) to the loss, which is the log probability of the
example flow being malicious. Conversely, it also adds log(1 − p(y)), which is the log probability
of the flow being benign, for each benign flow (y = 0).
In this section, the results and evaluation of experiments with the ISCX IDS 2012 dataset are de-
scribed. The experiments show that by using deep fully connected feedforward neural networks
for classification of network traffic flows as either benign or malicious, and using the embedding
technique for the high cardinality categorical features, high performance can be achieved in terms
of evaluation metrics. A variety of hyperparameters were iterated with along with different com-
binations of features in order to arrive at the best results.
As a review, this dataset is highly imbalanced, with only 2.71% of the traffic flows being attack
traffic, as shown in Figure 3.3.
The confusion matrix in Figure 3.4 shows the results of using this neural network architecture
on the ISCX IDS 2012 dataset. In addition, the technique of using embeddings for the categorical
variables was used. The detailed metrics are also shown in Table 3.5.
The comparison of Precision vs Recall is shown in Figure 3.7. A Precision/Recall Curve is the
preferred metric to use when the classes are highly imbalanced, as is the case in this case study.
52
Figure 3.3: ISCX IDS 2012 Class Distribution
53
Figure 3.4: ISCX IDS 2012 Confusion Matrix using Embeddings with IP Address
54
Figure 3.6: ISCX IDS 2012 Confusion Matrix - without IP Address, AppName, Direction
The higher towards the right of the Precision/Recall Curve, the better. As seen in Figure 3.7, the
Precision/Recall curve for this case study is favorable.
In addition, the F1 Measure, which is the one of the primary metrics which is being evaluated
in this work, is above 90% at a value of 0.9491.
55
The false positive rate for this experiment reaches a low of 0.0022, which is lower than other
leading machine learning techniques used on this same dataset. In the work of [16] published in
2017, Atli compared their technique along with other leading machine learning approaches against
the ISCX IDS 2012 dataset. The results of these other techniques in terms of detection rate and
false positive rate is shown in Table 3.6. As can be seen, the method used in this work outperforms
all the others in terms of the False Positive Rate metric, reaching a low of 0.0022. For the average
detection rate, also referred to as True Positive Rate or Recall (a measure combining the true
positive rate and false negative rate), our method achieves a detection rate of 0.9612.
In looking at the detection rate of other methods, [119], [15], and the last two methods of [16]
have higher detection rates. When looking at [119], the authors only selected incoming mali-
cious and benign packets for certain days for a particular host. Similarly, the authors in [54] only
randomly selected a subset of benign and malicious attack traffic to create their training and test
dataset. For [15], the authors used Snort to generate ground truth labels for flow data. There-
fore, [16] used a subset of the dataset to emulate the dataset version used by [15], and they got
the resulting highest results for detection rate. For this Thesis, the ground truth that was evaluated
was from the creators of the ISCX IDS 2012 dataset itself, which generated flow data from an
56
IBM QRadar appliance. The main takeaway from this is that while the ISCX IDS 2012 dataset is a
good first step towards a benchmark intrusion detection dataset, there still exists the problem that
a common, ground truth set of flow data was not provided in the form of CSV files for which var-
ious studies can perform true apples to apples comparison. Each of the studies in this comparison
table used their own method of converting PCAP to flow records, and furthermore used subsets
of the dataset for performing evaluation and analysis. These issues were addressed when a newer
benchmark dataset, CIC IDS 2017 was released. Another point to mention is that in achieving
their top results, [16] used an Extreme Learning Machine (ELM), which is a single hidden layer
feedforward neural network, that was configured with 200 hidden neurons. Therefore, future work
should look at performing more in-depth hyperparameter optimization to determine the best deep
neural network configuration for achieving the best results. This also lends further credence to
the utility of neural networks for the field of network intrusion detection. While the comparisons
for the ISCX IDS 2012 are not ideal due to the varying ways in which the ground truth data for
each comparative study was generated, the results in this Thesis could be improved further by per-
forming more in-depth hyperparameter optimization. As will be seen in the next case study, it is
found that the performance of the deep neural network increases further by having more malicious
examples available for training.
For this first case study, a minimal number of features where used which represent what is
commonly available from the NetFlow format and in typical security appliances. In particular,
it did not include detailed flow statistics, which will be used in the next case study. Also, other
experiments were run without using IP Address as features, and also without using the categorical
features of ‘AppName’ and ‘Direction’. These experiments showed that IP address is an important
feature for achieving the best results. When all three of these aforementioned categorical variables
are removed, the performance of the classifier drops sharply.
Insight 1. Despite the smaller number of features in this dataset, the deep neural network still per-
forms well by using the embedding technique to include high-dimensional categorical variables.
Using all the features, the neural network achieves a lower false positive rate than other leading
57
methods at 0.0022, and has a detection rate of 96.12%. In particular, using the IP address as a
feature is important for achieving best results, especially when there are a smaller number of other
features available. When removing the IP address, the performance of the neural network drops
sharply.
The following subsections detail a deep learning approach applied in experiments using the CIC
IDS 2017 benchmark intrusion detection dataset.
The main difference with the CIC IDS 2017 dataset as compared to the ISCX IDS 2012 dataset
is the number of network flow features available. The ISCX IDS 2012 labeled dataset has 14
total features, while the labeled CIC IDS 2017 dataset has a total of 83 available features. These
additional features primarily consist of detailed flow statistics and calculations regarding the packet
details for each flow, which were generated using the open source CICFlowMeter tool. These
features are discussed in more detail in Chapter 2 and can also be reviewed in Appendix A. Another
main difference in the CIC IDS 2017 dataset is that it is made available in the form of CSV files
with labels, containing pre-processed packet data in the form of flow records. As described earlier,
for the ISCX IDS 2012 dataset, the labeled data was provided in the form of an XML document that
was produced via an IBMQRadar appliance. Therefore, there was some additional XML parsing
that had to be done in order to prepare the ISCX IDS 2012 dataset for use in the machine learning
algorithm.
After inspecting the dataset, it is found that there are a total of 289, 960 rows that contain at least
one NaN value. Within these, there were a total of 288, 602 rows with all values containing NaN
values. Therefore, all these rows were dropped from the dataset, leaving a total of 1, 358 rows that
had at least one ‘NaN’ value in it. The only column that contained remaining ‘NaN’ values was
‘Flow Bytes/s’. In addition, there is a value of ‘Infinity’ in a total of 2, 867 rows, specifically in the
58
column ‘Flow Packet/s’. Therefore, to clean this data, the ‘Infinity’ values are converted to NaN,
then these 2, 867 rows are dropped that had contained NaN and/or ‘Infinity’ values in any of the
columns.
In addition, after reviewing the statistics of the dataset and features such as mean, standard
deviation, min, max, there are a number of columns that have all 0 values for every single example.
Therefore, since these features provide no information to the learning algorithm, the following
columns are dropped from the dataset: Bwd PSH Flags, Bwd URG Flags, Fwd Avg Bytes/Bulk,
Fwd Avg Packets/Bulk, Fwd Avg Bulk Rate, Bwd Avg Bytes/Bulk, Bwd Avg Packets/Bulk, Bwd Avg
Bulk Rate.
As discussed in the previous case study, it is imperative that the continuous feature values all be
normalized into the same feature space. This is important when using features that have different
measurements, and is a general requirement of many machine learning algorithms. Therefore, the
values for this dataset are also normalized using the Min-Max scaling technique, bringing them all
within a range of [0, 1].
In this dataset, after cleaning the data, there are a total of 74 useable features, with five of them
being categorical. The categorical features are Source IP, Source Port, Destination IP, Destination
Port, and Protocol. As described in Chapter 2, the total number of unique values across these
five categorical variables is 17, 002 + 19, 112 + 64, 638 + 53, 791 + 3 = 154,546 features. In
order to use the features, they must be converted to a floating point value. Following the same
technique used for converting categorical variables in the ISCX IDS 2012 dataset, these features
are converted to floating point numbers using the categorical embedding technique. As described,
there are two general techniques for calculating the number of embedded dimensions. For the
experiments, both techniques were utilized and compared to see if it had an effect on the perfor-
mance of the classifier. One technique, referred here to as Embedding Technique A, as described
is dimensions = min(50, possible2 values ). Another technique, referred here to as Embedding Tech-
√
nique B, reduces dimensions using the formula dimensions = 4
possible values . Each of
59
these techniques generates a dense representation of each categorical variable with vectors of re-
duced dimensionality, with the latter reducing even further as shown in Table 3.7. Comparing these
two embedding techniques, it was found that on this dataset, they perform nearly identical, with
the latter technique of using the fourth square of the possible values slightly outperforming the
former technique. In one experiment, the former achieved an F1 measure of 0.9989440, and the
latter achieved an F1 measure of 0.9989576. This leads to the following insight:
Insight 2. The rule of thumb for embedding categorical variables of using the fourth square of the
number of possible categorical variables (Embedding Technique B) performs just slightly better
than Embedding Technique A for this given dataset. This allows for a smaller, more dense amount
of features, which reduces the number of parameters in the neural network and thus reduces the
computational overhead for training, while also providing better performance.
All the useable numeric features (full list in Appendix A) as well as the five categorical features
of Source IP, Source Port, Destination IP, Destination Port, and Protocol serve as input to the
neural network.
This dataset contains labels for not only benign and malicious, but also indicates the type of
malicious attack that was carried out for each malicious flow. Therefore, this data can be looked
at as either a binary classification problem, or a multinomial classification problem. There are a
total of fourteen different labels, one benign, and thirteen with various attack types. For the binary
classification problem, the labels are converted to 0 and 1, with 0 representing the benign class,
and 1 representing the malicious class. For performing multinomial classification, the labels can
be vectorized using the one-hot encoding technique. As described in Chapter 2, one-hot encoding
consists of mapping a unique integer index with each label, then turning a given integer index i into
60
a binary vector of size n (number of labels of 14 in this case). For this dataset, one-hot encoding
consists of embedding each of the labels into an all zero vector of size 14, with a single 1 value in
the place of the index for the index corresponding to the label. For example, this dataset contains
the following class labels as strings: BENIGN, FTP-Patator, SSH-Patator, DoS slowloris, DoS
Slowhttptest, DoS Hulk, DoS GoldenEye, Heartbleed, Web Attack Brute Force, Web Attack XSS,
Web Attack Sql Injection, Infiltration, Bot, PortScan, DDoS.
For a multinomial classification problem, this dataset can be converted using one-hot encod-
ing, with the label for the benign class represented as the following vector (tensor): [1,0,0,0,
0,0,0,0,0,0,0,0,0], and one of the the attack class labels of ’DoS GoldenEye’ represented by the
following one-hot encoded vector: [0,0,0,0,0,0,1,0,0,0,0,0,0]. However, for this first experiment,
this dataset is evaluated as a binary classification problem, where all the malicious flows are con-
verted to the value 1, and all benign flows are converted to the value 0.
Similar to the previous case study, this dataset is split into separate training, validation, and test
sets with the original data split into 67% training and 33% test; then this training set is further
split up into 80% for final training set, and 20% for hold-out validation set. This dataset is also
imbalanced, but not quite as imbalanced as the ISCX IDS 2012 dataset. Instead, there is a total
of 19.70% malicious flows, and 80.30% benign flows. Therefore, when splitting the dataset, it
must be stratified so that there is a consistent proportion of malicious and benign flows within each
training, validation, and test split.
3.2.5.5 Hyperparameters
After running several experiments tweaking various parameters, high performance is achieved
using a neural network configuration as shown in Figure 3.8.
61
Figure 3.8: Deep Neural Network Architecture for CIC IDS 2017 Dataset
62
3.2.5.6 Evaluation and Results
In this section, evaluation and results of experiments are provided from using a deep feedforward
fully connected neural network classifier on the CIC IDS 2017 dataset. The class distribution for
the CIC IDS 2017 dataset can be seen in Figure 3.9.
A number of experiments and trials were run using the CIC IDS 2017 dataset, similar to the
previous case study. Various neural network configurations and hyperparameters were used in
order to achieve the best results, which are shown in Table 3.8. Using a neural network with three
hidden layers of 64 units per layer, using the ReLU activation function for each hidden layer, and
with loss being calculated using the softmax cross entropy loss function. The optimizer used in
obtaining these results is the Adagrad optimizer.
As can been seen by the results, the deep neural network performs well in its ability to clas-
sify network flows as malicious and benign, achieving a low false positive rate of only 0.000127
and achieving an F1 score of 0.9980. This demonstrates that using the detailed flow statistics, in
addition to using the embedding technique for the categorical variables of source/destination IP
63
address and ports enables this classifier to achieve acceptable results.
Insight 3. When using embeddings for the categorical variables of Source/Destination IP Address
and Ports, neural network performs exceptionally well at classifying malicious and benign flows.
In addition, in experiments that omit the Source/Destination IP address features, but still utilizes
detailed flow statistics, the neural network performs nearly as well, achieving an F1 Measure of
0.9730 in comparison to the experiment that included the IP address features, which achieved
the 0.9980 F1 score. Therefore, training a neural network by omitting IP address features, given
it contains detailed flow statistics, may be a viable way to use the trained model on a different
computer network it has not seen before.
Insight 4. When using the IP address as a feature and representing it as a dense embedded vector
at the first layer of the deep neural network, the best results are achieved. As the dense vector rep-
resentation is among the first layer of the neural network, its representation gets updated at every
epoch, in relation to the other features in the dataset as well as the supervised label. Therefore,
it is theorized that the neural network forms a type of memory of the IP address in relation to the
other features and the label, whereby it enables the highest performance for the classification task.
64
Figure 3.10: CIC IDS 2017 Confusion Matrix using Embeddings with IP Address
Figure 3.11: CIC IDS 2017 Confusion Matrix using Embeddings without IP Address
65
It should be noted, however, that the IP address feature is highly relevant to the given dataset from
which it originates, and may therefore not generalize well to other datasets.
In reviewing these results, it shows that the using the IP address as a feature enables better
performance of the classification task. However, in practice, the IP address can often change due
to DHCP. Most often, the IP address will remain constant for the first three octets in a network
environment, as this is determined by the configuration of routers and switches and is seldom
changed. Instead, it is the last octet that can often change for a workstation on a network environ-
ment. Therefore, the key question is can IP address be used reliably as a feature in practice? How
can the IP address be used while still accounting for its possibility to change due to DHCP? One
approach is to use only the first three octets, instead of the entire IP address.
Figure 3.12 shows the results of removing the first three octets for source and destination IP
address:
Figure 3.12: CIC IDS 2017 Confusion Matrix - First 3 Octets of IP address
Insight 5. Using only the first three octets of the IP address performs just as well as using the
full IP address. This naturally captures interactions between different local subnets, and between
local and remote hosts. This methodology can be used to reliably include IP address as feature
and overcome limitations imposed by DHCP.
66
In addition to binary classification, the same neural network model was evaluated for multino-
mial classification. The results of these experiments can be seen in the confusion matrix in Figures
3.13 and 3.14. The multinomial classifier also performs well, for those categories for which it has
a sufficient number of examples of each class. For categories 7-12, the confusion matrix shows
low performance. However, for those categories, there is only 4, 497, 215, 7, 12, and 645 amount
of support (examples) in the dataset for each respective category (see Table 3.9). Therefore, this
shows that the neural network needs a sufficient number of examples to learn from in order to
perform classification accurately.
As shown in Figure 3.14, the neural network performs just as well for multinomial classification
using just the first 3 octets of the IP address as when using the full IP address. In some cases, such
as for Bot, DoS, and DDoS, it performs just slightly better.
In comparing the two case studies, the performance of the neural network was best for the
67
Figure 3.14: Confusion Matrix - Multinomial Classification using first 3 octets
68
second case study which used the CIC IDS 2017 dataset. For the second case study, there was a
larger amount of training data available. In particular, the second case study used a dataset that
had a higher amount of malicious training examples to learn from. The first case study with the
ISCX IDS 2012 dataset contained 2, 545, 935 total flows, of which 68, 910 (2.91%) were mali-
cious. The second case study with the CIC IDS 2017 dataset contained 2, 830, 743 total flows,
of which 557, 646 (19.70%) were malicious. In addition, the dataset for the first cased study had
less features (a total of 14, consisting of 7 continuous and 7 categorical), compared to the second
case study dataset (a total of 74, consisting of 69 continuous and 5 categorical), with the second
dataset containing more continuous features made up of detailed flow statistics for each training
example. Even when removing features from CIC IDS 2017 to make the two datasets have parity,
the performance of the neural network was higher on CIC IDS 2017 with an F1 Score of 0.9990,
compared to ISCX IDS 2012 with an F1 Score of 0.9491. The primary difference in this case
was that CIC IDS 2017 contained a higher number of attacks to learn from. In addition, the CIC
IDS 2017 dataset itself is more robust in how it was generated, as it adheres to the 11 criteria for
building a robust benchmark dataset, as according to [95].
Insight 6. Given more attack examples for the neural network to train on, the performance of the
neural network increases.
There are a number of studies that have been using the CIC IDS 2017 dataset for evaluating
approaches and algorithms for network intrusion detection. Tables 3.10 and 3.11 compare other
studies’ results to the results achieved in this study.
These experiments show promising results, especially when including the source and destination
IP address as features. In some environments, it may be viable to include the IP address as a
feature; however, in other environments it is often the case that IP addresses change due to DHCP.
The experiments without IP addresses still performed well, but could not reach the level of results
as when IP addresses were included as embedded categorical features. One way to overcome the
69
Table 3.10: CIC IDS 2017 Result Comparison [13]
Technique Avg. DR Avg. FPR
Hypbrid IDS - Decision Tree + Rule-based [13] 0.94475 .01145
WISARD [30] 0.48175 0.02865
Forest PA [12] .92920 0.03550
J48 Consolidated [52] 0.92020 0.06645
LIBSVM [19] 0.54595 0.05130
FURIA [50] 0.90500 0.03165
Random Forest [13] 0.93050 0.01880
REP Tree [13] 0.91640 0.04835
MLP [13] 0.77830 0.07350
Naive Bayes [13] 0.82510 0.33455
Jrip [13] 0.93400 0.04470
J48 [13] 0.91990 0.05040
DNN with IPs 0.9993 0.0003
DNN without IPs 0.9677 0.0052
70
limitation of IP addresses changing due to DHCP is to use only the first 3 octets of the IP address
as a feature. Experiments show that this methodology performs just as well as using the complete
4 octets of the IP address.
3.4 Summary
Deep Neural Networks are effective when working with tabular data that consists of lots of ex-
amples, and with categorical variables of high cardinality, which are present in the domain of
Cybersecurity and Network Intrusion Detection. The technique for embedding high cardinality
categorical variables, which are common in cybersecurity data, leverages the power of deep neural
networks to achieve better results than other leading techniques without needing to perform much
in the way of hand-engineered features, especially when the amount of training data is larger. Fur-
thermore, it was found that using the IP Address as an embedded categorical feature enabled the
best performance for the deep neural network. It is theorized that by using the embedding tech-
nique and updating the weights of the IP address representation at each epoch, the neural network
forms a type of inherent memory about the IP address feature in relation to the other features and
the given label. Additionally, high performance was still achieved when using only the first three
octets of the IP address are used, as opposed to the full IP address, allowing for a more robust
feature that withstands changes that would typically be caused by updates from a DHCP server.
In this chapter, two case studies were presented that evaluated the use of deep fully connected
neural networks for classifying network flows as benign or malicious. To perform the case studies,
two recent benchmark datasets were used. The experiments showed results that outperformed other
leading techniques when using deep neural networks for classification of network flows as benign
or malicious. Further, it was shown that the use of flow statistics generated by the CICFlowMeter,
in addition to a technique of dimensionality reduction of the categorical variables using the embed-
ding technique, enabled the deep neural network to achieve an F1 measure of 0.9990, a detection
rate of 0.9993, and a false positive rate of only 0.0003.
71
Chapter 4: USING AUTOENCODERS FOR NETWORK INTRUSION
DETECTION
4.1 Introduction
This chapter looks at using unsupervised learning with an autoencoder for detecting anomalies for
the task of network intrusion detection.
In this chapter, the following key contributions are made:
1. Application of unsupervised learning using deep autoencoder for anomaly detection, and
applying to field of network intrusion detection. Using only normal flows, this chapter eval-
uates how an autoencoder can be trained to detect anomalous flows based on reconstruction
error.
2. Validation of the approach by evaluating results using CIC IDS 2017 benchmark intrusion
detection dataset. As this benchmark dataset was captured from real network traffic as op-
posed to being simulated or replayed from packet captures, it is representative of normal
(benign) modern day network traffic. Therefore, it is useful for evaluation where the au-
toencoder trains only on benign traffic flows. In addition, this dataset contains malicious
flows generated by modern-day attack scenarios, which are representative of attacks seen on
typical enterprise networks.
This chapter looks at using autoencoders for the task of network intrusion detection. An autoen-
coder is a type of neural network that is trained in such a way that it aims to copy its input to its
output [45]. An autoencoder is designed so that it reproduces its input at the output layer. One of
the main differences between an autoencoder and a multilayer perceptron, or deep neural network,
is that the number of input neurons is equal to the number of output neurons. In addition, instead
of generating an output of ŷ (i.e. representing a class of benign or malicious as in the previous case
72
study), the output generated is X ′ , a reconstruction of the original input X . An autoencoder’s pri-
mary utility is to find a lower dimensional, latent space representation of the input data. It does this
in a non-linear fashion, unlike other popular dimensionality reduction techniques such as Principle
Component Analysis (PCA). Therefore, a key application of autoencoders is to use them for pre-
training a neural network, and thus use them to improve the performance of supervised learning.
In addition, they can be used to power an anomaly detection system.
The key idea behind autoencoders is to take an input vector and map it to a latent space (encode)
of lower dimensionality, then from that latent space, map it back to an output (decode) using the
low dimensional latent space as input. The module that is capable of taking this latent space
of lower dimensions and mapping it back to an output of with the same higher dimensions of the
original input is called a decoder. This technique is commonly used for image generation; however,
this chapter looks at applying this technique for the purposes of network intrusion detection, and
classifying network flows as either malicious or benign based on reconstruction error.
Variational autoencoders are one type of autoencoder that are well suited for learning latent
spaces that are well structured, whereby specific directions are able to encode useful axis of vari-
ation of the data. Therefore, VAEs can be well suited for the task of network intrusion detection.
Generative Adversarial Networks (GANs) are another type of generative deep learning that maps
an input space of high dimensionality into a lower dimensional latent space; the main difference
with a GAN is that the latent space it creates and uses tends not have as much structure and con-
tinuity as one created by a VAE. This chapter uses a standard autoencoder which functions to
reconstruct its inputs from a compressed representation.
Figure 4.1 shows an example standard autoencoder neural network architecture, where the
number of input neurons is equal to the number of output neurons. The way an autoencoder works
is that it takes an input vector and maps it to a latent vector space using an encoder module, then
decodes back to an output vector using a decoder module. The decoded output has the same
dimensions as the original input vector. Then, the neural network is trained using target data that is
the same as the original input vector, so that it learns how to reconstruct the original inputs. Using
73
Figure 4.1: Example neural network structure for Autoencoder
this process, and by placing certain constraints on the encoded output, the autoencoder can learn
interesting latent representations of the original input data. The most common constraint being
limiting the encoded output to be of low dimensions and sparse (mostly containing zeros). This
causes the encoder to enforce compression on the input data, resulting in it having a lower amount
of bits.
One way to think about an autoencoder is to consider that it is simply a neural network that
is making predictions about itself, instead of training itself to predict labeled output data. In
other words, instead of calling model.train(X, Y) followed by model.predict(X),
an autoencoder is the process of a neural network learning to model its own inputs, by calling
model.train(X, X).
For the cost function, since the inputs are real values that are trying to be predicted, instead of
classification where the network is trying to predict a binary value, the squared error can be used
as the cost function, and it can be treated as a regression problem. Alternatively, cross-entropy can
also be used as the loss function, with X being a value in the range [0, 1]. For the hidden layer and
74
output layer, the activation function can be either sigmoid or ReLU, so that the outputs are always
going to be within the range [0, 1].
For the autoencoder, there are different bias terms for the hidden layer bh and the output layer
bo . One thing to consider when using an autoencoder is the notion of shared weights. Therefore,
for the weight at the output layer, it is just a transpose of the weight of the input layer. This also
serves as a type of regularization, as it reduces the number of parameters, which in turn reduces
the probability of overfitting.
Z = sigmoid(X.dot(W ) + bh ) (4.1)
X̂ = sigmoid(Z.dot(W T ) + bo ) (4.2)
The objective function for the autoencoder used in this case study will be the squared error.
Written out in terms of weights and inputs, this function is shown in equation 4.3.
In this case study, using unsupervised deep learning with an autoencoder is evaluated on the the
CIC IDS 2017 dataset. The ability for the autoencoder to be used as an anomaly detection mech-
anism is demonstrated when training only on the benign traffic flows. Autoencoders can learn a
compressed representation of the input data, where the output of the autoencoder is a reconstruc-
tion of the input data in its most efficient form. By minimizing the error when reconstructing
the normal input, the neural network learns to modify its weights for reconstructing benign flows.
Then when the neural network sees malicious flows, the hypothesis is that the reconstruction error
will be higher, as the malicious flows are anomalous when compared to the benign flows in terms
of encoding and decoding.
The configuration of the autoencoder used in this case study is shown in Figure 4.2.
75
Figure 4.2: CIC IDS 2017 Autoencoder Configuration
The results of this experiment are shown in the tables and figures below. As shown in Figure
4.3, there is a higher reconstruction error for the malicious traffic flows as compared to the benign
flows.
Insight 7. The autoencoder is trained to reconstruct inputs from a latent space representation
based on only benign flows. Therefore, when it receives malicious flows as input, it is more difficult
for the neural network to reconstruct the malicious flows, thus resulting in a higher reconstruction
error. When there is sufficient amount of benign training examples, an autoencoder can be used to
detect anomalous flows based on reconstruction error and setting a threshold.
Looking at this further, when setting a threshold of 0.03 for the reconstruction error, the neural
network predicts malicious and benign flows with the metrics shown in Table 4.1. As can be seen,
a high number of malicious flows are detected as such, with a minimal number of false positives.
However, there is a relatively large number of false negatives, malicious flows that are classified as
benign. This can also be observed in Figure 4.3, where a number of malicious flows have a similar
reconstruction error as benign flows. Notwithstanding, the autoencoder shows effectiveness in
being able to detect malicious flows it has never seen before, and does so with a low false positive
rate. However, it does so at the cost of a high false negative rate, whereby when setting a given
76
threshold, the neural network sees a large amount of malicious flows as benign.
Insight 8. Autoencoders can be effective as anomaly detection mechanism for network intrusion
detection (with training on normal traffic only). They can be used to detect malicious flows that
have never been encountered before with a low false positive rate. Importantly, since it exhibits a
low false positive rate, this technique can be useful for bringing the anomalies to the attention of a
human analyst for review. However, this it at the cost of a high false negative rate, and should be
used along with other supervised techniques as part of a defense-in-depth strategy.
Figure 4.3: CIC IDS 2017 Autoencoder Reconstruction Error with Threshold
As shown in this case study, autoencoders can be useful for anomaly detection in terms of achieving
a low false positive rate. Using only one class of benign flows, autoencoders can learn how to
reconstruct the benign class. Therefore, when new flows are observed, if the reconstruction loss
is above a set threshold, anomalies can be detected. The limitations here are that there must be
data existing that is known to be normal or benign. In addition, the experiments show that a
threshold can be set where there are minimum number of false positives, yet still a relatively large
number of false negatives. However, this is may be preferred to have a lower number of false
77
Table 4.1: Autoencoder Evaluation Results - CIC IDS 2017
Metric Result
True Negative 681, 307
False Positive 89
False Negative 128, 065
True Positive 38, 902
Area Under the Curve 0.6164
Accuracy 0.8489
Error Rate 0.1512
True Positive Rate 0.2330
(Sensitivity, Recall, Detection Rate)
True Negative Rate (Specificity) 0.9999
False Positive Rate 0.00013
False Negative Rate 0.7670
Precision 0.9977
F1 Measure 0.3778
78
positives than false negatives, as once an anomalous flow is detected, it is a signal to an ongoing
attack that is causing multiple malicious flows in the network. In addition, from a practitioner’s
perspective, it may be preferable for the human analyst to be reviewing anomalous alerts that have
a high probability of being true positive. Importantly, this shows that autoencoders as an anomaly
detection mechanism are just one technique that can be used to surface never-before-seen malicious
activity, in combination with other tools within a defense-in-depth strategy. It is also important to
note that this case study used only the flow statistics as features. This is one contributing factor
to the lower performance in terms of the experiments exhibiting a high false negative rate. In the
previous chapter, it was shown that the IP address had a strong impact on the performance of the
neural network for supervised learning. Therefore, future work on autoencoders should investigate
the use of incorporating categorical variables along with the flow statistics to improve performance,
especially in achieving a lower false negative rate.
Insight 9. It is important to note that not all anomalous flows indicate malicious flows. Rather,
these are just flows that have an inherently different structure than the benign flows that the neural
network was trained on. Regardless, the experiments in this case study resulted in a low false
positive rate, and the majority of the anomalous flows are in fact verified as malicious. Incorpo-
rating categorical variables is a good next step for improving the performance of the autoencoder
in terms of false negative rate.
4.5 Summary
In this chapter, the use of an autoencoder for anomaly detection was evaluated on the CIC IDS 2017
benchmark intrusion detection dataset. It was shown that when training an autoencoder on only
benign flows, a threshold can be set on the reconstruction error, and anomalies can be detected.
These anomalies were verified to be attack traffic (with a low false positive rate) which the neural
network had never seen before. This shows a use of an unsupervised deep neural network for the
use case where there is lots of labeled benign network flows, and how never-before-seen malicious
traffic flows can then be detected as anomalies based on reconstruction error.
79
Chapter 5: RELATED WORK
Intrusion detection is an important field of cybersecurity data analytics, which is a core pillar un-
derlying the Cybersecurity Dynamics framework [109–112] that aims to understand, model, char-
acterize and quantify cybersecurity from a holistic perspective. The other two pillars are known
as first-principle modeling and analysis [29, 49, 61, 65, 68, 106, 108, 111, 113–116, 123, 124] and
cybersecurity metrics [20–22, 25, 26, 34, 40, 43, 71, 73, 88].
There are many sub-fields in cybersecurity data analytics. The present Thesis falls into pre-
scriptive cybersecurity data analytics aim to detect network attacks, which are then treated prop-
erly by the defender; therefore, this sub-field of cybersecurity data analytics may be called reactive
cybersecurity data analytics (Figure 5.1).
Intrusion detection can be host-based or network-based. The present Thesis falls into the lat-
ter scenario. There have been many approaches to Network Intrusion Detection developed over
the years. The general study of anomaly detection dates back as early as the 19th century with
origins in the statistics community [36]. The study of intrusion detection for cybersecurity has
roots in a paper published by Denning in 1987 [31]. Fiore et. al. [41] explored the use of a semi-
80
supervised model for network intrusion detection, using a Discriminative Restricted Boltzmann
Machine. They trained their classifier on "normal" traffic only, with the goal of being able to detect
anomalous behaviors that may evolve and change over time. They argue that it is often difficult
to train a model with anomalous training samples a priori. For their experiments, they used the
classic KDD’99 dataset. While this is a common benchmark dataset that has been used in many
research papers in the field, it has also been criticized for its lack of relevancy with current day
attacks. Sommer et. al suggest that this dataset may be better suited for only providing additional
validation and cross-checking of a novel technique [97].
Other sub-field of cybersecurity data analytics (shown in Figure 5.1) aim to understand, char-
acterize, and predict cybersecurity behaviors and events exhibited by cybersecurity data, includ-
ing: generating datasets [89, 95, 96]; measuring attack-defense structures in complex networks
[21,22,44]; measuring the susceptibility of software to attacks [3,8,9,47,62–64,80,117,118]; mea-
suring capabilities of defense tools [20,34,39,71]; quantifying capabilities of attacks [70,102,103];
malware detection and adversarial malware detection [53, 56–60, 72, 75–79, 92, 93, 102, 104, 120];
predicting or forecasting the cybersecurity situation based on the monitored cybersecurity data
streams [23, 90, 91, 105, 107, 121, 122].
81
Chapter 6: DISCUSSION AND CONCLUSION
6.1 Conclusion
Deep learning models are able to learn features on their own by having lots of training data avail-
able – labeled training data. Neural networks are capable of extracting features directly out of
the data itself, instead of relying on hand-engineered features. Hand-engineered features are still
important, as they allow for the solving of the problem with less resources and in a more elegant
manner. Good hand-engineered features based on domain knowledge also help when there is far
less data available to train on.
In the field of network intrusion detection, there is an abundance of data available from packet
captures of network flows. These packet captures can be converted into network flows that contain
rich metadata about the statistics of each flow, which are composed of the captured packet data.
These flows are structured in the form of tabular data, and contain both continuous and categorical
features. The categorical features for network flows are often times high-dimensional, as was
shown in the two case studies in this work. Deep learning can be a powerful tool when used
with tabular data, especially when it contains categorical features of high dimensionality. Using
an embedding technique, these high cardinality categorical variables can be converted to dense
floating point vectors that are orders of magnitude smaller.
Using two recent benchmark intrusion detection datasets, the effectiveness of using feedfor-
ward fully connected deep neural networks for classifying malicious and benign flows using both
supervised and unsupervised techniques was explored. It was shown that supervised learning us-
ing deep neural networks is highly effective for network intrusion detection. Using embeddings
for the high dimensional categorical variables of source and destination IP addresses and ports,
the neural network performs exceptionally well at classifying malicious and benign flows using
supervised learning. The performance of the neural network was also effective on the UNB ISCX
2012 dataset, which had less examples and less features per example (specifically there were far
fewer flow statistics available). In particular, it performed better than other leading techniques as
82
shown in table 3.6. Furthermore, on the CIC IDS 2017 dataset, which contained more examples
for each class, and many more features composed of flow statistics, the performance of the neural
network increased. Also, a technique for embedding IP address features using only the first three
octets was discovered that overcomes the limitations imposed by DHCP – using only the first three
octets performed just as well as using the full IP address.
In this work, it was also shown that unsupervised learning using autoencoders can be a useful
approach and use of deep neural networks when there exists lots of data for one class. By training
the neural network to reconstruct its inputs, it can learn the latent representation of the data based
on the reconstruction error. When the trained autoencoder receives new inputs it has never seen
before, it reconstructs those inputs based on its prior learned representation. If there is a high
reconstruction error, that input is deemed anomalous (above a certain threshold). It was shown
in Chapter 4 that the autoencoder reconstructed malicious inputs to contain a high reconstruction
error and thus be deemed anomalies, with a low false positive rate. Autoencoders can be effective
in modeling the normal flows, and by setting a threshold on the reconstruction error, can alert when
anomalous (malicious) flows are detected due to a high reconstruction error.
Deep neural networks can be used for both supervised and unsupervised learning. In the
case studies performed, it was shown that a deep feedforward fully connected neural network
can achieve excellent results for supervised learning. It was also shown that autoencoders can
be used for anomaly detection when they are trained on benign flows, enabling the discovery of
never-before-seen malicious flows with a low false positive rate.
There are a number of future directions that can be pursued, including the following:
1. Use time domain information as an embedded categorical variable. For example, use differ-
ent resolutions such as day of week, hour of day, or minute of hour as embedded categorical
variables. This is highly dependent on the signature of an attack, but if an adversary chooses
to attack at a certain time, this may help increase precision and recall. In addition, evaluate
83
these features on their ability to improve the performance of the autoencoder.
2. Perform additional strategies for embedding the IP address as a categorical variable. For
environments where the IP address frequently changes, other techniques for including the IP
address as a categorical embedded feature may be useful.
3. Explore Recurrent Neural Network structures and how they can be used to leverage the time
domain for classification of benign and malicious flows.
4. Leverage categorical variables with the autoencoder approach to improve its performance
in detecting anomalous flows. The current case study in this Thesis which evaluated an
autoencoder utilized only the continuous flow statistics as features.
6. UNB CIC recently released a new dataset in 2018 that was created using the same framework
for building the CICIDS2017 dataset [95]. Therefore, training a deep neural network on one
computer network traffic environment, and determining how well it generalizes to classifying
flows on an entirely different environment will be a good next step. It would be of interest
to determine algorithms and techniques in which a neural network trained on one network
intrusion detection dataset can be generalized to perform effectively on a new dataset coming
from a different computer network environment.
84
Appendix A: DATASET DETAILS
85
Table A.1: Continued
Standard deviation size of packet in backward
19 std_bpktl Continuous
direction
86
Table A.1: Continued
Number of times the PSH flag was set in packets
31 bpsh_cnt Continuous
travelling in the backward direction (0 for UDP)
UDP)
UDP)
87
Table A.1: Continued
48 flow_fin Number of packets with FIN Continuous
88
Table A.1: Continued
Average number of bulk rate in the backward
65 bAvgBulkRate Continuous
direction
89
Table A.1: Continued
Standard deviation time a flow was idle before
77 std_idle Continuous
becoming active
90
BIBLIOGRAPHY
[2] RFC 7011 - Specification of the IP Flow Information Export (IPFIX) Protocol for the Ex-
change of Flow Information, 2013. https://tools.ietf.org/html/rfc7011.
[6] What is IPFIX - Netflow’s main Contender in Traffic Analysis, Jul 2016. https://www.
pcwdld.com/what-is-ipfix.
[11] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good-
fellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz,
Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore,
91
Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,
Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015. Software
available from https://tensorflow.org/.
[12] Md Nasim Adnan and Md Zahidul Islam. Forest pa: Constructing a decision forest by
penalizing attributes used in previous trees. Expert Systems with Applications, 89:389–403,
2017.
[13] Ahmed Ahmim, Leandros Maglaras, Mohamed Amine Ferrag, Makhlouf Derdour, and
Helge Janicke. A novel hierarchical intrusion detection system based on decision tree and
rules-based models. arXiv preprint arXiv:1812.09059, 2018.
[14] Mark Allman, Mike Bennett, Martin Casado, Scott Crosby, Jason Lee, Boris Nechaev,
Ruoming Pang, Vern Paxson, and Brian Tierney. Lbnl/icsi enterprise tracing project, 2005.
[15] Adel Ammar. A decision tree classifier for intrusion detection priority tagging. Journal of
Computer and Communications, 3(04):52, 2015.
[16] Buse Atli et al. Anomaly-based intrusion detection by modeling probability distributions of
flow characteristics. 2017.
[17] R Bace and P Mell. Intrusion detection systems, national institute of standards and technol-
ogy (nist). Technical Report 800-31, 2001.
[18] M.A. Boden. Artificial Intelligence and Natural Man. Computer science/psychology. Basic
Books, 1977.
[19] Chih-Chung Chang and Chih-Jen Lin. Libsvm: Alibraryforsupportve ctormachines. Avail-
ableat: http://www. csie. ntu. edu. tw/scjlin/libsvm, 2001.
92
[20] John Charlton, Pang Du, Jin-Hee Cho, and Shouhuai Xu. Measuring relative accuracy
of malware detectors in the absence of ground truth. In IEEE Military Communication
Conference (MILCOM 2018), 2018.
[21] Huashan Chen, Jin-Hee Cho, and Shouhuai Xu. Quantifying the security effectiveness of
firewalls and dmzs. In Proceedings of the 5th Annual Symposium and Bootcamp on Hot
Topics in the Science of Security (HoTSoS’2018), pages 9:1–9:11, 2018.
[22] Huashan Chen, Jin-Hee Cho, and Shouhuai Xu. Quantifying the security effectiveness of
network diversity: poster. In Proceedings of the 5th Annual Symposium and Bootcamp on
Hot Topics in the Science of Security (HoTSoS’2018), page 24:1, 2018.
[23] Yu-Zhong Chen, Zi-Gang Huang, Shouhuai Xu, and Ying-Cheng Lai. Spatiotemporal pat-
terns and predictability of cyberattacks. PLoS One, 10(5):e0124472, 05 2015.
[24] J. Cho, S. Xu, P. Hurley, M. Mackay, T. Benjamin, and M. Beaumont. Stram: Measuring the
trustworthiness of computer-based systems. ACM Comput. Surv. (accepted for publication).
[25] J. Cho, S. Xu, P. Hurley, M. Mackay, T. Benjamin, and M. Beaumont. Stram: Measuring
the trustworthiness of computer-based systems. manuscript in submission, 2018.
[26] Jin-Hee Cho, Packtrick Hurley, and Shouhuai Xu. Metrics and measurement of trustworthy
systems. In IEEE Military Communication Conference (MILCOM 2016), 2016.
[27] F. Chollet. Deep Learning with Python. Manning Publications Company, 2018.
[29] Gaofeng Da, Maochao Xu, and Shouhuai Xu. A new approach to modeling and analyzing
security of networked systems. In Proceedings of the 2014 Symposium and Bootcamp on
the Science of Security (HotSoS’14), pages 6:1–6:12, 2014.
93
[31] D E Denning. An Intrusion-Detection Model. IEEE Transactions on Software Engineering,
SE-13(2):222–232, 1987.
[33] Gerard Draper-Gil, Arash Habibi Lashkari, Mohammad Saiful Islam Mamun, and Ali A.
Ghorbani. Characterization of encrypted and vpn traffic using time-related features. In
ICISS 2016, 2016.
[34] Pang Du, Zheyuan Sun, Huashan Chen, Jin-Hee Cho, and Shouhuai Xu. Statistical estima-
tion of malware detection metrics in the absence of ground truth. IEEE Trans. Information
Forensics and Security, 13(12):2965–2980, 2018.
[35] Sumeet Dua and Xian Du. Data Mining and Machine Learning in Cybersecurity. Auerbach
Publications, Boston, MA, USA, 1st edition, 2011.
[36] FY Edgeworth. Xli. on discordant observations. The London, Edinburgh, and Dublin Philo-
sophical Magazine and Journal of Science, 23(143):364–375, 1887.
[37] Laurene V Fausett et al. Fundamentals of neural networks: architectures, algorithms, and
applications, volume 3. Prentice-Hall Englewood Cliffs, 1994.
[38] Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–
874, 2006.
[40] Eric Ficke, Kristin M. Schweitzer, Raymond M. Bateman, and Shouhuai Xu. Characteriz-
ing the effectiveness of network-based intrusion detection systems. In 2018 IEEE Military
94
Communications Conference, MILCOM 2018, Los Angeles, CA, USA, October 29-31, 2018,
pages 76–81, 2018.
[41] Ugo Fiore, Francesco Palmieri, Aniello Castiglione, and Alfredo De Santis. Network
anomaly detection with the restricted boltzmann machine. Neurocomputing, 122:13–23,
2013.
[42] Gianluigi Folino, Francesco Sergio Pisani, and Pietro Sabatino. A distributed intrusion
detection framework based on evolved specialized ensembles of classifiers. In European
Conference on the Applications of Evolutionary Computation, pages 315–331. Springer,
2016.
[43] Richard Garcia-Lebron, David J. Myers, Shouhuai Xu, and Jie Sun. Node diversification in
complex networks by decentralized coloring). manuscript under review, 2018.
[44] Richard Garcia-Lebron, Kristin Schweitzer, Raymond Bateman, and Shouhuai Xu. A frame-
work for characterizing the evolution of cyber attacker-victim relation graphs. In IEEE
Milcom’2018. 2018.
[45] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.
[47] Gustavo Grieco, Guillermo Luis Grinblat, Lucas C. Uzal, Sanjay Rawat, Josselin Feist, and
Laurent Mounier. Toward large-scale vulnerability discovery using machine learning. In
Proceedings of the Sixth ACM on Conference on Data and Application Security and Privacy,
New Orleans, LA, USA, pages 85–96, 2016.
[48] Cheng Guo and Felix Berkhahn. Entity embeddings of categorical variables. arXiv preprint
arXiv:1604.06737, 2016.
95
[49] Yujuan Han, Wnelian Lu, and Shouhuai Xu. Characterizing the power of moving target
defense via cyber epidemic dynamics. In Proc. 2014 Symposium and Bootcamp on the
Science of Security (HotSoS’14), pages 10:1–10:12, 2014.
[50] Jens Hühn and Eyke Hüllermeier. Furia: an algorithm for unordered fuzzy rule induction.
Data Mining and Knowledge Discovery, 19(3):293–319, 2009.
[51] Young Hyun, Bradley Huffaker, Dan Andersen, Emile Aben, Colleen Shannon, Matthew
Luckie, and K Claffy. The caida ipv4 routed/24 topology dataset. URL http://www. caida.
org/data/active/ipv4_routed_24_topology_dataset. xml, 2011.
[52] Igor Ibarguren, Jesús M Pérez, Javier Muguerza, Ibai Gurrutxaga, and Olatz Arbelaitz.
Coverage-based resampling: Building robust consolidated decision trees. Knowledge-Based
Systems, 79:51–67, 2015.
[53] Erhan J. Kartaltepe, Jose Andre Morales, Shouhuai Xu, and Ravi S. Sandhu. Social network-
based botnet command-and-control: Emerging threats and countermeasures. In ACNS,
pages 511–528, 2010.
[54] Gulshan Kumar and Krishan Kumar. Design of an evolutionary approach for intrusion
detection. The Scientific World Journal, 2013, 2013.
[55] Arash Habibi Lashkari, Gerard Draper Gil, Mohammad Saiful Islam Mamun, and Ali A.
Ghorbani. Characterization of tor traffic using time based features. In Proceedings of the
3rd International Conference on Information Systems Security and Privacy - Volume 1:
ICISSP,, pages 253–262. INSTICC, SciTePress, 2017.
[56] L.Chen, S. Hou, Y. Ye, and S. Xu. Droideye: Fortifying security of learning-based clas-
sifier against adversarial android malware attacks. In Proc. 2018 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages ??–??,
2018.
96
[57] Justin Leonard, Shouhuai Xu, and Ravi S. Sandhu. A first step towards characterizing
stealthy botnets. In Proceedings of the The Forth International Conference on Availability,
Reliability and Security, ARES 2009, March 16-19, 2009, Fukuoka, Japan, pages 106–113,
2009.
[58] Justin Leonard, Shouhuai Xu, and Ravi S. Sandhu. A framework for understanding botnets.
In Proceedings of the The Forth International Conference on Availability, Reliability and
Security, ARES 2009, pages 917–922, 2009.
[59] D. Li, Q. Li, Y. Ye, and S. Xu. Enhancing robustness of deep neural networks against
adversarial malware samples: Principles, framework, and aics’2019 challenge. In AAAI-
2019 Workshop on Artificial Intelligence for Cyber Security (AICS’2019).
[60] Deqiang Li, Ramesh Baral, Tao Li, Han Wang, Qianmu Li, and Shouhuai Xu. Hashtran-dnn:
A framework for enhancing robustness of deep neural networks against adversarial malware
samples. CoRR, abs/1809.06498, 2018.
[61] Xiaohu Li, Paul Parker, and Shouhuai Xu. A stochastic model for quantitative security
analyses of networked systems. IEEE Transactions on Dependable and Secure Computing,
8(1):28–43, 2011.
[62] Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Hanchao Qi, and Jie Hu. Vulpecker: An
automated vulnerability detection system based on code similarity analysis. In Proceedings
of the 32nd Annual Conference on Computer Security Applications, Los Angeles, CA, USA,
pages 201–213, 2016.
[63] Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, Zhaoxuan Chen, Sujuan Wang, and
Jialai Wang. Sysevr: A framework for using deep learning to detect software vulnerabilities.
CoRR, abs/1807.06756, 2018.
[64] Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and
Yuyi Zhong. Vuldeepecker: A deep learning-based system for vulnerability detection. In
97
25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego,
California, USA, February 18-21, 2018, 2018.
[65] Z. Lin, W. Lu, and S. Xu. Unified preventive and reactive cyber defense dynamics is still
globally convergent. IEEE/ACM Transactions on Networking, 2019 (accepted for publica-
tion).
[66] Richard Lippmann, Robert K Cunningham, David J Fried, Isaac Graf, Kris R Kendall,
Seth E Webster, and Marc A Zissman. Results of the 1998 darpa offline intrusion detec-
tion evaluation. In Proc. Recent Advances in Intrusion Detection, 1999.
[67] Richard Lippmann, Joshua W Haines, David J Fried, Jonathan Korba, and Kumar Das.
The 1999 darpa off-line intrusion detection evaluation. Computer networks, 34(4):579–595,
2000.
[68] Wenlian Lu, Shouhuai Xu, and Xinlei Yi. Optimizing active cyber defense dynamics. In
Proceedings of the 4th International Conference on Decision and Game Theory for Security
(GameSec’13), pages 206–225, 2013.
[69] Freda Lundy. Cyber Threat Alert Fatigue and Reduction Methods. PhD thesis, 2017. Copy-
right - Database copyright ProQuest LLC; ProQuest does not claim copyright in the indi-
vidual underlying works; Last updated - 2018-03-02.
[70] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker. Learning to detect
malicious urls. ACM TIST, 2(3):30:1–30:24, 2011.
[71] Jose Mireles, Eric Ficke, Jin-Hee Cho, Patrick Hurley, and Shouhuai Xu. Metrics towards
measuring cyber agility. manuscript in submission, 2018.
[72] Jose David Mireles, Jin-Hee Cho, and Shouhuai Xu. Extracting attack narratives from
traffic datasets. In 2016 International Conference on Cyber Conflict, CyCon U.S. 2016,
Washington, DC, USA, October 21-23, 2016, pages 118–123, 2016.
98
[73] Jose David Mireles, Eric Ficke, Jin-Hee Cho, Patrick Hurley, and Shouhuai Xu. Metrics
towards measuring cyber agility. IEEE Transaction on Information Forensics & Security,
2019 (accepted for publication).
[74] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1
edition, 1997.
[75] J.A. Morales, R. Sandhu, and Shouhuai Xu. Evaluating detection and treatment effective-
ness of commercial anti-malware programs. In Malicious and Unwanted Software (MAL-
WARE), 2010 5th International Conference on, pages 31–38, 2010.
[76] Jose Morales, Shouhuai Xu, and Ravi Sandhu. Analyzing malware detection efficiency with
multiple anti-malware programs. In Proceedings of 2012 ASE International Conference on
Cyber Security (CyberSecurity’12), 2012.
[77] Jose Andre Morales, Areej Al-Bataineh, Shouhuai Xu, and Ravi S. Sandhu. Analyzing and
exploiting network behaviors of malware. In SecureComm, pages 20–34, 2010.
[78] Jose Andre Morales, Erhan J. Kartaltepe, Shouhuai Xu, and Ravi S. Sandhu. Symptoms-
based detection of bot processes. In MMM-ACNS, pages 229–241, 2010.
[79] Jose Andre Morales, Michael Main, Weiliang Luo, Shouhuai Xu, and Ravi S. Sandhu.
Building malware infection trees. In 6th International Conference on Malicious and Un-
wanted Software (MALWARE’2011), pages 50–57, 2011.
[80] Stephan Neuhaus, Thomas Zimmermann, Christian Holler, and Andreas Zeller. Predicting
vulnerable software components. In Proceedings of the 2007 ACM Conference on Computer
and Communications Security, Alexandria, Virginia, USA, pages 529–540, 2007.
[82] Chris V. Nicholson and Adam Gibson. Introduction to deep neural networks (deep learning).
99
[83] SP NIST. 800-94, guide to intrusion detection and prevention systems (idps). Information
Technology Laboratory, National Institute of Standards and Technology, USA, 2007.
[84] Quamar Niyaz, Weiqing Sun, Ahmad Y Javaid, and Mansoor Alam. A deep learning ap-
proach for network intrusion detection system. In Proceedings of the 9th EAI International
Conference on Bio-inspired Information and Communications Technologies (Formerly BIO-
NETICS), BICT-15, volume 15, pages 21–26, 2015.
[85] Bhaskar Pant, Daniela Rus, and Howard Shrobe. Cybersecurity: Technology, application
and policy, Spring 2016.
[86] Josh Patterson and Adam Gibson. Deep Learning: A Practitioner’s Approach. O’Reilly
Media, Inc., 1st edition, 2017.
[87] W. Patterson. Mathematical Cryptology for Computer Scientists and Mathematicians. Row-
man and Littlefield, 1987.
[88] Marcus Pendleton, Richard Garcia-Lebron, Jin-Hee Cho, and Shouhuai Xu. A survey on
systems security metrics. ACM Comput. Surv., 49(4):62:1–62:35, December 2016.
[89] Marcus Pendleton and Shouhuai Xu. A dataset generator for next generation system call host
intrusion detection systems. In 2017 IEEE Military Communications Conference, MILCOM
2017, Baltimore, MD, USA, October 23-25, 2017, pages 231–236, 2017.
[90] Chen Peng, Maochao Xu, Shouhuai Xu, and Taizhong Hu. Modeling and predicting extreme
cyber attack rates via marked point processes. Journal of Applied Statistics, 44(14):2534–
2563, 2017.
[91] Chen Peng, Maochao Xu, Shouhuai Xu, and Taizhong Hu. Modeling multivariate cyberse-
curity risks. Journal of Applied Statistics, 0(0):1–23, 2018.
[92] Moustafa Saleh, Tao Li, and Shouhuai Xu. Multi-context features for detecting malicious
programs. J. Computer Virology and Hacking Techniques, 14(2):181–193, 2018.
100
[93] Moustafa Saleh, E. Paul Ratazzi, and Shouhuai Xu. A control flow graph-based signature
for packer identification. In 2017 IEEE Military Communications Conference, MILCOM
2017, Baltimore, MD, USA, October 23-25, 2017, pages 683–688, 2017.
[94] Iman Sharafaldin, Amirhossein Gharib, Arash Habibi Lashkari, Ali A. Ghorbani, , , and
and. Towards a reliable intrusion detection benchmark dataset. Software Networking,
2017(1):177–200, 2017.
[95] Iman Sharafaldin, Arash Habibi Lashkari, and Ali A Ghorbani. Toward Generating a New
Intrusion Detection Dataset and Intrusion Traffic Characterization. In 4th International
Conference on Information Systems Security and Privacy, pages 108–116. SCITEPRESS -
Science and Technology Publications, 2018.
[96] Ali Shiravi, Hadi Shiravi, Mahbod Tavallaee, and Ali A. Ghorbani. Toward developing a
systematic approach to generate benchmark datasets for intrusion detection. Comput. Secur.,
31(3):357–374, May 2012.
[97] Robin Sommer and Vern Paxson. Outside the Closed World: On Using Machine Learning
for Network Intrusion Detection. In 2010 IEEE Symposium on Security and Privacy, pages
305–316. IEEE, 2010.
[98] Zhiyuan Tan, Aruna Jamdagni, Xiangjian He, Priyadarsi Nanda, Ren Ping Liu, and Jiankun
Hu. Detection of denial-of-service attacks based on computer vision techniques. IEEE
transactions on computers, 64(9):2519–2533, 2015.
[99] Florian Teschner. Exploring embeddings for categorical variables with keras. http://
flovv.github.io/Embeddings_with_keras/, 2018.
[100] Rachel Thomas. An introduction to deep learning for tabular data. https://www.fast.
ai/2018/04/29/categorical-embeddings/, 2018.
[101] A.M. Turing. Computing Machinery and Intelligence. Mind: a quarterly review. Blackwell
for the Mind Association, 1950.
101
[102] Li Xu, Zhenxin Zhan, Shouhuai Xu, and Keying Ye. Cross-layer detection of malicious
websites. In Third ACM Conference on Data and Application Security and Privacy (ACM
CODASPY’13), pages 141–152, 2013.
[103] Li Xu, Zhenxin Zhan, Shouhuai Xu, and Keying Ye. An evasion and counter-evasion study
in malicious websites detection. In IEEE Conference on Communications and Network
Security (CNS’14), pages 141–152, 2013.
[104] Li Xu, Zhenxin Zhan, Shouhuai Xu, and Keying Ye. An evasion and counter-evasion study
in malicious websites detection. In IEEE Conference on Communications and Network
Security (CNS’14), pages 265–273, 2014.
[105] M. Xu, K. M. Schweitzer, R. M. Bateman, and S. Xu. Modeling and predicting cyber
hacking breaches. IEEE Transactions on Information Forensics and Security, 13(11):2856–
2871, Nov 2018.
[106] Maochao Xu, Gaofeng Da, and Shouhuai Xu. Cyber epidemic models with dependences.
Internet Mathematics, 11(1):62–92, 2015.
[107] Maochao Xu, Lei Hua, and Shouhuai Xu. A vine copula model for predicting the effective-
ness of cyber defense early-warning. Technometrics, 59(4):508–520, 2017.
[108] Maochao Xu and Shouhuai Xu. An extended stochastic model for quantitative security
analysis of networked systems. Internet Mathematics, 8(3):288–320, 2012.
[110] Shouhuai Xu. Cybersecurity dynamics. In Proc. Symposium and Bootcamp on the Science
of Security (HotSoS’14), pages 14:1–14:2, 2014.
[111] Shouhuai Xu. Emergent behavior in cybersecurity. In Proceedings of the 2014 Symposium
and Bootcamp on the Science of Security (HotSoS’14), pages 13:1–13:2, 2014.
102
[112] Shouhuai Xu. Cybersecurity dynamics: A foundation for the science of cybersecurity. In
Zhuo Lu and Cliff Wang, editors, Proactive and Dynamic Network Defense. Springer New
York, 2018 (to appear).
[113] Shouhuai Xu, Wenlian Lu, and Hualun Li. A stochastic model of active cyber defense
dynamics. Internet Mathematics, 11(1):23–61, 2015.
[114] Shouhuai Xu, Wenlian Lu, and Li Xu. Push- and pull-based epidemic spreading in arbitrary
networks: Thresholds and deeper insights. ACM Transactions on Autonomous and Adaptive
Systems (ACM TAAS), 7(3):32:1–32:26, 2012.
[115] Shouhuai Xu, Wenlian Lu, Li Xu, and Zhenxin Zhan. Adaptive epidemic dynamics in
networks: Thresholds and control. ACM Transactions on Autonomous and Adaptive Systems
(ACM TAAS), 8(4):19, 2014.
[116] Shouhuai Xu, Wenlian Lu, and Zhenxin Zhan. A stochastic model of multivirus dynamics.
IEEE Transactions on Dependable and Secure Computing, 9(1):30–45, 2012.
[117] Fabian Yamaguchi, Markus Lottmann, and Konrad Rieck. Generalized vulnerability ex-
trapolation using abstract syntax trees. In 28th Annual Computer Security Applications
Conference, Orlando, FL, USA, pages 359–368, 2012.
[118] Fabian Yamaguchi, Christian Wressnegger, Hugo Gascon, and Konrad Rieck. Chucky:
Exposing missing checks in source code for vulnerability discovery. In 2013 ACM SIGSAC
Conference on Computer and Communications Security, Berlin, Germany, pages 499–510,
2013.
[119] Warusia Yassin, Nur Izura Udzir, Zaiton Muda, Md Nasir Sulaiman, et al. Anomaly-based
intrusion detection through k-means clustering and naives bayes classification. In Proc. 4th
Int. Conf. Comput. Informatics, ICOCI, volume 49, pages 298–303, 2013.
[120] Y. Ye, S. Hou, L. Chen, X. Li, L. Zhao, S. Xu, J. Wang, and Q. Xiong. Icsd: An automatic
system for insecure code snippet detection in stack overflow over heterogeneous information
103
network. In Proceedings of the 34nd Annual Conference on Computer Security Applications,
ACSAC 2018, pages 131–140, 2018.
[121] Zhenxin Zhan, Maochao Xu, and Shouhuai Xu. Characterizing honeypot-captured cyber
attacks: Statistical framework and case study. IEEE Transactions on Information Forensics
and Security, 8(11):1775–1789, 2013.
[122] Zhenxin Zhan, Maochao Xu, and Shouhuai Xu. Predicting cyber attack rates with extreme
values. IEEE Transactions on Information Forensics and Security, 10(8):1666–1677, 2015.
[123] Ren Zheng, Wenlian Lu, and Shouhuai Xu. Active cyber defense dynamics exhibiting rich
phenomena. In Proc. 2015 Symposium and Bootcamp on the Science of Security (Hot-
SoS’15), pages 2:1–2:12, 2015.
[124] Ren Zheng, Wenlian Lu, and Shouhuai Xu. Preventive and reactive cyber defense dynamics
is globally stable. IEEE Trans. Network Science and Engineering, 5(2):156–170, 2018.
104
VITA
Gabriel Carlos Fernández is a M.Sc. student in Computer Science at the University of Texas
at San Antonio. His research focus is on cybersecurity data analytics, attack and defense, and
machine learning.
Gabriel is currently employed at USAA where he has worked primarily as a Research Engineer
focused on cybersecurity and fraud research, and more recently as a Software Engineer working on
projects within the same domain. During his tenure at USAA, he has been granted twelve patents
to date, and has more than five patents in application.
He has two papers: “My brain is my passport. Verify me,” which was published in proceedings
at the 2016 IEEE International Conference on Consumer Electronics (ICCE); and “Addressing
the vulnerabilities of pass-thoughts,” which was published in proceedings of Signal Processing,
Sensor/Information Fusion, and Target Recognition XXV (SPIE Defense + Security 2016).