Cybersecurity: RL for Data Exfiltration
Cybersecurity: RL for Data Exfiltration
Abstract—Reinforcement learning (RL), in conjunction with Most organizations are focused on preventing network access,
attack graphs and cyber terrain, are used to develop reward which leaves gaps in defenses for access from the network to
and state associated with determination of optimal paths for the internet.
exfiltration of data in enterprise networks. This work builds
on previous crown jewels (CJ) identification that focused on Much of the literature on automating penetration testing
the target goal of computing optimal paths that adversaries using RL has a focus on the way networks can be accessed
may traverse toward compromising CJs or hosts within their (i.e., infiltration [2]–[10]). And while some consider using
proximity. This work inverts the previous CJ approach based RL to detect exfiltration [11], [12], RL for conducting post-
on the assumption that data has been stolen and now must be exploitation activities like exfiltration are under-studied [13].
quietly exfiltrated from the network. RL is utilized to support
the development of a reward function based on the identification Maeda and Mimura apply deep RL to do exfiltration, however,
of those paths where adversaries desire reduced detection. Re- they do not use a standard attack graph construct, but rather
sults demonstrate promising performance for a sizable network define states using an ontological model of the agent and define
environment. actions using task automation tools. [13]. Their approach has
Index Terms—attack graphs, reinforcement learning, exfiltra- several limitations:
tion paths, penetration testing, cyber terrain
• The RL agent’s inputs and outputs are greatly abstracted
I. I NTRODUCTION away from network structure, path structure, and cyber
The National Institute of Standards and Technology (NIST) terrain, thereby limiting the ability to anchor agents to
special publication 800-53 revision 5 states that exfiltration1 the real computer network.
(also called exfil) is the unauthorized movement of data within • The exfiltration methodology does not leverage auto-
a network [1]. Many times, cyber attacks are considered mated frameworks for attack graph construction like
successful if they exfiltrate data for monetary, disruptive, or MulVal [14] or the vulnerability- and bug-reporting com-
competitive gain. Detection of exfiltration can be plagued with munities (e.g., via the Common Vulnerability Scoring
technical challenges as adversaries routinely encapsulate data System (CVSS) [15]).
within typically allowable protocols (e.g., http(s), DNS) which • The output of the RL-based exfiltration method is not
make it significantly harder to defend. Additionally, adver- easily interpretable in terms of networks, their paths
saries have been known to prefer traversing certain network and configurations, and risks preferences regarding their
paths for data theft to reduce detection and tripping cyber traversal.
defenses so they do not raise suspicions. Whereas Maeda and Mimura’s use of task automation tools
Heisting data requires two different plans: a plan to get and ontologies make their proposed exfiltration method a
to the data and a plan to exfiltrate the data without getting highly automated means of actually performing exfiltration,
caught. Much effort in the cybersecurity industry is devoted this paper presents an alternative approach more tailored
to identifying and preventing points of weakness that allow to automating exfiltration path discovery for cyber operator
authorized (i.e., adversarial) entry into a network. The most workflows.
common exfiltration opportunity is moving data from a local This paper presents an RL method for discovering exfiltra-
network to an adversary network via the internet. To perform tion paths in attack graphs. This paper proposes and combines:
this, an adversary must gain access to the data on an organiza- 1) An approach for modeling service-based defensive cyber
tion’s network, then send the data to a place off their network. terrain in dynamic models of attack graphs.
1 NIST 800-53r5 [1] states specifically that exfiltration lies within security 2) An RL-based algorithm for discovering the top-N exfil-
control SC-07(10) for boundary protection to prevent unauthorized data tration paths in an attack graph.
movement (exfiltration). The presented methodology is aligned with a focus on network
978-1-6654-2141-6/22/$31.00 ©2022 IEEE structure and configuration, path analysis, and cyber terrain.
It maintains MulVal’s focus on scalability and leverages Alternatively, instead of learning the Q function, policies
the vulnerability- and bug-reporting communities via CVSS. can be parameterized and learned directly. In policy gradient
Its outcomes can be directly understood as paths through methods, the reward function is defined as
networks, as is highlighted in a detailed discussion of the X X X
J(θ) = dπ (s)V π (s) = dπ (s) πθ (a|s)Qπ (s, a),
results. To support reproducibility, the RL solution methods, s s a
experimental design, and network model are specified in great (4)
detail. where dπ (s) denotes the stationary distribution of Markov
This paper is structured as follows: First, background on chain for πθ . According to policy gradient theorem, the
RL for penetration testing and on constructing Markov de- gradient ∇θ J(θ) is given by
cision processes (MDPs)from attack graphs is given. After, X X
∇θ J(θ) ∝ dπ (s) Qπ (s, a)∇θ πθ (a|s). (5)
the methods for modeling defensive terrain and discovering
s a
exfiltration paths are presented. Then, experimental design is
The policy gradient theorem provides a basis for learning a
described, results are presented, and findings are discussed.
parameterized policy. However, it suffers from high variance
The paper concludes with remarks on modeling decisions, a
of gradient and instability. To overcome this, the value of the
synopsis, and a statement on future work.
state V π (s), the value of using policy π in state s is introduced
as the baseline:
II. RL AND P ENETRATION T ESTING
Aπ (s, a) = Qπ (s, a) − V π (s) (6)
A. Reinforcement Learning Preliminaries
where Aπ (s, a) is called the advantage. And the gradient is
RL describes the paradigm of learning by interaction with now given as
an environment [16]. This contrasts directly with supervised
learning where an oracle is queried for ground-truth labels. ∇θ J(θ) = Es∼ρπ ,a∼πθ [∇θ log πθ (s, a)Aπ (s, a)] (7)
More formally, it describes a set of solution methods for These gradients serve as the basis for the advantage actor-
approximate dynamic programming [17]. It also addresses critic method (A2C), a standard policy gradient method in
challenges associated with large and complex environments deep reinforcement learning [20].
by approximating various aspects of planning and decision-
making. B. RL for Penetration Testing
With respect to RL, agents learn by taking actions in envi- While deep RL has been applied to cybersecurity broadly
ronments E and receiving rewards. Commonly, environments E [21], it has only recently been pursued as a tool for penetration
are modeled as MDPs. Finite MDPs are tuples {S, A, Φ, P, R} testing [2]–[10]. While approaches and uses vary greatly, many
where S and A are states and actions, Φ ⊂ S × A are use attack graphs to model the network [22]. Note, attack
admissible state-action pairs, P : Φ × S → [0, 1] is the graphs model the network formed by computer vulnerabilities
probability transition function, and R : Φ → R where R are and exploits, creating an abstraction that does not necessarily
the reals is the expected reward function. An agent interacts match the topology of the physical network, as shown in
with an environment E = {S, A, Φ, P, R} by taking actions Figure 1. The use of attack graphs is a distinguishing character
at and receiving states st+1 and rewards rt+1 . of RL for penetration testing from RL for cybersecurity
The learning procedure can be described in general terms broadly [21].
as follows. Let Rt be the discounted sum of future rewards, Most frequently, RL is tasked with simply traversing a
network from one initial node to one terminal node—(i.e.,
∞
X finding paths through networks) [2]–[9]. Gangupantulu et al.,
Rt = γ k rt+k , (1)
in contrast, emphasize a more complex task of using RL to
k=0
perform CJ analysis [10]. Here, similar to Gangupantulu et
where γ ∈ (0, 1) is a discount factor. The action value function al., the presented RL method solves a more complex task and
Qπ (s, a) can then be defined as serves as a targeted tool for cyber operators to improve the
efficiency of operator workflow in penetration testing. It does
Qπ (s, a) = E[Rt |st = s, a], (2) not automate exfiltration entirely.
RL for penetration testing has made frequent use of DQN
where π is a policy mapping states and actions (s, a) to [3], [7]–[10]. Nguyen et al., alternatively, propose an RL-based
the probability of picking action a in state s. The learning approach to penetration testing that uses two agents: one for
procedure aims to find the optimal action value function iteratively scanning the network to build a structural model
Q∗ (s, a), and another for exploiting the constructed model [23]. In our
Q∗ (s, a) = max Qπ (s, a). (3) first attempt at performing RL on the network presented later
π
on, we attempt to use the DQN solution method but it did
Deep Q-learning (DQN) approximates Q∗ with a neural net- not converge, leading us to explore alternative agents. We use
work Q(s, a; θ), where θ are parameters of the neural network Nguyen et al.’s double agent method where both agents are
[18], [19]. A2C. We compare to the standard A2C algorithm.
alternative methods [2]–[5], CVSS is emerging as a standard
approach to modeling MDPs over attack graphs for RL [6]–
[10]. Gangupantulu et al. draw from the literature to define a
vanilla CVSS-MDP for point-to-point network traversal [9].
CVSS-MDPs use the attack graph to define the state-action
space S × A and CVSS to define the reward R and transition
probabilities P [9]. CVSS-MDP assigns transition probabil-
ities P using the attack complexity, where the ranks low,
medium, and high are associated with transition probabilities
of 0.9, 0.6, and 0.3. The reward for arriving at a host is given
by,
Exploitability Score
Base Score + .
10
The agent receives −1 reward for each step and receives 100
reward for arriving at the terminal node. Episodes terminate
when the terminal state is reached after number of steps (i.e.,
actions).
While CVSS scores are useful in practice and currently
considered an industry standard, it is important to remember
that a measure of threat severity is not the same as a measure
of risk and that they do not generalize to give information
that’s useful for evaluating an entire attack path through a
network. From the perspective of an attacker, a greater risk
means a greater chance of detection. While the CVSS scores
of vulnerabilities do inform the probability of success of any
particular exploit in the models here, the real driving force
of RL agent behavior should be centered around concepts of
terrain [31]. The details of this reward engineering of terrain
Fig. 1: RL for penetration testing requires abstracting from real
are given in the following.
computer networks described by information such as packet
flows, into the mathematical models with which RL agents IV. M ETHODS
interact.
The following subsections describe the presented methods
for adding service-based risk penalties as defensive terrain
in CVSS-MDPs and the algorithm for discovering the top-N
III. MDP S FROM ATTACK G RAPHS
exfiltration paths in a network, shown in Algorithm 1.
There are many solution methods for modeling attack
graphs [24]. Key trade-offs relate to scalability, observability, A. Defensive Terrain in CVSS-MDPs
accuracy, and reliability. In particular, partially observable Gangupantulu et al. argue that models of cyber terrain can
Markov decision processes (POMDPs) are well-argued to be a be layered onto CVSS-MDPs, and do so by adding firewalls
more realistic representation of computer networks than MDPs between subnets, and assigning protocol-specific negative re-
[25]. In POMDPs, actions are stochastic and network structure ward and transition probabilities for traversing firewalls [9].
and configuration are uncertain. But POMDPs have not been Gangupantulu et al. later layer on additional notions of cyber
shown to scale to large networks and require modeling many terrain by using RL as part of a methodology for modeling
prior probability distributions [26]. Additionally, while RL for footholds and pivot points nearby the 2-hop network of CJ
MDPs is well-established, RL for POMDPs is still under fun- nodes [10].
damental development [27]. Currently, MDPs are the standard We propose a new approach for modeling service-based
in RL for penetration testing [2]–[10]. defensive terrain in CVSS-MDPs. Instead of explicitly defining
The CVSS [15], [28] is used as a scalable approach for the defenses in the states of the MDP, we make assumptions
adding behavior to attack graphs [29], [30]. It is the numerical similar to what a human attacker would make: even if the
representation of an information security vulnerability. These attacker cannot detect a defense directly, by their experience
scores represent an attempt at providing a standardized way they can infer the presence of defenses based on the services
of evaluating the severity of threats posed by a particular available on a given host. Common network defenses can
vulnerability. This takes into consideration both how easy include host-based antivirus and malware detection software,
it is to exploit this vulnerability, and also how severe the inter-subnet router firewalls, or authentication log tracking.
consequences of such an exploit would be. We engineer rewards for defensive terrain that are additive,
While some authors in RL for penetration testing use or, otherwise put, are layered on top of the CVSS-MDP
Algorithm 1 Exfiltration Paths via RL (EP-RL) Algorithm 1. Notably, the agent iteratively solves the problem
Require: MDP, initial node i, exit nodes J, N of finding a path to a single exit in the joint set of exits. This
Ensure: N paths from initial node to top-N exit nodes avoids the brute force approach of creating an MDP for each
for i in N do exit node, solving each MDP, then ranking the paths.
path ← fRL (M DP, i, J) . Optimal path i → j, j ∈ J
paths ← store(path) V. E XPERIMENTAL D ESIGN
J ←J \j . Remove j from J
end for In the following subsections the network, state-action space,
return paths and RL algorithm implementation are described.
A. Network Description
rewards. A quantified negative reward structure is used to
itemize the cost of attacker actions. The criteria of interest The experimental network where the simulations are run
are (1) a risk hierarchy applied to service categories and (2) was derived from an architectural leading practices approach to
the type of action performed by the agent on a host. The represent enterprise network configurations and deployments.
requirement to implementing these criteria is to unify them in a The network contains:
way the agent could enumerate. This is achieved by creating an • Defined Subnets - 9
array of actions and services and applying an individual reward • Defined Hosts - 26
to each combination. The negative reward can be assigned • Types of Operating Systems - 2
using (1) action type, here, exploiting or scanning, and (2) • Privilege Access Levels - 3
services. • Network Services - 9
Services are derived from four principal categories: authen- • Host Processes - 6
tication, data, security, and common. To create a negative • Network Firewall Rulesets - 39
reward, a hierarchy of costs associated with attacking these The network is visualized in Figures 3, 4, and 5.
services was applied. When performing an exploiting action, Subnets are constructed to represent a grouping of hosts
this hierarchy applies authentication as a reward of -6, data with commonly segregated services utilized for enterprise
as a reward of -4, while both security and common have information technology administration to include server ser-
a reward of -2. When performing a scanning action, the vices, database services, client workstation networks, edge
reward is increased by 1 (i.e., -5, -3, -1, respectively). These and DMZ services, and core services that orchestrate least-
rewards represent a combination of factors highlighting the privilege or zero-trust security (i.e., domain controllers and
risk to organizations presented from these services. Different public-key infrastructure) [32]. Hosts and network firewall
organizations or operators may prefer a different scaling. It rulesets are configured to deliver a representation of common
is important to note that the values of these negative rewards ITS communication requirements between these subnets that
are relative, and as such they can be as a set scaled together allow daily functions of an enterprise ITS department and
to represent different risk preferences. When taking an action business operations.
on a host with multiple services, the agent applied the highest
The services within this network are laid out with the
cost to the action’s reward. This was a decision that presumes
presumption of common security controls and monitoring
a leading practice approach by security practitioners to apply
software one would see within an enterprise network. These
security controls based upon the ‘riskiest’ service found on
presumptions include the following expectations:
a host (i.e., a service known to be at greater exposure to
the network edge or greater business loss if exploited). By 1) Authentication services are exposed to the internet
syncing our rewards to this presumption, the agent calculates through a Virtual Private Network (VPN).
a more realistic quantitative measurement of risk as it attempts 2) Web services are exposed to the internet through a
to converge to an optimal attack path. secured edge network zone (DMZ).
3) Services exposed to the internet are monitored.
B. Discovering Exfiltration Paths with RL 4) Firewalls are monitored at a higher rate than other
In contrast to Gangupantulu et. al.’s CJ analysis method network devices.
[10], our discovering exfiltration paths method uses multiple 5) Security services have the most inherited security con-
terminal states corresponding to the various exit nodes of trols.
interest and only a single initial node. The agent then interacts 6) Authentication services and firewall services, if ex-
with the network in an episodic fashion to learn which is the ploited, have the greatest secondary and tertiary impacts
best exit node with respect to expected reward. To provide to a network’s overall security profile.
a comprehensive path analysis for cyber operators using the 7) Network security rules only apply allowlists.
tool, the top-N exit nodes are found by iteratively solving the 8) Host and network assets apply principles of least-
MDP to find the best exit node, removing the best exit node, privilege when authorizing privileges for account access
and solving the MDP again. This algorithm is described in and use.
B. Environment Description A. RL Performance
For the environment, each host is represented by an 1D
When reviewing the A2C and double agent as they reach
vector that contains its status (compromised, reachable, dis-
convergence, A2C uses a similar amount of episodes to reach
covered or not) and configurations (address, services, operating
an optimal path regardless of the scaling factor. In the double
system and processes). The environment combines all the
agent model, the risk-accepting agent reaches an optimal path
vectors for hosts in the network as a entire state tensor.
much quicker than the risk-neutral and risk-adverse agents.
Thus, each state contains descriptions of all hosts. The actions
Additionally, the double agent model converges quickly at first,
are defined as an operation performed on a specific target
and then plateaus. This suggests the double agent model can
host. The actions consist of 6 primitive actions for scanning,
quickly arrive at near optimal policies.
exploiting, or privilege escalation. The action type and target
host configuration must align or the action will fail. For the
environment, the initial host for exfiltration is set on (6, 0), B. Agent’s Behavior Compared to Human Expectations
while the terminal hosts are set on (1, 0), (2, 0) and (4,
0), which are all connected to the public internet, where (a, The paths results from the nine experiments are shown in
b) denotes host b in subnet a. The initial node is set as Table I. The simulations represent the top three paths for
compromised, reachable and also discovered at the beginning all three risk preferences. In addition to the paths, the table
to make it possible for the agent to perform further actions. shows the number of steps, the reward and the cumulative
The exfiltration goal is reached if the agent compromises probability score. These cumulative probability score are not
any host among them and obtains root access. If the goal is directly the CVSS scores of the exploits, but are proprietary
reached, the agent is given a high reward (set as 100 for our scores designed to play similar role. It is important to note that
experiment). the ordering of these scores is not the same as the ordering
of the reward. This is in agreement with the expectation that
C. RL Implementation the measure of risk (tracked by the reward) does not have to
The experiment is conducted based on two models: A2C track linearly with a vulnerability score.
model and the double agent architecture [23]. Both agents in Unbeknownst to the rest of the authors, the cyber security
double agent use the A2C algorithm. For both, the learning rate operations expert who crafted the simulated networks for the
is set as 0.001 and the discounted factor is set as 0.99. We use generated attack paths and network topology included two
Adam as the optimizer of our networks. Both of the models are intentional misconfigurations within the host-service assign-
trained for 4,000 episodes with a maximum of 3,000 steps in ments. These misconfigurations simulated real-world experi-
each episode. If the maximum number of steps is reached, the ences resolving enterprise network incidents where exfiltration
episode terminates and the agent receives 0 terminal reward. of data occurred. The primary goal of these misconfigurations
Both the A2C model and the structuring agent of double agent is to represent flaws in network design that were exploited by
use deep neural networks (DNNs) with three fully connected actual attackers for exfiltration in actual enterprise networks.
layers of size 64, 32, and 1 and the exploiting agent of the If the agents can deduce (without explicit design) this mis-
double agent uses a DNN with two fully connected layers of configuration, it would be a compelling example of how this
size 10 and 1. All DNNs use tanh activation functions for reward engineering can produce human like behavior.
non-output layers and softmax for the output layer. The significant configuration within our results was on host
(3,2). The PKI service was extended into subnet (3), the
D. Sensitivity Analysis
server subnet. In published leading practices for securing PKI
The experiments run the A2C and double agent algorithms [32], this service is included in the most privileged tier of an
to convergence. To study the effect of the scale of service- environment and requires a specific privileged account autho-
based penalties on the convergence of the agents and on the rization ’Crytpographic Operator’ [32]. As such it should only
paths they discover, the exploiting and scanning service-based be accessible utilizing hardware and software with enhanced
penalties are scaled by a factor of 1.3, 1.0, and 0.7. These security controls. These leading practices also require this
values correspond to risk preferences that we term risk-averse, service to only reside in a secure subnet alongside other servers
risk-neural, and risk-accepting, respectively. and appliances with similarly privileged security requirements.
When this service is allowed to operate as a node within
VI. R ESULTS the general server subnet (i.e., regular business applications),
To observe the convergence of our models, we plot the steps it exposes the network firewall rules to exploitation when
and the reward versus episodes and the result is shown in Fig. exfiltrating data from the private key repository database.
2. It can be observed that both of our models converge within In the resulting optimal path diagrams identified by the
1,000 episodes. It could also be noticed that double agent agent, host (3,2) was the most traversed node. Reviewing
converges slower than the A2C agent, which is expected con- this result shows that the misconfiguration was successfully
sidering that the double agent is more complex and contains identified and exploited by the agent when defining an optimal
two A2C models that learn simultaneously. path.
Fig. 2: Agent learning over episodes. The left plots show the average reward over episodes and the right plots show the average
number of steps taken over episodes. The top plots are the results from running RL using an A2C agent and the bottom plots
show the results when leveraging the double agent methodology [23]. The color of the lines reflects how heavily incentivized
an agent is to avoid detection: blue for risk-accepting, green for risk-neutral, and red for risk-averse.
Path Rank Scale Factor Path Steps Reward Cumulative Probability Score
0.7 (6, 0) → (3, 0) → (2, 0) 12 57.8 2.9 + 2.9 = 5.8
Best Path 1.0 (6, 0) → (3, 2) → (1, 0) 11 62 2.9 + 2.9 = 5.8
1.3 (6, 0) → (3, 0) → (1, 0) 5 68.3 1.9 + 2.9 = 4.8
0.7 (6, 0) → (3, 2) → (1, 0) 19 46.9 2.9 + 4.9 = 7.8
Second
1.0 (6, 0) → (3, 0) → (2, 0) 16 24 2.9 + 4.8 = 7.7
Best Path
1.3 (6, 0) → (3, 2) → (1, 0) 19 33.1 1.9 + 2.9 = 4.8
0.7 (6, 0) → (3, 0) → (1, 1) → (4, 0) 15 41.3 1.9 + 1.9 + 2.4 = 6.2
Third
1.0 (6, 0) → (3, 2) → (1, 0) → (4, 0) 24 17 3.9 + 1.9 + 7.5 = 13.3
Best Path
1.3 (6, 0) → (3, 2) → (1, 0) → (4, 0) 22 -6.1 1.9 + 2.9 + 6.3 = 11.1
TABLE I: Table of the top-3 exfiltration paths found by double agent. Scale factor denotes the risk-accepting (0.7), risk-neutral
(1.0), and risk-averse (1.3) scaling of the penalty for services. Path gives the shortest path from the initial node to exit node,
and is derived from the set of actions taken by the converged agent in an episode. Steps and reward, in contrast, refer to the
optimal performance of the agent in an episode (i.e., not just the actions taken to form the path). Cumulative probability score
reports a custom, CVSS-like vulnerability scoring of the path.
than stealthier actors. operations such as database backups, system update down-
• The optimal path will often be the same for various risk loads, or unexpected network configuration changes can each
profiles, matching the A2C modeling convergence trends. create a pattern that alerts security or network teams to heavy
• Utilizing misconfigurations of security services within a payloads on a network. Without a way to compensate for these
network is a high-likelihood of success for attackers. additional variables, the value of adding payload sizes in this
• Data exploitation is more likely through servers and work was negligible.
services than through client workstations.
• When operating in more secure networks, the agent VIII. C ONCLUSION
consistently creates simple exfiltration paths but requires In this paper, we have provided security practitioners and
additional unsuccessful scanning actions to achieve this network defenders a quantitative methodology using RL to
same convergence. identify optimal paths for data exfiltration. In our experiments,
the presented RL approach identified the most likely hosts and
B. Remarks on Payload services used when exfiltrating data and captured metrics used
Within the current scope of this work, there is no considera- in network risk assessments. The strength of this approach
tion for the size of the payload extracted, or the rate at which it was validated through identification of intentional network
is removed. If the payload is a small amount of (critical) data, misconfigurations that mimic real-world vulnerabilities.
this simulation can be considered an approximation of reality. Future work should consider integration with other RL
If the amount of data exfiltrated becomes large enough that this for penetration testing tasks. In addition, expanding the risk
approximation fails, then additional modeling considerations formalism to increase its sophistication and maturity will
need to be considered, such as encoding rates of transfer and drive increased applicability and relevance. Review of payload
amount of data into the states of the environment. Payload size extraction and subsequent rates are also should also be
size, while being a calculable statistic for security operations, included for future studies.
is often measured for security in a binary manner. If the
payload size from one server to another server, or for one IX. ACKNOWLEDGEMENTS
firewall at a given time of day, is of sufficient variation from This work was made possible through the collaboration
the expected thresholds, an alert will trigger for security or between Dr. Michael Ambroso, Lead for the AI/ML for Cy-
network operations. While malicious actions can create this, ber Testing Innovation Pipeline group—funded by Deloitte’s
non-malicious actions can create this as well. Common ITS Cyber Strategic Growth Offering led by Deborah Golden,
[10] R. Gangupantulu, T. Cody, A. Rahman, C. Redino, R. Clark, and
P. Park, “Crown jewels analysis using reinforcement learning with attack
graphs,” arXiv preprint arXiv:2108.09358, 2021.
[11] S. Venkatesan, M. Albanese, A. Shah, R. Ganesan, and S. Jajodia,
“Detecting stealthy botnets in a resource-constrained environment using
reinforcement learning,” in Proceedings of the 2017 Workshop on
Moving Target Defense, 2017, pp. 75–85.
[12] M. Albanese, S. Jajodia, and S. Venkatesan, “Defending from stealthy
botnets using moving target defenses,” IEEE Security & Privacy, vol. 16,
no. 1, pp. 92–97, 2018.
[13] R. Maeda and M. Mimura, “Automating post-exploitation with deep
reinforcement learning,” Computers & Security, vol. 100, p. 102108,
2021.
[14] X. Ou, S. Govindavajhala, A. W. Appel et al., “Mulval: A logic-based
network security analyzer.” in USENIX security symposium, vol. 8.
Baltimore, MD, 2005, pp. 113–128.
[15] P. Mell, K. Scarfone, S. Romanosky et al., “A complete guide to the
common vulnerability scoring system version 2.0,” in Published by
FIRST-forum of incident response and security teams, vol. 1, 2007, p. 23.
[16] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[17] W. B. Powell, Approximate Dynamic Programming: Solving the curses
of dimensionality. John Wiley & Sons, 2007, vol. 703.
[18] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-
ing,” arXiv preprint arXiv:1312.5602, 2013.
[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
Nature, vol. 518, no. 7540, pp. 529–533, 2015.
Fig. 5: Network diagram showing Third Best Path in Table I. [20] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep rein-
The color of the edge reflects risk preference and the color of forcement learning,” in International conference on machine learning.
nodes encodes the subnet. PMLR, 2016, pp. 1928–1937.
[21] T. T. Nguyen and V. J. Reddi, “Deep reinforcement learning for cyber
security,” arXiv preprint arXiv:1906.05799, 2019.
[22] J. P. McDermott, “Attack net penetration testing,” in Proceedings of the
and Dr. Laura Freeman, Director of the Hume Center for 2000 workshop on New security paradigms, 2001, pp. 15–21.
National Security and Technology’s Intelligent Systems Lab [23] H. V. Nguyen, S. Teerakanok, A. Inomata, and T. Uehara, “The proposal
of double agent architecture using actor-critic algorithm for penetration
at the Virginia Polytechnic Institute and State University. testing.” in ICISSP, 2021, pp. 440–449.
[24] T. Gonda, T. Pascal, R. Puzis, G. Shani, and B. Shapira, “Analysis of
R EFERENCES attack graph representations for ranking vulnerability fixes.” in GCAI,
[1] N. I. of Standards and Technology, “Security and privacy controls for 2018, pp. 215–228.
federal information systems and organizations,” U.S. Department of [25] E. Miehling, M. Rasouli, and D. Teneketzis, “A pomdp approach to
Commerce, Washington, D.C., Tech. Rep. NIST Special Publication the dynamic defense of large-scale cyber networks,” IEEE Transactions
800-53 Revision 5, 2020. on Information Forensics and Security, vol. 13, no. 10, pp. 2490–2505,
[2] M. C. Ghanem and T. M. Chen, “Reinforcement learning for intelligent 2018.
penetration testing,” in 2018 Second World Conference on Smart Trends [26] D. Shmaryahu, G. Shani, J. Hoffmann, and M. Steinmetz, “Constructing
in Systems, Security and Sustainability (WorldS4). IEEE, 2018, pp. plan trees for simulated penetration testing,” in The 26th international
185–192. conference on automated planning and scheduling, vol. 121, 2016.
[3] J. Schwartz and H. Kurniawati, “Autonomous penetration testing using [27] P. Zhu, X. Li, P. Poupart, and G. Miao, “On improving deep reinforce-
reinforcement learning,” arXiv preprint arXiv:1905.05965, 2019. ment learning for pomdps,” arXiv preprint arXiv:1704.07978, 2017.
[4] M. C. Ghanem and T. M. Chen, “Reinforcement learning for efficient [28] H. Joh and Y. K. Malaiya, “Defining and assessing quantitative security
network penetration testing,” Information, vol. 11, no. 1, p. 6, 2020. risk measures using vulnerability lifecycle and cvss metrics,” in Proceed-
[5] S. Chaudhary, A. O’Brien, and S. Xu, “Automated post-breach penetra- ings of the 2011 International Conference on Security and Management
tion testing through reinforcement learning,” in 2020 IEEE Conference (SAM’11), vol. 1, 2011, pp. 10–16.
on Communications and Network Security (CNS). IEEE, 2020, pp. 1–2. [29] L. Gallon and J. J. Bascou, “Using cvss in attack graphs,” in 2011
[6] M. Yousefi, N. Mtetwa, Y. Zhang, and H. Tianfield, “A reinforcement Sixth International Conference on Availability, Reliability and Security.
learning approach for attack graph analysis,” in 2018 17th IEEE In- IEEE, 2011, pp. 59–66.
ternational Conference On Trust, Security And Privacy In Computing [30] M. Keramati, A. Akbari, and M. Keramati, “Cvss-based security metrics
And Communications/12th IEEE International Conference On Big Data for quantitative analysis of attack graphs,” in ICCKE 2013. IEEE, 2013,
Science And Engineering (TrustCom/BigDataSE). IEEE, 2018, pp. 212– pp. 178–183.
217. [31] G. Conti and D. Raymond, On cyber: towards an operational art for
[7] A. Chowdhary, D. Huang, J. S. Mahendran, D. Romo, Y. Deng, and cyber conflict. Kopidion Press, 2018.
A. Sabur, “Autonomous security analysis and penetration testing,” in [32] M. Corporation. (2021) Implementing least-privilege administrative
2020 16th International Conference on Mobility, Sensing and Network- models. [Online]. Available: https://docs.microsoft.com/
ing (MSN). IEEE, 2020, pp. 508–515. en-us/windows-server/identity/ad-ds/plan/security-best-practices/
[8] Z. Hu, R. Beuran, and Y. Tan, “Automated penetration testing using implementing-least-privilege-administrative-models
deep reinforcement learning,” in 2020 IEEE European Symposium on
Security and Privacy Workshops (EuroS&PW). IEEE, 2020, pp. 2–10.
[9] R. Gangupantulu, T. Cody, P. Park, A. Rahman, L. Eisenbeiser,
D. Radke, and R. Clark, “Using cyber terrain in reinforcement learning
for penetration testing,” arXiv preprint arXiv:2108.07124, 2021.