Discovering Exfiltration Paths Using Reinforcement
Discovering Exfiltration Paths Using Reinforcement
Abstract—Reinforcement learning (RL) operating on attack Modern efforts to detect and respond to adversarial network
graphs leveraging cyber terrain principles are used to develop reconnaissance are a complicated blend of automated and
reward and state associated with determination of surveillance human processes. Automated collection systems are installed
detection routes (SDR). This work extends previous efforts on
developing RL methods for path analysis within enterprise on network devices and endpoints to passively and actively
networks. This work focuses on building SDR where the routes monitor the network communications, analyze the flow, and
focus on exploring the network services while trying to evade aggregate the data for the Security Information Event Man-
risk. RL is utilized to support the development of these routes agement (SIEM) and/or Security Orchestration, Automation
by building a reward mechanism that would help in realization and Response (SOAR) systems for analysis. These network
of these paths. The RL algorithm is modified to have a novel
warm-up phase which decides in the initial exploration which tools assist the human component of detection by providing
areas of the network are safe to explore based on the rewards automated security reports, incident alerts, and executing net-
and penalty scale factor. work protection protocols with a single click from the security
Index Terms—attack graphs, reinforcement learning, surveil- operations center (SOC) analyst. The effectiveness of these
lance detection routes, SDR, cyber terrain systems relies on the data collected, the knowledge of current
threat behavior, and the human analyst’s ability to understand
I. I NTRODUCTION the threat. Naturally, such approaches have blindspots.
Combining the current security information (network topol-
Reconnaissance (also called recon) in MITRE’s Adversarial ogy and configuration) with machine learning (ML) anal-
Tactics, Techniques, and Common Knowledge (ATT&CK®) ysis allows highlighting weak points missed by automated
framework is described as ”techniques that involve adversaries systems that an attacker may focus on during initial recon.
actively or passively gathering information that can be used Network traffic behavior analysis, no matter how advanced,
to support targeting [1].” As reconnaissance activities usually relies on active network traffic and does not preempt net-
precede an exploitation campaign, detection of these efforts work/host/protocol mis-configurations. This paper contributes
benefit cyber defenders by identifying potential targets (e.g., a deep RL approach to generating SDR in the form of attack
crown jewels) of interest. In this respect, adversarial recon graphs from network models consisting of network topology
activities strive to maximize visibility of targets while mini- and configuration, thereby extending the suite of automated
mizing opportunities of being detected. Critical to this are the tools and systems available for the cyber defense.
identification of paths, termed SDR, traversed by adversaries In particular, an Markov decision process (MDP) formula-
to gather critical data about targets (e.g., ports, protocols, tion and a new algorithm, SDR-RL, is proposed that uses a
applications, services). warm-up phase and penalty scaling to control the asymmetry
From a cyber protection standpoint, recon activities disguise between the number of host services scanned and the number
serious hostile intent but may be quite difficult to quantitatively of defensive terrain encountered. This emulates the asym-
differentiate from normal behavior. Malicious intent is quite metry sought by human operators when conducting network
difficult to observe, as it may be efficiently designed to hide reconnaissance generally and SDR in particular. Additionally,
in the background of normal traffic. Detecting this type of this paper extends the double agent architecture (DAA) of
recon is situational in nature and requires meticulous analyses Nguyen et al. [3], which originally used standard deep Q-
of huge volumes of collected data. Other domains apart learning (DQN), with actor-critic (A2C) and proximal-policy
from cyber are often challenged in a similar manner, where optimization (PPO) algorithms [4].
evaluation of these SDR require data analysis to differentiate This paper is structured as follows. First, background on
abnormal traffic from events that are suspicious in proximity RL and penetration testing is given. Then, the methods used
to roads and crossings [2]. to expose SDR using RL, attack graphs, and cyber terrain
are presented. Next, experimental design for testing the pre-
sented methodology is given, including details regarding RL
implementation and training. Results are presented, and, before ∇θi Li (θi ) =
concluding, an in depth discussion is given in terms of cyber- Es,a∼p(·);s∼ [(r + γ max
0
Q∗ (s0 , a0 ; θi−1 )
a
specific outcomes.
− Q(s, a; θi ))∇θi Q(s, a; θi )] (6)
II. RL AND P ENETRATION T ESTING
B. The Penetration Testing Environment
A. Reinforcement Learning Preliminaries
Reinforcement learning (RL) problems involve an agent, Though RL has been pursued as a tool for penetration
interacting with an environment, and transiting from one state testing recently, the approaches to model the network vary
to another until it reaches the goal state [5]. The agent significantly. Alternative methods to modeling penetration
receives rewards for taking actions with the overall goal testing including hypothesis generation, ontology-based, attack
to achieve maximum cumulative rewards. Environments are trees and attack graphs. Hypothesis generation model [8] is an
normally modeled as MDPs, which can be defined by a tuple organizational network presentation for cyber defense while
{S, A, R, P, γ} where S denotes the set of possible states ontology-based models [9] focus more on the semantics of
and A denotes the set of possible actions, R represents the penetration testing. However, neither of them contains any
distribution of reward given any state-action pair, P represents structural information about the network itself. While both
transition probability and γ is discount factor. The goal of attack trees [10] and attack graphs do provide structural rep-
the agent is to learn an optimal policy π that maps states resentation of the network, attack graphs are more generative
to actions. As so, the RL algorithm is to learn an optimal and attack trees are special cases of attack graphs. As a result,
policy to select actions and maximize the expected cumulative attack graph are the most recognized method for the modeling
discounted reward: of penetration testing environment [11].
X
π ∗ = argmax E[ γ t rt |π] C. Related Work
π (1)
t>0
with s0 ∼ p(s0 ), at ∼ π(·|s0 ), st+1 ∼ p(·|st , at ) The use of deep RL for attack graphs has seen recent
development. Other than Ghanem and Chen [12], the authors
In (deep) Q-learning, the optimal policy can be defined by in the RL for attack graphs literature use fully observable
two terms, the value function and Q-value function. The value MDPs to model networks. Many authors use the Common
function shows how good the state is and it is defined as the Vulnerability Scoring System (CVSS) to furnish their MDPs
expected cumulative reward from following the policy from (CVSS-MDPs). Yousefi et al. provide the earliest work doing
the state X so in deep RL for penetration testing [13]. Hu et al. extend
V π (s) = E[ γ t rt |π] (2) the use of the CVSS by proposing to use exploitability scores
t>0 weight rewards [14]. Gangupantulu et al. [15], [16] and Cody
The Q-value function, on the other side, is defined as the et al. [17] explicitly extend the methods of Hu et al. with
expected cumulative reward given both parts of the state-action concepts of terrain. Gangupantulu et al. advocate defining
pair models of terrain in terms of the rewards and transition
probabilities of MDPs, first in the case of firewalls as obstacles
X
Qπ (s, a) = E[ γ t rt |s0 = s, a0 = a, π] (3)
t>0
[15], then in the case of lateral pivots nearby key terrain [16].
Cody et al. apply these concepts to exfiltration [17]. Other
It satisfies the Bellman equation authors either handcraft the MDP or do not remark on how
Q∗ (s, a) = Es ∼ [r + γ max Q∗ (s0 , a0 )|s, a] (4) its components are estimated.
0 a Many authors apply generic deep Q learning (DQN) [6], [7]
Thus, the optimal policy π ∗ corresponds to taking the best to solve point-to-point network traversal tasks [13]–[15], [18],
action in any state as specified by Q∗ . [19]. Typically the terminal state is unknown and solutions
For a more complex problem where the state or action take the form of individual paths. Others develop domain-
space is large enough so that computing every state-action specific modifications for deep RL including the double agent
pair is infeasible, neural networks become a powerful function architecture [3], a hierarchical action decomposition approach
estimator in deep Q-learning (DQL) [6], [7]. The optimal Q- [20], and various improvements to DQN termed NDSPI-
values are estimated by a neural network parameterized by DQN [21]. Another line of research focuses on developing
θ: more specific penetration testing tasks. A number of authors
Q(s, a; θ) ≈ Q∗ (s, a) (5) define more specific tasks by reward engineering and other
modifications to the MDP including formulations of capture
where yi = Es0 ∼ [r + γ maxa0 Q∗ (s0 , a0 ; θi−1 )|s, a] the flag [22], crown jewel analysis [16], and discovering
During the backward pass, the gradients can be calculated exfiltration paths [17]. This paper extends this line of research
as with a methodology for exposing SDR.
III. M ETHODS of defense terrain increases. Therefore, we limit the agent’s
The following subsections describe the presented methods exploration capability as we increase the scale value of defense
for adding service-based risk penalties as defensive terrain in terrain.
CVSS-MDPs and the algorithm for discovering surveillance D. Training Phase
paths in a network.
In the training phase, the agent interacts with network in
A. Defensive Terrain in CVSS-MDPs an episodic fashion to learn which is the best possible path
Gangupantulu et al. [15] proposed that cyber terrain can be to gain service information of all the target nodes identified
modeled into CVSS-MDPs by adding transition probabilities during the warmup phase.
for traversing firewalls and negative rewards for different E. Scalability
protocols. Cody et al. [17] later modeled the services-based
Conventionally, DQN is the most basic algorithm used for
defensive terrain in CVSS-MDPs based on the assumption that
modeling RL on penetration testing. Nguyen at al. proposed a
the attackers can infer the presence of defenses terrain based
method that utilizes two A2C agents: one called the structural
on the services running on a host. We adopt their methods
agent that learns the structural information about subnets,
and classify the services into four categories: authentication,
hosts and firewalls as well as the connections between them.
data, security and common. Our reward hierarchy assigns -6
The other called the exploiting agent selects actions to take
for authentication, -4 for data and -2 for security and common.
and the target of them. Their method solves the scalability
Moreover, the type of actions has an effect on the rewards. n+1
problem of DQN to some extent and has a better capacity for
(-1, -3 and -5) is assigned for scanning action while n (-2, -4
large networks. We improve upon this method by applying
and -6) is assigned for exploiting action.
proximal policy optimization (PPO) instead of A2C for both
B. Discovering Surveillance Paths with RL of the agents. PPO is an advanced RL algorithm known
In our surveillance detection routes (SDR) algorithm (Al- for its convergence speed, stability and sample efficiency. It
gorithm 1), we have a target node of interest which we want optimizes the following clipped surrogate objective function to
to explore all the services running on it. The goal of SDR is prevent performance collapse caused by a large policy update:
to gain the service information of the target node along with
h i
L(θ) = E min ρt (θ)At , clip ρt (θ), 1 − , 1 + At , (7)
maximizing service information discovery along other areas
of the network while being cautious and not triggering any where ρt (θ) = πθ (at |st )/πθold (at |st ) is the probability ratio
defensive terrain area. We give a discovery reward of 100 to of the new policy over the old policy. The advantage function
the target node when all its service have been discovered. To At is often estimated using the generalized advantage estima-
encourage the service discovery the agent receives a +1 reward tion [23], truncated after T steps:
for each service discovered on a node. The algorithm is divided
into two phases - a warm-up phase and a training phase. At ≈ δt + (γλ)δt+1 + · · · + (γλ)T −t+1 δT −1 , (8)
where δt = rt + γV (st+1 ) − V (st ). (9)
C. Warmup Phase
In the warmup phase, we want to update our goal for the IV. E XPERIMENTAL D ESIGN
RL agent to not only gain information of just the target node
In the following subsections the network, state-action space,
but also other nodes from where we receive positive reward
and RL algorithm implementation are described.
indicating that the defense terrain allows us to access this area.
The following steps are taken in the warmup phase: A. Network Description
• We define a certain number of episodes for warmup phase The same network framework as in Cody et al [17] is used
in the training configuration. for our experiment but with different configurations in defense
• In the warmup episodes the RL agent does not learn (no mechanism. To simulate real-world network conditions, there
weight updates) but only monitors positive reward during are layers of defenses between the Internet and the innermost
an episode. private network. Systems that require Internet to provide
• If the RL agent receives a positive reward from a scanning service (HTTP, email) are most vulnerable to attack; and are
action, then that node is added to the goal along with the typically in the Demilitarized Zone (DMZ) with limited access
target node and its reward for compromise is set to 100. to private network resources. The private, internal networks are
• The algorithm allows only one node (which gives out the separated from the DMZ with firewalls that apply rules that
maximum positive reward) from each subnet to be part only allow connections to specific internal servers/services.
of the goal as gaining control of one node in a subnet is VPN services to the internal network is protected with VPN
enough to gain service information of the entire subnet. Management Firewalls that apply network rules allowing only
At the end of the warmup phase, we have these dynamic authorized and authenticated user traffic to traverse internally
nodes along with the target node of which we must gain the over an encrypted connection. Internal network subnets are
service information to reach the goal. The number of dynamic separated based on access rules and allow traffic to egress
nodes being a part of the goal decreases as the scale value to the Internet if it is authorized, as well as applying rules
Algorithm 1: SDR via RL (SDR-RL) using PPO. We use Adam as the optimizer of our network. The
input : MDP M, initial node i, set of target nodes J, A2C model is trained for 4000 episodes and the two double
number of warmup episode Nw , RL algorithm agent models are trained for 8000 episodes with a maximum
fRL : M × i × J → path of 3000 steps in each episode. The episode terminates either
output: SDR that includes initial and target nodes when the goal is reached or when the step limit is reached.
Both the A2C model and the structuring agent of the DA
for n in range(Nw ) do
models use deep neural networks (DNNs) with three fully
for each subnet MS in M do
connected layers of 64, 32 and 1 and the exploiting agent
for each node j in MS do
of the DA models use a DNN with two fully connected layers
if rewardj > 0 then
of size 10 and 1. All DNNs use tanh activation functions for
dynamic nodes ←
non-output layers and softmax for the output layer.
dynamic nodes ∪ J
end
D. Sensitivity analysis
end
if dynamic nodes 6= ∅ then We train the A2C, DA-A2C, DA-PPO to convergence. Our
J ← max(dynamic nodes) ∪ J surveillance detection algorithm is run on four different scales
end values namely 1, 3, 5 and 11 for host (9,2) (8,2), and (3,1).
end For host (9,2) and (8,2) we run an extra scale of 15. The scale
end values were experimental in nature and each scale defines a
path ← fRL (M, i, J); certain drop in exploration of the network. When the scale
value increases, the agent becomes more risk-averse and thus
return paths keeps on reducing the amount of exploration.
V. R ESULTS
for network traversal between subnets. Finally, access to the The model convergence is showcased by plotting the steps
innermost subnet is controlled by a firewall that allows only and reward versus episodes and results are shown in Figure 1.
authorized traffic in or out, and only to specific hosts. As can be seen from Figure 2, DA-PPO trains significantly
These security controls were known and applied by the faster than DA-A2C on all five penalty scales, both in terms
designers but were intentionally left unknown to the model of wall-time and number of episodes. Specifically, DA-PPO
to provide an accurate simulation of an attacker exploring an converges in less than ten minutes as compared to more than
unknown network environment. 30 minutes by DA-A2C (Figure 2a), and it requires much
B. Environment Description fewer episodes to learn an effective policy (Figure 2b).
The DAA modeling highlights a pattern that showed the
The hosts are each represented as an 1D vector that encodes
expected path being taken at all the penalty scales for all the
its status (compromised, reachable, discovered or not) and
three different target host, minimizing the number of defenses
configurations (address, services, OS and processes). Next, our
crossed. While the path converges by taking a relative higher
environment combines all vectors of hosts in the network as
number of episodes, it thus offers higher scalability while
a entire state tensor. Our action space contains 6 primitive
applying it to real world use cases with larger networks.
actions for scanning, exploiting and privilege escalation.
Across the models, it is observed that as the penalty keeps
For these experiments, the initial host is set at (1, 0) while
on increasing the agent reduces the amount of exploration and
the terminal host is set as (3, 1), (8, 2) or (9, 2). Here, the host
only explored areas which are deemed to be safe. The simpler
(x, y) refers to the host indexed by y in subnet x. The initial
A2C model followed the expected path being taken in penalty
host is compromised, reachable and discovered by default so
scale of 1, 11 and 15 for three different target host. However,
that it allows the agent to perform further exploration, thus
for scale of 3 and 5 in target host of (3,1), the A2C Model
the simulation assumes the attacker already has gained this
takes a potential imperfect path which expose to a higher risk;
foothold in the network. The goal of SDR is to explore all the
this suggests that even for a small test network of this size, a
services of the target host. And a high positive reward (100)
simpler agent scheme fails to find realistic paths.
is assigned if the goal is reached.
C. RL Implementation VI. D ISCUSSION
The experiment is conducted on the single-agent A2C model We can observe the paths taken by the agent to achieve the
and two variants of the double agent (DA) architecture [3]. In SDR goal when target hosts are (8,2) and (3,1) for penalty
the original double agent model, DA-A2C, both the structuring scales 1, 3 and 11 in Fig 3. In Table 1, we can observe that as
and the exploiting agents are trained using the A2C algorithm. we are adjusting the scale factor the number of services being
To achieve better sample efficiency and training stability, we explored drops off. Furthermore, the number of high penalty
combine the PPO algorithm [4] with the DA architecture and host controlled goes down from 2 to 1 as we scaled the penalty
build a DA-PPO model, where both RL agents are trained showcasing that the agent becomes more risk averse.
Fig. 1: Training performance of DAA and A2C agent with different penalty scales: (a) episode-reward of A2C agent; (b)
episode-step of A2C agent; (c) episode-reward of DAA agent and (d) episode-step of DAA agent.
TABLE I: Table of the number of services and hosts explored with different target hosts and scale factors.
For penalty scale 1, the agent takes a path which according penalty host (i.e., (4,0)) which is again a necessity to explore
to the environment and penalty is acceptable as in its explo- services on the target host. However, when the target host is
ration of network it is taking control of hosts which are either (3,1), the agent takes a path which is less acceptable according
low or medium risk except one high penalty host (i.e., (4,0)) to the environment and penalty as it takes control of two high
which is a necessity as to reach the goal at least one high-risk penalty node (i.e., (4,0) and (3,0)) where only one was required
host needs to be controlled. Therefore, the path taken by the to achieve the goal. The agent had an option to select (3,2)
agent for a low penalty of 1 is acceptable. which is a lower penalty host than (3,0).
For penalty scale 3, when the target is (8,2) the agent takes For penalty scale 11, the agent takes a path which is
an acceptable path according to the environment and penalty acceptable according the state and penalty of the environment
as it takes control of only low penalty host except one high and takes control of host which are low risk apart from only
(a) (b)
Fig. 2: Training performance of DA-A2C and DA-PPO with different penalty scales. The left plot shows the total rewards in
an episode as training time increases. The right plot shows the total number of steps in each training episode.
one high risk host which is unavoidable in achieving the goal. yields great impact, the associated risk is also higher. This
Associating penalty scores (low to high) to the risk adversity risky behavior is reflected by the agent as seen in Table 1
of human actors demonstrates that the agent performs actions with the larger number of services and high penalty hosts
that are reasonably in line with expected human advisory being exploited, as well as represented by the paths chosen
behavior. in Figure 3.
Actors operating in a condensed timeline, commonly re- When risk adversity is high, the cost of getting caught
ferred as “smash and grab” operations, or actors without outweighs the reward of network discovery, so actors, and as
sufficient experience, would be examples of an agent set with observed, the agent, take the most direct path possible. Risk
a Penalty Scale of 1.0. Nation-State actors, highly-competent adverse actors naturally choose paths that avoid traversing
APTs (Advanced Persistent Threats), and experienced actors multiple protocols/services and presumed increased security
with more time to observe and a higher cost of getting caught controls. This lowers the log footprint and network noise
would be representative of an agent with a penalty scale of generated and reduces the chances of detection. The agent’s
11.0. behavior of only taking control of one high risk host to reach
When risk adversity is low, an actor is not worried about the lower target reflects the lowest risk path that an advisory
performing “noisy” scans. These scans will include enumera- would naturally gravitate to.
tion of multiple network services or simultaneously traversing
multiple network segments (with each new segment potentially VII. C ONCLUSION
having deeper layers of defense incorporated into it). As In this work, we provide a RL realization of automating
observed when the penalty score is set to one (1.0), the analysis methods for SDR. Our approach introduces a warm-
agent finds and explores paths that would involve a higher up phase to perform pre-exploration emulating an experienced
level of risk of getting caught and a lower sensitivity to actors necessity to find the “safe” areas of a network. The
negative consequences. These actions involve scanning or capability and efficiency of our methods are validated by
traversing paths with a reasonable presumption of greater simulations on a custom network configured with defense
security and more rigorously logged and monitored network mechanisms (simulated cyber range).
devices, such as VPNs and Firewalls (FW) (High Risk Hosts) The model ran on a network created to simulate the realistic
as well as enumerating networks not directly associated with placement of layered network defenses with increasing secu-
the respective target. The risk-accepting paths also explored rity controls as an actor draws closer to the most sensitive
several networks unnecessarily and utilize multiple services information and services. As the model was faced with penal-
along the way. Each service to be used and/or exploited along ties that increased when transitioning into more secure zones,
the way creates another risk and/or protection system that the effect of the penalty score became more demonstrative.
may be in place, thus increasing the log presence and actor This activity mirrors the expected risk sensitivity of an actor
footprint. (As stated above, it is presumed that firewalls are mirroring the risk profile of the model’s penalty score. A more
monitored at a higher rate, and that security services have the risk/penalty adverse model showed significantly more concise
most inherited controls.) While exploitation of these devices paths to reach the goal, while the less risk-adverse model
Fig. 3: Network diagram showing the SDR to different target nodes for various penalty scale factors: host (3,1) for the first
row and (8,2) for the second row. The nodes in green indicates the hosts whose service information is available while the
information of nodes in red is not. The nodes in purple are set as initial nodes while the nodes in yellow are set as target
nodes.
explored around and left more of a footprint on the defense [9] G. Chu and A. Lisitsa, “Ontology-based automation of penetration
system’s logs. testing.” in ICISSP, 2020, pp. 713–720.
[10] B. Schneier, “Attack trees,” Dr. Dobb’s journal, vol. 24, no. 12, pp.
In future work, we expect to run this model on a live 21–29, 1999.
network utilizing both internal and external network data. [11] T. Cody, “A layered reference model for penetration testing with rein-
Incorporating internal and external scan information allows forcement learning and attack graphs,” arXiv preprint arXiv:2206.06934,
2022.
for a model that most-closely mimics a malicious actor and [12] M. C. Ghanem and T. M. Chen, “Reinforcement learning for efficient
defensive expectations for their actions. Using live data for network penetration testing,” Information, vol. 11, no. 1, p. 6, 2020.
a production network should also create greater granularity [13] M. Yousefi, N. Mtetwa, Y. Zhang, and H. Tianfield, “A reinforcement
for the penalty scoring as vulnerability data for CVSS3 and learning approach for attack graph analysis,” in 2018 17th IEEE In-
ternational Conference On Trust, Security And Privacy In Computing
weakness data for associated Common Weakness Enumeration And Communications/12th IEEE International Conference On Big Data
(CWE) can be incorporated. Science And Engineering (TrustCom/BigDataSE). IEEE, 2018, pp. 212–
217.
[14] Z. Hu, R. Beuran, and Y. Tan, “Automated penetration testing using
R EFERENCES deep reinforcement learning,” in 2020 IEEE European Symposium on
[1] (2021) Mitre att&ck framework®. [Online]. Available: https://attack. Security and Privacy Workshops (EuroS&PW). IEEE, 2020, pp. 2–10.
mitre.org [15] R. Gangupantulu, T. Cody, P. Park, A. Rahman, L. Eisenbeiser,
[2] R. Schoemaker, R. Sandbrink, and G. van Voorthuijsen, “Intelligent route D. Radke, and R. Clark, “Using cyber terrain in reinforcement learning
surveillance,” in Unattended Ground, Sea, and Air Sensor Technologies for penetration testing,” Submitted ACM ASIACCS 2022, 2021.
and Applications XI, E. M. Carapezza, Ed., vol. 7333, International [16] R. Gangupantulu, T. Cody, A. Rahman, C. Redino, R. Clark, and
Society for Optics and Photonics. SPIE, 2009, pp. 83–90. P. Park, “Crown jewels analysis using reinforcement learning with attack
[3] H. V. Nguyen, S. Teerakanok, A. Inomata, and T. Uehara, “The proposal graphs,” arXiv preprint arXiv:2108.09358, 2021.
of double agent architecture using actor-critic algorithm for penetration [17] T. Cody, A. Rahman, C. Redino, L. Huang, R. Clark, A. Kakkar,
testing.” in ICISSP, 2021, pp. 440–449. D. Kushwaha, P. Park, P. Beling, and E. Bowen, “Discovering exfiltration
[4] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- paths using reinforcement learning with attack graphs,” arXiv preprint
imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, arXiv:2201.12416, 2022.
2017. [18] J. Schwartz and H. Kurniawati, “Autonomous penetration testing using
[5] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. reinforcement learning,” arXiv preprint arXiv:1905.05965, 2019.
MIT press, 2018. [19] A. Chowdhary, D. Huang, J. S. Mahendran, D. Romo, Y. Deng, and
[6] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- A. Sabur, “Autonomous security analysis and penetration testing,” in
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn- 2020 16th International Conference on Mobility, Sensing and Network-
ing,” arXiv preprint arXiv:1312.5602, 2013. ing (MSN). IEEE, 2020, pp. 508–515.
[7] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. [20] K. Tran, A. Akella, M. Standen, J. Kim, D. Bowman, T. Richer, and C.-T.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski Lin, “Deep hierarchical reinforcement agents for automated penetration
et al., “Human-level control through deep reinforcement learning,” testing,” arXiv preprint arXiv:2109.06449, 2021.
Nature, vol. 518, no. 7540, pp. 529–533, 2015. [21] S. Zhou, J. Liu, D. Hou, X. Zhong, and Y. Zhang, “Autonomous pen-
[8] C. Weissman, “Penetration testing,” Information security: An integrated etration testing based on improved deep q-network,” Applied Sciences,
collection of essays, vol. 6, pp. 269–296, 1995. vol. 11, no. 19, p. 8823, 2021.
[22] F. M. Zennaro and L. Erdodi, “Modeling penetration testing with
reinforcement learning using capture-the-flag challenges: trade-offs be-
tween model-free learning and a priori knowledge,” arXiv preprint
arXiv:2005.12632, 2020.
[23] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
dimensional continuous control using generalized advantage estimation,”
arXiv preprint arXiv:1506.02438, 2015.